Departmental Salary Insights: with SQL and PySpark
Departmental Salary Insights: with SQL and PySpark
Description: This article dives into the SQL and PySpark methodologies to identify employees who earn more than the average salary within their respective departments.
INPUT:

SQL QUERY:
SELECT Name, Salary
FROM (
SELECT Name, Salary, AVG(Salary) OVER(PARTITION BY Department) AS avg_sal
FROM Employee
) AS department_salaries
WHERE Salary > avg_sal;
PYSPARK QUERY:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window as w
from pyspark.sql.functions import avg, col
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
# Define the schema for the DataFrame
schema = StructType([
StructField("EmpID", StringType(), True),
StructField("Name", StringType(), True),
StructField("Department", StringType(), True),
StructField("Salary", DoubleType(), True)
])
# Initialize a Spark session
spark = SparkSession.builder.appName("interview1").getOrCreate()
sc = spark.sparkContext
# Create an RDD from the data
data = [
("1", "Alice", "HR", 60000.00),
("2", "Bob", "Finance", 75000.00),
("3", "Charlie", "IT", 90000.00),
("4", "David", "HR", 65000.00),
("5", "Eve", "Finance", 80000.00)
]
rdd = sc.parallelize(data)
# Create a DataFrame from the RDD using the schema
df = spark.createDataFrame(rdd, schema)
# Create a window specification to partition the data by department
spec = w.partitionBy("Department")
# Calculate the average salary for each department using a window function
df = df.withColumn("avg_sal", avg("Salary").over(spec))
# Filter the DataFrame to include only employees earning more than the average salary in their respective departments
df.filter(col("Salary") > col("avg_sal")).show()
OUTPUT:

CONCLUSION:
In conclusion, the SQL and PySpark queries provided in this article offer a powerful way to analyse salary data within departments and identify employees who earn above their departmental averages. By understanding window functions and partitioning data, users can efficiently extract valuable insights and make informed decisions based on their data.
Comments
Post a Comment