Big Data Engineering

Posts

Showing posts from February, 2024

Difference between distinct() and dropDuplicates()

February 28, 2024

Difference between distinct() and dropDuplicates() distinct() Checks the entire row, if all columns are same between two or more rows, then it considers the first row alone. In other words, it returns distinct rows based on the values of all columns in the DataFrame. dropDuplicates( subset:optional ) It is more versatile than distinct() as it can be used to pick distinct values from specific columns as well. When no argument is passed to dropDuplicates() , it performs the same task as distinct() , checking the entire row. However, you can also specify a subset of columns to consider when looking for duplicates. Example code: distinct(): from pyspark.sql import SparkSession spark = SparkSession.builder.appName( 'test' ).getOrCreate() sc = spark.sparkContext sales_data = [ ( 1 , "Product A" , 10 , 5.0 , 50.0 ), ( 2 , "Product B" , 8 , 7.5 , 60.0 ), ( 1 , "Product A" , 10 , 5.0 , 50.0 ), ( 4 , "Product D" , 15 , 8.0 ...

ACID Tables in Hive:

February 22, 2024

ACID Tables in Hive: In Hive, starting from version 0.14 , it supports Online Transactional Processing (OLTP). Before that, Hive only supported Online Analytical Processing (OLAP What is OLTP? OLTP systems are designed to handle transaction-oriented workloads, where multiple users are performing concurrent transactions. In OLTP, we can perform operations like Insert, Update, and Delete. In OLAP, tables cannot be modified. Now what is ACID: · A — Atomicity — they are either executed completely or not at all. There are no partial transactions. · C — Consistency — The database remains in a consistent state even in the presence of failures. · I — Isolation — the intermediate states of one transaction are not visible to other transactions until the transaction is completed. · D — Duration — Once a transaction is committed, its changes are permanent and survive system failures. These are tables that al...

Troubleshooting “Column Not Callable” Error in PySpark

February 22, 2024

Troubleshooting “Column Not Callable” Error in PySpark While working with PySpark, you might encounter the “Column not callable” error. This error can occur due to various reasons, including incorrect syntax or a misspelled keyword. In this blog post, we’ll explore one specific cause of this error: misspelling the alias keyword in the agg function, and how to fix it. The Misspelling Issue One specific cause of this error is misspelling. In this scenario, the alias keyword when using the agg function. For example: Reading the data from pyspark.sql.functions import avg temp_data = [( "Monday" , 20.0 ), ( "Tuesday" , 25.0 ), ( "Wednesday" , 22.5 )] temp_df = spark.createDataFrame(temp_data,[ "day" , "temp" ]) Output of the above code Now let us perform the aggregation: from pyspark.sql.functions import avg temp_data = [("Monday", 20.0 ), ("Tuesday", 25.0 ), ("Wednesday", 22.5 )] temp_df = spar...

Handling a PySpark TypeError: ‘unsupported operand type(s) for +: ‘int’ and ‘str’

February 21, 2024

Handling a PySpark TypeError: ‘unsupported operand type(s) for +: ‘int’ and ‘str’ You’ve been working with PySpark, trying to calculate the rolling sum of a column in your DataFrame, and you’ve set everything up perfectly. However, when you run your code, you face the dreaded TypeError: ‘unsupported operand type(s) for +: ‘int’ and ‘str’ error. The Reason: The error indicates that there’s a type mismatch in your DataFrame. Specifically, you’re trying to perform arithmetic operations on columns with different data types — an integer (int) and a string (str). In this scenario, you forgot a tiny but essential step: importing the ‘ sum ’ function! Here is the sample Hospital data, from pyspark.sql.window import Window as w from pyspark.sql.functions import col df_hospital = spark.read.options(header= 'True' ,inferSchema= 'True' ).csv( 'C:/Users/hp/Python_Sums_20thfeb/pyspark/hospital.csv' ) ovr_sp = w.partitionBy( 'department' ).orderBy( 'discharge_...

Departmental Salary Insights: with SQL and PySpark

February 21, 2024

Departmental Salary Insights: with SQL and PySpark Description: This article dives into the SQL and PySpark methodologies to identify employees who earn more than the average salary within their respective departments. INPUT: SQL QUERY: SELECT Name, Salary FROM ( SELECT Name, Salary, AVG (Salary) OVER ( PARTITION BY Department) AS avg_sal FROM Employee ) AS department_salaries WHERE Salary > avg_sal; PYSPARK QUERY: from pyspark.sql import SparkSession from pyspark.sql.window import Window as w from pyspark.sql.functions import avg, col from pyspark.sql.types import StructType, StructField, StringType, DoubleType # Define the schema for the DataFrame schema = StructType([ StructField( "EmpID" , StringType(), True ), StructField( "Name" , StringType(), True ), StructField( "Department" , StringType(), True ), StructField( "Salary" , DoubleType(), True ) ]) # Initialize a Spark session spark = SparkSessio...

Troubleshooting PySparkTypeError: [NOT_ITERABLE]

February 15, 2024

Troubleshooting PySparkTypeError: [NOT_ITERABLE] Ever encountered a confusing error message like 'Column is not iterable' while working with PySpark? Here's a relatable scenario: You're trying to find the highest salary from a list of employees using PySpark . You've got your DataFrame set up, but when you run your code, The error pops up, leaving you scratching your head. The reason? You forgot a tiny but essential step: importing the ' max ' function! Here's the fix, in plain language: # Don't forget to import the ' max ' function! from pyspark.sql.functions import max # Now, let's find that maximum salary: data = [("John", 2500.0), ("Alice", 3000.0), ("Bob", 2200.0)] usr_data = spark.createDataFrame(data,["name","salary"]) +-----+----------+ | name|salary | +-----+----------+ | John| 2500.0 | |Alice | 3000.0 | | Bob | 2200.0 | +-----+----------+ max_sal = usr_data.agg(max(us...