Big Data Engineering

Posts

Showing posts from 2024

Boost Your Spark Efficiency with Eager Evaluation!

March 07, 2024

Boost Your Spark Efficiency with Eager Evaluation! spark.sql.repl.eagerEval.enabled As Spark enthusiasts, we’re all familiar with its lazy evaluation framework. This means transformations aren’t executed until an action is called. But what if I told you there’s a way to expedite this process? spark.sql.repl.eagerEval.enabled — a configuration option that can revolutionize your Spark development experience! By enabling this setting, you’ll no longer need to wait for .show() to view your results while developing or testing your code in a notebook. By default, the spark has this configuration disabled . However, if there is a large dataset involved, it will impact the performance. So, we should use this configuration wisely. 📝 Example : Check out this PySpark snippet to see spark.sql.repl.eagerEval.enabled in action: from pyspark.sql import SparkSession spark = SparkSession.builder.master( "local[1]" ).appName( "program" ).getOrCreate() name = [( 'Apple'...

Difference between distinct() and dropDuplicates()

February 28, 2024

Difference between distinct() and dropDuplicates() distinct() Checks the entire row, if all columns are same between two or more rows, then it considers the first row alone. In other words, it returns distinct rows based on the values of all columns in the DataFrame. dropDuplicates( subset:optional ) It is more versatile than distinct() as it can be used to pick distinct values from specific columns as well. When no argument is passed to dropDuplicates() , it performs the same task as distinct() , checking the entire row. However, you can also specify a subset of columns to consider when looking for duplicates. Example code: distinct(): from pyspark.sql import SparkSession spark = SparkSession.builder.appName( 'test' ).getOrCreate() sc = spark.sparkContext sales_data = [ ( 1 , "Product A" , 10 , 5.0 , 50.0 ), ( 2 , "Product B" , 8 , 7.5 , 60.0 ), ( 1 , "Product A" , 10 , 5.0 , 50.0 ), ( 4 , "Product D" , 15 , 8.0 ...

ACID Tables in Hive:

February 22, 2024

ACID Tables in Hive: In Hive, starting from version 0.14 , it supports Online Transactional Processing (OLTP). Before that, Hive only supported Online Analytical Processing (OLAP What is OLTP? OLTP systems are designed to handle transaction-oriented workloads, where multiple users are performing concurrent transactions. In OLTP, we can perform operations like Insert, Update, and Delete. In OLAP, tables cannot be modified. Now what is ACID: · A — Atomicity — they are either executed completely or not at all. There are no partial transactions. · C — Consistency — The database remains in a consistent state even in the presence of failures. · I — Isolation — the intermediate states of one transaction are not visible to other transactions until the transaction is completed. · D — Duration — Once a transaction is committed, its changes are permanent and survive system failures. These are tables that al...

Troubleshooting “Column Not Callable” Error in PySpark

February 22, 2024

Troubleshooting “Column Not Callable” Error in PySpark While working with PySpark, you might encounter the “Column not callable” error. This error can occur due to various reasons, including incorrect syntax or a misspelled keyword. In this blog post, we’ll explore one specific cause of this error: misspelling the alias keyword in the agg function, and how to fix it. The Misspelling Issue One specific cause of this error is misspelling. In this scenario, the alias keyword when using the agg function. For example: Reading the data from pyspark.sql.functions import avg temp_data = [( "Monday" , 20.0 ), ( "Tuesday" , 25.0 ), ( "Wednesday" , 22.5 )] temp_df = spark.createDataFrame(temp_data,[ "day" , "temp" ]) Output of the above code Now let us perform the aggregation: from pyspark.sql.functions import avg temp_data = [("Monday", 20.0 ), ("Tuesday", 25.0 ), ("Wednesday", 22.5 )] temp_df = spar...

Handling a PySpark TypeError: ‘unsupported operand type(s) for +: ‘int’ and ‘str’

February 21, 2024

Handling a PySpark TypeError: ‘unsupported operand type(s) for +: ‘int’ and ‘str’ You’ve been working with PySpark, trying to calculate the rolling sum of a column in your DataFrame, and you’ve set everything up perfectly. However, when you run your code, you face the dreaded TypeError: ‘unsupported operand type(s) for +: ‘int’ and ‘str’ error. The Reason: The error indicates that there’s a type mismatch in your DataFrame. Specifically, you’re trying to perform arithmetic operations on columns with different data types — an integer (int) and a string (str). In this scenario, you forgot a tiny but essential step: importing the ‘ sum ’ function! Here is the sample Hospital data, from pyspark.sql.window import Window as w from pyspark.sql.functions import col df_hospital = spark.read.options(header= 'True' ,inferSchema= 'True' ).csv( 'C:/Users/hp/Python_Sums_20thfeb/pyspark/hospital.csv' ) ovr_sp = w.partitionBy( 'department' ).orderBy( 'discharge_...