Posts

Showing posts from 2024

Boost Your Spark Efficiency with Eager Evaluation!

Image
 Boost Your Spark Efficiency with Eager Evaluation! spark.sql.repl.eagerEval.enabled As Spark enthusiasts, we’re all familiar with its lazy evaluation framework. This means transformations aren’t executed until an action is called. But what if I told you there’s a way to expedite this process? spark.sql.repl.eagerEval.enabled — a configuration option that can revolutionize your Spark development experience! By enabling this setting, you’ll no longer need to wait for .show() to view your results while developing or testing your code in a notebook. By default, the spark has this configuration disabled . However, if there is a large dataset involved, it will impact the performance. So, we should use this configuration wisely. šŸ“ Example : Check out this PySpark snippet to see spark.sql.repl.eagerEval.enabled in action: from pyspark.sql import SparkSession spark = SparkSession.builder.master( "local[1]" ).appName( "program" ).getOrCreate() name = [( 'Apple'...

Difference between distinct() and dropDuplicates()

Image
  Difference between distinct() and  dropDuplicates() distinct() Checks the entire row, if all columns are same between two or more rows, then it considers the first row alone. In other words, it returns distinct rows based on the values of all columns in the DataFrame. dropDuplicates( subset:optional ) It is more versatile than  distinct()  as it can be used to pick distinct values from specific columns as well. When no argument is passed to  dropDuplicates() , it performs the same task as  distinct() , checking the entire row. However, you can also specify a subset of columns to consider when looking for duplicates. Example code: distinct(): from pyspark.sql import SparkSession spark = SparkSession.builder.appName( 'test' ).getOrCreate() sc = spark.sparkContext sales_data = [ ( 1 , "Product A" , 10 , 5.0 , 50.0 ), ( 2 , "Product B" , 8 , 7.5 , 60.0 ), ( 1 , "Product A" , 10 , 5.0 , 50.0 ), ( 4 , "Product D" , 15 , 8.0 ...

ACID Tables in Hive:

  ACID Tables in Hive: In Hive, starting from  version 0.14 , it supports Online Transactional Processing (OLTP). Before that, Hive only supported Online Analytical Processing (OLAP What is OLTP? OLTP  systems are designed to handle  transaction-oriented   workloads, where multiple users are performing concurrent transactions. In OLTP, we can perform operations like  Insert, Update, and Delete.  In OLAP, tables cannot be modified. Now what is ACID: ·  A — Atomicity  — they are either executed completely or not at all. There are no partial transactions. ·  C — Consistency  — The database remains in a consistent state even in the presence of failures. ·  I — Isolation  — the intermediate states of one transaction are not visible to other transactions until the transaction is completed. ·  D — Duration  — Once a transaction is committed, its changes are permanent and survive system failures. These are tables that al...

Troubleshooting “Column Not Callable” Error in PySpark

Image
  Troubleshooting “Column Not Callable” Error in PySpark While working with PySpark, you might encounter the “Column not callable” error. This error can occur due to various reasons, including incorrect syntax or a misspelled keyword. In this blog post, we’ll explore one specific cause of this error: misspelling the  alias  keyword in the  agg  function, and how to fix it. The Misspelling Issue One specific cause of this error is  misspelling.  In this scenario,   the  alias  keyword when using the agg function. For example: Reading the data from pyspark.sql.functions import avg temp_data = [( "Monday" , 20.0 ), ( "Tuesday" , 25.0 ), ( "Wednesday" , 22.5 )] temp_df = spark.createDataFrame(temp_data,[ "day" , "temp" ]) Output of the above code Now let us perform the aggregation: from pyspark.sql.functions import avg temp_data = [("Monday", 20.0 ), ("Tuesday", 25.0 ), ("Wednesday", 22.5 )] temp_df = spar...

Handling a PySpark TypeError: ‘unsupported operand type(s) for +: ‘int’ and ‘str’

Image
Handling a PySpark TypeError: ‘unsupported operand type(s) for +: ‘int’ and ‘str’ You’ve been working with PySpark, trying to calculate the rolling sum of a column in your DataFrame, and you’ve set everything up perfectly. However, when you run your code, you face the dreaded TypeError: ‘unsupported operand type(s) for +: ‘int’ and ‘str’ error. The Reason: The error indicates that there’s a type mismatch in your DataFrame. Specifically, you’re trying to perform arithmetic operations on columns with different data types — an integer (int) and a string (str). In this scenario, you forgot a tiny but essential step: importing the ‘ sum ’ function! Here is the sample Hospital data, from pyspark.sql.window import Window as w from pyspark.sql.functions import col df_hospital = spark.read.options(header= 'True' ,inferSchema= 'True' ).csv( 'C:/Users/hp/Python_Sums_20thfeb/pyspark/hospital.csv' ) ovr_sp = w.partitionBy( 'department' ).orderBy( 'discharge_...