Repartition() vs Coalesce() in PySpark
Repartition() vs Coalesce() in PySpark
In PySpark, optimizing the number of partitions in a DataFrame is essential for performance. repartition() and coalesce() are two methods commonly used for this purpose.
1. Purpose:
- repartition() can both increase and decrease the number of partitions, whereas coalesce() is specifically used to decrease the number of partitions. These functions aim to enhance performance and reduce execution time.
2. Differences:
- Scenario: Suppose there are 3 partitions, and you want only 2 partitions.
- repartition() shuffles all partitions, distributing the data into 2 partitions to ensure balanced shuffling.
- coalesce() takes the data from the 3rd partition and places it into the 1st and 2nd partitions without balancing. It may allocate, for example, 70% of the 3rd partition to the 1st and 30% to the 2nd partition.
3. Example Code:
```PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Repartition_Coalesce").master("local[1]").getOrCreate()
val1 = spark.range(2252, 5555555242555).toDF("value1")
val2 = spark.range(343535, 3456557224890).toDF("value2")
val_join = val1.join(val2, val1["value1"] == val2["value2"], "inner")
print("Partition count is ", val_join.rdd.getNumPartitions())
# Using coalesce(5) and repartition(5)
val_join.coalesce(5).write.mode('overwrite').parquet("C:/path")
val_join.repartition(5).write.mode('overwrite').parquet("C:/path")
When using coalesce(5), the partition count of the parent DataFrame (val_join) is changed. Join operations are then performed in only 5 partitions, showcasing one reason why coalesce() is known for not causing a full shuffle.
When using repartition(5), the DataFrame performs 65 tasks in parallel, and while writing to file, the data is partitioned into 5 files.
4. Conclusion:
Choose repartition() when a full shuffle and even distribution are required. Opt for coalesce() when minimizing partitions without a full shuffle is a priority.
Comments
Post a Comment