Understanding Union() and UnionByName() in PySpark

February 09, 2024

🚀 Union() vs UnionByName() in PySpark

In PySpark, when working with DataFrames, union() and unionByName() are two methods used for merging data from multiple DataFrames.

Purpose of the two functions:

o union() in PySpark is used to merge two tables with the same column count irrespective of the column names. It simply appends the rows of the second table below the rows of the first table.

o unionByName() on the other hand is used to merge data from two tables based on their column names. It ensures that the columns are matched and appended accordingly.

Differences:

o Example 1:

§ union() simply merges two tables without considering column names or data types.

§ unionByName() merges tables based on column names, without considering the order of columns.

Example Code:

spark = SparkSession.builder.appName("Union_Example").getOrCreate()

# Creating DataFrames

mob = spark.createDataFrame([('Jio', 120, 'Ambani'), ('iphone', 2, 'Steve')], ["Company", "id", "Owner"])

mob1 = spark.createDataFrame([('Mark', 34, 'Fb', 2012), ('Billgates', 2, 'Windows', 2002)], ["Owner", "id", "Company", "year"])

# Performing union

mob_union = mob.union(mob1)

mob_union.show()

# Performing unionByName

mob_unionByName = mob.unionByName(mob1)

mob_unionByName.show()

+-------+---+------+

|Company| id| Owner|

+-------+---+------+

| Jio|120|Ambani|

| iphone| 2| Steve|

+-------+---+------+

+---------+---+-------+----+

+---------+---+-------+----+

| Mark| 34| Fb|2012|

|Billgates| 2|Windows|2002|

+---------+---+-------+----+

# Output for unionByName():

+---------+---+-------+----+

+---------+---+-------+----+

| Mark| 34| Fb|2012|

|Billgates| 2|Windows|2002|

| Jio|120| null|null|

+---------+---+-------+----+

# output of union(): AnalysisException: [NUM_COLUMNS_MISMATCH]

This is because union() can only be performed with a matching column count

Conclusion:

o Use union() when you want to merge two tables with the same column count, regardless of column names.

o Use unionByName() when you want to merge tables based on column names, ensuring that columns are matched and appended accordingly.

Search This Blog

Big Data Engineering

Understanding Union() and UnionByName() in PySpark

Comments

Post a Comment

Popular posts from this blog

Departmental Salary Insights: with SQL and PySpark

Troubleshooting PySparkTypeError: [NOT_ITERABLE]

Handling a PySpark TypeError: ‘unsupported operand type(s) for +: ‘int’ and ‘str’