Difference between distinct() and dropDuplicates()
Difference between distinct() and dropDuplicates() distinct() Checks the entire row, if all columns are same between two or more rows, then it considers the first row alone. In other words, it returns distinct rows based on the values of all columns in the DataFrame. dropDuplicates( subset:optional ) It is more versatile than distinct() as it can be used to pick distinct values from specific columns as well. When no argument is passed to dropDuplicates() , it performs the same task as distinct() , checking the entire row. However, you can also specify a subset of columns to consider when looking for duplicates. Example code: distinct(): from pyspark.sql import SparkSession spark = SparkSession.builder.appName( 'test' ).getOrCreate() sc = spark.sparkContext sales_data = [ ( 1 , "Product A" , 10 , 5.0 , 50.0 ), ( 2 , "Product B" , 8 , 7.5 , 60.0 ), ( 1 , "Product A" , 10 , 5.0 , 50.0 ), ( 4 , "Product D" , 15 , 8.0 ...