1 | initial version |
To combine a series of PySpark dataframes using shared keys, you can follow the following procedure:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Combine DataFrames").getOrCreate()
df1 = spark.read.csv("path/to/df1.csv", header=True)
df2 = spark.read.csv("path/to/df2.csv", header=True)
df3 = spark.read.csv("path/to/df3.csv", header=True)
join
function to merge the dataframes using their shared columns:combined_df = df1.join(df2, on="shared_column").join(df3, on="shared_column")
Note that the join
function takes in the name of the shared column as an argument. You can continue chaining join
functions as needed to combine more dataframes.
.withColumnRenamed()
function:combined_df = combined_df.withColumnRenamed("old_column_name", "new_column_name")
combined_df.write.csv("path/to/output.csv", header=True, mode="overwrite")
Note: The above code assumes that you are using the PySpark API for version 2.0 or higher. If you are using an earlier version, some of the syntax may be different.