Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Assuming that the two data frames have the same schema and the same number of rows, the following Scala function can be used to replace null column values in DataFrame1 with the corresponding values in DataFrame2:

import org.apache.spark.sql.DataFrame

def replaceNullValues(df1: DataFrame, df2: DataFrame): DataFrame = {
  val columnNames = df1.columns
  columnNames.foldLeft(df1) { (tempDF, colName) =>
    tempDF.na.fill(df2.select(colName).collect()(0)(0), Seq(colName))
  }
}

The function takes two data frames as input and returns a new data frame. It starts by getting the column names of the first data frame, and then uses a fold operation to iterate over each column. For each column, the function uses the na.fill() method to replace null values in the column with the corresponding non-null value from the second data frame. The Seq(colName) parameter specifies the name of the column to fill, and the df2.select(colName).collect()(0)(0) expression selects the first non-null value from the second data frame for that column.

To use the function, simply call it with the two data frames as arguments:

val df1 = ... // original data frame with some null values
val df2 = ... // data frame with replacement values for nulls
val filledDF = replaceNullValues(df1, df2)

The filledDF data frame will have the same schema and number of rows as df1, but with null values replaced by the corresponding non-null values from df2.