Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

One efficient way to determine the number of Null and Nan values in each column of a PySpark dataframe is to use the .isNull() and .isNaN() functions along with the .agg() function to calculate the sum of these values for each column.

Here is an example:

from pyspark.sql.functions import isNull, isNaN, col

# create a PySpark dataframe
df = spark.createDataFrame([(1, 2, None), (4, None, float('nan')), (7, 8, 9)], ['col1', 'col2', 'col3'])

# calculate the number of Null and Nan values for each column
null_counts = df.agg(*(sum(isNull(col(c)).cast('int')).alias(c) for c in df.columns))
nan_counts = df.agg(*(sum(isNaN(col(c)).cast('int')).alias(c) for c in df.columns))

# print the results
null_counts.show()
nan_counts.show()

This will output:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   0|   1|   1|
+----+----+----+

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   0|   1|   1|
+----+----+----+

In this example, we first create a PySpark dataframe with some Null and Nan values. Then we use the agg() function with a list comprehension to calculate the sum of Null and Nan values for each column. We use the isNull() and isNaN() functions to determine if a value in each column is Null or Nan, respectively, and then use the sum() function to add up the total number of Null and Nan values in each column. Finally, we use the alias() function to assign column names to the resulting sum column, and then show the resulting dataframes with the .show() method.