Ask Your Question

Revision history [back]

You can group and count null values in a PySpark dataframe by using the groupBy() and agg() functions. Here is an example:

from pyspark.sql.functions import count, col

# create sample dataframe
df = spark.createDataFrame([(1, None, 'foo'), (2, 'bar', None), (3, 'baz', 'qux'), (4, None, None)], ['id', 'col1', 'col2'])

# group and count null values
null_counts = df.groupBy().agg(*(count(col(c).isNull().cast('int')).alias(c) for c in df.columns))

null_counts.show()

This will output:

+---+----+----+
| id|col1|col2|
+---+----+----+
|  0|   2|   2|
+---+----+----+

This means that there are 2 null values in col1 and 2 null values in col2, and there are 0 null values in id.