You can group and count null values in a PySpark dataframe by using the groupBy()
and agg()
functions. Here is an example:
from pyspark.sql.functions import count, col
# create sample dataframe
df = spark.createDataFrame([(1, None, 'foo'), (2, 'bar', None), (3, 'baz', 'qux'), (4, None, None)], ['id', 'col1', 'col2'])
# group and count null values
null_counts = df.groupBy().agg(*(count(col(c).isNull().cast('int')).alias(c) for c in df.columns))
null_counts.show()
This will output:
+---+----+----+
| id|col1|col2|
+---+----+----+
| 0| 2| 2|
+---+----+----+
This means that there are 2 null values in col1
and 2 null values in col2
, and there are 0 null values in id
.
Asked: 2022-09-15 11:00:00 +0000
Seen: 7 times
Last updated: Aug 08 '22