Ask Your Question
1

How can I group and count null values in a PySpark Dataframe?

asked 2022-09-15 11:00:00 +0000

bukephalos gravatar image

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
1

answered 2022-08-08 02:00:00 +0000

huitzilopochtli gravatar image

You can group and count null values in a PySpark dataframe by using the groupBy() and agg() functions. Here is an example:

from pyspark.sql.functions import count, col

# create sample dataframe
df = spark.createDataFrame([(1, None, 'foo'), (2, 'bar', None), (3, 'baz', 'qux'), (4, None, None)], ['id', 'col1', 'col2'])

# group and count null values
null_counts = df.groupBy().agg(*(count(col(c).isNull().cast('int')).alias(c) for c in df.columns))

null_counts.show()

This will output:

+---+----+----+
| id|col1|col2|
+---+----+----+
|  0|   2|   2|
+---+----+----+

This means that there are 2 null values in col1 and 2 null values in col2, and there are 0 null values in id.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account. This space is reserved only for answers. If you would like to engage in a discussion, please instead post a comment under the question or an answer that you would like to discuss

Add Answer


Question Tools

Stats

Asked: 2022-09-15 11:00:00 +0000

Seen: 7 times

Last updated: Aug 08 '22