Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

When the value of a column changes, the lead function in PySpark returns the value of the specified column at the next row. It allows you to retrieve the value of a column in the next row relative to the current row.

For example, consider the following PySpark code:

from pyspark.sql.functions import lead

df = spark.createDataFrame([(1, "John"), (2, "Sam"), (3, "Tom"), (4, "Mark"), (5, "Dan")],["id", "name"])

df.show()

+---+----+
| id|name|
+---+----+
|  1|John|
|  2| Sam|
|  3| Tom|
|  4|Mark|
|  5| Dan|
+---+----+

df.select("*", lead("name", 1).over(Window.orderBy("id")).alias("next_name")).show()

+---+----+---------+
| id|name|next_name|
+---+----+---------+
|  1|John|      Sam|
|  2| Sam|      Tom|
|  3| Tom|     Mark|
|  4|Mark|      Dan|
|  5| Dan|     null|
+---+----+---------+

In this example, the lead function is used to get the name of the next row for each row of the DataFrame. The lead function is applied to the name column with an offset of 1, which means it will return the value of the name column in the next row. Then, we use the Window.orderBy function to order the rows by the id column so that the lead function will get the next row in the correct order. Finally, we alias the column to next_name for readability.

The output of the code shows that the lead function has returned the name of the next row for each row of the DataFrame. For example, for the first row with id=1, the next row has name=Sam. Similarly, for the second row with id=2, the next row has name=Tom, and so on. The last row has next_name as null as there is no next row for it.