1 | initial version |
When the value of a column changes, the lead function in PySpark returns the value of the specified column at the next row. It allows you to retrieve the value of a column in the next row relative to the current row.
For example, consider the following PySpark code:
from pyspark.sql.functions import lead
df = spark.createDataFrame([(1, "John"), (2, "Sam"), (3, "Tom"), (4, "Mark"), (5, "Dan")],["id", "name"])
df.show()
+---+----+
| id|name|
+---+----+
| 1|John|
| 2| Sam|
| 3| Tom|
| 4|Mark|
| 5| Dan|
+---+----+
df.select("*", lead("name", 1).over(Window.orderBy("id")).alias("next_name")).show()
+---+----+---------+
| id|name|next_name|
+---+----+---------+
| 1|John| Sam|
| 2| Sam| Tom|
| 3| Tom| Mark|
| 4|Mark| Dan|
| 5| Dan| null|
+---+----+---------+
In this example, the lead function is used to get the name of the next row for each row of the DataFrame. The lead function is applied to the name column with an offset of 1, which means it will return the value of the name column in the next row. Then, we use the Window.orderBy
function to order the rows by the id
column so that the lead function will get the next row in the correct order. Finally, we alias the column to next_name
for readability.
The output of the code shows that the lead function has returned the name of the next row for each row of the DataFrame. For example, for the first row with id=1
, the next row has name=Sam
. Similarly, for the second row with id=2
, the next row has name=Tom
, and so on. The last row has next_name
as null as there is no next row for it.