Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

You can use PySpark or Python to modify the year column in your dataset as follows:

  1. Using PySpark:

Assuming your PySpark dataframe is called "df" and the year column is called "year", you can use the "when" and "otherwise" functions to create a new column with the modified year values:

from pyspark.sql.functions import when

df = df.withColumn("new_year", 
                   when(df.year == 2022, "2022-23")
                   .when(df.year == 2021, "2021-2022")
                   .otherwise(df.year))

This creates a new column called "new_year" that has the modified year values for 2021 and 2022, and leaves the other year values unchanged.

  1. Using Python:

Assuming you have a Pandas dataframe called "df" and the year column is called "year", you can use the "apply" function and a lambda function to modify the year values:

df["new_year"] = df["year"].apply(lambda x: "2022-23" if x == 2022 else 
                                  ("2021-2022" if x == 2021 else x))

This creates a new column called "new_year" that has the modified year values for 2021 and 2022, and leaves the other year values unchanged.