Ask Your Question
4

How can I modify the year column in my dataset using PySpark or Python, so that if the year is 2022, it becomes 2022-23, and if the year is 2021, it becomes 2021-2022?

asked 2023-07-18 02:19:09 +0000

scrum gravatar image

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
2

answered 2023-07-18 02:31:01 +0000

bukephalos gravatar image

You can use PySpark or Python to modify the year column in your dataset as follows:

  1. Using PySpark:

Assuming your PySpark dataframe is called "df" and the year column is called "year", you can use the "when" and "otherwise" functions to create a new column with the modified year values:

from pyspark.sql.functions import when

df = df.withColumn("new_year", 
                   when(df.year == 2022, "2022-23")
                   .when(df.year == 2021, "2021-2022")
                   .otherwise(df.year))

This creates a new column called "new_year" that has the modified year values for 2021 and 2022, and leaves the other year values unchanged.

  1. Using Python:

Assuming you have a Pandas dataframe called "df" and the year column is called "year", you can use the "apply" function and a lambda function to modify the year values:

df["new_year"] = df["year"].apply(lambda x: "2022-23" if x == 2022 else 
                                  ("2021-2022" if x == 2021 else x))

This creates a new column called "new_year" that has the modified year values for 2021 and 2022, and leaves the other year values unchanged.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account. This space is reserved only for answers. If you would like to engage in a discussion, please instead post a comment under the question or an answer that you would like to discuss

Add Answer


Question Tools

Stats

Asked: 2023-07-18 02:19:09 +0000

Seen: 12 times

Last updated: Jul 18 '23