Revision history [back]

Apache Spark is a distributed computing framework that facilitates parallel processing of large-scale data. It operates on a distributed cluster of machines and distributes computation across those machines. Apache Spark divides the problem into smaller tasks that can be executed in parallel. Here are the functions of Apache Spark job, task, and stage:

Job - A job in Spark represents a set of computations that are performed in parallel across multiple machines in a cluster. It is a sequence of stages, and each stage corresponds to a set of tasks that can be executed in parallel. The output of each stage provides input to the subsequent stage.
Stage - A stage is a set of tasks that can be executed in parallel. The division of the job into stages depends upon the data shuffling involved. A stage can be either a shuffle stage or a non-shuffle stage. The shuffle stage represents the stages where data is shuffled, and the non-shuffle stage represents the stages without data shuffling.
Task - A task is the smallest unit of work in Apache Spark that is performed on a single machine. It processes a partition of data and sends the output to the driver or the next stage in the pipeline. A task can be executed in parallel across multiple machines in the cluster. Each task operates on a portion of the data set, and all the tasks in the same stage process different partitions of the same data set.