Revision history [back]

Spark's architecture has the following fundamental aspects:

Cluster Manager: Spark uses a cluster manager to take care of resource allocation and scheduling. It can work with various cluster managers like Apache Mesos, Hadoop YARN, and Spark's built-in standalone cluster manager.
Driver Program: The driver program is responsible for controlling the overall execution of the application and creating SparkContext, which is the entry point to any Spark functionality.
Executors: Executors are the worker nodes that perform the actual computation tasks. Each executor starts as a separate Java process and is responsible for running tasks assigned by the driver program.
RDD: RDD (Resilient Distributed Datasets) is the fundamental data structure in Spark. It is an immutable distributed collection of objects that can be processed and computed in parallel.
Data Sources: Spark can read and write data from various sources like HDFS, Amazon S3, and more using various connectors. It can also integrate with external databases like Hadoop Hive, Apache Cassandra, and JDBC databases.
Modules: Spark has several built-in modules like Spark SQL, Spark Streaming, MLlib, and GraphX. These modules provide higher-level abstractions over the RDDs, making it easier to perform specific tasks like data processing, machine learning, and graph processing.