Spark's architecture has the following fundamental aspects:
Cluster Manager: Spark uses a cluster manager to take care of resource allocation and scheduling. It can work with various cluster managers like Apache Mesos, Hadoop YARN, and Spark's built-in standalone cluster manager.
Driver Program: The driver program is responsible for controlling the overall execution of the application and creating SparkContext, which is the entry point to any Spark functionality.
Executors: Executors are the worker nodes that perform the actual computation tasks. Each executor starts as a separate Java process and is responsible for running tasks assigned by the driver program.
RDD: RDD (Resilient Distributed Datasets) is the fundamental data structure in Spark. It is an immutable distributed collection of objects that can be processed and computed in parallel.
Data Sources: Spark can read and write data from various sources like HDFS, Amazon S3, and more using various connectors. It can also integrate with external databases like Hadoop Hive, Apache Cassandra, and JDBC databases.
Modules: Spark has several built-in modules like Spark SQL, Spark Streaming, MLlib, and GraphX. These modules provide higher-level abstractions over the RDDs, making it easier to perform specific tasks like data processing, machine learning, and graph processing.
Please start posting anonymously - your entry will be published after you log in or create a new account. This space is reserved only for answers. If you would like to engage in a discussion, please instead post a comment under the question or an answer that you would like to discuss
Asked: 2021-08-21 11:00:00 +0000
Seen: 17 times
Last updated: Oct 19 '21
How can Python import data from a centralized location?
How can Spring Boot and Mysql be utilized for CRUD operations?
How can SSL be used with CqlSessionFactoryBean in Springboot Cassandra?
Where does my Springboot application load its database from?
Can specific items be eliminated from Parquet and NoSQL Targets?
What is the method to compute the dimensions of data types like blob, map<text,text> in Cassandra?
How can I retrieve the Cassandra keyspace using double quotation marks?