Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and also an optimized engine which supports overall execution charts. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured information processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. Spark can be configured with multiple cluster managers. Along with that it can be configured in local mode and standalone mode. Standalone mode is good to go for a developing applications in spark. Spark processes runs in JVM. Java should be pre-installed on the machines on which we have to run Spark job.
deb http://ppa.launchpad.net/linuxuprising/java/ubuntu bionic main
apt install dirmngr
apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys EA8CACC073C3DB2A
apt install oracle-java11-installer
apt install oracle-java11-set-default
tar xvzf spark-2.4.3-bin-hadoop2.7.tgz
ln -s spark-2.4.3-bin-hadoop2.7 spark
Next we will write a basic Scala application to load a file and then see its content on the console. But before we start writing any java application let’s get familiar with few terms of spark application
User program and its dependencies are bundled into the application jar so that all of its dependencies are available to the application.
It acts as the entry point of the application. This is the process which starts complete execution.
This is an external service which manages resources needed for the job execution.
It can be standalone spark manager, Apache Mesos, YARN, etc.
Cluster – Here driver runs inside the cluster
Client – Here driver is not part of the cluster. It is only responsible for job submission.
This is the node that runs the application program on the machine which contains the data.
Process launched on the worker node that runs tasks
It uses worker node’s resources and memory
Fundamental unit of the Job that is run by the executor
It is combination of multiple tasks
Each job is divided into smaller set of tasks called stages. Each stage is sequential and depend on each other.
Learn Hadoop by working on interesting Big Data and Hadoop Projects
It gets the application program access to the distributed cluster.
This acts as a handle to the resources of cluster.
We can pass custom configuration using the sparkcontext object.
It can be used to create RDD, accumulators and broadcast variable
RDD(Resilient Distributed Dataset)
RDD is the core of the spark’s API
It distributes the source data into multiple chunks over which we can perform operation simultaneously
Various transformation and actions can be applied over the RDD
RDD is created through SparkContext
This is used to carry shared variable across all partitions.
They can be used to implement counters (as in MapReduce) or sums
Accumulator’s value can only be read by the driver program
It is set by the spark context
Again a way of sharing variables across the partitions
It is a read only variable
Allows the programmer to distribute a read-only variable cached on each machine rather than shipping a copy of it with tasks thus avoiding wastage of disk space.
Any common data that is needed by each stage is distributed across all the nodes
Spark provides different programming APIs to manipulate data like Java, R, Scala and Python. We have interactive shell for three programming languages i.e. R, Scala and Python among the four languages. Unfortunately Java doesn’t provide interactive shell as of now.
ETL Operation in Apache Spark
In this section we would learn a basic ETL (Extract, Load and Transform) operation in the interactive shell.