For a wiser world, a raw social media powered by Open Search Engine. Want to Support? Donate Some Maintenance and Server Expenses!.. More about us.
Content Guidelines

How To Install Apache Spark on Debian Stretch

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and also an optimized engine which supports overall execution charts. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured information processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. Spark can be configured with multiple cluster managers. Along with that it can be configured in local mode and standalone mode. Standalone mode is good to go for a developing applications in spark. Spark processes runs in JVM. Java should be pre-installed on the machines on which we have to run Spark job.

nano /etc/apt/sources.list
deb bionic main
apt install dirmngr
apt-key adv --keyserver hkp:// --recv-keys EA8CACC073C3DB2A

apt update
apt install oracle-java11-installer
apt install oracle-java11-set-default

tar xvzf spark-2.4.3-bin-hadoop2.7.tgz
ln -s spark-2.4.3-bin-hadoop2.7 spark

nano ~/.bashrc
source ~/.bashrc

Next we will write a basic Scala application to load a file and then see its content on the console. But before we start writing any java application let’s get familiar with few terms of spark application

Application jar
        User program and its dependencies are bundled into the application jar so that all of its dependencies are available to the application.
    Driver program
        It acts as the entry point of the application. This is the process which starts complete execution.
    Cluster Manager
        This is an external service which manages resources needed for the job execution.
        It can be standalone spark manager, Apache Mesos, YARN, etc.
    Deploy Mode
        Cluster – Here driver runs inside the cluster
        Client – Here driver is not part of the cluster. It is only responsible for job submission.
    Worker Node
        This is the node that runs the application program on the machine which contains the data.
        Process launched on the worker node that runs tasks
        It uses worker node’s resources and memory
        Fundamental unit of the Job that is run by the executor
        It is combination of multiple tasks
        Each job is divided into smaller set of tasks called stages. Each stage is sequential and depend on each other.

Learn Hadoop by working on interesting Big Data and Hadoop Projects

        It gets the application program access to the distributed cluster.
        This acts as a handle to the resources of cluster.
        We can pass custom configuration using the sparkcontext object.
        It can be used to create RDD, accumulators and broadcast variable
    RDD(Resilient Distributed Dataset)
        RDD is the core of the spark’s API
        It distributes the source data into multiple chunks over which we can perform operation simultaneously
        Various transformation and actions can be applied over the RDD
        RDD is created through SparkContext
        This is used to carry shared variable across all partitions.
        They can be used to implement counters (as in MapReduce) or sums
        Accumulator’s value  can only be read by the driver program
        It is set by the spark context
    Broadcast Variable
        Again a way of sharing variables across the partitions
        It is a read only variable
        Allows the programmer to distribute a read-only variable cached on each machine rather than shipping a copy of it with tasks thus avoiding wastage of disk space.
        Any common data that is needed by each stage is distributed across all the nodes

Spark provides different programming APIs to manipulate data like Java, R, Scala and Python. We have interactive shell for three programming languages i.e. R, Scala and Python among the four languages. Unfortunately Java doesn’t provide interactive shell as of now.

ETL Operation in Apache Spark

In this section we would learn a basic ETL (Extract, Load and Transform) operation in the interactive shell.

प्रत्यक्षं किम् प्रमाणम् | Share this post:

4 comments on “How To Install Apache Spark on Debian Stretch


Hadoop (docker-search) on Ubuntu:

nano /etc/apt/sources.list
deb bionic main
apt install dirmngr
apt-key adv –keyserver hkp:// –recv-keys EA8CACC073C3DB2A

apt update
apt install oracle-java11-installer
apt install oracle-java11-set-default

ssh-keygen -t rsa -P ” -f ~/.ssh/id_rsa
cat ~/.ssh/ >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh localhost

tar xzf hadoop-3.2.0.tar.gz
mv hadoop-3.2.0 hadoop

nano ~/.bashrc
export HADOOP_HOME=/root/hadoop

nano /root/hadoop/etc/hadoop/
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64

hdfs namenode -format
cd /root/hadoop/sbin/

cd /root/hadoop/etc/hadoop
nano core-site.xml

nano hdfs-site.xml

nano mapred-site.xml

nano yarn-site.xml

hdfs namenode -format
cd $HADOOP_HOME/sbin/

nano ~/.bashrc
tar xvzf spark-2.4.3-bin-hadoop2.7.tgz
ln -s spark-2.4.3-bin-hadoop2.7 spark

apt-get install hadoop-2.7 hadoop-0.20-namenode hadoop-0.20-datanode hadoop-0.20-jobtracker hadoop-0.20-tasktracker

nano ~/.bashrc
source ~/.bashrc

apt-get install apt-transport-https ca-certificates curl gnupg2 software-properties-common
curl -fsSL | apt-key add –
add-apt-repository “deb [arch=amd64] stretch stable”

apt-get update
apt-get install docker-ce
systemctl status docker

docker search hadoop
docker pull hadoop
docker images
docker run -i -t hadoop /bin/bash
docker ps
docker ps -a

docker start
docker stop

docker attach

Running your first crawl job in minutes

Starts docker container and forwards ports to host

Inject seed urls
/data/sparkler/bin/ inject -id 1 -su ‘’

Start the crawl job
/data/sparkler/bin/ crawl -id 1 -tn 100 -i 2 # id=1, top 100 URLs, do -i=2 iterations

Running Sparkler with seed urls file:
nano sparkler/bin/seed-urls.txt
copy paste your urls

Inject seed urls using the following command,
/sparkler/bin/ inject -id 1 -sf seed-urls.txt

Start the crawl job.
To crawl until the end of all new URLS, use -i -1, Example: /data/sparkler/bin/ crawl -id 1 -i -1

Access the dashboard http://localhost:8983/banana/ (forwarded from docker image).


Apache Solr is an open-source search platform written in Java. Solr provides full-text search, spell suggestions, custom document ordering and ranking, Snippet generation and highlighting. We will help you to install Apache Solr on Debian using Solution Point VPS Cloud.

apt install default-java
java -version

tar xzf solr-8.2.0.tgz solr-8.2.0/bin/ –strip-components=2
bash ./ solr-8.2.0.tgz

systemctl stop solr
systemctl start solr
systemctl status solr

su – solr -c “/opt/solr/bin/solr create -c spcloud1 -n data_driven_schema_configs”


Your own Rocketchat Server on Ubuntu:

snap install rocketchat-server
snap refresh rocketchat-server
service snap.rocketchat-server.rocketchat-server status
sudo service snap.rocketchat-server.rocketchat-mongo status
sudo service snap.rocketchat-server.rocketchat-caddy status

sudo journalctl -f -u snap.rocketchat-server.rocketchat-server
sudo journalctl -f -u snap.rocketchat-server.rocketchat-mongo
sudo journalctl -f -u snap.rocketchat-server.rocketchat-caddy

How do I backup my snap data?
sudo service snap.rocketchat-server.rocketchat-server stop
sudo service snap.rocketchat-server.rocketchat-mongo status | grep Active
Active: active (running) (…)
sudo snap run rocketchat-server.backupdb

sudo service snap.rocketchat-server.rocketchat-server start
sudo service snap.rocketchat-server.rocketchat-server restart
sudo service snap.rocketchat-server.rocketchat-mongo restart
sudo service snap.rocketchat-server.rocketchat-caddy restart

Get your own Cloud Mail Server, Linux Dedicated / Managed Server and launch apps quickly. Book at

Comments are closed for this post !!

Tech Partners


New Categories