Getting Started With PySpark on Ubuntu with Jupyter Notebook

Apache Spark is the future of big data platform. But why Apache Spark is so popular and makes it a go-to solution for big data projects.  It is an open-source, scalable, cluster-computing framework for analytics applications. The main feature of Spark is its in-memory computing that increases the processing speed.

Spark offers lighting speed data processing capabilities and supports various programming languages such as Python, Scala, Java, and R. You can use any of these programming languages to leverage this outstanding big data environment to build sophisticated applications.

As we have already seen the benefits of Python for programming, such as easy to learn, better code readability, etc. Using it for data analysis and processing with Spark can further simplify the development of robust and performance-oriented solutions for processing data on a large scale.

In this programming article, we will see the steps to install PySpark on Ubuntu and using it in conjunction with the Jupyter Notebook for our future data science projects on our blog.

Jupyter notebook is a web application that enables you to run Python code. It makes coding more interactive and lets you connect with language freely. The best part is that Jupyter Notebook also supports other languages like R, Haskell, and Ruby.

Before getting started with Apache Spark on Ubuntu with Jupyter Notebook, let’s first explore its various features.

Also ReadSending Emails Using Python With Image And PDF Attachments.

Apache Spark Features

1. Speed

Apache Spark runs workload 100 times faster. It achieves higher performance for both batch and streaming data using a DAG scheduler, an efficient query optimizer, and a physical execution engine.

2. Ease of use

Spark makes it easier for developers to build parallel applications using Java, Python, Scala, R, and SQL shells. It provides high-level APIs for developing applications using any of these programming languages.

3. Powerful Caching

Apache Spark has a versatile in-memory caching tool, which makes it very fast. With this powerful caching tool, Spark can store the results of computation, so that they can be accessed faster any number of times.

4. Generality

spark-stack
Image Source: Apache Spark

Apache Spark includes various powerful libraries – MLib for machine learning, GraphX, Spark Streaming, and SQL and Data Frames. You can use any or multiple libraries in your applications.

5. Runs Everywhere

Spark runs everywhere, such as Hadoop, Kubernetes, Apache Mesos, standalone, or in the cloud.

Steps to install PySpark on Ubuntu

PySpark is an API that enables Python to interact with Apache Spark.

Step 1: Install Apache spark.

Download Apache Spark from here and extract the downloaded spark package using this command ~$ tar xvzf spark-2.4.5-bin-hadoop2.7.tgz

 

download apache spark

Step 2: Move the package to usr/lib directory using these terminal commands.

~$ sudo mv spark-2.4.5-bin-hadoop2.7.tgz /spark

~$ sudo mv spark /usr/lib

Step 3: Now add JAVA_HOME and SPARK_HOME path to the bashrc file.

First, open bashrc file : ~$ sudo vim ~/.bashrc and add

export JAVA_HOME = <your java path>

export PATH=$PATH:$JAVA_HOME/bin

export SPARK_HOME=/usr/lib/spark

export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Step 4: Now verify if Spark installed successfully – run spark-shell

If everything goes well, then you will see

verify apache spark installation

 

Steps to install Jupyter Notebook on Ubuntu

Step 1: Update the local apt package index and install pip and python headers with this command.

~$ sudo apt update

~$ sudo apt install python3-pip python3-dev

Step 2: After step 1, you need to create a virtual environment; a virtual environment helps you to manage your project and its dependencies to install a virtual environment.

~$ sudo pip3 install virtualenv

Step 3: In this step, we will create a virtual environment at the home directory.

~$ virtualenv ~/my_env

Step 4: Now, activate the virtual environment.

~$ source ~/my_env/bin/activate

After executing this command, you should see “(my_env)~$ “which indicates that your environment is activated.

Step 5: In this virtual environment, we will install the Jupyter Notebook using this command.

(my_env) ~$ pip install jupyter

Step 6: Run this command, and if you are running this on local it will navigate you to the browser and jupyter notebook get started, or you can copy the link displayed on terminal to your browser

(my_env) ~$ jupyter notebook

To close Jupyter Notebook, press Control + C and press Y for confirmation.

Wrapping Up

Apache Spark is the largest open-source project for data processes. It is a unified analytics engine that has been widely adopted by enterprises and small businesses because of its scalability and performance. In this article, you set up PySpark on Ubuntu with Jupyter Notebook. In the next tutorial, we will write our first PySpark program. Subscribe to our mailing list for all the latest updates straight to your inbox.

If you come across any trouble, then do let us know in the comments section and I will help you in setting up PySpark in your Ubuntu machine. If you want a similar tutorial on Windows, then you can also let us know, and we will update this article and include the steps to install PySpark on Windows as well.

Also ReadHow To Use ArcGIS API for Python and Jupyter NotebooksHow To Make A Simple Python 3 Calculator Using Functions.

Scroll to Top