Apache Spark is the future of big data platform. But why Apache Spark is so popular and makes it a go-to solution for big data projects. It is an open-source, scalable, cluster-computing framework for analytics applications. The main feature of Spark is its in-memory computing that increases the processing speed.
Spark offers lighting speed data processing capabilities and supports various programming languages such as Python, Scala, Java, and R. You can use any of these programming languages to leverage this outstanding big data environment to build sophisticated applications.
As we have already seen the benefits of Python for programming, such as easy to learn, better code readability, etc. Using it for data analysis and processing with Spark can further simplify the development of robust and performance-oriented solutions for processing data on a large scale.
In this programming article, we will see the steps to install PySpark on Ubuntu and using it in conjunction with the Jupyter Notebook for our future data science projects on our blog.
Jupyter notebook is a web application that enables you to run Python code. It makes coding more interactive and lets you connect with language freely. The best part is that Jupyter Notebook also supports other languages like R, Haskell, and Ruby.
Before getting started with Apache Spark on Ubuntu with Jupyter Notebook, let’s first explore its various features.
Table of Contents
Apache Spark Features
Apache Spark runs workload 100 times faster. It achieves higher performance for both batch and streaming data using a DAG scheduler, an efficient query optimizer, and a physical execution engine.
2. Ease of use
Spark makes it easier for developers to build parallel applications using Java, Python, Scala, R, and SQL shells. It provides high-level APIs for developing applications using any of these programming languages.
3. Powerful Caching
Apache Spark has a versatile in-memory caching tool, which makes it very fast. With this powerful caching tool, Spark can store the results of computation, so that they can be accessed faster any number of times.
Apache Spark includes various powerful libraries – MLib for machine learning, GraphX, Spark Streaming, and SQL and Data Frames. You can use any or multiple libraries in your applications.
5. Runs Everywhere
Spark runs everywhere, such as Hadoop, Kubernetes, Apache Mesos, standalone, or in the cloud.
Steps to install PySpark on Ubuntu
PySpark is an API that enables Python to interact with Apache Spark.
Step 1: Install Apache spark.
Download Apache Spark from here and extract the downloaded spark package using this command
~$ tar xvzf spark-2.4.5-bin-hadoop2.7.tgz
Step 2: Move the package to usr/lib directory using these terminal commands.
~$ sudo mv spark-2.4.5-bin-hadoop2.7.tgz /spark
~$ sudo mv spark /usr/lib
Step 3: Now add JAVA_HOME and SPARK_HOME path to the bashrc file.
First, open bashrc file :
~$ sudo vim ~/.bashrc and add
export JAVA_HOME = <your java path>
Step 4: Now verify if Spark installed successfully –
If everything goes well, then you will see
Steps to install Jupyter Notebook on Ubuntu
Step 1: Update the local apt package index and install pip and python headers with this command.
~$ sudo apt update
~$ sudo apt install python3-pip python3-dev
Step 2: After step 1, you need to create a virtual environment; a virtual environment helps you to manage your project and its dependencies to install a virtual environment.
~$ sudo pip3 install virtualenv
Step 3: In this step, we will create a virtual environment at the home directory.
~$ virtualenv ~/my_env
Step 4: Now, activate the virtual environment.
~$ source ~/my_env/bin/activate
After executing this command, you should see “(my_env)~$ “which indicates that your environment is activated.
Step 5: In this virtual environment, we will install the Jupyter Notebook using this command.
(my_env) ~$ pip install jupyter
Step 6: Run this command, and if you are running this on local it will navigate you to the browser and jupyter notebook get started, or you can copy the link displayed on terminal to your browser
(my_env) ~$ jupyter notebook
To close Jupyter Notebook, press Control + C and press Y for confirmation.
Apache Spark is the largest open-source project for data processes. It is a unified analytics engine that has been widely adopted by enterprises and small businesses because of its scalability and performance. In this article, you set up PySpark on Ubuntu with Jupyter Notebook. In the next tutorial, we will write our first PySpark program. Subscribe to our mailing list for all the latest updates straight to your inbox.
If you come across any trouble, then do let us know in the comments section and I will help you in setting up PySpark in your Ubuntu machine. If you want a similar tutorial on Windows, then you can also let us know, and we will update this article and include the steps to install PySpark on Windows as well.