Setting up PySpark in Jupyter Notebook

Preparation
Method
- Option 1: Open notebook directly from PySpark
- Option 2: Invoke Spark environment in notebook on the fly

I want to use PySpark form Jupyter Notebook for covenient view of program output.

Preparation

This post assumes configurations in this earlier post. I read this article.

Method

There are two ways to set up PySpark with Jupyter Notebook. They are explained in detail in the article above. I would like to supplement the article by providing a summary and highlighting some caveats.

Option 1: Open notebook directly from PySpark

Create environment variables PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS and set them to be jupyter and 'notebook', respectively.
Open cmd and type pyspark, this should open Jupyter Notebook in the browser.

Run the following code in the notebook. (sample.txt is taken from the wikipedia page of Apache Spark.)

 import random
 num_samples = 100000000
 def inside(p):     
 x, y = random.random(), random.random()
 return x*x + y*y < 1
 count = sc.parallelize(range(0, num_samples)).filter(inside).count()
 pi = 4 * count / num_samples
 print(pi)

 lines = sc.textFile("sample.txt")
 print(lines.count())
 print(lines.first())

The output should look like this.

Option 2: Invoke Spark environment in notebook on the fly

Install findspark module by typing pip install findspark.
Create environment variable SPARK_HOME and set it to the path of Spark installation.
Launch Jupyter Notebook.

Paste the following code at the start of the notebook.

 import findspark
 findspark.init()
 import pyspark

Run the following code.

 import random
 sc = pyspark.SparkContext()
 num_samples = 100000000
 def inside(p):     
 x, y = random.random(), random.random()
 return x*x + y*y < 1
 count = sc.parallelize(range(0, num_samples)).filter(inside).count()
 pi = 4 * count / num_samples
 print(pi)
 sc.stop()

 sc = pyspark.SparkContext()
 lines = sc.textFile("sample.txt")
 print(lines.count())
 print(lines.first())
 sc.stop()