Setting Up a PySpark Notebook using Docker: A Step-by-Step Guide

Hello readers! Today, we will guide you through setting up a PySpark Jupyter Notebook using Docker. This tutorial is particularly useful for those of you keen on diving into the world of big data analytics using PySpark.

Step 1: Pull the Docker Image

Start by pulling the jupyter/all-spark-notebook image, which is packed with Spark 3.5.0.

docker pull jupyter/all-spark-notebook:spark-3.5.0

Step 2: Set Up Your Workspace

Before we run our Docker image, we need to set up a directory where our Spark data will be stored.

Create a directory named sparkdata in your workspace. Remember, you'll need to adjust the path in the next step to match your system setup. In this directory you can store any CSV file.

We dropped the CSV file from https://www.kaggle.com/datasets/lava18/google-play-store-apps (shared below for your convenience).

Step 3: Run the Docker Image

Now, let's run the Docker image and map our sparkdata directory to the container.

Replace /Users/datasciencefm/workspace/sparkdata with the path to your sparkdata directory.

docker run -d -P --name notebook -v /Users/datasciencefm/workspace/sparkdata:/sparkdata jupyter/all-spark-notebook:spark-3.5.0

Step 4: Retrieve the Port Mapping

To access the Jupyter Notebook, we need to know which host port has been mapped to the container's port 8888. Execute the following command:

docker port notebook 8888

For us, the output was 0.0.0.0:55002.

Step 5: Fetch the Notebook Token

For security reasons, Jupyter Notebooks are protected by tokens. Let's fetch this token:

docker logs --tail 3 notebook

Your output should resemble this URL: http://127.0.0.1:8888/lab?token=YOUR_TOKEN_HERE

Step 6: Access the Jupyter Notebook

Replace the default port in the URL with the one you identified in Step 4. In our case, the updated URL is:

http://127.0.0.1:55002/lab?token=82f8bf90a86103ca2e029241df7a59200c1aff788e1c2fbb

Simply paste this URL into your browser to open your Jupyter Notebook.

Step 7: Dive into PySpark!

Now, it's time to get our hands dirty with PySpark. Start a new Python notebook and enter the following code:

from pyspark.sql import SparkSession

# Spark session & context
spark = SparkSession.builder.master("local").getOrCreate()
sc = spark.sparkContext

# Reading CSV file from /sparkdata folder
csv_path = "/sparkdata/googleplaystore_user_reviews.csv"  # This will read any CSV file in the /sparkdata folder
df = spark.read.csv(csv_path, header=True, inferSchema=True)  # Assuming the CSV has a header

# Show the DataFrame
df.show()

# If you want to perform any action on DataFrame
# For instance, to get the count of rows:
print("Number of rows:", df.count())

This will initialize a Spark session, read a CSV file from the /sparkdata directory, and display its contents.

Extract top keywords from the reviews column


from pyspark.sql.functions import split, explode, lower, col
from pyspark.ml.feature import StopWordsRemover
from collections import Counter

# Tokenize the reviews
df_words = df.withColumn("word", explode(split(lower(col("Translated_Review")), "\\W+")))

# Remove stop words
stop_words = set(StopWordsRemover.loadDefaultStopWords("english"))  # Default English stop words
df_filtered = df_words.filter(~df_words.word.isin(stop_words))

# Count word frequency
word_frequency = df_filtered.groupBy("word").count().orderBy(col("count").desc())

# Display the top keywords
word_frequency.show()

# If you want to get the results in Python
keywords_list = word_frequency.rdd.map(lambda row: (row.word, row.count)).collect()

Extract top bi-grams and tri-grams from the reviews column

from pyspark.sql.functions import split, col, explode, lower, when
from pyspark.ml.feature import NGram, StopWordsRemover

# Replace null values in the Translated_Review column with an empty string
df = df.withColumn("Translated_Review", when(col("Translated_Review").isNull(), "").otherwise(col("Translated_Review")))

# Tokenize the reviews
df_tokenized = df.withColumn("words", split(lower(col("Translated_Review")), "\\W+"))

# Remove stop words
stop_words = StopWordsRemover.loadDefaultStopWords("english")
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words", stopWords=stop_words)
df_filtered = remover.transform(df_tokenized)

# Generate bi-grams
bigram = NGram(n=2, inputCol="filtered_words", outputCol="bigrams")
df_bigrams = bigram.transform(df_filtered)

# Generate tri-grams
trigram = NGram(n=3, inputCol="filtered_words", outputCol="trigrams")
df_trigrams = trigram.transform(df_bigrams)

# Count bi-gram frequency
bigram_freq = df_bigrams.withColumn("bigram", explode(col("bigrams"))) \
    .groupBy("bigram").count() \
    .orderBy(col("count").desc())

# Count tri-gram frequency
trigram_freq = df_trigrams.withColumn("trigram", explode(col("trigrams"))) \
    .groupBy("trigram").count() \
    .orderBy(col("count").desc())

# Display results
bigram_freq.show()
trigram_freq.show()

Finding similar words using Word2Vec algorithm

from pyspark.ml.feature import Word2Vec, Tokenizer
from pyspark.sql.functions import when

# Handle null values
df = df.withColumn("Translated_Review", when(col("Translated_Review").isNull(), "").otherwise(col("Translated_Review")))

# Tokenization
tokenizer = Tokenizer(inputCol="Translated_Review", outputCol="words")
df_words = tokenizer.transform(df)

# Apply Word2Vec
word2Vec = Word2Vec(vectorSize=100, minCount=0, inputCol="words", outputCol="features")
model = word2Vec.fit(df_words)
df_vectors = model.transform(df_words)

input_word = "good"  # replace 'good' with your desired word
num_synonyms = 10  # number of similar words you want to fetch

# Find synonyms using the trained Word2Vec model
synonyms = model.findSynonyms(input_word, num_synonyms)

synonyms.show()

Access Our Spark Cluster

Interested in enhancing your data engineering skills? Apply now to gain access to our Spark cluster:

  • Hands-on Learning: Experience the power of learning on an actual Spark cluster.
  • End-to-End Workflows: Build and test complete data processing workflows.
  • Jupyter Notebooks: Use Jupyter Notebooks to write and test your PySpark and SparkSQL code.
  • Curriculum Access: From beginner to advanced, follow a structured curriculum to solve complex problems.
  • Industry Datasets: Work with datasets like Movie Lens for recommender systems and Common Crawl for NLP tasks. And many more datasets available.

Conclusion

Congratulations! You've successfully set up a PySpark environment using Docker and executed your first few PySpark commands in a Jupyter Notebook.

The world of big data analytics is vast and offers numerous exciting challenges.

With the tools you've set up today, you're well-equipped to tackle them head-on.

Dive in, explore, and don't hesitate to experiment with your data using PySpark.

Happy data analyzing!