Sign in Subscribe

Apache Spark

CommonCrawl on Spark: Reliably Processing 28TB of Data (6300 WARCs, 7 CC Dumps) on a Spark Cluster

CommonCrawl on Spark: Reliably Processing 28TB of Data (6300 WARCs, 7 CC Dumps) on a Spark Cluster

Reliably crunch TBs of data with a solid error-tolerant system on Apache Spark!

BigBanyanTree: Enriching WARC Data With IP Information from MaxMind

BigBanyanTree: Enriching WARC Data With IP Information from MaxMind

You can gain a lot of insights by enriching CommonCrawl WARC files with geolocation data from MaxMind. Learn how to do that using Apache Spark!

Serializability in Spark: Using Non-Serializable Objects in Spark Transformations

Serializability in Spark: Using Non-Serializable Objects in Spark Transformations

Discover strategies to effectively harness Spark's distributed computing power when working with third-party or custom library objects that aren't serializable.

Building ScriptScope Part 1: Extracting Top Used JS Libraries from Common Crawl using Apache Spark

Building ScriptScope Part 1: Extracting Top Used JS Libraries from Common Crawl using Apache Spark

Learn how you can build website technology analysis tools like builtWith and Wappalyzer! In this blog, we identify the top-used JS libraries for 2024!

BigBanyanTree: Parsing HTML source code with Apache Spark & Selectolax

BigBanyanTree: Parsing HTML source code with Apache Spark & Selectolax

Dive into the world of data extraction! Learn how to parse HTML source code from Common Crawl WARC files with Apache Spark and Selectolax for insightful analysis and unlock the potential of HTML source code.

Zero to Spark - BigBanyanTree Cluster Setup

Zero to Spark - BigBanyanTree Cluster Setup

This blog details the setup of an Apache Spark cluster in standalone mode for data engineering using Docker compose on a dedicated Hetzner server. It also covers the setup of other utilities such as Jupyterlab and Llama-3.1 8B LLM service.