CommonCrawl on Spark: Reliably Processing 28TB of Data (6300 WARCs, 7 CC Dumps) on a Spark Cluster Reliably crunch TBs of data with a solid error-tolerant system on Apache Spark!
BigBanyanTree: Enriching WARC Data With IP Information from MaxMind You can gain a lot of insights by enriching CommonCrawl WARC files with geolocation data from MaxMind. Learn how to do that using Apache Spark!
Serializability in Spark: Using Non-Serializable Objects in Spark Transformations Discover strategies to effectively harness Spark's distributed computing power when working with third-party or custom library objects that aren't serializable.
Building ScriptScope Part 1: Extracting Top Used JS Libraries from Common Crawl using Apache Spark Learn how you can build website technology analysis tools like builtWith and Wappalyzer! In this blog, we identify the top-used JS libraries for 2024!
BigBanyanTree: Parsing HTML source code with Apache Spark & Selectolax Dive into the world of data extraction! Learn how to parse HTML source code from Common Crawl WARC files with Apache Spark and Selectolax for insightful analysis and unlock the potential of HTML source code.
Zero to Spark - BigBanyanTree Cluster Setup This blog details the setup of an Apache Spark cluster in standalone mode for data engineering using Docker compose on a dedicated Hetzner server. It also covers the setup of other utilities such as Jupyterlab and Llama-3.1 8B LLM service.