BigBanyanTree: Enriching WARC Data With IP Information from MaxMind You can gain a lot of insights by enriching CommonCrawl WARC files with geolocation data from MaxMind. Learn how to do that using Apache Spark!
Serializability in Spark: Using Non-Serializable Objects in Spark Transformations Discover strategies to effectively harness Spark's distributed computing power when working with third-party or custom library objects that aren't serializable.
Building ScriptScope Part 1: Extracting Top Used JS Libraries from Common Crawl using Apache Spark Learn how you can build website technology analysis tools like builtWith and Wappalyzer! In this blog, we identify the top-used JS libraries for 2024!
BigBanyanTree: Parsing HTML source code with Apache Spark & Selectolax Dive into the world of data extraction! Learn how to parse HTML source code from Common Crawl WARC files with Apache Spark and Selectolax for insightful analysis and unlock the potential of HTML source code.
Leveraging LLMs in Recommendation Systems Discover personalized content effortlessly with LLM-powered recommendation systems!
Movie Recommender System Using PySpark Explore the world of cinema with our cutting-edge PySpark movie recommender system that provides tailored movie suggestions to match your unique tastes and preferences.
Unlocking the Power of PySpark SQL: An end-to-end tutorial on App Store data Discover the transformative capabilities of PySpark SQL querying in our comprehensive tutorial. Unleash the power of big data analytics with ease and efficiency!