CommonCrawl on Spark: Reliably Processing 28TB of Data (6300 WARCs, 7 CC Dumps) on a Spark Cluster Reliably crunch TBs of data with a solid error-tolerant system on Apache Spark!
BigBanyanTree: Enriching WARC Data With IP Information from MaxMind You can gain a lot of insights by enriching CommonCrawl WARC files with geolocation data from MaxMind. Learn how to do that using Apache Spark!
Serializability in Spark: Using Non-Serializable Objects in Spark Transformations Discover strategies to effectively harness Spark's distributed computing power when working with third-party or custom library objects that aren't serializable.
Building ScriptScope Part 1: Extracting Top Used JS Libraries from Common Crawl using Apache Spark Learn how you can build website technology analysis tools like builtWith and Wappalyzer! In this blog, we identify the top-used JS libraries for 2024!
Generating Synthetic Text2SQL Instruction Dataset to Fine-tune Code LLMs Creating text2SQL data with defined roles, and sub-topics guiding natural language question generation using GPT-4.
Multi-Doc RAG: Leverage LangChain to Query and Compare 10K Reports Embark on a deep dive into RAG as we explore QnA over multiple documents, and the fusion of cutting-edge LLMs and LangChain. Learn how LangChain works along the way!
Exploring Agents: Get started by Creating Your Own Data Analysis Agent LLMs have taken the world by storm, but on their own, they CAN'T do any particular task very well and produce unreliable results many a time. But if there's one thing that they do well, it is following cleverly, and meticulously crafted prompts. Let's use this ability of theirs to build some agents!