CommonCrawl on Spark: Reliably Processing 28TB of Data (6300 WARCs, 7 CC Dumps) on a Spark Cluster Reliably crunch TBs of data with a solid error-tolerant system on Apache Spark!
Developing a data product I wanted to develop a product end to end. One product idea was to develop a service that would identify web technologies used by popular sites. The "end to end" development of such a product would require crawling URLs, parsing of raw website data, data processing, server-side web