Sign in Subscribe

Data Product

CommonCrawl on Spark: Reliably Processing 28TB of Data (6300 WARCs, 7 CC Dumps) on a Spark Cluster

CommonCrawl on Spark: Reliably Processing 28TB of Data (6300 WARCs, 7 CC Dumps) on a Spark Cluster

Reliably crunch TBs of data with a solid error-tolerant system on Apache Spark!

Developing a data product

Developing a data product

I wanted to develop a product end to end. One product idea was to develop a service that would identify web technologies used by popular sites. The "end to end" development of such a product would require crawling URLs, parsing of raw website data, data processing, server-side web