datascience.fm - The #1 Data Science Channel
  • Home
  • Search
  • Videos
  • About
  • AI Products
  • FAQ
  • Tutorials
Sign in Subscribe

Data Product

CommonCrawl on Spark: Reliably Processing 28TB of Data (6300 WARCs, 7 CC Dumps) on a Spark Cluster

CommonCrawl on Spark: Reliably Processing 28TB of Data (6300 WARCs, 7 CC Dumps) on a Spark Cluster

Reliably crunch TBs of data with a solid error-tolerant system on Apache Spark!
Suchit G Nov 7, 2024
Developing a data product

Developing a data product

I wanted to develop a product end to end. One product idea was to develop a service that would identify web technologies used by popular sites. The "end to end" development of such a product would require crawling URLs, parsing of raw website data, data processing, server-side web
Harsh Singhal | DataScienceFM Jul 17, 2019

Subscribe to datascience.fm - The #1 Data Science Channel

Don't miss out on the latest news. Sign up now to get access to the library of members-only articles.
datascience.fm - The #1 Data Science Channel © 2025. Powered by Ghost