The top 1 million Alexa sites seemed like a good place to find a list of URLs to crawl.
Most of the development was done using AWS services. The final web application though is served using Google Cloud's Cloudrun service.
There are a few high-level moving parts that need to be highlighted;
- Data processing
- Web service/application
Simply put, crawlers are programs that extract information from URLs. URLs found when parsing the information extracted from previous URLs are also crawled. This process is described popularly as "spidering". Though simple in notion, implementing a high-performance crawler is challenging. For my purposes, I created a simple crawler as shown in the diagram below;
- A Simple Queue Service(SQS) queue is seeded with the top 1M Alexa URLs.
- Each EC2 instance runs the crawler process. The crawler process 1) requests a URL from SQS to crawl 2) requests the URL and 3) submits the received response into a Kinesis Firehose stream.
- The Kinesis Firehose stream transports data to S3.
The crawler uses Splash, a headless browser service that requests the URL. A powerful feature of Splash is the ability to write Lua scripts to perform customizations.
It is important to rate-limit your crawling and not overload websites with your requests. For this project, I only visited a URL once and did not crawl any out links. Additionally, always respect the
robots.txt directives when you crawl.
The final results are stored in a SQLite database and served using a Python Flask web application which you can play within the
Try search terms such as "Prebid" or "Mixpanel".