NLP

Gain relevant experience by creating your own project

Common Crawl (CC) is a very popular dataset and can satisfy a great many text analysis and NLP tasks.

If you have an AWS account you can access the CC index files in Parquet format. Follow the instructions shared in this article to get started.

If you don't want to use Athena you can simply get a hold of a single Parquet file and analyze it as described here.

Once you are ready with the Index data in an Athena table, you can extract a small sample using the query below.

with data as (
  select
    url_host_name
  from 
   "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2021-25'
  AND subset = 'warc'
  and rand() <= 0.001
)
select 
    url_host_name, 
    count(*)
from data
group by 
    url_host_name
;

Download the results by clicking on the icon on the top right corner of the results window that appears below the query editor.

The file is approximately 35MB in size and you now have a URL dataset for your analysis.

The above Dataset is available on Kaggle and comes with a Notebook that you can use as a starting point.

Read next