Common Crawl (CC) is a very popular dataset and can satisfy a great many text
analysis and NLP tasks.
If you have an AWS account you can access the CC index files in Parquet format.
Follow the instructions shared in this article
[https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/]
to get