Postgres text similarity with commoncrawl domains
Commoncrawl [https://commoncrawl.org/] is a public repository of web crawl data
made available for analysis. In this post I want to extract the list of domains
crawled, stick them into a Postgres database and play with text similarity
functions provided by pg_similarity extension.
The basic steps to be