Building ScriptScope Part 1: Extracting Top Used JS Libraries from Common Crawl using Apache Spark

It's astonishing how much data and information you can get from the Common Crawl data. From analyzing the top used TLDs to finding what bot-blocking software your competitor's site is using, A LOT is possible!

In this ScriptScope series, we'll analyze the usage patterns of JavaScript libraries over the past 10 years. We'll spell out how to do trend analysis, co-occurrence analysis, use-case specific JS library analysis, and more, in detail!

Introduction

Say, you are looking around to find the top-used JS libraries for a particular year. You come across tools like builtWith and Wappalyzer.

Being the curious cat that you are, you're probably wondering how that's done. It feels like magic, doesn't it? Well, let me tell you — you can do it too! With a little help from this blog and a decently spec'ed machine, you're all set.


builtWith and Wappalyzer

builtWith and Wappalyzer are tools to find the technology stack of websites and do competitor research. Here are a couple of screenshots of what the respective tools are offering drawing a parallel to what we are doing here:

JavaScript Library Usage Distribution — builtWith.com
JavaScript Libraries Technologies Market Share — Wappalyzer.com

We'll now see how to do this ourselves!


Getting the Top Used JS Libraries

There are 3 high-level parts involved in this analysis:

  1. Downloading and processing a sample of the WARC files for a particular year
  2. Parsing and extracting JS Libraries from the processed data
  3. Performing aggregations to get the final list of top JS Libraries

1. Processing a Sample of the Internet (WARC files)

The Common Crawl dataset contains a "snapshot" of the internet from every month since 2008. It is a huge dataset measured in petabytes. The best part is that this dataset is free and accessible to anyone. It is conveniently available on Amazon S3. There are a lot of cool things that can be done using this dataset.

The data in Common Crawl is available as WARC, WAT, and WET files.

💡
💡WARC: The Web ARChive files store archives of digital resources and content. Each file is a concatenation of one or more WARC Records containing metadata and content of the harvested/crawled file from the World Wide Web.

WAT: The Web Archive Transformation files store the computed metadata from the raw crawls (WARC files).

WET: The WARC Encapsulated Text file format stores only text extracted from the body of the HTML pages, excluding any HTML code, and media.

The "response" record type in the WARC files contains the responses sent by the servers when CommonCrawl crawled them. No points for guessing ;), but these responses are in HTML format.

The src attributes can then be extracted from the <script> tags in the HTML and parsed to get JS libraries.

Read more about how we extracted the src attributes:

BigBanyanTree: Parsing HTML source code with Apache Spark & Selectolax
Dive into the world of data extraction! Learn how to parse HTML source code from Common Crawl WARC files with Apache Spark and Selectolax for insightful analysis and unlock the potential of HTML source code.

2. Parsing JS Libraries

We now have the processed data from the WARC files sampled from a Common Crawl dump. In this example, we will be using the data processed from 2022-49 (up to the 49th week of 2022) and performing the parsing and aggregations using (Py)Spark.

Learn more about our Spark setup on which we are running this exercise from this wonderful blog post documenting the same in-depth: Zero to Spark - BigBanyanTree Cluster Setup.

Let's go through the code to do this now. We'll start by defining the imports and connecting to the cluster, but before that make sure you have huggingface_hub downloaded to be able to download our dataset from HuggingFace.

pip install huggingface_hub
import pyspark.sql.functions as F
import re, requests

from pyspark.sql.types import StructField, StructType, StructType, StringType, ArrayType
from huggingface_hub import snapshot_download
from pyspark.sql import SparkSession
from urllib.parse import urlparse

spark = SparkSession.builder \
    .appName("top-jslibs") \
    .master("spark://spark-master:7077") \
    .getOrCreate()

The next step is to download our processed dataset having the following data columns extracted from CommonCrawl WARC files: ['ip', 'host', 'server', 'script_src_attrs', 'year']. The script_src_attrs column has the src attributes from the script tags. Multiple attribute values are separated by |. You can look around the dataset in this 🤗 embed:

Feel free to check out our other datasets at hf.co/big-banyan-tree!

# Download the processed dataset from HF
snapshot_download(repo_id="big-banyan-tree/BBT_CommonCrawl_2024", repo_type="dataset", allow_patterns="script_extraction_out/*.parquet", local_dir=".")
# Read the processed data
df = spark.read.parquet("script_extraction_data")
df.show(n=10)
+---------------+--------------------+------------------+--------------------+----+
|             ip|                host|            server|    script_src_attrs|year|
+---------------+--------------------+------------------+--------------------+----+
|   217.160.0.83|https://nadinesim...|            Apache|https://nadinesim...|2024|
| 185.127.236.35|https://azionecat...|         Glaucorm3|\'https://azionec...|2024|
|   52.15.227.12|https://secureftp...|              NULL|js/dist/jquery.mi...|2024|
|   2.59.135.142|https://www.zeitk...|             nginx|/_assets/a739cde7...|2024|
|    104.21.60.7|http://www.250r.r...|        cloudflare|http://www.250r.r...|2024|
|  46.183.11.162|https://taliarand...|             nginx|https://taliarand...|2024|
|    65.60.61.13|https://almost4x4...|            Apache|./assets/javascri...|2024|
|185.227.138.230|http://gabrik-hor...|              NULL|http://gabrik-hor...|2024|
|   23.227.38.32|https://unifamy.c...|        cloudflare|//unifamy.com/cdn...|2024|
|  104.21.66.126|https://thesmartl...|        cloudflare|https://thesmartl...|2024|
+---------------+--------------------+------------------+--------------------+----+
only showing top 10 rows

This is what we have from the processed data.

Where do we go from here to get the top-used JS libraries? We could take this approach:

  • Split the script_src_attrs column on | to convert it into ArrayType.
  • Extract JS libraries using regex and perform an explode on the new column.
  • count the host column grouped by the extracted JS libraries.

But there's a problem with this approach. The WARC files have many multiple request records for the same domains on different paths (e.g., https://google.com/search?q=abcd and https://google.com/imghp). So, this would lead to the JS libraries being counted multiple times for the same domain, implying that the counts would be skewed by domains having a higher number of paths crawled.

To avoid this, we first extract the domains, and the JS libraries, and then remove identical domain-JS library pairs. Let's spell this out in code.

Extracting Domains

We'll use a PySpark UDF to extract the domains into a new column. The urlparse function from Python's urllib comes in handy to do this.

@F.udf(StringType())
def extract_domain(host):
    parsed_url = urlparse(host)
    return parsed_url.netloc
df = df.withColumn("domain", extract_domain("host"))
df.select("domain").show(n=10, truncate=False)
+---------------------------------------+
|domain                                 |
+---------------------------------------+
|nadinesimmerock.com                    |
|azionecattolica.arcidiocesi.siracusa.it|
|secureftp.rci.com                      |
|www.zeitklicks.de                      |
|www.250r.ru                            |
|taliarandall.com                       |
|almost4x4.com                          |
|gabrik-hormozgan.com                   |
|unifamy.com                            |
|thesmartlocal.co.th                    |
+---------------------------------------+
only showing top 10 rows

Extracting JS Libraries

@F.udf(ArrayType(StringType()))
def attrs_to_libs(src_attrs):
    """Parse list of src attrs to have JS libs ONLY."""
    splits = src_attrs.split('|')
    regex_obj = regex_broadcast.value
    for i, s in enumerate(splits):
        pattern = r"/?(?:js\.)?([^/?]+\.js)"
        m = re.search(pattern, s)
        if m:
            splits[i] = m.group(1)
        else:
            splits[i] = None
            continue
        splits[i] = re.sub(r"\.min\.", '.', splits[i])
            
    return splits

Let's break down what's happening in the above function:

  • The script_src_attrs column is split on |.
  • The broadcasted and compiled regex object is obtained.
  • For each element in the split list, the JS library is extracted using the regex pattern. Click here to understand and get a breakdown of this regex pattern.
  • If a match is found, it is obtained and stored in splits[i], else the code below is skipped.
  • Notice, that we are substituting .min. with . essentially considering, say, jquery.min.js and jquery.js as the same libraries.

Now, this UDF works, but we can improve it. Here, every time re.search is executed, the pattern is compiled, and the given text is searched for. Since it is in a loop, the same regex pattern is compiled every time in the loop.

We can avoid this repetitive step by compiling the regex once and then broadcasting it to all the worker nodes. Pretty nifty, right? :)

jslib_regex = re.compile(r"/?(?:js\.)?([^/?]+\.js)")
regex_broadcast = spark.sparkContext.broadcast(jslib_regex)

Modifying the UDF to include this, we get:

@F.udf(ArrayType(StringType()))
def attrs_to_libs(src_attrs):
    """Parse list of src attrs to have JS libs ONLY."""
    splits = src_attrs.split('|')
    regex_obj = regex_broadcast.value
    for i, s in enumerate(splits):
        m = regex_obj.search(s)
        if m:
            splits[i] = m.group(1)
        else:
            splits[i] = None
            continue
        splits[i] = re.sub(r"\.min\.", '.', splits[i])
            
    return splits
domain_lib_df = df.withColumn("js_lib", attrs_to_libs("script_src_attrs"))
domain_lib_df.select("js_lib").show(n=5)
+--------------------------------+
|js_lib                          |
+--------------------------------+
|[jquery.js, jquery-migrate.j...]|
|[woocommerce.js, jquery.js, ...]|
|[jquery.js, NULL, bootstrap....]|
|[react.min.js, react-dom.min...]|
|[jquery.blockui.js, woocomme...]|
+--------------------------------+
only showing top 5 rows

As you can infer from the datatypes passed to the UDF, the resultant column has ArrayType() values with the array elements being the extracted JS libraries.

3. Getting Top JS Libraries using Aggregations

These array elements can now be put on separate rows using PySpark's explode function. We might also have some NULL values as array elements. So, we'll remove that too.

domain_lib_df = domain_lib_df.select(domain_lib_df.domain, F.explode(domain_lib_df.js_lib).alias("js_lib")).dropna()
domain_lib_df.select("domain", "js_lib").show(n=10)
+--------------------------+----------------------+
|domain                    |js_lib                |
+--------------------------+----------------------+
|nadinesimmerock.com       |jquery.js             |
|nadinesimmerock.com       |jquery-migrate.js     |
|nadinesimmerock.com       |jquery.blockui.js     |
|azionecattolica.arcidio...|woocommerce.js        |
|azionecattolica.arcidio...|jquery.js             |
|secureftp.rci.com         |slick.min.js          |
|www.zeitklicks.de         |fancybox.js           |
|www.zeitklicks.de         |bootstrap.min.js      |
|www.250r.ru               |react.min.js          |
|taliarandall.com          |redux.min.js          |
+--------------------------+----------------------+
only showing top 10 rows

Now, there can be duplicate domain-js_lib. We don't want these to be influencing the counts. So, let's drop such duplicates.

domain_lib_df = domain_lib_df.dropDuplicates(["domain", "js_lib"])

We are almost there. Only one aggregation remains to get to our final goal!

To get the final counts, we must group by the js_lib column and count how many domains use a particular JS library.

count_df = domain_lib_df.groupby("js_lib").agg(F.count("domain").alias("domain_count"))

Let's sort this in non-ascending order.

sorted_df = count_df.sort("domain_count", ascending=False)

That's it! Here's the moment of truth.

sorted_df.show(n=50, truncate=False)
+--------------------------------------------------------------------------------+------------+
|js_lib                                                                          |domain_count|
+--------------------------------------------------------------------------------+------------+
|jquery.js                                                                       |3649381     |
|jquery-migrate.js                                                               |2291335     |
|core.js                                                                         |959250      |
|index.js                                                                        |937461      |
|bootstrap.js                                                                    |892675      |
|hooks.js                                                                        |798132      |
|main.js                                                                         |787037      |
|i18n.js                                                                         |775130      |
|scripts.js                                                                      |764052      |
|comment-reply.js                                                                |748273      |
|frontend.js                                                                     |681768      |
|api.js                                                                          |653214      |
|wp-polyfill.js                                                                  |635136      |
|cookie.js                                                                       |618353      |
|script.js                                                                       |548814      |
|jquery.blockui.js                                                               |517469      |
|woocommerce.js                                                                  |501323      |
|common.js                                                                       |493211      |
|waypoints.js                                                                    |492947      |
|custom.js                                                                       |459783      |
|add-to-cart.js                                                                  |452764      |
|imagesloaded.js                                                                 |451984      |
|owl.carousel.js                                                                 |451184      |
|regenerator-runtime.js                                                          |446713      |
|slick.js                                                                        |442980      |
|underscore.js                                                                   |440106      |
|adsbygoogle.js                                                                  |433745      |
|theme.js                                                                        |383817      |
|frontend-modules.js                                                             |380701      |
|webpack.runtime.js                                                              |372203      |
|preloads.js                                                                     |368922      |
|load_feature-9f951eb7d8d53973c719de211f807d63af81c644e5b9a6ae72661ac408d472f6.js|367641      |
|features-1c0b396bd4d054b94abae1eb6a1bd6ba47beb35525c57a217c77a862ff06d83f.js    |366054      |
|wp-polyfill-inert.js                                                            |351462      |
|wp-embed.js                                                                     |331821      |
|email-decode.js                                                                 |322340      |
|sourcebuster.js                                                                 |312386      |
|order-attribution.js                                                            |311339      |
|jquery-ui.js                                                                    |299792      |
|wp-util.js                                                                      |294530      |
|cdn.js                                                                          |289011      |
|front.js                                                                        |286595      |
|cart-fragments.js                                                               |283559      |
|polyfill.js                                                                     |283218      |
|app.js                                                                          |275751      |
|navigation.js                                                                   |269894      |
|webpack-pro.runtime.js                                                          |263654      |
|swiper.js                                                                       |261707      |
|jquery.magnific-popup.js                                                        |258535      |
|lazysizes.js                                                                    |257031      |
+--------------------------------------------------------------------------------+------------+
only showing top 50 rows

There are a few JS files of no interest to us or that are not applicable here. We will filter it out.

# List of file names to filter out
unwanted_files = ["api.js", "index.js", "script.js", "frontend.js", "main.js", "custom.js", "frontend-modules.js", "front.js", "navigation.js", "app.js", "theme.js"]

sorted_df = sorted_df.filter(~F.col("js_lib").isin(unwanted_files))

sorted_df.show(n=30)
+--------------------------------------------------------------------------------+------------+
|js_lib                                                                          |domain_count|
+--------------------------------------------------------------------------------+------------+
|jquery.js                                                                       |3649381     |
|jquery-migrate.js                                                               |2291335     |
|core.js                                                                         |959250      |
|bootstrap.js                                                                    |892675      |
|hooks.js                                                                        |798132      |
|i18n.js                                                                         |775130      |
|comment-reply.js                                                                |748273      |
|wp-polyfill.js                                                                  |635136      |
|cookie.js                                                                       |618353      |
|jquery.blockui.js                                                               |517469      |
|woocommerce.js                                                                  |501323      |
|waypoints.js                                                                    |492947      |
|add-to-cart.js                                                                  |452764      |
|imagesloaded.js                                                                 |451984      |
|owl.carousel.js                                                                 |451184      |
|regenerator-runtime.js                                                          |446713      |
|slick.js                                                                        |442980      |
|underscore.js                                                                   |440106      |
|adsbygoogle.js                                                                  |433745      |
|frontend-modules.js                                                             |380701      |
|webpack.runtime.js                                                              |372203      |
|preloads.js                                                                     |368922      |
|load_feature-9f951eb7d8d53973c719de211f807d63af81c644e5b9a6ae72661ac408d472f6.js|367641      |
|features-1c0b396bd4d054b94abae1eb6a1bd6ba47beb35525c57a217c77a862ff06d83f.js    |366054      |
|wp-polyfill-inert.js                                                            |351462      |
|wp-embed.js                                                                     |331821      |
|email-decode.js                                                                 |322340      |
|sourcebuster.js                                                                 |312386      |
|order-attribution.js                                                            |311339      |
|jquery-ui.js                                                                    |299792      |
+--------------------------------------------------------------------------------+------------+
only showing top 30 rows

There are a few seemingly uninteresting but interesting JS files here.

... I can keep going on.

Another interesting file is the, again, seemingly random load_feature-9f951eb7d8d53973c719de211f807d63af81c644e5b9a6ae72661ac408d472f6.js that is used in Shopify.

💡
This is some valuable information right here! A use case for this would be, for example, finding a list of websites using this JS file. Et voilà! As a Shopify plugins developer, here is your lead list already!

Take a close look at the above list of top-used JS libraries. Compare it with what Wappalyzer shows. Can you find any similarities? Heck yeah, you can! I'll make a callout box just for this :)

💡
This exercise of ours yielded accurate results corroborated by popular websites like Wappalyzer.com.

Many libraries shown in our aggregation above (like jQuery, core-js, jQuery Migrate, Swiper, Slick, Underscore.js, OWL Carousel, etc.) appear in the top JS libraries list in Wappalyzer.

Phew! That was fun, right? I'm super thrilled that we got to work on this. It's still unreal that we replicated what Wappalyzer has done (at least a tiny part of what they offer ;) ).


Open Sourcing our Dataset

The dataset used in this analysis is fully open-sourced by us and is available on HuggingFace 🤗 for download. We encourage you to tinker with the dataset and can't wait to see what you build with it!

big-banyan-tree (Big Banyan Tree)
Org profile for Big Banyan Tree on Hugging Face, the AI community building the future.

Our dataset has two parts (separated as directories on HF):

  • ipmaxmind_out
  • script_extraction_out

"ipmaxmind_out" has IPs from response records of the WARC files enriched with information from MaxMind's database.

"script_extraction_out" has IP, server among other columns, and the src attributes of the HTML content in the WARC response records. Visit the datasets' HuggingFace repositories for more information.


Conclusion

They say a picture is worth a thousand words. I say it is worth even more (because this blog has 1500+ words excluding the code outputs) and I'm not going to bore you with a super-long conclusion that you probably won't read. Instead, I'll let my newfound interest in draw.io do the talking :)

I hope you enjoyed reading this blog as much as I enjoyed writing it. As always, stay tuned for more interesting stuff!