Building ScriptScope Part 1: Extracting Top Used JS Libraries from Common Crawl using Apache Spark
Learn how you can build website technology analysis tools like builtWith and Wappalyzer! In this blog, we identify the top-used JS libraries for 2024!
It's astonishing how much data and information you can get from the Common Crawl data. From analyzing the top used TLDs to finding what bot-blocking software your competitor's site is using, A LOT is possible!
In this ScriptScope series, we'll analyze the usage patterns of JavaScript libraries over the past 10 years. We'll spell out how to do trend analysis, co-occurrence analysis, use-case specific JS library analysis, and more, in detail!
Introduction
Say, you are looking around to find the top-used JS libraries for a particular year. You come across tools like builtWith and Wappalyzer.
Being the curious cat that you are, you're probably wondering how that's done. It feels like magic, doesn't it? Well, let me tell you — you can do it too! With a little help from this blog and a decently spec'ed machine, you're all set.
builtWith and Wappalyzer
builtWith and Wappalyzer are tools to find the technology stack of websites and do competitor research. Here are a couple of screenshots of what the respective tools are offering drawing a parallel to what we are doing here:
We'll now see how to do this ourselves!
Getting the Top Used JS Libraries
There are 3 high-level parts involved in this analysis:
- Downloading and processing a sample of the WARC files for a particular year
- Parsing and extracting JS Libraries from the processed data
- Performing aggregations to get the final list of top JS Libraries
1. Processing a Sample of the Internet (WARC files)
The Common Crawl dataset contains a "snapshot" of the internet from every month since 2008. It is a huge dataset measured in petabytes. The best part is that this dataset is free and accessible to anyone. It is conveniently available on Amazon S3. There are a lot of cool things that can be done using this dataset.
The data in Common Crawl is available as WARC, WAT, and WET files.
WAT: The Web Archive Transformation files store the computed metadata from the raw crawls (WARC files).
WET: The WARC Encapsulated Text file format stores only text extracted from the body of the HTML pages, excluding any HTML code, and media.
The "response" record type in the WARC files contains the responses sent by the servers when CommonCrawl crawled them. No points for guessing ;), but these responses are in HTML format.
The src
attributes can then be extracted from the <script>
tags in the HTML and parsed to get JS libraries.
Read more about how we extracted the src
attributes:
2. Parsing JS Libraries
We now have the processed data from the WARC files sampled from a Common Crawl dump. In this example, we will be using the data processed from 2022-49 (up to the 49th week of 2022) and performing the parsing and aggregations using (Py)Spark.
Learn more about our Spark setup on which we are running this exercise from this wonderful blog post documenting the same in-depth: Zero to Spark - BigBanyanTree Cluster Setup.
Let's go through the code to do this now. We'll start by defining the imports and connecting to the cluster, but before that make sure you have huggingface_hub
downloaded to be able to download our dataset from HuggingFace.
pip install huggingface_hub
import pyspark.sql.functions as F
import re, requests
from pyspark.sql.types import StructField, StructType, StructType, StringType, ArrayType
from huggingface_hub import snapshot_download
from pyspark.sql import SparkSession
from urllib.parse import urlparse
spark = SparkSession.builder \
.appName("top-jslibs") \
.master("spark://spark-master:7077") \
.getOrCreate()
The next step is to download our processed dataset having the following data columns extracted from CommonCrawl WARC files: ['ip', 'host', 'server', 'script_src_attrs', 'year']. The script_src_attrs
column has the src
attributes from the script
tags. Multiple attribute values are separated by |
. You can look around the dataset in this 🤗 embed:
Feel free to check out our other datasets at hf.co/big-banyan-tree!
# Download the processed dataset from HF
snapshot_download(repo_id="big-banyan-tree/BBT_CommonCrawl_2024", repo_type="dataset", allow_patterns="script_extraction_out/*.parquet", local_dir=".")
# Read the processed data
df = spark.read.parquet("script_extraction_data")
df.show(n=10)
+---------------+--------------------+------------------+--------------------+----+
| ip| host| server| script_src_attrs|year|
+---------------+--------------------+------------------+--------------------+----+
| 217.160.0.83|https://nadinesim...| Apache|https://nadinesim...|2024|
| 185.127.236.35|https://azionecat...| Glaucorm3|\'https://azionec...|2024|
| 52.15.227.12|https://secureftp...| NULL|js/dist/jquery.mi...|2024|
| 2.59.135.142|https://www.zeitk...| nginx|/_assets/a739cde7...|2024|
| 104.21.60.7|http://www.250r.r...| cloudflare|http://www.250r.r...|2024|
| 46.183.11.162|https://taliarand...| nginx|https://taliarand...|2024|
| 65.60.61.13|https://almost4x4...| Apache|./assets/javascri...|2024|
|185.227.138.230|http://gabrik-hor...| NULL|http://gabrik-hor...|2024|
| 23.227.38.32|https://unifamy.c...| cloudflare|//unifamy.com/cdn...|2024|
| 104.21.66.126|https://thesmartl...| cloudflare|https://thesmartl...|2024|
+---------------+--------------------+------------------+--------------------+----+
only showing top 10 rows
This is what we have from the processed data.
Where do we go from here to get the top-used JS libraries? We could take this approach:
- Split the
script_src_attrs
column on|
to convert it intoArrayType
. - Extract JS libraries using regex and perform an
explode
on the new column. count
thehost
column grouped by the extracted JS libraries.
But there's a problem with this approach. The WARC files have many multiple request records for the same domains on different paths (e.g., https://google.com/search?q=abcd and https://google.com/imghp). So, this would lead to the JS libraries being counted multiple times for the same domain, implying that the counts would be skewed by domains having a higher number of paths crawled.
To avoid this, we first extract the domains, and the JS libraries, and then remove identical domain-JS library pairs. Let's spell this out in code.
Extracting Domains
We'll use a PySpark UDF to extract the domains into a new column. The urlparse
function from Python's urllib
comes in handy to do this.
@F.udf(StringType())
def extract_domain(host):
parsed_url = urlparse(host)
return parsed_url.netloc
df = df.withColumn("domain", extract_domain("host"))
df.select("domain").show(n=10, truncate=False)
+---------------------------------------+
|domain |
+---------------------------------------+
|nadinesimmerock.com |
|azionecattolica.arcidiocesi.siracusa.it|
|secureftp.rci.com |
|www.zeitklicks.de |
|www.250r.ru |
|taliarandall.com |
|almost4x4.com |
|gabrik-hormozgan.com |
|unifamy.com |
|thesmartlocal.co.th |
+---------------------------------------+
only showing top 10 rows
Extracting JS Libraries
@F.udf(ArrayType(StringType()))
def attrs_to_libs(src_attrs):
"""Parse list of src attrs to have JS libs ONLY."""
splits = src_attrs.split('|')
regex_obj = regex_broadcast.value
for i, s in enumerate(splits):
pattern = r"/?(?:js\.)?([^/?]+\.js)"
m = re.search(pattern, s)
if m:
splits[i] = m.group(1)
else:
splits[i] = None
continue
splits[i] = re.sub(r"\.min\.", '.', splits[i])
return splits
Let's break down what's happening in the above function:
- The
script_src_attrs
column is split on|
. - The broadcasted and compiled regex object is obtained.
- For each element in the split list, the JS library is extracted using the regex pattern. Click here to understand and get a breakdown of this regex pattern.
- If a match is found, it is obtained and stored in
splits[i]
, else the code below is skipped. - Notice, that we are substituting
.min.
with.
essentially considering, say,jquery.min.js
andjquery.js
as the same libraries.
Now, this UDF works, but we can improve it. Here, every time re.search
is executed, the pattern is compiled, and the given text is searched for. Since it is in a loop, the same regex pattern is compiled every time in the loop.
We can avoid this repetitive step by compiling the regex once and then broadcasting it to all the worker nodes. Pretty nifty, right? :)
jslib_regex = re.compile(r"/?(?:js\.)?([^/?]+\.js)")
regex_broadcast = spark.sparkContext.broadcast(jslib_regex)
Modifying the UDF to include this, we get:
@F.udf(ArrayType(StringType()))
def attrs_to_libs(src_attrs):
"""Parse list of src attrs to have JS libs ONLY."""
splits = src_attrs.split('|')
regex_obj = regex_broadcast.value
for i, s in enumerate(splits):
m = regex_obj.search(s)
if m:
splits[i] = m.group(1)
else:
splits[i] = None
continue
splits[i] = re.sub(r"\.min\.", '.', splits[i])
return splits
domain_lib_df = df.withColumn("js_lib", attrs_to_libs("script_src_attrs"))
domain_lib_df.select("js_lib").show(n=5)
+--------------------------------+
|js_lib |
+--------------------------------+
|[jquery.js, jquery-migrate.j...]|
|[woocommerce.js, jquery.js, ...]|
|[jquery.js, NULL, bootstrap....]|
|[react.min.js, react-dom.min...]|
|[jquery.blockui.js, woocomme...]|
+--------------------------------+
only showing top 5 rows
As you can infer from the datatypes passed to the UDF, the resultant column has ArrayType()
values with the array elements being the extracted JS libraries.
3. Getting Top JS Libraries using Aggregations
These array elements can now be put on separate rows using PySpark's explode
function. We might also have some NULL
values as array elements. So, we'll remove that too.
domain_lib_df = domain_lib_df.select(domain_lib_df.domain, F.explode(domain_lib_df.js_lib).alias("js_lib")).dropna()
domain_lib_df.select("domain", "js_lib").show(n=10)
+--------------------------+----------------------+
|domain |js_lib |
+--------------------------+----------------------+
|nadinesimmerock.com |jquery.js |
|nadinesimmerock.com |jquery-migrate.js |
|nadinesimmerock.com |jquery.blockui.js |
|azionecattolica.arcidio...|woocommerce.js |
|azionecattolica.arcidio...|jquery.js |
|secureftp.rci.com |slick.min.js |
|www.zeitklicks.de |fancybox.js |
|www.zeitklicks.de |bootstrap.min.js |
|www.250r.ru |react.min.js |
|taliarandall.com |redux.min.js |
+--------------------------+----------------------+
only showing top 10 rows
Now, there can be duplicate domain
-js_lib
. We don't want these to be influencing the counts. So, let's drop such duplicates.
domain_lib_df = domain_lib_df.dropDuplicates(["domain", "js_lib"])
We are almost there. Only one aggregation remains to get to our final goal!
To get the final counts, we must group by the js_lib
column and count how many domains use a particular JS library.
count_df = domain_lib_df.groupby("js_lib").agg(F.count("domain").alias("domain_count"))
Let's sort this in non-ascending order.
sorted_df = count_df.sort("domain_count", ascending=False)
That's it! Here's the moment of truth.
sorted_df.show(n=50, truncate=False)
+--------------------------------------------------------------------------------+------------+
|js_lib |domain_count|
+--------------------------------------------------------------------------------+------------+
|jquery.js |3649381 |
|jquery-migrate.js |2291335 |
|core.js |959250 |
|index.js |937461 |
|bootstrap.js |892675 |
|hooks.js |798132 |
|main.js |787037 |
|i18n.js |775130 |
|scripts.js |764052 |
|comment-reply.js |748273 |
|frontend.js |681768 |
|api.js |653214 |
|wp-polyfill.js |635136 |
|cookie.js |618353 |
|script.js |548814 |
|jquery.blockui.js |517469 |
|woocommerce.js |501323 |
|common.js |493211 |
|waypoints.js |492947 |
|custom.js |459783 |
|add-to-cart.js |452764 |
|imagesloaded.js |451984 |
|owl.carousel.js |451184 |
|regenerator-runtime.js |446713 |
|slick.js |442980 |
|underscore.js |440106 |
|adsbygoogle.js |433745 |
|theme.js |383817 |
|frontend-modules.js |380701 |
|webpack.runtime.js |372203 |
|preloads.js |368922 |
|load_feature-9f951eb7d8d53973c719de211f807d63af81c644e5b9a6ae72661ac408d472f6.js|367641 |
|features-1c0b396bd4d054b94abae1eb6a1bd6ba47beb35525c57a217c77a862ff06d83f.js |366054 |
|wp-polyfill-inert.js |351462 |
|wp-embed.js |331821 |
|email-decode.js |322340 |
|sourcebuster.js |312386 |
|order-attribution.js |311339 |
|jquery-ui.js |299792 |
|wp-util.js |294530 |
|cdn.js |289011 |
|front.js |286595 |
|cart-fragments.js |283559 |
|polyfill.js |283218 |
|app.js |275751 |
|navigation.js |269894 |
|webpack-pro.runtime.js |263654 |
|swiper.js |261707 |
|jquery.magnific-popup.js |258535 |
|lazysizes.js |257031 |
+--------------------------------------------------------------------------------+------------+
only showing top 50 rows
There are a few JS files of no interest to us or that are not applicable here. We will filter it out.
# List of file names to filter out
unwanted_files = ["api.js", "index.js", "script.js", "frontend.js", "main.js", "custom.js", "frontend-modules.js", "front.js", "navigation.js", "app.js", "theme.js"]
sorted_df = sorted_df.filter(~F.col("js_lib").isin(unwanted_files))
sorted_df.show(n=30)
+--------------------------------------------------------------------------------+------------+
|js_lib |domain_count|
+--------------------------------------------------------------------------------+------------+
|jquery.js |3649381 |
|jquery-migrate.js |2291335 |
|core.js |959250 |
|bootstrap.js |892675 |
|hooks.js |798132 |
|i18n.js |775130 |
|comment-reply.js |748273 |
|wp-polyfill.js |635136 |
|cookie.js |618353 |
|jquery.blockui.js |517469 |
|woocommerce.js |501323 |
|waypoints.js |492947 |
|add-to-cart.js |452764 |
|imagesloaded.js |451984 |
|owl.carousel.js |451184 |
|regenerator-runtime.js |446713 |
|slick.js |442980 |
|underscore.js |440106 |
|adsbygoogle.js |433745 |
|frontend-modules.js |380701 |
|webpack.runtime.js |372203 |
|preloads.js |368922 |
|load_feature-9f951eb7d8d53973c719de211f807d63af81c644e5b9a6ae72661ac408d472f6.js|367641 |
|features-1c0b396bd4d054b94abae1eb6a1bd6ba47beb35525c57a217c77a862ff06d83f.js |366054 |
|wp-polyfill-inert.js |351462 |
|wp-embed.js |331821 |
|email-decode.js |322340 |
|sourcebuster.js |312386 |
|order-attribution.js |311339 |
|jquery-ui.js |299792 |
+--------------------------------------------------------------------------------+------------+
only showing top 30 rows
There are a few seemingly uninteresting but interesting JS files here.
preloads.js
: Used in Shopify websites for performance enhancements.email-decode.js
: Used by CloudFlare to obfuscate emails for bots while keeping them visible to humans.cart-fragments.js
: Used in WooCommerce sites to update the cart without refreshing the webpage.
... I can keep going on.
Another interesting file is the, again, seemingly random load_feature-9f951eb7d8d53973c719de211f807d63af81c644e5b9a6ae72661ac408d472f6.js
that is used in Shopify.
Take a close look at the above list of top-used JS libraries. Compare it with what Wappalyzer shows. Can you find any similarities? Heck yeah, you can! I'll make a callout box just for this :)
Many libraries shown in our aggregation above (like jQuery, core-js, jQuery Migrate, Swiper, Slick, Underscore.js, OWL Carousel, etc.) appear in the top JS libraries list in Wappalyzer.
Phew! That was fun, right? I'm super thrilled that we got to work on this. It's still unreal that we replicated what Wappalyzer has done (at least a tiny part of what they offer ;) ).
Open Sourcing our Dataset
The dataset used in this analysis is fully open-sourced by us and is available on HuggingFace 🤗 for download. We encourage you to tinker with the dataset and can't wait to see what you build with it!
Our dataset has two parts (separated as directories on HF):
- ipmaxmind_out
- script_extraction_out
"ipmaxmind_out" has IPs from response records of the WARC files enriched with information from MaxMind's database.
"script_extraction_out" has IP, server among other columns, and the src attributes of the HTML content in the WARC response records. Visit the datasets' HuggingFace repositories for more information.
Conclusion
They say a picture is worth a thousand words. I say it is worth even more (because this blog has 1500+ words excluding the code outputs) and I'm not going to bore you with a super-long conclusion that you probably won't read. Instead, I'll let my newfound interest in draw.io do the talking :)
I hope you enjoyed reading this blog as much as I enjoyed writing it. As always, stay tuned for more interesting stuff!