BrilliantSoupGPT: Revolutionizing HTML Parsing with Advanced AI

In the dynamic and ever-evolving world of Machine Learning (ML) and Artificial Intelligence (AI), the introduction of BrilliantSoupGPT marks a significant milestone.

As the name suggests, BrilliantSoupGPT isn't just an ordinary tool; it's a groundbreaking blend of beauty and brilliance in the realm of HTML parsing and data extraction.

Pioneering a New Era of Data Parsing

The concept of BrilliantSoupGPT stems from the need to streamline and enhance the process of extracting data from HTML content.

In the digital age, where data is king, the ability to efficiently parse and analyze web data is invaluable. This is where BrilliantSoupGPT comes into play, offering an unparalleled solution for developers, data scientists, and AI enthusiasts.

Crafting a Specialized Dataset for Fine-Tuning

The foundation of BrilliantSoupGPT's effectiveness lies in its specialized training dataset.

Developed to fine-tune large language models like GPT-4, this dataset focuses on real-world HTML examples.

The process involves extracting HTML content from various sources, using tools like Common Crawl, and then employing GPT-4 to generate Python code snippets using the BeautifulSoup library.

These snippets are designed to execute specific tasks on the extracted HTML, ensuring the model's proficiency in handling diverse and complex data structures.

An Innovative Approach to Dataset Creation

The dataset creation involves a meticulous process:

  1. Sampling Common Crawl Files: By accessing a vast repository of web crawl data, a diverse range of HTML content is gathered.
  2. Extracting and Processing HTML: The HTML content of each URL is extracted and served as input for GPT-4.
  3. Generating BeautifulSoup Code: GPT-4 is prompted to write Python code snippets (BeautifulSoup code) that perform specified tasks on the given HTML.
  4. Verification and Compilation: Each output is rigorously checked for accuracy, ensuring the final dataset is of the highest quality.

This approach not only enhances the model's capability to handle real-world HTML structures but also ensures a robust and versatile training regime.

Cost-Effective and Efficient Dataset Development

The development of this dataset is both cost-effective and efficient. The process involves minimal expenses for generating examples and conducting manual evaluations.

Utilizing a blend of automated AI-driven processes and manual oversight, the dataset aims for both precision and economy.

The BrilliantSoupGPT in Action

Once fine-tuned, BrilliantSoupGPT showcases its potential in a variety of applications. From developing Chrome plugins that interact with web page contents to executing specialized data extraction tasks, the model's versatility is evident.

Its ability to provide Python code in response to specific HTML parsing tasks makes it a valuable asset for developers seeking fast and accurate solutions.

In summary, BrilliantSoupGPT is not just a tool; it's a revolutionary step forward in the field of AI and web data parsing.

With its advanced capabilities, cost-effective training process, and wide range of applications, BrilliantSoupGPT is poised to become an indispensable asset in the toolbox of modern developers and data scientists.

Developing an Supervised Fine Tuning dataset

Below is the prompt fed into Bard.


I want to build a fine-tuning dataset to fine-tune a large language model like GPT4.

The data I need should be in the following form. Below I have shown the format and an example.
_______
##Instruction##
Extract text from the HTML below.

##Input##
<div><p> The news today from the world of ML are <span> that open-source </span> models are gaining in popularity </p></div>

##Output##
from bs4 import BeautifulSoup

# HTML content
html_content = "<div><p> The news today from the world of ML are <span> that open-source </span> models are gaining in popularity </p></div>"

# Creating a BeautifulSoup object and specifying the parser
soup = BeautifulSoup(html_content, 'html.parser')

# Parsing all the <p> tags and extracting the text from within them
p_tags_text = [tag.get_text() for tag in soup.find_all('p')]

print(p_tags_text)
_______

Create more examples like the above in varying complexity of the Input HTML and the Output beautiful soup code.
‎Fine-tuning GPT4 examples
Created with Bard.

An example output of the prompt from Bard is shared below.

Generating large amounts of data for real-world HTML examples.

Common Crawl is a repository of Internet web crawl data. The crawled data is made available for free and is hosted on AWS S3 https://commoncrawl.org/get-started

Approach

  1. Sample a few common crawl files
  2. Extract the HTML of the URL
  3. Prompt GPT4 to write BeautifulCode snippet that performs a specific task on the HTML extracted in #2.
  4. Verify the output is as expected.

Collect 1k such examples.

Cost Considerations

SFT Dataset Creation

The prompt above is 231 tokens. The response included 4 examples so roughly 4x output tokens.

GPT-4

0.03 * (231/1000) + 0.06 * (800/1000)

$ 0.07 for 4 examples.

~70 $ for 1000 examples.

Manual Evaluation of SFT dataset

Manual evaluation of 1 example in a Kaggle notebook should take 10 minutes by a Junior/Intern SWE and a Business Analyst.

In a day, the combined team can conduct 50 to 80 evals. The eval is to run the BeautifulSoup code generated on the HTML input to check if the code works without error and produces the output that solves the task.

For 1k evals, the team will need 20 to 25 days. 

The cost of hiring junior/inter SWEs and Business Analysts may vary across industries and geographies. This cost can be offset by working with partners and AI agencies

Cost of Fine-Tuning

Cost of fine-tuning a base 7b or 13b model

Anyscale: Fine Tuning is billed at a fixed cost of $5 per run and $/million-tokens. For example, a fine tuning job of Llama-2-13b-chat-hf with 10M tokens would cost $5 + $2x10 = $25

From https://docs.endpoints.anyscale.com/pricing

This is effectively $5 to $10.

Fine-tuned Model Evaluation

The fine-tuned model needs to be assessed for quality. The model will be tested on scraping tasks in a similar manner to the SFT dataset creation.  The evaluation team can spend another couple of weeks on testing BrilliantSoupGPT

How will BrilliantSoupGPT be used?

There are a variety of ways BrilliantSoupGPT can be used.

Imagine you are building a Chrome plugin to chat with the contents on a web-page. 

To extract content from the web-page you could do one of two things.

GPT4

  1. You can send back the HTML source to an LLM and describe your task e.g.,  extract all text, or extract contact details. The LLM now works on the HTML to perform the task.
  2. Depending on which LLM you use, your results may vary. GPT4 does a great job in such cases but can be an expensive proposition.

Your Own Task-specific LLM

  1. The other approach is to develop a model such as BrilliantSoupGPT
  2. You would fine-tune an existing base model such as a Llama2 or CodeLlama on the dataset created above.
  3. Once fine-tuned, BrilliantSoupGPT will be the model you prompt with the task and the HTML to receive the Python code.
  4. The Python code is then executed on the HTML to extract the necessary content that solves for the task.
  5. With BrilliantSoupGPT you can use a smaller model which can be made to run faster and cheaper compared to a larger model.
  6. And with a model developed on an open-source LLM you have the ability to adopt the best that open-source LLMs have to offer.

In this post you have learned how to develop your own task-specific LLM. If you are looking to develop your own task-specific LLMs then feel reach to reach out to contact@denselayer.ai

Dense Layer AI technologies develops bespoke AI solutions. Reach out to use today.