Cheerio LLM Generator & Sandbox

This tool allows you to generate JavaScript code for a Cheerio.js extractor using LLM (Large Language Model) and quickly test it against the provided HTML in a virtual JavaScript sandbox. The generated extractor code can later be used in ScrapeNinja API calls and the ScrapeNinja online web scraper builder to extract structured data from similar webpages. This is an experimental tool.

Sample Data (HTML)

Prompt

Specify what you want to extract from the sample data

1. Generate code Generation.. DEV MODE

Processing stats

Generation latency: {{ codegenResponseLatency }}ms

HTML Elements: {{ codegenResponse.html.stats.nodesNum }}

Max nesting level: {{ codegenResponse.html.stats.maxNestingLevel }}

Nodes at 50% depth level: {{ codegenResponse.html.stats.nodesNumHalfNestingLevel }}

Input length: {{ codegenResponse.html.stats.inputLength }}

Compressed input length: {{ codegenResponse.html.stats.cleanedInputLength }}

Compression: {{ codegenResponse.html.stats.compressionRatio * 100 }}%

Compressed input trimmed length: {{ codegenResponse.html.stats.cleanedInputTrimmedLength }}

HTML sent to code generator: {{ codegenResponse.html.stats.sentToGeneratorRatio * 100 }}% Low High Medium

Extractor Click on "Generate code" button to generate the extractor.

2. Run extractor Extracting.. Extraction latency: {{ responseLatency }}ms Save

{{ errMsg }} on line {{ errLineNumber }}

Extracted data

{{ safeDump(evalResult) }}

Console Log

Improvement prompt

Specify what you want to fix (improve) in the extractor

3. Improve extractor Improving... Improvement latency: {{ responseLatency }}ms Save

Why?

Large Language Models are actively used in web scraping, but the most popular approach is to convert each HTML page into markdown and feed it into the model. This approach is not always optimal, as the model may not understand the structure of the HTML page. Additionally, it is prohibitively slow and expensive to scrape a large number of similar pages (e.g., 10k product pages or 10k news items from the same website) using this method. This tool attempts to build a pipeline by using LLM to generate JavaScript code, which can scale much better.

Workflow

The workflow consists of three sequential steps:

Generate the JS code according to the prompt, using the sample HTML as a reference. This involves HTML cleanup, compression, and trimming, so valuable information can be fed into LLM without overwhelming it with noisy, useless data, but still maintaining the valuable DOM structure information.
Run the generated JS code against the sample HTML to evaluate the quality of the extractor.
(Optional) Improve the generated JS code after analyzing its output by adding a new improvement prompt.

This sandbox was created to streamline Node.js HTML scraper development. It evolved from the 1st generation of Cheerio Sandbox, which allows you to test your own extractors.

How to use the AI Cheerio Generator

First of all, you need to paste the HTML code of the analyzed web page. You can use the ScrapeNinja online sandbox for this, and copy the HTML from the output of the web scraper. You can also just use "View Page Source" context menu item of any website, in your web browser. It is important to understand that "View Page Source" shows the raw source which is not what you see in a real browser. This Chrome extension allows to compare the raw HTML and the HTML of the browser-rendered page. ScrapeNinja can extract both versions, since it has two scraping engines, but it's much better to try to use raw HTML at first, as extracting browser-rendered page is a slow, fragile, and resource intensive process.

Once you've pasted the HTML source, run the AI code generator. It will generate the JavaScript code that can be used to extract the data from the HTML. Do not forget that it's just the first step: now you can run the generated code against the HTML to see the results.

Extracting article/blog/news text data from arbitrary websites

To extract articles and news data from multiple websites (hundreds), it's not feasible to write and support a generic JS extractor. Instead, consider using a specialized tool: Article Extractor API project, which leverages ScrapeNinja scraping engine under the hood.

How to use this in a real project:

You can use this extractor function in your local Cheerio installation (you need to have your Node.js installation for this) or in the ScrapeNinja extractor field.

Running your extractor locally:

Step #1. Create project folder and install node-fetch&cheerio

mkdir your-project-folder && \
  cd "$_" && \

  npm i -g create-esnext && \

  npm init esnext && \

  npm i node-fetch cheerio -y

Step #2. Copy&paste the code

Create new empty file like scraper.js and paste the code to this file:

import cheerio from 'cheerio'

// paste the extractor function here
function extract(input, cheerio) { ... } // the extractor function can now be called as extract()

// retrieve your input from node-fetch or file system
const input = '<h2 class="title">YOUR TEST INPUT</h2>';

let results = extract(input, cheerio);


// the json data is now located in results variable
console.log(results);

Step #3. Launch

node ./scraper.js

Running your scraper with extractor in ScrapeNinja:

Just copy&paste the code of function to "extractor" field in ScrapeNinja sandbox and then put generated ScrapeNinja code to your local node.js script.