Processing stats
Extracted data
{{ safeDump(evalResult) }}
Console Log
Why?
Large Language Models are actively used in web scraping, but the most popular approach is to convert each HTML page into markdown and feed it into the model. This approach is not always optimal, as the model may not understand the structure of the HTML page. Additionally, it is prohibitively slow and expensive to scrape a large number of similar pages (e.g., 10k product pages or 10k news items from the same website) using this method. This tool attempts to build a pipeline by using LLM to generate JavaScript code, which can scale much better.
Workflow
The workflow consists of three sequential steps:
- Generate the JS code according to the prompt, using the sample HTML as a reference. This involves HTML cleanup, compression, and trimming, so valuable information can be fed into LLM without overwhelming it with noisy, useless data, but still maintaining the valuable DOM structure information.
- Run the generated JS code against the sample HTML to evaluate the quality of the extractor.
- (Optional) Improve the generated JS code after analyzing its output by adding a new improvement prompt.
This sandbox was created to streamline Node.js HTML scraper development. It evolved from the 1st generation of Cheerio Sandbox, which allows you to test your own extractors.
How to use the AI Cheerio Generator
First of all, you need to paste the HTML code of the analyzed web page. You can use the ScrapeNinja online sandbox for this, and copy the HTML from the output of the web scraper. You can also just use "View Page Source" context menu item of any website, in your web browser. It is important to understand that "View Page Source" shows the raw source which is not what you see in a real browser. This Chrome extension allows to compare the raw HTML and the HTML of the browser-rendered page. ScrapeNinja can extract both versions, since it has two scraping engines, but it's much better to try to use raw HTML at first, as extracting browser-rendered page is a slow, fragile, and resource intensive process.
Once you've pasted the HTML source, run the AI code generator. It will generate the JavaScript code that can be used to extract the data from the HTML. Do not forget that it's just the first step: now you can run the generated code against the HTML to see the results.
Extracting article/blog/news text data from arbitrary websites
To extract articles and news data from multiple websites (hundreds), it's not feasible to write and support a generic JS extractor. Instead, consider using a specialized tool: Article Extractor API project, which leverages ScrapeNinja scraping engine under the hood.
How to use this in a real project:
You can use this extractor function in your local Cheerio installation (you need to have your Node.js installation for this) or in the ScrapeNinja extractor field.
Running your extractor locally:
Step #1. Create project folder and install node-fetch&cheerio
mkdir your-project-folder && \
cd "$_" && \
npm i -g create-esnext && \
npm init esnext && \
npm i node-fetch cheerio -y
Step #2. Copy&paste the code
Create new empty file like scraper.js
and paste the code to this file:
import cheerio from 'cheerio' // paste the extractor function here function extract(input, cheerio) { ... } // the extractor function can now be called as extract() // retrieve your input from node-fetch or file system const input = '<h2 class="title">YOUR TEST INPUT</h2>'; let results = extract(input, cheerio); // the json data is now located in results variable console.log(results);
Step #3. Launch
node ./scraper.js
Running your scraper with extractor in ScrapeNinja:
Just copy&paste the code of function to "extractor" field in ScrapeNinja sandbox and then put generated ScrapeNinja code to your local node.js script.