New Jul 2024 Cheerio Sandbox v2: Use AI to write your JS extractor!
Three ways to extract data values from HTML using Cheerio:
1. Index-based CSS Selector Extraction: The line tdByIndex
extracts the text from the second cell of the second row in a table. The :eq()
pseudo-class is used to select elements by their index, starting from 0.
2. Functional Traversing: The tdTraversed
property demonstrates a more functional approach to DOM traversal. It starts by selecting a table, finds a td
element within it, moves to its parent tr
element, navigates to the next tr
sibling, and finally selects the second td
element within that row.
3. Sibling Element Extraction by Text: The tdByText
property showcases a method to extract data based on the text content of a preceding element. This is a perfect approach for HTML pages with fully dynamic CSS classes. The selector $('td:contains("Second row title") + td').text()
finds a td
element containing the text "Second row title" and then selects its immediate next td
sibling to extract its text.
Extracted data
{{ safeDump(evalResult) }}
Console Log
New Jul 2024 Cheerio Sandbox v2: Use AI to write your JS extractor!
Why?
This sandbox was created to streamline Node.js HTML scraper development. It evolved from the primary ScrapeNinja Live Sandbox, which executed HTTP requests and scraped a target website on every form submission. This wasn't efficient for rapid HTML extractor testing, especially with challenging and slow sites. By isolating the HTML extraction component, we've made iterative REPL coding for HTML extraction quicker and more efficient. Debugging cheerio extractors locally can be time-consuming, requiring multiple test runs to ensure consistent syntax. Learn more about the sandbox's creation and functionality in our blog post.
How to write your perfect extractor
Websites change their html layouts and break things. So, perfect and bullet proof extractor is the extractor that you didn't have to write! So make sure the website you are scraping does not provide some sort of JSON API before scraping HTML.
The extractor uses cheerio node.js package so first of all read its documentation.
Cheerio is in a lot of cases similar to jQuery, but with notable and sometimes annoying differences.
The best tool to get and test your css selectors is Chrome Dev Tools console.
Extracting article/blog/news text data from arbitrary websites
To extract articles and news data from multiple websites (hundreds) it's not feasible to write and support a generic js extractor. Instead, consider using a specialized tool: Article Extractor API project, which leverages ScrapeNinja scraping engine under the hood.
How to use in a real project:
You can use this extractor function in your local cheerio installation (you need to have your Node.js installation for this) or in ScrapeNinja extractor field for /scrape endpoint.
Running your extractor locally:
Step #1. Create project folder and install node-fetch&cheerio
mkdir your-project-folder && \
cd "$_" && \
npm i -g create-esnext && \
npm init esnext && \
npm i node-fetch cheerio -y
Step #2. Copy&paste the code
Create new empty file like scraper.js
and paste the code to this file:
import cheerio from 'cheerio' // paste the extractor function here function extract(input, cheerio) { ... } // the extractor function can now be called as extract() // retrieve your input from node-fetch or file system const input = '<h2 class="title">YOUR TEST INPUT</h2>'; let results = extract(input, cheerio); // the json data is now located in results variable console.log(results);
Step #3. Launch
node ./scraper.js
Running your scraper with extractor in ScrapeNinja:
Just copy&paste the code of function to "extractor" field in ScrapeNinja sandbox and then put generated ScrapeNinja code to your local node.js script.