The web is transitioning to more and more front-end frameworks e.g. Next.js, React, Angular etc. We’d need a new way to scrape data from websites. This article shows you how to use jQuery to write your own client-side web scraping scripts effectively with Agenty.
I’ll show you how to get your web data with just a few lines of code in JavaScript or jQuery with Agenty to automate your business by collecting data from websites.
Scraping data with jQuery
- Create a new scraping agent or edit an existing one in your account.
- Go to configuration tab, click on Field and collection in sidebar
- Click on the
jQuery
button on the top right of the fields section.
Here you can write any JavaScript or jQuery code to scrape data from the given URL, the code will be injected on the current browser page to be evaluated in the browsing session.
To demo, let’s build a scraper capable of extracting news from this hacker news website using this 10-lines of code.
Code
function extract() {
const result = [];
$('.itemlist tr[id]').each(function (index, tr) {
const item = {
rank: index,
url: $(tr).find('.titlelink').attr('href'),
title: $(tr).find('.titlelink').text(),
}
result.push(item)
});
return result;
}
If you look at this jQuery code, I am using each
to traverse on each row in the table with id, and use the find
function to find elements with given selector on each row.
Then scrape the data using the text()
or attr()
function to extract attributes or text.
I also added the index
to keep a track of news rank on the website to find which news appeared on what rank.
Result
Run your agent by clicking on the Run
button and see the tabular output on Result
tab downloadable in CSV, JSON and TSV format.
Scraping data with JavaScript
Scraping data with JavaScript is a way to get the data from any website, which is not available in the form of API or any other channel, into your system. In this article we will see how scraping can be done using JavaScript without jQuery dependency.
Code
function extract() {
const result = [];
const elements = document.querySelectorAll('.itemlist tr[id]');
let index = 0;
for (let element of elements) {
const item = {
rank: index++,
url: element.querySelector('.titlelink').getAttribute('href'),
title: element.querySelector('.titlelink').innerText,
}
result.push(item)
}
return result;
}
In JavaScript we have just changed our code to use forof loop and querySelectorAll
and querySelector
to traverse each element in the document to extract the innerText
and attribute using getAttribute
function in native API.
Optionally, we can also test our code in Chrome developer mode to make the development and debugging easy. Follow these steps -
- Open developer mode
- Go to Sources tab
- Go to Snippets, create new snippet
- Paste the code
- Add one more line at the bottom to execute the
extract()
function and print the result using console.table.
console.table(extract());
Once tested, remove the console.table(extract());
line and enter the code in Agenty to run your web scraping agent for batch URLs in input.