Puppeteer is the most popular open-source library for headless browsers. Our no-code Puppeteer API allows you to scrape the website with a simple adblock option to intercept the request and block ads and tracking requests to speed-up your data scraping agents.
Filters
Agenty uses the EasyList ad blocking filters by default to block ads on web requests. You can see and download the filters from their github repository.
If you have a particular website slow in scraping, you can also use Chrome dev tools to find the 3rd party requests and prepare your custom list of domains to block ads using custom filters to pass on rejectRequestPattern
array.
Find ads on network requests
To find the third party requests on Chrome dev tools, open the developer tools Ctrl + Shift + I
and go to the network tab.
Click on the filter icon to show the filter panel, then enter the website url with -
to find all external domains except the current website. For example -
-url:nytimes.com
To find external network requests on nytimes.com website.
In the search result, we can easily find the requests which serve ads. For example, the filter shows 67/1083 requests match.
I can see ads from these websites on my dev tool filter result
googlesyndication.com
doubleclick.net
amazon-adsystem.com
...
To filter multiple URLs, we can also use the |
symbol with REGEX. For example
-/nytimes.com|nyt.com/
This will filter nytimes.com and nyt.com to better find 3rd party requests not related to original content
Using the pipe symbol we can filter many different websites or URLs to easily find the unnecessary requests to block and improve the web scraping software performance.
So, we can add all these domains in rejectRequestPattern
array to block requests on network interceptor.
rejectRequestPattern: [
"googlesyndication.com",
"/*.doubleclick.net",
"/*.amazon-adsystem.com",
"/*.adnxs.com",
]
Or, if you are using the Agenty cloud portal to mange your scraping agents. You can just to go agent configuration > browser settings > and enter the domains list to block.
Ad blocker in browser API
The ad blocker option is enabled by default in scraping requests. If you want to block any custom request, just pass in the rejectRequestPattern
array.
rejectRequestPattern: [
"adsystem.com"
]
You may also use regular expressions to match pattern, for example *.adsystem.com
will also block it’s matching sub domain -
adserver1.adsystem.com
track.adsystem.com
...
Or, If you want to disable the blocker, just pass the blockAds: false
option in the request body.
Ad blocker in Puppeteer
We can use the setRequestInterception(true)
method in Puppeteer to enable the interceptor.
Once the interception is enabled, we can block any resource with simple if/else logic to match our rules to decide if the request should be aborted or continue.
Here is an example -
await page.setRequestInterception(true);
const rejectRequestPattern = [
"googlesyndication.com",
"/*.doubleclick.net",
"/*.amazon-adsystem.com",
"/*.adnxs.com",
];
const blockList = [];
page.on("request", (request) => {
if (rejectRequestPattern.find((pattern) => request.url().match(pattern))) {
blockList.push(request.url());
request.abort();
} else request.continue();
});
And full code here on my gist
Running this example on Visual studio code resulted in 16 request blocked before capturing a screenshot.