How to block ads with Puppeteer to super-fast your web scraping

Puppeteer is the most popular open-source library for headless browsers. Our no-code Puppeteer API allows you to scrape the website with a simple adblock option to intercept the request and block ads and tracking requests to speed-up your data scraping agents.

Filters

Agenty uses the EasyList ad blocking filters by default to block ads on web requests. You can see and download the filters from their github repository.

If you have a particular website slow in scraping, you can also use Chrome dev tools to find the 3rd party requests and prepare your custom list of domains to block ads using custom filters to pass on rejectRequestPattern array.

Find ads on network requests

To find the third party requests on Chrome dev tools, open the developer tools Ctrl + Shift + I and go to the network tab.

Click on the filter icon to show the filter panel, then enter the website url with - to find all external domains except the current website. For example -

-url:nytimes.com

To find external network requests on nytimes.com website.

In the search result, we can easily find the requests which serve ads. For example, the filter shows 67/1083 requests match.

I can see ads from these websites on my dev tool filter result

googlesyndication.com
doubleclick.net
amazon-adsystem.com
...

To filter multiple URLs, we can also use the | symbol with REGEX. For example

-/nytimes.com|nyt.com/

This will filter nytimes.com and nyt.com to better find 3rd party requests not related to original content

Using the pipe symbol we can filter many different websites or URLs to easily find the unnecessary requests to block and improve the web scraping software performance.

So, we can add all these domains in rejectRequestPattern array to block requests on network interceptor.

rejectRequestPattern: [
   "googlesyndication.com",
    "/*.doubleclick.net",
    "/*.amazon-adsystem.com",
    "/*.adnxs.com",
]

Or, if you are using the Agenty cloud portal to mange your scraping agents. You can just to go agent configuration > browser settings > and enter the domains list to block.

Ad blocker in browser API

The ad blocker option is enabled by default in scraping requests. If you want to block any custom request, just pass in the rejectRequestPattern array.

rejectRequestPattern: [
 "adsystem.com"
]

You may also use regular expressions to match pattern, for example *.adsystem.com will also block it’s matching sub domain -

adserver1.adsystem.com
track.adsystem.com
...

Or, If you want to disable the blocker, just pass the blockAds: false option in the request body.

Ad blocker in Puppeteer

We can use the setRequestInterception(true) method in Puppeteer to enable the interceptor.

Once the interception is enabled, we can block any resource with simple if/else logic to match our rules to decide if the request should be aborted or continue.

Here is an example -

  await page.setRequestInterception(true);
 
  const rejectRequestPattern = [
    "googlesyndication.com",
    "/*.doubleclick.net",
    "/*.amazon-adsystem.com",
    "/*.adnxs.com",
  ];
  const blockList = [];
 
  page.on("request", (request) => {
    if (rejectRequestPattern.find((pattern) => request.url().match(pattern))) {
      blockList.push(request.url());
      request.abort();
    } else request.continue();
  });

And full code here on my gist

Running this example on Visual studio code resulted in 16 request blocked before capturing a screenshot.

14 days free trial

Automate your business with advanced, fully-featured agents on Agenty. Fast, scalable and no-code web automation tool for scraping, change monitoring and more...

Sign up for free →

No credit card required