Scraping Data from JSON using Regex

In this tutorial, we will learn how to extract data from JSON pages or API, by using a scraping agent with the super-fast Regular Expression(Regex) extractor by Agenty. The JSON (JavaScript Object Notation) is a lightweight data-interchange format and widely used format on websites, API or to display the data in a structured way online.

JSON Example

For this demo, I have created this JSON example page https://cdn.agenty.com/examples/json/json-example-1.json, where the content is displayed as a JSON file, as below. Here, if you see the content, it’s an array of objects where each object has 5 properties (rank, cnt, tt, au, yr) and their corresponding values.

[
  {
    "rank": 1,
    "cnt": 5,
    "tt": "The Great Gatsby",
    "au": "F. Scott Fitzgerald",
    "yr": "1990"
  },
  {
    "rank": 2,
    "cnt": 5,
    "tt": "The Grapes of Wrath",
    "au": "John Steinbeck",
    "yr": "1972"
  },
  {
    "rank": 3,
    "cnt": 5,
    "tt": "The Catcher in the Rye",
    "au": "J.D. Salinger",
    "yr": "1993"
  },
  {
    "rank": 4,
    "cnt": 4,
    "tt": "Invisible Man",
    "au": "Ralph Ellison",
    "yr": "1988"
  },
  {
    "rank": 5,
    "cnt": 4,
    "tt": "The Sound and the Fury",
    "au": "William Faulkner",
    "yr": "1987"
  },
  {
    "rank": 6,
    "cnt": 4,
    "tt": "The Sun Also Rises",
    "au": "",
    "yr": "1988"
  },
  {
    "rank": 7,
    "cnt": 4,
    "tt": "Things Fall Apart",
    "au": "Chinua Achebe",
    "yr": "1996"
  },
  {
    "rank": 8,
    "cnt": 4,
    "tt": "Lolita",
    "au": "Vladimir Vladimirovich Nabokov",
    "yr": "1983"
  },
  {
    "rank": 9,
    "cnt": 4,
    "tt": "A Passage to India",
    "au": "E. M. Forster",
    "yr": "1984"
  },
  {
    "rank": 10,
    "cnt": 4,
    "tt": "1984",
    "au": "George Orwell",
    "yr": "1977"
  },
  {
    "rank": 11,
    "cnt": 4,
    "tt": "Beloved",
    "au": "Toni Morrison",
    "yr": "1987"
  },
  {
    "rank": 12,
    "cnt": 4,
    "tt": "Native Son",
    "au": "Richard T. Wright",
    "yr": "1940"
  },
  {
    "rank": 13,
    "cnt": 4,
    "tt": "Catch-22",
    "au": "Joseph Heller",
    "yr": ""
  },
  {
    "rank": 14,
    "cnt": 4,
    "tt": "Go Tell it on the Mountain",
    "au": "James Baldwin",
    "yr": "1954"
  },
  {
    "rank": 15,
    "cnt": 4,
    "tt": "On the Road",
    "au": "Jack Kerouac",
    "yr": "1991"
  },
  {
    "rank": 16,
    "cnt": 3,
    "tt": "ULYSSES",
    "au": "James Joyce",
    "yr": "1961"
  },
  {
    "rank": 17,
    "cnt": 3,
    "tt": "Don Quixote",
    "au": "Miguel de Cervantes",
    "yr": "1982"
  },
  {
    "rank": 18,
    "cnt": 3,
    "tt": "To the Lighthouse",
    "au": "Virginia Woolf",
    "yr": "1982"
  },
  {
    "rank": 19,
    "cnt": 3,
    "tt": "Madame Bovary",
    "au": "Gustave Flaubert",
    "yr": "1998"
  },
  {
    "rank": 20,
    "cnt": 3,
    "tt": "An American Tragedy",
    "au": "Theodore Dreiser",
    "yr": "2000"
  }
]

Since, it’s not an HTML page where we can use our Chrome Extension to generate CSS selectors automatically. So, we’d need to create our agent manually and then edit the agent in Agenty to add, update the field and URL…so, let’s create a placeholder agent from samples, or you can also create from any website.

Create Agent

  1. Go to the agents
  2. Click on New Agent
  3. Get any of the example agents available in the Sample agent section. (Because we are going to edit the url, selector, fields etc…so we can use any demo agent, to create one and then edit per our need)

Regex

We will be using REGEX to extract the individual field value from the JSON objects. So, use any online REGEX tester to build your expression. Here, I am going to use rubular.com in this example, to demonstrate the expression and instant result. We need to enter the sample values in the “Your test string” box, and then the tool will display the matching result and group, as soon as we type our expression. To write the pattern for our first field rank, we will just use the property name and then expression in values, because the value can change. If I see on the content, it’s always a number, so we can use the (\d+)

Final REGEX for rank field in JSON will be : “rank”: (\d+),

group rank in JSON

Similarly, for the next field cnt. It’s also a number so we can just use "cnt": (\d+),

Scraping cnt field using regex from JSON

The third field tt is not a number, so we can’t use (\d+) here, because the (\d+) is used to extract the digit values only. So, we need to use ([^"]*)here. Which means anything in value ends with " and * means 0 or more times. So, to extract the value from tt, the final expression is : "tt": "([^"]+)",

Match groups in regex

Similarly, we can test the REGEX expression for 4th and 5th field as well.

Add Fields

Now, we have the REGEX expression and the matching group number for all the fields we want to scrape from JSON. So, we need to edit the scraping agent and then add the fields expression and Index, by selecting the field type as REGEX

  1. Edit the scraping agent by clicking on the Edit tab
  2. Add a new field and give it a name, as I did in screenshot below rank
  3. Now select the Type as REGEX and paste your regular expression in Regex Pattern box, and the group number in Group Name or Index box,

select regex as a type

  1. Similarly, add next field cnt and enter the expression in Regex Pattern box and the group number in Group Index box.

Scraping cnt field from JSON with regex expression

  1. tt field

using regex to extract field tt from JSON

  1. au field

Add field au to extract from JSON using regex

  1. yr field

Scrape yr field using regex pattern

  1. Same way, we can add any number of fields, and can enter their REGEX expression and matching group number.
  2. Now, Save the scraping agent configuration. (Remember, the saving agent just updates the configuration and we need to re-run our agent by clicking on the Start button in order to reflect the changes in the result).
  3. Change the SOURCE URL to the page with JSON content, if not already : https://cdn.agenty.com/examples/json/json-example-1.json.
  4. And, re-run your agent to refresh the result, as per the change in agent configuration.

Execution

Once the job is completed, we can see the JSON scraping result in the Result tab and can add any number of URLs with similar structure to scrape data from JSON pages or APIs.

Signup now to get 100 pages credit free

14 days free trial, no credit card required!