Scraping Data from CSV File

In this tutorial, we will learn about “CSV Content Scraping”, CSV stands for Comma-separated values. It is tabular data that has been saved as plaintext data separated by commas. Each line of the file is a data record. Each record consists of one or more fields.

For example, We have the URL “https://cdn.agenty.com/examples/csv/top-us-retailers-2011.txt” for making “CSV Content Scraping” agent.

Note : Agenty chrome extension can’t be used to setup REGEX fields, so we need to create a dummy agent or use one from samples and then edit that agent in agent editor to add REGEX fields.

In this example I am going to use this example page : https://cdn.agenty.com/examples/csv/top-us-retailers-2011.txt

Step 1 : Create a new web scraping agent using Chrome extension or use an example agent from samples.

Step 2 : Edit the agent in the agent editor and go to the Collection > Fields section.

Step 3 : Go to the example page (or the page you want to extract) and open the CSV source code in a “View source” option in browser

CSV Source

Rank,Retailer Name,# Stores,Revenue
1,Wal-Mart,4358,307736000.000000
2,Kroger,3609,78326000.000000
3,Target,1750,65815000.000000
4,Walgreen,7456,61240000.000000
5,The Home Depot,1966,60194000.000000
6,Costco,412,58983000.000000
7,CVS Caremark,7217,57464000.000000
8,Lowe's,1723,48175000.000000
9,Best Buy,1312,37110000.000000
10,Sears Holdings,3484,35362000.000000
11,Safeway,1475,33262000.000000
12,SUPERVALU,2436,30975000.000000
13,Rite Aid,4750,25196000.000000
14,Publix,1173,25072000.000000
15,Macy's,852,24864000.000000
16,Ahold USA/Royal Ahold,751,23518000.000000
17,McDonald's,14027,23130000.000000
18,Delhaize America,1627,18799000.000000
19,Amazon.com,0,18526000.000000
20,Kohl's,1083,18391000.000000
21,Apple Stores/iTunes,233,18064000.000000
22,J.C.Penney,1099,17659000.000000
23,YUM! Brands,17619,17306000.000000
24,TJX,2206,16751000.000000
25,Meijer,198,15319000.000000
26,True Value,4701,16738000.000000
27,H-E-B,299,14947000.000000
28,Dollar General,9372,13035000.000000
29,ShopRite,273,11800000.000000
30,Gap,2502,11718000.000000
31,BJ'S WholesaleClub,189,10876000.000000
32,Subway,24200,10373000.000000
33,Wendy's/Arby's Restaurants,9406,10026000.000000
34,Nordstrom,204,9624000.000000
35,Staples,1575,9204000.000000
36,Ace Hardware,4047,9101000.000000
37,Toys 'R' Us,1486,9066000.000000
38,Whole Foods Markets,288,8736000.000000
39,Bed Bath & Beyond,1114,8700000.000000
40,7-Eleven,6586,8513000.000000
41,Burger King Holdings,7258,8437000.000000
42,Aldi,1135,8362000.000000
43,Army Air Force Exchange,183,8309000.000000
44,Limited Brands,2645,8247000.000000
45,A&P,382,8123000.000000
46,Menard,260,8032000.000000
47,Verizon Wireless,2330,8021000.000000
48,Family Dollar,6785,7867000.000000
49,Ross Stores,1054,7860000.000000
50,Darden Restaurants,1824,7603000.000000
51,Starbucks,11131,7560000.000000
52,Office Depot,1125,7557000.000000
53,Winn-Dixie,485,7207000.000000
54,Hy-Vee,236,6838000.000000
55,Trader Joe's,359,6817000.000000
56,GameStop,4488,6610000.000000
57,Giant Eagle,384,6398000.000000
58,AutoZone,4364,6098000.000000
59,Dillard's,308,6020000.000000
60,DineEquity,3295,5884000.000000
61,Advance Auto Parts,3537,5876000.000000
62,Dollar Tree,4055,5801000.000000
63,Barnes & Noble,1343,5715000.000000
64,OfficeMax,904,5655000.000000
65,Wegman's Food Markets,76,5599000.000000
66,O'Reilly Automotive,3570,5398000.000000
67,QVC,0,5236000.000000
68,Defense Commissary Agy.,184,5046000.000000
69,AT&TWireless,2315,4990000.000000
70,Save Mart,238,4968000.000000
71,Dell,0,4946000.000000
72,Big Lots,1398,4903000.000000
73,PetSmart,1118,4839000.000000
74,RadioShack,5602,4615000.000000
75,Alimentation Couche-Tard,3862,4528000.000000
76,Dick's Sporting Goods,525,4414000.000000
77,Albertsons,221,4316000.000000
78,WinCo Foods,78,4300000.000000
79,Sherwin-Williams,3279,4226000.000000
80,Ruddick Corp.,199,4099000.000000
81,Neiman Marcus,77,3723000.000000
82,Michaels Stores,1103,3673000.000000
83,Burlington Coat Factory,458,3660000.000000
84,Tractor Supply Co.,1001,3639000.000000
85,Stater Bros. Holdings,167,3607000.000000
86,Foot Locker,2592,3577000.000000
87,Belk,305,3513000.000000
88,Price Chopper Supermkts.,128,3500000.000000
89,IKEA North America,37,3459000.000000
90,Williams-Sonoma,577,3447000.000000
91,Sports Authority,464,3409000.000000
92,SonyStyle,55,3401000.000000
93,Raley's,143,3364000.000000
94,OSI Restaurant Partners,1249,3314000.000000
95,Ingles Markets,202,3274000.000000
96,Brinker International,1337,3090000.000000
97,HSN,20,2998000.000000
98,Bon-Ton Stores,275,2980000.000000
99,Abercrombie & Fitch,1017,2846000.000000
100,ShopKo Stores,341,2832000.000000

Now use any REGEX editor tool to write and test your REGEX pattern. I am using rubular.com in this example and created this permanent link if you want to try it out - http://rubular.com/r/nIrllPL4Oe

,

Once a REGEX expression is created. Go to Agenty agent editor and paste the expression in “REGEX pattern” box by editing the field. Because the REGEX expression I created is for the entire row(all 4 fields), I can use the same REGEX expression in all 4 fields by changing the “Group index” to 1, 2, 3 and 4 for its respective field.

,

Now, click on the Save button to save the agent configuration and then come back to the main agent page.

Click on Start button to start the execution of scraping agent and wait for the job completion. Once the job is completed you can see the extracted result as in screenshot below :

,

Signup now to get 100 pages credit free

14 days free trial, no credit card required!