Automate Web Scraping — A Solution for Murphy's Law of Data Extraction

IBM i, Solaris, Windows, UNIX, Linux, AIX, Mac OSX
February 27, 2016
Data Scraping Tutorial banner

We currently live in the information age where everything is at our fingertips. Is Drew Brees the current NFL Passing Leader? Is houndstooth or herringbone more “in” this season? What will the weather be like this weekend? When asked these questions, most of us will hop on Google/Bing/Wikipedia and search for the answer. More and more information is being made available online—and the business world is no different. People are printing out less physical media.  Rather, the information is being made available in digital formats with various consumption methods such as laptops, tablets and smart phones. As more business information makes its way to the web, being able to interact and make sense of that data becomes essential.

Pulling data from a webpage is nothing new. Web scraping and web data extraction tools have been around forever.  The issue with these tools is the huge amount of processing and filtering required to retrieve usable information. And according to Murphy ’s Law, the moment you finish a filtering algorithm to process the data accurately, the web designer will change the site.

That is where our Extract Table activity comes into play. With the Extract Table action, Automate brings structured data extraction capabilities that easily integrate with other components and tools in your infrastructure. Extract Table keeps the information in table format so you can bypass all the processing and references by column/row.

We recently worked with a customer on a Web Interactivity Task that involved the issue described above. The client had to log into a website with username/password and retrieve a section of information.  They needed to pull all of the information from the table and then evaluate particular fields.  We could have grabbed each individual element, but that would have proved tedious and cumbersome. So we decided to use the Extract Table action. Once we have verified that the section of data was indeed an HTML Table Element, we used the Extract Table to grab everything into a dataset table. From there we could evaluate the contents and branch appropriately.

In this particular example, we will be using the stock tickers from Yahoo Finance, and are going to schedule this task to run at the close of market Eastern Time. Once we have retrieved the information, we are going to parse through the values. The values will be written to an Excel workbook: worksheet1 if the changes are positive; worksheet2 if they are negative.

Our process is going to look something like this.

Automate Scraping Process

So first, let’s open up the Yahoo Extract Table Task to look at how to extract the table. The first thing you need to do is create a variable that contains the tickers. 

Yahoo Extract Table Task

Yahoo Extract Table Task - Create Variable

Next, we need to open a new browser to Yahoo Finance. Select the open activity and specify in the Page URL box.

Open a new browser to Yahoo Finance

Next, select the appropriate box to set the values of the stock tickers. 

Set the values of the stock tickers

Yahoo Finance Screenshot

Once you select the box that you want, scroll down and specify the interaction. You want to place the content of the variable into the search box. 

Specify Interaction - Automated Web Scraping

Once that step is done, you want to click the “get quotes” button. Use the Click Action to select the button. 

Click Get Quotes - Web Scraping

Yahoo Finance Screenshot

Once you have finished that and run it once, your browser should be displaying something like this:

Quote Results

Drag over another web browser and then select the wait action. Once you’ve done that, drag over the final web browser action and select the Extract Table activity. 

Select the Extract Table activity

Select the browser by dragging over the magnifying glass. Then drag the hand over to the table. 

Drag over the magnifying glass

If you have done this successfully, there should be some table identifier information located in the details. 

Table identifier information

Scrolling down, be sure to enter the name of a dataset here. It stores the information from the table.

Enter the name of a dataset

Once you have done this, close any IE browsers you have open and run the task. At the bottom you’ll see your dataset. 

Close any IE browsers, Run the task

Right click on it and select Inspect. Then re-run your task one more time. You should be able to see all of this information in your dataset. 

Re-run your task

Now let’s write the information to Excel; it will create a new workbook. 

Write the information to Excel

Next we want to loop through our dataset to figure out which sheet we are going to write the information to.

Loop through our dataset

Lastly we want to build in the logic that places Positive Gains (+) into worksheet 1 and Negative Gains (-) into worksheet 2; and create two variables: one that stores the current row of worksheet 1, and the other that stores the current row of worksheet 2. Then apply the following if-else logic. 

Positive Gains (+) into worksheet 1 and Negative Gains (-)

Now that we have that all taken care of, lets run the task and verify the results. Figure 13 verifies that the positive values have been placed into worksheet1. Figure 14 verifies that the negative values have been placed into worksheet2.

Verify that the positive values have been placed into worksheet1


Now that you’ve discovered this powerful new tool, go out there and crunch some data!

Learn more about web scraping, screen scraping, and other data extraction here

Get Started

Try Automate free for 30 days!