List

Following up on a recent post this post will also be about automating a process that could be performed manually, but really should be programmed and automated. There are numerous government agencies, of different countries, that have made their data available online. While this sounds like a very good thing for researchers it is often accompanied by the following problem: in order to access each piece of data, and there are many pieces, one has to fill out a web form and submit it. When there are multiple years, multiple spatial units, and multiple sub-categories, the number of permutations become too large. Yes, one can possibly spend a week filling out all the different combinations of values along the dropdown boxes, radio buttons, and text fields and hitting submit after submit; but the chances of making a mistake are large, the opportunity cost of the time spent is non-negligible, and the thought of doing this again when you want to update your project with new data might be daunting enough to drop the project altogether. This is without even commenting about the reproducibility of the project.

 

Hopefully you are convinced that if a better method exists, then it is worth picking it up. Luckily, such a method does exist and you can use to produce something of value after just a few hours of playing around with it. Web scraping (hereafter: scraping) is a general term that refers to getting data from the web. Some scraping techniques focus on processing large bodies of texts and parsing them out into a structured database format. This post is about a different set of techniques that are meant to simulate and automate the manual processes a user performs. In this post I will cover a quick tutorial about a Python module called Selenium which provides a simple interface through which to code those tedious actions.

 

This post is a simple guide to scraping with Selenium and I am assuming you have rudimentary knowledge of Python. Honestly, if you have some background in any other programming or scripting language and are able to produce the canonical “Hello World!” message in Python, you can probably follow and execute the code in this post. Selenium is available for both Python 2.X and 3.X and given that you have a working Python installation all you need to do is type from the terminal (or command prompt):

The second ingredient we will need is a WebDriver. Simply think of this as a browser with which Selenium can play nicely. I use the Chrome WebDriver, but a Firefox WebDriver exists as well. To install them you just need to download them and extract the file into the path of your Python installation. After you have installed either one we can proceed to a simple example that demonstrates how to set things up and perform a basic action. Once we cover the basics we will continue to a more a complicated example involving scraping data from the US Geological Survey Water Quality Samples for the Nation.

 

Let’s import the necessary components, and write a script that selects some objects on the “Code” page. You might want to already open that page in a different tab.

Line 1 imports the webdriver from the Selenium module. I define the string variable with the URL to the code page on this website on line 4. Line 6 creates the driver object, and line 8 sends the driver to the defined URL. The next part is where things get mildly interesting. If you open the “Code” page you will see that there are three different tabs and Latex is the default one. Line 10 defines a string variable with XML id of the Python tab, line 11 tells the driver object to find it, based on that id, and line 13 sends a mouse click to that tab. This results in the focus shifting from the Latex tab to the Python tab, just as if you were using your mouse to click on the Python tab.

 

For future reference, in case the WebDriver opens but does not load the page you told it to and you have no idea why things are no longer working, try downloading the newest version of the WebDriver and upgrade your Selenium installation (same command used to install it with an added “-upgrade” syntax). Hopefully that should be enough.

 

We should always remember to close the driver object once we are done with it. Simply issue the following command at the console:

While this might seem underwhelming it encompasses the basic concept behind using Selenium. A web page, or a web form, has a structure through which we can identify the different elements on it. Once we know how to refer to each element we can tell the driver to select it, input text into it, choose a value from a dropdown menu, etc. The big mystery is how to identify those elements? The good news are that if you know how to right click you are basically certified.

 

Head over to the “Code” page and right click the “Python” tab. Choose inspect element. A window should open up with a piece of text highlighted. In that short piece of text you will see, right before the word “Python,” the following string id=”ui-id-2 and this is where that string variable value on line 10 comes from. Basically, by right clicking and inspecting the different elements you can map the entire structure of the page. Once you know the elements and values you want to choose from you just need to loop over them.

 

There are different ways to identify the elements on a web page, meaning that not all of them will always have an id defined for them. Sometimes we will need to refer to them by name, CSS tag, or by something called an XPath. These concepts will become clearer through an example so we might as well jump to the next, more complicated, case.

At some point I needed to extract data about water quality samples for about 600 parameter codes. The story of how I matched the chemicals I was interested in to their parameter codes involves fuzzy string matching, which means matching between strings that are referring to the same thing but are spelled a bit differently. This is a problem because any programming language will return False for the following expression “mac n’ cheese” == “mac and cheese” even though they are referring to the same thing. Python has a nifty module to handle these cases and provides you with a matching even when the strings do not fully match each other. It is called Fuzzy Wuzzy and I would write an additional tutorial about it but fellow SusDever Anthony D’Agostino already did. I highly recommend knowing how to use Fuzzy Wuzzy as fixing fuzzy string matchings manually will be detrimental to your health. This was a long digression, back to scraping.

 

I have 600 parameter codes and a website that claims it can handle up to 200 codes submitted at once but then proceeds to crash. Clearly, I had to do this one by one, or better yet, let Selenium do it one by one for me. Here is the full working script that extracts the data by parameter codes. It is a bit long but I will cover it part by part. The length is mainly due to the fact that it is also capturing errors that could happen along the way and it handles the files produced after each iteration (naming them and moving to the destination folder).

The first lines just import all the necessary components. Notice we are importing a second item from the Selenium module called “Keys.” This is used to input keyboard strokes, meaning typing text in a text field. On line 13 I create a short list of codes instead of the full 600 of them because we just need something to loop over (in the real script I simply open a .csv file which has all the codes and I dump them in a list). Lines 15-17 create the URL strings and driver object.

 

On line 19 we begin to define the function we will use to extract each code. The function receives three arguments: the code to extract and the two strings that define the full URL. Lines 20-23 handle cases where the code should have a leading zero but it got truncated. Line 25 sends the driver to the web form. Once we are on the web form we need to enter the parameter code in the text field. First we need to identify the text field. Verify for yourself that when you right click the text field and choose inspect element you get that value for the id.

 

Once we have identified the text field we use driver to send_keys to it and the keys are going to be the value provided in the function’s argument code. We need to select an option from  a radio button and from a dropdown menu. We are going to use XPath for that. Ideally, either ids or names should be used to identify objects on a web form but sometimes whoever designed the web form did not bother with assigning ids or names. When that happens we have to rely on the actual location of the object. While this allows for some high degree of flexibility it is also very brittle. If the website gets updated and things move around the code will stop working.

 

Finding the XPath is just like finding the id or the name. You inspect the element only now you need to use the location of the item in the web form. While there are automatic tools that can do this for you they sometimes generate the wrong XPaths and it is probably worth going through what it is exactly at least once. Let’s try and find the XPath for “Tab-separated data” radio button which is on the bottom of the page. Right click the radio button and inspect it. The “//” defines the relative path, “input” brings you to a crossroad with multiple options, and “[@value=’rdb’]” chooses the right path to walk down on. To see how the path would have been different for a different element, inspect the radio button just about that one and notice that while the name is the same, the value is not.

 

Now that the function is defined we start looping over all the different values we need to extract (line 40) and call the function separately for each such code (line 41). Once we call the function we hope that the proper parameter code query has resulted in a data file that is being downloaded. However, many other things could happen and the following lines of code are trying to capture those cases and ensure a smooth processing of all the codes.

 

In line 43 we define a Boolean waiting that while it is true we are still processing the current code and cannot move on to the next one. Next, in line 44 we define a variable timeout which will act as a counter and we initialize it at zero; assuming that the server has not returned a timeout error. We then enter a while loop that keeps processing the same code until we either determine that: (1) there is no data for that code, (2) a file was downloaded, or (3) the server returned a timeout error.

 

Line 47 checks if the text “No sites” appears in the source code of the page. This is basically the error message you get on the page when a specific parameter code resulted in no data fetched by the server. If that is the case we set waiting to False, the while loop exits, and the for loop advances to the next value. If we do not receive that error we continue and check whether a file is currently being downloaded. Notice you will need to edit the path where the file should be located (usually the default folder to which the browser downloads files). For the NWIS system the default file name is qwdata so line 49 is simply checking whether there is a file by that name at the specified path (we will later rename and move that file so need to worry about what happens when we loop over to the next parameter code).

 

Assuming the file has started downloading we are now interested in learning whether it has finished downloading. We change the downloading flag to True and check on short intervals (if the files you are downloading are large you can increase the interval size) whether it is still being updated or not. Once the code detects the file has not been updated (this is not a fool proof way of checking it but it is the best one I am aware of) we update the Boolean flags by changing both downloading and waiting to False.

 

The following lines simply perform the clean up. We rename and move the file in one fell swoop. Since waiting has been set to False we exit the while loop and are back at the for loop, only now we are processing the next item in the parameter codes list.

 

Line 64 is meant to capture the cases where we did not get a “No sites” message and no file has been detected. It could be that the server is still processing the query and this might take a while, but how long should we allow the server to keep us waiting before we declare a timeout and retry the query? I guess there is no right or wrong answer and I set the cap at 10 minutes.  Each time we check for “No sites” or for a presence of a file and get that neither has happened we increase the timeout counter by 1 and when it reaches 10 we submit the query again, reset the timeout counter, and go through the steps within the while loop again. This could potentially get us stuck in this loop if there is something problematic happening with a specific code. One way to avoid this is to add a counter for the number of failed attempts and once that crosses some threshold value to just skip this specific code.

When the for loop ends we simply close the webdriver and print a message to the console. You could instead print some sort of progress bar by print how many more items are left to loop over in the list if you want to make sure that you have not been stuck on the same code for a few days.

 

While this code is extremely specific for just one application I hope it provides with a good example of how to think and approach scraping a website. By understanding how this example works I think you can quickly apply it to other settings and your desired data, web form interface be damned.

 

I will be writing a follow up post to this one where I discuss how to scrape content off a webpage and structure it in a database. That too will involve Python but will not involve water quality data.

Leave a Reply

Your email address will not be published. Required fields are marked *

  Posts

1 2 3
October 12th, 2017

Four Not-So-Random Links On Conservation

November 14th, 2016

Code Riffs: Stata and Regression Tables

September 16th, 2016

Trade Ban on Ivory: Are We Getting it Right?

May 27th, 2016

#WildForLife Campaign Aims to Reduce Demand for Wildlife Trade Prodcuts

May 1st, 2016

Recent Examples of the Large Number of Species Traded Globally

March 18th, 2016

A Bats Housing Boom

February 23rd, 2016

Scraping a Web Form with Multiple Pages Using Selenium – A Neat Trick

December 19th, 2015

Scraping Web Forms with Selenium

December 3rd, 2015

Watch Racing Extinction

December 1st, 2015

Classifying Neighboring Spatial Units