Skip to main content
Coding

Scraping a Web Form with Multiple Pages Using Selenium – A Neat Trick

By February 23, 2016February 10th, 2021No Comments

In a previous post I covered the basics of how to use a Python module called Selenium in order to automate the process of filling out web forms. This can be useful when you need to use such an interface to extract large quantities of data. Or maybe there is not that much data to extract but there are too many manual clicks involved.

 

A problem you might face while using Selenium is that some web forms have more than a single page and you need to know if you are on the right page before you start issuing new Selenium commands. At best this will result in an error but at worst it would mean you are filling in the wrong item (with the same name or id) on a different page and you would not even know about it.

 

So how do you know whether you are on the right page? Or which page are you on? There is a very simple trick that is well detailed at the following reference.

(If you have been watching this for more than a few seconds I feel obliged to tell you – nothing is going to load)

The basic concept of the method is simple. You know you are on some page i and you have issued a command that should take you to page i+1. You enter a While loop and you keep checking for an item that you know is present on page i but not on page i+1. When trying to select that item fails it will return a “stale error” and that would signal that you are no longer on page i. You use this error to exit the While loop and you are once again on your way to scrape those precious bytes. Unless something very abnormal happened you should be on page i+1 but if you are working with something that is abnormal by definition or are just very risk averse, then it might be necessary to add a second check to verify that you are indeed on page i+1.

 

It is a very short and clean implementation that solves something that is easy for a user to understand but is very complicated for Selenium (for reasons shortly discussed in the above reference).