Saving Yourself a 1000+ Clicks – OR – How to Automate Browser Interactions Using Selenium & Python

In course of my Ph.D. dissertation research, I used a physics model called TIE-GCM to study terrestrial atmosphere. One way to run this model is via an online interface hosted on the CCMC website, which provides a way to run the model with customized as well as fixed settings. The customized runs take a bit of time since CCMC has to rerun the model with the new parameters. I was in a time crunch to get a particular set of model runs for my dissertation. In particular, the customization I wanted was to get the model results along the observation path of an instrument called SABER (on board  the TIMED satellite). This involved way too many mouse clicks and manual selections on their webpage repeatedly (in my case, it was a couple of hundred times!). Hence I wrote this tool to automate the process.

This script was written in python, utilizing Selenium and the Chrome Webdriver. The web interface to download the data (for the model run I had requested earlier) can be found here.

The script is designed to do several things:

  • Set the latitude, longitude and the altitude ( the trajectory of the satellite)
  • Set time / date ranges
  • Select the output format
  • Submit request
  • Use a GET request to get the ASCII data file generated by the tool

Notes:

  • Some try..except.. code has been used to handle time outs (which may occur frequently – depending on the load on the server).
  • Several variables have been hard coded into the script, and these will have to be changed to match your requirements.

Explanation:


 path = "......\\TIEGCM Scraping\\Version2Data\\2003OctData\\Index1"
 # Define the URL below
 URL = "http://ccmc.gsfc.nasa.gov/cgi-bin/run_timeseries.cgi?dir=11379"
 timestep_value="model data"
 
  • The path variable is where you want the data to be saved on your local system.
  • The URL variable is specific to the CCMC run you requested. The one used above is for a run I requested earlier.
  • The timestep_value is used to set the time frequency for the model to be run on. I chose “model data”.
# Set the latitude range and resolution
 lat = np.arange(-50, 85, 5)
 table = pd.DataFrame(data = lat , columns = ['Lat'], index=None)

# Get the longitude from a .mat file.
 mattable = spio.loadmat("Longitude_Start_2003Storm_V1", matlab_compatible=True, )
 longstartv1 = mattable['Longitude_Start_Index1_V2'] # ascending pass of the satellite
 longstartv0 = mattable['Longitude_Start_Index0_V2'] # descending pass of the satellite
  • A dataframe called table is created to hold all the latitudes that we wish to get data for.
  • mattable reads a matlab file which contains the longitude values for the ascending and descending pass of the satellites.
startTime = 1007

everythingWorks = True

for day in range(0,2):
    for lat in range(0, len(table)):
        for row in longstartv1[lat,day]:
            for altitude in range(100, 125, 5):
                #Setting timestep option on target URL
                timestep = Select(browser.find_element_by_name("time_cadence"))
                option = timestep.select_by_value((timestep_value))

                #Setting minimum time
                time1 = Select(browser.find_element_by_name("Time1"))
                option = time1.select_by_value(str(startTime))

                #Setting maximum time
                time2 = Select(browser.find_element_by_name("Time2"))
                option = time2.select_by_value(str(startTime+68))

                # Setting minimum latitude
                minLat = browser.find_element_by_name("X2MIN")
                minLat.clear()
                minLat.send_keys(str(table.iloc[lat]['Lat']))

                # Setting maximum latitude
                maxLat = browser.find_element_by_name("X2MAX")
                maxLat.clear()
                maxLat.send_keys(str(table.iloc[lat]['Lat']))

                # Setting minimum longitude
                minLon = browser.find_element_by_name("X1MIN")
                minLon.clear()
                longitudeMin = row[0]
                minLon.send_keys(str(longitudeMin))

                # Setting maximum longitude
                maxLon = browser.find_element_by_name("X1MAX")
                maxLon.clear()
                longitudeMax = -(360-longitudeMin-24)
                maxLon.send_keys(str(longitudeMax))

                # Setting minimum altitude
                minAlt = browser.find_element_by_name("X3AMIN")
                minAlt.clear()
                minAlt.send_keys(altitude)

                # Setting maximum altitude
                maxAlt = browser.find_element_by_name("X3AMAX")
                maxAlt.clear()
                maxAlt.send_keys(altitude)

                # Selecting output options
                browser.find_element_by_name("Output_Pointdata").click()

                # Clicking "Update Plot" button
                browser.find_element_by_xpath('.//input[@type="submit" and @value="Update Plot"]').click()

The “elements” from the webpage were identified by looking at CCMC interface’s source code. These were then passed to selenium in order to make changes to the values in the various user input fields.

  • Outermost for loop iterates through various altitudes in steps of 5
  • Middle for loop iterates through the longitudes (ascending pass) for a given latitude.
  • The innermost for loop iterates through the various latitudes listed in the table we created earlier.
  • Once inside the innermost loop, the (selenium) webdriver locates the element (a drop down) that has the name “time_cadence”. It selects the value “model data”.
  • Next the “Date/Time1” is set, based on the value of “startTime”. The points to the Date: 2003/10/27 Time 00:00:00.
  • “Date/Time2” is set by adding 68 to 1007, which is Date: 2003/10/27 Time: 22:40:00.
  • “X2MIN” is the element (input text) that accepts Latitude values
  • “X2MAX” is the element (input text) that accepts the max latitude
  • “X1MIN” refers to the minimum longitude ( for the Plot Area)
  • “X1MAX” refers to the input for maximum longitude (for the Plot Area)
  • Finally the “Height1” and “Height2” (elements “X3AMIN” and “X3AMAX”) are set to the same altitude.
  • Selenium then locates the “Update Plot” button and clicks on it.
while everythingWorks == True:
      try:
          wait = WebDriverWait(browser,100)
          wait.until(EC.presence_of_element_located((By.LINK_TEXT, "EPS image")))
      except TimeoutException:
          browser.quit()

As the script iterated through the latitudes / longitudes, I noticed that I kept getting several timeout error messages. So to fix this, I added the try .... except.. so as to retry the iteration that just failed. If a time out exception does occur, the webdriver instance is restarted and attempts are made until the the next page loads (which is detected by the presence of an EPS Image).

The next step involves the downloading of the data.

# Locating the text file / output file link
ahref = browser.find_element_by_partial_link_text('ASCII')
link = ahref.get_attribute("href")

# Identifying the text file name
fileName = os.path.basename(link)

# Transfering cookies from current session in chromedriver to requests library
all_cookies = browser.get_cookies()
cookies = {}
for cookie in all_cookies:
    cookies[cookie["name"]] = cookie["value"]

# Requesting the link associated with the text file.
r = requests.get(link,cookies=cookies)

# Checking for a successful connection to the server.
if r.status_code == 200:
    print("Downloading data for day %d, Latitude %.2f, longitudeMin %.2f, longitudeMax %.2f, altitude %d" %(day, table.iloc[lat]['Lat'], longitudeMin, longitudeMax, altitude) )    # Notifying the user about downloading 
    data = r.text # Extracting the text from the file online
    file_name = os.path.join(path,fileName)  # Creating path for the target file with the same name as original file obtained from server
    with open(file_name, 'w') as w:
         w.write(data)

The new pages that loads after click “Submit”, contains an ASCII link containing the data we need.

  • We assign the link to a variable called “link”.
  • Following this, we create a file with the same name.
  • The cookies are then transferred to a variable so that we can utilize the requests library to download the data.
  • The script checks for a successful connection to the server, and then downloads the data to a file.
                # Reopening main URL
                try:
                    browser.get(URL)
                # Handling time out by restarting the browser
                except TimeoutException:
                    browser.quit()
                    browser = webdriver.Chrome()
                    browser.get(URL)

    startTime += 72 # This is based on the resolution you want to use. 72 here refers to a resolution of 24 hours

This bit redirect the webdriver back to the original interface URL, and if it has trouble accessing the URL, reloads the chromedriver. The next iteration will have the startTime incremented by 72 (equivalent to 24 hours)

You can find the entire code here.

Leave a Reply

Your email address will not be published. Required fields are marked *