Skip to main content

My web parsing journey

Mission - Quick way to select new job listings for particular search parameters in a minimalist format and learn something on the way

I was interested in the area of web-parsing and the idea of quickly gather information from a webpage. My idea was to parse a job search website with the parameters I was interested and find me new job offers in a effective manner. A birdie told me that "The modern way to do it is called Mechanicalsoup, Selenium is for oldies" - let's see.

Mechanicalsoup with URL parameters approach

First look at the library Mechanicalsoup. It is built on requests and beautifulsoup libraries, can follow links and submit forms. It sounds like an ideal lightweight alternative to Selenium when you only want basic interaction and flexible information scraping.
I defined the URL with parameters, parsed the listed jobs from the first page, separated the information and outputted only the important bits (Name, URL, location, remote possibility).
Everything looked fine, the results of the first page were listed, Mechanicalsoup parsed the elements cleanly. Then something didn't add up - the results of the second search seemed to be too unrelated to my search word. I assumed that the browser object might have difficulty holding the cookies on this page and just 'forgets' everything when it clicks 'Next'

 I originally thought that I might fix it with adding some parameters into the .submit_selected(), or tinkering with cookie options, but after exhausting duckduckgo searches and willingness to study the direct and indirect documentation, I decided to switch to another approach to at least bring the use case to an successful end.

Mechanicalsoup is not yet ready to replace Selenium in all aspects of web parsing (at least for some webpages). At least so I though.

Code is here


Selenium with Beautifulsoup approach

Selenium is the person you call when you want this job done reliably. At least that is mostly the case. I quickly fired up the chrome browser (discovered that there is a new way of handling the driver - the file is no longer needed, it is handled by an command -
), set the parameters, navigated to the search page, seleted all necessary locators to fire up the search. Everything went pretty uncomplicated, the script navigated through the results, gathered all the job data with Beautifulsoup -
job_details = [text for text in soup_job.stripped_strings]
a_tags = soup_job.find_all('a', href=True) 
and listed the jobs. I have also added a way how the script selects only the newest jobs (those which he didn't find previously) - storing the search results per search term in a pandas dataframe and each time listing only the difference with the new search (afterwards updating the main dataframe):
for job in all_jobs:
        if job[1] not in job_list_df.values :
            series = pd.Series(job)
            job_list_df_new = job_list_df_new.append(series, ignore_index=True)
    job_list_df = pd.DataFrame(all_jobs)
    # print(csv_file)
    if not job_list_df_new.empty:
        print("New jobs:", job_list_df_new)
    else: print("There are no new jobs for this searchterm")
This did the job, at least for a week.
On the next Monday I woke up seeing that the script doesn't work anymore - Selenium was failing to submit the search. I tried tinkering with the waits or with the way how the script selects the elements, but to no avail, they seem to (maybe on purpose) made the website quite hard to automate with selenium.
Code is here
I was quite frustrated after this.
Will my work be in vain?
Will I ever be able to search for job offers elegantly?

Requests with Beautifulsoup approach

A revelation came the next day. When Mechanicalsoup is implementing both Requests and Beautifulsoup, why not combine those directly and use their full potential. I looked at the POST request the search query made, experimented with parameters within and tried to figure out which ones are crucial for the search and narrowed it down to the following set

data_test_zurich = {'__seo_search':'search',
'random_search_id' was necessary and I suppose it somehow groups together the pagination of the search results. Giving those into a requests.Session() was giving me the desired behavior - consistent and reliable job listing across multiple result pages.
Code is here

I put it together with the useful functionality from previous attempts (show only new jobs, remember the last search results), added some perks like better view in consoles and command line interface with click. At the end I basically did the whole circle - starting exploring with Mechanicalsoup, switching to old mate Selenium and finally solving the problem with the libraries on which Mechanicalsoup is build on.


Popular posts from this blog

When to start automation?

If you are asking this as a tester, you probably asking too late. Automation is something that can save you some portion of your work (understand resources for your client) and i rarely found cases of testing work that did not need at least some portion of automation. I know that it is rarely understood that automation is something to be developed & maintained and if you cover enough of the application, you do not need any more regression - well i do not think that somebody has done an automation regression suite that if fully reliable (i am not speaking about maintaining this code - which is another topic). There can be always a bug (or quality issue) that slips through, even when you scripts go through the afflicted part. I understand that many testers have no development background or skills, but i doubt the developers that could help you are far away. I am not assuming that they can do the scripts for you.... However if they understand what you need, they can say how e

Cynefin beginnings

Cynefin was on my radar ever since I joined The House . It seemed an interesting idea worthy of further pursuit, therefore I decided to visit a training on this topic in London this April. Cynefin - my amazing drawing My first thought was  "What I'm doing here?!" - the other attendees were a mix of scrum masters, project managers and similar sort, which was actually to be expected. Cynefin is a decision-making framework which seems to be applicable mainly in management, but my firm belief is that testing can benefit from it equally. My goal was, however, to find out more about Cynefin and how to apply it to my work as a software tester. I expect it will take some time to my thoughts on this fully settle and I get the whole picture from this training. My colleagues got already some very good insights from cynefin, my goal is to follow this path. The purpose of this blog is to summarize my thoughts on this so I can revisit later in my life and maybe see how much

Testers toolkit

As every craftsman needs his tools, testers are no exceptions. I was this weekend at  SOCRATES SWITZERLAND  (SOftware CRAftsmanship and TESting) where we talked also about useful everyday tools. This list tries to be as general as possible to provide tools which can be useful to most part of everyday work, more specialised test-useful tools are very context dependent. I use most of the tools mentioned here and believe they can provide value also to you. Documenting Screenshot  - good picture is worth thousand words, this applies especially for testing, the following are some screenshotting tools and editors I used and can recommend Lightscreen  - simple tool for capturing pictures, free, cannot edit the pictures FastStone  - screen capturing tool and editor, cheap & gets the job done Snagit  - very powerful screen capturing tool and graphic editor, many functions which you never knew you can do, but wont be able to live without afterwards, a bit pricey Recording  (