Skip to main content

My web parsing journey

Mission - Quick way to select new job listings for particular search parameters in a minimalist format and learn something on the way

I was interested in the area of web-parsing and the idea of quickly gather information from a webpage. My idea was to parse a job search website with the parameters I was interested and find me new job offers in a effective manner. A birdie told me that "The modern way to do it is called Mechanicalsoup, Selenium is for oldies" - let's see.

Mechanicalsoup with URL parameters approach

First look at the library Mechanicalsoup. It is built on requests and beautifulsoup libraries, can follow links and submit forms. It sounds like an ideal lightweight alternative to Selenium when you only want basic interaction and flexible information scraping.
I defined the URL with parameters, parsed the listed jobs from the first page, separated the information and outputted only the important bits (Name, URL, location, remote possibility).
Everything looked fine, the results of the first page were listed, Mechanicalsoup parsed the elements cleanly. Then something didn't add up - the results of the second search seemed to be too unrelated to my search word. I assumed that the browser object might have difficulty holding the cookies on this page and just 'forgets' everything when it clicks 'Next'

 I originally thought that I might fix it with adding some parameters into the .submit_selected(), or tinkering with cookie options, but after exhausting duckduckgo searches and willingness to study the direct and indirect documentation, I decided to switch to another approach to at least bring the use case to an successful end.

Mechanicalsoup is not yet ready to replace Selenium in all aspects of web parsing (at least for some webpages). At least so I though.

Code is here


Selenium with Beautifulsoup approach

Selenium is the person you call when you want this job done reliably. At least that is mostly the case. I quickly fired up the chrome browser (discovered that there is a new way of handling the driver - the file is no longer needed, it is handled by an command -
), set the parameters, navigated to the search page, seleted all necessary locators to fire up the search. Everything went pretty uncomplicated, the script navigated through the results, gathered all the job data with Beautifulsoup -
job_details = [text for text in soup_job.stripped_strings]
a_tags = soup_job.find_all('a', href=True) 
and listed the jobs. I have also added a way how the script selects only the newest jobs (those which he didn't find previously) - storing the search results per search term in a pandas dataframe and each time listing only the difference with the new search (afterwards updating the main dataframe):
for job in all_jobs:
        if job[1] not in job_list_df.values :
            series = pd.Series(job)
            job_list_df_new = job_list_df_new.append(series, ignore_index=True)
    job_list_df = pd.DataFrame(all_jobs)
    # print(csv_file)
    if not job_list_df_new.empty:
        print("New jobs:", job_list_df_new)
    else: print("There are no new jobs for this searchterm")
This did the job, at least for a week.
On the next Monday I woke up seeing that the script doesn't work anymore - Selenium was failing to submit the search. I tried tinkering with the waits or with the way how the script selects the elements, but to no avail, they seem to (maybe on purpose) made the website quite hard to automate with selenium.
Code is here
I was quite frustrated after this.
Will my work be in vain?
Will I ever be able to search for job offers elegantly?

Requests with Beautifulsoup approach

A revelation came the next day. When Mechanicalsoup is implementing both Requests and Beautifulsoup, why not combine those directly and use their full potential. I looked at the POST request the search query made, experimented with parameters within and tried to figure out which ones are crucial for the search and narrowed it down to the following set

data_test_zurich = {'__seo_search':'search',
'random_search_id' was necessary and I suppose it somehow groups together the pagination of the search results. Giving those into a requests.Session() was giving me the desired behavior - consistent and reliable job listing across multiple result pages.
Code is here

I put it together with the useful functionality from previous attempts (show only new jobs, remember the last search results), added some perks like better view in consoles and command line interface with click. At the end I basically did the whole circle - starting exploring with Mechanicalsoup, switching to old mate Selenium and finally solving the problem with the libraries on which Mechanicalsoup is build on.


Popular posts from this blog

When to start automation?

If you are asking this as a tester, you probably asking too late. Automation is something that can save you some portion of your work (understand resources for your client) and i rarely found cases of testing work that did not need at least some portion of automation. I know that it is rarely understood that automation is something to be developed & maintained and if you cover enough of the application, you do not need any more regression - well i do not think that somebody has done an automation regression suite that if fully reliable (i am not speaking about maintaining this code - which is another topic). There can be always a bug (or quality issue) that slips through, even when you scripts go through the afflicted part. I understand that many testers have no development background or skills, but i doubt the developers that could help you are far away. I am not assuming that they can do the scripts for you.... However if they understand what you need, they can say how e

Testing impact on security

... or the impact when testing is lacking? Security breaches , hacks , exploits , major ransomware attacks - their frequency seem to increase recently. These can result in financial, credibility and data loss, and increasingly the endangerment of human lives. I don't want to propose that testing will always prevent these situations. There were probably testers present (and I'm sure often also security testers) when such systems were created. I think that there was simply a general lack of risk-awareness on these projects. There are many tools and techniques from  a pure technical point of view to harden the software in security context. Some of them have automated scans which crawl through your website and might discover the low hanging fruits of security weaknesses ( ZAP , Burpsuite ...), without much technical knowledge from the person operating it. The more important aspect is however the mindset with which you approach the product. The tester is often the f

RST Explored - My experience

My experience report from my recent RST Class I attended the RST class after a while, wanting to refresh my knowledge about the RST view on testing. It was a 4-day event, each day 3 Sessions, approx 4hour/day. My general impression was that it enriched and refreshed my understanding of testing.   Each of the four days had an central theme Day 1: "It is possible to test everything?" Day2: "When to stop testing? How to test from specifications." Day3: "Product coverage outline. Complexity of the system" Day4: "Risk analysis and coverage"   Going deeper into the topics of each day would be impossible without spoilers, I will therefore rather focus on my impressions and what this training has brought me. The way Michael was guiding us through the class was very engaging, although we usually started with a short lecture, questions and remarks were encouraged from start and we had an shared review after each exercise - students explaining their work,