My web parsing journey

Mission - Quick way to select new job listings for particular search parameters in a minimalist format and learn something on the way

I was interested in the area of web-parsing and the idea of quickly gather information from a webpage. My idea was to parse a job search website with the parameters I was interested and find me new job offers in a effective manner. A birdie told me that "The modern way to do it is called Mechanicalsoup, Selenium is for oldies" - let's see.

Mechanicalsoup with URL parameters approach

First look at the library Mechanicalsoup. It is built on requests and beautifulsoup libraries, can follow links and submit forms. It sounds like an ideal lightweight alternative to Selenium when you only want basic interaction and flexible information scraping.

I defined the URL with parameters, parsed the listed jobs from the first page, separated the information and outputted only the important bits (Name, URL, location, remote possibility).

Everything looked fine, the results of the first page were listed, Mechanicalsoup parsed the elements cleanly. Then something didn't add up - the results of the second search seemed to be too unrelated to my search word. I assumed that the browser object might have difficulty holding the cookies on this page and just 'forgets' everything when it clicks 'Next'

I originally thought that I might fix it with adding some parameters into the .submit_selected(), or tinkering with cookie options, but after exhausting duckduckgo searches and willingness to study the direct and indirect documentation, I decided to switch to another approach to at least bring the use case to an successful end.

Mechanicalsoup is not yet ready to replace Selenium in all aspects of web parsing (at least for some webpages). At least so I though.

Code is here

Selenium with Beautifulsoup approach

Selenium is the person you call when you want this job done reliably. At least that is mostly the case. I quickly fired up the chrome browser (discovered that there is a new way of handling the driver - the file is no longer needed, it is handled by an command -

ChromeDriverManager().install()

), set the parameters, navigated to the search page, seleted all necessary locators to fire up the search. Everything went pretty uncomplicated, the script navigated through the results, gathered all the job data with Beautifulsoup -

job_details = [text for text in soup_job.stripped_strings]
a_tags = soup_job.find_all('a', href=True)

and listed the jobs. I have also added a way how the script selects only the newest jobs (those which he didn't find previously) - storing the search results per search term in a pandas dataframe and each time listing only the difference with the new search (afterwards updating the main dataframe):

for job in all_jobs:
        if job[1] not in job_list_df.values :
            series = pd.Series(job)
            job_list_df_new = job_list_df_new.append(series, ignore_index=True)
    job_list_df = pd.DataFrame(all_jobs)
    # print(csv_file)
    if not job_list_df_new.empty:
        print("New jobs:", job_list_df_new)
        job_list_df_new.to_csv(csv_file_timestamped)
    else: print("There are no new jobs for this searchterm")
    job_list_df.to_csv(csv_file)

This did the job, at least for a week.

On the next Monday I woke up seeing that the script doesn't work anymore - Selenium was failing to submit the search. I tried tinkering with the waits or with the way how the script selects the elements, but to no avail, they seem to (maybe on purpose) made the website quite hard to automate with selenium.

Code is here

I was quite frustrated after this.

Will my work be in vain?

Will I ever be able to search for job offers elegantly?

Requests with Beautifulsoup approach

A revelation came the next day. When Mechanicalsoup is implementing both Requests and Beautifulsoup, why not combine those directly and use their full potential. I looked at the POST request the search query made, experimented with parameters within and tried to figure out which ones are crucial for the search and narrowed it down to the following set

data_test_zurich = {'__seo_search':'search',
                    '__search_freetext':keyword,
                    '__search_city':location[0],
                    'seal':random_search_id,
                    '__search_city_location_id':location[1],
                    '__search_city_country':location[2],
                    '__search_city_perimeter':'100',
                    'search_id':random_search_id,
                    'search_simple':'suchen'}

'random_search_id' was necessary and I suppose it somehow groups together the pagination of the search results. Giving those into a requests.Session() was giving me the desired behavior - consistent and reliable job listing across multiple result pages.

Code is here

I put it together with the useful functionality from previous attempts (show only new jobs, remember the last search results), added some perks like better view in consoles and command line interface with click. At the end I basically did the whole circle - starting exploring with Mechanicalsoup, switching to old mate Selenium and finally solving the problem with the libraries on which Mechanicalsoup is build on.

Testing is my Profession

Search This Blog