Indeed Job Scraper

scrape-sample

❗ Disclaimer

Please note that this was a hobby project to learn more about web scraping and anti-scraping mechanisms on job boards. I am not held responsible for any violation resulted from the direct/indirect use of my scripts or information presented here.

Overview

In this project, I made three scrapers using different libraries in an attempt to scrape Indeed Canada job posts (on Jan 23, 2020) in Ontario. The three are built respectively with:

1. requests & bs4
2. selenium & bs4
3. scrapy

Why making three scrapers?

Because my initial attempts were countered by anti-scraping mechanism, such as Google reCAPTCHA.

Google reCAPTCHA throws 5 to 10 reCAPTCHAs in one setting when a large amount of requests are detected from the same address, same user agent etc.

I first wrote the scraper with Requests and bs4, which was stopped by reCAPTCHA about 900 jobs/10 mins in. Hoping to manually resolve the reCAPTCHAs, I switched to the browser automation route with Selenium, adding a logic so that when Google reCAPTCHA is thrown, the program pauses and waits for the user input. The program did pause about 1000 jobs in and I was able to manually resolve the reCAPTCHAs, but for some unknown reasons, the scraper always stopped after the resolution of reCAPTCHAs.

At this stage, there are several solutions I considered:

Continue to debug to figure out why the scraper was stopped after the manual resolution of reCAPTCHAs;
Get past the reCAPTCHA with speech-to-text transcribing the audio file in the accessability option (but this is clearly an abuse of features even if it works); or
Rotate user agents and/or proxies to avoid triggering anti-scraping mechanism

I decided to go with the last option.

Instead of manually setting up user agent rotation, I found out that this could be easily set up with Scrapy. I refactored my script to Scrapy and used Scrapy user agent middleware. The script successfully scraped all 1500 job posts in Ontario and took about 3 mins.

Published Jan 24, 2021