ITZone

Job Board Scraping with Rails

Data related to jobs (jobs) is one of the favorite data of many people. While there are some great public databases with job-related information, it can also be gathered from many different sites.

In this article I build a job board in Ruby on Rails that will automatically run once per day, hosted for free on Heroku.

Setup

Let’s say you have successfully installed Ruby, Rails and DB. Create a new project or cd to an existing project of yours!

Here I use postgresql to make deploying the production to heroku easier underneath. You can choose any DB.

If using more gem, don’t forget to bundle install !

Configure the Database

Open the file /config/database.yml and update again as follows:

Replace xxx as your username / password posgres. Next, run the command rails db:create to create database.

Create the ORM Model

Use scaffold to create a quick Job model for rails g scaffold Jobs title:string company:string url:string location:string

We will get this file db/migrate/20200813034505_create_jobs.rb

Run rails db:migrate so that rails creates the table and properties for us.

Create a Rake Task

In /lib/tasks create a new file called scrape.rake . When you want to write code and schedule it to run, this is where the code is located.

Here I will crawl the best jobs of the topcv. Update file scrape.rake :

Nokogiri is Ruby’s html parsing library, this article will not cover the use of this gem. Please refer here .

Run Locally

Now let’s get started – let’s run code rake scrape

Once it’s done running, you can view your local data using PGAdmin, DBeaver or console …

Our only problem is that we don’t want to run it manually every day. To overcome that, deploy to Heroku and schedule task. If you have a server available, use Whenever, sidekiq to set up your schedule.

Deploy to Production

By default you already have a heroku account and push the code to github or gitlab.

Create a new app on the dashboard

Go to the Deploy tab> App connected to GitHub> Select your repo and branch> Manual deploy> Select Deploy

The deploy process can take a few minutes.

Configure Scheduler in Production

Now we need to configure rake task above into scheduled job. In the Heroku tab choose Tab Resources> Search for the Heroku Scheduler addon> Add that Addon to your project.

Note: Although this addon is free, heroku still requires you to set Billing.

Open Heroku Scheduler to proceed to config as follows:

Note 2: Now config is UTC time. Here I set 6:30 AM UTC which means 1:30 PM +7

Important Note: Don’t intentionally continually crawl other people’s pages when it’s not needed if you don’t want their site to block your IP.

And here are my results

References

Share the news now