Job Board Scraping with Rails

Tram Ho

Data related to jobs (jobs) is one of the favorite data of many people. While there are some great public databases with job-related information, it can also be gathered from many different sites.

In this article I build a job board in Ruby on Rails that will automatically run once per day, hosted for free on Heroku.

Setup

Let’s say you have successfully installed Ruby, Rails and DB. Create a new project or cd to an existing project of yours!

Here I use postgresql to make deploying the production to heroku easier underneath. You can choose any DB.

If using more gem, don’t forget to bundle install !

Configure the Database

Open the file /config/database.yml and update again as follows:

Replace xxx as your username / password posgres. Next, run the command rails db:create to create database.

Create the ORM Model

Use scaffold to create a quick Job model for rails g scaffold Jobs title:string company:string url:string location:string

We will get this file db/migrate/20200813034505_create_jobs.rb

Run rails db:migrate so that rails creates the table and properties for us.

Create a Rake Task

In /lib/tasks create a new file called scrape.rake . When you want to write code and schedule it to run, this is where the code is located.

Here I will crawl the best jobs of the topcv. Update file scrape.rake :

Nokogiri is Ruby’s html parsing library, this article will not cover the use of this gem. Please refer here .

Run Locally

Now let’s get started – let’s run code rake scrape

Once it’s done running, you can view your local data using PGAdmin, DBeaver or console …

Our only problem is that we don’t want to run it manually every day. To overcome that, deploy to Heroku and schedule task. If you have a server available, use Whenever, sidekiq to set up your schedule.

Deploy to Production

By default you already have a heroku account and push the code to github or gitlab.

Create a new app on the dashboard

Go to the Deploy tab> App connected to GitHub> Select your repo and branch> Manual deploy> Select Deploy

The deploy process can take a few minutes.

Configure Scheduler in Production

Now we need to configure rake task above into scheduled job. In the Heroku tab choose Tab Resources> Search for the Heroku Scheduler addon> Add that Addon to your project.

Note: Although this addon is free, heroku still requires you to set Billing.

Open Heroku Scheduler to proceed to config as follows:

Note 2: Now config is UTC time. Here I set 6:30 AM UTC which means 1:30 PM +7

Important Note: Don’t intentionally continually crawl other people’s pages when it’s not needed if you don’t want their site to block your IP.

And here are my results

References

Share the news now

Source : Viblo