Job Board Scraping with Rails

Friday, 14/08/2020

Tram Ho

Data related to jobs (jobs) is one of the favorite data of many people. While there are some great public databases with job-related information, it can also be gathered from many different sites.

In this article I build a job board in Ruby on Rails that will automatically run once per day, hosted for free on Heroku.

Setup

Let’s say you have successfully installed Ruby, Rails and DB. Create a new project or cd to an existing project of yours!

rails new scrape_app -d postgresql
cd scrape_app/

rails new scrape_app -d postgresql

cd scrape_app/

Here I use postgresql to make deploying the production to heroku easier underneath. You can choose any DB.

If using more gem, don’t forget to bundle install !

Configure the Database

Open the file /config/database.yml and update again as follows:

default: &amp;default
  adapter: postgresql
  encoding: unicode
  host: localhost
  username: xxx # fill in your username
  password: xxx # fill in your password
  port: 5432
  pool: &lt;%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %&gt;

development:
  &lt;&lt;: *default
  database: scrape_app_development
  
production:
  &lt;&lt;: *default
  database: scrape_app_production
  username: scrape_app
  password: &lt;%= ENV['SCRAPE_APP_DATABASE_PASSWORD'] %&gt;

default: &default

adapter: postgresql

encoding: unicode

host: localhost

username: xxx # fill in your username

password: xxx # fill in your password

port: 5432

pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>

development:

<<: *default

database: scrape_app_development

production:

<<: *default

database: scrape_app_production

username: scrape_app

password: <%= ENV['SCRAPE_APP_DATABASE_PASSWORD'] %>

Replace xxx as your username / password posgres. Next, run the command rails db:create to create database.

Create the ORM Model

Use scaffold to create a quick Job model for rails g scaffold Jobs title:string company:string url:string location:string

We will get this file db/migrate/20200813034505_create_jobs.rb

class CreateJobs &lt; ActiveRecord::Migration[5.2]
  def change
    create_table :jobs do |t|
      t.string :title
      t.string :company
      t.string :url
      t.string :location

      t.timestamps
    end
  end
end

class CreateJobs < ActiveRecord::Migration[5.2]

def change

create_table :jobs do |t|

t.string :title

t.string :company

t.string :url

t.string :location

t.timestamps

end

Run rails db:migrate so that rails creates the table and properties for us.

Create a Rake Task

In /lib/tasks create a new file called scrape.rake . When you want to write code and schedule it to run, this is where the code is located.

Here I will crawl the best jobs of the topcv. Update file scrape.rake :

task scrape: :environment do
  require 'open-uri'
 
  # Link mình muốn vào
  URL_BASIC = 'https://www.topcv.vn/viec-lam-tot-nhat'

  # Cho Nokogiri open url và inspect nó
  doc_basic = Nokogiri::HTML(open(URL_BASIC))
  
  # Đếm số trang cần crawl
  pages = doc_basic.search('ul.pagination &gt; li').count - 2

  # Bắt đầu crawl từng page với index là i
  (1..pages).each do |i|
    URL = "https://www.topcv.vn/viec-lam-tot-nhat?page=#{i}"
    doc = Nokogiri::HTML(open(URL))
    postings = doc.search('div.result-job-hover')

    postings.each do |p|
      # Lấy thông tin từng field mình mong muốn theo tag html
      job_title = remove_n_string(p.search('div &gt; div &gt; h4.job-title &gt; a &gt; span.bold.transform-job-title').text)
      url = remove_n_string(p.search('div &gt; div &gt; h4.job-title &gt; a')[0]['href'])
      company = remove_n_string(p.search('div &gt; div &gt; div.row-company').text)
      location = remove_n_string(p.search('div &gt; div &gt; div#row-result-info-job &gt; div.address')[0].text)

      # Bỏ qua các job đã tồn tại trong database
      if Job.where(title: job_title, location: location, company: company).count &lt;= 0
        Job.create(
          title: job_title,
          location: location,
          company: company,
          url: url
        )
        puts 'Added: ' + (job_title ? job_title : '')
      else
        puts 'Skipped: ' + (job_title ? job_title : '')
      end
    end
  end
end

# Loại bỏ n chứa trong string
def remove_n_string string
  string.gsub("n","")
end

task scrape: :environment do

require 'open-uri'

# Link mình muốn vào

URL_BASIC = 'https://www.topcv.vn/viec-lam-tot-nhat'

# Cho Nokogiri open url và inspect nó

doc_basic = Nokogiri::HTML(open(URL_BASIC))

# Đếm số trang cần crawl

pages = doc_basic.search('ul.pagination > li').count - 2

# Bắt đầu crawl từng page với index là i

(1..pages).each do |i|

URL = "https://www.topcv.vn/viec-lam-tot-nhat?page=#{i}"

doc = Nokogiri::HTML(open(URL))

postings = doc.search('div.result-job-hover')

postings.each do |p|

# Lấy thông tin từng field mình mong muốn theo tag html

job_title = remove_n_string(p.search('div > div > h4.job-title > a > span.bold.transform-job-title').text)

url = remove_n_string(p.search('div > div > h4.job-title > a')[0]['href'])

company = remove_n_string(p.search('div > div > div.row-company').text)

location = remove_n_string(p.search('div > div > div#row-result-info-job > div.address')[0].text)

# Bỏ qua các job đã tồn tại trong database

if Job.where(title: job_title, location: location, company: company).count <= 0

Job.create(

title: job_title,

location: location,

company: company,

url: url

)

puts 'Added: ' + (job_title ? job_title : '')

else

puts 'Skipped: ' + (job_title ? job_title : '')

end

# Loại bỏ n chứa trong string

def remove_n_string string

string.gsub("n","")

end

Nokogiri is Ruby’s html parsing library, this article will not cover the use of this gem. Please refer here .

Run Locally

Now let’s get started – let’s run code rake scrape

Once it’s done running, you can view your local data using PGAdmin, DBeaver or console …

Our only problem is that we don’t want to run it manually every day. To overcome that, deploy to Heroku and schedule task. If you have a server available, use Whenever, sidekiq to set up your schedule.

Deploy to Production

By default you already have a heroku account and push the code to github or gitlab.

Create a new app on the dashboard

Go to the Deploy tab> App connected to GitHub> Select your repo and branch> Manual deploy> Select Deploy

The deploy process can take a few minutes.

Configure Scheduler in Production

Now we need to configure rake task above into scheduled job. In the Heroku tab choose Tab Resources> Search for the Heroku Scheduler addon> Add that Addon to your project.

Note: Although this addon is free, heroku still requires you to set Billing.

Open Heroku Scheduler to proceed to config as follows:

Note 2: Now config is UTC time. Here I set 6:30 AM UTC which means 1:30 PM +7

Important Note: Don’t intentionally continually crawl other people’s pages when it’s not needed if you don’t want their site to block your IP.

And here are my results

References

Share the news now

Source : Viblo

Job Board Scraping with Rails

Setup

Configure the Database

Create the ORM Model

Create a Rake Task

Run Locally

Deploy to Production

Configure Scheduler in Production

References

TikTok becomes the second largest social platform in South Africa

The fastest depreciating after 9 months of launch, iPhone 14 Pro Max continues to break the bottom in Vietnam

Beginner's guide to R: Introduction

10 essential SublimeText plugins for JavaScript developers