How are popular ranking algorithms such as Reddit and Hacker News working?

Tram Ho

Have you ever wondered how people arrange posts on the homepage, the "hot" section of news sites such as Reddit , Hacker News , entertainment sites such as trending sections of Youtube , 9GaG , and Comedy Ignoring , Does anyone know , … Or closest to us is like the Trending section of Viblo ?

Please think about personalizing the content for each user or applying machine learning to the ranking. In this article, I would like to ask you to learn and analyze the calculation and ranking of the "hot" article using the most simple and pure scoring algorithm .

Algorithm for ranking posts is hot and popular like Reddit and Hacker News

overview

First of all, you need to determine what is the peculiarity of your system / website , how to rank the articles on the home page:

  • If your system provides knowledge and answers like Quora , StackOverFlow : you need to prioritize providing the most useful information to users first, old or new content is less important. Quora and StackOverFlow both rank the highest upvote – downvote answers on the top. Ranking by "hot" will not be very useful in this case.
  • If your system is a form of news and entertainment social network : on the homepage, you will need to provide fresh, interesting information to the masses, restricting old and boring information. So you need a ranking algorithm format to sort the article properly. This is the part I will focus on mostly in this article.

Here are some factors that you can use to calculate the score for the article:

  • Likes / votes : This is the most reliable factor, because usually only for users who have registered an account, need users to perform specific actions, click the Like / (+) button and each person usually only vote / like once. Depending on where you can only be released / liked, or even dislike, or downvote the posts. The higher the number of likes / votes, the higher the score naturally.
  • Time to post : This is a very important factor, which helps you identify new, "trendy" posts to give to users. As such, old and outdated articles need to be deducted more points, and new and recent posts need to be deducted less. Sometimes the system also calculates the score according to the style: "the more recent articles are, the more points they get", like Reddit, for example.
  • Number of Comments / Number of Comments : This is not as important as the 2 above, because we don't know whether people are commenting or complimenting, but it is also helpful. Getting lots of comments proves that this topic is exciting and interesting.
  • Views / watch time : This is the most uncertain factor, you should limit using it to rank. However, there are still cases, as Google clearly has tracking click / bounce rate to determine the ranking of Web pages on search results.

Recipe overview

I would like to temporarily give the formula for the simplest article writing as follows:

S c o r e = ( u p v o t e d o w n v o t e ) t i m e e l a S p e d {score} = ({upvote} – {downvote}) – {timeelasped}

With

t i m e e l a p S e d {timeelapsed}

is penalized according to the elapsed time. You can define how the points will be deducted. For example, I can set that every 4 hours will pass, the article will be deducted 1 point.

For example a few cases:

  • Posts posted 6 hours ago, with 5 upvotes, 1 downvote, will have a score of
    ( 5 first ) ( 6 4 ) = 2.5 (5-1) – ( frac {6} {4}) = 2.5

  • Posts posted 12 hours ago, with 25 upvotes, 4 downvotes, will have a score of
    ( 25 4 ) ( twelfth 4 ) = 18 (25-4) – ( frac {12} {4}) = 18

  • Posts posted 30 days ago, with 32 upvotes, 2 downvotes, will have a score of
    ( 32 2 ) ( 720 4 ) = 150 (32-2) – ( frac {720} {4}) = – 150

Looking at the example, we can see that the above formula looks "temporary", because the higher the upvote will get at the top, but still has the number of points decreasing over time.

However, there are cases where an article gets to the top, then every day it receives 6, 7 upvotes or more. Because the article is at the top, so it 's snowball ( snowball rolling effect, anyone who plays LOL most definitely 😏), which means more and more upvotes for it is very common. So the consequence is that after … 2 years later this article is still hanging around your top 10 trending (?)

We will now look at some of the most popular sites that apply their ranking, and analyze how they overcome the "snowball" case.

Reddit algorithm (old)

The reason for saying old because Reddit used to open source their system, but have switched to close their source code for some time. However, their source code while still open source is still shared and worth learning.

Source code

The scoring section to rate their hot posts like this ( GitHub ):

Explain

A brief explanation of the above algorithm is as follows: Call:

  • d a t e date

    is the time the article was posted in Unix Timestamp format,

  • 1134028003 1134028003

    is a fixed time (August 12, 2005 @ 7:46 am (UTC), can be roughly understood as the time when Reddit started operating), which is the time format. Unix Timestamp.

I am

S e c o n d S seconds

is the time interval between them (in seconds):

S e c o n d S = d a t e 1134028003 seconds = {date} -1134028003

Call

S S

is the upvote – downvote difference of the article:

S = u p S d o w n S s = ups – downs

From there we have

S i g n sign

is the sign of

S S

just now. In other words,

S i g n sign

denotes whether the sign of the voting result is positive or negative:

  • S i g n = first sign = 1

  • S i g n = first sign = -1

  • S i g n = 0 sign = 0

To be given

S S

into the logarithm function, one must change

S S

in terms of absolute value, ie impossible to be negative; and make it always bigger than 1. I temporarily put that number as

n n

:

  • n = S n = | s |

  • n = first n = 1

Finally, we can calculate the "hot" score of the article:

f ( S e c o n d S , S i g n , n ) = log ten n + S i g n S e c o n d S 45000 f (seconds, sign, n) = log_ {10} n + frac {{sign} * {seconds}} {45000}

Analysis

Looking at Reddit's formula, we immediately see the similarity with the "overview formula" at the beginning of our article:

  • First term (
    log ten n log_ {10} n

  • Second term (
    S i g n S e c o n d S 45000 frac {{sign} * {seconds}} {45000}

Suming these two terms, we will finalize the article. But there are some "interesting" things here: why do you need logarithmic functions? Why divide by number 45000? Now we will go to learn little by little.

Rank 2 (points calculated from time of posting)

This section also serves the purpose of making recent posts score better than previous posts. But instead of subtracting points based on time passed, Reddit chose to add points to recent posts.

More specifically, they did it by: first choosing a fixed timeline in the past, and the difference between that timeline and the time of writing. For example (assuming not counting votes):

  • My website was launched on Christmas Eve, December 12, 2019, at 0:00, and I would like to select this time period as a "fixed timeline". It has a Unix Timestamp value of 1577206800.
  • The first article was posted at 6 am (6 hours later), Unix Timestamp is 1577228400, the effective time is
    1577228400 1577206800 = 21600 1577228400-1577206800 = 21600

  • The second post was posted at 12:30, the time difference is
    1577251800 1577206800 = 45000 1577251800-1577206800 = 45000

  • The 3rd article is posted at 1am the next day (the 26th), the time difference is
    1577296800 1577206800 = 90000 1577296800-1577206800 = 90000

From the example above, we see that, for every 12.5 hours after the time is Christmas Eve, the new post will be added 1 point. 45000 is the number of seconds corresponding to 12.5h. This is possible because if the current date in the formula will make the point constantly changing and limiting the ability to use the cache or execute scoring in the background.

Number 1 (the part that counts points from votes)

First of all, probably many readers have not been in contact with Math for a long time, so I would like to briefly mention the shape of a logarithm base function 10: Basic logarithm function 10 To make it easier to imagine, I would like to offer a few cases:

  • For an article with 10 votes (ie, upvote – downvote), that post will be added completely
    log ten ten = first log_ {10} 10 = 1

  • For posts that have double the number of votes above, that is, 20 votes , that post only received
    log ten 20 1.3 log_ {10} 20 approx1.3

  • Another article has more votes than the first 10 posts, ie 100 votes , that article only received
    log ten 100 = 2 log_ {10} 100 = 2

As another example, in order for an article that was posted 3 days ago to rank higher than the one that was just posted , it needs to be more

259200 45000 = 5.76 frac {259200} {45000} = 5.76

Thus, thanks to the logarithm function that:

  • The first vote has the highest value
  • The later votes are getting less and less valuable
  • For the older posts to stay on the trending, the bigger the number of votes must be received

In short, this is how Reddit counteracts the "snowball" case as I mentioned in the Overview section, and helps keep the content on the Reddit frontpage always fresh.

What about HackerNews's algorithm?

Unlike Reddit, HackerNews is still open source to this day. The HackerNews system is written in Arc, and the source code for the ranking function of the articles on the homepage is as follows:

To me, code like Lisp or Arc as above looks quite scary. The above code, if rewritten in Python, would look like this:

In summary, the code above is quite simple formula as below.

Recipe

HackerNews's formula is somewhat easier to understand and simpler than Reddit's, specifically as follows:

S c o r e = p first ( t + 2 ) G score = frac {p-1} {(t + 2) ^ {G}}

With:

  • p p

    : Upvote of the article (upvote-downvote). This vote should be subtracted 1 to not count the upvote of the writer.

  • t t

    : Time between posting time and the current time (in hours). For example, posts 2 hours ago will be

    t = 2 t = 2

  • The "gravity" constant, the default is
    1.8 1.8

    .

Explain

Instead of counting points by votes and by the time and total, HackerNews uses the division between the number of votes and the time posted (in hours) to calculate the rank of the article.

Some comments can be made as follows:

  • The longer the writing time, the lower the score, because of the number
    t t

    is placed in the denominator.

  • To control whether the score decreases rapidly or slowly over time, a constant is used
    G G

    is the number of gravity. The higher the gravity number, the faster the score will decrease over time.

With

G = 1.8 G = 1.8

HackerNews ranking algorithm

Refer

Share the news now

Source : Viblo