Build a system that pushes millions of notifications per hour

Friday, 14/04/2023

Tram Ho

Basic introduction to push . system

You are probably familiar with notifications from banking applications when you transfer money, or ride-hailing apps when you order or complete a ride, to food ordering apps that send you discounts every day.
Push notification system becomes a core and necessary system for most applications, the problem we are aiming for is an application that sends advertising messages.
With this system, we need to send notifications to a large number of users depending on the number of users in the system and can be up to millions of users.

Problem

Have you ever wondered how a managed push notification system can push messages to millions of users?
The question can be likened to the problem of how to eat an elephant? Obviously, it is not possible to eat an entire elephant at once, but it needs to be broken down and eaten gradually.
This mechanism is similar to the solution of the above problem, we will not be able to push millions of messages at the same time, so each time we only take a certain number of messages and send them, and then continue this process. until all sent out
So when we go into detail how to handle sending messages to millions of customers, let’s go back to the story of how to send notifications to one customer.

How to send a message to a user

Your system must call a third-party api like Firebase Cloud Messaging (FCM). FCM will send message to your phone

So what information is needed when calling the api?

Message : The content of the message displayed on the client side
Token : FCM will identify each device with a token, so when there is a token FCM can send the correct device. But FCM is a device identification system, how can my App get a token that calls to FCM?

FCM will have an api so that the client can register for the token and send it back to the App for saving. Regarding the flow, it is as simple as this:

Client calls FCM’s token registration api
FCM generates a token, and saves the information in the cache, or db (not really the saving mechanism of FCM, but I drew it so that everyone can imagine the service they will also save this information), then return information back to the client
The client calls the api and sends that token to the App to save it, maybe in the cache, or in the dbBecause the token can be reused many times to send messages, the token registration process is usually only when first logging in to the app, or when the token has expired (can’t send push messages with that token) and needs a process. re-register.

Send messages to many people

To create an advertising campaign, we need a web backend that selects user sets and sent content, then the underlying push system handles the sending.

This is the overall architecture of the system

Web Backend : Will create a campaign and save the campaign’s information including message content, user set, sending time….
Builder Service : Polling the campaigns to send, based on the conditions to build a set of users to push in redis, update the status of the campaign that has completed the build
Push Worker : The push workers will polling from the db to retrieve the campaigns that have completed the set build. Get the user’s token in redis, build the message and call the FCM api, push the response to the queue to async the state saving. When the processing is complete, the push completion status will be updated.

Problems

It can be seen that the most important service of the system is Push Worker, how to effectively handle the interaction with the internal system as well as with the FCM is the most important thing. Problems that slow down the push system:

Inefficient way of handling workers interacting with internal components
Calling the FCM api to handle push for each user is not efficient with a large number of users.
Handling invalid, or expired tokens will slow down push

Push worker to handle it efficiently

How does each worker handle it quickly?
Every time you scan a list of campaigns to push, use that thread to handle push? Of course not.
One thread to handle everything from polling to push, the system cannot be fast.
So each worker will only have 1 thread responsible for polling to get new campaigns. Next, assign the push task to the thread pool of multiple threads handling the push
To handle it well, we need to manage the number of active threads. If all threads are already active for push, don’t add tasks to the thread pool.
If you want to push multiple campaigns at the same time, you need to control the number of threads handling one campaign

Use batch push instead of single push

Let’s see how the single push is handled?

Suppose you send 100 push messages to FCM, with each call:

Send a request to FCM
FCM takes a bit of time to process (it is definitely very fast, I will explain this part below)
Send response back to Push Worker

So how does FCM handle it? It should be noted that the response FCM returns includes a code indicating whether the token is valid or not and at that time the push message has not been sent to the client .

What does that mean? This means that when receiving an incoming request, FCM will only validate the request and token , then send the notification to the queue for later processing and return the response immediately. That processing is very fast .
So the slowest thing when I called the api turned out to be opening the connection and sending each request over the internet to Firebase .
To optimize this, FCM has another api that allows sending up to 500 messages at the same time. This improves processing speed greatly.
For push messages with shared content, we can use multicast send that allows sending the same content with many tokens, which will save bandwidth.

Dealing with Junk Tokens

Not all users who use your app are active users, there are people who only use it a few times and delete the app, or they only enter the app once a year.
That leads to the problem that there are a large number of expired tokens, which cannot be delivered.
If you continue to send messages, it will only make your system more resource-intensive and time-consuming, especially on inactive multi-user systems.
So how to determine a user is no longer active ? We will rely on the error code that the response from the FCM returns, we will determine that the token is no longer valid. We will mark that token must be refreshed when the user logs back into the app, so that the client registers a new token on its own.
At the same time, when the Push Worker checks the invalid token status , it will no longer send that user’s notification .

Other optimizations

How to handle getting users in set and tokens?

Often we will get used to the for loop thinking . For example, when you want to get 100 users from a set in redis, what do you do?
You would write a for loop to get information:

int BATCH_SIZE = 100;
List&lt;String&gt; users = new ArrayList&lt;&gt;(BATCH_SIZE);
for(int i = 0;i &lt; BATCH_SIZE; i++) { 
    String user = redis.sPop(key);
    users.add(user);
 }

int BATCH_SIZE = 100;

List<String> users = new ArrayList<>(BATCH_SIZE);

for(int i = 0;i < BATCH_SIZE; i++) {

String user = redis.sPop(key);

users.add(user);

}

Doing that would be equivalent to:

You will find this is identical to using the batch push mentioned above. Avoid this of course Redis also provides a solution to save RTT & avoid context switching.

Redis pipelining : allows sending batch commands ie multiple commands at the same time without waiting for the results to return as above.
The statements support getting multiple results at the same time : for example, if you want to pop 100 users, you can use sPop(key, count)

And of course you can apply to get the token depending on whether you store the token as a hash or key value. You can choose hmget or hget to get multiple users at the same time

Handling saving Push Response

With Push Workers sending push responses through the queue, it helps the Workers to avoid having to spend more time saving to the db for better push capability.
But say it again and again, you have to save anyway, but how to save it optimally.
This story has been repeated up to 3 times, still batch. Our system uses MongoDB to allow batch update mechanism, which is similar to the story of FCM, and Redis…
And when you apply it you will see an amazing speed improvement…

Conclude

After reading the article, everyone must see how great the power of batch is. With just a single concept, it has been applied in many different places.
And the problem of how to eat an elephant, perhaps you also have the correct solution. Break the elephant into small portions (batch) , and eat gradually!
Note that it’s just right . If you want to understand more deeply, ask yourself why FCM only allows batch 500 , why db only batch 1000 is effective , with Redis pipelining how many commands are effective ?

The article ends here. See you in new posts

Share the news now

Source : Viblo