How does my NodeJS API handle peak traffic?

Tram Ho

An optimization story turned into a workshop here and in this very article on the topic of building high-load API applications with NodeJS.

Reading this article, you will not only have a simple keyword or an arrow pointing to where, but you will have a direct code to apply to your project today. What are you waiting for without scrolling down?

First things first

Yeah, but don’t be in a hurry. My name is Minh Monmen and today I will continue to show you how a farmer doing technology is different from a scientist doing technology.

This is still the story of the ten thousand CCU livestream page (actually 3k and a half) from the article Me, NuxtJS and the live stream of ten thousand CCUs , but this time it is on the backend. Although it’s still alum, call it a peak and it still has a story to tell. But wait until I get thousands of CCUs and then millions of CCUs, then I hide all the secrets to private inbox to pay consulting coins, not share publicly like this anymore.

OK yet? Come on!

A little bit about the problem

Still like the previous post, my problem is a live stream of ten thousand CCUs (ie towards ten thousand CCUs =))). Some of the basic requirements are as follows:

  • The system is mainly read data with a ratio of almost 99/1
  • The data on the page is mostly dynamic, can be customized for each user
  • The hot data part changes relatively often and is updated by the worker system behind

As for the livestream infrastructure, another party has already taken care of and optimized it, my job is simply to make a website for users to access and interact with.

The architecture I’m using

Currently, I’m running with a very basic model… monolith . Of course, I have separated the frontend using my own NuxtJS and the backend API on my own Nodejs. This is the first step to take to be able to scale easily, but there is no need for a big architecture like a micro-service.

This livestream page has quite a bit of dynamic data, and to develop it quickly , it is compatible with most devices from antique to needle, from laptop to mobile or even to embedded browsers. Ours is still a short-polling mechanism. You can google a lot of articles about other realtime data update techniques such as long-polling , server-sent event , websocket ,… but I guarantee nothing is easy to implement and make the product as fast as this guy. where is short polling . This is also the most important factor to choose for this product: speed of development .

On the frontend side, the framework here is NuxtJS . Mainly because Vue (Nuxt’s base) is a guy with an easy-to-learn and familiar syntax. Choosing Nuxt is to get Server-Side Render capability right from init project. This Nuxt guy has both SPA and SSR modes, so whatever you like, you can avoid adding it later.

On the backend, my web server is choosing Fastify . Of course, it’s not just about listening on the internet that fastify is fast with low overhead, but it’s just applied indiscriminately. It’s important that you know when you need to care about overhead . If our team is still drowning in slow queries, and then the latency is in hundreds of milliseconds or seconds, then surely there is no need to pay attention to the so-called framework overhead . But when the API calculates time by a few milliseconds to tens of milliseconds, that’s when you have to choose a framework because of overhead.

Framework overhead is a concept used when comparing the processing capabilities of two frameworks. Basically, compare 2 empty requests to see which one has a lower latency, that is, he is faster. For example, when considering fastify with express, when comparing the overhead of the two, fastify has a smaller overhead and is therefore faster.

The most important thing when making a choice for performance reasons is to really understand what you’re trading off and whether the choice is worth it. For example, when choosing fastify, I have to be aware that I’m ignoring a lot of features that express guy has available, for example, and whether my devs are familiar with fastify like I’m used to express,…

Of course, with a reasonable application architecture, changing the web server is not a big deal. It’s even quite easy. Take a look at my project directory structure:

Here I have implemented multiple-entrypoint structure to start different components of the project such as api , worker ,… The common logic will be handled in common . In the API directory, it only contains logic related to the web server such as Controller , router , validator , … but does not contain business logic. Therefore, it is relatively easy to change the web server from the familiar express to fastify.

Some basic numbers

Ok, before we start, let’s go through the basic numbers related to the performance of the current system.

Let’s start with one of the most popular API calls in the app, which is the get item detail API:

  • Test machine configuration: Core i5 4200H 4 cores, 12GB RAM + SSD
  • Test environment: local DB on docker, code running 1 process node
  • Endpoint: /items/:id

All the results in the tests here are tested on my real running code and the real data will therefore be quite a bit more complicated than the code below. However, basically, the differences related to performance have been shown.

To make the results comparable and exclude network-related factors, I have 1 more empty endpoint that does nothing but return the correct response of the /items/:id endpoint (for all APIs) tests all have the same response size, eliminating network and data transfer differences).

Test results of 100k requests , concurrency 200 with the k6.io tool are as follows:

No.Kiểm TRAHttp time(s)Http RPSavg(ms)min(ms)max(ms)p95(ms)Node CPU
firstEmpty GET18.5539636.950.91140.6653.47115.00%
2Get by id154.9645309.59174.96557.06349.28230.0%

The Empty GET represents the overhead of the framework and of the data. That is, now the code does almost nothing, only responding to the right data, how will the performance be.

In essence, my 1 node instance has run full CPU (100% ie full 1 CPU core), which means it has used up the maximum resources it can. But everyone will wonder why Nodejs is still said to be single-threaded but sometimes uses up to 115% of CPU, sometimes it uses up to 230% of CPU (ie > 2 CPU cores). So it turns out that the single-threaded label has been a scam for so long?

~> Not a scam, but it’s not enough to say it. In essence, the event-loop used to run our code is single-threaded . That is, we can ensure that running the computational code on our CPU is single-threaded , where else is the io outside – specifically working with libuv (a library in the core of node). asynchronous processing in reading files, calling http, dns,…) will be performed on 1 thead pool with the default thread count of 4. The number of threads of libuv can be configured via the environment variable UV_THREADPOOL_SIZE .

I’m using the postgres database, using node-postgres as a driver, but deep inside this guy uses libpq, libpq uses libuv. So when I call Get by id API, my node has used up to 230% CPU because of that.

There is a good question used to trick Node.js developers: If only running 1 instance, can Node application use more than 1 CPU? . You guys just thought of single-threaded but quickly said no , that’s not wrong.

Implement cache-aside model with Redis

Calling the database directly, of course, the performance will not be very good. Based on my experience on some systems with 10 CCUs, generating 1 req/s , the number 645 req/s above can only meet 6k CCUs (but also in very ideal conditions. there is little data but the user only calls get item by id). Then my journey to 10,000 CCU seems to cost a lot more resources.

No, who wouldn’t just pay tribute to the cloud capitalists like that? Now I develop the simplest solution that everyone uses is Cache with Redis and see how it works.

image.png

The simplest way to cache with Redis is to use the cache-aside model. I can encapsulate this cache implementation with the following function:

Then in my code will call like this:

Done, I cached the item’s response with Time-to-live of 30s. Continue to test the performance with k6.io again:

No.Kiểm TRAHttp time(s)Http RPSavg(ms)min(ms)max(ms)p95(ms)Node CPU
firstEmpty GET18.5539636.950.91140.6653.47115.00%
2Get by id154.9645309.59174.96557.06349.28230.0%
3Cache Redis24.1414348.1410.86195.468.19125.00%

Well, the speed is pretty good, isn’t it? So the first step is done, I apply this caching mechanism to have a number of APIs to get data on another page depending on how often the data will change in each place.

Short-polling and the public bandwidth problem

Everything was fine until I discovered a problem when the number of CCUs increased: Our public output bandwidth skyrocketed to hundreds of Mbps. Of course, for a normal website, this number of nearly 100Mbps is not trivial. But when you are targeting 10,000 CCUs and now only have 2-3k CCUs, you should see such high bandwidth, you should reconsider. Especially after I split the application for users to use Nuxt CSR instead of SSR (ie, almost all static assets have been cached on CDN and html is extremely light).

I quickly found the problem: it was the short-polling mechanism we were using.

In the previous article Me, NuxtJS and the live stream of thousands of CCUs , I also mentioned this issue. Now I just reiterate that choosing short-polling mechanism is relatively suitable for the problem we are working on as well as the resource cost for this project.

Above I chose Fastify as a web framework and this is the first trade-off after making a performance choice: I have to learn how to do everything myself . Previously, express helped me preconfigure the use of Etag to mark the response unchanged and return HTTP status 304 to help the client not lose the bandwidth to transmit the response anymore. Now I have to do it myself with Fastify. Of course, it’s not difficult, it’s just difficult whether you know you need to do it yourself or not.

After turning on Etag, I reduced about ~4 times the public output bandwidth (also reasonable when our data changes with a frequency of about 1p / time and short-polling is being configured 15s / time).

In addition, with some special APIs in the form of lists, instead of refreshing the entire list, I also do a few more simple optimization moves by trying to separate the data that changes little from the data that changes often . From there, it is possible to separate the API refresh list that changes frequently. This helps me reduce a few times more public output bandwidth.

As a result, with 3k CCUs , my public output bandwidth is only about 10Mbps .

The power of promises and the Thundering Herds problem

Although quite satisfied with having a system running fine in a short time. However, I can’t help but notice a few ominous signs when monitoring the database load and slow query.

  • Slow query occurs when database CPU is high.
  • There are many duplicate slow queries at the same time.

Ok, my system usually has a cron job running to update large amounts of data updated by other systems. At such times, high database load is normal. Even if slow query occurs at that high load, it is acceptable. However, having many duplicate slow queries at the same time is a sign of collapse.

If you’ve read my Caching Dafa series, you’ll also realize what this means. That’s exactly the problem Thundering herds aka Cache stampede . That is, the cache system is continuously missed at a time because the previous request has not been processed (to cache the results), the next request has arrived. This leads to the request running directly to the database and continues to burden the db leading to overload and system death.

To solve this problem, of course I will think that when the first request misses the cache, I will mark the db call in progress, thereby preventing subsequent requests from calling directly to the db but will wait with the first request. Luckily our Node.js is essentially single-threaded , so it’s relatively easy to initialize and access a global variable without ever having to worry about race-conditions.

To lock the database call, I use the following code:

Here you should notice the line const promise = getData(); . It’s not that I forgot the word await when calling getData, but it was intentional. The result of getData() will return a promise . I will immediately set the promise with the calling key to a global Map called callingMaps . Now if there is a request to the system that the value = await promise; If you haven’t done it yet, you will also find a running promise in callingMaps and will wait for the result of that running promise, not requesting to the db.

By saving the promise when calling the db, I have eliminated all redundant requests to the database when the Thundering Herds phenomenon occurs. Even if there are 1000 requests coming in at the same time and the cache is missed, only 1 request can reach the db.

Cache promise is an extremely powerful technique if you know how to use it properly. Remember not to get used to adding await where const promise = getData(); It’s not that all the candy is ruined =))))

Memory cache and internal bandwidth problem

After relatively solving the problem of public bandwidth, we continue to work with the problem of internal bandwidth. Although we used promises to combat Thundering Herds by keeping a running promise, we quickly realized that was not enough with an incremental CCU system.

As you can see in the picture, our internal bandwidth is about 10 times the public bandwidth. This is the result after the process of optimizing public bandwidth using that Etag. That is, the system still consumes internal bandwidth to get data from Redis, but if the response does not change compared to the etag, it will not consume public bandwidth to transmit to the client anymore.

Still the same story, 100Mbps internal is still nothing, but after operating many systems running with Redis in the cloud, we have also encountered many cases of Redis bandwidth up to several hundred Mbps and appeared. waiting network icon quite a lot. In addition, the network internal VPS in the cloud does not have a specific commitment, so using a few hundred Mbps is also quite risky.

So how to optimize internal bandwidth with the running system? Of course, you can save money by… not using it anymore .

If I completely switch Redis cache to Memory cache , the network problem will be almost solved. But only using Memory cache, there are countless other problems such as:

  • Invalidate cache is difficult
  • Memory for 1 instance is limited
  • Cold cache when re-deploy service
  • Horizontal scale leads to reduced cache efficiency (low cache hit ratio)…

Clouds and clouds. The problem of this livestream page has a relatively complex backend with many APIs and different types of data, but it is not as simple as the service in the “Super fast API” problem with Golang and Mongodb , so it is completely converted to Memory cache is not suitable at this time.

Ok, if you don’t switch completely, use both. We can add one more function to wrap the getOrSet above as follows:

Here I will remove the new callingMaps mechanism made above because it is not needed anymore. You notice that when I call getOrSet at return or value = getOrSet() , there is no await to return the result as a promise.

In this, a very famous memory cache library for Node is used, which is node-cache , you should also note that when initializing, I have the option useClones: false so that the node-cache guy keeps his promise object. Don’t clone a new promise. Doing this will save more memory and reuse the same promise generated from the first request.

When used, it is quite simple, just add one parameter, burstTtl , which is the time to cache memory.

This time will be shorter than the ttl time which is the redis cache time. burstTtl usually I only set for a few seconds, ie just to avoid calling redis a lot at times of requesting a lot. Therefore, it will almost still retain the properties of the redis cache system such as accepting invalidate with a delay of several seconds, the amount of memory that the instance uses is relatively small, …

Try to test the performance again with k6.io :

No.Kiểm TRAHttp time(s)Http RPSavg(ms)min(ms)max(ms)p95(ms)Node CPU
firstEmpty GET18.5539636.950.91140.6653.47115.00%
2Get by id154.9645309.59174.96557.06349.28230.0%
3Cache Redis24.1414348.1410.86195.468.19125.00%
4Cache Mem (burst)22.1451644.162.92176.5859.81120.00%

As you can see, using memory cache with your system is not to solve the speed problem . This can be easily seen through the benchmark results, the cache memory is only slightly better than the Redis cache above about 10% in terms of performance. What this method solves is that I almost kept the data and code characteristics, but reduced the internal bandwidth by about 5-6 times.

This is how a bachelor of economics optimizes the system, rather than changing all the solutions and creating more problems.

summary

From this article, I learned 3 things:

  • Optimization is the sum of the process of increasing outputs and input costs.
  • Increasing output at a reasonable cost is the hardest thing, not trading everything for the best results.
  • Cache is still easy, how to combine cache types to fit a specific problem is difficult. And that’s why it takes a lot of combat experience.

It’s over. Upvote if you find it interesting.

Share the news now

Source : Viblo