Distributed monolith and amazing save by DevOps team

Tram Ho

Whether this code is right or wrong, the victim is still the DevOps team.

This sentence even though I just made it up to like, but after thinking about it for a while, I think it is also worthy to become a new truth of the world of infrastructure, not playing. DevOps, please save it so that when the system has problems, there is something to make a lamenting caption on facebook.

Here is the story of the Monolith application deployment as Distributed and a more positive view of the so-called anti-pattern of this microservice system.

First things first

The first is the familiar greeting, I am Minh Monmen and a Herbicide . This is how we call the DevOps team at the company because this team is dedicated to finding and eradicating developer code. In addition to weeding, I also take on the role of solution architect for many different projects. In general, my work often revolves around the process of finding – killing – bringing solutions .

This story has already happened a long time ago, since I was a social networker. But recently, I had a similar experience in a similar situation as an observer, so I remembered to tell you guys. Hopefully, after this story, you will know an emergency backup plan if you encounter a similar situation.

Before starting the article, I also hope that you all understand what the terms microservice , monolith , distributed mean so that I don’t waste time explaining again. In addition, in the article there may be some other knowledge or terms related to kubernetes , api gateway , … then try to find out for yourself.

Ready to sit down and listen to sweets and tea? Ok started.

Story from a monolith

Anyone will start with a monolith. And SHOULD be so.

In an era of cloud , distributed and microservices , every developer wants to build a microservice system from scratch – ie. Completely rebuild it to that standard. This is probably nothing to discuss, because no one wants to sit down and fix a slow, old traditional app, add a head and a tail for it to look like a microservice. Every man wants to get back to work so quickly.

But life is not like a dream and a story made of a product is of course not like a poem. Raising a baby has to start from birth, then after taking care of it for many years, it will grow into a beautiful and obedient student, it’s not like being pregnant for 18 years and giving birth to an adult is always better. to be raised like that.

Making a product will have to consider launching the product as soon as possible. Once released, we can fix it wherever we go. The story of over-engineering , I also mentioned in the article Software Architect – Bad practices , then you can go there for reference.

Ok then start from monolith to make products so fast. In the beginning we also just started from a monolith in PHP containing all the logic.

It still runs fine on a dev environment with a handful of internal users. But when production is finished and plus the media effect, there are quite a few users accessing it, leading to the app rolling and dying. Although this was also expected 😅 the fact that it died so quickly left us quite shocked and confused in handling it. Many solutions were implemented overnight, such as scale database, create more indexes, scale app,…

These options are all more or less effective, gradually we also see the load of the database and backend service decrease, accessing the application is also gradually faster. However, it only took a while for the app to die again.

Well, check the system then:

  • Load of the database is normal
  • Download of the normal app
  • The number of users is not much (too)

However, the nginx (which we put before php-fpm) kept saying request 499 and the mobile app could not call the API leading to paralysis.

Initially, I thought it was because the php-fpm config running in the container was not standard, leading to it creating few workers, so it could handle few requests. However, even if we increase the php-fpm config, or then we scale the app to nearly 200 instances (200 pods on kubernetes), it still only takes a while to die.

The scene at that time was right:

The boss went back and forth Dev was bewildered: “What’s wrong with my code?”

Error of the third service

After using all 72 magic spells to debug, we also realized who was the real cause of system death.

Number is that we originally used an opensource chat platform from a third party. And although it works very well with internal chat systems, but when scaled up to serve general users, it is not responsive, leading to requests to it, it responds very slowly. It is the waiting when the request to the other chat system fills up all the php-fpm worker processes, leading to the usual requests when calling to the app are also rejected by nginx because of waiting too long.

Knowing the offender is already a significant victory. But how to fix it is a difficult problem that we have never experienced. Of course, everyone knows that in this case, you have to optimize the other chat guy for a faster response, then put the timeout in your service backend to not wait for the chat guy too long,… But all the methods All the methods in that textbook need an element of time . Meanwhile, time is something we don’t have.

At that time, service mesh , traffic control and circuit breaker were not popular, so don’t say: Die because service call service is an inherent problem of microservices, so why not apply those measures.

Distributed monolith to the rescue

After using a lot of brains, the only possible (and quick) solution that we came up with at that time was to accept sacrifices .

Our system is deployed on kubernetes using helm chart and managed with ArgoCD. So we took advantage of the easy app deployment and routing to deploy our own app to another deployment then route all the APIs related to chat to this new deployment. And our first Distributed monolith was born.

With this implementation, the entire core API related to users, posts, comments,… is served by one deployment and chat is served by another. After that, of course, the deployment chat continued to live on and on, but it was just an in-app feature and most users could still use other features normally. This is the acceptance of a part of the system that is sacrificed to a higher goal.ater on, I continued to encounter many other similar problems, when a certain secondary resource kills the main service, or the read (less important) kills the write action (more important), the fastest way to handle it. would be to create multiple deployments with separate configs to limit the impact of these actions.

Although the default helm chart can do 1 chart deploy many times with many config files, I recommend you to use a helm deployment management tool to manage this clone better. As we are using is ArgoCD guy.

Some more sharing

The original problem of the problem is the interdependence between components in the microservice, when service A depends on service B, … leading to B’s death, A also dies, or service B’s data must be divided share data with service A.

This is also one of the most difficult problems when implementing microservices. They also give you a lot of design patterns to handle such as:

  • Combining services, redefining the functionality of services when two services are too dependent on each other.
  • Create multiple components in a service to share data but without cross-calling (e.g. separate read and write components)
  • Event/subscription model to sync data between services, each service manages its own data.

Or, if cross-calling cannot be avoided because the system design cannot be changed, you can use tools to help manage cross-calling such as implementing service mesh with circuit breaker, timeout feature to limit the cross-calling. make the services drag each other. However, when applying service mesh, remember to pay attention to the latency it adds when services call each other. Sometimes this number is very large and you will not easily accept it (for example, we do not dare to use it).

Another note is that the design of calling services should also have a clear layering strategy. Limit call service like this (if you can avoid it):

But let’s divide the system into 2 classes clearly like this:

In there:

  • core service layer will be used to directly manage data and perform main operations. In the core service layer, many small services can be organized to perform specialized operations such as optimizing read, write, …
  • The edge service layer will perform the operations of aggregating data from different services and returning it to the user.

With a design like this, cross-calling will only take place at edge services and it will be easier to optimize connection, timeout, retry, failover, … more.

For example:

  • Core service includes post svc, comment svc, profile svc,…
  • Edge service includes feed svc, search svc,…

summary

If people read about Distributed monolith, they usually only read negative articles, this is anti-pattern, this is making microservices wrong, this is something people should avoid,… However, In some special cases, it can act as an emergency firefighting plan before the developer’s code can change.

Some features for you to think about this solution if you have problems:

  • The system has problems with secondary resources, third parties, or unimportant actions that affect important actions,…
  • The secondary resource must be separate from the main resource.
  • Sub-resource usage can be separated from the routing step

A note for you is that this is just a temporary solution for the transition to microservices. Do not abuse and think this is the solution to help your heavy monolith application enter a new era.

Cordially greet and to win.

Share the news now

Source : Viblo