Surely a lot of people know about the principles of KISS (Keep It Simple, Stupid) and YAGNY (You Aren’t Gonna Need It). Its implication in Vietnamese is roughly “killing chickens with a buffalo knife”. In general, this principle is very important but also often violated, because DEVs often like to take risks and do new things, cool new models but the consequences can’t be known without trying, but once you try it, wow yeah. A system that is both crumbling and complex, and packed with the latest in technology, to solve the simplest problems in the most complex ways and the most buggy, hardest to maintain software.
Today I talk about a problem I have encountered, also relatively often the problem of normalization (Normalization) and denormalization (Denormalization). I will briefly explain these two
First what is normalization?
- Save data separately to avoid duplication
- Save space
- When you need to update just update one record, for example you update manager A’s area. You only need to update one record on the Manager table, the other tables don’t have to be modified. Defect:
- When you need to join large data, there will be a performance effect (Everyone needs to remember to join big.)
So what is non-standard?
- Fast query, no need to join or do anything Cons: Even so, it has some disadvantages, namely, there are no advantages of normalization.
- If the data has to change often, it will be a nightmare, what if you change the name of the manager (usually few but I like that) who is managing thousands of people. (this will have to update thousands of records)
- If that manager leaves, and is replaced by someone else (this happens often) How many updates will you have to update?
- In general, editing the data will be expensive, if the data is written a lot then this non-standardization will be problematic.
That’s the theory, I’ll go into the real story
Story 1: Organizational structure story. We all know the organizational structure is an important part of the business, it is ubiquitous and in fact it rarely changes, so it is acceptable for non-standard to store redundant structure names in tables. So why am I speaking here? Because. Wait for the next post!… That’s what people will say, and I would like to describe it as follows. Suppose your organizational structure is saved as follows.
With the model in the parent-child table based on parentID, however, this model has a disadvantage, that to find the parent or child, it has to be recursive, which affects the performance negatively. So the DEV team used the PATH ENUMARATION model to speed up the query. For patterns with trees in the database, you can refer here. For the PATH ENUMATATION section, please see page 55, and for Tree type, see page 48. At this time, the table is as follows.
Now suppose you want to find all the children and descendants of the Hanoi branch, just query.
WHERE path LIKE '1/2%'
That’s it. It’s convenient, isn’t it? It doesn’t hurt to save this field separately on the Organization_unit table. However, for fear of excessive JOIN. Dev team decided to save the path in the tables to use. For example, the employee table is saved as follows:
As we can see, the above saving if you change the name of the structure, like Ba Dinh Branch, for example, it is okay, only employees in Ba Dinh branch will be updated with the name of the organizational structure. But what about path, when is this field updated? As the divine slide about Antipattern of SQL I wrote above. Using this path will be complicated in the case of changes, meaning that updating the path when there is a tree shift such as branching or leveling up for example will have to be updated in many parts. To make the code simpler, the DEV team decided that every time there is a change in the structure tree (add, edit, delete) it will update this entire path field for the whole structure table. And update on all tables where the path field is saved. It can be seen that every time an organizational structure update (even just one record) will have to update all the tables that store this path field, which has many tables with millions of records. As a result, the database hangs, the customer screams, and I feel numb.
Story 2: Editing history story.
As each record will store its edit history, following the idea of the speed of non-standardization, the DEV team decided to create an edit history field right on the original data table as follows.
Save like this, getting the history out quickly is always out of the question, the saving is just added to the end, anyway and then added. One fine day, less than 1000 years later. Naturally, the log data error is too large, and in the list screen for convenience, the dev also downloads the history while retrieving the data to follow the recommendation not to poke the server many times. Over time, the accumulation of historical data becomes huge, seriously affecting the data download. In addition, finding the edit date is also difficult because saving all of them in one place can’t use a normal select statement to get the necessary data, and downloading thousands of lines of history is meaningless.
Then the dev team had to agonize over building a career, creating an extra normalized history table, from then on the querying was faster (oh the normalized backup was faster than the non-standard one). Data can be uploaded paginated, the list screen without loading history runs smoothly, adding data is also simple and fast. Data lookup is also convenient by year. When I finished, I felt light. Oh, and another project has the same problem, syncing to Elastic Search, but the history is too big to hang the sync process (Tears fall again)
Story 3: Master Detail Story.
Before going into this story I explain what is master detail? For example, on facebook, your post is the master, and the comments are detail. According to the normalized model, you will save comments to a separate table, which is the comment table, and save posts to another table. However, also to avoid JOIN, fear of affecting performance DEV team decided to save it in the same table. And the problem becomes like history. However, there is one thing that is more painful than history is just adding later, comments can be edited (in reality, other fields are complicated and edit a lot) And every time you edit, you have to peel off the data block. out there, find a place to fix it (just listening to it, I can see my heart pounding and my feet shaking). That is, after that, editing a comment runs for a long time, loading the data is also long because it loads all the comments every time. The program is slow, the customer calls again. And as you know. Simply going back to the familiar KISS and YAGNY brings smiles.
PS: So does JOIN affect performance? People are often afraid of JOIN but need to understand one thing, JOIN is scary with big data, meaning that JOIN two large tables together can bring problems. However, in story 2 and story 3. We have clearly defined the record to be retrieved. So at this time, the JOIN record only has 1 record, moreover it is all by primary key and foreign key (for foreign keys, you must always have an index) so it won’t affect performance! The non-standard work is mainly useful when you want to retrieve many records or the retrieval condition is not according to the primary key and foreign key only. If querying by Key, non-standard work is not necessary.