Introduce
In this article, I would like to introduce everyone to the algorithms and libraries to build the ID generator (ID generator) that I often use to solve these problems.
ID generator (ID generator)
First of all, you must have understood what an ID generator is, according to ChatGPT baby
An ID generator is a tool or software that creates unique identification numbers for entities such as individuals, products, or transactions. These IDs are used to efficiently identify and track items in databases and systems. Some common types of ID generators include sequential ID generators, random ID generators, and hash-based ID generators.
Vietnamese sub
An ID generator is a tool or software that generates unique identifiers for objects such as individuals, products, or transactions. These IDs are used to efficiently identify and track entries in databases and systems. Some common types of ID generators include sequential ID generators, random ID generators, and hash-based ID generators.
There are 3 most popular types of ID generators today
- Sequential ID generator : the most common one you often see is the ascending ID in the Databases (Mysql, PosgreSql…), using this method the advantage is that the id looks pretty, always ensuring uniqueness. However, the reverse is that DBs often use sequence to count incrementally and this counting needs to be processed sequentially when encountering a large number of requests, which will be a weakness that cannot be “swallowed”. This method is suitable for small/medium systems where the inserted data is not too large.
- Random ID generator : this method will randomize the characters/numbers based on the input input. However, to ensure that the ID does not match, the random result length usually needs to be large enough. The most famous library that every dev has heard of is probably UUID . In this article, I will introduce you to another library that is lighter, faster and better.
- ID generator based on hash(Hash) : this method uses the band algorithm by bringing in data from many sources, be it email, id server, random string…. The method combined with other algorithms has can generate IDs with extremely low duplication, commonly used in distributed applications. In this article we will learn about Twitter’s Snowflake algorithm.
Random ID generator
In this article, I use Go to make a demo
According to the developer, NanoId is faster than UUID
“Nano ID is quite comparable to UUID v4 (random-based). It has a similar number of random bits in the ID (126 in Nano ID and 122 in UUID), so it has a similar collision probability — for there to be a one in a billion chance of duplication, 103 trillion version 4 IDs must be generated”
According to NanoID, if you use Standard mode with a length of 21 characters, every 1,000,000 (1 million) reuquests per paper will take you more than 40,000 years to have a 1% duplicate rate.
Below is a picture of the calculation of duplicate probability of another tool of NanoID. virtual ha =))).
Usually if you just want to use zero, the possibility of duplicates will increase quite a lot, but if you need to generate IDs for 1 million requests per second, forget about NanoID, and if you have a few tens of thousands, then NanoID is still great.
Golang for example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | func main() { canonicID, err := nanoid.Standard(21) if err != nil { panic(err) } id1 := canonicID() log.Printf("ID 1: %s", id1) // Makes sense to use CustomASCII since 0-9 is ASCII. decenaryID, err := nanoid.CustomASCII("0123456789", 12) if err != nil { panic(err) } id2 := decenaryID() log.Printf("ID 2: %s", id2) } |
And this is the result
1 2 3 4 5 6 | log.Printf("time start: %s", time.Now()) for i := 0; i < 10000000; i++ { nanoid.Standard(21) } log.Printf("end start: %s", time.Now()) |
I tried adding genera to 10 million codes with 21 characters long and it took about 31 seconds. With sequential processing, I find the speed to be very good.
OK! Now do a high-intensity test to see if there is any infection.
Case 1: create a week from 10 million codes
1 2 3 4 5 6 7 8 9 10 11 12 | m := make(map[string]int) log.Printf("time start: %s", time.Now()) for i := 0; i < 10000000; i++ { canonicID, err := nanoid.Standard(21) if err != nil { panic(err) } m[canonicID()] = 1 } log.Printf("end start: %s", time.Now()) log.Printf("total: %d", len(m)) |
Case 2: simulate 100 requests together, each request generates 1000 codes and woww, no duplicates detected.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | func main() { wg := &sync.WaitGroup{} wg.Add(1000) total := 0 startTime := time.Now() for i := 0; i < 1000; i++ { fmt.Println("* create chanel ", i) count := make(chan int) go Standard(count, wg, &total) } wg.Wait() log.Printf("total: %d", total) log.Printf("start time : %s, end time %s", startTime, time.Now()) } func Standard(ch chan int, wg *sync.WaitGroup, total *int) { m := make(map[string]int) for i := 0; i < 10000; i++ { canonicID, err := nanoid.Standard(21) if err != nil { panic(err) } m[canonicID()] = 1 } *total += len(m) wg.Done() ch <- len(m) } |
Case 3: create 1 million channels, each channel 100 requests.
ah, the code above you guys edit a little and then test it, my computer can’t run it.
In some cases where the programmer needs an ID with increasing ability, in this case NanoID will probably no longer be preferred, especially with distributed systems that will make the problem really difficult, However. consider the case of random code NanoID is still worth considering.
Distributed ID generator
As I mentioned in the ID generator section above, there will be a lot of inadequacies if using the DB incrementer, especially with distributed systems that will make the problem really difficult.
Twitter’s snowflake algorithm is the typical solution in this context
The code generated by the Snowflale algorithm is 64-bit and is divided into four parts:
- Do not use the first bit because this bit is the sign bit.
- timestamp: uses 41 bits to represent the timestamp when the request is received, in milliseconds.
- datacenter_id: 5 digits to represent the id of the datacenter.
- worker_id: 5 digits to represent the id of the server.
- sequence_id: the last is the 12-bit loop increment id (increments from 0 to 111111111111 and then returns to 0).
With this mechanism snowfllake can generate 2 ^ 12 = 4096
per millisecond or about 4096 million per second per server.
timestamp
stores 41 bits, so by calculation it will only work until 2039/9/7 23:47:35
timestamp
, datacenter_id
, worker_id
and sequence_id
are four fields, timestamp and sequence_id separately generated by the program at runtime. The datacenter_id and worker_id need to be obtained during the deployment phase and once the program has been started, it cannot be changed.
Let’s go to the demo, let’s go to Golang again today
lib: https://github.com/bwmarrin/snowflake
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | func main() { n, err := snowflake.NewNode(1) if err != nil { println(err) os.Exit(1) } for i := 0; i < 10; i++ { // tạo ID id := n.Generate() fmt.Println("id", id) fmt.Println( "node: ", id.Node(), "step: ", id.Step(), "time: ", id.Time(), "n", ) } } |
IDs are generated sequentially, isn’t it awesome.
Next, I simulate there are 4 nodes, each node has 10 requests.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | func main() { wg := &sync.WaitGroup{} wg.Add(4) for i := 0; i < 4; i++ { fmt.Println("* create chanel ", i) go gen(i, wg) } wg.Wait() } func gen(node int, wg *sync.WaitGroup) { n, err := snowflake.NewNode(int64(node)) if err != nil { println(err) os.Exit(1) } for i := 0; i < 10; i++ { // tạo ID id := n.Generate() fmt.Println( "id: ", id, "node: ", id.Node(), "step: ", id.Step(), "time: ", id.Time(), "n", ) } wg.Done() } |
You can see the picture to check the results or copy the code to try it
and of course, in addition to using it, you can also customize a few things to suit such as Epoch, NodeBits, StepBits…
But is 2039 too short for you? if it’s too short, then Sonyflake is an alternative, Sonyflake has a very similar design to Snowflake, but different in the way bits are distributed.
Sonyflake’s Sequence ID is similar to Snowflake’s and Machine ID is Node’s id. The difference between sonyflake is the configuration parameters during startup (see more at: https://github.com/sony/sonyflake )
Reference
https://developer.twitter.com/en/docs/twitter-ids
https://github.com/matoous/go-nanoid
https://github.com/bwmarrin/snowflake
https://github.com/sony/sonyflake
https://zalopay-oss.github.io/