Today the whole company is on vacation and I stay on duty and work. The work is already in a box, so I have a lot of free time, but I have a lot of free time so I write a few words to confide in my work. About a year ago, I received a spam user identification problem. When he first received a smile in his mind thinking “this is simple”. But when I did it, I found myself mistaken.
# The hard ones
Difficult 1: Definition of spam
The most commonly used definition is:
Spam is the act of sending messages that are not meaningful and annoying to the recipient
That is, what information is meaningless and annoying to the recipient. The recipient is not one person, but many people. For example, person A texts to request sex chat with person B and person C. Person B finds it annoying and insulting, but person C agrees and enjoys it. So is this spam or not ?!
Difficult 2: Data
When building a Machine Learning model, the most time consuming and most important thing is data processing. What is data processing here? Process is to analyze the data to give attributes, then label the data. For you to imagine, I have a data set of 10 million records or 10 million lines of user behavior, my job is to analyze the data to find out the behaviors to distinguish spam users and users do not spam without any suggestions. That’s the job of a Data Analyst
Finding out the attributes is the “big hand” labeling. Labeling is like teaching the kid (the kid here is a computer) how he knows how an orange is different from an apple. Say that to know how important labeling is and determines the success of the model. If you keep telling a child the apple is an orange, and vice versa, when he gets older he will say it is an orange. But when they were young, when they were young, they could not wake up anymore, and it is very difficult to teach again the easiest way to teach is “reset”: – <
But the most accurate label is to have big hands, which means making it by hand. Meanwhile, to solve the spam problem, only I do from A-> Z from data analysis to system design and development. It’s impossible to label 10,000,000 records alone, so you can only do about 3,000 labels.
Difficult 3: Policy protecting user privacy
In order to protect the user all user data such as phone number, account name are encrypted and instead to identify the user I only have access to that user’s id which is not an incrementing integer that is a post-coded string that I don’t know how it was coded or generated. This is completely understandable and does not interfere with my build system.
But people often recognize spam based on content such as email or sms. And here I can’t do that. The content of the message is neither viewed nor viewed. The question is, can spam users be identified? The answer is yes, but the ratio will not be as accurate as combining user behavior and content.
Suppose, what if spam is identified based on content? It is also very difficult. If email spam is detected or on a system with standard style like spiderum, NLP handling is easy. As for SMS systems such as Mocha and Zalo, it is a nightmare because users often use abbreviations, teen codes, or a certain language that only … newcomers can understand to express to the machine. then you have to analyze, label, build models. In other words, you rework NLP Core for each of those codes or languages.
Difficult 4: Real-time processing of big data
When I first started, I didn’t have any experience handling big data using Hadoop, Spark, … I only used pure python and backend libraries like pandas, sklearn. Initially let me detect spam behavior starting from property separation to detect taking 2 hours for about 200,000 users / 10,000,000 records / day. After that, I used queue and split strategy to rule + parallel and distributed processing (all code by themselves without using any big data processing framework) it only took about 20 minutes to process. accomplished.
Another banana is to identify a user with spam behavior in the shortest time, or it can be understood as to identify a user spam based on behavior in less time ?! That’s right ?! You guys didn’t get it wrong. Behavior is something that must be accumulated in a certain period of time but must be identified in real time? = ((((((((
In the difficulty it emerges the wisdom. I handle by continuously scanning the user behavior in very small delta t time eg 1 minute / time. So at time t0 that user has not spam, but at time t1 that user has had enough spam behavior and was identified. Because the scan time or delta t is very small, it can be considered as real time. That is to meet the request of detecting spam behavior of the user as soon as possible: 3
Difficult 5: Finding the attributes of spam users
Back to the difficulty of analyzing user data. Finding the attributes that distinguish spam users from ordinary users is like scavenging for gold. The first thing is to ask the question, what do users usually do? What is spam’s desire? From there find the properties.
But its life is not like a new life. The spammers it behaved like normal guys. For example, the time to send messages between sending messages by spammers is the same as that of regular texters, …
So how to find properties? The answer is to keep thinking, query the data, visualize it into charts and metrics to compare spam users with normal users =))) Explain the meaning and origin!
Difficult 6: Evaluation
Usually when evaluating a classification model people often use Accurancy or True / False Positive / Negative. But that’s when we already have a labeled dataset available for research. But when applied in practice, the data we have is purely unlabeled data. Therefore, how to evaluate a model when we have to label about 10,000,000 records per day = ((((((The only way is to hire a team to label to know exactly)
But the poor can not play like that, we only have statistics and assessment based on the results returned by the model. For example, in the model we recognize 100 spam users / day in which 90 users are truly spam, we can temporarily consider it 90% correct. Plus the percentage of spam users that we listed earlier to compare to see if the reality is true. If the previous statistic rate compared to the model guessing the difference is not more than 10%, the odds are very high that your model is correct
After about the last 2 months I have built the model and the rate is quite good at 96% – this number is only judged by the spam detection messages that the model returns. Well, well, temporarily satisfied. But there is still a lot of work to do and to optimize because users are always searching for and surpassing the AI model.