Spam comments (comment spam urls, comments with indecent content, comments containing sensitive personal information, …) are problems that websites from news, e-commerce, classifieds, blogs, … faced daily. Therefore, if we can build a system to block / alert these spam comments, the problem will be solved a lot easier.
Original article posted on his personal blog: Building a simple spam comment blocking system
In this article, I would like to share my experience in building a system to block spam comments in a very simple but effective way. Best of all, it can be easily deployed for small products without the cost and effort.
Look at the problem of comment spam
To view this problem in more detail, we need to separate each type of spam comment. Analyze their characteristics to provide a reasonable solution. Here is a simple look:
There is a type of spam comment that attempts to put their website link on your site. With this type, the comment content will definitely have a url. Therefore, you can use this information to check. For example:
- The comments leave sensitive personal information, often the information is also structured; such as email, phone number. This information can also be verified by laws. (The purpose is not to disclose customers’ personal information, to protect customers).
- The most common type is comments that contain sensitive, obscene content. These comments often contain obscene and profane words.
- Another difficult type is the ironic comments, which are disparaging, insulting but without obscene words. But this type is often less.
With the first two types, we can use the regular expression to check simply in any popular programming language. As for the last category, I would like to not discuss in this article. I have approached this type of irony, satire by layering method using fasttext but it is not effective. It is not clear now that with the development of deeplearning and context embedding, it will solve this problem. But overall it is quite difficult and I have not had a chance to try it yet
Therefore, the remainder of the article will cover the method to handle the third type of junk comment above
What problems does the problem have?
- If applied by machine learning method, text classification problem, the biggest difficulty of the problem is that the data must be labeled (each comment needs to know in advance whether it is a spam comment or not). is not). I myself tried to approach in this direction by labeling according to a few rules through data observation. However, I failed in this direction. The reason is probably young and green. However, from my perspective, if you have money to do data, you can follow this direction, or try by optimizing the classification model over and over again.
- The comment data is very spam, I mean here is very much misspelled. From acronyms, unsigned typing, using teencode (youth’s writing), … are the major difficulties for this problem. And it is a problem that you need to solve if you use the machine learning method above.
- The problem you want to deploy must ensure high accuracy, you understand the phrase “It is better to kill wrong than to ignore”. In fact, too, you are not allowed to miss spam comments, especially those that are sensitive to your system. It is exactly that you have to increase the recall if you apply a classification model to this problem.
Solution to block spam comments
As I have analyzed with comment type 3 above, they will come with sensitive words / phrases, and this type is the type of spam comment that appears the most and has the highest level of danger to be excluded. . Therefore, my approach is to use the dictionary to remove: Check for a comment, if there is a sensitive word in the comment, that conclusion is a spam comment.
But the problem is where to get a full list of sensitive words / phrases to use?
If you’ve ever used the word embedding word2vec model, then you must know the concept of ” word similarity “. It can be considered as a way to evaluate your word2vec model is really good or not!
Visual representation of word2vec’s performance in space, source: https://github.com/sonvx/word2vecVN
In fact, the word2vec model learns by using the words adjacent to it according to a fixed word window that you specify. Therefore, synonyms or similar meanings will often be in the same context together. As a result, after the model is completed, the vector values of closely related words will be close to each other. Look at these sentences and you will immediately see:
anhis the one between this life, (2)
emwho are between life
- (1) I
yêuyou, (2) I
- (1) I
đếchneed, (2) I
From the similar word search problem, I have come up with a solution to the problem of building sensitive word dictionaries in Vietnamese as follows:
- Build sensitive initial dictionary. This is not complete, just as much as possible; You can do this by writing it yourself or by collecting it on the internet.
- Training word2vec model with social network data, forums, or wikipedia is also a suitable data warehouse. Avoid using news data because news data doesn’t have the sensitive words we need. The bigger your data set is, the less it should be a few GB or more.
- Expand the dictionary in (1) by putting each word into the word2vec model trained in (2) and taking out the top of similar words. You need to review similar words that the word2vec model gives to make sure not to mix with words you don’t want.
- Repeat step 3 from 2 to 3 times that you have the sensitive dictionary you need; If possible, you can train word2vec models with new data to search for new sensitive words.
After you have the dictionary, all you have to do is use it for the automatic spam comment block. Of course we can do a few more functions to use this dictionary in the most effective way.
You can refer to this sensitive word dictionary built by yourself here .
That is so only with the word similarity problem and we can solve the most difficult part of the problem of blocking spam comments. The time you can build this dictionary is not too long, probably about a week or less than the completion of the problem.
Advantages and disadvantages of the method
Any method has its advantages and disadvantages. And the method to use the dictionary I propose is no exception. Here are my own reviews of the proposed method.
Overall, this solution is relatively good while the time and cost of deployment is inexpensive.
Regarding the previous advantages:
- Simple on cost and time taken. Because of the data, you can get it from the wiki dump, from social comments (Vietnamese youtube comments are quite plentiful, my social-crawler project may be useful to do this). The data is just taken and put into the word2vec model, I use fasttext word2vec to train because it does not require much hardware and the advantage of very fast training time. Fasttext also provides pretrain with Vietnamese wiki dump data, can be downloaded and used.
- The comment categories that contain sensitive words in the form of teencode, abbreviations, etc. are eliminated because the word2vec model learns such words from the data.
- Ensure fast processing speed, due to the use of dictionaries, so checking does not cost much to compute.
- Using dictionaries can instantly add sensitive words; Compared to using machine learning method, you have to retrain the model.
On the downside of the method:
- Cannot handle rude comments when the dictionary is missing words. Therefore, I just emphasize the factor “full dictionary”.
- The comment form without accents is hard to spot.
- A nice comment may be caught when the dictionary has ambiguous words, such as dog
Optimize the problem
Currently, I no longer do this problem anymore. However, we can still make this problem a lot better with some of our subjective suggestions. I really welcome readers if you have a good idea for this problem can share sir.
- If using this method as the main method, you should update the dictionary regularly. Because Vietnamese dictionaries also develop over time
- If using this method as the main method, ambiguous words should seek to eliminate or limit such ambiguities by opening the word window for those ambiguous words. For example, walnuts are not sensitive at all.
- Preprocessing comments before being tested (applicable to both machine learning and dictionary methods): like handling teencode, a list of some common Vietnamese teencode words , and adding punctuation for Comment unsigned, refer to the solution here to increase the accuracy of the problem.
- After deploying this solution, we should invest in building an end2end machine learning model to solve the problem to help the problem achieve better results and be easier to manage and deploy.
Thank you for your interest, looking forward to comments from readers and experts!