Vietnamese celebrity face data and Face recognition problem

Tram Ho

Apple uses face recognition to unlock mobile devices; Facebook uses the system to assign a friend’s face tag to connect the community; financial companies are using the face recognition to authenticate payment on behalf of hard cards; airport and station use face recognition to control security; schools, companies want to use attendance systems, automatic grading through face authentication, …

These are typical examples of strong and popular development of face recognition problems or real face recognition.

There have been many studies of this problem, many models have been released, many pre-trained models are available with free public face data sets. The results achieved on this dataset are very good, but many observations show that applying them to practical problems in Vietnam is not so good. For the purpose of building a dataset exclusively for the purpose of researching and optimizing human face recognition problems for Vietnamese people, I would like to introduce a VN-celebrity mini-dataset with more than 23k faces of more than 1000 Vietnamese people. . Hopefully, this data set will serve your research needs and more of the pre-trained models reserved for Vietnamese public and redundant.

Through this article, I will introduce how to collect and aggregate data. Thereby you can build a set of data for yourself or contribute to this data set to help it expand. (Currently, the data set has been completely published for study and research purposes here .)

Build a list of celebrity names

Before conducting the collection of Vietnamese facial data for the research process, what I was most concerned about was the privacy issue and the image book. The largest source of human face data can reach and collect as a social network, however, this is also a prohibition due to privacy issues. It is not allowed to use scripts on other people ‘s pages to drag photos and their information about use (although I know there are major companies using this data source). As far as I know, I know that if I only collect for research purposes, I can access more open data sources which is Google Image (although the number of images determines its identity less far more than social networks. Of course, if the person who appeared in the photo asked me not to use their image anymore, it was their right.

To be able to retrieve the identified image from Google Image (mainly images from newspapers), I need a list of names to conduct Search . These names well enough for her to be able to search their names easily. I used Wikipedia Vietnam to define this list because the names written above are enough for me to easily find them.

Within the same platform as Wikipedia is the Wikidata repository, a free, multilingual, secondary database that collects structured data to support other projects on the Wikimedia platform. Actually in recent times, I use data from Wiki platform quite a lot.

To get the Vietnamese name data on Wikipedia in the simplest way, simply go to Wikidata Query , execute a query like this:

Then download the results back to Json or CSV.

However, I have learned about SPARQL and I like the code a bit more.

First install the necessary library.

Next, I perform a query (you can also generate this code on the interface of Wikidata Query by clicking on the source code above the result table).

The results I obtained are also a list of names of people on Wikipedia.

Analyze the code a bit.

I get names from Vietnamese Wikipedia and English instead of all.

In WIkidata, it is possible to understand that the starting values ​​P are only attributes, Q is the value of those attributes. P31 is a classification where this object is a member or a specific example, is an entity of some class (the object usually has an appropriate name label) and Q5 is its class label, class label human . Similarly, the P27 attribute here is only to nationality, Q881 is Vietnam.

In this query, I get the names of Vietnam nationality is above wiki English and Vietnamese.

To be able to exploit more information from Wikipedia, you should definitely learn a bit about SPARQL.

Collect image data using Google Image Search

After acquiring a list of people named on Wikipedia, I began to search their photos on Google Image Search. Fortunately, this section is not the whole code of the crawl script because there is an open source, Google Images Download . The use of this open source is quite simple, you just need to install the instructions at the package’s home page and run the following script:

After running the above code, we will get a directory named download containing all subdirectories corresponding to the image and the name of each person in list_name_celeb .

Note that these are only images returned with the support of Google Images Downloaded with the accompanying keywords, it is not possible to confirm whether the result is that person’s or the image of that person only. We need a re-check. For example, the image collected by singer Toc Tien is shown below:

Here each of us only takes 50 photos, you can get more by changing the value of the limit field.

The first US city banned the use of face recognition
Chinese police wear smart glasses to identify faces, citizen codes and license plate information

Face detection: Detect and extract faces in photos

So far, we have a lot of photos of the characters named in list_name_celeb . However, as mentioned above, this data has noise, lots of noise. The returned image may be of another person or more than one person in the image.

Our goal is to build a dataset of people in the face of people to study problems like Face recognition, Face verification, etc. so we only care about the face of the people in the photos.

To extract the face in the photos, the pre-trained model for the face detection problem was reviewed and I decided to use FaceNet’s MTCNN by David Sandberg .

The face part in the image is extracted and resized to 128 * 128 1 2 8 1 2 8 pixels and 182 * 182 1 8 2 1 8 2 pixels, the margin around the face is taken 10 more. Another pixel for each image.

Finally, the noise reduction stage. At this time, the power of people was poured out with the task of deleting non-same photos of each person in each folder. This takes a lot of time and is boring.

However, the results will always make us go up to town. After noise reduction, more than 23,000 images were collected on more than 1,000 people. This figure is also significant and worth researching. This data set is similar to the Labeled Faces in the Wild (LFW) data set with more than 13,000 photos of 5749 people, but it is more data-intensive and is exclusively for Vietnamese.

Finally, we will briefly analyze this new dataset.


The data set included 23105 faces of 1020 people present on Wikipedia Vietnam. In an average of nearly 20 people, there are 7 people with at least 2 photos, the most with 105 photos. The amount of distribution is as follows:

Another feature of this dataset that needs to be considered is that the collected image of the same person may be at very different periods, the situation is very different, there is a picture when the child has an old photo, there is a black-and-white photo. There are color pictures. Here are some examples (in the order from left to right from top to bottom are General Vo Nguyen Giap – poet Huy Can – hero Pham Tuan – Xuan Bac – Van Dung – Xuan Hinh):

Vo Nguyen GiapHuy CanPham TuanXuan BacVan DungXuan Hinh

The data set was also successfully used in a celebrity identification contest organized by AIVIVN in March 2019. Below are the results of the competition.

Currently, the data set has been completely published here for learning purposes and studies of problems related to human face recognition for Vietnamese people.

I hope, with this data set combined with techniques like transfer learning, fine-tuning, … will help you achieve better results for Vietnamese-specific problems.

Thank you for reading the article!

Apple is developing a new technology to bring Face ID to a new level
Public cameras in Ho Chi Minh City can recognize pedestrians’ faces
Share the news now

Source : viblo