Application of Naive Bayes algorithm in solving diabetes diagnosis problem

Tram Ho

Hi everyone, it has been a long time since I had a new Viblo article about Machine Learning and today we will come up with a method that is not new in machine learning but is always a method that is effective best in class layered problems or predict. The algorithm we will discuss today is Naive Bayes – one of the very typical algorithms for the classification of probabilities based on probability theory. There is a special thing in this Viblo lesson that I will not use the Naive Bayes library in Scikit-learn library package like the previous Viblo lessons but I will guide you to implement the algorithm instead. This is in Python in a step-by-step way through a great practical example that diagnoses a person with diabetes based on his or her manifestations. OK let’s start.

Method of learning based on probability

If you have watched your articles about Machine Learning or other tutorials about machine learning, you can see that there is a very close connection between Machine Learning and probability theory. Classification methods based on probability theory can basically be interpreted as calculating how the probability that one of our events will occur in the direction. The higher the probability of which direction the more likely it is to happen in that direction. This is particularly significant in the prediction problem and class of Machine Learning. One question is, “So, how does our innocent computer ultimately determine that probability?” According to modern learning statistics, corresponding to each problem solving probability method often comes with a probability distribution suitable for that problem. Corresponding to each probability distribution we have a separate calculation of the quantities needed for the process of running algorithms such as expected, standard deviation … which we will jointly explore and conduct transparent calculations. Scope of this article. Now we continue to the next part.

Naive Bayes algorithm

The Bayesian theory is probably no longer strange to us. It is the relationship between conditional probabilities. That suggests to us that we can calculate an unknown probability based on other conditional probabilities. Naive Bayes algorithm is also based on calculating these conditional probabilities. Listening to the algorithm name has seen something naive. Why is it Naive . It is no coincidence that people named this algorithm like that. This name is based on a hypothesis that the dimensionality of the data X = (x_1, x_2, …., X_n) X = (x 1, x 2,…., X n) is independent face probability together. We can see that this hypothesis seems quite naive because in fact this can be said to be impossible ie we rarely find a data set whose components are irrelevant. What comes together. However, this naive assumption brought unexpectedly good results. The assumption of the independence of these dimensions is called Naive Bayes. The way to define the class of data based on this assumption is called Naive Bayes Classifier (NBC) . However, based on this hypothesis, training and testing is extremely quick and simple. We can use it for large-scale problems. In fact, NBC works quite effectively in many real-world problems, especially in text classification problems, such as filtering spam messages or filtering spam emails. In this article, I will work with you to apply the theory of NBC to solve a new problem that is the problem of diagnosing diabetes.

Diabetes data set

This dataset includes data of 768 volunteers including people with diabetes and those without diabetes. This dataset includes the following attributes:

  1. Number of times pregnant
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  3. Diastolic blood pressure (mm Hg)
  4. Triceps fold skin thickness (mm)
  5. 2-Hour serum insulin (mu U / ml)
  6. Body mass index (weight in kg / (height in m) ^ 2)
  7. Diabetes pedigree function
  8. Age (years)

For each volunteer, the data includes the set of indicators listed above and the status of illness ie class 1 or not sick , class 0 . In essence, this is a two-class classification problem and we can use other classification methods such as SVM, Random Forest, kNN … to classify the results quite well. If I have the opportunity, I will present this method on another occasion. We can imagine this data set through representation in the CSV file format as follows, in which the last column is the status of the volunteer, columns 1 to 8 correspond to the indicators if on

One thing to realize is that the value of the index is a continuous variable , not a discrete value, so when we apply the Naive Bayes algorithm we need to apply a probability distribution to it. One of the common probability distributions used in this section is Gaussian distribution . Let’s learn a little bit about it. To understand the nature, it can be practiced.

Gaussian distribution

With a data belonging to a class x_i C_i x i c i we see x_i x i follow a normal distribution with expectation mu μ and standard deviation sigma σ. Meanwhile x_i probability function of x i is defined as follows:

This is the calculation method of the sklearn library but in this article I will guide you to install it manually. It is this manual installation that helps us better understand the problem

Discuss programming principles
Only by algorithms and cloud, researchers can increase smartphone battery life by 60%

Manual installation

Load data

Our data is saved as a CSV file so we will use Python’s csv library to read the data

Calculate the standard deviation

As mentioned above, we need a code to calculate the standard deviation of a continuous random variable. We can refer to its calculation formula as the following expression:

Overline {x} x is the average value of the random variable across the data set. We use Python to implement the function to calculate the average value and standard deviation as follows

Money handling data

Before starting each problem about Machine Learning, the pre-processing of data is very important. If our data set is not standard we will need to do some other steps such as sampling the data, removing missing data and transforming the data into the appropriate form to handle .. For the data set Diabetes data is already standardized, so depending on the algorithm we choose the data representation accordingly. As the above mentioned, we will use the Standard Deviation and the average value to calculate the necessary probabilities, so there should be a function to convert the original data into a set of standard and average deviations in order to serve for later probability calculations.

Calculate the probability of each variable continuously according to the Gausian distribution

Based on the theoretical basis above. We proceed to calculate the probability of dependence of random variables including p (x) p ( x ) of each health index and p (x | c) p ( x c ) of each class corresponding to only that number.

Prediction based on probability

This is a step to apply the Bayes theorem mentioned above to predict classes through indexes in the data set.


After the initial data processing step, we conduct learning as follows:

Installation results

After installation, we realize that the calculation algorithm is very fast and gives about 75% accuracy depending on how the data is divided.

Next to check if the algorithm’s settings are correct. I will test the comparison with the sklearn library in the next section

Use the Sklearn library

Data division

First we need to divide the initial data set into two matrices, one matrix containing the volunteer index is the 8 indexes shown in the first section and a matrix containing the corresponding classes.


We add the main () m a i n ( ) function above the following code to perform library training


After the training, we will use the model to evaluate the test data set.

Compare two ways to install

After running the test the same results show that our manual installation is correct

Source code

You can refer to the source code in the article here


We have come a long way through probability theory to the Naive Bayes algorithm and then applied it to the problem of the type of diabetic patients. Our problem is clearly still statistics of relationships and finding its probabilities with the problem we care about (here is whether to have diabetes or not). Although the accuracy is not high because the nature of the method as well as the data set is not large enough, it also helps readers visualize how to install the Naive Bayes algorithm.

Katie Bouman – The girl used the “capture” algorithm to capture the first black hole in history
Algorithms by Jeff Erickson, the book brings algorithms closer to you
Share the news now

Source : viblo