Learn about DNA and use machine learning methods to predict genetic families

Tram Ho

In our previous post on bioinformatics overview, we were able to list a number of applications of the methods of computer science as they apply to the collection, storage, organization, analysis, or Biological data visualization. This article will talk about how to present, structure data as well as how to process DNA data. Along with that, the article also provides a small example of how to classify a genome family using a simple machine learning method.

What is DNA

DNA (Deoxyribonucleic Acid), also known as DNA (acide désoxyribonucléique) is a molecule that contains biological instructions that make up the characteristics of each species. DNA, along with the genetic information it contains, is passed from adult organisms to their offspring during reproduction.

The human genome has about 20,000-25,0000 genes, which are gathered inside the chromosomes (chromosomes) located in the nucleus of a cell. The human genome consists of about 3 billion base pair molecules, the human DNA pairs are made up of four types of nucleotides including “A”, “C”, “G”, “T”. We all have separate genomes but still share many of the same parts.

For DNA data, machine learning techniques can be applied ddeer:

  • Capture the dependence of the data
  • Inference and discovery of new biological hypotheses

So in this article, we will learn how to interpret the structure of DNA and how machine learning algorithms can be used to build predictive models on DNA sequence data.

Reference: https://medium.com/analytics-vidhya/demystify-dna-sequencing-with-machine-learning-and-python-bdbaeb177f56

How the DNA is represented

Most DNA molecules are made up of two evenly twisted biopolymers around an imaginary axis forming a double helix.

Illustrations from https://biologydictionary.net/double-helix/

These two DNA strands are called polynucleotides because their composition is composed of nucleotide monomers. Each nucleotide is made up of one of four nitrogen-containing nucleobases — either cytosine (C), guanine (G), adenine (A), or thymine (T) — bound to the sugar deoxyribose and a phosphate group. The nucleotides are linked together into a DNA circuit by covalent bonding between the sugar molecule of the nucleotide and the phosphate group of the next nucleotide, forming a solid rotating sugar-phosphate “backbone”.

The order or sequence of nucleobases determines which biological instructions are contained in a DNA sequence. For example, a specific region on chromosome 15 is important in determining eye color.

How Python treats DNA data

We all know that, Python has a lot of libraries that support data processing and visualization, part of which is biological data. Two of the libraries may be mentioned is Biopython and squiggle

  • Biopython is a set of python modules that provide functions to handle the operations of DNA, RNA, and protein sequences such as reverse addition of DNA sequences, finding modules in protein chains, etc.It provides a lot of work. parser to read all major genetic databases like GenBank, SwissPort, FASTA, etc,
  • Squiggle is a software tool that automatically generates web-based two-dimensional graphic representations of raw DNA sequences. Squiggle has implemented several previously published sequence visualization algorithms and introduces new visualization methods designed to maximize user usability.

To install the above two libraries, starting with the Jupyter Notebook environment we use the following commands:

Genetic data is usually stored in a number of formats, one of which is FASTA. When we open an FASTA file, we can see the content will be similar to this:

To see the full content of the above example file, anyone can see http://www.cbs.dtu.dk/services/NetGene2/fasta.php . The entire contents of the FASTA file are in bold on the web page above. Copy that part and save it in .fa format like normal text file such as example.fa, ​​they can use that file as data for Squiggle example.

To use matplotlib well with Jupyter Notebook we use eternal magic command

Next, with the Biopython library we can read the information of the data file with the following commands:

The results are as follows:

Of course besides just printing it out we can use the Squiggle library mentioned above to visualize the data. Using the command below, Squiggle will open a data visualization web page on which we can move, zoom, and interact with the data.

There are also many Squiggle options that we can learn more about at the library’s documentation page.

Predict genetic families based on machine learning methods

A genomic family is a set of several similar genes formed by duplicating an original gene that generally has a similar biochemical function. Genes are classified into families based on the shared sequence of nucleotides or proteins.

Knowing the sequence of proteins encoded by a gene can allow researchers to apply methods of finding similarity between protein sequences that provide more information than similarities or differences between proteins. DNA sequence.

The content above is for reference in 7.15A: Gene Families

Prepare the data

In this example, we will use data on human, chimpanzee and dog DNA, this example is based on the repo of author Nagesh Singh Chauhan so the data used for the three species above is also. Available in the DNA-Sequence-Machine-learning repo. The data includes groups of genes labeled as follows:

Gene FamilyClass label
G protein coupled receptors0
Tyrosine kinasefirst
Tyrosine phosphatase2
Ion channel5
Transcription factor6

After downloading the above data, first as usual, we import the libraries we need to use:

Next, with the pandas library we read the downloaded data as follows:

Converts DNA sequence data into a k-mer characteristic matrix

The k-mer counting technique can be understood as the string into “words” with overlapping length. In this example, we will use k = 6. The following function is used to apply the k-mer technique:

Next we have the sentencize function to create k-mer “sentences”

From there we obtain sents_human and y_human through the following statement:

Continuously vectorize using BOW as well as dividing train and test:


Classification by Multinomial Naive Bayes

Initialize the model and train with the following command:

Then using the trained model to guess on the test set, y_pred is y_pred

From that result, we can evaluate by the following popular methods:

From there we obtained the following results

Machine title vector

The following code is used to train as well as evaluate the results of the model

The results are as follows:

Predictions from DNA sequence data of other species

Similar to processing human genome data, we proceed with DNA data of other species as follows:

Next, define the evaluate_classifier function as follows:

From there using the evaluate_classifier function with two models trained with our human DNA data obtained:

  • Using Multinomial Naive Bayes model
    • Prediction on chimpanzee DNA
      • accuracy = 0.993
      • precision = 0.994
      • recall = 0.993
      • f1 = 0.993
    • Prediction on dog DNA data
      • accuracy = 0.926
      • precision = 0.934
      • recall = 0.926
      • f1 = 0.925
  • Use the model vector machine title:
    • Prediction on chimpanzee DNA
      • accuracy = 0.968
      • precision = 0.971
      • recall = 0.968
      • f1 = 0.969
    • Prediction on dog DNA data
      • accuracy = 0.493
      • precision = 0.803
      • recall = 0.493
      • f1 = 0.436

From the above results it can be seen that the SVM model shows that humans and chimpanzees are much more closely related than dogs since the results are worthwhile using a predictive model with low dog DNA data. much more when using chimpanzee DNA guessing.


Bioinformatics – research that integrates high-flux bio data and statistical modeling through computationally intensive computing, has attracted great attention recently and DNA sequencing is one of the Its core problem. The vast amount of information gained from sequencing has given us a deeper understanding and basic understanding of the organism. The above article briefly covers the basics of DNA, introduces some libraries used to visualize and process data, and finally a small example of the application of scientific methods. computers that are more specific are machine learning methods for processing and extracting some insights into the DNA sequence data of some species.

Share the news now

Source : Viblo