CN107292124A

CN107292124A - Grand genome manipulation taxon recognition methods based on layering pivot deep learning

Info

Publication number: CN107292124A
Application number: CN201710490528.5A
Authority: CN
Inventors: 郑灏
Original assignee: Guangdong Guosheng Medical Technology Co Ltd
Current assignee: Guangdong Guosheng Medical Technology Co Ltd
Priority date: 2017-06-25
Filing date: 2017-06-25
Publication date: 2017-10-24

Abstract

The present invention provides a kind of characteristic vector result of utilization pivot analysis and goes to guide the initialization of neutral net deep learning, belong to the activity classification unit identification technology field of grand genome, function is opened by Relu and multiple cross validation learns, the method that the layering OTU classification to grand genome is carried out to pretreated grand genome signature, has the advantages that specificity and sensitiveness are high.

Description

Grand genome manipulation taxon recognition methods based on layering pivot deep learning

Technical field

The invention belongs to the activity classification unit identification technology field of grand genome, more particularly to one kind is based on layering pivot The grand genome manipulation taxon recognition methods of deep learning.

Background technology

Metagenomics are an emerging biological informations and molecular biology research, and its technology avoids traditional microorganism Isolated culture method directly extracts STb gene from environmental sample, is that the species of scientists study environmental microorganism and distribution are beaten A new chapter is opened.

Activity classification unit（OTU）Identification is a core technology in metagenomics, and its object is to study grand base Because of the microbe species and ratio in group.With the extensive development of nearest sequencing technologies of future generation so that Depth Study is grand Genomics is possibly realized, and good OTU sorting algorithms are more particularly important.

Activity classification unit popular at present（OTU）Method for identifying and classifying has TETRA and Phylopythia. TETRA carries out OTU identifications using tetramer structure sequence signature to grand genome；Phylopythia utilizes known DNA sequence dna OTU identifications, but the specific and sensitivity of the OTU identifications of above two method are carried out to grand genome based on support vector machine method Property it is low, it is impossible to meet the demand of further scientific research analysis.

The content of the invention

Above mentioned problem is had based on prior art, the present invention provides a kind of characteristic vector result of utilization pivot analysis and goes to draw The initialization of neutral net deep learning is led, function is opened by Relu and multiple cross validation learns, to pretreated grand Genome signature has the advantages that specificity and sensitiveness are high come the method for carrying out the layering OTU classification to grand genome.

The present invention reaches above-mentioned purpose by the following technical programs：

A kind of grand genome manipulation taxon recognition methods based on layering pivot deep learning, it comprises the following steps：

Step S1 sample treatments：The microorganism being present in sample is isolated from sample, all DNA in microorganism are extracted, And high-flux sequence is carried out to the DNA of extraction；

Step S2 data predictions：Initial analysis is carried out to reads, contigs and the scaffold obtained in step S1, by weight Multiple DNA sequence dna information and the DNA sequence dna information in known low quality region are rejected；

Step S3 gene expression characteristicses are analyzed：The chaos sequence signature analysises of the dimeric structures of DNA six is extracted, it is determined that and obtaining grand gene Group characteristic information；

Step S4 pivot analysis：The grand genome signature information of typing, screens key character information, to important spy by statistical check Reference breath carries out pivot analysis；

Step S5 sets up neural network classification model：Nerve net is set up as initialization information according to step S4 pivot analysis results Network disaggregated model, then by Relu open function f (x)=max(0, x) and multiple cross validation study is carried out, to grand gene Group carries out hierarchical operations taxon classification.

Wherein, described step S2 data predictions also include the classification of step S21 conserved sequences, whether judge grand genome There is conservative region sequence, activity classification unit classification is carried out using BLAST if it there is conservative region sequence, in the absence of conservative Regional sequence then directly performs step S3 after step S2 terminates.

Wherein, the concrete operations of described step S4 pivot analysis are as follows：

For high latitude sequence signature vector, { x } X obtains a unitary matrice Θ ∈ R by singular value decomposition^{M ×M}, each In higher dimensional space x vector by linear transformation be mapped to y ≡ [y1, y2 ..., y^M]^T, Y=Θ x-Θ μ x；

Wherein μ x skies are the averages of { x }；Obtained Θ sets up neural network classification model as initial information, guiding.

Brief description of the drawings

Fig. 1, the comparative result figure that the method and TETRA provided using the present invention is analyzed simHC data sets.

Fig. 2, the Comparative result that the method and Phylopythia provided using the present invention is analyzed analog synthesis data Figure.

Fig. 3, this patent method and TETRA, Phylopythia Comprehensive Correlation result figure.

Embodiment

With reference to specific embodiment, the invention will be further described.

Embodiment one, OUT classification is carried out using simHC data sets

The genome containing 113 species, DNA in OUT classification, SimHC are carried out using wide variety of simHC data sets Length is from 130 to 3,754 bps, and it comprises the following steps：

Because the species gene group in simHC data sets has completed extraction using conventional separating and extracting process, therefore omit Microorganism separation in step S1, directly carries out high-flux sequence to the genome in simHC data sets.

Step S2 data predictions：Initial analysis is carried out to reads, contigs and the scaffold obtained in step S1, The DNA sequence dna information and the DNA sequence dna information in known low quality region that repeat are rejected.

Step S21 conserved sequences are classified, and judge that grand genome whether there is conservative region sequence, there is conservative region sequence Row, carry out advance activity classification unit using BLAST and classify.

Step S3 gene expression characteristicses are analyzed：The chaos sequence signature analysis of the remaining non-classified dimeric structures of DNA six is extracted, It is determined that and obtaining simHC data set features information, such as structural information, each functional sequence positional information, sequence information feature.

Step S4 pivot analysis：Typing simHC data set features information, key character information is screened by statistical check, Pivot analysis is carried out to important characteristic information, it is specific as follows：

Step S5 sets up neural network classification model：The unitary matrice Θ obtained according to step S4 pivot analysis results is as first Beginningization information sets up neural network classification model, then by Relu open function f (x)=max(0, intersected x) and repeatedly Checking study, hierarchical operations taxon classification, classification results and TETRA comparing results such as accompanying drawing are carried out to simHC data sets 1。

Embodiment two, OUT classification is carried out using analog synthesis data

Because the species gene group of analog synthesis data is combined into using known species gene, therefore omit micro- life in step S1 Thing is separated, and directly carries out high-flux sequence to the genome in analog synthesis data.

Step S3 gene expression characteristicses are analyzed：The chaos sequence signature analysis of the remaining non-classified dimeric structures of DNA six is extracted, It is determined that and obtaining analog synthesis data characteristic information, such as structural information, each functional sequence positional information, sequence information feature.

Step S4 pivot analysis：Typing analog synthesis data characteristic information, key character information is screened by statistical check, Pivot analysis is carried out to important characteristic information, it is specific as follows：

Step S5 sets up neural network classification model：The unitary matrice Θ obtained according to step S4 pivot analysis results is as first Beginningization information sets up neural network classification model, then by Relu open function f (x)=max(0, intersected x) and repeatedly Analog synthesis data are carried out hierarchical operations taxon classification, classification results and Phylopythia comparing results by checking study Such as accompanying drawing 2.

Embodiment three, this patent method and TETRA, Phylopythia Comprehensive Correlation

Be employed many times this patent method to different samples carry out the classification of activity classification unit and with TETRA and Phylopythia couples Than comparing result such as accompanying drawing 3.

It can be seen that the method that this patent is provided has higher specificity and sensitiveness from above three embodiment.

Embodiment described above only expresses the several embodiments of the present invention, and it describes more specific and detailed, but simultaneously Therefore the limitation to the scope of the claims of the present invention can not be interpreted as.It should be pointed out that for one of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention Protect scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. the grand genome manipulation taxon recognition methods based on layering pivot deep learning, it is characterised in that：It is included such as Lower step：

2. the grand genome manipulation taxon recognition methods according to claim 1 based on layering pivot deep learning, It is characterized in that：Described step S2 data predictions also include step S21 conserved sequences and classified, and judge whether grand genome is deposited In conservative region sequence, activity classification unit classification is carried out using BLAST if it there is conservative region sequence, in the absence of conserved region Domain sequence then directly performs step S3 after step S2 terminates.

3. the grand genome manipulation taxon recognition methods according to claim 1 based on layering pivot deep learning, It is characterized in that：The concrete operations of described step S4 pivot analysis are as follows：

For high latitude sequence signature vector, { x } X obtains a unitary matrice Θ ∈ R by singular value decomposition^{M ×M}, every In individual higher dimensional space x vector by linear transformation be mapped to y ≡ [y1, y2 ..., y^M]^T, Y=Θ x-Θ μ x；