CN107292124A - Grand genome manipulation taxon recognition methods based on layering pivot deep learning - Google Patents
Grand genome manipulation taxon recognition methods based on layering pivot deep learning Download PDFInfo
- Publication number
- CN107292124A CN107292124A CN201710490528.5A CN201710490528A CN107292124A CN 107292124 A CN107292124 A CN 107292124A CN 201710490528 A CN201710490528 A CN 201710490528A CN 107292124 A CN107292124 A CN 107292124A
- Authority
- CN
- China
- Prior art keywords
- pivot
- information
- dna
- sequence
- taxon
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Genetics & Genomics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Chemical & Material Sciences (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Computational Linguistics (AREA)
- Analytical Chemistry (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The present invention provides a kind of characteristic vector result of utilization pivot analysis and goes to guide the initialization of neutral net deep learning, belong to the activity classification unit identification technology field of grand genome, function is opened by Relu and multiple cross validation learns, the method that the layering OTU classification to grand genome is carried out to pretreated grand genome signature, has the advantages that specificity and sensitiveness are high.
Description
Technical field
The invention belongs to the activity classification unit identification technology field of grand genome, more particularly to one kind is based on layering pivot
The grand genome manipulation taxon recognition methods of deep learning.
Background technology
Metagenomics are an emerging biological informations and molecular biology research, and its technology avoids traditional microorganism
Isolated culture method directly extracts STb gene from environmental sample, is that the species of scientists study environmental microorganism and distribution are beaten
A new chapter is opened.
Activity classification unit(OTU)Identification is a core technology in metagenomics, and its object is to study grand base
Because of the microbe species and ratio in group.With the extensive development of nearest sequencing technologies of future generation so that Depth Study is grand
Genomics is possibly realized, and good OTU sorting algorithms are more particularly important.
Activity classification unit popular at present(OTU)Method for identifying and classifying has TETRA and Phylopythia.
TETRA carries out OTU identifications using tetramer structure sequence signature to grand genome;Phylopythia utilizes known DNA sequence dna
OTU identifications, but the specific and sensitivity of the OTU identifications of above two method are carried out to grand genome based on support vector machine method
Property it is low, it is impossible to meet the demand of further scientific research analysis.
The content of the invention
Above mentioned problem is had based on prior art, the present invention provides a kind of characteristic vector result of utilization pivot analysis and goes to draw
The initialization of neutral net deep learning is led, function is opened by Relu and multiple cross validation learns, to pretreated grand
Genome signature has the advantages that specificity and sensitiveness are high come the method for carrying out the layering OTU classification to grand genome.
The present invention reaches above-mentioned purpose by the following technical programs:
A kind of grand genome manipulation taxon recognition methods based on layering pivot deep learning, it comprises the following steps:
Step S1 sample treatments:The microorganism being present in sample is isolated from sample, all DNA in microorganism are extracted,
And high-flux sequence is carried out to the DNA of extraction;
Step S2 data predictions:Initial analysis is carried out to reads, contigs and the scaffold obtained in step S1, by weight
Multiple DNA sequence dna information and the DNA sequence dna information in known low quality region are rejected;
Step S3 gene expression characteristicses are analyzed:The chaos sequence signature analysises of the dimeric structures of DNA six is extracted, it is determined that and obtaining grand gene
Group characteristic information;
Step S4 pivot analysis:The grand genome signature information of typing, screens key character information, to important spy by statistical check
Reference breath carries out pivot analysis;
Step S5 sets up neural network classification model:Nerve net is set up as initialization information according to step S4 pivot analysis results
Network disaggregated model, then by Relu open function f (x)=max(0, x) and multiple cross validation study is carried out, to grand gene
Group carries out hierarchical operations taxon classification.
Wherein, described step S2 data predictions also include the classification of step S21 conserved sequences, whether judge grand genome
There is conservative region sequence, activity classification unit classification is carried out using BLAST if it there is conservative region sequence, in the absence of conservative
Regional sequence then directly performs step S3 after step S2 terminates.
Wherein, the concrete operations of described step S4 pivot analysis are as follows:
For high latitude sequence signature vector, { x } X obtains a unitary matrice Θ ∈ R by singular value decompositionM ×M, each
In higher dimensional space x vector by linear transformation be mapped to y ≡ [y1, y2 ..., yM]T, Y=Θ x-Θ μ x;
Wherein μ x skies are the averages of { x };Obtained Θ sets up neural network classification model as initial information, guiding.
Brief description of the drawings
Fig. 1, the comparative result figure that the method and TETRA provided using the present invention is analyzed simHC data sets.
Fig. 2, the Comparative result that the method and Phylopythia provided using the present invention is analyzed analog synthesis data
Figure.
Fig. 3, this patent method and TETRA, Phylopythia Comprehensive Correlation result figure.
Embodiment
With reference to specific embodiment, the invention will be further described.
Embodiment one, OUT classification is carried out using simHC data sets
The genome containing 113 species, DNA in OUT classification, SimHC are carried out using wide variety of simHC data sets
Length is from 130 to 3,754 bps, and it comprises the following steps:
A kind of grand genome manipulation taxon recognition methods based on layering pivot deep learning, it comprises the following steps:
Because the species gene group in simHC data sets has completed extraction using conventional separating and extracting process, therefore omit
Microorganism separation in step S1, directly carries out high-flux sequence to the genome in simHC data sets.
Step S2 data predictions:Initial analysis is carried out to reads, contigs and the scaffold obtained in step S1,
The DNA sequence dna information and the DNA sequence dna information in known low quality region that repeat are rejected.
Step S21 conserved sequences are classified, and judge that grand genome whether there is conservative region sequence, there is conservative region sequence
Row, carry out advance activity classification unit using BLAST and classify.
Step S3 gene expression characteristicses are analyzed:The chaos sequence signature analysis of the remaining non-classified dimeric structures of DNA six is extracted,
It is determined that and obtaining simHC data set features information, such as structural information, each functional sequence positional information, sequence information feature.
Step S4 pivot analysis:Typing simHC data set features information, key character information is screened by statistical check,
Pivot analysis is carried out to important characteristic information, it is specific as follows:
For high latitude sequence signature vector, { x } X obtains a unitary matrice Θ ∈ R by singular value decompositionM ×M, each
In higher dimensional space x vector by linear transformation be mapped to y ≡ [y1, y2 ..., yM]T, Y=Θ x-Θ μ x;
Wherein μ x skies are the averages of { x };Obtained Θ sets up neural network classification model as initial information, guiding.
Step S5 sets up neural network classification model:The unitary matrice Θ obtained according to step S4 pivot analysis results is as first
Beginningization information sets up neural network classification model, then by Relu open function f (x)=max(0, intersected x) and repeatedly
Checking study, hierarchical operations taxon classification, classification results and TETRA comparing results such as accompanying drawing are carried out to simHC data sets
1。
Embodiment two, OUT classification is carried out using analog synthesis data
A kind of grand genome manipulation taxon recognition methods based on layering pivot deep learning, it comprises the following steps:
Because the species gene group of analog synthesis data is combined into using known species gene, therefore omit micro- life in step S1
Thing is separated, and directly carries out high-flux sequence to the genome in analog synthesis data.
Step S2 data predictions:Initial analysis is carried out to reads, contigs and the scaffold obtained in step S1,
The DNA sequence dna information and the DNA sequence dna information in known low quality region that repeat are rejected.
Step S21 conserved sequences are classified, and judge that grand genome whether there is conservative region sequence, there is conservative region sequence
Row, carry out advance activity classification unit using BLAST and classify.
Step S3 gene expression characteristicses are analyzed:The chaos sequence signature analysis of the remaining non-classified dimeric structures of DNA six is extracted,
It is determined that and obtaining analog synthesis data characteristic information, such as structural information, each functional sequence positional information, sequence information feature.
Step S4 pivot analysis:Typing analog synthesis data characteristic information, key character information is screened by statistical check,
Pivot analysis is carried out to important characteristic information, it is specific as follows:
For high latitude sequence signature vector, { x } X obtains a unitary matrice Θ ∈ R by singular value decompositionM ×M, each
In higher dimensional space x vector by linear transformation be mapped to y ≡ [y1, y2 ..., yM]T, Y=Θ x-Θ μ x;
Wherein μ x skies are the averages of { x };Obtained Θ sets up neural network classification model as initial information, guiding.
Step S5 sets up neural network classification model:The unitary matrice Θ obtained according to step S4 pivot analysis results is as first
Beginningization information sets up neural network classification model, then by Relu open function f (x)=max(0, intersected x) and repeatedly
Analog synthesis data are carried out hierarchical operations taxon classification, classification results and Phylopythia comparing results by checking study
Such as accompanying drawing 2.
Embodiment three, this patent method and TETRA, Phylopythia Comprehensive Correlation
Be employed many times this patent method to different samples carry out the classification of activity classification unit and with TETRA and Phylopythia couples
Than comparing result such as accompanying drawing 3.
It can be seen that the method that this patent is provided has higher specificity and sensitiveness from above three embodiment.
Embodiment described above only expresses the several embodiments of the present invention, and it describes more specific and detailed, but simultaneously
Therefore the limitation to the scope of the claims of the present invention can not be interpreted as.It should be pointed out that for one of ordinary skill in the art
For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention
Protect scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.
Claims (3)
1. the grand genome manipulation taxon recognition methods based on layering pivot deep learning, it is characterised in that:It is included such as
Lower step:
Step S1 sample treatments:The microorganism being present in sample is isolated from sample, all DNA in microorganism are extracted,
And high-flux sequence is carried out to the DNA of extraction;
Step S2 data predictions:Initial analysis is carried out to reads, contigs and the scaffold obtained in step S1, by weight
Multiple DNA sequence dna information and the DNA sequence dna information in known low quality region are rejected;
Step S3 gene expression characteristicses are analyzed:The chaos sequence signature analysises of the dimeric structures of DNA six is extracted, it is determined that and obtaining grand gene
Group characteristic information;
Step S4 pivot analysis:The grand genome signature information of typing, screens key character information, to important spy by statistical check
Reference breath carries out pivot analysis;
Step S5 sets up neural network classification model:Nerve net is set up as initialization information according to step S4 pivot analysis results
Network disaggregated model, then by Relu open function f (x)=max(0, x) and multiple cross validation study is carried out, to grand gene
Group carries out hierarchical operations taxon classification.
2. the grand genome manipulation taxon recognition methods according to claim 1 based on layering pivot deep learning,
It is characterized in that:Described step S2 data predictions also include step S21 conserved sequences and classified, and judge whether grand genome is deposited
In conservative region sequence, activity classification unit classification is carried out using BLAST if it there is conservative region sequence, in the absence of conserved region
Domain sequence then directly performs step S3 after step S2 terminates.
3. the grand genome manipulation taxon recognition methods according to claim 1 based on layering pivot deep learning,
It is characterized in that:The concrete operations of described step S4 pivot analysis are as follows:
For high latitude sequence signature vector, { x } X obtains a unitary matrice Θ ∈ R by singular value decompositionM ×M, every
In individual higher dimensional space x vector by linear transformation be mapped to y ≡ [y1, y2 ..., yM]T, Y=Θ x-Θ μ x;
Wherein μ x skies are the averages of { x };Obtained Θ sets up neural network classification model as initial information, guiding.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710490528.5A CN107292124A (en) | 2017-06-25 | 2017-06-25 | Grand genome manipulation taxon recognition methods based on layering pivot deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710490528.5A CN107292124A (en) | 2017-06-25 | 2017-06-25 | Grand genome manipulation taxon recognition methods based on layering pivot deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107292124A true CN107292124A (en) | 2017-10-24 |
Family
ID=60099549
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710490528.5A Pending CN107292124A (en) | 2017-06-25 | 2017-06-25 | Grand genome manipulation taxon recognition methods based on layering pivot deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107292124A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111564179A (en) * | 2020-05-09 | 2020-08-21 | 厦门大学 | Species biology classification method and system based on triple neural network |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101168729A (en) * | 2006-10-25 | 2008-04-30 | 中国科学院上海生命科学研究院 | Highly effective hydrogen yield photosynthetic bacterium and application thereof |
CN102477460A (en) * | 2010-11-24 | 2012-05-30 | 深圳华大基因科技有限公司 | Method for sequencing and clustering analysis of metagenome 16S hypervariable region V6 |
CN102517392A (en) * | 2011-12-26 | 2012-06-27 | 深圳华大基因研究院 | Metagenome 16S hypervariable region V3 based classification method and device thereof |
CN106055928A (en) * | 2016-05-29 | 2016-10-26 | 吉林大学 | Classification method for metagenome contigs |
WO2017070123A1 (en) * | 2015-10-19 | 2017-04-27 | Dovetail Genomics, Llc | Methods for genome assembly, haplotype phasing, and target independent nucleic acid detection |
CN106682454A (en) * | 2016-12-29 | 2017-05-17 | 中国科学院深圳先进技术研究院 | Method and device for data classification of metagenome |
-
2017
- 2017-06-25 CN CN201710490528.5A patent/CN107292124A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101168729A (en) * | 2006-10-25 | 2008-04-30 | 中国科学院上海生命科学研究院 | Highly effective hydrogen yield photosynthetic bacterium and application thereof |
CN102477460A (en) * | 2010-11-24 | 2012-05-30 | 深圳华大基因科技有限公司 | Method for sequencing and clustering analysis of metagenome 16S hypervariable region V6 |
CN102517392A (en) * | 2011-12-26 | 2012-06-27 | 深圳华大基因研究院 | Metagenome 16S hypervariable region V3 based classification method and device thereof |
WO2017070123A1 (en) * | 2015-10-19 | 2017-04-27 | Dovetail Genomics, Llc | Methods for genome assembly, haplotype phasing, and target independent nucleic acid detection |
CN106055928A (en) * | 2016-05-29 | 2016-10-26 | 吉林大学 | Classification method for metagenome contigs |
CN106682454A (en) * | 2016-12-29 | 2017-05-17 | 中国科学院深圳先进技术研究院 | Method and device for data classification of metagenome |
Non-Patent Citations (4)
Title |
---|
THOMAS J.SHARPTON 等: "A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel Taxa from Metagenomic Data", 《PLOS COMPUTATIONAL BIOLOGY》 * |
丁啸: "基于序列特征的宏基因组数据分析方法研究", 《中国博士学位论文全文数据库 基础科学辑》 * |
刘建军: "太赫兹时域光谱技术在转基因物质检测上的识别方法研究", 《中国博士学位论文全文数据库 农业科技辑》 * |
罗幸: "宏基因组分类分析方法的研究和应用", 《中国优秀硕士学位论文全文数据库 基础科学辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111564179A (en) * | 2020-05-09 | 2020-08-21 | 厦门大学 | Species biology classification method and system based on triple neural network |
CN111564179B (en) * | 2020-05-09 | 2022-04-29 | 厦门大学 | Species biology classification method and system based on triple neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Peay et al. | Fungal community ecology: a hybrid beast with a molecular master | |
Abdelkareem et al. | VirNet: Deep attention model for viral reads identification | |
CN106446597B (en) | Several species feature selecting and the method for identifying unknown gene | |
CN101303730A (en) | Integrated system for recognizing human face based on categorizer and method thereof | |
CN109886021A (en) | A kind of malicious code detecting method based on API overall situation term vector and layered circulation neural network | |
CN106548041A (en) | A kind of tumour key gene recognition methods based on prior information and parallel binary particle swarm optimization | |
CN112365929A (en) | Method for analyzing microbial population induction effect based on metagenome data | |
CN107292124A (en) | Grand genome manipulation taxon recognition methods based on layering pivot deep learning | |
CN109949863A (en) | A method of spirit quality is identified based on Random Forest model | |
Eisenstein | The secret life of cells | |
CN103348350B (en) | Information nucleic acid processing means and processing method thereof | |
Varshavsky et al. | Compact: A comparative package for clustering assessment | |
CN108009402A (en) | A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network | |
CN109686406A (en) | A kind of phylogenetic tree figure production method and system | |
CN110443318A (en) | A kind of deep neural network method based on principal component analysis and clustering | |
CN106709273B (en) | The matched rapid detection method of microalgae protein characteristic sequence label and system | |
CN103339632B (en) | Information nucleic acid treating apparatus and processing method thereof | |
CN109582292B (en) | Online interaction cloud platform based on genomics and bioinformatics | |
CN106650311A (en) | Detection and recognition method and system for microorganisms | |
JP2003028855A (en) | Method for evaluation and display of clustered result | |
Trapnell et al. | Monocle: Differential expression and time-series analysis for single-cell RNA-Seq and qPCR experiments | |
CN110196979B (en) | Intent recognition method and device based on distributed system | |
Lee | Use of collections in taxonomic research with a focus on genetic data | |
Köseoğlu | METATRANSCRIPTOMICS ANALYSIS USING MICROBIOME RNA-SEQ DATA | |
Lu | A Comparison of Cell Type Identification for Single-Cell RNA Sequencing Data Analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171024 |
|
RJ01 | Rejection of invention patent application after publication |