CN107292124A - Grand genome manipulation taxon recognition methods based on layering pivot deep learning - Google Patents

Grand genome manipulation taxon recognition methods based on layering pivot deep learning Download PDF

Info

Publication number
CN107292124A
CN107292124A CN201710490528.5A CN201710490528A CN107292124A CN 107292124 A CN107292124 A CN 107292124A CN 201710490528 A CN201710490528 A CN 201710490528A CN 107292124 A CN107292124 A CN 107292124A
Authority
CN
China
Prior art keywords
pivot
information
dna
sequence
taxon
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710490528.5A
Other languages
Chinese (zh)
Inventor
郑灏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Guosheng Medical Technology Co Ltd
Original Assignee
Guangdong Guosheng Medical Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Guosheng Medical Technology Co Ltd filed Critical Guangdong Guosheng Medical Technology Co Ltd
Priority to CN201710490528.5A priority Critical patent/CN107292124A/en
Publication of CN107292124A publication Critical patent/CN107292124A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Genetics & Genomics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Chemical & Material Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computational Linguistics (AREA)
  • Analytical Chemistry (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention provides a kind of characteristic vector result of utilization pivot analysis and goes to guide the initialization of neutral net deep learning, belong to the activity classification unit identification technology field of grand genome, function is opened by Relu and multiple cross validation learns, the method that the layering OTU classification to grand genome is carried out to pretreated grand genome signature, has the advantages that specificity and sensitiveness are high.

Description

Grand genome manipulation taxon recognition methods based on layering pivot deep learning
Technical field
The invention belongs to the activity classification unit identification technology field of grand genome, more particularly to one kind is based on layering pivot The grand genome manipulation taxon recognition methods of deep learning.
Background technology
Metagenomics are an emerging biological informations and molecular biology research, and its technology avoids traditional microorganism Isolated culture method directly extracts STb gene from environmental sample, is that the species of scientists study environmental microorganism and distribution are beaten A new chapter is opened.
Activity classification unit(OTU)Identification is a core technology in metagenomics, and its object is to study grand base Because of the microbe species and ratio in group.With the extensive development of nearest sequencing technologies of future generation so that Depth Study is grand Genomics is possibly realized, and good OTU sorting algorithms are more particularly important.
Activity classification unit popular at present(OTU)Method for identifying and classifying has TETRA and Phylopythia. TETRA carries out OTU identifications using tetramer structure sequence signature to grand genome;Phylopythia utilizes known DNA sequence dna OTU identifications, but the specific and sensitivity of the OTU identifications of above two method are carried out to grand genome based on support vector machine method Property it is low, it is impossible to meet the demand of further scientific research analysis.
The content of the invention
Above mentioned problem is had based on prior art, the present invention provides a kind of characteristic vector result of utilization pivot analysis and goes to draw The initialization of neutral net deep learning is led, function is opened by Relu and multiple cross validation learns, to pretreated grand Genome signature has the advantages that specificity and sensitiveness are high come the method for carrying out the layering OTU classification to grand genome.
The present invention reaches above-mentioned purpose by the following technical programs:
A kind of grand genome manipulation taxon recognition methods based on layering pivot deep learning, it comprises the following steps:
Step S1 sample treatments:The microorganism being present in sample is isolated from sample, all DNA in microorganism are extracted, And high-flux sequence is carried out to the DNA of extraction;
Step S2 data predictions:Initial analysis is carried out to reads, contigs and the scaffold obtained in step S1, by weight Multiple DNA sequence dna information and the DNA sequence dna information in known low quality region are rejected;
Step S3 gene expression characteristicses are analyzed:The chaos sequence signature analysises of the dimeric structures of DNA six is extracted, it is determined that and obtaining grand gene Group characteristic information;
Step S4 pivot analysis:The grand genome signature information of typing, screens key character information, to important spy by statistical check Reference breath carries out pivot analysis;
Step S5 sets up neural network classification model:Nerve net is set up as initialization information according to step S4 pivot analysis results Network disaggregated model, then by Relu open function f (x)=max(0, x) and multiple cross validation study is carried out, to grand gene Group carries out hierarchical operations taxon classification.
Wherein, described step S2 data predictions also include the classification of step S21 conserved sequences, whether judge grand genome There is conservative region sequence, activity classification unit classification is carried out using BLAST if it there is conservative region sequence, in the absence of conservative Regional sequence then directly performs step S3 after step S2 terminates.
Wherein, the concrete operations of described step S4 pivot analysis are as follows:
For high latitude sequence signature vector, { x } X obtains a unitary matrice Θ ∈ R by singular value decompositionM ×M, each In higher dimensional space x vector by linear transformation be mapped to y ≡ [y1, y2 ..., yM]T, Y=Θ x-Θ μ x;
Wherein μ x skies are the averages of { x };Obtained Θ sets up neural network classification model as initial information, guiding.
Brief description of the drawings
Fig. 1, the comparative result figure that the method and TETRA provided using the present invention is analyzed simHC data sets.
Fig. 2, the Comparative result that the method and Phylopythia provided using the present invention is analyzed analog synthesis data Figure.
Fig. 3, this patent method and TETRA, Phylopythia Comprehensive Correlation result figure.
Embodiment
With reference to specific embodiment, the invention will be further described.
Embodiment one, OUT classification is carried out using simHC data sets
The genome containing 113 species, DNA in OUT classification, SimHC are carried out using wide variety of simHC data sets Length is from 130 to 3,754 bps, and it comprises the following steps:
A kind of grand genome manipulation taxon recognition methods based on layering pivot deep learning, it comprises the following steps:
Because the species gene group in simHC data sets has completed extraction using conventional separating and extracting process, therefore omit Microorganism separation in step S1, directly carries out high-flux sequence to the genome in simHC data sets.
Step S2 data predictions:Initial analysis is carried out to reads, contigs and the scaffold obtained in step S1, The DNA sequence dna information and the DNA sequence dna information in known low quality region that repeat are rejected.
Step S21 conserved sequences are classified, and judge that grand genome whether there is conservative region sequence, there is conservative region sequence Row, carry out advance activity classification unit using BLAST and classify.
Step S3 gene expression characteristicses are analyzed:The chaos sequence signature analysis of the remaining non-classified dimeric structures of DNA six is extracted, It is determined that and obtaining simHC data set features information, such as structural information, each functional sequence positional information, sequence information feature.
Step S4 pivot analysis:Typing simHC data set features information, key character information is screened by statistical check, Pivot analysis is carried out to important characteristic information, it is specific as follows:
For high latitude sequence signature vector, { x } X obtains a unitary matrice Θ ∈ R by singular value decompositionM ×M, each In higher dimensional space x vector by linear transformation be mapped to y ≡ [y1, y2 ..., yM]T, Y=Θ x-Θ μ x;
Wherein μ x skies are the averages of { x };Obtained Θ sets up neural network classification model as initial information, guiding.
Step S5 sets up neural network classification model:The unitary matrice Θ obtained according to step S4 pivot analysis results is as first Beginningization information sets up neural network classification model, then by Relu open function f (x)=max(0, intersected x) and repeatedly Checking study, hierarchical operations taxon classification, classification results and TETRA comparing results such as accompanying drawing are carried out to simHC data sets 1。
Embodiment two, OUT classification is carried out using analog synthesis data
A kind of grand genome manipulation taxon recognition methods based on layering pivot deep learning, it comprises the following steps:
Because the species gene group of analog synthesis data is combined into using known species gene, therefore omit micro- life in step S1 Thing is separated, and directly carries out high-flux sequence to the genome in analog synthesis data.
Step S2 data predictions:Initial analysis is carried out to reads, contigs and the scaffold obtained in step S1, The DNA sequence dna information and the DNA sequence dna information in known low quality region that repeat are rejected.
Step S21 conserved sequences are classified, and judge that grand genome whether there is conservative region sequence, there is conservative region sequence Row, carry out advance activity classification unit using BLAST and classify.
Step S3 gene expression characteristicses are analyzed:The chaos sequence signature analysis of the remaining non-classified dimeric structures of DNA six is extracted, It is determined that and obtaining analog synthesis data characteristic information, such as structural information, each functional sequence positional information, sequence information feature.
Step S4 pivot analysis:Typing analog synthesis data characteristic information, key character information is screened by statistical check, Pivot analysis is carried out to important characteristic information, it is specific as follows:
For high latitude sequence signature vector, { x } X obtains a unitary matrice Θ ∈ R by singular value decompositionM ×M, each In higher dimensional space x vector by linear transformation be mapped to y ≡ [y1, y2 ..., yM]T, Y=Θ x-Θ μ x;
Wherein μ x skies are the averages of { x };Obtained Θ sets up neural network classification model as initial information, guiding.
Step S5 sets up neural network classification model:The unitary matrice Θ obtained according to step S4 pivot analysis results is as first Beginningization information sets up neural network classification model, then by Relu open function f (x)=max(0, intersected x) and repeatedly Analog synthesis data are carried out hierarchical operations taxon classification, classification results and Phylopythia comparing results by checking study Such as accompanying drawing 2.
Embodiment three, this patent method and TETRA, Phylopythia Comprehensive Correlation
Be employed many times this patent method to different samples carry out the classification of activity classification unit and with TETRA and Phylopythia couples Than comparing result such as accompanying drawing 3.
It can be seen that the method that this patent is provided has higher specificity and sensitiveness from above three embodiment.
Embodiment described above only expresses the several embodiments of the present invention, and it describes more specific and detailed, but simultaneously Therefore the limitation to the scope of the claims of the present invention can not be interpreted as.It should be pointed out that for one of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention Protect scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims (3)

1. the grand genome manipulation taxon recognition methods based on layering pivot deep learning, it is characterised in that:It is included such as Lower step:
Step S1 sample treatments:The microorganism being present in sample is isolated from sample, all DNA in microorganism are extracted, And high-flux sequence is carried out to the DNA of extraction;
Step S2 data predictions:Initial analysis is carried out to reads, contigs and the scaffold obtained in step S1, by weight Multiple DNA sequence dna information and the DNA sequence dna information in known low quality region are rejected;
Step S3 gene expression characteristicses are analyzed:The chaos sequence signature analysises of the dimeric structures of DNA six is extracted, it is determined that and obtaining grand gene Group characteristic information;
Step S4 pivot analysis:The grand genome signature information of typing, screens key character information, to important spy by statistical check Reference breath carries out pivot analysis;
Step S5 sets up neural network classification model:Nerve net is set up as initialization information according to step S4 pivot analysis results Network disaggregated model, then by Relu open function f (x)=max(0, x) and multiple cross validation study is carried out, to grand gene Group carries out hierarchical operations taxon classification.
2. the grand genome manipulation taxon recognition methods according to claim 1 based on layering pivot deep learning, It is characterized in that:Described step S2 data predictions also include step S21 conserved sequences and classified, and judge whether grand genome is deposited In conservative region sequence, activity classification unit classification is carried out using BLAST if it there is conservative region sequence, in the absence of conserved region Domain sequence then directly performs step S3 after step S2 terminates.
3. the grand genome manipulation taxon recognition methods according to claim 1 based on layering pivot deep learning, It is characterized in that:The concrete operations of described step S4 pivot analysis are as follows:
For high latitude sequence signature vector, { x } X obtains a unitary matrice Θ ∈ R by singular value decompositionM ×M, every In individual higher dimensional space x vector by linear transformation be mapped to y ≡ [y1, y2 ..., yM]T, Y=Θ x-Θ μ x;
Wherein μ x skies are the averages of { x };Obtained Θ sets up neural network classification model as initial information, guiding.
CN201710490528.5A 2017-06-25 2017-06-25 Grand genome manipulation taxon recognition methods based on layering pivot deep learning Pending CN107292124A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710490528.5A CN107292124A (en) 2017-06-25 2017-06-25 Grand genome manipulation taxon recognition methods based on layering pivot deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710490528.5A CN107292124A (en) 2017-06-25 2017-06-25 Grand genome manipulation taxon recognition methods based on layering pivot deep learning

Publications (1)

Publication Number Publication Date
CN107292124A true CN107292124A (en) 2017-10-24

Family

ID=60099549

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710490528.5A Pending CN107292124A (en) 2017-06-25 2017-06-25 Grand genome manipulation taxon recognition methods based on layering pivot deep learning

Country Status (1)

Country Link
CN (1) CN107292124A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111564179A (en) * 2020-05-09 2020-08-21 厦门大学 Species biology classification method and system based on triple neural network

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101168729A (en) * 2006-10-25 2008-04-30 中国科学院上海生命科学研究院 Highly effective hydrogen yield photosynthetic bacterium and application thereof
CN102477460A (en) * 2010-11-24 2012-05-30 深圳华大基因科技有限公司 Method for sequencing and clustering analysis of metagenome 16S hypervariable region V6
CN102517392A (en) * 2011-12-26 2012-06-27 深圳华大基因研究院 Metagenome 16S hypervariable region V3 based classification method and device thereof
CN106055928A (en) * 2016-05-29 2016-10-26 吉林大学 Classification method for metagenome contigs
WO2017070123A1 (en) * 2015-10-19 2017-04-27 Dovetail Genomics, Llc Methods for genome assembly, haplotype phasing, and target independent nucleic acid detection
CN106682454A (en) * 2016-12-29 2017-05-17 中国科学院深圳先进技术研究院 Method and device for data classification of metagenome

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101168729A (en) * 2006-10-25 2008-04-30 中国科学院上海生命科学研究院 Highly effective hydrogen yield photosynthetic bacterium and application thereof
CN102477460A (en) * 2010-11-24 2012-05-30 深圳华大基因科技有限公司 Method for sequencing and clustering analysis of metagenome 16S hypervariable region V6
CN102517392A (en) * 2011-12-26 2012-06-27 深圳华大基因研究院 Metagenome 16S hypervariable region V3 based classification method and device thereof
WO2017070123A1 (en) * 2015-10-19 2017-04-27 Dovetail Genomics, Llc Methods for genome assembly, haplotype phasing, and target independent nucleic acid detection
CN106055928A (en) * 2016-05-29 2016-10-26 吉林大学 Classification method for metagenome contigs
CN106682454A (en) * 2016-12-29 2017-05-17 中国科学院深圳先进技术研究院 Method and device for data classification of metagenome

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
THOMAS J.SHARPTON 等: "A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel Taxa from Metagenomic Data", 《PLOS COMPUTATIONAL BIOLOGY》 *
丁啸: "基于序列特征的宏基因组数据分析方法研究", 《中国博士学位论文全文数据库 基础科学辑》 *
刘建军: "太赫兹时域光谱技术在转基因物质检测上的识别方法研究", 《中国博士学位论文全文数据库 农业科技辑》 *
罗幸: "宏基因组分类分析方法的研究和应用", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111564179A (en) * 2020-05-09 2020-08-21 厦门大学 Species biology classification method and system based on triple neural network
CN111564179B (en) * 2020-05-09 2022-04-29 厦门大学 Species biology classification method and system based on triple neural network

Similar Documents

Publication Publication Date Title
Peay et al. Fungal community ecology: a hybrid beast with a molecular master
Abdelkareem et al. VirNet: Deep attention model for viral reads identification
CN106446597B (en) Several species feature selecting and the method for identifying unknown gene
CN101303730A (en) Integrated system for recognizing human face based on categorizer and method thereof
CN109886021A (en) A kind of malicious code detecting method based on API overall situation term vector and layered circulation neural network
CN106548041A (en) A kind of tumour key gene recognition methods based on prior information and parallel binary particle swarm optimization
CN112365929A (en) Method for analyzing microbial population induction effect based on metagenome data
CN107292124A (en) Grand genome manipulation taxon recognition methods based on layering pivot deep learning
CN109949863A (en) A method of spirit quality is identified based on Random Forest model
Eisenstein The secret life of cells
CN103348350B (en) Information nucleic acid processing means and processing method thereof
Varshavsky et al. Compact: A comparative package for clustering assessment
CN108009402A (en) A kind of method of the microbial gene sequences disaggregated model based on dynamic convolutional network
CN109686406A (en) A kind of phylogenetic tree figure production method and system
CN110443318A (en) A kind of deep neural network method based on principal component analysis and clustering
CN106709273B (en) The matched rapid detection method of microalgae protein characteristic sequence label and system
CN103339632B (en) Information nucleic acid treating apparatus and processing method thereof
CN109582292B (en) Online interaction cloud platform based on genomics and bioinformatics
CN106650311A (en) Detection and recognition method and system for microorganisms
JP2003028855A (en) Method for evaluation and display of clustered result
Trapnell et al. Monocle: Differential expression and time-series analysis for single-cell RNA-Seq and qPCR experiments
CN110196979B (en) Intent recognition method and device based on distributed system
Lee Use of collections in taxonomic research with a focus on genetic data
Köseoğlu METATRANSCRIPTOMICS ANALYSIS USING MICROBIOME RNA-SEQ DATA
Lu A Comparison of Cell Type Identification for Single-Cell RNA Sequencing Data Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171024

RJ01 Rejection of invention patent application after publication