CN110997936A - Method and device for genotyping based on low-depth genome sequencing and application of method and device - Google Patents
Method and device for genotyping based on low-depth genome sequencing and application of method and device Download PDFInfo
- Publication number
- CN110997936A CN110997936A CN201780093812.7A CN201780093812A CN110997936A CN 110997936 A CN110997936 A CN 110997936A CN 201780093812 A CN201780093812 A CN 201780093812A CN 110997936 A CN110997936 A CN 110997936A
- Authority
- CN
- China
- Prior art keywords
- organism
- sequencing
- mutation
- variation
- determining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 100
- 238000003205 genotyping method Methods 0.000 title claims abstract description 70
- 238000012268 genome sequencing Methods 0.000 title claims abstract description 42
- 230000035772 mutation Effects 0.000 claims abstract description 113
- 238000012163 sequencing technique Methods 0.000 claims abstract description 88
- 238000011144 upstream manufacturing Methods 0.000 claims abstract description 8
- 238000004458 analytical method Methods 0.000 claims description 26
- 238000012360 testing method Methods 0.000 claims description 22
- 238000007637 random forest analysis Methods 0.000 claims description 13
- 241000282326 Felis catus Species 0.000 claims description 10
- 241001465754 Metazoa Species 0.000 claims description 10
- 239000012634 fragment Substances 0.000 claims description 10
- 238000012217 deletion Methods 0.000 claims description 7
- 230000037430 deletion Effects 0.000 claims description 7
- 238000003780 insertion Methods 0.000 claims description 6
- 230000037431 insertion Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 239000002773 nucleotide Substances 0.000 claims description 5
- 125000003729 nucleotide group Chemical group 0.000 claims description 5
- 102000054765 polymorphisms of proteins Human genes 0.000 claims description 5
- 238000012706 support-vector machine Methods 0.000 claims description 5
- 241000824799 Canis lupus dingo Species 0.000 claims description 4
- 238000007621 cluster analysis Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 4
- 238000012165 high-throughput sequencing Methods 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims 1
- 241000282472 Canis lupus familiaris Species 0.000 description 60
- 238000001514 detection method Methods 0.000 description 33
- 239000000523 sample Substances 0.000 description 32
- 239000008280 blood Substances 0.000 description 16
- 210000004369 blood Anatomy 0.000 description 16
- 210000000349 chromosome Anatomy 0.000 description 13
- 108020004414 DNA Proteins 0.000 description 12
- 108091028043 Nucleic acid sequence Proteins 0.000 description 9
- 238000003066 decision tree Methods 0.000 description 8
- 108090000623 proteins and genes Proteins 0.000 description 8
- 238000000513 principal component analysis Methods 0.000 description 7
- 239000012472 biological sample Substances 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 5
- 238000012847 principal component analysis method Methods 0.000 description 5
- 241000894007 species Species 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 210000003296 saliva Anatomy 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 239000000344 soap Substances 0.000 description 4
- 238000012070 whole genome sequencing analysis Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000013145 classification model Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000002493 microarray Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 241000271566 Aves Species 0.000 description 1
- 241000283690 Bos taurus Species 0.000 description 1
- 241000282461 Canis lupus Species 0.000 description 1
- 238000001353 Chip-sequencing Methods 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 241000287828 Gallus gallus Species 0.000 description 1
- 238000004159 blood analysis Methods 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 235000013330 chicken meat Nutrition 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 101150113725 hd gene Proteins 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010369 molecular cloning Methods 0.000 description 1
- 238000013081 phylogenetic analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
Landscapes
- Chemical & Material Sciences (AREA)
- Organic Chemistry (AREA)
- Life Sciences & Earth Sciences (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Microbiology (AREA)
- Immunology (AREA)
- Physics & Mathematics (AREA)
- Molecular Biology (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Analytical Chemistry (AREA)
- Biochemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Genetics & Genomics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
A method for genotyping based on low-depth genomic sequencing is provided. Wherein, the method comprises the following steps: (a) performing low-depth genome sequencing on the whole genome of a sample to be tested so as to obtain a sequencing result consisting of a plurality of sequencing data; (b) constructing a reference sequence set of the known variation sites aiming at least one known variation site, wherein the reference sequence set contains variation types of the known variation sites and sequences upstream and downstream of the variation sites; (c) comparing the sequencing result obtained in step (a) with the reference sequence set so as to determine a comparison result of each known variant locus, wherein the comparison result comprises a matching variant type of the sequencing result and the matching times of the matching variant type; and (d) determining a high probability mutation type of the known mutation site based on the alignment result.
Description
PRIORITY INFORMATION
Is free of
The invention relates to the field of biotechnology, specifically to the field of genotyping and pedigree analysis, and more specifically to a method and a device for genotyping based on low-depth genome sequencing and application thereof.
The existing breed identification (also referred to as "blood analysis" herein) service is represented by gene detection of dogs by Wisdom Panel company, and detects the type of a given single-base mutation point on pet dog DNA through a customized microarray chip, and then compares the type with data of pure dogs in a database to give the proportion of breed components of the dog to be detected.
The above-mentioned prior art is based on microarray chip, and the number of samples required for each detection is hundreds, so it needs to gather samples on a centralized machine, which is equivalent to the experiment period for analyzing the sample to be detected for each blood system is long and the cost is high, that is, the technology can not deliver the detection report to the user quickly and cheaply. And the DNA concentration in the sample required by chip sequencing is higher, so that the probability of sampling failure is certain, namely the sampling requirement is higher, and the difficulty in solving the problems of long chip detection technology period and high cost is further increased.
Thus, current techniques for genotyping and pedigree analysis remain to be improved.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, an object of the present invention is to provide a genotyping and breed identification technique with low cost and short detection period.
It should be noted that the present invention has been completed based on the following work and findings of the inventors:
the inventors believe that genotyping and pedigree analysis can be performed based on whole genome sequencing, since as the cost of whole genome sequencing decreases, the cost of genotyping and pedigree analysis based on whole genome sequencing will be lower than for chip-based detection schemes. And, the scheme based on sequencing does not need sample collection, and compared with the chip scheme, the method has the advantages of low requirement on the required DNA content, high sampling success rate, short experimental period and capability of quickly giving a detection report. With the gradual reduction of the sequencing cost, the detection price is lower based on the genotyping and pedigree analysis performed by whole genome sequencing.
Furthermore, through a series of experimental research and exploration work, the inventors surprisingly found that genotyping and pedigree analysis of a biological sample to be tested can be effectively realized based on low-depth genome sequencing by obtaining genotyping of known mutation sites based on whole-genome low-depth sequencing data, expressing uncertain typing results in a probability form and then increasing tolerance to deletion values when comparing the variety database of the existing dogs.
Thus, in one aspect, the present invention provides a method for genotyping based on low-depth genomic sequencing, comprising: (a) performing low-depth genome sequencing on the whole genome of a sample to be tested so as to obtain a sequencing result consisting of a plurality of sequencing data; (b) constructing a reference sequence set of the known variation sites aiming at least one known variation site, wherein the reference sequence set contains variation types of the known variation sites and sequences upstream and downstream of the variation sites; (c) comparing the sequencing result obtained in step (a) with the reference sequence set so as to determine a comparison result of each known variant locus, wherein the comparison result comprises a matching variant type of the sequencing result and the matching times of the matching variant type; and (d) determining a high probability mutation type of the known mutation site based on the alignment result. The inventors surprisingly found that, by using the method of the present invention, the known mutation sites of the sample to be tested can be effectively genotyped based on the low-depth genome sequencing data, and further, the organism from which the sample to be tested is derived can be effectively analyzed in blood system based on the obtained genotyping result. In addition, the method for genotyping based on low-depth genome sequencing has the advantages of low cost, short detection period and accurate and reliable detection result.
In another aspect of the invention, a method of performing a pedigree analysis on an organism is provided. According to an embodiment of the invention, the method comprises: (1) performing low-depth genome sequencing on the genome of a biological sample to be tested by using the method, and performing genotyping on at least one known variation site of the biological sample to be tested; (2) determining the ancestry of the organism based on the results of the genotyping. According to the embodiment of the invention, by utilizing the method for analyzing the blood system of the organism, the known mutation sites of the organism sample to be detected can be subjected to genotyping based on the low-depth genome sequencing data, so that the blood system of the organism is determined.
In yet another aspect of the invention, the invention provides an apparatus for genotyping based on low-depth genomic sequencing. According to an embodiment of the present invention, the genotyping apparatus comprises: the sequencing unit is used for performing low-depth genome sequencing on the whole genome of the sample to be tested so as to obtain a sequencing result consisting of a plurality of sequencing data; a reference sequence set constructing unit, configured to construct, for at least one known variation site, a reference sequence set of the known variation site, where the reference sequence set includes a variation type of the known variation site and sequences upstream and downstream of the variation site; the comparison unit is respectively connected with the sequencing unit and the reference sequence set construction unit and is used for receiving a sequencing result from the sequencing unit and comparing the sequencing result with the reference sequence set so as to determine a comparison result of each known variation site, wherein the comparison result comprises a matching variation type of the sequencing result and the matching times of the matching variation type; and the high-probability mutation type determining unit is connected with the comparing unit and is used for determining the high-probability mutation type of the known mutation site based on the comparison result. By utilizing the device, the known mutation sites of the sample to be detected can be subjected to genotyping based on low-depth genome sequencing data, and the device is convenient to operate, low in cost, short in detection period and accurate and reliable in detection result.
In yet another aspect of the invention, a system for performing a pedigree analysis on an organism is provided. According to an embodiment of the invention, the system comprises: the genotyping apparatus described above, wherein the genotyping apparatus is used for performing low-depth genome sequencing on the genome of a sample of a test organism and genotyping at least one known mutation site of the test organism using the low-depth genome sequencing-based genotyping method described above; a ancestry determining device coupled to the genotyping device for determining the ancestry of the organism based on the results of the genotyping. According to the embodiment of the invention, the system for analyzing the blood system of the organism can be used for genotyping the known mutation sites of the organism sample to be detected based on the low-depth genome sequencing data so as to determine the blood system of the organism, and the system has the advantages of convenient operation, low detection cost, short detection period and accurate and reliable detection result.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 shows a schematic flow diagram of a method of genotyping based on low-depth genomic sequencing according to the invention, according to an embodiment of the invention;
FIG. 2 shows a schematic structural diagram of an apparatus for genotyping based on low-depth genome sequencing according to an embodiment of the present invention;
FIG. 3 shows a schematic diagram of a system for performing a pedigree analysis on an organism according to an embodiment of the invention;
FIG. 4 shows the results of the phylogenetic analysis of the pet dog tested in example 1;
FIG. 5 shows the results of the principal component analysis for verification of the pet dog to be tested in example 1;
FIG. 6 shows the results of the blood lineage analysis of the pet dog to be tested in example 2.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
Genotyping method and apparatus
In one aspect of the invention, the invention provides a method for genotyping based on low-depth genomic sequencing. According to the embodiment of the invention, by using the method, the known mutation sites of the sample to be detected can be effectively genotyped based on the low-depth genome sequencing data, and further, the organism from which the sample to be detected is derived can be effectively subjected to pedigree analysis based on the obtained genotyping result. In addition, the method for genotyping based on low-depth genome sequencing has the advantages of low cost, short detection period and accurate and reliable detection result.
According to an embodiment of the invention, referring to fig. 1, the method comprises the steps of:
(a) the method comprises the following steps And performing low-depth genome sequencing on the whole genome of the sample to be tested so as to obtain a sequencing result consisting of a plurality of sequencing data.
According to embodiments of the invention, low depth genome sequencing may be high throughput sequencing, with a sequencing depth of no more than 5. According to some specific examples of the invention, the sequencing depth may not exceed 3.
(b) The method comprises the following steps And aiming at least one known variation site, constructing a reference sequence set of the known variation site, wherein the reference sequence set contains the variation type of the known variation site and the sequences upstream and downstream of the variation site.
It is to be noted that the term "type of variation" as used herein is to be understood in a broad sense and can be any mutated base as compared to the wild type, including but not limited to single nucleotide polymorphisms, fragment sequence insertions and deletions. Thus, according to embodiments of the present invention, known variation sites may include sites known to have single nucleotide polymorphisms, fragment sequence insertions, and deletions.
(c) The method comprises the following steps Comparing the sequencing result obtained in step (a) with the reference sequence set so as to determine an comparison result of each known variation site, wherein the comparison result comprises a matching variation type of the sequencing result and the matching times of the matching variation type.
According to some embodiments of the invention, the sequencing data is pre-segmented into a plurality of short sequences of equal length before performing step (c). For low depth genome sequencing, the occurrence of mismatched bases can significantly affect the efficiency of genotyping. Therefore, for low depth genome sequencing, for example, genome sequencing with a sequencing depth of not more than 3, it is desirable to avoid the occurrence of mismatches as much as possible. The inventors of the present invention found through studies that the probability of occurrence of a base mismatch is greater as the length of sequencing data is longer. Therefore, by dividing sequencing data into a plurality of short sequences, the probability of occurrence of mismatching can be effectively reduced, and the genotyping efficiency of low-depth genome sequencing is improved. According to some specific examples of the invention, the short sequence is no more than 50bp in length. According to further embodiments of the present invention, the short sequence is preferably 35bp in length. This reduces mismatches due to excessively long sequences, which can cause mis-filtering of reads that would otherwise align to the corresponding position.
(d) The method comprises the following steps And determining a high-probability variation type of the known variation sites based on the alignment result.
In the step (c), the matching variant types of the sequencing result and the matching times of the matching variant types can be obtained by alignment. It will be appreciated by those skilled in the art that the type of variation that is matched and the number of times it is matched are related to the actual type of variation at a particular site, i.e., a known variation site. Therefore, after the comparison result is obtained, the high-probability variation type of the known variation site can be obtained through reverse reasoning. Furthermore, the method provided by the embodiment of the invention can be used for effectively obtaining relatively reliable genotyping results based on low-depth sequencing results.
The manner of determining the high probability mutation type based on the alignment result, i.e., the above-mentioned reverse-deducing, according to the embodiment of the present invention is not particularly limited.
According to an embodiment of the present invention, in step (d), determining the high probability mutation type based on a Bayesian model is included. According to some specific examples of the present invention, the bayesian model uses the occurrence probability of the predetermined mutation type of the predetermined known mutation site as the prior probability, and uses the alignment result obtained in the step (c) as the posterior probability. Herein, the "predetermined type of mutation at a predetermined known mutation site" is described, wherein "predetermined" is intended to have a predetermined meaning and is understood to mean "predetermined". Specifically, the bayesian model is based on the occurrence probability of a specific known type of a predetermined known mutation site as a prior probability, and the occurrence frequency of the specific mutation type obtained by comparison is used as a posterior probability corresponding to the specific mutation type, so that the high-probability mutation type of the mutation site can be determined. In particular, the method comprises the following steps of,
using a formulaDetermining a high probability mutation type of a specific mutation site, wherein the bayesian model uses the known type probability of the known mutation site as a prior probability p (a)/p (b), where the prior probability can be determined by performing statistical analysis on a plurality of control samples, i.e. samples of the known mutation site type, or assuming that the probabilities of various types of the mutation site appearing are the same, for example, for an SNP site, the probability of A, T, G or C appearing in the site is 0.25. Based on the comparison result as observation, that is, when 1 read is compared to a sequence corresponding to a certain typing, the typing value is the possibility of the base type corresponding to the read, namely P (B | A), and the posterior probability P (A | B) obtained by the Bayesian model is used as the final high probability mutation type of the known mutation site.
In addition, in order to compare a large number of comparison results, the comparison results can be constructed into a matching times-variation type database. The type of database is not particularly limited, and according to some embodiments of the present invention, the matching times-mutation type database may exist in the form of a hash table in which the mutation type is a key and the matching times are key values. Therefore, the matching times-variation type database can be found more quickly and conveniently, and the result is more accurate and reliable.
Accordingly, in another aspect of the present invention, the present invention provides an apparatus for genotyping based on low-depth genomic sequencing. The apparatus is suitable for performing the aforementioned methods of genotyping based on low-depth genomic sequencing. By utilizing the device, the known mutation sites of the sample to be detected can be subjected to genotyping based on low-depth genome sequencing data, and the device is convenient to operate, low in cost, short in detection period and accurate and reliable in detection result.
Referring to fig. 2, the genotyping apparatus 1000, according to an embodiment of the present invention, includes: a sequencing unit 100, a reference sequence set construction unit 200, an alignment unit 300 and a high probability variation type determination unit 400.
According to an embodiment of the present invention, the sequencing unit 100 is used for low-depth genome sequencing of the whole genome of a sample to be tested, so as to obtain a sequencing result composed of a plurality of sequencing data. According to some embodiments of the invention, in the sequencing unit 100, the low depth genome sequencing is high throughput sequencing, with a sequencing depth of no more than 5. According to some embodiments of the invention, the sequencing depth is no more than 3.
According to some embodiments of the invention, the reference sequence set constructing unit 200 is configured to construct, for at least one known mutation site, a reference sequence set of the known mutation site, the reference sequence set containing a mutation type of the known mutation site and sequences upstream and downstream of the mutation site. According to some embodiments of the invention, the known sites of variation include sites known to have single nucleotide polymorphisms, fragment sequence insertions and deletions.
According to the embodiment of the present invention, the alignment unit 300 is respectively connected to the sequencing unit 100 and the reference sequence set constructing unit 200, and configured to receive a sequencing result from the sequencing unit 100 and compare the sequencing result with the reference sequence set, so as to determine an alignment result of each known mutation site, where the alignment result includes a matching mutation type of the sequencing result and a matching number of the matching mutation type.
According to some embodiments of the present invention, the apparatus further comprises a sequence dividing unit (not shown in the figures) connected to the sequencing unit 100 and the alignment unit 300, respectively, for dividing the sequencing data into a plurality of short sequences with equal length in advance before performing the alignment. According to some specific examples of the invention, the short sequence is no more than 50bp in length. According to further embodiments of the invention, the short sequence is 35bp in length.
According to some embodiments of the present invention, the high probability mutation type determining unit 400 is connected to the aligning unit 300, and is configured to determine a high probability mutation type of the known mutation site based on the aligning result. According to some embodiments of the present invention, the determining the high probability mutation type in the high probability mutation type determining unit 400 includes determining the high probability mutation type based on a bayesian model. Specifically, the bayesian model is based on the occurrence probability of a specific known type of a predetermined known mutation site as a prior probability, and the occurrence frequency of the specific mutation type obtained by comparison is used as a posterior probability corresponding to the specific mutation type, so that the high-probability mutation type of the mutation site can be determined. In particular, the method comprises the following steps of,
using a formulaDetermining a high probability mutation type of a specific mutation site, wherein the bayesian model uses the known type probability of the known mutation site as a prior probability p (a)/p (b), where the prior probability can be determined by performing statistical analysis on a plurality of control samples, i.e. samples of the known mutation site type, or assuming that the probabilities of various types of the mutation site appearing are the same, for example, for an SNP site, the probability of A, T, G or C appearing in the site is 0.25. Based on the comparison result as observation, that is, when 1 read is compared to a sequence corresponding to a certain typing, the typing value is the possibility of the base type corresponding to the read, namely P (B | A), and the posterior probability P (A | B) obtained by the Bayesian model is used as the final high probability mutation type of the known mutation site.
In addition, in order to compare a large number of comparison results, the comparison results can be constructed into a matching times-variation type database. The type of database is not particularly limited, and according to some embodiments of the present invention, the matching times-mutation type database may exist in the form of a hash table in which the mutation type is a key and the matching times are key values. Therefore, the matching times-variation type database can be found more quickly and conveniently, and the result is more accurate and reliable.
Method and system for pedigree analysis
In yet another aspect of the invention, a method of performing a pedigree analysis on an organism is provided. According to the embodiment of the invention, by utilizing the method for analyzing the blood system of the organism, the known mutation sites of the organism sample to be detected can be subjected to genotyping based on the low-depth genome sequencing data, so that the blood system of the organism is determined.
According to an embodiment of the invention, the method comprises: (1) performing low-depth genome sequencing on the genome of a biological sample to be tested by using the method, and performing genotyping on at least one known variation site of the biological sample to be tested; (2) determining the ancestry of the organism based on the results of the genotyping.
The species of organisms to which the method of the present invention is applied are not particularly limited, and dogs, cats and even humans can be analyzed for pedigree by the method of the present invention. Thus, according to some embodiments of the invention, the organism is an animal. According to some embodiments of the invention, the animal comprises a domestic cat (Felis Silvestris catus), a domestic dog (Canis lupus familiaris).
As will be understood by those skilled in the art, the term "pedigree analysis" as used herein refers to the determination of the blood relationship, origin, lineage or lineage of a test organism, e.g., a pet such as a cat or dog, e.g., for a particular animal, the determination of its female or male parent and more upstream relative animal breeds.
According to some embodiments of the invention, in step (2), the ancestry of the organism is determined based on a predetermined characteristic genotyping of the close relative of the organism.
According to an embodiment of the present invention, the step (2) further comprises:
scoring at least one of the candidate organism's neighbors for at least one of the known mutation sites based on the high probability type of mutation for the test organism and at least one of the candidate organism's neighbors 'known types of mutation to determine a similarity value for each of the candidate organism's neighbors.
Specifically, according to an embodiment of the present invention, determining the ancestry of the organism further comprises: and comparing the high-probability variation type of the organism to be detected with variation types of multiple candidate organism relatives aiming at the known variation sites, and scoring each candidate organism relative so as to determine the similarity value of each candidate organism relative. It will be understood by those skilled in the art that a higher similarity value indicates a closer relationship between the test organism and the candidate organism. It should be noted that the similarity value is a characteristic value in the embodiment, and the similarity values are used interchangeably herein and are all used to represent the relative similarity value between the pet dog to be tested and each of the possible breeds of the candidate pet dog.
According to an embodiment of the present invention, the step (2) further comprises:
dividing at least a portion of the genomic sequence of the test organism into a plurality of windows, each of the plurality of windows containing at least one of the known sites of variation; and
classifying at least a portion of the plurality of windows based on the similarity values of the respective candidate organism's neighbors to determine candidate sources of the neighbors to which the at least a portion of the plurality of windows correspond.
Specifically, according to an embodiment of the present invention, determining the ancestry of the organism further comprises: dividing the DNA sequence of the organism to be detected into a plurality of windows with approximately the same length, wherein the windows contain at least one known mutation site; and classifying the obtained multiple windows with the same length respectively based on the similarity value of the close relatives of the candidate organisms so as to determine the close relative source corresponding to each window. It should be noted that the method for classifying the window based on the similarity value is not particularly limited, and may be accomplished by using a classification method of party and libsvm libraries in the R language through models including but not limited to random forest, support vector machine, and naive bayes. Wherein, the preferably adopted classification method is a random forest model. The random forest is a classification model which integrates decision trees to obtain a better effect, a plurality of decision trees are constructed, each decision tree classifies samples according to the weight of each point and in combination with input characteristic values, and then the classification given by the random forest model is obtained by integrating the classification of the decision trees. Thus, according to the embodiment of the present invention, by dividing the gene sequence into fragments having the same length and then classifying the windows for each fragment according to the base type of the mutation site therein, for example, SNP typing, as a characteristic value, it is possible to classify it into a variety, that is, it is assumed that the DNA sequence of the window originates from the variety.
It should be noted that the "same length window" described herein should tolerate a certain amount of length deviation, e.g., 1-10% up and down. According to an embodiment of the present invention, the delineation may be performed in the following manner:
the method comprises the steps of marking N SNP loci to be detected on a chromosome I as S1, S2, S3.. Sn, marking the distance from S1 to S2 as D1, and marking the distance from S2 to S3 as D2. Given a fixed window size X, at most, will satisfySa is divided into a window, and the window is numbered 1. Then according to the same rule, at most will satisfySNP site S ofa+1,Sa+2...SbDividing into another window, and coding the window as number 2. By analogy, the cutting window of the chromosome I is completedThen, the same rule is used for cutting the window for the chromosome II, and the window cutting for all the autosomes is completed in sequence.
The specific value of X is constituted by the species to be detected and can be 1% of the total length of the autosomes in the whole gene sequence of the species to be detected, for example, dogs.
After obtaining the possible corresponding close-relative source of each window, according to an embodiment of the present invention, further comprising: and determining the distance of the known variant site corresponding to each close source on the genome sequence of the organism to be detected, and determining the corresponding ancestral weight of each close source based on the obtained distance.
According to an embodiment of the present invention, preferably, the step (2) may include:
determining the distance of the known variant site corresponding to each candidate parent source on the genome sequence of the organism to be tested;
determining a pedigree weight for each of the candidate parent sources based on the distance.
According to an embodiment of the present invention, after determining the ancestry weight of each close relative source, further comprising: obtaining variety components of the organisms to be detected through weighting calculation; and verifying the obtained variety component result of the organism to be detected by a cluster analysis method so as to determine the blood system of the organism to be detected. According to some specific examples of this disclosure, the cluster analysis method is principal component analysis. Principal component analysis is a commonly used data dimension reduction method. After linear combination is carried out on the multidimensional variable group, several dimensions with the largest variance are found, and the original data are projected to a new coordinate axis, so that more information of the original data can be reserved by the data after dimension reduction. According to an embodiment of the present invention, the principal component analysis method can be performed using the ppca function in the pcrMethods package in the R language.
According to some specific examples of the present invention, the method of performing a pedigree analysis on an organism of the present invention may comprise the steps of:
1) and performing low-depth genome sequencing on the whole genome of the sample to be tested. If the length of the read obtained from the second generation sequencing platform is larger than 50bp, the read is cut into a plurality of short sequences with equal length according to the front and back sequence, and the newly cut short sequences form a new file which is called cut-read.
2) The data of the gene chip to be detected is found from a website (https:// www.illumina.com), an appointed list of single base variation to be detected and reference sequences before and after variation are downloaded, and reference sequences corresponding to different types on different sites to be detected can be generated in a mode specifically described in the embodiment, and the file is called SNP-index. The detectable variation here is not only a single base mutation but also includes insertions and deletions of a short and well-defined sequence of a known variant fragment.
3) SOAPaligner2 was downloaded from the website (http:// soap. genomics. org. cn) using the SNP-index file obtained in step 2) as input, and the data structure required for alignment was established using the/2 btwt-builder command.
4) The cut-reads from step 1) were aligned on SNP-index based reference sequences using the soap command using the parameters "-v 0-M0-r 0".
5) And (4) establishing a hash table by taking the name of each SNP-index in comparison as a key and the occurrence frequency as a value according to the comparison result obtained in the step 4), and updating the hash table by traversing the comparison result so as to obtain the comparison frequency of each SNP-index.
6) Assuming that the probability of the parent chain and the parent chain detected in the sequencing is the same, according to a Bayesian formula and a hash table obtained in the step 5), the known type probability of the known mutation site is used as a prior probability, wherein the probability of the occurrence of each type on the mutation site is assumed to be the same, the value is referred to as P (A)/P (B), the comparison result obtained in the step 4 is used as an observation, namely when 1 read is compared to a sequence corresponding to a certain typing, the typing value is the probability of the base type corresponding to the read, namely P (B | A), and the posterior probability P (A | B) obtained by the Bayesian model is used as the final high-probability mutation type of the known mutation site. And obtaining possible single base typing results of all points at different depths according to the formula of the Bayesian model.
7) Comparing the detected genotype obtained in the step 6) with the single-base typing results of the samples of different varieties in the background database, and obtaining a characteristic value for each variety to be detected according to the expected value of the same number of loci. It should be noted that the average feature value corresponding to each variety is obtained by dividing the number of samples of the variety by the number of the samples of the variety if the expected values of the same number of the sites are the same, i.e., if the typing results are the same, the feature value corresponding to the variety is added by one, and if the results are different, the feature value is not changed.
8) Dividing the DNA of the organism to be detected into a plurality of windows with equal length according to the sequence of the positions of the single base mutation appearing on different chromosomes, wherein each window comprises at least one single base mutation site.
The N SNP loci to be detected on the chromosome I are marked as S1, S2 and S3.. Sn, the distance from S1 to S2 is marked as D1, and the distance from S2 to S3 is marked as D2. Given a fixed window size X, at most, will satisfySa is divided into a window, and the window is numbered 1. Then according to the same rule, at most will satisfySNP site S ofa+1,Sa+2...SbDividing into another window, and coding the window as number 2. And by analogy, after the cutting window of the chromosome I is finished, the cutting window of the chromosome II is cut by using the same rule, and the cutting of the windows of all autosomes is finished in sequence.
The specific value of X is constituted by the species to be detected and is 1% of the total length of autosomes in the dog's complete gene sequence.
9) And (3) aiming at each window obtained in the step 8), using the characteristic values obtained by different varieties in the step 7), using models including but not limited to random forests, support vector machines and naive Bayes, and classifying the small section of DNA of each window by using a classification method of party and libsvm libraries in R language, wherein the classification result is the possible variety corresponding to the DNA sequence, and the classification basis is the characteristic value obtained by the known pure dog of the variety in the step 7).
The classification results of each window are recorded as b1, b2... bn, wherein each classification result corresponds to a breed of dog, and the final breed component estimation formula is as followsThat is, the classification results of each segment are added to obtain the total of the classification results of each variety.
It should be noted that, as can be understood by those skilled in the art, the foregoing "S1, S2, S3.. Sn" for N SNP sites to be detected on chromosome one "and" b1, b2... bn "for the classification result of each window" are described, where two codes N of "Sn" and "bn" have different meanings, N of "Sn" is the code of the SNP site to be detected, N of "bn" is the code of the corresponding window, and "bn" represents the classification result of the DNA of the coding window.
According to the embodiment of the invention, the method further comprises the step of calculating the variety components of the organism to be detected by weighting according to the detection results of different windows obtained in the step 8) and the lengths of the DNA sequences represented by the different windows, so that the blood system of the organism to be detected is determined based on the proportion of the variety components of the organism to be detected.
For the SNP-containing site Sa,Sa+1To SbA window ofWG is the total number of bases of autosomes in the whole genome of the dog as the weight of each window, and for the classification window described according to step 8), step 8) will obtain a classification result, and the classification results of the windows are marked as b1, b2.Dog breed, final breed component estimation formula is
10) Verifying the detection result obtained in step 9) by using principal component analysis or other clustering methods.
Specifically, selecting the most varieties in the varieties obtained in the step 9), and clustering by using a principal component analysis method or other clustering methods. And (3) calculating the average value of the distances between the sample to be detected and different samples according to the clustering result, and verifying the reliability of the result in the step 9) if the variety closest to the sample to be detected is the most main variety obtained in the step 9).
Wherein, according to some specific examples of the present invention, the implementation method of step 7) is: and (3) comparing the typing result of the site of the organism to be detected obtained in the step 6) with each sample of each variety in the background database one by one aiming at each detected site, so as to obtain the similarity (namely the characteristic value) of the organism to be detected and each variety respectively. Specifically, the typing result of the site of the organism to be tested is compared with the typing results of the multiple samples of the variety at the site in the background database, if the typing results of the organism to be tested and the samples in the background database at the site are consistent, the similarity (i.e. the characteristic value) between the sample to be tested and the variety is increased by one, and the comparison results of the multiple samples of the same variety in the background database are weighted and averaged to obtain the corresponding similarity (i.e. the characteristic value) of the variety.
The method for analyzing the ancestry of the organism can quickly and accurately obtain the genotyping result of the corresponding locus from the second-generation low-depth sequencing data of the whole genome. Since the sequencing depth is 1 to 2 layers on average, it is not possible to confirm which possible single base variation sites are covered, and it is not possible to obtain an accurate typing result. And by using a probability form to express an uncertain typing result and increasing tolerance to a missing value when comparing an existing variety database of the organism to be tested (it needs to be noted that what missing value can be tolerated by the method of 'increasing tolerance to a missing value', no clear non-black or white answer exists, the accuracy of variety judgment is reduced along with the increase of the proportion of the missing value in the data, and the number of detected SNP sites is required to be not less than 25% of the total amount according to the current experience), the ancestry of the organism to be tested can be effectively determined. In the aspect of practical application, the application prospect is wide, for example: the method of the invention can provide the pedigree certificate of the pure breed pet dog, the certificate of the direct genetic relationship of the two dogs, or whether the two dogs are the same dog (the gene identity card of the pet dog or the cat is given), can also give quantitative ancestral component proportion to the hybrid dog, and can also give the predicted variety tree within three generations.
In yet another aspect of the invention, a system for performing a pedigree analysis on an organism is provided. According to the embodiment of the invention, the system for analyzing the blood system of the organism can be used for genotyping the known mutation sites of the organism sample to be detected based on the low-depth genome sequencing data so as to determine the blood system of the organism, and the system has the advantages of convenient operation, low detection cost, short detection period and accurate and reliable detection result.
According to an embodiment of the present invention, referring to fig. 3, the system 10000 includes: a genotyping device 1000 and a pedigree determination device 2000.
According to an embodiment of the present invention, the genotyping device 1000 is used for performing low-depth genome sequencing on the genome of a sample of a test organism and genotyping at least one known mutation site of the test organism by using the aforementioned method for performing genotyping based on low-depth genome sequencing. According to some embodiments of the invention, the organism is an animal. According to some specific examples of the invention, the animal comprises a domestic cat (Felis Silvestris catus), a domestic dog (Canis lupus family).
According to some embodiments of the invention, the ancestry determining means 2000 is connected to the genotyping means 1000 for determining the ancestry of the organism based on the results of the genotyping. According to some embodiments of the present invention, in the ancestry determining apparatus 2000, the ancestry of the organism is determined based on a predetermined characteristic genotyping of the close relative of the organism.
According to some embodiments of the present invention, the ancestry determining apparatus 2000 further comprises a similarity value determining unit, which is adapted to compare the high probability mutation type of the test organism with the mutation types of a plurality of the candidate organism relatives for the known mutation sites, and score each of the candidate organism relatives to determine a similarity value of each of the candidate organism relatives.
According to some embodiments of the present invention, the ancestry determining apparatus 2000 further comprises a close relative source determining unit, which divides the DNA sequence of the organism to be tested into a plurality of windows with approximately the same length, wherein at least one of the known mutation sites is contained in the window; and classifying the obtained multiple windows with the same length respectively based on the similarity value of the close relatives of the candidate organisms so as to determine the close relative source corresponding to each window. It should be noted that the method for classifying the window based on the similarity value is not particularly limited, and includes, but is not limited to, random forest, support vector machine, naive bayes, and can be accomplished by using the classification method of party and libsvm library in R language. Wherein, the preferably adopted classification method is a random forest model. The random forest is a classification model which integrates decision trees to obtain a better effect, a plurality of decision trees are constructed, each decision tree classifies samples according to the weight of each point and in combination with input characteristic values, and then the classification given by the random forest model is obtained by integrating the classification of the decision trees. Thus, according to the embodiment of the present invention, by dividing the gene sequence into fragments having the same length and then classifying the windows for each fragment according to the base type of the mutation site therein, for example, SNP typing, as a characteristic value, it is possible to classify it into a variety, that is, it is assumed that the DNA sequence of the window originates from the variety.
According to some embodiments of the invention, the ancestry determination apparatus 2000 further comprises an ancestry weight determination unit adapted to: and determining the distance of the known variant site corresponding to each close source on the genome sequence of the organism to be detected, and determining the corresponding ancestral weight of each close source based on the obtained distance.
According to some embodiments of the present invention, the ancestry determining apparatus 2000 further comprises an ancestry determining unit adapted to perform a principal component analysis on the ancestry weights of the respective close-relative sources so as to determine the ancestry of the test organism. Principal component analysis is a commonly used data dimension reduction method. After linear combination is carried out on the multidimensional variable group, several dimensions with the largest variance are found, and the original data are projected to a new coordinate axis, so that more information of the original data can be reserved by the data after dimension reduction. According to an embodiment of the present invention, the principal component analysis method can be performed using the ppca function in the pcrMethods package in the R language.
It should be noted that the method, the apparatus and the application of genotyping based on low-depth genome sequencing according to the present invention have at least one of the following advantages:
1. the invention discloses a method for analyzing ancestry of an organism, which aims to obtain variety components through low-depth sequencing data and realize ancestry analysis.
2. The method for genotyping based on low-depth genome sequencing of the invention uses low-depth full-genome data to estimate the typing result of a known mutation site (such as a single-base mutation site), while the traditional mutation detection software, such as GATK, cannot normally give a result when the depth is low. In addition, the invention uses a mode of constructing sequences before and after the site to be detected, and can obtain an accurate single-base typing result in one fifth of time and one tenth of memory consumption by the traditional method.
3. The method of the invention for performing a pedigree analysis on an organism gives an estimate of the progenitor-derived components in a quantitative manner when performing the pedigree analysis. Similar calculation methods can be used for detecting breed components of pets such as pet cats, pet birds and the like, and economic crops such as cattle, chickens and the like in the future, and can also be used for detecting human ancestral components.
The scheme of the invention will be explained with reference to the examples. It will be appreciated by those skilled in the art that the following examples are illustrative of the invention only and should not be taken as limiting the scope of the invention. The examples do not specify particular techniques or conditions, and are carried out according to techniques or conditions described in the literature of the art (for example, see molecular cloning, a laboratory Manual, third edition, scientific Press, of J. SammBruker et al, Huang Petang et al) or according to the product instructions. The reagents or instruments used are not indicated by the manufacturer, and are all conventional products commercially available.
Example 1:
referring to FIG. 1, the method for genotyping an organism according to the present invention further performs a pedigree analysis on the organism to be tested.
The organism to be detected is a pet dog, and the dog owner self-states that the organism to be detected is a Siberian Husky. The biological sample to be detected is a saliva sample and is obtained by non-invasively sampling the pet dog by using a PG-100 saliva sampler.
The method comprises the following specific steps:
1) and (3) performing low-depth genome sequencing on the whole genome of the sample to be tested by using a BGI-seq500 sequencing platform. Specifically, DNA in saliva is extracted, the whole genome of the DNA is amplified by an enzyme cutting method, and then library construction is carried out. Then, whole genome low-depth sequencing is carried out on the BGI-seq500, and the sequencing depth is 2 to 3 layers. The reads (reads) obtained from the second-generation sequencing platform are cut into short sequences of 50bp in sequence, and the newly cut short sequences form a new file called cut-read.
2) Illumina Canine HD gene chip data are found from a website (ftp:// webdata2: webdata2@ ussd-ftp. Illumina. com/downloads/product files/Canine HD _ B.csv), and the file in the link is downloaded, wherein the file is a list of single base variation to be detected and reference sequences before and after variation.
The sequence of 50bp before the mutation site, the base type of the mutation site and the sequence of 50bp after the mutation site are combined in sequence to obtain the corresponding sequence of the mutation type corresponding to the site, the file is called SNP-index, because the mutation site has two possible genotypes, two corresponding sequences are constructed aiming at the same site according to different base types according to the rules, each corresponding sequence is named according to the number and the base type of the corresponding SNP site, and the SNP-index of the bases A and G corresponding to the site with the number of BICF2G630100019 is shown as follows.
3) Downloading the SOAPaligner2 from the website (http:// soap. genomic. org. cn), using the SNP-index file obtained in step 2) as input, and using the/2 bwt-builder commands to create the 13 different index files required for alignment, the suffixes of AMB, ann, bwt, fmv, hot, lkt, pac, rev.bwt, rev.fmv.fmv.mvv, rev.lkt, rev.pac, sa, and sai.
4) The cut-reads from step 1) were aligned on SNP-index based reference sequences using the soap command using the parameters "-v 0-M0-r 0".
5) And (4) establishing a hash table by taking the name of each SNP-index in comparison as a key and the occurrence frequency as a value according to the comparison result obtained in the step 4), and updating the hash table by traversing the comparison result so as to obtain the comparison frequency of each SNP-index. The results obtained in this step are given in the following table, which, since it contains 16 ten thousand rows, only lists the first three rows: the format of the hash table is as follows, each row represents the event that cut-read cut in the step 1) is compared with SNP-index obtained in the step 3), wherein the first column is the number of SNP, and the second column is the base value corresponding to the read:
SNP numbering | Base number |
BICF2S23657714 | C |
BICF2G630130992 | G |
BICF2G630708586 | G |
6) And (3) assuming that the probability of detecting the parent chain and the parent chain is the same during sequencing, and obtaining possible single-base typing results of all points at different depths according to the Hash table obtained in the step 5) according to a Bayesian formula.
Assuming that the probability of occurrence of each type at the mutation site is the same, the value is referred to as P (A)/P (B). The sequencing alignment result obtained from the step 4) at this point is used as an observation, the probability that the typing value is the base type corresponding to the read is called P (B | A), and the posterior probability P (A | B) is obtained according to the Bayesian formula described below, that is, the possible typing value at this point is obtained.
The results obtained in this step are given in the following table, which, since it contains 16 ten thousand rows, only lists the first three rows: in this table, the first column is the ID of the SNP, the second column is the possible typing, the third column is the possible probability value for typing in the second column, the fourth column is another possible typing for that point, and the fifth column is the possible probability value for that point in the fourth column:
SNP ID | Typing 1 | Probability value of type 1 | Typing 2 | Probability value of type 2 |
BICF2S23657714 | CC | 0.67 | AC | 0.33 |
BICF2G630130992 | GG | 0.8 | GC | 0.2 |
BICF2G630708586 | GG | 0.67 | AG | 0.33 |
7) Comparing the detected genotype obtained in step 6) with a background database (https: (vi)/www.ncbi.nlm.nih.gov/geo/query/acc. cgiac ═ GSE90441) and based on the expected value of the same number of loci, for each species to be tested, a characteristic value was obtained, for each SNP site to be detected, comparing the samples in the background database one by one with the typing result at the point obtained in the step 6), if the result is consistent with the typing result at the position, the similarity (i.e. the characteristic value mentioned here) between the sample to be tested and the variety is added by one, each sample of each variety in the background database is compared one by one, after the samples of all the varieties are compared one by one, the samples are divided by the number of the samples of the varieties in the database to obtain the corresponding similarity of each variety, namely the characteristic value.
The results obtained in this step are given in the following table, which, since it contains 70 rows, only lists the first four rows: the first column is the value, the second column is the corresponding breed:
characteristic value | Variety of (IV) C |
94607.4 | Siberian Husky (Siberian Husky) |
89423.8028571428 | Greenland sleigh dog (Greenland Slededge dog) |
89404.921 | Alaska sled dog (Alaskan Alamuute) |
89399.9492857142 | Doll (Chihuahuahua) |
8) Dividing the DNA of the organism to be detected into a plurality of windows with equal length according to the sequence of the positions of the single base mutation appearing on different chromosomes, wherein each window comprises a plurality of sites of the single base mutation.
Marking N SNP loci to be detected on the chromosome I as S1, S2 and S3The distance from S1 to S2 was denoted as D1, and the distance from S2 to S3 was denoted as D2. Given a fixed window size X, at most, will satisfySa is divided into a window, and the window is numbered 1. Then according to the same rule, at most will satisfySNP site S ofa+1,Sa+2...SbDividing into another window, and coding the window as number 2. And by analogy, after the cutting window of the chromosome I is finished, the cutting window of the chromosome II is cut by using the same rule, and the cutting of the windows of all autosomes is finished in sequence, so that 100 windows are obtained, wherein the numbers are 1, 2 and 3.
X is 1% of the length of the sum of autosomes in the genome of the dog, i.e., 21 Mbp.
9) According to the detection results of different windows obtained in the step 8), according to the length of the DNA sequence represented by different windows, using the characteristic values obtained in the step 7) of different varieties, using the random forest models of party and libsvm libraries in the R language, and classifying the small DNA segments of each window respectively, wherein the classification results are possible varieties corresponding to the DNA sequences, and the classification basis is the characteristic values obtained in the step 7) of the known pure dogs of the variety. The classification results of the windows are denoted as b1, b2... b100, and the label of each classification is from the classification results given by the random forest model in the step. Where b1 and b2 correspond to a breed of dog, w1 and w2., respectively, wn is the weight of each window, namely the ratio of the length of the sequence corresponding to the window to the total sequence length, and the final breed component estimation formula isWherein the formula of wi isFor each window, the formula calculates the length of the DNA sequence contained within the window, WG being the total number of bases of autosomes within the dog's whole genome. And finally obtaining the sum of the classification results of all varieties by carrying out weighted average on the classification results of each window according to the length of the window on the chromosome.
Through the weighted average calculation of the classification results of the windows, the blood system of the organism to be detected is as follows: 61% Siberian Husky + 39% Greenland sled dog (see FIG. 4). As shown in fig. 4, the specific ratio of the progenitor source components of the pet dog to be tested is as follows: 61% Siberian Husky and 39% Greenland sled dogs (the photographs in FIG. 4 are of the pet dog to be tested).
10) Verifying the detection result obtained in the step 9) by using a principal component analysis method.
The principal component analysis method is a common data dimension reduction method, is realized in various programming languages, and can obtain a result directly from input data. The method comprises the following steps of: 1) according to the input matrix, solving a covariance matrix of the matrix, 2) solving eigenvalues and eigenvectors of the covariance matrix obtained in the previous step, 3) selecting two eigenvectors with the highest eigenvalues, and 4) projecting the input matrix onto the eigenvectors.
Specifically, the largest 5 varieties obtained in the step 9) were selected and clustered using the ppca function in the pcrMethods package in the R language.
Fig. 5 shows the principal component analysis results for the test dog for verification, with the horizontal and vertical axes listing the two most important components, the greenland sled dog at the top left, the siberia hardsch at the bottom right, and the test dog in the middle. It can be seen that the dog to be tested is located between the greenland sled dog and siberia hastelli, in line with the ratio obtained in step 9). Namely, the detection result obtained in the step 9) is verified to be accurate.
The inventor performs breed component estimation on the pet dog based on off-line data by using the method of the invention according to the steps, obtains a detection result within 2 hours, and reports the result to the owner of the pet dog.
Furthermore, in order to verify the accuracy of the method, the inventor compares the variety components calculated in the embodiment with the self-description of the pet dog owner, and finds that the two have higher consistency.
Specifically, the original reads obtained after sequencing was 5.5 gbp, about 2.3x, and the ancestral analysis and detection result is shown in fig. 4 (the graph is a family spectrogram of a three-generation variety, including great ancestor parent, grandparent and parent, inferred from the DNA data of the pet dog to be tested).
Example 2:
the test organisms were analyzed for pedigree analysis according to the method of example 1.
Wherein the organism to be detected is pet dog Weini to be detected, the dog owner describes the dog as a poodle dog, and the final blood lineage analysis and detection result is shown in figure 6 (the figure is a family spectrogram of a three-generation variety, including great ancestor parent, grandparent and parent, which is deduced according to DNA data of the pet dog to be detected). As shown in fig. 6, the pet dog to be tested is 100% of mini poodle dogs (the photograph in fig. 6 is a photograph of the pet dog to be tested).
In addition, the present invention has been widely used to report breed components to pet dogs of applicant (huada gene) internal users. There are 48 reports, including both pure breed dogs and crossbred dogs. It is emphasized that, based on the above practice, the method can give a test report within 1 week from the saliva sample received from the dog, and wherein the data analysis does not require a mainframe and can give a report on a personal computer (4GB memory) at a time of 2 hours per sample.
The method for genotyping based on low-depth genome sequencing can effectively perform genotyping on the known mutation sites of the sample to be detected based on low-depth genome sequencing data, and further can effectively perform pedigree analysis on the organism from which the sample to be detected is derived based on the obtained genotyping result. In addition, the method has the advantages of low detection cost, short detection period and accurate and reliable detection result.
Although specific embodiments of the invention have been described in detail, those skilled in the art will appreciate. Various modifications and substitutions of those details may be made in light of the overall teachings of the disclosure, and such changes are intended to be within the scope of the present invention. The full scope of the invention is given by the appended claims and any equivalents thereof.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Claims (40)
- A method for genotyping based on low-depth genomic sequencing, comprising:(a) performing low-depth genome sequencing on the whole genome of a sample to be tested so as to obtain a sequencing result consisting of a plurality of sequencing data;(b) constructing a reference sequence set of the known variation sites aiming at least one known variation site, wherein the reference sequence set contains variation types of the known variation sites and sequences upstream and downstream of the variation sites;(c) comparing the sequencing result obtained in step (a) with the reference sequence set so as to determine a comparison result of each known variant locus, wherein the comparison result comprises a matching variant type of the sequencing result and the matching times of the matching variant type; and(d) and determining a high-probability variation type of the known variation sites based on the alignment result.
- The method of claim 1, wherein the low depth genomic sequencing is high throughput sequencing and the sequencing depth is no more than 5.
- The method of claim 2, wherein the sequencing depth is no more than 3.
- The method of claim 1, wherein the sequencing data is pre-segmented into a plurality of short sequences of equal length before performing step (c).
- The method of claim 4, wherein the short sequence is no more than 50bp in length.
- The method of claim 5, wherein the short sequence is 35bp in length.
- The method of claim 1, wherein the known variant sites comprise sites known to have single nucleotide polymorphisms, fragment sequence insertions, and deletions.
- The method of claim 1, wherein in step (d), the high probability mutation type is determined based on a Bayesian model.
- The method according to claim 8, wherein the bayesian model uses the probability of occurrence of a predetermined mutation type of a predetermined known mutation site as a prior probability and the alignment result obtained in step (c) as a posterior probability.
- The method of claim 8, further comprising constructing the comparison result as a hash table, wherein the mutation type is a key and the number of matches is a key value.
- A method of performing a pedigree analysis on an organism, comprising:(1) performing low depth genome sequencing of a genome of a test organism sample and genotyping at least one known mutation site of the test organism using the method of any one of claims 1 to 10;(2) determining the ancestry of the organism based on the results of the genotyping.
- The method of claim 11, wherein the organism is an animal.
- The method of claim 12, wherein the animal comprises a domestic cat or a domestic dog.
- The method of claim 11, wherein in step (2), the ancestry of the organism is determined based on a predetermined characteristic genotyping of the close relative of the organism.
- The method of claim 14, wherein step (2) further comprises:scoring at least one of the candidate organism's neighbors for at least one of the known mutation sites based on the high probability type of mutation for the test organism and at least one of the candidate organism's neighbors 'known types of mutation to determine a similarity value for each of the candidate organism's neighbors.
- The method of claim 15, wherein step (2) further comprises:dividing at least a portion of the genomic sequence of the test organism into a plurality of windows, each of the plurality of windows containing at least one of the known sites of variation; andclassifying at least a portion of the plurality of windows based on the similarity values of the respective candidate organism's neighbors to determine candidate sources of the neighbors to which the at least a portion of the plurality of windows correspond.
- The method of claim 16, wherein the classification is performed by at least one of a random forest model, a support vector machine, and na iotave bayes.
- The method of claim 16, wherein step (2) further comprises:determining the distance of the known variant site corresponding to each candidate parent source on the genome sequence of the organism to be tested;determining a pedigree weight for each of the candidate parent sources based on the distance.
- The method of claim 18, wherein step (2) further comprises: determining the ancestry of the test organism based on the ancestry weight of each candidate parent source.
- The method of claim 19, further comprising, after determining the pedigree weight for each of the closely related sources:obtaining the variety component of the organism to be detected through weighting calculation, and verifying the obtained variety component result of the organism to be detected through a cluster analysis method so as to determine the ancestry of the organism to be detected based on the ancestry weight of each candidate parent source.
- An apparatus for genotyping based on low depth genomic sequencing, comprising:the sequencing unit is used for performing low-depth genome sequencing on the whole genome of the sample to be tested so as to obtain a sequencing result consisting of a plurality of sequencing data;a reference sequence set constructing unit, configured to construct, for at least one known variation site, a reference sequence set of the known variation site, where the reference sequence set includes a variation type of the known variation site and sequences upstream and downstream of the variation site;the comparison unit is respectively connected with the sequencing unit and the reference sequence set construction unit and is used for receiving a sequencing result from the sequencing unit and comparing the sequencing result with the reference sequence set so as to determine a comparison result of each known variation site, wherein the comparison result comprises a matching variation type of the sequencing result and the matching times of the matching variation type; andand the high-probability mutation type determining unit is connected with the comparing unit and is used for determining the high-probability mutation type of the known mutation site based on the comparison result.
- The apparatus of claim 21, wherein in the sequencing unit, the low depth genomic sequencing is high throughput sequencing with a sequencing depth of no more than 5.
- The device of claim 22, wherein the sequencing depth is no more than 3.
- The apparatus of claim 21, further comprising a sequence segmentation unit, connected to the sequencing unit and the alignment unit, respectively, for segmenting the sequencing data into a plurality of short sequences of equal length in advance before performing the alignment.
- The apparatus of claim 24, wherein the short sequence is no more than 50bp in length.
- The apparatus of claim 25, wherein the short sequence is 35bp in length.
- The apparatus of claim 21, wherein the known variant sites comprise sites known to have single nucleotide polymorphisms, fragment sequence insertions, and deletions.
- The apparatus of claim 21, wherein the high probability mutation type is determined based on a bayesian model.
- The apparatus according to claim 29, wherein the bayesian model uses a predetermined mutation type occurrence probability of the predetermined known mutation sites as a prior probability and the alignment result as a posterior probability.
- The apparatus of claim 29, further comprising constructing the comparison result as a hash table, wherein the mutation type is a key and the number of matches is a key value.
- A system for performing a pedigree analysis on an organism, comprising:a genotyping device as claimed in any one of claims 21 to 30, for low depth genome sequencing of the genome of a sample of an organism to be tested and genotyping at least one known mutation site of the organism to be tested using the method as claimed in any one of claims 1 to 10;a ancestry determining device coupled to the genotyping device for determining the ancestry of the organism based on the results of the genotyping.
- The system of claim 31, wherein the organism is an animal.
- The system of claim 32, wherein the animal comprises a domestic cat or a domestic dog.
- The system of claim 31, wherein in the lineage determination device, the lineage of the organism is determined based on a predetermined characteristic genotyping of the close relatives of the organism.
- The system of claim 34, wherein the ancestry determination device further comprises a similarity value determination unit adapted to score at least one of the candidate organism relatives for at least one of the known mutation sites based on the high probability mutation type of the test organism and at least one of the known mutation types of the candidate organism's relatives to determine a similarity value for each of the candidate organism's relatives.
- The system according to claim 35, wherein the ancestry determining means further comprises a close source determining unit adapted to perform the steps of:dividing at least a portion of the genomic sequence of the test organism into a plurality of windows, each of the plurality of windows containing at least one of the known sites of variation; andclassifying at least a portion of the plurality of windows based on the similarity values of the respective candidate organism's neighbors to determine candidate sources of the neighbors to which the at least a portion of the plurality of windows correspond.
- The system of claim 36, wherein the classification is performed by at least one of a random forest model, a support vector machine, and na iotave bayes.
- The system of claim 36, wherein the ancestry determination device further comprises an ancestry weight determination unit adapted to:determining the distance of the known variant site corresponding to each candidate parent source on the genome sequence of the organism to be tested; anddetermining a pedigree weight for each of the candidate parent sources based on the distance.
- The system of claim 38, wherein said lineage determination means further comprises determining the lineage of said test organism based on the lineage weights of said candidate parent sources.
- The system of claim 39, further comprising, after determining the pedigree weight for each of the closely related sources:obtaining the variety component of the organism to be detected through weighting calculation, and verifying the obtained variety component result of the organism to be detected through a cluster analysis method so as to determine the ancestry of the organism to be detected based on the ancestry weight of each candidate parent source.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2017/101128 WO2019047181A1 (en) | 2017-09-08 | 2017-09-08 | Method for genotyping on the basis of low-depth genome sequencing, device and use |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110997936A true CN110997936A (en) | 2020-04-10 |
CN110997936B CN110997936B (en) | 2024-05-10 |
Family
ID=65635230
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201780093812.7A Active CN110997936B (en) | 2017-09-08 | 2017-09-08 | Method, device and application of genotyping based on low-depth genome sequencing |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110997936B (en) |
WO (1) | WO2019047181A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111883207A (en) * | 2020-07-31 | 2020-11-03 | 武汉蓝沙医学检验实验室有限公司 | Identification method of biological genetic relationship |
CN113186255A (en) * | 2021-05-12 | 2021-07-30 | 深圳思勤医疗科技有限公司 | Method and device for detecting nucleotide variation based on single molecule sequencing |
CN113470746A (en) * | 2021-06-21 | 2021-10-01 | 广州市金域转化医学研究院有限公司 | Method for reducing artificially introduced error mutation in high-throughput sequencing and application |
CN113517022A (en) * | 2021-06-10 | 2021-10-19 | 阿里巴巴新加坡控股有限公司 | Gene detection method, feature extraction method, device, equipment and system |
CN116168763A (en) * | 2022-09-06 | 2023-05-26 | 安诺优达基因科技(北京)有限公司 | Method and device for grouping and assembling autotetraploid genome, method and device for constructing chromosome and application of method and device |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113637747B (en) * | 2021-06-21 | 2023-02-03 | 深圳思勤医疗科技有限公司 | Method for determining SNV and tumor mutation load in nucleic acid sample and application |
CN113327646B (en) * | 2021-06-30 | 2024-04-23 | 南京医基云医疗数据研究院有限公司 | Sequencing sequence processing method and device, storage medium and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101539967A (en) * | 2008-12-12 | 2009-09-23 | 深圳华大基因研究院 | Method for detecting mononucleotide polymorphism |
US20140114582A1 (en) * | 2012-10-18 | 2014-04-24 | David A. Mittelman | System and method for genotyping using informed error profiles |
CN106755300A (en) * | 2016-11-17 | 2017-05-31 | 中国科学院华南植物园 | A kind of method for recognizing Kiwi berry hybrid strain to filial generation genome contribution proportion |
US20170213127A1 (en) * | 2016-01-24 | 2017-07-27 | Matthew Charles Duncan | Method and System for Discovering Ancestors using Genomic and Genealogic Data |
-
2017
- 2017-09-08 CN CN201780093812.7A patent/CN110997936B/en active Active
- 2017-09-08 WO PCT/CN2017/101128 patent/WO2019047181A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101539967A (en) * | 2008-12-12 | 2009-09-23 | 深圳华大基因研究院 | Method for detecting mononucleotide polymorphism |
US20140114582A1 (en) * | 2012-10-18 | 2014-04-24 | David A. Mittelman | System and method for genotyping using informed error profiles |
US20170213127A1 (en) * | 2016-01-24 | 2017-07-27 | Matthew Charles Duncan | Method and System for Discovering Ancestors using Genomic and Genealogic Data |
CN106755300A (en) * | 2016-11-17 | 2017-05-31 | 中国科学院华南植物园 | A kind of method for recognizing Kiwi berry hybrid strain to filial generation genome contribution proportion |
Non-Patent Citations (3)
Title |
---|
ARIEL W. CHAN 等: "Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data", pages 1 - 17 * |
JEAN-SIMON BROUARD 等: "Low-depth genotyping-by-sequencing (GBS) in a bovine population: strategies to maximize the selection of high quality genotypes and the accLow-depth genotyping-by-sequencing (GBS) in a bovine population: strategies to maximize the selection of high quality genotypes and the accuracy of imputationura", vol. 18, pages 1 - 14 * |
RUIQIANG LI等: "SNP detection for massively parallel whole-genome resequencing", vol. 19, pages 1124 - 1132, XP055069881, DOI: 10.1101/gr.088013.108 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111883207A (en) * | 2020-07-31 | 2020-11-03 | 武汉蓝沙医学检验实验室有限公司 | Identification method of biological genetic relationship |
CN113186255A (en) * | 2021-05-12 | 2021-07-30 | 深圳思勤医疗科技有限公司 | Method and device for detecting nucleotide variation based on single molecule sequencing |
CN113517022A (en) * | 2021-06-10 | 2021-10-19 | 阿里巴巴新加坡控股有限公司 | Gene detection method, feature extraction method, device, equipment and system |
CN113470746A (en) * | 2021-06-21 | 2021-10-01 | 广州市金域转化医学研究院有限公司 | Method for reducing artificially introduced error mutation in high-throughput sequencing and application |
CN113470746B (en) * | 2021-06-21 | 2023-11-21 | 广州市金域转化医学研究院有限公司 | Method for reducing artificially introduced error mutation in high-throughput sequencing and application thereof |
CN116168763A (en) * | 2022-09-06 | 2023-05-26 | 安诺优达基因科技(北京)有限公司 | Method and device for grouping and assembling autotetraploid genome, method and device for constructing chromosome and application of method and device |
Also Published As
Publication number | Publication date |
---|---|
CN110997936B (en) | 2024-05-10 |
WO2019047181A1 (en) | 2019-03-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102562419B1 (en) | Variant classifier based on deep neural networks | |
CN110997936B (en) | Method, device and application of genotyping based on low-depth genome sequencing | |
KR102273717B1 (en) | Deep learning-based variant classifier | |
Peterson et al. | Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species | |
CA2964902C (en) | Ancestral human genomes | |
US20190318806A1 (en) | Variant Classifier Based on Deep Neural Networks | |
US20220130488A1 (en) | Methods for detecting copy-number variations in next-generation sequencing | |
KR102447812B1 (en) | Deep Learning-Based Framework For Identifying Sequence Patterns That Cause Sequence-Specific Errors (SSES) | |
US20180137243A1 (en) | Therapeutic Methods Using Metagenomic Data From Microbial Communities | |
US20240347135A1 (en) | Difference-based genomic identity scores | |
JP7122006B2 (en) | Insertion/deletion/inversion/translocation/substitution detection method | |
Ottensmann | Comparing the performance of the gene prioritization methods DEPICT and MAGMA on genome-wide association studies of schizophrenia using the Benchmarker framework | |
NL2021473B1 (en) | DEEP LEARNING-BASED FRAMEWORK FOR IDENTIFYING SEQUENCE PATTERNS THAT CAUSE SEQUENCE-SPECIFIC ERRORS (SSEs) | |
Li | A Multidisciplinary Approach With Dogs as The Model Organism To Identify Whole-Genome Breed-Specific Genotypes Potentially Relating To Human Complex Traits | |
Barcelona Cabeza | Genomics tools in the cloud: the new frontier in omics data analysis | |
NZ791625A (en) | Variant classifier based on deep neural networks | |
Clarke | Bioinformatics challenges of high-throughput SNP discovery and utilization in non-model organisms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |