Abstract
Large genomic DNA sequences contain regions with distinctive patterns of sequence organization. We describe a method using logarithms of probabilities based on seventh-order Markov chains to rapidly identify genomic sequences that do not resemble models of genome organization built from compilations of octanucleotide usage. Data bases have been constructed from Escherichia coli and Saccharomyces cerevisiae DNA sequences of > 1000 nt and human sequences of > 10,000 nt. Atypical genes and clusters of genes have been located in bacteriophage, yeast, and primate DNA sequences. We consider criteria for statistical significance of the results, offer possible explanations for the observed variation in genome organization, and give additional applications of these methods in DNA sequence analysis.
Full text
PDFSelected References
These references are in PubMed. This may not be the complete list of references from this article.
- Barondess J. J., Beckwith J. A bacterial virulence determinant encoded by lysogenic coliphage lambda. Nature. 1990 Aug 30;346(6287):871–874. doi: 10.1038/346871a0. [DOI] [PubMed] [Google Scholar]
- Binns M. Contamination of DNA database sequence entries with Escherichia coli insertion sequences. Nucleic Acids Res. 1993 Feb 11;21(3):779–779. doi: 10.1093/nar/21.3.779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burge C., Campbell A. M., Karlin S. Over- and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci U S A. 1992 Feb 15;89(4):1358–1362. doi: 10.1073/pnas.89.4.1358. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Coderre P. E., Earhart C. F. The entD gene of the Escherichia coli K12 enterobactin gene cluster. J Gen Microbiol. 1989 Nov;135(11):3043–3055. doi: 10.1099/00221287-135-11-3043. [DOI] [PubMed] [Google Scholar]
- Cuticchia A. J., Ivarie R., Arnold J. The application of Markov chain analysis to oligonucleotide frequency prediction and physical mapping of Drosophila melanogaster. Nucleic Acids Res. 1992 Jul 25;20(14):3651–3657. doi: 10.1093/nar/20.14.3651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Green P., Lipman D., Hillier L., Waterston R., States D., Claverie J. M. Ancient conserved regions in new gene sequences and the protein databases. Science. 1993 Mar 19;259(5102):1711–1716. doi: 10.1126/science.8456298. [DOI] [PubMed] [Google Scholar]
- Hardison R., Miller W. Use of long sequence alignments to study the evolution and regulation of mammalian globin gene clusters. Mol Biol Evol. 1993 Jan;10(1):73–102. doi: 10.1093/oxfordjournals.molbev.a039991. [DOI] [PubMed] [Google Scholar]
- JOSSE J., KAISER A. D., KORNBERG A. Enzymatic synthesis of deoxyribonucleic acid. VIII. Frequencies of nearest neighbor base sequences in deoxyribonucleic acid. J Biol Chem. 1961 Mar;236:864–875. [PubMed] [Google Scholar]
- Karlin S., Brendel V. Chance and statistical significance in protein and DNA sequence analysis. Science. 1992 Jul 3;257(5066):39–49. doi: 10.1126/science.1621093. [DOI] [PubMed] [Google Scholar]
- Karlin S., Brendel V. Patchiness and correlations in DNA sequences. Science. 1993 Jan 29;259(5095):677–680. doi: 10.1126/science.8430316. [DOI] [PubMed] [Google Scholar]
- Kleffe J., Borodovsky M. First and second moment of counts of words in random texts generated by Markov chains. Comput Appl Biosci. 1992 Oct;8(5):433–441. doi: 10.1093/bioinformatics/8.5.433. [DOI] [PubMed] [Google Scholar]
- Lamperti E. D., Kittelberger J. M., Smith T. F., Villa-Komaroff L. Corruption of genomic databases with anomalous sequence. Nucleic Acids Res. 1992 Jun 11;20(11):2741–2747. doi: 10.1093/nar/20.11.2741. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oliver S. G., van der Aart Q. J., Agostoni-Carbone M. L., Aigle M., Alberghina L., Alexandraki D., Antoine G., Anwar R., Ballesta J. P., Benit P. The complete DNA sequence of yeast chromosome III. Nature. 1992 May 7;357(6373):38–46. doi: 10.1038/357038a0. [DOI] [PubMed] [Google Scholar]
- Ouellette B. F., Clark M. W., Keng T., Storms R. K., Zhong W., Zeng B., Fortin N., Delaney S., Barton A., Kaback D. B. Sequencing of chromosome I from Saccharomyces cerevisiae: analysis of a 32 kb region between the LTE1 and SPO7 genes. Genome. 1993 Feb;36(1):32–42. doi: 10.1139/g93-005. [DOI] [PubMed] [Google Scholar]
- Pesole G., Prunella N., Liuni S., Attimonelli M., Saccone C. WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. Nucleic Acids Res. 1992 Jun 11;20(11):2871–2875. doi: 10.1093/nar/20.11.2871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sharp P. M., Lloyd A. T. Regional base composition variation along yeast chromosome III: evolution of chromosome primary structure. Nucleic Acids Res. 1993 Jan 25;21(2):179–183. doi: 10.1093/nar/21.2.179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tagle D. A., Stanhope M. J., Siemieniak D. R., Benson P., Goodman M., Slightom J. L. The beta globin gene cluster of the prosimian primate Galago crassicaudatus: nucleotide sequence determination of the 41-kb cluster and comparative sequence analyses. Genomics. 1992 Jul;13(3):741–760. doi: 10.1016/0888-7543(92)90150-q. [DOI] [PubMed] [Google Scholar]