Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 1994 Jul 19;91(15):7134–7138. doi: 10.1073/pnas.91.15.7134

Atypical regions in large genomic DNA sequences.

S Scherer 1, M S McPeek 1, T P Speed 1
PMCID: PMC44353  PMID: 8041759

Abstract

Large genomic DNA sequences contain regions with distinctive patterns of sequence organization. We describe a method using logarithms of probabilities based on seventh-order Markov chains to rapidly identify genomic sequences that do not resemble models of genome organization built from compilations of octanucleotide usage. Data bases have been constructed from Escherichia coli and Saccharomyces cerevisiae DNA sequences of > 1000 nt and human sequences of > 10,000 nt. Atypical genes and clusters of genes have been located in bacteriophage, yeast, and primate DNA sequences. We consider criteria for statistical significance of the results, offer possible explanations for the observed variation in genome organization, and give additional applications of these methods in DNA sequence analysis.

Full text

PDF
7134

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

  1. Barondess J. J., Beckwith J. A bacterial virulence determinant encoded by lysogenic coliphage lambda. Nature. 1990 Aug 30;346(6287):871–874. doi: 10.1038/346871a0. [DOI] [PubMed] [Google Scholar]
  2. Binns M. Contamination of DNA database sequence entries with Escherichia coli insertion sequences. Nucleic Acids Res. 1993 Feb 11;21(3):779–779. doi: 10.1093/nar/21.3.779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Burge C., Campbell A. M., Karlin S. Over- and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci U S A. 1992 Feb 15;89(4):1358–1362. doi: 10.1073/pnas.89.4.1358. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Coderre P. E., Earhart C. F. The entD gene of the Escherichia coli K12 enterobactin gene cluster. J Gen Microbiol. 1989 Nov;135(11):3043–3055. doi: 10.1099/00221287-135-11-3043. [DOI] [PubMed] [Google Scholar]
  5. Cuticchia A. J., Ivarie R., Arnold J. The application of Markov chain analysis to oligonucleotide frequency prediction and physical mapping of Drosophila melanogaster. Nucleic Acids Res. 1992 Jul 25;20(14):3651–3657. doi: 10.1093/nar/20.14.3651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Green P., Lipman D., Hillier L., Waterston R., States D., Claverie J. M. Ancient conserved regions in new gene sequences and the protein databases. Science. 1993 Mar 19;259(5102):1711–1716. doi: 10.1126/science.8456298. [DOI] [PubMed] [Google Scholar]
  7. Hardison R., Miller W. Use of long sequence alignments to study the evolution and regulation of mammalian globin gene clusters. Mol Biol Evol. 1993 Jan;10(1):73–102. doi: 10.1093/oxfordjournals.molbev.a039991. [DOI] [PubMed] [Google Scholar]
  8. JOSSE J., KAISER A. D., KORNBERG A. Enzymatic synthesis of deoxyribonucleic acid. VIII. Frequencies of nearest neighbor base sequences in deoxyribonucleic acid. J Biol Chem. 1961 Mar;236:864–875. [PubMed] [Google Scholar]
  9. Karlin S., Brendel V. Chance and statistical significance in protein and DNA sequence analysis. Science. 1992 Jul 3;257(5066):39–49. doi: 10.1126/science.1621093. [DOI] [PubMed] [Google Scholar]
  10. Karlin S., Brendel V. Patchiness and correlations in DNA sequences. Science. 1993 Jan 29;259(5095):677–680. doi: 10.1126/science.8430316. [DOI] [PubMed] [Google Scholar]
  11. Kleffe J., Borodovsky M. First and second moment of counts of words in random texts generated by Markov chains. Comput Appl Biosci. 1992 Oct;8(5):433–441. doi: 10.1093/bioinformatics/8.5.433. [DOI] [PubMed] [Google Scholar]
  12. Lamperti E. D., Kittelberger J. M., Smith T. F., Villa-Komaroff L. Corruption of genomic databases with anomalous sequence. Nucleic Acids Res. 1992 Jun 11;20(11):2741–2747. doi: 10.1093/nar/20.11.2741. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Oliver S. G., van der Aart Q. J., Agostoni-Carbone M. L., Aigle M., Alberghina L., Alexandraki D., Antoine G., Anwar R., Ballesta J. P., Benit P. The complete DNA sequence of yeast chromosome III. Nature. 1992 May 7;357(6373):38–46. doi: 10.1038/357038a0. [DOI] [PubMed] [Google Scholar]
  14. Ouellette B. F., Clark M. W., Keng T., Storms R. K., Zhong W., Zeng B., Fortin N., Delaney S., Barton A., Kaback D. B. Sequencing of chromosome I from Saccharomyces cerevisiae: analysis of a 32 kb region between the LTE1 and SPO7 genes. Genome. 1993 Feb;36(1):32–42. doi: 10.1139/g93-005. [DOI] [PubMed] [Google Scholar]
  15. Pesole G., Prunella N., Liuni S., Attimonelli M., Saccone C. WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. Nucleic Acids Res. 1992 Jun 11;20(11):2871–2875. doi: 10.1093/nar/20.11.2871. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Sharp P. M., Lloyd A. T. Regional base composition variation along yeast chromosome III: evolution of chromosome primary structure. Nucleic Acids Res. 1993 Jan 25;21(2):179–183. doi: 10.1093/nar/21.2.179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Tagle D. A., Stanhope M. J., Siemieniak D. R., Benson P., Goodman M., Slightom J. L. The beta globin gene cluster of the prosimian primate Galago crassicaudatus: nucleotide sequence determination of the 41-kb cluster and comparative sequence analyses. Genomics. 1992 Jul;13(3):741–760. doi: 10.1016/0888-7543(92)90150-q. [DOI] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES