skip to main content
10.5555/1065226.1065285acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesdg-oConference Proceedingsconference-collections
Article

Aligning database columns using mutual information

Published: 15 May 2005 Publication History

Abstract

As with many large organizations, the Government's data is split in many different ways and is collected at different times by different people. The resulting massive data heterogeneity means government staff cannot effectively locate, share, or compare data across sources, let alone achieve computational data interoperability. A case in point is the California Air Resources Board (CARB), which is faced with the challenge of integrating the emissions inventory databases belonging to California's 35 air quality management districts to create a state inventory. This inventory must be submitted annually to the US EPA which, in turn, must perform quality assurance tests on these inventories and integrate them into a national emissions inventory for use in tracking the effects of national air quality policies. The premise of our research is that it is possible to significantly reduce the amount of manual labor required in database wrapping and integration by automatically learning mappings in the data. In this research, we applied statistical algorithms to discover correspondences across comparable datasets. We have seen particular success in an information theoretic model, called SIfT (Significance Information for Translation), that performs data-driven column alignments. We have applied SIfT to mapping the Santa Barbara County Air Pollution Control District's 2001 emissions inventory database with the California Air Resources Board statewide inventory database. A fully customizable interface to the SIfT toolkit is available at <u>http://sift.isi.edu/</u>, allowing users to create new alignments, navigate the information theoretic model, and inspect alignment decisions. On a broader scale, this work makes strides toward appeasing a central problem in data management of integrating legacy data.

References

[1]
Ambite, J. L.; Arens, Y.; Gravano, L.; Hatzivassiloglou, V.; Hovy, E. H.; Klavans, J. L.; Philpot, A.; Ramachandran, U.; Ross, K.; Sandhaus, J.; Sarioz, D.; Singla, A.; and Whitman, B. 2002. Data Integration and Access: The Digital Government Research Center's Energy Data Collection (EDC) Project. In W. McIver and A. K. Elmagarmid (eds), Advances in Digital Government. pp. 85--106. Dordrecht: Kluwer.
[2]
Baru, C.; Gupta, A.; Ludaescher, B.; Marciano, R.; Papakonstantinou, Y.; and Velikhov, P. 1999. XML-Based Information Mediation with MIX. In Proceedings of Exhibitions Program of ACM SIGMOD International Conference on Management of Data.
[3]
Chawathe, S.; Garcia-Molina, H.; Hammer, J.; Ireland, K.; Papakonstantinou, Y.; Ullman, J.; and Widom, J. 1994. The TSIMMIS Project: Integration of Heterogeneous Information Sources. In Proceedings of IPSJ Conference. Tokyo, Japan. pp. 7--18.
[4]
Church, K. and Hanks, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of ACL-89. pp. 76--83. Vancouver, Canada.
[5]
Doan, A.; Domingos, P.; and Halevy, A. Y. 2001. Reconciling schemas of disparate data sources: A machine-learning approach. In Proceedings of SIGMOD-2001. pp. 509--520. Santa Barbara, CA.
[6]
Hovy, E. H. 2003. Using an Ontology to Simplify Data Access. In Communications of the ACM, Special Issue on Digital Government. January.
[7]
Kang, J. and Naughton, J. F. 2003. On schema matching with opaque column names and data values. In Proceedings of SIGMOD-2003. San Diego, CA.
[8]
Levy, A. Y. 1998. The Information Manifold approach to data integration. IEEE Intelligent Systems (September/October), 11--16.
[9]
Pantel, P. and Lin, D. 2002. Discovering word senses from text. In Proceedings of SIGKDD-02. pp. 613--619. Edmonton, Canada.
[10]
Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill.
[11]
Shaw Jr., W. M.; Burgin, R.; and Howell, P. 1997. Performance standards and evaluations in IR test collections: Cluster-based retrieval methods. Information Processing and Management, 33:1--14.
[12]
Tova, M. and Zohar, S. 1998. Using schema matching to simplify heterogeneous data translation. In Proceeding of VLDB-1998. pp. 122--133.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
dg.o '05: Proceedings of the 2005 national conference on Digital government research
May 2005
328 pages

Sponsors

  • NSF: National Science Foundation

Publisher

Digital Government Society of North America

Publication History

Published: 15 May 2005

Check for updates

Author Tags

  1. database alignment
  2. information theory
  3. mutual information

Qualifiers

  • Article

Conference

dg.o '05
Sponsor:
  • NSF
dg.o '05: Digital government research
May 15 - 18, 2005
Georgia, Atlanta, USA

Acceptance Rates

Overall Acceptance Rate 150 of 271 submissions, 55%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media