Article

Aligning database columns using mutual information

Authors:

Patrick Pantel,

Andrew Philpot,

Eduard HovyAuthors Info & Claims

dg.o '05: Proceedings of the 2005 national conference on Digital government research

Pages 205 - 210

Published: 15 May 2005 Publication History

Get Access

Abstract

As with many large organizations, the Government's data is split in many different ways and is collected at different times by different people. The resulting massive data heterogeneity means government staff cannot effectively locate, share, or compare data across sources, let alone achieve computational data interoperability. A case in point is the California Air Resources Board (CARB), which is faced with the challenge of integrating the emissions inventory databases belonging to California's 35 air quality management districts to create a state inventory. This inventory must be submitted annually to the US EPA which, in turn, must perform quality assurance tests on these inventories and integrate them into a national emissions inventory for use in tracking the effects of national air quality policies. The premise of our research is that it is possible to significantly reduce the amount of manual labor required in database wrapping and integration by automatically learning mappings in the data. In this research, we applied statistical algorithms to discover correspondences across comparable datasets. We have seen particular success in an information theoretic model, called SIfT (Significance Information for Translation), that performs data-driven column alignments. We have applied SIfT to mapping the Santa Barbara County Air Pollution Control District's 2001 emissions inventory database with the California Air Resources Board statewide inventory database. A fully customizable interface to the SIfT toolkit is available at <u>http://sift.isi.edu/</u>, allowing users to create new alignments, navigate the information theoretic model, and inspect alignment decisions. On a broader scale, this work makes strides toward appeasing a central problem in data management of integrating legacy data.

References

[1]

Ambite, J. L.; Arens, Y.; Gravano, L.; Hatzivassiloglou, V.; Hovy, E. H.; Klavans, J. L.; Philpot, A.; Ramachandran, U.; Ross, K.; Sandhaus, J.; Sarioz, D.; Singla, A.; and Whitman, B. 2002. Data Integration and Access: The Digital Government Research Center's Energy Data Collection (EDC) Project. In W. McIver and A. K. Elmagarmid (eds), Advances in Digital Government. pp. 85--106. Dordrecht: Kluwer.

Google Scholar

[2]

Baru, C.; Gupta, A.; Ludaescher, B.; Marciano, R.; Papakonstantinou, Y.; and Velikhov, P. 1999. XML-Based Information Mediation with MIX. In Proceedings of Exhibitions Program of ACM SIGMOD International Conference on Management of Data.

Digital Library

Google Scholar

[3]

Chawathe, S.; Garcia-Molina, H.; Hammer, J.; Ireland, K.; Papakonstantinou, Y.; Ullman, J.; and Widom, J. 1994. The TSIMMIS Project: Integration of Heterogeneous Information Sources. In Proceedings of IPSJ Conference. Tokyo, Japan. pp. 7--18.

Google Scholar

[4]

Church, K. and Hanks, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of ACL-89. pp. 76--83. Vancouver, Canada.

Digital Library

Google Scholar

[5]

Doan, A.; Domingos, P.; and Halevy, A. Y. 2001. Reconciling schemas of disparate data sources: A machine-learning approach. In Proceedings of SIGMOD-2001. pp. 509--520. Santa Barbara, CA.

Digital Library

Google Scholar

[6]

Hovy, E. H. 2003. Using an Ontology to Simplify Data Access. In Communications of the ACM, Special Issue on Digital Government. January.

Digital Library

Google Scholar

[7]

Kang, J. and Naughton, J. F. 2003. On schema matching with opaque column names and data values. In Proceedings of SIGMOD-2003. San Diego, CA.

Digital Library

Google Scholar

[8]

Levy, A. Y. 1998. The Information Manifold approach to data integration. IEEE Intelligent Systems (September/October), 11--16.

Google Scholar

[9]

Pantel, P. and Lin, D. 2002. Discovering word senses from text. In Proceedings of SIGKDD-02. pp. 613--619. Edmonton, Canada.

Digital Library

Google Scholar

[10]

Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill.

Digital Library

Google Scholar

[11]

Shaw Jr., W. M.; Burgin, R.; and Howell, P. 1997. Performance standards and evaluations in IR test collections: Cluster-based retrieval methods. Information Processing and Management, 33:1--14.

Digital Library

Google Scholar

[12]

Tova, M. and Zohar, S. 1998. Using schema matching to simplify heterogeneous data translation. In Proceeding of VLDB-1998. pp. 122--133.

Digital Library

Google Scholar

Cited By

View all

Kaza SChen H(2008)Evaluating ontology mapping techniquesDecision Support Systems10.1016/j.dss.2007.12.00745:4(714-728)Online publication date: 1-Nov-2008
https://dl.acm.org/doi/10.1016/j.dss.2007.12.007
Degwekar SDePree JBeck HThomas CSu SCushing JPardo T(2007)Event-triggered data and knowledge sharing among collaborating government organizationsProceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains10.5555/1248460.1248477(102-111)Online publication date: 20-May-2007
https://dl.acm.org/doi/10.5555/1248460.1248477
Kaza SWang YChen H(2007)Enhancing border securityDecision Support Systems10.1016/j.dss.2006.09.00743:1(199-210)Online publication date: 1-Feb-2007
https://dl.acm.org/doi/10.1016/j.dss.2006.09.007
Show More Cited By

Index Terms

Aligning database columns using mutual information
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Information systems
  1. Data management systems

Recommendations

Matching and integration across heterogeneous data sources
dg.o '06: Proceedings of the 2006 international conference on Digital government research

A sea of undifferentiated information is forming from the body of data that is collected by people and organizations, across government, for different purposes, at different times, and using different methodologies. The resulting massive data ...
Database Aesthetics: Art in the Age of Information Overflow
Spatial information in phased‐array radar

In this study, the application of information theory to describe radar measurement problems is investigated. In Shannon's information theory, mutual information is used to quantify the reduction in the a priori uncertainty of the transmitted message. ...

Comments

Information & Contributors

Information

Published In

dg.o '05: Proceedings of the 2005 national conference on Digital government research

May 2005

328 pages

Conference Chairs:
Lois Delcambre
Portland State University
,
Genevieve Giuliano
DGRC and USC/ISI

Publisher

Digital Government Society of North America

Publication History

Published: 15 May 2005

Check for updates

Author Tags

Qualifiers

Article

Conference

dg.o '05

Sponsor:

dg.o '05: Digital government research

May 15 - 18, 2005

Georgia, Atlanta, USA

Acceptance Rates

Overall Acceptance Rate 150 of 271 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
333
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Kaza SChen H(2008)Evaluating ontology mapping techniquesDecision Support Systems10.1016/j.dss.2007.12.00745:4(714-728)Online publication date: 1-Nov-2008
https://dl.acm.org/doi/10.1016/j.dss.2007.12.007
Degwekar SDePree JBeck HThomas CSu SCushing JPardo T(2007)Event-triggered data and knowledge sharing among collaborating government organizationsProceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains10.5555/1248460.1248477(102-111)Online publication date: 20-May-2007
https://dl.acm.org/doi/10.5555/1248460.1248477
Kaza SWang YChen H(2007)Enhancing border securityDecision Support Systems10.1016/j.dss.2006.09.00743:1(199-210)Online publication date: 1-Feb-2007
https://dl.acm.org/doi/10.1016/j.dss.2006.09.007
Pantel PPhilpot AHovy EFortes JMacintosh A(2006)Matching and integration across heterogeneous data sourcesProceedings of the 2006 international conference on Digital government research10.1145/1146598.1146738(438-439)Online publication date: 21-May-2006
https://dl.acm.org/doi/10.1145/1146598.1146738
Kaza SWang YChen H(2006)Suspect vehicle identification for border safety with modified mutual informationProceedings of the 4th IEEE international conference on Intelligence and Security Informatics10.1007/11760146_27(308-318)Online publication date: 23-May-2006
https://dl.acm.org/doi/10.1007/11760146_27

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Matching and integration across heterogeneous data sources

Database Aesthetics: Art in the Age of Information Overflow

Spatial information in phased‐array radar

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations