skip to main content
10.5555/1760894.1760959acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Electricity based external similarity of categorical attributes

Published: 30 April 2003 Publication History

Abstract

Similarity or distance measures are fundamental and critical properties for data mining tools. Categorical attributes abound in databases. The Car Make, Gender, Occupation, etc. fields in a automobile insurance database are very informative. Sadly, categorical data is not easily amenable to similarity computations. A domain expert might manually specify some or all of the similarity relationships, but this is error-prone and not feasible for attributes with large domains, nor is it useful for cross-attribute similarities, such as between Gender and Occupation. External similarity functions define a similarity between, say, Car Makes by looking at how they co-occur with the other categorical attributes. We exploit a rich duality between random walks on graphs and electrical circuits to develop REP, an external similarity function. REP is theoretically grounded while the only prior work was ad-hoc. The usefulness of REP is shown in two experiments. First, we cluster categorical attribute values showing improved inferred relationships. Second, we use REP effectively as a nearest neighbour classifier.

References

[1]
C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.
[2]
A. K. Chandra, P. Raghavan, W. L. Ruzzo, and R. Smolensky. The electrical resistance of a graph captures its commute and cover times. In Proceedings of the Twenty First Annual ACM Symposium on Theory of Computing, 1989.
[3]
G. Das, H. Mannila, and P. Ronkainen. Similarity of attributes by external probes. In Knowledge Discovery and Data Mining, pages 23-29, 1998.
[4]
P. G. Doyle and J. L. Snell. Random Walks and Electric Networks.
[5]
V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS - clustering categorical data using summaries. In Knowledge Discovery and Data Mining, pages 73-83, 1999.
[6]
D. Gibson, J. M. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamical systems. VLDB Journal, 8(3-4):222-236, 2000.
[7]
S. Guha, R. Rastogi, and K. Shim. ROCK - a robust clusering algorith for categorical attributes. In Proceedings of IEEE International Conference on Data Engineering, 1999.
[8]
G. Jeh and J. Widom. Simrank: A measure of structural-context similarity. In Eigth ACM SIGKDD Internation Conference on Knowledge Discovery and Data Mining, 2002.
[9]
D. J. Klein and M. Randic. Resistance distance. J. of Math. Chemistry, 1993.
[10]
R. Quinlan. C4.5 decision tree generator.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
PAKDD '03: Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
April 2003
610 pages
ISBN:3540047603
  • Editors:
  • Kyu-Young Wang,
  • Jongwoo Jeon,
  • Kyuseok Shim,
  • Jaideep Srivastava

Sponsors

  • Statistical Research Center for Complex Systems
  • KAIST: Korea Advanced Institute of Science and Technology
  • The Korean Datamining Society
  • Advanced Information Technology Research Center
  • Korea Info Sci Society: Korea Information Science Society
  • Air Force Office of Scientific Research/Asian Office of Aerospace R&D

In-Cooperation

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 30 April 2003

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media