Article

Electricity based external similarity of categorical attributes

Authors:

Christopher R. Palmer,

Christos FaloutsosAuthors Info & Claims

PAKDD '03: Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining

Pages 486 - 500

Published: 30 April 2003 Publication History

Abstract

Similarity or distance measures are fundamental and critical properties for data mining tools. Categorical attributes abound in databases. The Car Make, Gender, Occupation, etc. fields in a automobile insurance database are very informative. Sadly, categorical data is not easily amenable to similarity computations. A domain expert might manually specify some or all of the similarity relationships, but this is error-prone and not feasible for attributes with large domains, nor is it useful for cross-attribute similarities, such as between Gender and Occupation. External similarity functions define a similarity between, say, Car Makes by looking at how they co-occur with the other categorical attributes. We exploit a rich duality between random walks on graphs and electrical circuits to develop REP, an external similarity function. REP is theoretically grounded while the only prior work was ad-hoc. The usefulness of REP is shown in two experiments. First, we cluster categorical attribute values showing improved inferred relationships. Second, we use REP effectively as a nearest neighbour classifier.

References

[1]

C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.

Google Scholar

[2]

A. K. Chandra, P. Raghavan, W. L. Ruzzo, and R. Smolensky. The electrical resistance of a graph captures its commute and cover times. In Proceedings of the Twenty First Annual ACM Symposium on Theory of Computing, 1989.

Digital Library

Google Scholar

[3]

G. Das, H. Mannila, and P. Ronkainen. Similarity of attributes by external probes. In Knowledge Discovery and Data Mining, pages 23-29, 1998.

Google Scholar

[4]

P. G. Doyle and J. L. Snell. Random Walks and Electric Networks.

Google Scholar

[5]

V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS - clustering categorical data using summaries. In Knowledge Discovery and Data Mining, pages 73-83, 1999.

Digital Library

Google Scholar

[6]

D. Gibson, J. M. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamical systems. VLDB Journal, 8(3-4):222-236, 2000.

Digital Library

Google Scholar

[7]

S. Guha, R. Rastogi, and K. Shim. ROCK - a robust clusering algorith for categorical attributes. In Proceedings of IEEE International Conference on Data Engineering, 1999.

Digital Library

Google Scholar

[8]

G. Jeh and J. Widom. Simrank: A measure of structural-context similarity. In Eigth ACM SIGKDD Internation Conference on Knowledge Discovery and Data Mining, 2002.

Digital Library

Google Scholar

[9]

D. J. Klein and M. Randic. Resistance distance. J. of Math. Chemistry, 1993.

Google Scholar

[10]

R. Quinlan. C4.5 decision tree generator.

Google Scholar

Cited By

View all

Adeleye OYu JYongchareon SSheng QYang L(2019)A Fitness-Based Evolving Network for Web-APIs DiscoveryProceedings of the Australasian Computer Science Week Multiconference10.1145/3290688.3290709(1-10)Online publication date: 29-Jan-2019
https://dl.acm.org/doi/10.1145/3290688.3290709
Hsu CHuang W(2016)Integrated dimensionality reduction technique for mixed-type data involving categorical valuesApplied Soft Computing10.1016/j.asoc.2016.02.01543:C(199-209)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1016/j.asoc.2016.02.015
Desai ASingh HPudi V(2011)DISCProceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II10.5555/2022850.2022889(469-481)Online publication date: 24-May-2011
https://dl.acm.org/doi/10.5555/2022850.2022889
Show More Cited By

Recommendations

Similarity of attributes by external probes
KDD'98: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining

In data mining, similarity or distance between attributes is one of the central notions. Such a notion can be used to build attribute hierarchies etc. Similarity metrics can be user-defined, but an important problem is defining similarity on the basis ...
Simplex Based Vector Mapping for Categorical Attributes Clustering
CIIS '18: Proceedings of the 2018 International Conference on Computational Intelligence and Intelligent Systems

When clustering unlabeled data, categorical attributes are usually treated differently from numerical attributes because of their unique characteristics, which introduces difficulties in clustering data with both types of attributes. In this paper, we ...
Mining Optimized Association Rules with Categorical and Numeric Attributes

Mining association rules on large data sets has received considerable attention in recent years. Association rules are useful for determining correlations between attributes of a relation and have applications in marketing, financial, and retail ...

Comments

Information & Contributors

Information

Published In

PAKDD '03: Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining

April 2003

610 pages

ISBN:3540047603

Editors:
Kyu-Young Wang
Korea Advanced Institute of Science and Technology, Computer Science Department, Daejeon, Korea
,
Jongwoo Jeon
Seoul National University, Department of Statistics, Seoul, Korea
,
Kyuseok Shim
Seoul National University, School of Electrical Engineering and Computer Science, Seoul, Korea
,
Jaideep Srivastava
University of Minnesota, Department of Computer Science and Engineering, Minneapolis, MN

In-Cooperation

SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 30 April 2003

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 31 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Adeleye OYu JYongchareon SSheng QYang L(2019)A Fitness-Based Evolving Network for Web-APIs DiscoveryProceedings of the Australasian Computer Science Week Multiconference10.1145/3290688.3290709(1-10)Online publication date: 29-Jan-2019
https://dl.acm.org/doi/10.1145/3290688.3290709
Hsu CHuang W(2016)Integrated dimensionality reduction technique for mixed-type data involving categorical valuesApplied Soft Computing10.1016/j.asoc.2016.02.01543:C(199-209)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1016/j.asoc.2016.02.015
Desai ASingh HPudi V(2011)DISCProceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II10.5555/2022850.2022889(469-481)Online publication date: 24-May-2011
https://dl.acm.org/doi/10.5555/2022850.2022889
Onuma KTong HFaloutsos CElder JFogelman FFlach PZaki M(2009)TANGENTProceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1557019.1557093(657-666)Online publication date: 28-Jun-2009
https://dl.acm.org/doi/10.1145/1557019.1557093
Yen LFouss FDecaestecker CFrancq PSaerens M(2009)Graph nodes clustering with the sigmoid commute-time kernelData & Knowledge Engineering10.1016/j.datak.2008.10.00668:3(338-361)Online publication date: 1-Mar-2009
https://dl.acm.org/doi/10.1016/j.datak.2008.10.006
Yen LSaerens MMantrach AShimbo MLi YLiu BSarawagi S(2008)A family of dissimilarity measures between nodes generalizing both the shortest-path and the commute-time distancesProceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1401890.1401984(785-793)Online publication date: 24-Aug-2008
https://dl.acm.org/doi/10.1145/1401890.1401984
Bonifati AMecca GPappalardo ARaunich SSumma GMouaddib NValduriez PKemper ABouzeghoub MMarkl VAmsaleg LManolescu ITeubner J(2008)Schema mapping verificationProceedings of the 11th international conference on Extending database technology: Advances in database technology10.1145/1353343.1353358(85-96)Online publication date: 25-Mar-2008
https://dl.acm.org/doi/10.1145/1353343.1353358
Hsu CChen CSu Y(2007)Hierarchical clustering of mixed data based on distance hierarchyInformation Sciences: an International Journal10.1016/j.ins.2007.05.003177:20(4474-4492)Online publication date: 1-Oct-2007
https://dl.acm.org/doi/10.1016/j.ins.2007.05.003
Tong HFaloutsos CEliassi-Rad TUngar LCraven MGunopulos D(2006)Center-piece subgraphsProceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1150402.1150448(404-413)Online publication date: 20-Aug-2006
https://dl.acm.org/doi/10.1145/1150402.1150448
Geerts FMannila HTerzi E(2004)Relational link-based rankingProceedings of the Thirtieth international conference on Very large data bases - Volume 3010.5555/1316689.1316738(552-563)Online publication date: 31-Aug-2004
https://dl.acm.org/doi/10.5555/1316689.1316738
Show More Cited By

Abstract

References

Cited By

Recommendations

Similarity of attributes by external probes

Simplex Based Vector Mapping for Categorical Attributes Clustering

Mining Optimized Association Rules with Categorical and Numeric Attributes

Comments

Information

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations