skip to main content
Skip header Section
Exploratory Data Mining and Data CleaningMay 2003
Publisher:
  • John Wiley & Sons, Inc.
  • 605 Third Ave. New York, NY
  • United States
ISBN:978-0-471-26851-2
Published:01 May 2003
Pages:
203
Skip Bibliometrics Section
Reflects downloads up to 18 Jan 2025Bibliometrics
Skip Abstract Section
Abstract

From the Publisher:

A groundbreaking addition to the existing literature, Exploratory Data Mining and Data Cleaning serves as an important reference for data analysts who need to analyze large amounts of unfamiliar data, operations managers, and students in undergraduate or graduate-level courses, dealing with data analysis and data mining.

Cited By

  1. Luo Z, Xiong K, Zhu J, Chen R, Shu X, Weng D and Wu Y (2024). Ferry: Toward Better Understanding of Input/Output Space for Data Wrangling Scripts, IEEE Transactions on Visualization and Computer Graphics, 31:1, (1202-1212), Online publication date: 1-Jan-2025.
  2. Petricek T, Burg G, Nazábal A, Ceritli T, Jiménez-Ruiz E and Williams C (2023). AI Assistants: A Framework for Semi-Automated Data Wrangling, IEEE Transactions on Knowledge and Data Engineering, 35:9, (9295-9306), Online publication date: 1-Sep-2023.
  3. Domova V and Vrotsou K (2023). A Model for Types and Levels of Automation in Visual Analytics: A Survey, a Taxonomy, and Examples, IEEE Transactions on Visualization and Computer Graphics, 29:8, (3550-3568), Online publication date: 1-Aug-2023.
  4. ACM
    Han L, Chen T, Demartini G, Indulska M and Sadiq S (2023). A Data-Driven Analysis of Behaviors in Data Curation Processes, ACM Transactions on Information Systems, 41:3, (1-35), Online publication date: 31-Jul-2023.
  5. Yu S, Chen T, Han L, Demartini G and Sadiq S (2023). DataOps-4G: On Supporting Generalists in Data Quality Discovery, IEEE Transactions on Knowledge and Data Engineering, 35:5, (4668-4681), Online publication date: 1-May-2023.
  6. ACM
    Yu S, Han L, Indulska M, Sadiq S and Demartini G Human-in-the-loop Regular Expression Extraction for Single Column Format Inconsistency Proceedings of the ACM Web Conference 2023, (3859-3867)
  7. ACM
    Kasica S, Berret C and Munzner T Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, (1-18)
  8. ACM
    Sarathy J, Song S, Haque A, Schlatter T and Vadhan S Don’t Look at the Data! How Differential Privacy Reconfigures the Practices of Data Science Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, (1-19)
  9. ACM
    Parulian N and Ludäscher B DCM explorer Proceedings of the 14th International Workshop on the Theory and Practice of Provenance, (1-6)
  10. ACM
    De Bie T, De Raedt L, Hernández-Orallo J, Hoos H, Smyth P and Williams C (2022). Automating data science, Communications of the ACM, 65:3, (76-87), Online publication date: 1-Mar-2022.
  11. ACM
    Moore J, Goffin P, Wiese J and Meyer M (2021). An Interview Method for Engaging Personal Data, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5:4, (1-28), Online publication date: 27-Dec-2022.
  12. Lynch C, Gore R, Collins A, Cotter T, Grigoryan G and Leathrum J Increased need for data analytics education in support of verification and validation Proceedings of the Winter Simulation Conference, (1-12)
  13. Parulian N and Ludäscher B Towards Transparent Data Cleaning: The Data Cleaning Model Explorer (DCM/X) Proceedings of the 2021 ACM/IEEE Joint Conference on Digital Libraries, (326-327)
  14. Yang J, He Y and Chaudhuri S (2021). Auto-pipeline, Proceedings of the VLDB Endowment, 14:11, (2563-2575), Online publication date: 1-Jul-2021.
  15. ACM
    Han L, Chen T, Demartini G, Indulska M and Sadiq S On Understanding Data Worker Interaction Behaviors Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, (269-278)
  16. ACM
    Wu Y, Tannen V and Davidson S PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, (447-462)
  17. Passalis N and Tefas A (2019). Discriminative clustering using regularized subspace learning, Pattern Recognition, 96:C, Online publication date: 1-Dec-2019.
  18. van den Burg G, Nazábal A and Sutton C (2022). Wrangling messy CSV files by detecting row and type patterns, Data Mining and Knowledge Discovery, 33:6, (1799-1820), Online publication date: 1-Nov-2019.
  19. ACM
    Chen Y, Martins R and Feng Y Maximal multi-layer specification synthesis Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, (602-612)
  20. ACM
    Pervaiz F, Vashistha A and Anderson R Examining the challenges in development data pipeline Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies, (13-21)
  21. Bleifuß T, Bornemann L, Johnson T, Kalashnikov D, Naumann F and Srivastava D (2018). Exploring change, Proceedings of the VLDB Endowment, 12:2, (85-98), Online publication date: 1-Oct-2018.
  22. ACM
    Sutton C, Hobson T, Geddes J and Caruana R Data Diff Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, (2279-2288)
  23. He Y, Chu X, Ganjam K, Zheng Y, Narasayya V and Chaudhuri S (2018). Transform-data-by-example (TDE), Proceedings of the VLDB Endowment, 11:10, (1165-1177), Online publication date: 1-Jun-2018.
  24. Lee S, Choi Y, Lee R, Park G and Hong E (2018). A decision support system for scientists by processing large-scale satellite images on a distributed computing environment, Multimedia Tools and Applications, 77:11, (14305-14326), Online publication date: 1-Jun-2018.
  25. ACM
    He Y, Ganjam K, Lee K, Wang Y, Narasayya V, Chaudhuri S, Chu X and Zheng Y Transform-Data-by-Example (TDE) Proceedings of the 2018 International Conference on Management of Data, (1785-1788)
  26. Keller S, Shipp S, Korkmaz G, Molfino E, Goldstein J, Lancaster V, Pires B, Higdon D, Chen D and Schroeder A (2018). Harnessing the power of data to support community‐based research, WIREs Computational Statistics, 10:3, Online publication date: 15-Apr-2018.
  27. ACM
    Sadiq S, Dasu T, Dong X, Freire J, Ilyas I, Link S, Miller M, Naumann F, Zhou X and Srivastava D (2018). Data Quality, ACM SIGMOD Record, 46:4, (35-43), Online publication date: 22-Feb-2018.
  28. Pires B, Goldstein J, Higdon D, Sabin P, Korkmaz G, Shipp S, Keller S, Ba S, Hamall K, Koehler A and Reese S A Bayesian simulation approach for supply chain synchronization Proceedings of the 2017 Winter Simulation Conference, (1-12)
  29. ACM
    Feng Y, Martins R, Van Geffen J, Dillig I and Chaudhuri S (2017). Component-based synthesis of table consolidation and transformation tasks from examples, ACM SIGPLAN Notices, 52:6, (422-436), Online publication date: 14-Sep-2017.
  30. ACM
    Feng Y, Martins R, Van Geffen J, Dillig I and Chaudhuri S Component-based synthesis of table consolidation and transformation tasks from examples Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, (422-436)
  31. Wang S, Shi W, Zhao Y and Shu Y (2017). Improved Approximation Algorithm for Maximal Information Coefficient, International Journal of Data Warehousing and Mining, 13:1, (76-93), Online publication date: 1-Jan-2017.
  32. ACM
    Wang X, Gulwani S and Singh R (2016). FIDEX: filtering spreadsheet data using examples, ACM SIGPLAN Notices, 51:10, (195-213), Online publication date: 5-Dec-2016.
  33. ACM
    Wang X, Gulwani S and Singh R FIDEX: filtering spreadsheet data using examples Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, (195-213)
  34. ACM
    Dong X, Kementsietsidis A and Tan W (2016). A Time Machine for Information, ACM SIGMOD Record, 45:2, (23-32), Online publication date: 28-Sep-2016.
  35. ACM
    Chiang F and Sitaramachandran S (2016). Unifying Data and Constraint Repairs, Journal of Data and Information Quality, 7:3, (1-26), Online publication date: 27-Sep-2016.
  36. Krishnan S, Wang J, Wu E, Franklin M and Goldberg K (2016). ActiveClean, Proceedings of the VLDB Endowment, 9:12, (948-959), Online publication date: 1-Aug-2016.
  37. ACM
    Krishnan S, Haas D, Franklin M and Wu E Towards reliable interactive data cleaning Proceedings of the Workshop on Human-In-the-Loop Data Analytics, (1-5)
  38. Singh R (2016). BlinkFill, Proceedings of the VLDB Endowment, 9:10, (816-827), Online publication date: 1-Jun-2016.
  39. Li D, Wang S, Yuan H and Li D (2016). Software and applications of spatial data mining, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 6:3, (84-114), Online publication date: 1-May-2016.
  40. Ilyas I and Chu X (2015). Trends in Cleaning Relational Data, Foundations and Trends in Databases, 5:4, (281-393), Online publication date: 1-Oct-2015.
  41. ACM
    Deshpande M, Ray D, Dixit S and Agasti A ShareInsights Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, (1925-1940)
  42. ACM
    Bergman M, Milo T, Novgorodov S and Tan W Query-Oriented Data Cleaning with Oracles Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, (1199-1214)
  43. Malik H, Davis I, Godfrey M, Neuse D and Mankovskii S Detecting Discontinuities in Large Scale Systems Proceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing, (345-354)
  44. ACM
    Dasu T, Loh J and Srivastava D Empirical glitch explanations Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, (572-581)
  45. Shyr J and Spisic D (2014). Automated data analysis, WIREs Computational Statistics, 6:5, (359-366), Online publication date: 18-Aug-2014.
  46. ACM
    Wang J, Krishnan S, Franklin M, Goldberg K, Kraska T and Milo T A sample-and-clean framework for fast and accurate query processing on dirty data Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, (469-480)
  47. Liu S, Zhao Q and Wu X (2014). Feature selection based on partition clustering, International Journal of Knowledge-based and Intelligent Engineering Systems, 18:2, (135-142), Online publication date: 1-Apr-2014.
  48. Morton K, Balazinska M, Grossman D and Mackinlay J (2014). Support the data enthusiast, Proceedings of the VLDB Endowment, 7:6, (453-456), Online publication date: 1-Feb-2014.
  49. ACM
    Hafen R, Gibson T, van Dam K and Critchlow T Large-scale exploratory analysis, cleaning, and modeling for event detection in real-world power systems data Proceedings of the 3rd International Workshop on High Performance Computing, Networking and Analytics for the Power Grid, (1-9)
  50. Neme A and Nido A Exploratory Data Analysis through the Inspection of the Probability Density Function of the Number of Neighbors Proceedings of the 12th International Symposium on Advances in Intelligent Data Analysis XII - Volume 8207, (310-321)
  51. Ienco D, Pitarch Y, Poncelet P and Teisseire M Knowledge-Free Table Summarization Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery - Volume 8057, (122-133)
  52. Dey D and Kumar S (2013). Data Quality of Query Results with Generalized Selection Conditions, Operations Research, 61:1, (17-31), Online publication date: 1-Jan-2013.
  53. ACM
    Heer J and Kandel S (2012). Interactive analysis of big data, XRDS: Crossroads, The ACM Magazine for Students, 19:1, (50-54), Online publication date: 1-Sep-2012.
  54. ACM
    Kandel S, Parikh R, Paepcke A, Hellerstein J and Heer J Profiler Proceedings of the International Working Conference on Advanced Visual Interfaces, (547-554)
  55. ACM
    Sivogolovko E and Novikov B Validating cluster structures in data mining tasks Proceedings of the 2012 Joint EDBT/ICDT Workshops, (245-250)
  56. Kwak D and Kim K (2012). A data mining approach considering missing values for the optimization of semiconductor-manufacturing processes, Expert Systems with Applications: An International Journal, 39:3, (2590-2596), Online publication date: 1-Feb-2012.
  57. ACM
    Guo P, Kandel S, Hellerstein J and Heer J Proactive wrangling Proceedings of the 24th annual ACM symposium on User interface software and technology, (65-74)
  58. Köksal G, Batmaz İ and Testik M (2011). A review of data mining applications for quality improvement in manufacturing industry, Expert Systems with Applications: An International Journal, 38:10, (13448-13467), Online publication date: 15-Sep-2011.
  59. Naubourg P, Savonnet M, Leclercq É and Yétongnon K A approach to clinical proteomics data quality control and import Proceedings of the Second international conference on Information technology in bio- and medical informatics, (168-182)
  60. Han J, Kamber M and Pei J (2011). Data Mining, 10.5555/1972541, Online publication date: 29-Jul-2011.
  61. ACM
    Chen S, Dong X, Lakshmanan L and Srivastava D We challenge you to certify your updates Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, (481-492)
  62. ACM
    Kandel S, Paepcke A, Hellerstein J and Heer J Wrangler Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, (3363-3372)
  63. Ioannou E, Nejdl W, Niederée C and Velegrakis Y (2010). On-the-fly entity-aware query processing in the presence of linkage, Proceedings of the VLDB Endowment, 3:1-2, (429-438), Online publication date: 1-Sep-2010.
  64. Akoglu L, McGlohon M and Faloutsos C OddBall Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II, (410-421)
  65. ACM
    Mayfield C, Neville J and Prabhakar S ERACER Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, (75-86)
  66. ACM
    Rodic J and Baranovic M Generating data quality rules and integration into ETL process Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP, (65-72)
  67. ACM
    Embury S, Missier P, Sampaio S, Greenwood R and Preece A (2009). Incorporating Domain-Specific Information Quality Constraints into Database Queries, Journal of Data and Information Quality, 1:2, (1-31), Online publication date: 1-Sep-2009.
  68. ACM
    Batini C, Cappiello C, Francalanci C and Maurino A (2009). Methodologies for data quality assessment and improvement, ACM Computing Surveys, 41:3, (1-52), Online publication date: 1-Jul-2009.
  69. ACM
    Madnick S, Wang R, Lee Y and Zhu H (2009). Overview and Framework for Data and Information Quality Research, Journal of Data and Information Quality, 1:1, (1-22), Online publication date: 1-Jun-2009.
  70. Dembczyński K, Kotłowski W and Sydow M (2008). Effective Prediction of Web User Behaviour with User-Level Models, Fundamenta Informaticae, 89:2-3, (189-206), Online publication date: 15-Jan-2009.
  71. ACM
    Bernstein P and Haas L (2008). Information integration in the enterprise, Communications of the ACM, 51:9, (72-79), Online publication date: 1-Sep-2008.
  72. Abdelzaher T, Khan M, Le H, Ahmadi H and Han J Data mining for diagnostic debugging in sensor networks Proceedings of the Second international conference on Knowledge Discovery from Sensor Data, (1-24)
  73. ACM
    Silva A, Calais P, Pereira A, Mourão F, Almeida J, Meira W and Góes P A seller's perspective characterization methodology for online auctions Proceedings of the 10th international conference on Electronic commerce, (1-10)
  74. Petrosino A and Staiano A (2008). Fuzzy modeling for data cleaning in sensor networks, International Journal of Hybrid Intelligent Systems, 5:3, (143-151), Online publication date: 1-Aug-2008.
  75. Lee D and Tsatsoulis C Domain independent data discrepancy detection using ensemble learning Proceedings of the 12th WSEAS international conference on Computers, (88-94)
  76. Dembczyński K, Kotłowski W and Sydow M (2008). Effective Prediction of Web User Behaviour with User-Level Models, Fundamenta Informaticae, 89:2-3, (189-206), Online publication date: 1-Apr-2008.
  77. Farinha J and Trigueiros M An extensible metadata framework for data quality assessment of composite structures Proceedings of the 9th international conference on Data Warehousing and Knowledge Discovery, (34-44)
  78. ACM
    Srivastava D and Velegrakis Y Intensional associations between data and metadata Proceedings of the 2007 ACM SIGMOD international conference on Management of data, (401-412)
  79. Haas L Beauty and the beast Proceedings of the 11th international conference on Database Theory, (28-43)
  80. ACM
    Cho S, Koudas N and Srivastava D Meta-data indexing for XPath location steps Proceedings of the 2006 ACM SIGMOD international conference on Management of data, (455-466)
  81. Berti-Équille L Quality-Aware association rule mining Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining, (440-449)
  82. Fuxman A, Fuxman D and Miller R ConQuer Proceedings of the 31st international conference on Very large data bases, (1354-1357)
  83. Muthukrishnan S (2005). Data streams, Foundations and Trends® in Theoretical Computer Science, 1:2, (117-236), Online publication date: 1-Aug-2005.
  84. ACM
    Bugajski J, Grossman R, Sumner E and Tang Z An event based framework for improving information quality that integrates baseline models, causal models and formal reference models Proceedings of the 2nd international workshop on Information quality in information systems, (40-45)
  85. ACM
    Koudas N, Marathe A and Srivastava D SPIDER Proceedings of the 2005 ACM SIGMOD international conference on Management of data, (876-878)
  86. ACM
    Fuxman A, Fazli E and Miller R ConQuer Proceedings of the 2005 ACM SIGMOD international conference on Management of data, (155-166)
  87. Knobbe A Multi-Relational Data Mining Proceedings of the 2005 conference on Multi-Relational Data Mining, (1-118)
  88. ACM
    Garofalakis M and Kumar A (2005). XML stream processing using tree-edit distance embeddings, ACM Transactions on Database Systems, 30:1, (279-332), Online publication date: 1-Mar-2005.
  89. Koudas N, Marathe A and Srivastava D Flexible string matching against large databases in practice Proceedings of the Thirtieth international conference on Very large data bases - Volume 30, (1078-1086)
  90. Guha S, Koudas N, Marathe A and Srivastava D Merging the results of approximate match operations Proceedings of the Thirtieth international conference on Very large data bases - Volume 30, (636-647)
  91. ACM
    Leser U and Freytag J Mining for patterns in contradictory data Proceedings of the 2004 international workshop on Information quality in information systems, (51-58)
  92. ACM
    Andritsos P, Miller R and Tsaparas P Information-theoretic tools for mining database structure from large data sets Proceedings of the 2004 ACM SIGMOD international conference on Management of data, (731-742)
  93. ACM
    Herbert K, Gehani N, Piel W, Wang J and Wu C (2004). BIO-AJAX, ACM SIGMOD Record, 33:2, (51-57), Online publication date: 1-Jun-2004.
  94. Korn F, Muthukrishnan S and Zhu Y Checks and balances Proceedings of the 29th international conference on Very large data bases - Volume 29, (536-547)
  95. ACM
    Dasu T, Vesonder G and Wright J Data quality through knowledge engineering Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, (705-710)
Contributors
  • Fairleigh Dickinson University

Recommendations