skip to main content
research-article

CUDIA: Probabilistic cross-level imputation using individual auxiliary information

Published: 08 October 2013 Publication History

Abstract

In healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues, or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones, such as hospital referral regions (HRR) or hospital service areas (HSA). Such levels constitute partitions over the underlying individual level data, which may not match the groupings that would have been obtained if one clustered the data based on individual-level attributes. Moreover, treating aggregated values as representatives for the individuals can result in the ecological fallacy. How can one run data mining procedures on such data where different variables are available at different levels of aggregation or granularity? In this article, we seek a better utilization of variably aggregated datasets, which are possibly assembled from different sources. We propose a novel cross-level imputation technique that models the generative process of such datasets using a Bayesian directed graphical model. The imputation is based on the underlying data distribution and is shown to be unbiased. This imputation can be further utilized in a subsequent predictive modeling, yielding improved accuracies. The experimental results using a simulated dataset and the Behavioral Risk Factor Surveillance System (BRFSS) dataset are provided to illustrate the generality and capabilities of the proposed framework.

Supplementary Material

a66-park-apndx.pdf (park.zip)
Supplemental movie, appendix, image and software files for, CUDIA: Probabilistic cross-level imputation using individual auxiliary information

References

[1]
Achen, C. H. and Shively, W. P. 1995. Cross-Level Inference. The University of Chicago Press, Chicago, IL.
[2]
Agarwal, D. and Chen, B.-C. 2009. Regression-based latent factor models. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[3]
Asuncion, A., Welling, M., Smyth, P., and Teh, Y. 2009. On smoothing and inference for topic models. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. 27--34.
[4]
Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. 2005. Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705--1749.
[5]
Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 993--1022.
[6]
Booth, J. G. and Hovert, J. P. 1999. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J. Royal Stat. Soc. Series B 61, 265--285.
[7]
Breiman, L. 1984. Classification and Regression Trees. Wadsworth International Group.
[8]
Brownstone, D. and Valletta, R. 2001. The bootstrap and multiple imputations: Harnessing increased computing power for improved statistical tests. J. Econ. Perspect. 15, 4, 129--141.
[9]
Carmelli, D., Cardon, L. R., and Fabsitz, R. 1994. Clustering of hypertension, diabetes, and obesity in adult male twins: Same genes or same environments? Amer. J. Human Genet. 55, 3, 566--573.
[10]
Cawley, G. C., Talbot, N. L., and Girolami, M. 2006. Sparse multinomial logistic regression via bayesian l1 regularization. In Proceedings of the 19th Annual Conference on Neural Information Processing Systems. 209--216.
[11]
Dempster, A. P., Laird, N. M., and Rubin, D. B. 1976. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Series B 39.
[12]
Duncan, O. D. and Davis, B. 1953. An alternative to ecological correlation. Am. Sociol. Rev. 18, 665--666.
[13]
Dwork, C. 2006. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming, Vol. 4052, 1--12.
[14]
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., and Naor, M. 2006a. Our data, ourselves: Privacy via distributed noise generation. In Proceedings of the 25th International Cryptology Conference (EUROCRYPT). 486--503.
[15]
Dwork, C., McSherry, F., Nissim, K., and Smith, A. 2006b. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Theory of Cryptography Conference.
[16]
Emam, K. E. and Fineberg, A. 2009. An overview of techniques for de-identifying personal health information. Social Sci. Res. Netw.
[17]
Freedman, D. A. 1999. Ecological inference and the ecological fallacy. Tech. rep. 549, Department of Statistics, University of California Berkeley, CA.
[18]
Goodman, L. 1953. Ecological regression and the behavior of individuals. Am. Sociol. Rev. 18, 663--664.
[19]
Goodman, L. 1959. Some alternatives to ecological correlation. Amer. J. Sociol. 64, 610--625.
[20]
Grimmett, G. and Stirzaker, D. 2001. Probability and Random Processes 3rd Ed. Oxford, Chapter 3.7, 67.
[21]
Hastie, T., Tibshirani, R., and Friedman, J. 2009. The Elements of Statistical Learning 2nd Ed. Springer.
[22]
Henry, K. A. and Boscoe, F. P. 2008. Estimating the accuracy of geographical imputation. Int. J. Health Geograph.
[23]
HIPAA Compliance Assistance. 2003. Summary of the HIPAA Privacy Rule. http://www.hhs.gov/ocr/privacy/hipaa/understanding/summary/privacysummary.pdf.
[24]
Jackson, C., Best, N., and Richardson, S. 2008. Hierarchical related regression for combining aggregate and individual data in studies of socio-economic disease risk factors. J. Royal Stat. Soc. Series A 171, 159--178.
[25]
Jackson, C., Best, N., and Richardson, S. 2009. Bayesian graphical models for regression on multiple data sets with different variables. J. Biostat. 10, 2, 335--351.
[26]
King, G. 1997. A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton University Press, Princeton, NJ
[27]
Liu, J. S. 1994. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Am. Stat. Assoc. 89, 427, 958--966.
[28]
Park, Y. and Ghosh, J. 2011. A generative framework for predictive modeling using variably aggregated, multi-source healthcare data. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Medicine and Healthcare. 27--32.
[29]
Park, Y. and Ghosh, J. 2012. A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium.
[30]
Quinlan, J. R. 1993. C4.5: Prgrams for Machine Learning. Morgan Kaufmann.
[31]
Robinson, W. S. 1950. Ecological correlations and the behavior of individuals. Amer. Sociol. Rev. 15, 351--357.
[32]
Rubin, D. B. 2004. Multiple Imputation for Nonresponse in Surveys. Wiley-IEEE.
[33]
Steppan, C. M., Bailey, S. T., Baht, S., Brown, E. J., Banerjee, R. R., Writhe, C. M., Patel, H. R., Ahima, R. S., and Lazar, M. A. 2011. The hormene resistin links obesity to diabetes. Nature 209, 307--312.
[34]
Tabachnick, B. G. and Fidel, L. S. 2001. Using Multivariate Statistics 4th Ed. Allyn & Bacon, Boston, MA.
[35]
Wakefield, J. and Salway, R. 2001. A statistical framework for ecological and aggregated studies. J. Royal Stat. Soc. Series A 164, 119--137.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 4, Issue 4
Survey papers, special sections on the semantic adaptive social web, intelligent systems for health informatics, regular papers
September 2013
452 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/2508037
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 October 2013
Accepted: 01 August 2012
Revised: 01 May 2012
Received: 01 December 2011
Published in TIST Volume 4, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. BRFSS
  2. Clustering
  3. privacy preserving data mining

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media