skip to main content
10.1145/2597008.2597138acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
Article

Redacting sensitive information in software artifacts

Published: 02 June 2014 Publication History

Abstract

In the past decade, there have been many well-publicized cases of source code leaking from different well-known companies. These leaks pose a serious problem when the source code contains sensitive information encoded in its identifier names and comments. Unfortunately, redacting the sensitive information requires obfuscating the identifiers, which will quickly interfere with program comprehension. Program comprehension is key for programmers in understanding the source code, so sensitive information is often left unredacted.
To address this problem, we offer a novel approach for REdacting Sensitive Information in Software arTifacts (RESIST). RESIST finds and replaces sensitive words in software artifacts in such a way to reduce the impact on program comprehension. We evaluated RESIST experimentally using 57 professional programmers from over a dozen different organizations. Our evaluation shows that RESIST effectively redacts software artifacts, thereby making it difficult for participants to infer sensitive information, while maintaining a desired level of comprehension.

References

[1]
C. C. Aggarwal. On k-anonymity and the curse of dimensionality. In VLDB ’05, pages 901–909. VLDB Endowment, 2005.
[2]
J. T. Alexander, M. Davern, and B. Stevenson. Inaccurate age and sex data in the census pums files: Evidence and implications. Working Paper 15703, National Bureau of Economic Research, January 2010.
[3]
G. Antoniol, Y.-G. Gueheneuc, E. Merlo, and P. Tonella. Mining the lexicon used by programmers during software evolution. In ICSM’07, pages 14–23, Paris, France, 2007. IEEE Computer Society Press.
[4]
W. Aspray, F. Mayades, and M. Vardi. Globalization and Offshoring of Software. ACM, 2006.
[5]
J. Bansiya and C. Davis. A hierarchical model for object-oriented design quality assessment. IEEE TSE, 28(1):4–17, 2002.
[6]
C. Bialik. Census bureau obscured personal data – too well, some say. The Wall Street Journal, Feb. 2010.
[7]
J. Bieman and B.-K. Kang. Cohesion and reuse in an object-oriented system. In ACM SSR’95, pages 259–262, 1995.
[8]
S. M. Bragg. Outsourcing: A Guide to Selecting the Correct Business Unit, Negotiating the Contract, Maintaining Control of the Process. John Wiley & Sons, Inc., New York, NY, USA, 2006.
[9]
L. C. Briand, J. Wst, J. W. Daly, and V. D. Porter. Exploring the relationship between design measures and software quality in object-oriented systems. Journal of System and Software, 51(3):245–273, 2000.
[10]
J. Brickell and V. Shmatikov. The cost of privacy: destruction of data-mining utility in anonymized data publishing. In KDD ’08, pages 70–78, New York, NY, USA, 2008. ACM.
[11]
A. Budi, D. Lo, L. Jiang, and Lucia. b-anonymity: a model for anonymized behaviour-preserving test and debugging data. In PLDI, pages 447–457, 2011.
[12]
C. Caprile and P. Tonella. Nomen est omen: Analyzing the language of function identifiers. In WCRE’99, pages 112–122, Atlanta, Georgia, USA, 1999.
[13]
C. Casper. Roundup of privacy research, 4q10. http://www.gartner.com/DisplayDocument?id=1497614, Dec. 2010.
[14]
S. Chidamber, D. Darcy, and C. Kemerer. Managerial use of metrics for object-oriented software: An exploratory analysis. IEEE TSE, 24(8):629–639, 1998.
[15]
R. Chow, P. Golle, and J. Staddon. Detecting privacy leaks using corpus-based association rules. In KDD ’08, pages 893–901, New York, NY, USA, 2008. ACM.
[16]
L. Constantin. Kaspersky confirms source code leak, threatens legal action against downloaders. http://news.softpedia.com/news/Kaspersky-Anti-Virus-Source-Code-Leaked-Online-181297.shtml, Jan. 2011.
[17]
T. A. Corbi. Program understanding: Challenge for the 1990s. IBM Systems Journal, 28(2):294–306, 1989.
[18]
C. M. Cumby. Protecting sensitive topics in text documents with protextor. In ECML/PKDD (2), pages 714–717, 2009.
[19]
D. Darcy and C. Kemerer. Oo metrics in practice. IEEE Software, 22(6):17–19, 2005.
[20]
Datamonitor. Application testing services: global market forecast model. Datamonitor Research Store, Aug. 2007.
[21]
J. W. Davison, D. Mancl, and W. F. Opdyke. Understanding and addressing the essential costs of evolving systems. Bell Labs Technical Journal, 5(2):44–54, 2000.
[22]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391–407, 1990.
[23]
F. Deissenboeck and M. Pizka. Concise and consistent naming. In IWPC’05, pages 97–106, St. Louis, Missouri, USA, 2005.
[24]
I. Dinur and K. Nissim. Revealing information while preserving privacy. In PODS ’03, pages 202–210, New York, NY, USA, 2003. ACM.
[25]
J. Domingo-Ferrer and D. Rebollo-Monedero. Measuring risk and utility of anonymized data using information theory. In EDBT/ICDT ’09, pages 126–130, New York, NY, USA, 2009. ACM.
[26]
T. Espiner. Extortion failed - anonymous posts symantec source code. http://www.shacknews.com/article/28619/halflife-2-source-leak, Feb. 2012.
[27]
L. H. Etzkorn and C. G. Davis. Automatically identifying reusable oo legacy code. IEEE Computer, 30(10):66–72, 1997.
[28]
A. Garrido and J. Meseguer. Formal specification and verification of java refactorings. In SCAM’06, pages 165–174, Washington, DC, USA, 2006. IEEE Computer Society.
[29]
S. Gibson. Half-life 2 source leak. http://www.shacknews.com/article/28619/half-life-2-sourceleak, Oct. 2003.
[30]
M. Grechanik, C. Csallner, C. Fu, and Q. Xie. Is data privacy always good for software testing? In ISSRE, pages 368–377, 2010.
[31]
E. Hull, K. Jackson, and J. Dick. Requirements Engineering. SpringerVerlag, 2004.
[32]
M. Jesper. Framework for outsourcing manufacturing: strategic and operational implications. Comput. Ind., 49:59–75, September 2002.
[33]
T. C. Jones. Estimating Software Costs. McGraw-Hill, Inc., New York, NY, USA, 2 edition, 2007.
[34]
M. Klum. Eve online source code leaked. http://www.neowin.net/news/eve-online-source-code-leaked, Apr. 2008.
[35]
J. Legon. Profanity, partner’s name hidden in leaked microsoft code. http://articles.cnn.com/2004-02- 13/tech/microsoft.source_1_mike-gullard-windows-codesource-code?_s=PM:TECH, Feb. 2004.
[36]
R. Lemos. Cisco investigates source code leak. http://www.techrepublic.com/article/cisco-investigatessource-code-leak/5213772, May 2004.
[37]
T. Li and N. Li. On the tradeoff between privacy and utility in data publishing. In KDD ’09, pages 517–526, New York, NY, USA, 2009. ACM.
[38]
R. F. Lorch and E. J. OBrien, editors. Sources of coherence in reading. Erlbaum, Hillsdale, NJ, 1995.
[39]
J. I. Maletic and A. Marcus. Supporting program comprehension using semantic and structural information. In ICSE’01, pages 103–112, Toronto, Ontario, Canada, 2001. IEEE.
[40]
A. Marcus and D. Poshyvanyk. The conceptual cohesion of classes. In ICSM’05, pages 133–142, Washington, DC, USA, 2005. IEEE Computer Society.
[41]
A. Marcus, D. Poshyvanyk, and R. Ferenc. Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Trans. Softw. Eng., 34:287–300, March 2008.
[42]
NASA. Redaction of confidential information in electronic documents. http://www.sti.nasa.gov/publish/redaction.pdf, Mar. 2011.
[43]
W. F. Opdyke. Refactoring object-oriented frameworks. PhD thesis, UIUC, Champaign, IL, USA, 1992. UMI Order No. GAX93-05645.
[44]
N. Pennington. Stimulus structures and mental representations in expert comprehension of computer programs. Cognitive Psychology, 19:295–341, 1987.
[45]
D. Poshyvanyk and A. Marcus. The conceptual coupling metrics for object-oriented systems. In 22nd IEEE ICSM, pages 469–478, Washington, DC, USA, 2006. IEEE Computer Society.
[46]
D. Poshyvanyk, A. Marcus, R. Ferenc, and T. Gyimóthy. Using information retrieval based coupling measures for impact analysis. Empirical Softw. Engg., 14:5–32, February 2009.
[47]
M. Revelle, M. Gethers, and D. Poshyvanyk. Using structural and textual information to capture feature coupling in object-oriented software. Empirical Software Engineering, 16(6):773–811, 2011.
[48]
J. Richards. Facebook source code leaked onto internet. http://www.foxnews.com/story/0,2933,293115,00.html, June 2008.
[49]
M. Schäfer, T. Ekman, and O. de Moor. Sound and extensible renaming for java. In OOPSLA ’08, pages 277–294, New York, NY, USA, 2008. ACM.
[50]
L. Seltzer. Source code leak offers novel security test. http://www.eweek.com/c/a/Security/Source-Code-Leak-Offers-Novel-Security-Test, Feb. 2004.
[51]
I. Shield. International data privacy laws. http://www.informationshield.com/intprivacylaws.html, 2010.
[52]
M. Sokolova, K. El Emam, S. Rose, S. Chowdhury, E. Neri, E. Jonker, and L. Peyton. Personal health information leak prevention in heterogeneous texts. In AdaptLRTtoND ’09, pages 58–69, Stroudsburg, PA, USA, 2009. ACL.
[53]
J. Staddon, P. Golle, and B. Zimny. Web-based inference detection. In 16th USENIX Security Symposium, pages 6:1–6:16, Berkeley, CA, USA, 2007. USENIX Association.
[54]
P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2005.
[55]
K. Taneja, M. Grechanik, R. Ghani, and T. Xie. Testing software in age of data privacy: a balancing act. In SIGSOFT FSE, pages 201–211, NY, NY, USA, 2011. ACM.
[56]
B. G. Thompson. H.R.6423: Homeland Security Cyber and Physical Infrastructure Protection Act of 2010. U.S.House, 111th Congress, Nov. 2010.
[57]
F. Tip, A. Kiezun, and D. Bäumer. Refactoring for generalization using type constraints. In OOPSLA ’03, pages 13–26, New York, NY, USA, 2003. ACM.
[58]
A. Von Mayrhauser and A. Vans. Program understanding - a survey. Technical Report CS-94-120, Department of Computer Science, Colorado State University, August 23 1994. .pdf.
[59]
L. Willenborg and T. d. Waal. Elements of Statistical Disclosure Control. Springer, NY, NY, USA, 2001.
[60]
U. Yair. Five tips that can protect your company from an embarrassing leak of confidential information. http://www.gtbtechnologies.com/Downloads/5_tips_for_ protecting_data.pdf, Mar. 2011.
[61]
K. Zetter. Security breach: Tsa leaks sensitive airport screening procedure by failing to properly redact pdf. http://www.portfolio.com/business-travel/2009/12/08/tsaleaks-sensitive-airport-screening-manual, Dec. 2009.
[62]
K. Zetter. Goldman sachs programmer sentenced to 8 years in prison for code theft. http://www.wired.com/threatlevel/2011/03/aleynikovsentencing, Mar. 2011.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICPC 2014: Proceedings of the 22nd International Conference on Program Comprehension
June 2014
325 pages
ISBN:9781450328791
DOI:10.1145/2597008
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • TCSE: IEEE Computer Society's Tech. Council on Software Engin.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. associative rules
  2. privacy
  3. program comprehension
  4. redaction

Qualifiers

  • Article

Conference

ICSE '14
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media