research-article

“Synthesizing input grammars”: a replication study

Authors:

Bachir Bendrissou,

Rahul Gopinath,

Andreas ZellerAuthors Info & Claims

PLDI 2022: Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation

Pages 260 - 268

https://doi.org/10.1145/3519939.3523716

Published: 09 June 2022 Publication History

Abstract

When producing test inputs for a program, test generators ("fuzzers") can greatly profit from grammars that formally describe the language of expected inputs. In recent years, researchers thus have studied means to recover input grammars from programs and their executions. The GLADE algorithm by Bastani et al., published at PLDI 2017, was the first black-box approach to claim context-free approximation of input specification for non-trivial languages such as XML, Lisp, URLs, and more.

Prompted by recent observations that the GLADE algorithm may show lower performance than reported in the original paper, we have reimplemented the GLADE algorithm from scratch. Our evaluation confirms that the effectiveness score (F1) reported in the GLADE paper is overly optimistic, and in some cases, based on the wrong language. Furthermore, GLADE fares poorly in several real-world languages evaluated, producing grammars that spend megabytes to enumerate inputs.

References

[1]

[n. d.]. Artifact Review and Badging – Version 1.1. https://www.acm.org/publications/policies/artifact-review-and-badging-current

[2]

[n. d.]. GLADE Implementation and experiments. https://github.com/obastani/glade-full

[3]

[n. d.]. Grammars written for ANTLR v4. https://github.com/antlr/grammars-v4

[4]

Dana Angluin. 1987. Learning Regular Sets from Queries and Counterexamples. Inf. Comput., 75, 2 (1987), Nov., 87–106. issn:0890-5401 https://doi.org/10.1016/0890-5401(87)90052-6

Digital Library

[5]

Dana Angluin and Michael Kharitonov. 1995. When Won’t Membership Queries Help? J. Comput. System Sci., 50, 2 (1995), 336–355. https://doi.org/10.1145/103418.103420

Digital Library

[6]

Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2017. Synthesizing Program Input Grammars. In ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, NY, USA. 95–110. isbn:978-1-4503-4988-8 https://doi.org/10.1145/3140587.3062349

Digital Library

[7]

Juan Caballero, Heng Yin, Zhenkai Liang, and Dawn Song. 2007. Polyglot: Automatic Extraction of Protocol Message Format Using Dynamic Binary Analysis. In ACM Conference on Computer and Communications Security. ACM, New York, NY, USA. 317–329. isbn:978-1-59593-703-2 https://doi.org/10.1145/1315245.1315286

Digital Library

[8]

Alexander Clark. 2010. Distributional learning of some context-free languages with a minimally adequate teacher. In International Colloquium on Grammatical Inference. 24–37. https://doi.org/10.1007/978-3-642-15488-1_4

[9]

Jean-Christophe Deprez and Arun Lakhotia. 2000. A Formalism to Automate Mapping from Program Features to Code. In IWPC. 69–78. https://doi.org/10.1109/WPC.2000.852481

[10]

E Mark Gold. 1967. Language identification in the limit. Information and control, 10, 5 (1967), 447–474. https://doi.org/10.1016/S0019-9958(67)91165-5

[11]

E Mark Gold. 1978. Complexity of automaton identification from given data. Information and control, 37, 3 (1978), 302–320. https://doi.org/10.1016/S0019-9958(78)90562-4

[12]

Rahul Gopinath, Björn Mathis, and Andreas Zeller. 2020. Mining input grammars from dynamic control flow. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 172–183. https://doi.org/10.1145/3368089.3409679

Digital Library

[13]

Rahul Gopinath and Andreas Zeller. 2019. Building Fast Fuzzers. CoRR, arxiv:1911.07707. arxiv:1911.07707

[14]

Benedikt Hauptmann, Elmar Juergens, and Volkmar Woinke. 2015. Generating refactoring proposals to remove clones from automated system tests. In Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension. 115–124. https://doi.org/10.1109/ICPC.2015.20

Digital Library

[15]

Renáta Hodován, Ákos Kiss, and Tibor Gyimóthy. 2018. Grammarinator: a grammar-based open source fuzzer. In Proceedings of the 9th ACM SIGSOFT International Workshop on Automating TEST Case Design, Selection, and Evaluation. 45–48. https://doi.org/10.1145/3278186.3278193

Digital Library

[16]

Matthias Höschele and Andreas Zeller. 2016. Mining Input Grammars from Dynamic Taints. In IEEE/ACM International Conference on Automated Software Engineering. ACM, New York, NY, USA. 720–725. https://doi.org/10.1145/2970276.2970321

Digital Library

[17]

C Jacobs and D Grune. 1990. Parsing techniques: A practical guide. Springer. isbn:978-0-387-68954-8

[18]

Neil Kulkarni, Caroline Lemieux, and Koushik Sen. 2021. Learning Highly Recursive Input Grammars. In 36th IEEE/ACM International Conference on Automated Software Engineering. ACM New York, NY, USA. https://doi.org/10.1109/ASE51524.2021.9678879

[19]

Zhiqiang Lin, Xiangyu Zhang, and Dongyan Xu. 2010. Reverse Engineering Input Syntactic Structure from Program Execution and Its Applications. IEEE Transactions on Software Engineering, 36, 5 (2010), Sept., 688–703. issn:0098-5589 https://doi.org/10.1109/TSE.2009.54

Digital Library

[20]

José Oncina and Pedro Garcia. 1992. Identifying regular languages in polynomial time. In Advances in structural and syntactic pattern recognition. World Scientific, 99–108. https://doi.org/10.1142/9789812797919_0007

[21]

Václav Rajlich and Norman Wilde. 2002. The role of concepts in program comprehension. In Proceedings 10th International Workshop on Program Comprehension. 271–278. https://doi.org/10.1109/WPC.2002.1021348

[22]

Andreas Zeller, Rahul Gopinath, Marcel Böhme, Gordon Fraser, and Christian Holler. 2021. Parsing Inputs. In The Fuzzing Book. CISPA Helmholtz Center for Information Security. https://www.fuzzingbook.org/html/Parser.html Retrieved 2021-11-16 14:26:40+01:00

Cited By

Bendrissou BCadar CDonaldson A(2024)Grammar Mutation for Testing Input ParsersACM Transactions on Software Engineering and Methodology10.1145/3708517Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3708517
Jia XTan G(2024)V-Star: Learning Visibly Pushdown Grammars from Program InputsProceedings of the ACM on Programming Languages10.1145/36564588:PLDI(2003-2026)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656458
Zheng MShi QLiu XXu XYu LLiu CWei GZhang X(2024)ParDiff: Practical Static Differential Analysis of Network Protocol ParsersProceedings of the ACM on Programming Languages10.1145/36498548:OOPSLA1(1208-1234)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3649854

Index Terms

“Synthesizing input grammars”: a replication study
1. Security and privacy
  1. Software and application security
    1. Software reverse engineering
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Parsers
    2. Software maintenance tools

Recommendations

Preference logic grammars

Preference logic grammars (PLGs) are introduced in this paper as a concise, declarative, modular, and efficient means of resolving ambiguity in logic grammars. Preference logic grammars can be thought as extensions of definite clause grammars (DCGs) and ...
Simple LR(k) grammars

A class of context-free grammars, called the “Simple LR(k)” or SLR(k) grammars is defined. This class has been shown to include weak precedence and simple precedence grammars as proper subsets. How to construct parsers for the SLR(k) grammars is also ...
Lambek Grammars as Second-Order Abstract Categorial Grammars
New Frontiers in Artificial Intelligence
Abstract
We demonstrate that for all practical purposes, Lambek Grammars (LG) are strongly equivalent to Context-Free Grammars (CFG) and hence to second-order Abstract Categorial Grammars (ACG). To be precise, for any Lambek Grammar LG there exists a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI 2022: Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation

June 2022

1038 pages

ISBN:9781450392655

DOI:10.1145/3519939

General Chair:
Ranjit Jhala
University of California at San Diego, USA
,
Program Chair:
Işil Dillig
University of Texas at Austin, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

PLDI '22

Sponsor:

SIGPLAN

PLDI '22: 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation

June 13 - 17, 2022

CA, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
564
Total Downloads

Downloads (Last 12 months)90
Downloads (Last 6 weeks)8

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bendrissou BCadar CDonaldson A(2024)Grammar Mutation for Testing Input ParsersACM Transactions on Software Engineering and Methodology10.1145/3708517Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3708517
Jia XTan G(2024)V-Star: Learning Visibly Pushdown Grammars from Program InputsProceedings of the ACM on Programming Languages10.1145/36564588:PLDI(2003-2026)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656458
Zheng MShi QLiu XXu XYu LLiu CWei GZhang X(2024)ParDiff: Practical Static Differential Analysis of Network Protocol ParsersProceedings of the ACM on Programming Languages10.1145/36498548:OOPSLA1(1208-1234)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3649854

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten