Shape analysis (program analysis)

Last updated August 09, 2023

In program analysis, shape analysis is a static code analysis technique that discovers and verifies properties of linked, dynamically allocated data structures in (usually imperative) computer programs. It is typically used at compile time to find software bugs or to verify high-level correctness properties of programs. In Java programs, it can be used to ensure that a sort method correctly sorts a list. For C programs, it might look for places where a block of memory is not properly freed.

Applications

Shape analysis has been applied to a variety of problems:

Memory safety: finding memory leaks, dereferences of dangling pointers, and discovering cases where a block of memory is freed more than once.^[1]^[2]
Finding array out-of-bounds errors^{[ citation needed ]}
Checking type-state properties (for example, ensuring that a file is open() before it is read())^{[ citation needed ]}
Ensuring that a method to reverse a linked list does not introduce cycles into the list^[2]
Verifying that a sort method returns a result that is in sorted order^{[ citation needed ]}

Example

Shape analysis is a form of pointer analysis, although it is more precise than typical pointer analysis. Pointer analysis attempts to determine the set of objects to which a pointer can point (called the points-to set of the pointer). Unfortunately, these analysis are necessarily approximate (since a perfectly precise static analysis could solve the halting problem). Shape analysis can determine smaller (more precise) points-to sets.

Consider the following simple C++ program.

Item*items[10];for(inti=0;i<10;++i){items[i]=newItem(...);// line [1]}process_items(items);// line [2]for(inti=0;i<10;++i){deleteitems[i];// line [3]}

This program builds an array of objects, processes them in some arbitrary way, and then deletes them. Assuming that the process_items function is free of errors, it is clear that the program is safe: it never references freed memory, and it deletes all the objects that it has constructed.

Unfortunately, most pointer analyses have difficulty analyzing this program precisely. In order to determine points-to sets, a pointer analysis must be able to name a program's objects. In general, programs can allocate an unbounded number of objects; but in order to terminate, a pointer analysis can only use a finite set of names. A typical approximation is to give all the objects allocated on a given line of the program the same name. In the example above, all the objects constructed at line [1] would have the same name. Therefore, when the delete statement is analyzed for the first time, the analysis determines that one of the objects named [1] is being deleted. The second time the statement is analyzed (since it is in a loop) the analysis warns of a possible error: since it is unable to distinguish the objects in the array, it may be that the second delete is deleting the same object as the first delete. This warning is spurious, and the goal of shape analysis is to avoid such warnings.

Summarization and materialization

Shape analysis overcomes the problems of pointer analysis by using a more flexible naming system for objects. Rather than giving an object the same name throughout a program, objects can change names depending on the program's actions. Sometimes, several distinct objects with different names may be summarized, or merged, so that they have the same name. Then, when a summarized object is about to be used by the program, it can be materialized—that is, the summarized object is split into two objects with distinct names, one representing a single object and the other representing the remaining summarized objects. The basic heuristic of shape analysis is that objects that are being used by the program are represented using unique materialized objects, while objects not in use are summarized.

The array of objects in the example above is summarized in separate ways at lines [1], [2], and [3]. At line [1], the array has been only partly constructed. The array elements 0..i-1 contain constructed objects. The array element i is about to be constructed, and the following elements are uninitialized. Shape analysis can approximate this situation using a summary for the first set of elements, a materialized memory location for element i, and a summary for the remaining uninitialized locations, as follows:

0 .. i-1	i	i+1 .. 9
pointer to constructed object (summary)	uninitialized	uninitialized (summary)

After the loop terminates, at line [2], there is no need to keep anything materialized. The shape analysis determines at this point that all the array elements have been initialized:

0 .. 9

pointer to constructed object (summary)

At line [3], however, the array element i is in use again. Therefore, the analysis splits the array into three segments as in line [1]. This time, though, the first segment before i has been deleted, and the remaining elements are still valid (assuming the delete statement hasn't executed yet).

0 .. i-1	i	i+1 .. 9
free (summary)	pointer to constructed object	pointer to constructed object (summary)

Notice that in this case, the analysis recognizes that the pointer at index i has not been deleted yet. Therefore, it doesn't warn of a double deletion.

Related Research Articles

In computer science, an array is a data structure consisting of a collection of elements, of same memory size, each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula. The simplest type of data structure is a linear array, also called one-dimensional array.

C is a general-purpose computer programming language. It was created in the 1970s by Dennis Ritchie, and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of the targeted CPUs. It has found lasting use in operating systems, device drivers, protocol stacks, though decreasingly for application software. C is commonly used on computer architectures that range from the largest supercomputers to the smallest microcontrollers and embedded systems.

In computer science, garbage collection (GC) is a form of automatic memory management. The garbage collector attempts to reclaim memory which was allocated by the program, but is no longer referenced; such memory is called garbage. Garbage collection was invented by American computer scientist John McCarthy around 1959 to simplify manual memory management in Lisp.

In computer science, a linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes which together represent a sequence. In its most basic form, each node contains data, and a reference to the next node in the sequence. This structure allows for efficient insertion or removal of elements from any position in the sequence during iteration. More complex variants add additional links, allowing more efficient insertion or removal of nodes at arbitrary positions. A drawback of linked lists is that data access time is a linear function of the number of nodes for each linked list because nodes are serially linked so a node needs to be accessed first to access the next node. Faster access, such as random access, is not feasible. Arrays have better cache locality compared to linked lists.

In computer science, a priority queue is an abstract data-type similar to a regular queue or stack data structure. Each element in a priority queue has an associated priority. In a priority queue, elements with high priority are served before elements with low priority. In some implementations, if two elements have the same priority, they are served in the same order in which they were enqueued. In other implementations, the order of elements with the same priority is undefined.

In computing, a segmentation fault or access violation is a fault, or failure condition, raised by hardware with memory protection, notifying an operating system (OS) the software has attempted to access a restricted area of memory. On standard x86 computers, this is a form of general protection fault. The operating system kernel will, in response, usually perform some corrective action, generally passing the fault on to the offending process by sending the process a signal. Processes can in some cases install a custom signal handler, allowing them to recover on their own, but otherwise the OS default signal handler is used, generally causing abnormal termination of the process, and sometimes a core dump.

In computing, an optimizing compiler is a compiler that tries to minimize or maximize some attributes of an executable computer program. Common requirements are to minimize a program's execution time, memory footprint, storage size, and power consumption.

In computer science, an in-place algorithm is an algorithm that operates directly on the input data structure without requiring extra space proportional to the input size. In other words, it modifies the input in place, without creating a separate copy of the data structure. An algorithm which is not in-place is sometimes called not-in-place or out-of-place.

<span class="mw-page-title-main">Pointer (computer programming)</span> Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

Data-flow analysis is a technique for gathering information about the possible set of values calculated at various points in a computer program. A program's control-flow graph (CFG) is used to determine those parts of a program to which a particular value assigned to a variable might propagate. The information gathered is often used by compilers when optimizing a program. A canonical example of a data-flow analysis is reaching definitions.

A struct in the C programming language is a composite data type declaration that defines a physically grouped list of variables under one name in a block of memory, allowing the different variables to be accessed via a single pointer or by the struct declared name which returns the same address. The struct data type can contain other data types so is used for mixed-data-type records such as a hard-drive directory entry, or other mixed-type records.

In computing, an uninitialized variable is a variable that is declared but is not set to a definite known value before it is used. It will have some value, but not a predictable one. As such, it is a programming error and a common source of bugs in software.

In the C++ programming language, new and delete are a pair of language constructs that perform dynamic memory allocation, object construction and object destruction.

In computer science, pointer analysis, or points-to analysis, is a static code analysis technique that establishes which pointers, or heap references, can point to which variables, or storage locations. It is often a component of more complex analyses such as escape analysis. A closely related technique is shape analysis.

Memory safety is the state of being protected from various software bugs and security vulnerabilities when dealing with memory access, such as buffer overflows and dangling pointers. For example, Java is said to be memory-safe because its runtime error detection checks array bounds and pointer dereferences. In contrast, C and C++ allow arbitrary pointer arithmetic with pointers implemented as direct memory addresses with no provision for bounds checking, and thus are potentially memory-unsafe.

Increment and decrement operators are unary operators that increase or decrease their operand by one.

Reinhard Wilhelm is a German computer scientist.

Susan Beth Horwitz was an American computer scientist noted for her research on programming languages and software engineering, and in particular on program slicing and dataflow-analysis. She had several best paper and an impact paper award mentioned below under awards.

Whiley is an experimental programming language that combines features from the functional and imperative paradigms, and supports formal specification through function preconditions, postconditions and loop invariants. The language uses flow-sensitive typing also known as "flow typing."

Mooly (Shmuel) Sagiv is an Israeli computer scientist known for his work on static program analysis. He is currently Chair of Software Systems in the School of Computer Science at Tel Aviv University, and CEO of Certora, a startup company providing formal verification of smart contracts.

References

↑ Rinetzky, Noam; Sagiv, Mooly (2001). "Interprocedural Shape Analysis for Recursive Programs" (PDF). Compiler Construction. Lecture Notes in Computer Science. Vol. 2027. pp. 133–149. doi:10.1007/3-540-45306-7_10. ISBN 978-3-540-41861-0.
1 2 Berdine, Josh; Calcagno, Cristiano; Cook, Byron; Distefano, Dino; o'Hearn, Peter W.; Wies, Thomas; Yang, Hongseok (2007). "Shape Analysis for Composite Data Structures" (PDF). Computer Aided Verification. Lecture Notes in Computer Science. Vol. 4590. pp. 178–192. doi:10.1007/978-3-540-73368-3_22. ISBN 978-3-540-73367-6.

Bibliography

Neil D. Jones; Steven S. Muchnick (1982). "A flexible approach to interprocedural data flow analysis and programs with recursive data structures". Proceedings of the 9th ACM SIGPLAN-SIGACT symposium on Principles of programming languages - POPL '82. ACM. pp. 66–74. doi:10.1145/582153.582161. ISBN 0897910656. S2CID 13266723.{{cite book}}: CS1 maint: date and year (link)

Mooly Sagiv; Thomas Reps; Reinhard Wilhelm (May 2002). "Parametric shape analysis via 3-valued logic" (PDF). ACM Transactions on Programming Languages and Systems. ACM. 24 (3): 217–298. CiteSeerX 10.1.1.147.2132 . doi:10.1145/292540.292552. S2CID 101653.

Wilhelm, Reinhard; Sagiv, Mooly; Reps, Thomas (2007). "Chapter 12: Shape Analysis and Applications". In Srikant, Y. N.; Shankar, Priti (eds.). The Compiler Design Handbook: Optimizations and Machine Code Generation, Second Edition. CRC Press. pp. 12–1–12–44. ISBN 978-1-4200-4382-2.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Rinetzky, Noam; Sagiv, Mooly (2001). "Interprocedural Shape Analysis for Recursive Programs" (PDF). Compiler Construction. Lecture Notes in Computer Science. Vol. 2027. pp. 133–149. doi:10.1007/3-540-45306-7_10. ISBN 978-3-540-41861-0.

[Be07-2] 1 2 Berdine, Josh; Calcagno, Cristiano; Cook, Byron; Distefano, Dino; o'Hearn, Peter W.; Wies, Thomas; Yang, Hongseok (2007). "Shape Analysis for Composite Data Structures" (PDF). Computer Aided Verification. Lecture Notes in Computer Science. Vol. 4590. pp. 178–192. doi:10.1007/978-3-540-73368-3_22. ISBN 978-3-540-73367-6.

[1]

[2]

v t e Compiler optimizations
Basic block	Peephole optimization Local value numbering
Loop optimization	Automatic parallelization Induction variable Loop fusion Loop-invariant code motion Loop inversion Loop interchange Loop nest optimization Loop splitting Loop unrolling Loop unswitching Software pipelining Strength reduction
Data-flow analysis	Available expression Common subexpression elimination Constant folding Dead-store elimination Induction variable recognition and elimination Live-variable analysis Use-define chain
SSA-based	Global value numbering Sparse conditional constant propagation
Code generation	Instruction scheduling Instruction selection Register allocation Rematerialization
Functional	Deforestation Tail-call elimination
Global	Interprocedural optimization
Other	Bounds-checking elimination Compile-time function execution Dead-code elimination Expression templates Inline expansion Jump threading Profile-guided optimization
Static analysis	Alias analysis Array-access analysis Control-flow analysis Data-flow analysis Dependence analysis Escape analysis Pointer analysis Shape analysis Value range analysis