Inline expansion

Last updated

In computing, inline expansion, or inlining, is a manual or compiler optimization that replaces a function call site with the body of the called function. Inline expansion is similar to macro expansion, but occurs during compilation, without changing the source code (the text), while macro expansion occurs prior to compilation, and results in different text that is then processed by the compiler.

Contents

Inlining is an important optimization, but has complicated effects on performance. [1] As a rule of thumb, some inlining will improve speed at very minor cost of space, but excess inlining will hurt speed, due to inlined code consuming too much of the instruction cache, and also cost significant space. A survey of the modest academic literature on inlining from the 1980s and 1990s is given in Peyton Jones & Marlow 1999. [2]

Overview

Inline expansion is similar to macro expansion as the compiler places a new copy of the function in each place it is called. Inlined functions run a little faster than the normal functions as function-calling-overheads are saved, however, there is a memory penalty. If a function is inlined 10 times, there will be 10 copies of the function inserted into the code. Hence inlining is best for small functions that are called often. In C++ the member functions of a class, if defined within the class definition, are inlined by default (no need to use the inline keyword); otherwise, the keyword is needed. The compiler may ignore the programmer’s attempt to inline a function, mainly if it is particularly large.

Inline expansion is used to eliminate the time overhead (excess time) when a function is called. It is typically used for functions that execute frequently. It also has a space benefit for very small functions, and is an enabling transformation for other optimizations.

Without inline functions, the compiler decides which functions to inline. The programmer has little or no control over which functions are inlined and which are not. Giving this degree of control to the programmer allows for the use of application-specific knowledge in choosing which functions to inline.

Ordinarily, when a function is invoked, control is transferred to its definition by a branch or call instruction. With inlining, control drops through directly to the code for the function, without a branch or call instruction.

Compilers usually implement statements with inlining. Loop conditions and loop bodies need lazy evaluation. This property is fulfilled when the code to compute loop conditions and loop bodies is inlined. Performance considerations are another reason to inline statements.

In the context of functional programming languages, inline expansion is usually followed by the beta-reduction transformation.

A programmer might inline a function manually through copy and paste programming, as a one-time operation on the source code. However, other methods of controlling inlining (see below) are preferable, because they do not precipitate bugs arising when the programmer overlooks a (possibly modified) duplicated version of the original function body, while fixing a bug in the inlined function.

Effect on performance

The direct effect of this optimization is to improve time performance (by eliminating call overhead), at the cost of worsening space usage [lower-alpha 1] (due to duplicating the function body). The code expansion due to duplicating the function body dominates, except for simple cases, [lower-alpha 2] and thus the direct effect of inline expansion is to improve time at the cost of space.

However, the primary benefit of inline expansion is to allow further optimizations and improved scheduling, due to increasing the size of the function body, as better optimization is possible on larger functions. [3] The ultimate impact of inline expansion on speed is complicated, due to multiple effects on performance of the memory system (primarily instruction cache), which dominates performance on modern processors: depending on the specific program and cache, inlining particular functions can increase or decrease performance. [1]

The impact of inlining varies by programming language and program, due to different degrees of abstraction. In lower-level imperative languages such as C and Fortran it is typically a 10–20% speed boost, with minor impact on code size, while in more abstract languages it can be significantly more important, due to the number of layers inlining removes, with an extreme example being Self, where one compiler saw improvement factors of 4 to 55 by inlining. [2]

The direct benefits of eliminating a function call are:

The primary benefit of inlining, however, is the further optimizations it allows. Optimizations that cross function boundaries can be done without requiring interprocedural optimization (IPO): once inlining has been performed, additional intraprocedural optimizations ("global optimizations") become possible on the enlarged function body. For example:

These can be done without inlining, but require a significantly more complicated compiler and linker (in case caller and callee are in separate compilation units).

Conversely, in some cases a language specification may allow a program to make additional assumptions about arguments to procedures that it can no longer make after the procedure is inlined, preventing some optimizations. Smarter compilers (such as Glasgow Haskell Compiler) will track this, but naive inlining loses this information.

A further benefit of inlining for the memory system is:

The direct cost of inlining is increased code size, due to duplicating the function body at each call site. However, it does not always do so, namely in case of very short functions, where the function body is smaller than the size of a function call (at the caller, including argument and return value handling), such as trivial accessor methods or mutator methods (getters and setters); or for a function that is only used in one place, in which case it is not duplicated. Thus inlining may be minimized or eliminated if optimizing for code size, as is often the case in embedded systems.

Inlining also imposes a cost on performance, due to the code expansion (due to duplication) hurting instruction cache performance. [6] This is most significant if, prior to expansion, the working set of the program (or a hot section of code) fit in one level of the memory hierarchy (e.g., L1 cache), but after expansion it no longer fits, resulting in frequent cache misses at that level. Due to the significant difference in performance at different levels of the hierarchy, this hurts performance considerably. At the highest level this can result in increased page faults, catastrophic performance degradation due to thrashing, or the program failing to run at all. This last is rare in common desktop and server applications, where code size is small relative to available memory, but can be an issue for resource-constrained environments such as embedded systems. One way to mitigate this problem is to split functions into a smaller hot inline path (fast path), and a larger cold non-inline path (slow path). [6]

Inlining hurting performance is primarily a problem for large functions that are used in many places, but the break-even point beyond which inlining reduces performance is difficult to determine and depends in general on precise load, so it can be subject to manual optimization or profile-guided optimization. [7] This is a similar issue to other code expanding optimizations such as loop unrolling, which also reduces number of instructions processed, but can decrease performance due to poorer cache performance.

The precise effect of inlining on cache performance is complicated. For small cache sizes (much smaller than the working set prior to expansion), the increased sequentiality dominates, and inlining improves cache performance. For cache sizes close to the working set, where inlining expands the working set so it no longer fits in cache, this dominates and cache performance decreases. For cache sizes larger than the working set, inlining has negligible impact on cache performance. Further, changes in cache design, such as load forwarding, can offset the increase in cache misses. [8]

Compiler support

Compilers use a variety of mechanisms to decide which function calls should be inlined; these can include manual hints from programmers for specific functions, together with overall control via command-line options. Inlining is done automatically by many compilers in many languages, based on judgment of whether inlining is beneficial, while in other cases it can be manually specified via compiler directives, typically using a keyword or compiler directive called inline. Typically this only hints that inlining is desired, rather than requiring inlining, with the force of the hint varying by language and compiler.

Typically, compiler developers keep the above performance issues in mind, and incorporate heuristics into their compilers that choose which functions to inline so as to improve performance, rather than worsening it, in most cases.

Implementation

Once the compiler has decided to inline a particular function, performing the inlining operation itself is usually simple. Depending on whether the compiler inlines functions across code in different languages, the compiler can do inlining on either a high-level intermediate representation (like abstract syntax trees) or a low-level intermediate representation. In either case, the compiler simply computes the arguments, stores them in variables corresponding to the function's arguments, and then inserts the body of the function at the call site.

Linkers can also do function inlining. When a linker inlines functions, it may inline functions whose source is not available, such as library functions (see link-time optimization). A run-time system can inline function as well. Run-time inlining can use dynamic profiling information to make better decisions about which functions to inline, as in the Java Hotspot compiler. [9]

Here is a simple example of inline expansion performed "by hand" at the source level in the C programming language:

intpred(intx){if(x==0)return0;elsereturnx-1;}

Before inlining:

intfunc(inty){returnpred(y)+pred(0)+pred(y+1);}

After inlining:

intfunc(inty){inttmp;if(y==0)tmp=0;elsetmp=y-1;/* (1) */if(0==0)tmp+=0;elsetmp+=0-1;/* (2) */if(y+1==0)tmp+=0;elsetmp+=(y+1)-1;/* (3) */returntmp;}

Note that this is only an example. In an actual C application, it would be preferable to use an inlining language feature such as parameterized macros or inline functions to tell the compiler to transform the code in this way. The next section lists ways to optimize this code.

Inlining by assembly macro expansion

Assembler macros provide an alternative approach to inlining whereby a sequence of instructions can normally be generated inline by macro expansion from a single macro source statement (with zero or more parameters). One of the parameters might be an option to alternatively generate a one-time separate subroutine containing the sequence and processed instead by an inlined call to the function. Example:

MOVE FROM=array1,TO=array2,INLINE=NO

Heuristics

A range of different heuristics have been explored for inlining. Usually, an inlining algorithm has a certain code budget (an allowed increase in program size) and aims to inline the most valuable callsites without exceeding that budget. In this sense, many inlining algorithms are usually modeled after the Knapsack problem. [10] To decide which callsites are more valuable, an inlining algorithm must estimate their benefit—i.e. the expected decrease in the execution time. Commonly, inliners use profiling information about the frequency of the execution of different code paths to estimate the benefits. [11]

In addition to profiling information, newer just-in-time compilers apply several more advanced heuristics, such as: [4]

Benefits

Inline expansion itself is an optimization, since it eliminates overhead from calls, but it is much more important as an enabling transformation. That is, once the compiler expands a function body in the context of its call site—often with arguments that may be fixed constants—it may be able to do a variety of transformations that were not possible before. For example, a conditional branch may turn out to be always true or always false at this particular call site. This in turn may enable dead code elimination, loop-invariant code motion, or induction variable elimination.

In the C example in the previous section, optimization opportunities abound. The compiler may follow this sequence of steps:

The new function looks like:

intfunc(inty){if(y==0)return0;if(y==-1)return-2;return2*y-1;}

Limitations

Complete inline expansion is not always possible, due to recursion: recursively inline expanding the calls will not terminate. There are various solutions, such as expanding a bounded amount, or analyzing the call graph and breaking loops at certain nodes (i.e., not expanding some edge in a recursive loop). [12] An identical problem occurs in macro expansion, as recursive expansion does not terminate, and is typically resolved by forbidding recursive macros (as in C and C++).

Comparison with macros

Traditionally, in languages such as C, inline expansion was accomplished at the source level using parameterized macros. Use of true inline functions, as are available in C99, provides several benefits over this approach:

Many compilers can also inline expand some recursive functions; [13] recursive macros are typically illegal.

Bjarne Stroustrup, the designer of C++, likes to emphasize that macros should be avoided wherever possible, and advocates extensive use of inline functions.

Selection methods

Many compilers aggressively inline functions wherever it is beneficial to do so. Although it can lead to larger executables, aggressive inlining has nevertheless become more and more desirable as memory capacity has increased faster than CPU speed. Inlining is a critical optimization in functional languages and object-oriented programming languages, which rely on it to provide enough context for their typically small functions to make classical optimizations effective.

Language support

Many languages, including Java and functional languages, do not provide language constructs for inline functions, but their compilers or interpreters often do perform aggressive inline expansion. [4] Other languages provide constructs for explicit hints, generally as compiler directives (pragmas).

In the Ada programming language, there exists a pragma for inline functions.

Functions in Common Lisp may be defined as inline by the inline declaration as such: [14]

(declaim(inlinedispatch))(defundispatch(x)(funcall(get(carx)'dispatch)x))

The Haskell compiler GHC tries to inline functions or values that are small enough but inlining may be noted explicitly using a language pragma: [15]

key_function::Int->String->(Bool,Double){-# INLINE key_function #-}

C and C++

C and C++ have an inline keyword which serves as a hint that inlining may be beneficial; however, in newer versions, its primary purpose is instead to alter the visibility and linking behavior of the function. [16]

See also

Notes

  1. Space usage is "number of instructions", and is both runtime space usage and the binary file size.
  2. Code size actually shrinks for very short functions, where the call overhead is larger than the body of the function, or single-use functions, where no duplication occurs.

Related Research Articles

In mathematics and computer science, mutual recursion is a form of recursion where two mathematical or computational objects, such as functions or datatypes, are defined in terms of each other. Mutual recursion is very common in functional programming and in some problem domains, such as recursive descent parsers, where the datatypes are naturally mutually recursive.

<span class="mw-page-title-main">Macro (computer science)</span> Rule for substituting a set input with a set output

In computer programming, a macro is a rule or pattern that specifies how a certain input should be mapped to a replacement output. Applying a macro to an input is known as macro expansion. The input and output may be a sequence of lexical tokens or characters, or a syntax tree. Character macros are supported in software applications to make it easy to invoke common command sequences. Token and tree macros are supported in some programming languages to enable code reuse or to extend the language, sometimes for domain-specific languages.

OCaml is a general-purpose, high-level, multi-paradigm programming language which extends the Caml dialect of ML with object-oriented features. OCaml was created in 1996 by Xavier Leroy, Jérôme Vouillon, Damien Doligez, Didier Rémy, Ascánder Suárez, and others.

An optimizing compiler is a compiler designed to generate code that is optimized in aspects such as minimizing program execution time, memory use, storage size, and power consumption. Optimization is generally implemented as a sequence of optimizing transformations, algorithms that transform code to produce semantically equivalent code optimized for some aspect. It is typically CPU and memory intensive. In practice, factors such as available memory and a programmer's willingness to wait for compilation limit the optimizations that a compiler might provide. Research indicates that some optimization problems are NP-complete, or even undecidable.

Java and C++ are two prominent object-oriented programming languages. By many language popularity metrics, the two languages have dominated object-oriented and high-performance software development for much of the 21st century, and are often directly compared and contrasted. Java's syntax was based on C/C++.

Template metaprogramming (TMP) is a metaprogramming technique in which templates are used by a compiler to generate temporary source code, which is merged by the compiler with the rest of the source code and then compiled. The output of these templates can include compile-time constants, data structures, and complete functions. The use of templates can be thought of as compile-time polymorphism. The technique is used by a number of languages, the best-known being C++, but also Curl, D, Nim, and XL.

In computer science, program optimization, code optimization, or software optimization is the process of modifying a software system to make some aspect of it work more efficiently or use fewer resources. In general, a computer program may be optimized so that it executes more rapidly, or to make it capable of operating with less memory storage or other resources, or draw less power.

In the C and C++ programming languages, an inline function is one qualified with the keyword inline; this serves two purposes:

  1. It serves as a compiler directive that suggests that the compiler substitute the body of the function inline by performing inline expansion, i.e. by inserting the function code at the address of each function call, thereby saving the overhead of a function call. In this respect it is analogous to the register storage class specifier, which similarly provides an optimization hint.
  2. The second purpose of inline is to change linkage behavior; the details of this are complicated. This is necessary due to the C/C++ separate compilation + linkage model, specifically because the definition (body) of the function must be duplicated in all translation units where it is used, to allow inlining during compiling, which, if the function has external linkage, causes a collision during linking. C and C++ resolve this in different ways.

In compiler theory, common subexpression elimination (CSE) is a compiler optimization that searches for instances of identical expressions, and analyzes whether it is worthwhile replacing them with a single variable holding the computed value.

In computer programming, an inline assembler is a feature of some compilers that allows low-level code written in assembly language to be embedded within a program, among code that otherwise has been compiled from a higher-level language such as C or Ada.

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler. On modern processors, loop unrolling is often counterproductive, as the increased code size can cause more cache misses; cf. Duff's device.

In computer science, a tail call is a subroutine call performed as the final action of a procedure. If the target of a tail is the same subroutine, the subroutine is said to be tail recursive, which is a special case of direct recursion. Tail recursion is particularly useful, and is often easy to optimize in implementations.

In computer programming, a nested function is a named function that is defined within another, enclosing, block and is lexically scoped within the enclosing block – meaning it is only callable by name within the body of the enclosing block and can use identifiers declared in outer blocks, including outer functions. The enclosing block is typically, but not always, another function.

In compiler theory, loop optimization is the process of increasing execution speed and reducing the overheads associated with loops. It plays an important role in improving cache performance and making effective use of parallel processing capabilities. Most execution time of a scientific program is spent on loops; as such, many compiler optimization techniques have been developed to make them faster.

In computer science, loop fission is a compiler optimization in which a loop is broken into multiple loops over the same index range with each taking only a part of the original loop's body. The goal is to break down a large loop body into smaller ones to achieve better utilization of locality of reference. This optimization is most efficient in multi-core processors that can split a task into multiple tasks for each processor.

<span class="mw-page-title-main">Recursion (computer science)</span> Use of functions that call themselves

In computer science, recursion is a method of solving a computational problem where the solution depends on solutions to smaller instances of the same problem. Recursion solves such recursive problems by using functions that call themselves from within their own code. The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite statement. In the same manner, an infinite number of computations can be described by a finite recursive program, even if this program contains no explicit repetitions.

Haxe is a high-level cross-platform programming language and compiler that can produce applications and source code for many different computing platforms from one code-base. It is free and open-source software, released under an MIT License. The compiler, written in OCaml, is released under the GNU General Public License (GPL) version 2.

In computer programming, a pure function is a function that has the following properties:

  1. the function return values are identical for identical arguments, and
  2. the function has no side effects.

In computer programming, a function is a callable unit of software logic that has a well-defined interface and behavior and can be invoked multiple times.

Tracing just-in-time compilation is a technique used by virtual machines to optimize the execution of a program at runtime. This is done by recording a linear sequence of frequently executed operations, compiling them to native machine code and executing them. This is opposed to traditional just-in-time (JIT) compilers that work on a per-method basis.

References

  1. 1 2 Chen et al. 1993.
  2. Chen et al. 1993, 3.4 Function inline expansion, p. 14.
  3. 1 2 3 Prokopec et al., An Optimization Driven Incremental Inline Substitution Algorithm for Just-In-Time Compilers, CGO'19 publication about the inliner used in the Graal compiler for the JVM
  4. Chen et al. 1993, 3.4 Function inline expansion, p. 19–20.
  5. 1 2 Benjamin Poulain (August 8, 2013). "Unusual speed boost: size matters".
  6. See for example the Adaptive Optimization System Archived 2011-08-09 at the Wayback Machine in the Jikes RVM for Java.
  7. Chen et al. 1993, 3.4 Function inline expansion, p. 24–26.
  8. Description of the inliner used in the Graal JIT compiler for Java
  9. Scheifler, An Analysis of Inline Substitution for a Structured Programming Language
  10. Matthew Arnold, Stephen Fink, Vivek Sarkar, and Peter F. Sweeney, A Comparative Study of Static and Profile-based Heuristics for Inlining
  11. Peyton Jones & Marlow 1999, 4. Ensuring Termination, pp. 6–9.
  12. Inlining Semantics for Subroutines which are Recursive" by Henry G. Baker
  13. DeclarationINLINE, NOTINLINE at the Common Lisp HyperSpec
  14. 7.13.5.1. INLINE pragma Chapter 7. GHC Language Features
  15. https://en.cppreference.com/w/cpp/language/inline