Section: Research Program

Static Analysis and Transformation of programs

Participants : Laurent Hascoët, Valérie Pascual, Ala Taftaf.

abstract syntax tree

Tree representation of a computer program, that keeps only the semantically significant information and abstracts away syntactic sugar such as indentation, parentheses, or separators.

control flow graph

Representation of a procedure body as a directed graph, whose nodes, known as basic blocks, each contain a sequence of instructions and whose arrows represent all possible control jumps that can occur at run-time.

abstract interpretation

Model that describes program static analysis as a special sort of execution, in which all branches of control switches are taken concurrently, and where computed values are replaced by abstract values from a given semantic domain. Each particular analysis gives birth to a specific semantic domain.

data flow analysis

Program analysis that studies how a given property of variables evolves with execution of the program. Data Flow analysis is static, therefore studying all possible run-time behaviors and making conservative approximations. A typical data-flow analysis is to detect, at any location in the source program, whether a variable is initialized or not.

data dependence analysis

Program analysis that studies the itinerary of values during program execution, from the place where a value is defined to the places where it is used, and finally to the place where it is overwritten. The collection of all these itineraries is stored as Def-Use and Use-Def chains or as a data dependence graph, and data flow analysis most often rely on this graph.

data dependence graph

Directed graph that relates accesses to program variables, from the write access that defines a new value to the read accesses that use this value, and from the read accesses to the write access that overwrites this value. Dependences express a partial order between operations, that must be preserved to preserve the program's result.

The most obvious example of a program transformation tool is certainly a compiler. Other examples are program translators, that go from one language or formalism to another, or optimizers, that transform a program to make it run better. AD is just one such transformation. These tools use sophisticated analysis [15] . These tools share their technological basis. More importantly, there are common mathematical models to specify and analyze them.

An important principle is abstraction: the core of a compiler should not bother about syntactic details of the compiled program. The optimization and code generation phases must be independent from the particular input programming language. This is generally achieved using language-specific front-ends and back-ends. Abstraction can go further: the internal representation becomes more language independent, and semantic constructs can be unified. Analysis can then concentrate on the semantics of a small set of constructs. We advocate an internal representation composed of three levels.

  • At the top level is the call graph, whose nodes are modules and procedures. Arrows relate nodes that call or import one another. Recursion leads to cycles.

  • At the middle level is the flow graph, one per procedure. It captures the control flow between atomic instructions. Loops lead to cycles.

  • At the lowest level are abstract syntax trees for the individual atomic instructions. Semantic transformations can benefit from the representation of expressions as directed acyclic graphs, sharing common sub-expressions.

To each level belong symbol tables, nested to capture scoping.

Static program analysis can be defined on this internal representation, which is largely language independent. The simplest analyses on trees can be specified with inference rules [18] , [26] , [16] . But many analyses are more complex, and better defined on graphs than on trees. This is the case for data-flow analyses, that look for run-time properties of variables. Since flow graphs may be cyclic, these global analyses generally require an iterative resolution. Data flow equations is a practical formalism to describe data-flow analyses. Another formalism is described in [19] , which is more precise because it can distinguish separate instances of instructions. However it is still based on trees, and its cost forbids application to large codes. Abstract Interpretation [20] is a theoretical framework to study complexity and termination of these analyses.

Data flow analyses must be carefully designed to avoid or control combinatorial explosion. At the call graph level, they can run bottom-up or top-down, and they yield more accurate results when they take into account the different call sites of each procedure, which is called context sensitivity. At the flow graph level, they can run forwards or backwards, and yield more accurate results when they take into account only the possible execution flows resulting from possible control, which is called flow sensitivity.

Even then, data flow analyses are limited, because they are static and thus have very little knowledge of actual run-time values. Far before reaching the very theoretical limit of undecidability, one reaches practical limitations to how much information one can infer from programs that use arrays [32] , [21] or pointers. In general, conservative over-approximations are always made that lead to derivative code that is less efficient than possibly achievable.