## Section: Scientific Foundations

### Static program analysis

Static program analysis is concerned with obtaining information about the run-time behaviour of a program without actually running it. This information may concern the values of variables, the relations among them, dependencies between program values, the memory structure being built and manipulated, the flow of control, and, for concurrent programs, synchronisation among processes executing in parallel. Fully automated analyses usually render approximate information about the actual program behaviour. The analysis is correct if the information includes all possible behaviour of a program. Precision of an analysis is improved by reducing the amount of information describing spurious behaviour that will never occur.

Static analysis has traditionally found most of its applications in the area of program optimisation where information about the run-time behaviour can be used to transform a program so that it performs a calculation faster and/or makes better use of the available memory resources. The last decade has witnessed an increasing use of static analysis in software verification for proving invariants about programs. The Celtique project is mainly concerned with this latter use. Examples of static analysis include:

Data-flow analysis as it is used in optimising compilers for imperative languages. The properties can either be approximations of the values of an expression (“the value of variable $\U0001d5d1$ is greater than 0” or $\U0001d5d1$ is equal to $\U0001d5d2$ at this point in the program” ) or more intensional information about program behaviour such as “this variable is not used before being re-defined” in the classical “dead-variable” analysis [71] .

Analyses of the memory structure includes shape analysis that aims at approximating the data structures created by a program. Alias analysis is another data flow analysis that finds out which variables in a program addresses the same memory location. Alias analysis is a fundamental analysis for all kinds of programs (imperative, object-oriented) that manipulate state, because alias information is necessary for the precise modelling of assignments.

Control flow analysis will find a safe approximation to the order in which the instructions of a program are executed. This is particularly relevant in languages where parameters or functions can be passed as arguments to other functions, making it impossible to determine the flow of control from the program syntax alone. The same phenomenon occurs in object-oriented languages where it is the class of an object (rather than the static type of the variable containing the object) that determines which method a given method invocation will call. Control flow analysis is an example of an analysis whose information in itself does not lead to dramatic optimisations (although it might enable in-lining of code) but is necessary for subsequent analyses to give precise results.

Static analysis possesses strong **semantic foundations**, notably abstract
interpretation [51] , that allow to prove its correctness. The
implementation of static analyses is usually based on well-understood
constraint-solving techniques and iterative fixpoint algorithms. In
spite of the nice mathematical theory of program analysis and the
solid algorithmic techniques available one problematic issue persists,
*viz.*, the *gap* between the analysis that is proved
correct on paper and the analyser that actually runs on the
machine. While this gap might be small for toy languages, it becomes
important when it comes to real-life languages for which the
implementation and maintenance of program analysis tools become a
software engineering task. A *certified static analysis* is an
analysis that has been formally proved correct using a
proof assistant.

In previous work we studied the benefit of using abstract
interpretation for developing **certified static analyses**
[49] , [74] . The development of
certified static analysers is an ongoing activity that will be part of
the Celtique project. We use the Coq proof assistant which allows for
extracting the computational content of a constructive proof. A Caml
implementation can hence be extracted from a proof of existence, for
any program, of a correct approximation of the concrete program
semantics. We have isolated a theoretical framework based on abstract
interpretation allowing for the formal development of a broad range of
static analyses. Several case studies for the analysis of Java byte
code have been presented, notably a memory usage analysis
[50] . This work has recently found
application in the context of Proof Carrying Code
and have also been successfully applied to
particular form of static analysis based on term rewriting and tree
automata [3] .

#### Static analysis of Java

Precise context-sensitive control-flow analysis is a fundamental
prerequisite for precisely analysing Java programs.
Bacon and Sweeney's Rapid Type Analysis (RTA) [42] is a
scalable algorithm for constructing an initial call-graph of the
program. Tip and Palsberg [80] have proposed a variety of
more precise but scalable call graph construction algorithms
*e.g.,* MTA, FTA, XTA which accuracy is between RTA and 0'CFA.
All those analyses are not context-sensitive. As early as 1991,
Palsberg and Schwartzbach [72] , [73] proposed a theoretical
parametric framework for typing object-oriented programs in a
context-sensitive way. In their setting, context-sensitivity is
obtained by explicit code duplication and typing amounts to analysing
the expanded code in a context-insensitive manner. The framework
accommodates for both call-contexts and allocation-contexts.

To assess the respective merits of different instantiations, scalable
implementations are needed. For Cecil and Java programs, Grove
*et al.,* [60] , [59] have explored the algorithmic design
space of contexts for benchmarks of significant size.
Latter on, Milanova *et. al.,* [66] have
evaluated, for Java programs, a notion of context called
*object-sensitivity* which abstracts the call-context by the
abstraction of the `this` pointer. More recently, Lhotak and
Hendren [64] have extended the empiric
evaluation of object-sensitivity using a BDD implementation allowing
to cope with benchmarks otherwise out-of-scope.
Besson and Jensen [46] proposed to use datalog
in order to specify context-sensitive analyses. Whaley and
Lam [81] have implemented a context-sensitive
analysis using a BDD-based datalog implementation.

Control-flow analyses are a prerequisite for other analyses. For instance, the security analyses of Livshits and Lam [65] and the race analysis of Naik, Aiken [67] and Whaley [68] both heavily rely on the precision of a control-flow analysis.

Control-flow analysis allows to statically prove the absence of certain run-time errors such as "message not understood" or cast exceptions. Yet it does not tackle the problem of "null pointers". Fahnrich and Leino [55] propose a type-system for checking that after object creation fields are non-null. Hubert, Jensen and Pichardie have formalised the type-system and derived a type-inference algorithm computing the most precise typing [63] . The proposed technique has been implemented in a tool called NIT [62] . Null pointer detection is also done by bug-detection tools such as FindBugs [62] . The main difference is that the approach of findbugs is neither sound nor complete but effective in practice.

#### Quantitative aspects of static analysis

Static analyses yield qualitative results, in the sense that they compute a safe over-approximation of the concrete semantics of a program, w.r.t. an order provided by the abstract domain structure. Quantitative aspects of static analysis are two-sided: on one hand, one may want to express and verify (compute) quantitative properties of programs that are not captured by usual semantics, such as time, memory, or energy consumption; on the other hand, there is a deep interest in quantifying the precision of an analysis, in order to tune the balance between complexity of the analysis and accuracy of its result.

The term of quantitative analysis is often related to probabilistic models for abstract computation devices such as timed automata or process algebras. In the field of programming languages which is more specifically addressed by the Celtique project, several approaches have been proposed for quantifying resource usage: a non-exhaustive list includes memory usage analysis based on specific type systems [61] , [41] , linear logic approaches to implicit computational complexity [43] , cost model for Java byte code [37] based on size relation inference, and WCET computation by abstract interpretation based loop bound interval analysis techniques [52] .

We have proposed an original approach for designing static analyses computing program costs: inspired from a probabilistic approach [75] , a quantitative operational semantics for expressing the cost of execution of a program has been defined. Semantics is seen as a linear operator over a dioid structure similar to a vector space. The notion of long-run cost is particularly interesting in the context of embedded software, since it provides an approximation of the asymptotic behaviour of a program in terms of computation cost. As for classical static analysis, an abstraction mechanism allows to effectively compute an over-approximation of the semntics, both in terms of costs and of accessible states [48] . An example of cache miss analysis has been developed within this framework [79] .

#### Semantic analysis for test case generation

The semantic analysis of programs can be combined with efficient
constraint solving techniques in order to extract specific
information about the program, *e.g.*, concerning the
accessibility of program points and feasibility of execution paths [76] , [54] . As
such, it has an important use in the automatic generation of test
data. Automatic test data generation received considerable attention these last years
with the development of efficient and dedicated constraint solving
procedures and compositional techniques [58] .

We have made major contributions to the development of **constraint-based testing**, which is a two-stage process
consisting of first generating a constraint-based
model of the program's data flow and then, from the selection of a testing objective such as a statement to reach
or a property to invalidate, to extract
a constraint system to be solved.
Using efficient constraint solving
techniques allows to generate test data that satisfy the testing
objective, although this generation might not always terminate.
In a certain way, these constraint techniques can be seen as efficient decision procedures
and so, they are competitive with the best software model
checkers that are employed to generate test data.