Section: Research Program
Algorithmic Differentiation
Participants : Laurent Hascoët, Valérie Pascual, Ala Taftaf.

Glossary
 algorithmic differentiation
(AD, aka Automatic Differentiation) Transformation of a program, that returns a new program that computes derivatives of the initial program, i.e. some combination of the partial derivatives of the program's outputs with respect to its inputs.
 adjoint
Mathematical manipulation of the Partial Differential Equations that define a problem, obtaining new differential equations that define the gradient of the original problem's solution.
 checkpointing
General tradeoff technique, used in adjoint AD, that trades duplicate execution of a part of the program to save some memory space that was used to save intermediate results.
Algorithmic Differentiation (AD) differentiates programs. The input of AD is a source program $P$ that, given some $X\in {\mathbb{R}}^{n}$, returns some $Y=F\left(X\right)\phantom{\rule{0.222222em}{0ex}}\in {\mathbb{R}}^{m}$, for a differentiable $F$. AD generates a new source program ${P}^{\text{'}}$ that, given $X$, computes some derivatives of $F$ [6] .
Any execution of $P$ amounts to a sequence of instructions, which is identified with a composition of vector functions. Thus, if
$\begin{array}{c}\hfill \begin{array}{ccc}\hfill P& \phantom{\rule{0.222222em}{0ex}}\text{runs}\phantom{\rule{0.222222em}{0ex}}& \{{I}_{1};{I}_{2};\cdots {I}_{p};\},\hfill \\ \hfill F& \text{then}\phantom{\rule{4.pt}{0ex}}\text{is}& {f}_{p}\circ {f}_{p1}\circ \cdots \circ {f}_{1},\hfill \end{array}\end{array}$  (1) 
where each ${f}_{k}$ is the elementary function implemented by instruction ${I}_{k}$. AD applies the chain rule to obtain derivatives of $F$. Calling ${X}_{k}$ the values of all variables after instruction ${I}_{k}$, i.e. ${X}_{0}=X$ and ${X}_{k}={f}_{k}\left({X}_{k1}\right)$, the Jacobian of $F$ is
${F}^{\text{'}}\left(X\right)={f}_{p}^{\text{'}}\left({X}_{p1}\right)\phantom{\rule{0.222222em}{0ex}}.\phantom{\rule{0.222222em}{0ex}}{f}_{p1}^{\text{'}}\left({X}_{p2}\right)\phantom{\rule{0.222222em}{0ex}}.\phantom{\rule{0.222222em}{0ex}}\cdots \phantom{\rule{0.222222em}{0ex}}.\phantom{\rule{0.222222em}{0ex}}{f}_{1}^{\text{'}}\left({X}_{0}\right)$  (2) 
which can be mechanically written as a sequence of instructions ${I}_{k}^{\text{'}}$. This can be generalized to higher level derivatives, Taylor series, etc. Combining the ${I}_{k}^{\text{'}}$ with the control of $P$ yields ${P}^{\text{'}}$, and therefore this differentiation is piecewise.
In practice, many applications only need cheaper projections of ${F}^{\text{'}}\left(X\right)$ such as:

Sensitivities, defined for a given direction $\dot{X}$ in the input space as:
${F}^{\text{'}}\left(X\right).\dot{X}={f}_{p}^{\text{'}}\left({X}_{p1}\right)\phantom{\rule{0.222222em}{0ex}}.\phantom{\rule{0.222222em}{0ex}}{f}_{p1}^{\text{'}}\left({X}_{p2}\right)\phantom{\rule{0.222222em}{0ex}}.\phantom{\rule{0.222222em}{0ex}}\cdots \phantom{\rule{0.222222em}{0ex}}.\phantom{\rule{0.222222em}{0ex}}{f}_{1}^{\text{'}}\left({X}_{0}\right)\phantom{\rule{0.222222em}{0ex}}.\phantom{\rule{0.222222em}{0ex}}\dot{X}\phantom{\rule{1.em}{0ex}}.$ (3) This expression is easily computed from right to left, interleaved with the original program instructions. This is the tangent mode of AD.

Adjoints, defined after transposition (${F}^{\text{'}*}$), for a given weighting $\overline{Y}$ of the outputs as:
${F}^{\text{'}*}\left(X\right).\overline{Y}={f}_{1}^{\text{'}*}\left({X}_{0}\right).{f}_{2}^{\text{'}*}\left({X}_{1}\right).\phantom{\rule{0.222222em}{0ex}}\cdots \phantom{\rule{0.222222em}{0ex}}.{f}_{p1}^{\text{'}*}\left({X}_{p2}\right).{f}_{p}^{\text{'}*}\left({X}_{p1}\right).\overline{Y}\phantom{\rule{1.em}{0ex}}.$ (4) This expression is most efficiently computed from right to left, because matrix$\times $vector products are cheaper than matrix$\times $matrix products. This is the adjoint mode of AD, most effective for optimization, data assimilation [32] , adjoint problems [27] , or inverse problems.
Adjoint AD builds a very efficient program [29] . This adjoint program will compute the gradient in a time independent from the number of parameters $n$, and which is only a small multiple of the runtime of $P$. In contrast, computing the same gradient with the tangent mode would require running the tangent differentiated program $n$ times.
However, the ${X}_{k}$ are required in the inverse of their computation order. If the original program overwrites a part of ${X}_{k}$, the differentiated program must restore ${X}_{k}$ before it is used by ${f}_{k+1}^{\text{'}*}\left({X}_{k}\right)$. Therefore, the central research problem of adjoint AD is to make the ${X}_{k}$ available in reverse order at the cheapest cost, using strategies that combine storage, repeated forward computation from available previous values, or even inverted computation from available later values.
Another research issue is to make the AD model cope with the constant evolution of modern language constructs. From the old days of Fortran77, novelties include pointers and dynamic allocation, modularity, structured data types, objects, vectorial notation and parallel programming. We keep developing our models and tools to handle these new constructs.