Section: New Results

Resilient workflows for distributed multidiscipline optimization

Participants : Toan Nguyen, Laurentiu Trifan.

A distributed platform based on the YAWL workflow management system has been designed and implemented to deploy HPC applications on the Grid5000 network infrastructure. The goal is to provide a generic environment for the design of complex applications that require HPC resources for large-scale fault-tolerant applications, see Fig. 2 and [39] .

The platform provides application-level fault-tolerance, i.e., resilience, in order to restart the workflow execution whenever abnormal behavior or system-level errors occur. This allows a variety of errors to be taken into account, ranging from execution time-outs to out-of-bounds parameter values to be managed, with the help of user intervention when necessary [40] .

The error management procedure uses exception handlers in YAWL to trigger the appropriate corrective actions, which are defined by rules invoking the adequate compensating workflows. Once defined, this can be made transparent to the users [41] .

An original scheme based on asymmetric checkpoints has be designed is order to reduce overhead in both checkpointing and application restarts. It minimizes the number of required checkpoints created based on default rules and user-specific needs.

The platform is currently developed in Java on Linux workstations and should be portable on Windows and MacOS, although this has not been tested yet.

Examples are deployed on the Grid5000 national network infrastructure using the OMD2 test-cases (e.g., vehicle air-conditioner pipe optimization). The goal is here to provide a demonstrator platform that deploys large-scale optimization applications involving several (typically over five) HPC clusters distributed on the Grid5000 network. The coarse-grain definition of the application is defined by a workflow that monitors the distributed execution of the parallel component codes on the various clusters, providing resilience capabilities in case of system and application errors, see Fig. 5 .

Figure 5. Application definition using YAWL.