Section: Scientific Foundations

System management and control

Management (or Administration) is the function that aims at maintaining a system's ability to provide its specified services, with a prescribed quality of service. We approach management as a control activity, involving an event-reaction loop: the management system detects events that may alter the ability of the managed system to perform its function, and reacts to these events by trying to restore this ability. The operations performed under system and application administration include observation and monitoring, configuration and deployment, resource management, performance management, and fault management.

Up to now, administration tasks have mainly been performed in an ad-hoc fashion. A great deal of the knowledge needed for administration tasks is not formalized and is part of the administrators' know-how and experience. As the size and complexity of the systems and applications are increasing, the costs related to administration are taking up a major part of the total information processing budgets, and the difficulty of the administration tasks tends to approach the limits of the administrators' skills. For example, an analysis of the causes of failures of Internet services [99] shows that most of the service's downtime may be attributed to management errors (e.g. wrong configuration), and that software failures come second. In the same vein, unexpected variations of the load are difficult to manage, since they require short reaction times, which human administrators are not able to achieve.

The above motivates a new approach, in which a significant part of management-related functions is performed automatically, with minimal human intervention. This is the goal of the so-called autonomic computing movement [82] . Several research projects [51] are active in this area. [83] , [79] are recent surveys of the main research problems related to autonomic computing. Of particular importance for Sardes ' work are the issues associated with configuration, deployment and reconfiguration [71] , and techniques for constructing control algorithms in the decision stage of administration feedback loops, including discrete control techniques [66] , and continuous ones [76] .

Management and control functions built by Sardes require also the development of distributed algorithms [92] , [111] at different scales, from algorithms for multiprocessor architectures [78] to algorithms for cloud computing [94] and for dynamic peer-to-peer computing systems [55] , [100] . Of particular relevance in the latter contexts are epidemic protocols such as gossip protocols [107] because of their natural resilience to node dynamicity or churn, an inherent scalability.