224 M. Pezz`e and J. Wuttke
Most of these classical fault tolerant techniques have been developed for safety crit-
ical applications and rely on expensive design approaches, such as redundant im-
plementation, that may not fit well other classes of software systems [1,2].
In recent years, research in different fields has converged on the definition and
deployment of self-adaptive and autonomic software systems, that is systems able
to autonomously recover from problems at runtime [3]. Self-adaptation refers to
various classes of problems and techniques, and is specialized in different self–*
techniques depending on the classes of problems. For example, self-configuring
systems are able to assemble and configure themselves based on a description of
high-level goals, self-protecting systems are capable of taking autonomous action
when threatened by attempts to violate their security and safety guarantees. In
this chapter we discuss self-healing systems, that is systems that can recover
from functional failures of their constituent components.
Building on key ideas expressed by Kephart and Chess, most self-adaptive sys-
tems rely on a variant of the “autonomic cycle” [3]. In their model, an autonomic
element, that is a component or a whole system, is under the control of an auto-
nomic manager, which monitors and analyzes the execution of the element, and in
the case of problems plans and executes changes to the systems configuration.
Most research on self-healing systems has addressed issues directly relating
to adaptability, that is the planning and execution phases of the autonomic cy-
cle. Such work usually assumes suitable monitoring and analysis mechanisms
exist, instead of treating this as a research problem. Our research tackles the
problem of precise failure detection, and thus develops techniques for the moni-
toring phase of self-healing systems. Even though detection of functional failures
has been explored extensively in the literature on software validation and ver-
ification, in the context of self-healing systems we face new challenges. To be
acceptable as monitors in production systems, automatic failure detectors (1)
cannot rely on human operators to arbitrate the validity of detected problems,
(2) must have only limited performance overhead, (3) must detect failures pre-
cisely and produce only few, if any, false alarms, and (4) must detect failures
as early as possible. The results of our research facilitate the automatic gener-
ation of runtime monitors that meet these criteria. A fifth criterion that might
be considered in self-adaptive systems regards the effects of adaptations on the
correctness of failure detectors. However, the assumptions we make in Sec. 4
allow us to set aside this consideration for the discussion in this chapter.
We argue that a complete and consistent set of well tailored assertions can
meet the requirements above. Encoding thoroughly analyzed system invariants
into assertions produces automated oracles, and removes human operators from
the loop. Careful choice of logic constructs in assertions can assure low overhead.
Assertions suitably placed in critical locations can detect failures precisely and
early enough to support efficient fixing.
In current practice, assertions are either added directly to the code by pro-
grammers [4,5], or are generated from formal specifications that describe invari-
ants of data-structures and algorithms [6,7]. In both cases getting the specification
right is non-trivial and highly error prone [8,9]. Additionally, when writing such