Mapping Tightly-Coupled Applications on Volatile Resources
Henri Casanova, Fanny Dufossé, Yves Robert, Frédéric Vivien

To cite this version:
Henri Casanova, Fanny Dufossé, Yves Robert, Frédéric Vivien. Mapping Tightly-Coupled Applications on Volatile Resources. 2012. ensl-00697621

HAL Id: ensl-00697621
https://hal-ens-lyon.archives-ouvertes.fr/enl-00697621
Submitted on 15 May 2012

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Abstract—Platforms that comprise volatile processors, such as desktop grids, have been traditionally used for executing independent-task applications. In this work we study the scheduling of tightly-coupled iterative master-worker applications onto volatile processors. The main challenge is that workers must be simultaneously available for the application to make progress. We consider two additional complications: one should take into account that workers can become temporarily reclaimed and, for data-intensive applications, one should account for the limited bandwidth between the master and the workers.

In this context, our first contribution is a theoretical study of the scheduling problem in its off-line version, i.e., when processor availability is known in advance. Even in this case the problem is NP-hard. Our second contribution is an analytical approximation of the expectation of the time needed by a set of workers to complete a set of tasks and of the probability of success of this computation. This approximation relies on a Markovian assumption for the temporal availability of processors. Our third contribution is a set of heuristics, some of which use the above approximation to favor reliable processors in a sensible manner. We evaluate these heuristics in simulation. We identify some heuristics that significantly outperform their competitors and derive heuristic design guidelines.

I. INTRODUCTION

In this paper we study the problem of scheduling parallel applications onto volatile processors. We target typical scientific iterative applications in which a master process parallelizes the execution of each iteration across worker processes. Each iteration requires the execution of a fixed number of tasks, with a global synchronization at the end of each iteration. In [1] we have studied this problem when these tasks are independent. In this work instead we consider tightly-coupled tasks that exchange data throughout each iteration, thus requiring that workers be simultaneously available. This work and that in [1] cover the two extremes of the parallelization spectrum, and are together representative of a large class of scientific applications.

We consider a platform that consists of processors that alternate between periods of availability and periods of unavailability. When available each processor runs a worker process, and a master process can choose to enroll a subset of these workers to participate in the application execution. Worker unavailability can be due to software faults, in which case unavailability may last only the time of a reboot. A hardware failure can lead to a longer unavailability period, until a repair is completed and followed by a reboot. We consider a third source of processor unavailability, which comes from cycle-stealing scenarios: when a processor is contributed to the platform by an individual owner, this owner can reclaim it at any time without notice for some unknown length of time. A difference here is that the processor is merely preempted (as opposed to being terminated) until the processor is no longer reclaimed. A worker process on this processor can later resume its computation. Accordingly, we use a 3-state availability model: UP (available), DOWN (crashed, computation is lost) and RECLAIMED (preempted, but computation can resume later). Our platform model also accounts for the fact that, due to bandwidth limitation, the master is only able to communicate simultaneously with a limited number of workers (to send them the application program as well as task data). This limitation corresponds to the bounded multi-port model [2]. It turns out that limiting the communication capacity of the master dramatically complicates the design of scheduling strategies. But without this limitation it would be in principle possible to enroll thousands of new processors at each iteration, which is simply not feasible in practice even if this many processors are available.

Given the above application and platform models, and given a deadline (typically expressed in hours or days), the scheduling problem we study is that of maximizing the expected number of application iterations successfully completed before the deadline. Informally, during each iteration, one must use the “best” processors among those that are simultaneously UP; these could be the fastest ones, or those expected to remain UP for the longest time. In addition, with processors failing, becoming reclaimed, and becoming UP again later, one has to decide when and how to change the set of cur-
cently enrolled processors. Each such change comes at a price: first, the application program needs to be sent to newly enrolled processors, thereby consuming some of the master’s bandwidth; second, and more importantly, iteration computation that was only partially completed is lost due to the tight coupling of tasks.

Our contribution in this work is threefold. First, we determine the complexity of the off-line scheduling problem, i.e., when processor availability is known in advance. Even with such knowledge the problem is NP-hard. Second, we compute approximations of the expectation of the time needed by a set of processors to complete a set of tasks and of the probability that this computation succeeds. These approximations provide a sound basis for making sensible scheduling decisions. Third, we design several on-line heuristics that we evaluate in simulation. Some of these contributions assume Markovian processor availability, which is not representative of real-world platforms but provides a tractable framework for obtaining theoretical and experimental results in laboratory conditions.

This paper is organized as follows. In Section II, we discuss related work. We give a formal definition of the application and platform models in Section III. In Section IV we define the scheduling problem and establish off-line complexity results. In Section V, we introduce a 3-state Markovian model of processor availability, and use this model to compute approximations of relevant probabilistic quantities. In Section VI, we describe several heuristics for solving the on-line scheduling problem, which are evaluated in simulation in Section VII. Finally, we summarize our results and provide perspectives on future work in Section VIII.

II. RELATED WORK

Iterative applications that can be implemented in master-worker fashion are widely used in computational linear algebra for sparse linear systems (e.g., [3]), eigenvalue problems (e.g., [4]), image processing (e.g., [5]), signal processing (e.g., [6]), etc.

Several authors have proposed scheduling approaches for such applications (see, e.g., [7], [8], [9]). In this work we consider volatile compute resources, such as those found in desktop grids, whose volatility has been studied in [10], [11], [12], [13]. Several authors have studied the “bag-of-tasks” scheduling problem on these platforms, either at an Internet-wide scale or within an Enterprise [14], [15], [16], [17], [18], [19], [20], [21]. Most of these works propose simple greedy scheduling algorithms that rely on mechanisms to select processors based on static criteria (e.g., processor clock-rates, benchmark results, time zone), on simple statistics of past availability [14], [16], [19], [15], and on predictive statistical models of availability [20], [21], [17], [18]. These criteria are used to rank processors but also to exclude them from consideration [14], [16]. The work in [18] is particularly related to our own in that it uses a Markov model of processor availability (but without accounting for temporary preemption). Given the wealth of scheduling approaches, in [19] the authors propose to automatically instantiate the parameters that together define the behavior of a generic scheduling algorithm. Most works published in this area are of a pragmatic nature and few theoretical results have been sought or obtained (one exception is the work in [22]).

A key aspect of our work is that we seek to develop scheduling algorithms that explicitly manage the master’s bandwidth. Limited master bandwidth is a known issue for desktop grid computing [23], [24], [25] and must therefore be addressed even though it complicates the scheduling problem. To the best of our knowledge, except our previous study for independent tasks [1], no previous work has made such an attempt.

III. MODELS AND ASSUMPTIONS

We assume that time is discretized into a sequence of time-slots of arbitrarily chosen duration. For simplicity when we say “at time $t$” we imply “at discrete time-slot $t$”. Our approach is agnostic to the time-slot duration. The duration that makes sense in practice depends on the application and/or platform, ranging from seconds to minutes or possibly hours.

A. Application model

We consider an application that performs a sequence of iterations. Each iteration consists of executing $m$ tasks and ends with a global synchronization. All $m$ tasks are identical (in terms of computational cost) and communicate throughout the iteration execution. Therefore, all tasks must make progress at the same rate. If a task is terminated prematurely (due to a worker failure), all computation performed so far for the current iteration is lost, and the entire iteration has to be restarted. If a task is suspended (due to a worker becoming temporarily reclaimed), then the entire execution of the iteration is also suspended. Due to the global synchronization, there is no overlap between communication and computations. We thus consider that an iteration proceeds in two phases: a communications phase and a computation phase. Finally, before being able to compute, a worker must acquire the application code once (e.g., binary executable, byte code), of constant size $V_{prog}$ in bytes, and the input data for each task and iteration, of constant size $V_{data}$ in bytes.

B. Platform model

The platform comprises $p$ processors. Since each processor executes a worker process, we use the terms processor and worker interchangeably. Worker $P_q$, $q = 1, \ldots, p$, can be in one of three states ($UP$,
RECLAIMED or DOWN), and transitions between these states occur for each processor at each time-slot independently of the other processors. More precisely:

- Any UP processor can become DOWN or RECLAIMED.
- Any UP or RECLAIMED processor can become DOWN. It then loses the application program and all the data for its current tasks. If it was computing some of these tasks, these computations are lost.
- Any UP processor can become RECLAIMED.

The processor does not lose any state. If it was receiving the application program or data for a task, the communication is temporarily suspended.

If it was computing a task, the computation on all processors is temporarily suspended.

We denote by \( S_q \) the vector that gives the state of \( P_q \) at each time-slot starting with time-slot 0.

\( P_q \) computes a task in \( w_q \) time-slots if it remains UP. If \( w_q = w \) for each processor \( P_q \), then the processors are homogeneous. The master has network bandwidth \( BW \) and communicates with a worker with bandwidth \( bw \), meaning that we assume same capacity links from the master to each worker. In this work we equate bandwidth with data transfer rate, acknowledging that in practice the data transfer rate is a fraction of the physical bandwidth. Let \( n_{prog} \) be the number of workers receiving the program at time \( t \), and let \( n_{data} \) be the number of workers receiving the input data of a task at time \( t \). The constraint on the master’s bandwidth is simply written as \( n_{prog} + n_{data} \leq n_{com} = \lfloor BW / bw \rfloor \).

Indeed, consider a worker on processor \( P_q \) that is communicating at time \( t \). Either \( P_q \) is receiving the program, or it is receiving data for a task. In both cases, it does this at data transfer rate \( bw \). Overall, the master can execute only a limited number \( n_{com} \) of such communications simultaneously. The time for a worker to receive the program is \( T_{prog} = V_{prog} / bw \), and the time to receive the data is \( T_{data} = V_{data} / bw \). For simplicity we assume that \( T_{prog} \) and \( T_{data} \) consist of integral numbers of time-slots. We also assume that the master is always UP, which can be enforced by using, for instance, two dedicated servers with a primary backup mechanism.

C. Application execution model

Let \( config(t) \) denote the set of workers enrolled by the master, or configuration, at time \( t \). The configuration is determined by an application scheduler, and in this work we propose algorithms to be used by this scheduler. To complete an iteration, enrolled workers must progress concurrently throughout the computations. One worker may be assigned several tasks and execute them concurrently if it has enough memory to do so. Formally, we define for each worker \( P_q \) a bound \( \mu_q \) on the maximum number of tasks that it can execute concurrently. We assume that \( \sum_{q=1}^{p} \mu_q \geq m \), otherwise the configuration cannot execute the application. The \( m \) tasks are mapped onto \( k \leq m \) workers. Each enrolled worker \( P_q \) is assigned \( x_q \) tasks, where \( \sum_{q=1}^{p} x_q = m \).

To be able to compute their tasks, the \( k \) enrolled workers must have received the application program and all necessary data. More precisely:

- Each enrolled worker \( P_q \) must receive the program, unless it has received it at some previous time and has not be DOWN since then.
- In addition, each worker \( P_q \) must receive \( x_q \) data messages (one per task) from the master. Suppose that since the begin of the current iteration \( P_q \) has received \( x'_q \) data messages. At least \( (x_q - x'_q)T_{data} \) time-slots are needed for this communication, and likely more since the master can be engaged in at most \( n_{com} \) concurrent communications.

Overall, the computation can start at a time \( t \) only if each of the \( k \) enrolled workers is in the UP state, has the program, has the data of all its allocated tasks, and has never been in the DOWN state since receiving these messages. Because tasks must proceed in locked steps, the execution goes at the pace of the slowest worker. Hence the computation of an iteration requires \( \max_q(x_qw_q) \) time-slots of concurrent computations (not necessarily consecutive, due to workers possibly being reclaimed). Consider the interval of time between time \( t_1 \) and time \( t_2 = t_1 + \max_q(x_qw_q) + t' - 1 \) for some \( t' \). For the iteration to be successfully completed by time \( t_2 \), between \( t_1 \) and \( t_2 \) there must be \( \max_q(x_qw_q) \) time-slots for which all enrolled workers are simultaneously UP, and there may be \( t' \) time-slots during which one or more workers are RECLAIMED.

The scheduler may choose a new configuration at each time \( t \). If at least one worker in \( config(t) \) becomes DOWN, the scheduler must select another configuration and restart the iteration from scratch. Even if all workers in \( config(t) \) are UP, the scheduler may decide to change the configuration because more desirable (i.e., faster, more reliable) workers have become available. Let \( P_q \) be a newly enrolled worker at that point, i.e., \( P_q \in config(t+1) \setminus config(t) \). \( P_q \) needs to receive the program unless it already has a copy of it and has not been DOWN since receiving it. In all cases, \( P_q \) needs to receive task data, i.e., \( x_q \) messages of \( V_{data} \) bytes. This holds true even if \( P_q \) had been enrolled at time \( t' < t \) but was un-enrolled since then. In other words, any interrupted communication must be resumed from scratch if the worker became DOWN or was removed from the configuration.

An example iteration execution with \( m = 5 \) tasks and \( p = 5 \) processors is shown in Figure 1. For this example the processors are heterogeneous with \( \forall 1 \leq i \leq 5, w_i = i, n_{com} = 2, T_{prog} = 2, \) and \( T_{data} = 1 \).
Computing, and I reclaimed time slots on PD the program, 5 to them. They all begin computing. At time
Download task data (e.g., bandwidth constraint, processors are idle while others have downloaded the application program. Due to the configuration requires When immediately.

The communication phase of this iteration is executed between time 1 and time 7. At time 1, the 3 processors selected can receive data, but because of the bandwidth constraint P4 remains idle during the first 3 time-slots. Processor P3 is temporarily reclaimed after it has downloaded the application program. Due to the bandwidth constraint, processors are idle while others download task data (e.g., P3 is idle between time 4 and time 5). At time 7, all 3 processors have downloaded the application program and the data for all tasks assigned to them. They all begin computing. At time 10, P2 is temporarily reclaimed and the computation is suspended with half of the computation of each task completed. When P2 becomes available again, at time 12, P3 has been reclaimed and the computation cannot resume immediately. P3 becomes available again at time 13, P2 and P4 are UP, and the computation continues. If a processor had become DOWN, say, at time 14, all the computation would have been lost and the communication phase would have been restarted from scratch. At time 16, the processors synchronize and a new iteration can start. At that time processors P1 and P5 may be included in the configuration.

IV. Off-Line Complexity

The scheduling problem is to maximize the expected number of completed application iterations before time N, where N is a specified deadline. In this section, we assess the complexity of the off-line version of this problem, assuming full knowledge of future worker states. In other words, S[q,j] is known for 1 ≤ q ≤ p and 1 ≤ j ≤ N. We show that the simplest off-line and deterministic versions of the problem are NP-hard.

Fixed number of workers: Consider the problem with no communications (Tprog = Tdata = 0), and identical workers with w_q = w and µ_q = µ = 1. m workers must be enrolled to complete an iteration. The problem reduces to finding w time-slots such that there exist m workers that are simultaneously UP during all these w time-slots. We call this version of the problem Off-Line-Coupled (µ = 1).

Flexible number of workers: Consider the problem with no communications (Tprog = Tdata = 0), and identical processors with w_q = w and µ_q = µ = +∞ (in fact µ = m is sufficient). The problem is less constrained than Off-Line-Coupled (µ = 1). Either one finds m processors that are simultaneously UP during w time-slots, or one finds ⌈m/2⌉ workers that are simultaneously UP during 2w time-slots, or one finds ⌈m/3⌉ workers that are simultaneously UP during 3w time-slots, and so on. We call this version of the problem Off-Line-Coupled (µ = +∞).

Theorem 4.1: Problems Off-Line-Coupled (µ = 1) and Off-Line-Coupled (µ = ∞) are NP-hard. The proof is provided in Section A of the appendix.

V. Analytical Approximations

In this section, we compute the expectation of the time needed by a configuration to compute a given workload conditioned on this computation being successful (i.e., with no worker becoming DOWN), as well as the probability of success. Intuitively, these quantities seem relevant for developing scheduling heuristics that account for the need for workers to be UP simultaneously and for workers that can become temporarily RECLAIMED. To compute the above expectation and probability, we introduce a Markov model of processor availability. The availability of processor P_q is described by a 3-state recurrent aperiodic Markov chain, defined by 9 probabilities: P_i^{(q)}(i,j), with i,j ∈ {u,r,d}, is the probability for P_q to move from state i at time-slot t to state j at time-slot t+1, which does not depend on t.

We are aware that the Markov, i.e., memory-less, assumption for processor availability does not hold in practice. For instance, several authors have observed that the duration of availability intervals in production desktop grids is often far from being exponentially distributed for a 2-state scenario in which processors
are either UP or DOWN [10], [26], [11], [12], [20]. Unfortunately, there is no consensus in the literature on a realistic model, even though some of these studies may suggest semi-Markov models with approximately Weibull or Log-Normal holding times. Deriving a realistic 3-state statistical model of processor availability is thus an open research question that is outside the scope of this work. Instead, we opt for a Markov model because it is simple and lends itself to tractable analysis. This model gives us a framework in which to design heuristics that trade off worker speed for reliability. Furthermore, it allows us to evaluate our trade-off approaches in “laboratory conditions.”

A. Probability of success and expected duration of a computation

Consider a set \( S \) of workers all in the UP state at time 0. This set is assigned a workload that requires \( W \) time-slots of simultaneous computation. To complete this workload successfully, all the workers in \( S \) must be simultaneously UP during another \( W - 1 \) time-slots. They can possibly become RECLAIMED (thereby temporarily suspending the execution) but must never become DOWN in between. What is the probability of the workload being completed? And, if it is successfully completed, what is the expectation of the number of time-slots until completion?

Definition 1: Knowing that all processors in a set \( S \) are UP at time-slot \( t_1 \), let \( P_{+}^{S} \) be the conditional probability that they will all be UP simultaneously at a later time-slot, without any of them going to the DOWN state in between. Formally, knowing that \( \forall P_q \in S, S_q[t_1] = u, P_{+}^{S} \) is the conditional probability that there exists a time \( t_2 > t_1 \) such that \( \forall P_q \in S, S_q[t_2] = u \) and \( S_q[t] \neq d \) for \( t_1 < t < t_2 \).

Definition 2: Let \( E(S)(W) \) be the conditional expectation of the number of time-slots required by a set of processors \( S \) to complete a workload of size \( W \) knowing that all processors in \( S \) are UP at the current time-slot \( t_1 \) and none will become DOWN before completing this workload. Formally, knowing that \( S_q[t_1] = u \), and that there exist \( W - 1 \) time-slots \( t_2 < t_3 < \cdots < t_W \), with \( t_1 < t_2 \), \( S_q[t_i] = u \) for \( i \in [2, W] \), and \( S_q[t] \neq d \) for \( t \in [t_1, t_W] \), \( E(S)(W) \) is the expectation of \( t_W - t_1 + 1 \) conditioned on success.

Theorem 5.1: It is possible to approximate the values of \( P_{+}^{S} \) and \( E(S)(W) \) numerically up to an arbitrary precision \( \varepsilon \) in fully polynomial time.

Proof Sketch. Consider a set \( S \) of processors, all available at time slot 0. Let \( P_q(t) \) be the probability that a processor \( P_q \) that was UP at time 0 is UP again at time \( t \), without having been DOWN in between, and let \( P_{+}^{S} \) be the product of all such probabilities for the processors in \( S \). Let \( E_u(S) = \sum_{t > 0} P_{+}^{S} \). \( E_u(S) \) can be approximated to a precision \( \varepsilon \) in time polynomial in \( 1/\varepsilon \). Then, \( P_{+}^{S} = E_u(S) \) if, in set \( S \), at least one processor has a nonzero probability of going DOWN, and \( P_{+}^{S} = 1 \) otherwise. Let \( A(S) = \sum_{t > 0} t \times P_{u \rightarrow d}^{S} \). \( A(S) \) can be approximated to a precision \( \varepsilon \) in time polynomial in \( 1/\varepsilon \). Then, \( E_u(S)(W) = \frac{\sum_{t > 0} (W-t-1) E_u(S)}{P_{+}^{S}(W)} \) with \( E_u(S) = E_u(S) \). The complete proof is provided in the appendix. \( \square \)

B. Probability of success and expected duration of a communication

The previous section gave approximations for the probability of success and conditional expected duration for computations. Unfortunately, similar approximations cannot be obtained for communications due to complexity added by the \( n_{\text{com}} \) constraint. Instead, we resort to a coarser approximation as explained hereafter. Let \( S \) be a set of enrolled workers. For worker \( P_q \in S \), let \( n_q \) be the number of time-slots of communication needed to receive the application program and all the data of its allocated tasks. Suppose first that \( |S| \leq n_{\text{com}} \). In this case, the expected communication time on worker \( P_q \), \( E_q \), can be estimated precisely reusing the result in the previous section: \( E_q = E(P_q)(n_q) \). We then estimate the expected communication time of the current configuration as \( E_{\text{com}}^{\text{comm}} = \max_{P_q \in S} \{ E(P_q)(n_q) \} \). In the case \( |S| \geq n_{\text{com}} \), obtaining an estimate close to the actual expected communication time seems out of reach. Instead, we use a coarser estimation: \( E_{\text{com}}^{\text{comm}} = \max \left\{ \frac{\sum_{P_q \in S} E(P_q)(n_q)}{n_{\text{com}}} \right\} \).

Let \( P_{NU}(t) \) denote the probability that worker \( P_q \) that was UP at time \( t' \) does not become DOWN between time \( t' \) and time \( t' + t \). The probability of success is then estimated as \( P_{S}^{\text{comm}} = \prod_{P_q \in S} P_{NU}(P_q)(t) \). \( E_{\text{com}}^{\text{comm}} \) does not take into account the time needed after the end of all communications for all workers to be UP simultaneously. The probability of success of an iteration is estimated by multiplying the probability of success of the communications and the probability of success of the computations.

VI. ON-LINE HEURISTICS

We propose heuristics for solving the on-line version of our scheduling problem, i.e., assuming no knowledge of future processor states. Conceptually, we distinguish between two classes of heuristics. Passive heuristics conservatively keep current processors active as long as possible. In other words, the current configuration is changed only when one of the enrolled processors becomes DOWN. In this case, all previously executed work is lost. However, a worker that has not become DOWN but has already received task data, can reuse that data if the scheduler reassigns tasks to it. Proactive
**heuristics** allow for a complete reconfiguration even if no worker fails, possibly aborting ongoing computation if a better configuration is found. This makes it possible for an iteration to never complete. A criterion must thus be derived to decide whether and when such an aggressive reconfiguration is worthwhile. Our proactive heuristics are defined by a pair (criterion, passive heuristic). When a new configuration is computed using the heuristic, it is compared to the current configuration according to the criterion. If the new configuration is better than the current one, then it is launched, leading to new communications and task allocations. Otherwise, the execution continues with the current configuration for an additional time slot.

We also include results for a baseline **Random heuristic** that allocates tasks to UP processors randomly using a uniform distribution.

**A. Passive heuristics**

Passive heuristics assign tasks to workers, which must be in the UP state, one by one until m tasks are assigned. Each task is assigned to a worker according to a criterion that defines the heuristic. As described hereafter, we consider four different criteria: probability of success, expected completion time, estimated yield, and estimated apparent yield.

- **IP (Incremental: Probability of success)** – This heuristic attempts to find configurations with high probability of success. The next task is assigned to the worker such that the probability of success of all currently assigned tasks (including the new one) is maximized. More precisely, consider the set S of workers with at least one task already assigned. For each worker $P_q$, either in S or not, we compute the probability $P(S)(q)$ of success of the communication and the computation if the additional task is assigned to $P_q$, using the results of Section V: $P(S)(q) = P(S ∪ \{P_q\})(W_q) × P_comm(S ∪ \{P_q\})$, with $W_q$ the maximal load in $S ∪ \{P_q\}$ with an additional task on $P_q$. We assign the next task to worker $P_q$, with $q_0 = \text{ArgMax} \{P(S)(q)\}$.

- **IE (Incremental: Expected completion time)** – This heuristic attempts to find fast configurations, without considering reliability. The next task is assigned to the worker that minimizes the expected execution time of the iteration. More precisely, consider the set S of workers with at least one task already assigned. For each worker $P_q$, either in S or not, we compute the expected communication time $E_comm(S ∪ \{P_q\})$ and the expected computation time $E(S ∪ \{P_q\})(W_q)$ with an additional task on $P_q$. We obtain the expected duration of the iteration $E(S)(q) = E_comm(S ∪ \{P_q\}) + E(S ∪ \{P_q\})(W_q)$. We assign the next task to worker $P_q$, with $q_0 = \text{ArgMin} \{E(S)(q)\}$.

- **IY (Incremental: Expected yield)** – This heuristic assigns the next task to the worker that maximizes the yield of the configuration. The yield is the expected value of the inverse of the execution time of the current iteration, which we estimate as follows. For a given configuration with probability of success $P$ and expected completion time $E$ for an iteration that has already been running for $t$ time slots, the yield is estimated as $Y = \frac{P}{tE}$. Intuitively, we expect the yield to achieve a trade-off between reliability (probability of success) and execution speed. Consider the set S of workers with at least one task already assigned. For each processor $P_q$, either in S or not, we compute the expected yield with an additional task on $P_q$: let $P(S)(q)$ be the probability computed for heuristic IP, $E(S)(q)$ be the expected completion time computed for heuristic IE, and $t$ be the time spent since the beginning of the current iteration. We assign the next task to worker $P_q$, with $q_0 = \text{ArgMax} \{\frac{P(S)(q)}{tE(S)(q)}\}$.

**B. Proactive heuristics**

Our proactive heuristics are designed as follows. Consider an application executing on a platform using a passive heuristic $H$ and criterion $C$ at some time $t$. The configuration $config(t − 1)$ was selected by $H$ at time $t' ≤ t − 1$ because of a configuration change due to a proactive decision, due to a worker becoming DOWN, or due to the beginning of a new iteration. Let $config_1 = config(t') = config(t − 1)$. At time $t'$, the configuration was measured by criterion $C$ with value $c'$. Suppose that by time $t$ no worker in this configuration has failed. Between $t'$ and $t$, some work may have been done: some communications may be in process or completed, and computations may have started. Consequently, the measure of this configuration given by $C$ should be updated to account for the progress between $t'$ and $t$. Let $c$ be the updated value of criterion $C$ for the current configuration. At step $t$, a new configuration is computed from scratch using heuristic $H$, as if no task were allocated to any worker. Let $config_2$ be this new configuration and $c_2$ its measure by $C$. If $c ≥ c_2$, then the current configuration at time $t − 1$ is kept for another time-slot: $config(t) = config_1$. Otherwise, the current configuration is interrupted, and the new configuration is $config(t) = config_2$.

For certain criterion choices, a heuristic could diverge and continually change the configuration, even with
workers that are reliably UP. To avoid this divergence, proactive criteria have to respect the following constraint: a given configuration that has been running for \( t + 1 \) time-slots must be better for the proactive criterion than the same configuration running for \( t \) time slots. With this constraint, all possible configurations are ordered by their value for the selected criterion at the beginning of the iteration, and a lower-ranked configuration in this order cannot be chosen to replace the current configuration. As the number of possible configurations is finite, no proactive heuristic can diverge. The four criteria used to define passive heuristics in the previous section meet this constraint. However, \( \text{AY} \) (Apparent Yield) leads to many (unnecessary) configuration changes before converging, while the other criteria should be stable. Hence, for the proactive criterion \( C \), we only retain \( P \) (Probability of success), \( E \) (Expected completion time) and \( Y \) (Expected yield). Any passive heuristic \( H \) can be used as the building block for a proactive heuristic. We thus obtain \( 3 \times 4 \) proactive heuristics named \( C-H \) where \( C \in \{P, E, Y\} \) and \( H \in \{\text{IP, IE, IY, IAY}\} \), plus the RANDOM heuristic.

VII. EXPERIMENTAL EVALUATION

A. Methodology

To evaluate our heuristics we use a discrete-event simulator. (The simulator is publicly available at [http://graal.ens-lyon.fr/~dufosse/changing_platforms.tar.gz](http://graal.ens-lyon.fr/~dufosse/changing_platforms.tar.gz)). The input to the simulator are the values for all the parameters listed in Section III. The simulator implements temporal processor availability as Markov processes as described in Section V. All our experiments are for \( p = 20 \) processors. The Markov model for processor availability is defined as follows. For each processor \( P_q \), we pick a random value uniformly distributed between 0.90 and 0.99 for each \( P_{x,q} \) value (for \( x = u, r, d \)). We then set \( P_{x,y}^q \) to \( 0.5 \times (1 - P_{x,q}^q) \), for \( x \neq y \). An experiment is defined by the Markov model for each processor and by three parameters: \( m \), the number of tasks per iteration; \( n_{com} \), the constraint on the master’s communication bandwidth; and a third parameter, \( w_{\min} \), defined as follows. For each processor \( P_q \), we pick \( w_q \) uniformly between \( w_{\min} \) and \( 10 \times w_{\min} \). \( T_{\text{data}} \) is set to \( w_{\min} \), meaning that the fastest processor has a computation-communication ratio of 1. \( T_{\text{prog}} \) is set to \( 5 \times w_{\min} \), meaning that downloading the program takes 5 times as much time as downloading the data for a task. We define our experimental space as \((m, n_{com}, w_{\min})\) with \( m \in \{5, 10\} \), \( n_{com} \in \{5, 10, 20\} \), and \( w_{\min} \in \{1, 2, \ldots, 10\} \). For each possible instantiation of \((m, n_{com}, w_{\min})\), we create 10 random experimental scenarios so as to obtain different instantiations of the various random parameters. Then, for each experimental scenario, we run 10 trials where each trial uses a different random number generator seed to produce multiple realizations of the Markov chain transitions. The total number of generated problem instances is thus \( 2 \times 3 \times 10 = 6,000 \). We emphasize that our goal here is not to instantiate a representative model for a desktop grid and application, but rather to create arbitrary and simple synthetic experimental scenarios that will make it possible to highlight inherent strengths and weaknesses of our proposed heuristics.

In all experiments, rather than fixing \( N \), the deadline in number of time-slots, we instead fix the number of iterations to 10. The quality of an application execution is then measured by the time needed to complete 10 iterations, or makespan. This problem is equivalent to the problem of maximizing the number of iterations before a deadline. It is also simpler to instantiate since it does not require choosing a meaningful deadline, which would depend on application and platform characteristics. For some experimental scenarios and some heuristics, the time needed to complete 10 iterations successfully can be extremely large, making it impossible to obtain results in a reasonable amount of time. Consequently, we limit the makespan to 1,000,000 seconds and declare that a heuristic fails if it reaches this limit, and succeeds otherwise.

The makespans vary widely between problem instances depending on processor availability patterns, which requires that we define relative metrics for a sound comparison of our heuristics. As expected, with more tasks (larger \( m \)), the number of failures increases for each heuristic. With \( m = 5 \), no heuristic fails for more than 5 out of the 3,000 problems instances. With \( m = 10 \), heuristics fail for up to 6% of the 3,000 problem instances. In both cases, IE is the most robust: It fails only for 1 of the 3,000 \( m = 5 \) instances, and for 81 of the 3,000 \( m = 10 \) instances. Importantly, whenever heuristic IE fails, all the other heuristics also fail. For this reason we use IE as a reference. For a heuristic \( H \), we compute the relative difference (expressed in percentage) between the makespan that it achieves for a given experimental scenario (averaged over all 10 trials) and the one achieved by IE for the same scenario, assuming that heuristic \( H \) succeeds. The relative difference is defined as:

\[
\frac{\text{makespan}_H - \text{makespan}_{IE}}{\min(\text{makespan}_H, \text{makespan}_{IE})}
\]

The denominator is always the makespan of the best performing heuristic, so as to allow consistent comparisons. We denote this metric by \( \% \text{diff} \).
fraction of trials where heuristic $H$ obtains a makespan that does not exceed that of IE by more than 30% (denoted by $\%\text{wins}_30$). We also report the standard deviation over all scenarios (column $\%\text{stdv}$).

Table I

<table>
<thead>
<tr>
<th>Heuristic</th>
<th>#fails</th>
<th>$%\text{diff}$</th>
<th>$%\text{wins}$</th>
<th>$%\text{wins}_30$</th>
<th>$%\text{stdv}$</th>
</tr>
</thead>
<tbody>
<tr>
<td>Y-IE</td>
<td>2</td>
<td>-11.82</td>
<td>72.58</td>
<td>92.09</td>
<td>0.42</td>
</tr>
<tr>
<td>P-IE</td>
<td>2</td>
<td>-10.58</td>
<td>70.98</td>
<td>91.19</td>
<td>0.44</td>
</tr>
<tr>
<td>E-IAY</td>
<td>4</td>
<td>-10.40</td>
<td>64.75</td>
<td>85.15</td>
<td>0.77</td>
</tr>
<tr>
<td>E-IE</td>
<td>4</td>
<td>-3.40</td>
<td>59.91</td>
<td>81.64</td>
<td>0.80</td>
</tr>
<tr>
<td>IE</td>
<td>1</td>
<td>0.00</td>
<td>100.00</td>
<td>100.00</td>
<td>0.00</td>
</tr>
<tr>
<td>IAY</td>
<td>2</td>
<td>13.59</td>
<td>51.07</td>
<td>76.42</td>
<td>1.93</td>
</tr>
<tr>
<td>E-IP</td>
<td>4</td>
<td>19.35</td>
<td>47.73</td>
<td>69.69</td>
<td>0.98</td>
</tr>
<tr>
<td>IY</td>
<td>2</td>
<td>24.22</td>
<td>45.26</td>
<td>70.85</td>
<td>1.96</td>
</tr>
<tr>
<td>IP</td>
<td>2</td>
<td>52.03</td>
<td>34.79</td>
<td>58.54</td>
<td>2.11</td>
</tr>
<tr>
<td>E-IE</td>
<td>5</td>
<td>53.93</td>
<td>39.57</td>
<td>64.51</td>
<td>2.57</td>
</tr>
<tr>
<td>Y-IAY</td>
<td>3</td>
<td>99.75</td>
<td>53.89</td>
<td>70.77</td>
<td>5.55</td>
</tr>
<tr>
<td>Y-IY</td>
<td>3</td>
<td>113.01</td>
<td>49.22</td>
<td>66.80</td>
<td>5.73</td>
</tr>
<tr>
<td>P-IAY</td>
<td>3</td>
<td>125.27</td>
<td>50.28</td>
<td>67.33</td>
<td>6.08</td>
</tr>
<tr>
<td>Y-IP</td>
<td>2</td>
<td>145.05</td>
<td>38.56</td>
<td>55.54</td>
<td>5.90</td>
</tr>
<tr>
<td>P-IY</td>
<td>3</td>
<td>145.78</td>
<td>42.54</td>
<td>59.66</td>
<td>6.22</td>
</tr>
<tr>
<td>P-IP</td>
<td>2</td>
<td>176.92</td>
<td>36.92</td>
<td>52.00</td>
<td>6.61</td>
</tr>
<tr>
<td>RANDOM</td>
<td>0</td>
<td>2124.42</td>
<td>0.00</td>
<td>0.20</td>
<td>22.54</td>
</tr>
</tbody>
</table>

Table I shows results for $m = 5$ tasks, with heuristics sorted by decreasing $\%\text{diff}$. The number of failures for all heuristics is shown in the first column of the table and is at most 5 (recall that IS, the reference heuristic, only fails for 1 instance). Consequently, although some heuristics fail on some scenarios, these failures do not have a large impact on our results.

These results show the efficiency of all our heuristic, when compared with the RANDOM Heuristic. RANDOM is on average more than 20 times worse than IE while all other heuristics have a $\%\text{diff}$ less than 200%. As seen in the table, only 4 heuristics lead to a $\%\text{diff}$ value lower than that obtained by IE, with 3 of these heuristics more than 10 points lower. These 4 heuristics are all proactive. We conclude that the best proactive heuristics are significantly better than the best passive heuristics. Several observations can be made on the results in the table. A first one is that using the yield as a heuristic or a criterion is better than using the probability of success. In other terms, heuristic C-IY is better than heuristic C-IP, and heuristic Y-H is marginally better than P-H (in this case an inspection of the simulation traces shows that Y-H and P-H lead to mostly identical executions). Everything else being equal, considering the yield is better than considering the probability of success because it accounts for the host's compute speeds. A second observation is that heuristic C-IAY is better than heuristic C-IY, thus confirming that the "apparent yield" has merit and is a direct improvement of the yield metric. Anecdotally, while E-IY and E-IAY obtain similar results for $\%\text{wins}$ and $\%\text{wins}_30$, E-IAY is significantly better than E-IY on average. A third observation is that although Y-IE and P-IE lead to good results, all other proactive heuristics with the same criteria (i.e., Y-H and P-H) rank last, with $\%\text{diff}$ values reaching 100%. Finally, the key observation is that the best heuristics (the top 5 heuristics, including the reference IE) all account for expected execution time either as a criterion for selecting a new configuration (E-IAY, E-IY) or as a host selection mechanism (Y-IE, P-IE, IE). A seemingly sensible expectation is thus that E-IE would be very efficient. But instead, E-IE leads to poor results, with on average makespan almost more than 50% longer than that of IE, the fourth lowest $\%\text{wins}$ value and the fifth lowest $\%\text{wins}_30$ value. The reason for these poor aggregate results is that E-IE leads to inefficient schedules for problem instances in which the fastest processor is unreliable.

B. Results for $m = 10$

Table II

<table>
<thead>
<tr>
<th>Heuristic</th>
<th>#fails</th>
<th>$%\text{diff}$</th>
<th>$%\text{wins}$</th>
<th>$%\text{wins}_30$</th>
<th>$%\text{stdv}$</th>
</tr>
</thead>
<tbody>
<tr>
<td>Y-IE</td>
<td>141</td>
<td>-10.33</td>
<td>71.35</td>
<td>88.42</td>
<td>0.54</td>
</tr>
<tr>
<td>P-IE</td>
<td>141</td>
<td>-8.62</td>
<td>69.64</td>
<td>87.23</td>
<td>0.55</td>
</tr>
<tr>
<td>E-IAY</td>
<td>178</td>
<td>-6.10</td>
<td>66.62</td>
<td>81.93</td>
<td>1.58</td>
</tr>
<tr>
<td>E-IY</td>
<td>176</td>
<td>8.04</td>
<td>61.90</td>
<td>77.87</td>
<td>3.07</td>
</tr>
<tr>
<td>E-IP</td>
<td>168</td>
<td>29.68</td>
<td>55.12</td>
<td>71.86</td>
<td>3.01</td>
</tr>
<tr>
<td>IAY</td>
<td>152</td>
<td>136.65</td>
<td>46.98</td>
<td>69.31</td>
<td>14.76</td>
</tr>
<tr>
<td>IY</td>
<td>152</td>
<td>147.77</td>
<td>42.06</td>
<td>64.47</td>
<td>14.76</td>
</tr>
</tbody>
</table>

In this section we discuss results for $m = 10$ tasks, but only for the reference IE and those 7 heuristics that achieve a $\%\text{diff}$ value below 50% (the largest such value being in fact below 25%): Y-IE, P-IE, E-IAY, E-IY, IAY, E-IP and IY. Results are shown in Table II. Only two of these heuristics do not consider expected completion time as a criterion: IAY and IY. These two heuristics rank reasonably high in terms of $\%\text{diff}$ with $m = 5$ tasks, but are over 130% with $m = 10$ tasks. The ranking of the heuristics is almost unchanged when compared to the $m = 5$ results, even if $\%\text{diff}$ values are lower. When for $m = 5$ E-IY leads to a negative $\%\text{diff}$ value, for $m = 10$ this value becomes positive. For $m = 10$ only three heuristics achieve positive $\%\text{diff}$ values: Y-IE, P-IE and E-IAY. With $m = 10$, most heuristics fail for more than 5% of the problem instances. Given that IE is the most robust heuristic, it should come to no surprise that those proactive heuristics that use IE lead to the lowest number of failures. One conclusion from these results is that Y-IE is only slightly less robust than IE (failing on 4.7% of instances as opposed to 2.7% for IE) but leads to significantly better performance with a $\%\text{diff}$ value above 11%, leading to a makespan for more than 72% of the instances, and leading to a makespan more than 30% larger in less than 8% of the instances.
Figure 2 shows %diff values versus $w_{\text{min}}$ for $m = 10$ tasks. $w_{\text{min}}$ is a synthetic parameter we have defined to instantiate our problem instances. Essentially a larger $w_{\text{min}}$ value means longer tasks and longer data transfers, leading thus to a more “difficult” instance. These results show that Y-IE is the best or close to the best heuristic up to $w_{\text{min}} \approx 8$. For large values of $w_{\text{min}}$ it is outperformed by other several other heuristics, such as P-IE, but also by the reference heuristic IE. IE is the best option for large values of $w_{\text{min}}$! An intuitive explanation is that when the instance is difficult, meaning that the probability of success is low due to long computations and communications, a good way to obtain a short makespan is to try to find the fastest workers and “hope for the best.” When looking at the whole $w_{\text{min}}$ range, P-IE appears like a good alternative to Y-IE. For low $w_{\text{min}}$ values it outperforms IE significantly, and for large $w_{\text{min}}$ it is outperformed by it only marginally. Recall from Table II that Y-IE and P-IE experience exactly the same number of failures (141 failures out of the 3,000 instances).

**Figure 2. %diff for $m = 10$ tasks vs. $w_{\text{min}}$.**

**VIII. CONCLUSION**

Unlike previous work that has considered loosely-coupled master-worker applications, in this work a single processor failure can have a dramatic effect on application execution. Furthermore, our problem formulation includes a limit on the available bandwidth from the master to the workers and the possibility for processors to be temporarily reclaimed. We have proved the problem to be NP-complete in an off-line setting, i.e., with full knowledge of future processor states. By assuming a Markov model of processor availability, we have proposed polynomial time approximation schemes to compute the expected completion time of a computation and its probability of success. We have then proposed 16 heuristics that are either passive (change the set of enrolled processors only when a processor failure occurs) or proactive (change the set of enrolled processors when a better set is available even if no failure occurs). These heuristics are easily defined as combinations of two among four sensible metrics: probability of success, expected completion time, expected yield and expected apparent yield.

All these heuristics widely outperform a baseline heuristic that allocates tasks to processors randomly. In addition, our simulation results shows that four of our 16 heuristics lead to significantly better results than the remaining 12. Passive heuristic IE is the most robust, which is why we have used it as a reference, but it does not lead to the best makespans. Heuristic Y-IE, which attempts to optimize expected execution time while proactively deciding to change the set of enrolled processors based on yield, leads to the best average results. Heuristic P-IE, which changes configuration based on probability of success, leads to more stable performance across our set of experimental problem instances as it is never significantly outperformed by IE. The conclusion is that a proactive heuristic that selects processors to maximize expected execution time and changes configuration based on yield or probability of success is promising.

We have made it plain that the Markov assumption for processor availability is not meant to be representative of real-world platforms. Nevertheless, faced with the lack of an acknowledged and validated model, we have opted for a Markov model. The advantage of this model is that it is simple. It provides us with a tractable framework in which we can not only develop heuristics but also evaluate the merit of heuristical ideas in “laboratory conditions.” If a more realistic model arises, then a next step in this research would be to determine to which extent the principles from our heuristics can be adapted to the new model. Given that the results in Section V rely on the Markov assumption heavily, it is unlikely that similar results would hold in a non-Markovian model. However, it may be possible to develop coarser estimates of our four criteria in the new model so as to design meaningful and effective heuristics. If no such new model arises, then an interesting next step would be to simply build a flawed Markov model based on
real-world processor availability traces, and investigate how “wrong” the Markov heuristics are in a real-world setting when compared to heuristics that have been proposed in previous work that do not rely on sophisticated probabilistic criteria for making scheduling decisions.

REFERENCES


APPENDIX

A. Proof of Theorem 4.1

Proof:

(i) NP-completeness of Off-Line-Coupled (μ = 1)

The decision problem associated to Off-Line-Coupled (μ = 1) writes: given a value w and p state vectors $S_q$, can we find m processors that are simultaneously UP during at least w time-steps? This problem clearly belongs to NP: the $m \times w$ sub-matrix is a certificate of polynomial (and even linear) size.

For the completeness, we use a reduction from ENCD, the Exact Node Cardinality Decision problem [27]. Let $I_2$ be an instance of ENCD: given a bipartite graph $G = (V \cup W, E)$ and two integers $a$ and $b$ such that $1 \leq a \leq |V|$ and $1 \leq b \leq |W|$, does there exist a bi-clique with exactly $a$ nodes in $V$ and $b$ nodes in $W$? Recall that a bi-clique $C = U_1 \cup U_2$ is a complete induced sub-graph: $U_1 \subseteq V, U_2 \subseteq W$, and for every $u_1 \in U_1, u_2 \in U_2$, the edge $(u_1, u_2) \in E$.

We construct the following instance $I_2$ of Off-Line-Coupled (μ = 1): we let $p = |V|$ and $N = |W|$. Resource $R_i$ (which corresponds to vertex $v_i \in V$) is UP at time-step $j$ (which corresponds to vertex $w_j \in W$) if and only if $(v_i, w_j) \in E$. Finally we let $m = a$ and $w = b$. The size of the instance $I_2$ is linear in the size of the instance $I_1$. We show that $I_2$ has a solution if and only if $I_2$ does. Suppose first that $I_2$ has a solution $C = U_1 \cup U_2$. We select the corresponding $U_1$ processors and the same $U_2$ time-steps. Because we have a clique, each processor is UP at each time-step, hence $I_2$ has a solution. Suppose now that $I_2$ has a solution. The corresponding sub-matrix translates into a bi-clique with the $a \times b$ nodes in $V$ and $W$, hence a solution to $I_2$.

(ii) NP-completeness of Off-Line-Coupled (μ = ∞)

We use the same instance $I_2$ of ENCD as in the first part of this proof. We construct the following instance $I_2$ of Off-Line-Coupled (μ = ∞); we let $p = |V|$ and $N = 2|W| + 1$. Resource $R_i$ (which corresponds to vertex $v_i \in V$) is UP at time-step $j \leq N$ (which corresponds to vertex $w_j \in W$) if and only if $(v_i, w_j) \in E$. All processors are up at each step $j$ such that $|W| + 1 \leq j \leq N$. Finally we let $m = a$ and $w = b + |W| + 1$. Intuitively, this amounts to add $|W| + 1$ new vertices in $W$ which are interconnected to every vertex in $V$.

The size of the instance $I_2$ is linear in the size of the instance $I_1$. We show that $I_2$ has a solution if and only if $I_2$ does. Suppose first that $I_2$ has a solution $C = U_1 \cup U_2$. We select the corresponding $U_1$ processors and the same $U_2$ time-steps, plus the last $|W| + 1$ time-steps. We have $w = b + |W| + 1$, hence $I_2$ has a solution. Suppose now that $I_2$ has a solution. The corresponding sub-matrix translates into a bi-clique with $x$ processors and $y$ time-steps. If $x < m$ then at least one processor executes tasks per iteration, and we need $2w$ time-steps to perform an iteration. But $2w > N$, what is a contradiction. Hence $x = m$ and $y = K$. At most $|W| + 1$ of the UP time-steps are greater than $|W|$, hence at least $b$ of them are smaller than or equal to $|W|$: this leads to a solution to $I_2$.

B. Proof of Theorem 5.1

Proof: Consider a set $S$ of processors, all available at time slot 0. Consider the probability $P_+(t)$ that all these processors are simultaneously UP again for the first time at time $t$. This means that for all $0 < t' < t$, there exists at least one processor RECLAIMED at time $t'$. Also, none of the processors in $S$ goes DOWN between 0 and $t$.

Let $P^{(q)}_{u \rightarrow v}$ be the probability that a processor $P_q$ that was UP at time 0 is UP again at time $t$, without having been DOWN in between, and let $P^{(S)}_{u \rightarrow v} = \prod_{P_q \in S} P^{(q)}_{u \rightarrow v}$. For each processor $P_q$, the value $P^{(q)}_{u \rightarrow v}$ can be computed by considering its transition matrix raised to the power $t$, knowing that the initial state is UP. We form the product to compute $P^{(S)}_{u \rightarrow v}$. We derive that

$$P_+(t) = P^{(S)}_{u \rightarrow v} - \sum_{0 < t' < t} P^{(S)}_{u \rightarrow v} \times P^{(S)}_{u \rightarrow v}.$$ 

The probability $P_+(t)$ that all the processors in $S$ will be simultaneously UP again at some point, before the first failure of any of them, is

$$P_+^{(S)} = \sum_{t > 0} P^{(S)}_{u \rightarrow v} = \sum_{t > 0} P^{(S)}_{u \rightarrow v} - \sum_{0 < t' < t} P_+(t') \times P^{(S)}_{u \rightarrow v} = \sum_{t > 0} P_+(t) \times \sum_{0 < t' < t} P^{(S)}_{u \rightarrow v}.$$ 

Let $E_u(S) = \sum_{t > 0} P^{(S)}_{u \rightarrow v}$. Suppose that all processors are UP at time slot 0. Let $A_t$ the random variable that is equal to 1 if all processors are UP at time slot $t$, but that at processor goes DOWN in between. Then $E(A_t) = P^{(S)}_{u \rightarrow v}$. By linearity of the expectation, we have $E(\sum_{0 < t' \leq t} A_{t'}) = \sum_{0 < t' \leq t} P^{(S)}_{u \rightarrow v}$. Suppose that, in set $S$, at least one processor has a nonzero probability of going DOWN. Then, let $\lim_{t' \to \infty} \sum_{0 < t' \leq t} P^{(S)}_{u \rightarrow v}$ converges. We can conclude that $E(\sum_{t > 0} A_t) = \sum_{t > 0} P^{(S)}_{u \rightarrow v}$. Then, $E_u(S)$ is the expected number of time slots with all processors UP, before one of these processors fails. Then, $P_+(t) = E_u(S) - E_u(S) \times P_+$, from which we derive that $P_+^{(S)} = \frac{E_u(S)}{1 - E_u(S)}$.
S, at least one processor has a nonzero probability of going DOWN. Otherwise, \( P^S_+ = 1 \).

We now consider the expected time \( E^S(W) \) to execute \( W \) time slots of computation, conditioned by the fact that no processor in \( S \) will fail. The first time slot of computation is done at \( t = 0 \). Let \( E^S_c \) be the expected time of the next time slot of computation. Then,

\[
E^S_c = \sum_{t > 0} t \times P^S_c(t) = \sum_{t > 0} t \times P^S_{u \to u}(t) - t \times \left( \sum_{0 < t' < t} P^S_{u \to u}(t') \times P^S_{u \to u}(t) \right) = \sum_{t > 0} t \times P^S_{u \to u}(t) - \left( \sum_{t > 0} P^S_{u \to u}(t) \right) \times \left( \sum_{t' > 0} P^S_{u \to u}(t') \right)
\]

Let \( A(S) = \sum_{t > 0} t \times P^S_{u \to u} \). Then, \( E^S_c = A(S) - E^S_c \times E_u(S) - P^S_+ \times A(S) \). Then, \( E^S_c = A(S)(1 - P^S_+) \) and \( E^S(G) = \frac{1}{1 + E_u(S)} \).

We now explain how we numerically approximate the values of \( E_u(S) \) and \( A(S) \). Let \( \varepsilon \) be the desired precision. Consider for some value \( T \) the difference between \( E_u(S) \) and \( \sum_{0 < t < T} P^S_+ \). We have \( P^S_+ = \prod_{P_q \in S} P^q_{u \to u} \) and \( P^q_{u \to u} \) the probability that a processor that was UP at time 0 is UP at time \( t \) without having been DOWN. For a processor \( P_q \in S \), let

\[
M_q = \begin{bmatrix} P^q_{u \to u} & P^q_{u \to r} \\ P^q_{r \to u} & P^q_{r \to r} \end{bmatrix} = (M^q_s)[0, 0].
\]

We obtain \( P^q_{u \to u} \leq (\lambda^q_1)^t \). We obtain \( P^q_{u \to r} \leq \left( \prod_{P_q \in S} \lambda^q_1 \right)^t \) and \( \sum_{t < T} P^S_{u \to u} \leq \left( \prod_{P_q \in S} \lambda^q_1 \right)^t \times \frac{1}{1 - \prod_{P_q \in S} \lambda^q_1} \). Let \( \Lambda = \prod_{P_q \in S} \lambda^q_1 \). We obtain that \( T > \sum_{0 < t < T} P^S_{u \to u} \leq \varepsilon \). Thus, we can compute in polynomial time an approximation of \( E_u(S) \) at \( \varepsilon \) in polynomial time.

Similarly, we obtain \( A(S) - \sum_{0 < t < T} t \times P^S_{u \to u} \leq \varepsilon \) as soon as \( \Lambda^T \left( \frac{T}{1-\Lambda} + \frac{\Lambda}{(1-\Lambda)^2} \right) \leq \varepsilon \). Therefore \( A(S) \) can be approximated with precision \( \varepsilon \) in polynomial time.