Efficient Algorithms for Residual Connectedness Reliability of 
the Distributed Systems

Zhongshi He*   and    Yinong Chen
Highly Dependable Systems Research Programme
University of the Witwatersrand , Johannesburg, South Africa
{zhe, yinong }@cs.wits.ac.za
http://www.cs.wits.ac.za/research/programme.html



Abstract

This paper discusses the reliability of a distributed system in which the links are considered reliable while the computing nodes may fail with certain probability. We need to find a subset of nodes in the system to run replicas of a critical task so that the task can survive as long as possible. The system configuration can be changed due to nodes' failures and task reallocation (system reconfiguration) has to be performed after a number of nodes become faulty. The task reallocation problem is converted into a well-known graph theory problem and the reliability of a task is modelled by the residual connectedness reliability R(G) of a graph G. To perform optimal task reallocation, we need to calculate R(G) after a number of nodes become faulty. This paper proposes a new approach with efficient algorithms for evaluating R(G). Numerical and graphic results are presented.
Key words. network reliability, bounds, connectedness, distributed system, hypercube.

1. Introduction
An ongoing project of the Highly Dependable Systems Research Programme (PHDS) at the University of the Witwatersrand is the development of a dependable distributed system and its application to the Internet. The current research related to this project includes design, specification, verification, reliability evaluation and implementation of a virtual service redirector for critical internet applications [5]. This paper discusses the reliability evaluation of the system.
The distributed system we are developing consists of a network of computing nodes. The network can be bus-based or dedicated links. This paper studies the case that the computing nodes are connected by dedicated links into an arbitrary topology or a graph. The tasks running on the system are illustrated in Figure 1. 
The system is intended to run various internet applications, for example, firewall, mail server, web server, etc., with different levels of criticality. The higher the criticality is, the more copies of the task will be run in the system. The fault-tolerant protocols will check the consistency among the replicas and perform fault detection and fault isolation. Faulty nodes will be removed from the systems. The hashing and redirection components will allocate the required copies of tasks to available nodes. The policy of task allocation involves the following. 
If a node is working, tasks on the node should not be reallocated to minimize the reallocation overhead.
If a node become faulty, a copy of a task on the node will be reallocated to another node only if the number of copies of the task drops to a critical level ( extra redundant copies may have been created in order to reduce the number of task reallocation.
If a copy has to be reallocated, it must be ensured that all nodes have relatively balanced loads. 
Copies of a task must be allocated to a subset of nodes of the system, such that the least number of task reallocation is necessary.
To implement this hashing and redirection policy, we need to find a subgraph to host the replicas of a critical task in a graph such that the subgraph can survive the maximum number of node or edge (link) faults without breaking the connectedness of the subgraph.


There are various reliability problems that occur when a distributed system is modelled by a graph whose nodes or edges or both have an associated probability of failure. This development has led to a more general interest in reliability measures and associated computational techniques. A number of measures have been defined in the literature[1~7]. For example, the K-terminal reliability of a graph: Let G = (V, E) be a graph, and K ( V, a special subset of V. Suppose that the elements (nodes and edges) of G may fail with certain probabilities, the K-terminal reliability RK(G) of G is the probability that there is some subgraph H in G such that all elements of H are operational and all nodes of K lie in a single component of H. 
If we restrict our attention to graphs G in which elements fail independently of each other with equal probabilities, there are three kinds of element failure models:
( Edge failure the nodes of a graph are perfectly reliable but the edges fail independently of each other with equal probability q1.
( Node failure the edges of a graph are perfectly reliable but the nodes fail independently of each other with equal probabilities q0.
( Node-and-edge failure both edges and nodes fail independently of each other with edge-probabilities q1, and with node-probabilities q0.
For the edge failure model, Viliant [14] has proven that computing RK(G) in general is NP-hard, even for |K| = 2. Subsequently, Provan [10,11] showed that even for planar graphs the computation of RK(G) is NP-hard. Algorithms for estimating network reliability were proposed by Provanl [9], Brown [3] and Shogan [12] etc, lower and upper bounds were obtained. Palmer [8] discussed the analysis property ( gradual property of reliability for a complete graph. For the node failure model, the problem of computing RK(G) is also shown to be NP-hard for general graphs and it remains so even for chordal graphs and comparability graphs[13].
To find the optimal subgraph for our task allocation efficiently, we need to calculate RK(G) in a very short period of time to avoid degrading the system performance. An NP algorithm is not acceptable.	
In this paper, we mainly consider the following reliability problem: Node failure model ( the network is considered to be in an operational state if the surviving nodes induce a connected subgraph of G. The residual connectedness reliability of a graph G, denoted R0(G), is the probability that the subgraph induced by the surviving nodes is nonempty and connected.
Let G be an undirected graph with perfectly reliable edges and n nodes that may fail independently with equal probability q0. Then the residual connectedness reliability R0(G) may be written as [13] 
       R0(G) = ( 1 (        				(1)
where Ai(G) stands for the number of connected node induced subgraphs of G with i nodes and (G) stands for the number of the disconnected node induced subgraphs of G with n ( i nodes. 
The computing of R0(G) in general is proved to be NP-hard and remains NP-hard for split graph as well as planar and bipartite graphs [13]. Consequently, methods and heuristic algorithms for estimating R0(G) are useful. As the problem of computing  is NP-complete even for a split graph (a special graph), obviously, it is hard to find out a good estimator of R0(G) if we make direct use of formula (1). 
If one can find a way other than using the combination formula (1), perhaps a better estimator, like upper and lower bounds, can be obtained. The better estimator means both a shorter running time and better accuracy. This is the aim of this paper. 
After unsuccessful attempts to find an existing algorithm to meet our requirement, we realize that we need to do our own research to solve the problem. This paper describes the results we obtained from this research. Based on a solid understanding and analysis of the complexity of the problem, we designed efficient algorithms to find
(  the upper bound of the residual connectedness reliability of a given graph and have shown that the upper bound we found is very close to the real R0(G);
(  the lower bound of the residual connectedness reliability and have shown that the lower bound is very close to the real R0(G);
(  an efficient algorithm to find a subgraph in a given graph so that the reallocation of a critical task to this sub graph will ensure that the residual connectedness reliability will fall between the upper and lower bounds. 
As examples, we calculated the residual connectedness reliability of some well-known graphs. The difference between the upper and lower bounds, as well as the trends of properties of the reliability are given in diagrams. 
The research results are important for our system design. They also contribute to general graph algorithms where finding residual connectedness reliability is a common interest in many problems.
In the rest of the paper, we introduce the methods for finding the upper and lower bounds of reliability R0(G). Section 3 then gives the corresponding algorithms for the bounds and the algorithm for finding a subgraph with maximum residual connectedness reliability. In section 4, we apply the methods to some typical classes of graphs, for example complete graph, star graph, hypercube and Harary graph. Section 5 gives the numerical results and analysis of the results. Section 6 concludes the paper.
6. Conclusions
A new approach and efficient algorithms for evaluation (upper bound and lower bound) of the residual connected reliability have been proposed in the paper. Meanwhile, we have applied our approach to several typical classes of graphs (networks), which shows that the efficiency is very good from both computational results and theoretical results. The results are used in the distributed system project to find a subgraph for the task allocation

Reference
[1] K.K.Aggarwal, Y.C.Chopra, and J.S.Bajwa, "Reliability evaluation by network decomposition", IEEE Trans. Reliab. R-31, 1982, pp. 355-358.
[2] M.O.Ball, "Computional complexity of network reliability analysis _ an overview", IEEE Trans. Reliab. R-35, 1986, pp. 230-239.
[3] J.I.Brown, C.J.Colbourn, and J.S.Devitt, "Network transformation and bounding network reliability", Network, 23, 1993, pp. 1-17.
[4] D.Bulka, J.B.Dugan, "Network s-t reliability bounds using a 2-dimensional reliability polynomial", IEEE Trans. Reliab. R-43, 1994, pp. 39-45.
[5] Y. Chen, V. Galpin, S. Hazelhurst, R. Mateer, and C. Mueller, "Modelling software development of a decentralised virtual service redirector for Internet applications", The 7th IEEE Workshop on Future Trends of Distributed Computing Systems, Cape Town, December 1999, pp. 235 - 241.
[6] C.J. Colbourn, "Computing residual connectedness reliability for restricted networks", Discrete Apple Math., 44, 1993, pp. 221-232.
[7] Z.S. He, The fault tolerance and reliability of computer communication networks: analysis and synthesis, Ph. D. Dissertation, Chongqing University, 1996.10.
[8] E.M.Palmer, Graphical evolution-an introduction to the theory of random graphs.  John Wiley & Sons, New York, 1985.
[9] J.S.Provan, "Bounds on reliability of networks", IEEE Trans. Reliab.R-35, 1986, pp.  260-268.
[10] J.S.Provan, "The complexity of reliability computations in planar and acyclic graphs", SIAM J.Comput.15, 1986, pp. 695-702.
[11] J.S.Provan, and M.O.Ball, "The complexity of counting cuts and computing the probability that a graph is connected", SIAM J.Comput.12, 1983, pp. 777-788.
[12] A.W.Shogan, "A recursive algorithm for bounding network reliability", IEEE Trans. Reliab. R-26, 1977, pp. 322-327.
[13] K.Sutner, A.Satyanarayana, and C.Suffel, "The complexity of the residual node connectedness reliability problem:,  SIAM J.Comput. 20, 1991, pp. 149-155.
[14] L.G.Valiant, "The complexity of enumeration and reliability problems", SIAM J.Comput. 8, 1979, pp. 410-421.


* Dr. He is a faculty member of Chongqing University, P.R. China, and is currently a visiting scholar at the University of Witwatersrand.
1