Fault-Tolerant Mechanism of the Distributed Cluster Computers Fault-Tolerant Mechanism of the Distributed Cluster Computers"期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Fault-Tolerant Mechanism of the Distributed Cluster Computers

作者单位：	State Key Laboratory of Hydroscience and Engineering Tsinghua University，Department of Automation，Tsinghua University，State Key Laboratory of Hydroscience and Engineering，Tsinghua University，Beijing 100084，China，Beijing 100084，China，Beijing 100084，China

摘要：	The distributed system with high performance and stability is commonly adopted in large scale scientific and engineering computing. In this paper, we discuss a fault-tolerant mechanism under Linux circumstance to improve the fault-tolerant ability of the system, namely a scheme and frame to form the stable computing platform. In terms of the structure and function of the distributed system, active list and file invocation strategies are employed in the task management. System multilevel fault-tolerance can be achieved by repeated processes in a single node and task migration on multi-nodes. Manager node agent introduced in this paper administrates the nodes using the list, disposes of the tasks according to the nodes' performance, and hence, to be able to make full use of the cluster resources. An evaluation method is proposed to appraise the performance. The analyzed results show the usefulness of the scheme proposed except for some additional overhead of memory consumption.
Fault-Tolerant Mechanism of the Distributed Cluster Computers"

SHANG Yizi,JIN Yang,WU Baosheng. Fault-Tolerant Mechanism of the Distributed Cluster Computers"[J]. Tsinghua Science and Technology, 2007, 12(Z1): 186-191

Authors:	SHANG Yizi JIN Yang WU Baosheng

Abstract:	The distributed system with high performance and stability is commonly adopted in large scale scientific and engineering computing. In this paper, we discuss a fault-tolerant mechanism under Linux circumstance to improve the fault-tolerant ability of the system, namely a scheme and frame to form the stable computing platform. In terms of the structure and function of the distributed system, active list and file invocation strategies are employed in the task management. System multilevel fault-tolerance can be achieved by repeated processes in a single node and task migration on multi-nodes. Manager node agent introduced in this paper administrates the nodes using the list, disposes of the tasks according to the nodes'performance, and hence, to be able to make full use of the cluster resources. An evaluation method is proposed to appraise the performance. The analyzed results show the usefulness of the scheme proposed except for some additional overhead of memory consumption.

Keywords:	distributed system active list file invocation multilevel fault-tolerance
本文献已被 CNKI 万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏