首页 | 本学科首页   官方微博 | 高级检索  
     

Fault-Tolerant Mechanism of the Distributed Cluster Computers
作者单位:State Key Laboratory of Hydroscience and Engineering Tsinghua University,Department of Automation,Tsinghua University,State Key Laboratory of Hydroscience and Engineering,Tsinghua University,Beijing 100084,China,Beijing 100084,China,Beijing 100084,China
摘    要:The distributed system with high performance and stability is commonly adopted in large scale scientific and engineering computing. In this paper, we discuss a fault-tolerant mechanism under Linux circumstance to improve the fault-tolerant ability of the system, namely a scheme and frame to form the stable computing platform. In terms of the structure and function of the distributed system, active list and file invocation strategies are employed in the task management. System multilevel fault-tolerance can be achieved by repeated processes in a single node and task migration on multi-nodes. Manager node agent introduced in this paper administrates the nodes using the list, disposes of the tasks according to the nodes' performance, and hence, to be able to make full use of the cluster resources. An evaluation method is proposed to appraise the performance. The analyzed results show the usefulness of the scheme proposed except for some additional overhead of memory consumption.


Fault-Tolerant Mechanism of the Distributed Cluster Computers"
SHANG Yizi,JIN Yang,WU Baosheng. Fault-Tolerant Mechanism of the Distributed Cluster Computers"[J]. Tsinghua Science and Technology, 2007, 12(Z1): 186-191
Authors:SHANG Yizi  JIN Yang  WU Baosheng
Abstract:The distributed system with high performance and stability is commonly adopted in large scale scientific and engineering computing. In this paper, we discuss a fault-tolerant mechanism under Linux circumstance to improve the fault-tolerant ability of the system, namely a scheme and frame to form the stable computing platform. In terms of the structure and function of the distributed system, active list and file invocation strategies are employed in the task management. System multilevel fault-tolerance can be achieved by repeated processes in a single node and task migration on multi-nodes. Manager node agent introduced in this paper administrates the nodes using the list, disposes of the tasks according to the nodes'performance, and hence, to be able to make full use of the cluster resources. An evaluation method is proposed to appraise the performance. The analyzed results show the usefulness of the scheme proposed except for some additional overhead of memory consumption.
Keywords:distributed system  active list  file invocation  multilevel fault-tolerance
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号