Principle Investigators: |
Ken Birman
4105c Upson Hall |
Keith Marzullo
University of California, San Diego |
Please use the horus navigation bar under the title picture to visit the Horus Web pages with all publications, information on how to get the software, links to related projects and other useful information. A Quad chart with an overview of the project is also available.
Reliable group communication arises whenever a collection of two or more programs need to cooperate in a networked environment. Our general interest is in providing tools for building reliable computing applications in extremely demanding NII environments, characterized by tight response time constraints, specifications that mandate certain levels of data throughput, latency and error supression, and in which security or privacy are combined with fault-tolerance considerations.
A good example of such an application would involve a co-processor for a modern telecommunication switch. Such co-processors must be highly fault-tolerant while respecting tight real-time constraints even as they reconfigure because of a failure or system upgrade. They raise challenging parallelism and security issues, and are clearly beyond the capabilities of first-generation process-group systems such as the Isis Toolkit, which we developed under ARPA funding during the period 1985-1990. A general discussion of the approach can be found in our May, 1996 Scientific American article, a copy of which is available online in the Scientific American web server.
The main criticism of process group systems like Isis is that they tend to be inflexible, providing all users with a single set of interfaces, and having a single fixed notion of ``reliability’’ that will often be mismatched with the specialized forms of reliability needed in demanding applications. No single reliability solution can satisfy all needs. For this reason, Horus is designed to be extremely flexible. The presentation of Horus can be varied depending on the need, by embedding it into existing operating system or language environments. And the properties of Horus process groups can also be customized, ranging from Isis-style process groups (used for fault-tolerant replication of data and services) to other sorts of groupware or collaborative workgroup applications, where reliability may involve providing isochronous communication, security, or other properties. Horus obtains this flexibilty using a unique plug-and-play architecture (we describe it as resembling a “lego” construction set).
Both Horus and Isis make use of a communication model called virtual synchrony, which was introduced by Isis in 1987. This model makes it possible to implement fault-tolerance in software, and also proves to be a powerful source of design simplification and a useful conceptual tool for distributed software designers. Horus and Isis support toolkits with solutions to problems such as consistent replication of data (with or without persistence), load balanced request execution, and even availability of services in the presense of network partitions.
Much of the distributed systems community has adopted this model, and the underlying theory is increasingly well understood. This theory is making it possible for us to develop security solutions that avoid denial of service attacks and that can provide security in the presense of network partitioning failures.
Process group distributed computing has been remarkably successful in settings that demand fault-tolerant or self-managed distributed solutions. Isis is now a \$8M/year product line of Stratus Computers Inc., and is being used in settings such as the forthcoming new generation of the French air traffic control system, cellular telephone systems, stock exchanges, VLSI chip fabrication lines, and many others. The military has applied Isis in a number of projects, and other government agencies are launching similar efforts. A visible example is the HiperD Naval project (a prototype of next-generation AEGIS technology), which is using Isis to manage a network of high performance computers in support of battle management tasks.
Horus goes far beyond Isis in terms of flexibility and much higher performance. On the performance side, we have been working with Thorsten von Eicken's UNet and Active Message technology to layer Horus over ATM networks. This permitted us to demonstrate a ten-fold performance improvement over what Horus achieves on ethernet, which is itself a substantial improvement over the performance of Isis for similar tasks. Horus is also much more predictable than Isis, both in terms of guaranteeing low deliver latencies, and high data throughput rates. Indeed, we believe that video data rates and groupware/conferencing applications are within reach with Horus technology over ATM.
Illustrating this point, we built a realistic emulation of a telecommunications switch co-processor as an experimental application of Horus. We were able to show scalability, high performance (22,000 ``calls’’ per second, all completed within 100ms deadlines), and fault-tolerance (we can service the processor or crash nodes while continuously guaranteeing that deadlines will be met). The potential military applications of this approach include tracking and target assessment functions in systems like AEGIS, on-board fly-by-wire aircraft control, and other advanced applications requiring extreme reliability, guaranteed performance, and a high degree of autonomy.
Looking to the future, we see a major opportunity to unify the major themes of parallel, fault-tolerant, real-time, and secure distributed computing in a single, highly modular architecture. By doing so, and making the solutions sufficiently transparent and easy to use, Horus can offer a major step forward in our ability to engineer demanding distributed applications. Today, robust distributed computing remains an ad-hoc problem that we deal with largely as an after-thought. With Horus, security and fault-tolerance can be introduced transparently into conventional distributed systems, even as more ambitious distributed systems benefit from a uniquely powerful, flexible, and high performance technology base. Such a technology is urgently needed by industry, and we believe Horus will demonstrate that extremely powerful integrated solutions are reaching the level of maturity demanded by ambitious commercial application developers.
All our research on Horus is freely available to other research groups in academic settings, industry, or the military. Through industry collaborations with Isis Distributed Systems, a subsidiary of Stratus Computer, BBN, and others, Horus will also become a commercial product line within the next two years or so.
Isis plans to launch a Horus-based product line focused on real-time applications, high performance embedded availability applications, and other Horus applications that exceed the performance or flexibility capabilities of Isis. The Isis commercial effort focuses on adding value to Horus, incorporating it into end-user deliverables, and supporting the resulting technology. Stratus Computer, the parent company
of Isis, is exploring the use of Horus in its RADIO architecture: a new generation of PC based scalable, parallel, highly available cluster computing products that run the NT operating system.