A redesigned ISIS and Meta under Mach.

Principle Investigators: Ken Birman

4105c Upson Hall
Cornell University
Ithaca, NY, 14853-9219
phone: (607) 255-9199
fax: (607) 255-4428
email: ken@cs.cornell.edu

Keith Marzullo

University of California, San Diego
La Jolla, CA, 92093
phone: (619) 534-3729
fax: (619) 534-7029
e-mail: marzullo@cs.ucsd.edu

Please use the horus navigation bar under the title picture to visit the Horus Web pages with all publications, information on how to get the software, links to related projects and other useful information. A Quad chart with an overview of the project is also available.

The 1996 project summary for the Defense Advanced Research Projects Agency (DARPA)

OBJECTIVE

The Horus system provides a communication system solving robustness problems for demanding distributed applications. Issues of fault-tolerance, security, guaranteed realtime responsiveness and self-management are addressed in a single, integrated framework, and supported by a rigorous theoretical model. The technology is presented through toolkits embedded into high-level language frameworks and ``wrappers,’’ which enhance versions of standard system call and library interfaces so that programs which use those interfaces can be transparently hardened. During the coming year, our objectives center on making the current version of Horus widely available and on integrating it into a Cornell campus-wide collaboration and groupware environment for ATM networks, called CU-Links. CU-Links can be understood as a prototype for future military intelligence analysis and C4-I systems.

APPROACH

Reliable group communication arises whenever a collection of two or more programs need to cooperate in a networked environment. Our general interest is in providing tools for building reliable computing applications in extremely demanding NII environments, characterized by tight response time constraints, specifications that mandate certain levels of data throughput, latency and error supression, and in which security or privacy are combined with fault-tolerance considerations.

A good example of such an application would involve a co-processor for a modern telecommunication switch. Such co-processors must be highly fault-tolerant while respecting tight real-time constraints even as they reconfigure because of a failure or system upgrade. They raise challenging parallelism and security issues, and are clearly beyond the capabilities of first-generation process-group systems such as the Isis Toolkit, which we developed under ARPA funding during the period 1985-1990. A general discussion of the approach can be found in our May, 1996 Scientific American article, a copy of which is available online in the Scientific American web server.

The main criticism of process group systems like Isis is that they tend to be inflexible, providing all users with a single set of interfaces, and having a single fixed notion of ``reliability’’ that will often be mismatched with the specialized forms of reliability needed in demanding applications. No single reliability solution can satisfy all needs. For this reason, Horus is designed to be extremely flexible. The presentation of Horus can be varied depending on the need, by embedding it into existing operating system or language environments. And the properties of Horus process groups can also be customized, ranging from Isis-style process groups (used for fault-tolerant replication of data and services) to other sorts of groupware or collaborative workgroup applications, where reliability may involve providing isochronous communication, security, or other properties. Horus obtains this flexibilty using a unique plug-and-play architecture (we describe it as resembling a “lego” construction set).

Both Horus and Isis make use of a communication model called virtual synchrony, which was introduced by Isis in 1987. This model makes it possible to implement fault-tolerance in software, and also proves to be a powerful source of design simplification and a useful conceptual tool for distributed software designers. Horus and Isis support toolkits with solutions to problems such as consistent replication of data (with or without persistence), load balanced request execution, and even availability of services in the presense of network partitions.

Much of the distributed systems community has adopted this model, and the underlying theory is increasingly well understood. This theory is making it possible for us to develop security solutions that avoid denial of service attacks and that can provide security in the presense of network partitioning failures.

Process group distributed computing has been remarkably successful in settings that demand fault-tolerant or self-managed distributed solutions. Isis is now a \$8M/year product line of Stratus Computers Inc., and is being used in settings such as the forthcoming new generation of the French air traffic control system, cellular telephone systems, stock exchanges, VLSI chip fabrication lines, and many others. The military has applied Isis in a number of projects, and other government agencies are launching similar efforts. A visible example is the HiperD Naval project (a prototype of next-generation AEGIS technology), which is using Isis to manage a network of high performance computers in support of battle management tasks.

Horus goes far beyond Isis in terms of flexibility and much higher performance. On the performance side, we have been working with Thorsten von Eicken's UNet and Active Message technology to layer Horus over ATM networks. This permitted us to demonstrate a ten-fold performance improvement over what Horus achieves on ethernet, which is itself a substantial improvement over the performance of Isis for similar tasks. Horus is also much more predictable than Isis, both in terms of guaranteeing low deliver latencies, and high data throughput rates. Indeed, we believe that video data rates and groupware/conferencing applications are within reach with Horus technology over ATM.

Illustrating this point, we built a realistic emulation of a telecommunications switch co-processor as an experimental application of Horus. We were able to show scalability, high performance (22,000 ``calls’’ per second, all completed within 100ms deadlines), and fault-tolerance (we can service the processor or crash nodes while continuously guaranteeing that deadlines will be met). The potential military applications of this approach include tracking and target assessment functions in systems like AEGIS, on-board fly-by-wire aircraft control, and other advanced applications requiring extreme reliability, guaranteed performance, and a high degree of autonomy.

Looking to the future, we see a major opportunity to unify the major themes of parallel, fault-tolerant, real-time, and secure distributed computing in a single, highly modular architecture. By doing so, and making the solutions sufficiently transparent and easy to use, Horus can offer a major step forward in our ability to engineer demanding distributed applications. Today, robust distributed computing remains an ad-hoc problem that we deal with largely as an after-thought. With Horus, security and fault-tolerance can be introduced transparently into conventional distributed systems, even as more ambitious distributed systems benefit from a uniquely powerful, flexible, and high performance technology base. Such a technology is urgently needed by industry, and we believe Horus will demonstrate that extremely powerful integrated solutions are reaching the level of maturity demanded by ambitious commercial application developers.

All our research on Horus is freely available to other research groups in academic settings, industry, or the military. Through industry collaborations with Isis Distributed Systems, a subsidiary of Stratus Computer, BBN, and others, Horus will also become a commercial product line within the next two years or so.

Recent Acccomplishments

Plans for 1997.

Technology transition

All research versions of Horus are available at no fee for research users in military, academic or commercial settings, and we seek demanding early users of the technology with whom we are collaborating, for example in telecommunications applications. Horus technology transition will occur through licensing agreements with Isis Distributed Systems Inc., the company that commercialized the Isis Toolkit, BBN, and other companies.

Isis plans to launch a Horus-based product line focused on real-time applications, high performance embedded availability applications, and other Horus applications that exceed the performance or flexibility capabilities of Isis. The Isis commercial effort focuses on adding value to Horus, incorporating it into end-user deliverables, and supporting the resulting technology. Stratus Computer, the parent company

of Isis, is exploring the use of Horus in its RADIO architecture: a new generation of PC based scalable, parallel, highly available cluster computing products that run the NT operating system.


Comments to Werner Vogels