Kimera Test Suite
We built a stand-alone Java bytecode verifier as part of the Kimera security firewall. The bytecode verifier forms an essential component of the trusted computing base of a Java engine. The verifier independently assures that a given foreign applet will not perform an unsafe operation without having to rely on or trust the originator of the applet. A Java virtual machine relies on the verifier as the principal line of defense against potentially malicious code. Consequently, the verifier's correctness is critical to the security of the Java engine.
Since it is computationally difficult to formally prove that a Java verifier implementation is correct, we decided to build a comprehensive test suite and a testing methodology that could be used to validate our verifier. We tested our verifier by comparing its behaviour against those of commercial verifiers. In simplest terms, we fed millions of automatically generated Java classes to our verifier, and compared its opinion on the safety of the test files against commercially available verifiers. The automatically generated test suite achieves broad coverage over the possible bugs and non-conforming behavior in Java applets. Our comparative testing methodology has enabled us to make our verifier more secure than commercial state of the art verifiers, and has uncovered numerous flaws in popular Java implementations.
To generate our test cases, we selected a small number of interesting class files as bases. We then automatically introduced random one-byte errors in to these base class files and compared the results of the Kimera verifier against commercial verifiers (Figure 1). Cases where the class file was accepted by our verifier but rejected by a commercial verifier pointed to a potential bug in our verifier. We would fix the bug and continue. Cases where the class file was rejected by our verifier but accepted by a commercial verifier indicated a potential bug in the commercial implementation. We have informed the developers of the commercial verifiers of our results.
We have automated the generation and comparison stages of testing so that human involvement is not necessary except to resolve differences. As long as the commercial verifier agrees with our bytecode verifier, that class file does not need to be inspected. However if the commercial bytecode verifier passes a class file as safe where our bytecode verifier has classified it as unsafe, the comparison stage flags the file and the exact reason why the discrepancy occurred.
Once a discrepancy is detected, the class file that caused the verifiers to differ typically needs to be inspected manually. We were not satisfied with the existing, commercial bytecode examination tools for this purpose, since they often neglected to print crucial information in the class file that highlights the discrepancies. Instead, we developed a bytecode disassembler to simplify the task of determining what is wrong with automatically generated class files.
In comparative testing, it is possible that the reference implementation may contain some flaws in common with the commercial verifiers, in which case some faults might lie undetected. Hence, comparative testing does not guarantee the absence of faults, but is a crucial tool in locating existing problems. Further, we have taken a number of precautions to ensure that our verifier implementation is strong. First, concentrating all verification checks into a small, single, separable component instead of distributing them throughout the Java runtime has allowed us to manually certify the verifier easily. We have retained a clear correspondance between the Java safety axioms in the JVM specification and their implementation in our code. We have also annotated each axiom with a precise English description such that its purpose is clear by inspection.
In generating the test suite, we look at two types of coverage statistics, category coverage and axiom coverage. An axiom is a single statement whose violation in a class file renders it unsafe. An example of an axiom is "When the LADD instruction is executed, there must be a long data type at the top of the stack." There are several hundred such fine-grain axioms for Java. Many of these axioms are closely related. For example, "When the IADD instruction is executed, there must be an integer data type at the top of the stack" is a very similar check. We classify related axioms into overall categories to simplify management of code invariants. For instance, in this case, both of these axioms fit into the same category, "instructions must execute with the proper types on the stack." Our bytecode verifier contains 595 axioms in 69 categories.
In figures 2 and 3, we show the number of axiom categories covered versus the number of mutated class files that were fed into our verifier. Note that the mutations are not cumulative, that is, each mutated class file differs from the base class file by one byte. Even though we expected one-byte errors to be limited in the breadth of coverage they can generate, we found that the automatically generated test suite has broad coverage. As shown in Figure 3, the test suite exercises 65 of the 69 categories of checks made in our verifier. Figures 4 and 5 show the coverage information on an axiom granularity. 495 checks out of the 595 axioms in our verifier are individually triggered. Note from Figures 4 and 5 that even after half a million iterations on five different bases, check coverage has not flattened out. We expect that the coverage would increase with more iterations of the generator.
Coverage can also be tailored to fit the needs of the bytecode verifier we are testing. For example, once an extensive set of test classes have been generated, one can easily select and filter out flaws related to LoadConstant errors. Choosing base class files that exhibit more opportunities for the flaw to occur also helps in the tailoring process. Repetition of constructs in the base class files increase the likelihood that random variations will affect the constructs that the bases exercise.
Test suites generated by randomly altering single bytes, while producing very broad coverage, do not produce the deep coverage needed in order for our test suite to be complete. The randomly introduced errors are often easily detectable and do not exercise the more subtle aspects of Java code verification. Subtle flaws or flaws that may exist under complicated circumstances are typically not triggered with this approach. We have looked into applying a variety of automatic generation techniques to create deeper, more complicated test cases. We are currently working on one such technique for automatically generating complicated test cases that stress Java bytecode verification.
Figure 6 shows the amount of time it took to produce, classify and record increasing numbers of mutated class files using our verifier. The time it takes to generate and classify class files using our methodology increases linearly with the size of the test suite and the size and complexity of its base class files. However, each class file in the test suite can be generated independently of other class files in the test suite. Consequently, the generation and classification of the test suite distributes easily onto a network of workstations. The comparison of the test suite with other bytecode verifiers can be distributed as well. As a result, automated mutation testing is scalable across a set of computers.
Our verification tests so far have not included link-time checks, which verify the consistency of a class file with respect to other Java classes. We have recently added this class of link-time checks into our verifier, and we are using the same methodology that we used to uncover bytecode verification flaws to uncover link flaws that may exist in other Java implementations.
Using our test suite, flaws in bytecode verifiers can be found quickly and decisively. Test cases can be crafted and classified, and verifiers can be compared automatically. We believe that our application of N-version programming to software testing is a first in the Java domain, and has helped us uncover a large number of flaws with commercial Java implementations. As a result of the rapid rise of Internet technologies, distributed computing based on Java, and Internet commerce, there are solid financial and personal implications of security flaws. We believe that our test suite and testing methodology help provide a tool for locating and eliminating such problems.