Loop Unrolling in the Networking Domain
Loop unrolling is a common and well-understood optimization. The basic insight is that many loops do the same computation many times, each time jumping to a new block of code. By duplicating loop body code and performing basic transformations on induction variable computations, we should be able to do fewer computations and we need fewer jumps, which always tend to hurt performance.
However, loops in the domain of network programming languages have entirely unique significance. Typical hardware targets (e.g., routers) are quite different from traditional CPUs, sometimes requiring statically allocated, straight-line code. As a result, many languages in the networking domain are intended to be terminating, or at least not to be Turing complete. Loops, then, usually ought to be not much more than syntactic sugar for essentially copy-pasting the same code some n
times with a small change at each copy. For this reason, loop-unrolling is less an optimization than a fundamental feature of a language's static and dynamic semantics. It is also needed less for the sake of effeciently repeating the same comuptation thousands of times, and more for finding an acceptable version of the source program which can be loaded onto networking hardware.
This project explores this unique treatment of loop unrolling in the network programming language P4. P4 is used to program network switches and routers, and is one of the most practical and widely used languages in the domain. Most of the language resembles basic C without for
and while
statements. Loops in P4 come in the form of the parser sub-language, which resembles a finite state machine (FSM) and is used by a network device to extract metadata from in-going packets into useful data structures.
Unfortunately, P4's only specification comes in the form of a ~200 page PDF document consisting of vague English descriptions, pseudocode, and examples. For instance, the document acknowledges that targets typically require straight-line code, yet it turns out that the parser sublanguage (the only part that has loops) is Turing complete. All the spec does to address this is make a vague suggestion that this be handled on a per-target basis:
Although parsers may contain loops, provided some header is extracted on each cycle, the packet itself provides a bound on the total execution of the parser. In other words, under these assumptions, the computational complexity of a P4 program is linear in the total size of all headers, and never depends on the size of the state accumulated while processing data (e.g., the number of flows, or the total number of packets processed).
...
Architectures may impose (static or dynamic) constraints on the number of parser states that can be traversed for processing each packet. For example, a compiler for a specific target may reject parsers containing loops that cannot be unrolled at compilation time or that may contain cycles that do not advance the cursor.
It is almost unbelievable that the above two paragraphs comprise the entirety of the P4 spec's treatment of this problem, especially in a Turing complete language which targets hardware that is often not Turing complete. It is clear that this needs further study.
Petr4 is a long-term project which strives to build a formally verified end-to-end compiler for P4, along with formal operational semantcs for the surface syntax. As it exists now, it consists of a front-end with a type-checker and evaluator in OCaml, from which formal semantics have been extracted on pencil-and-paper.
My project's goal is to study the problem of loop-unrolling in P4 by proposing a more clearly-defined static semantics for parser loops and implementing them in conjunction with basic loop unrolling using the Petr4 ASTs. The implementation should not be too concerned with optimization so much as correctness, termination, and static allocation. My current version is in OCaml and runs about 500 loc. By comparison, the implementation of the reference compiler loop unrolling, p4c, runs about 350 loc in C, and does not provide a tool for static analysis. The hope is that this implementation will soon be ported to Coq, where it can be the first of many transformations on P4 programs to be verified by Petr4 and related projects.
Implementation
Static Semantics of Parser Loops
The surface syntax of P4 parsers consists of local declarations followed by states. These states are comprised of a highly restricted set of statements (variable assignments and method calls only) followed by a transition statement, indicating a jump to another state. Transitions may be conditional, branching to arbitrarily many different states. The result is a system in which arbitrary irreducible control flow arises quite easily.
However, the common case is to use a natural loop in the parser to populate various data structures provided by the language. In particular, parser loops that extract from the packet on each iteration are guaranteed to terminate due to the fact that the packet has a finite length. Consider the following P4 program:
#include <core.p4>
#include <v1model.p4>
struct metadata { }
header headers {
bit<8> nextAddr;
bit<8> bos;
}
parser MyParser(packet_in packet,
out headers[12] hdrs,
inout metadata meta,
inout standard_metadata_t standard_metadata) {
state start {
packet.extract(hdrs.next);
transition select(hdrs.last.bos) {
8w0 : start;
default : accept;
}
}
}
control MyChecksum(inout headers[12] hdr, inout metadata meta) {
apply { }
}
control MyIngress(inout headers[12] hdr,
inout metadata meta,
inout standard_metadata_t standard_metadata) {
apply {
standard_metadata.egress_spec = 9;
}
}
control MyEgress(inout headers[12] hdr,
inout metadata meta,
inout standard_metadata_t standard_metadata) {
apply { }
}
control MyDeparser(packet_out packet, in headers[12] hdr) {
apply { }
}
//this is declaration
V1Switch(MyParser(), MyChecksum(), MyIngress(), MyEgress(), MyChecksum(), MyDeparser()) main;
The parser has a loop comprised of one state, which extracts from the packet every iteration. In a general purpose programming language, it is natural to conclude that this control flow contains a potentially infinite loop. Indeed, what if the incoming packet never has a non-zero bos
value? However, the particular semantics of P4 guarantee that this loop terminates. The packet is a finite buffer, and extracting from the packet increments a pointer to an index in the buffer. What's more, incrementing the pointer beyond the size of the buffer raises an error which immediately causes the current parser to exit.
With these considerations, I am able to provide a concrete proposal for a static analysis tool of parser loops. Accepted parsers should meet the following conditions:
- They should have reducible control flow
- Every distinct natural loop should include a block which both extracts from the packet and dominates all blocks in the loop whose successors include the loop header
These conditions are enough to exclude programs such as the following two examples:
#include <core.p4>
#include <v1model.p4>
struct metadata { }
struct headers { }
parser MyParser(packet_in packet,
out headers hdr,
inout metadata meta,
inout standard_metadata_t standard_metadata) {
state start {
transition select(1w0) {
0 : first;
1 : second;
}
}
state first {
transition second;
}
state second {
transition first;
}
}
control MyChecksum(inout headers hdr, inout metadata meta) {
apply { }
}
control MyIngress(inout headers hdr,
inout metadata meta,
inout standard_metadata_t standard_metadata) {
apply {
standard_metadata.egress_spec = 9;
}
}
control MyEgress(inout headers hdr,
inout metadata meta,
inout standard_metadata_t standard_metadata) {
apply { }
}
control MyDeparser(packet_out packet, in headers hdr) {
apply { }
}
//this is declaration
V1Switch(MyParser(), MyChecksum(), MyIngress(), MyEgress(), MyChecksum(), MyDeparser()) main;
and
#include <core.p4>
#include <v1model.p4>
struct metadata { }
struct headers { }
parser MyParser(packet_in packet,
out headers hdr,
inout metadata meta,
inout standard_metadata_t standard_metadata) {
state start {
transition start;
}
}
control MyChecksum(inout headers hdr, inout metadata meta) {
apply { }
}
control MyIngress(inout headers hdr,
inout metadata meta,
inout standard_metadata_t standard_metadata) {
apply { }
}
control MyEgress(inout headers hdr,
inout metadata meta,
inout standard_metadata_t standard_metadata) {
apply { }
}
control MyDeparser(packet_out packet, in headers hdr) {
apply { }
}
//this is declaration
V1Switch(MyParser(), MyChecksum(), MyIngress(), MyEgress(), MyChecksum(), MyDeparser()) main;
The former is a canonical example of irreducible control flow, with a loop that is not formed around a backedge. The latter is a simple example of a reducible control flow with loops that do not consume the packet on each iteration.
My implementation runs these checks, providing the user with the options to fail when they are not satisfied or merely to emit a warning and produce unmodified code. Surprisingly, this turned out to be the most computationally heavy part of the implementation, not the actual transformation. While natural loops are easily identified about a backedge in the control flow, arbitrary loops require a full traversal of the control flow graph (CFG). I used Tarjan's Strongly Connected Components Algorithm, which boasts linear runtime, to compute strongly-connected components of an arbitrary CFG. I then check that they all satisfied the criteria for forming a natural loop: that there exists a unique header node h
which dominates all other nodes in the strongly-connected component and which is the only node with incoming edges from outside of the loop. Having identified the natural loops, I then traverse them to ensure that every path back to the header consumes the packet. Again, this ensures termination as packets are guaranteed to be finite.
Loop Transformation
The next phase of my implementation transforms the CFG. I implement a basic form of loop unrolling which merely duplicates the loop body n
times (n
is measured in copies of the loop body). I chose to skip the steps of identifying and collapsing induction variable computations in this version, since the domain is not particularly concerned with the efficiency of these loops. I attempt to use some hueristics (below) to compute n
statically, and failing this fall back to a default value for n
provided by the user. The user also provides a flag indicating whether the unrolling should result in a control flow that still contains loops with the body duplicated, or if they prefer that the control flow raise an error after the n
iterations. This will be useful for a user who is able to determine based on their target devices how many iterations are truly needed. Additionally, the transformation in future versions in Coq may be verifiable relative to an assumption that such an n
exists.
One particularly interesting hueristic I was able to identify for the computation of n
was related to header stacks. Header stacks are a data type in P4 consisting of an array and a pointer to an index in the array. Fortunately, extracting from the packet into a header stack has the side effect of incrementing the index pointer. Even better, the sizes of header stacks is a compile-time known constant used by the type system, and the dynamic semantics of incrementing the stack pointer past the stack size is clearly defined. Therefore, parser loops extracting into header stacks are guaranteed to run a compile-time known constant number of iterations before terminating. Loops which extract to header stacks every iteration can be unrolled to n+1
copies of straight-line code, where n
is the size of the stack, and the semantics ought to be identical (this clearly merits a Coq verification in the future).
Hueristics for computing n
statically require further study, and indeed should never be able to acheive completeness due to decidability concerns. As for my current version, it is limited in capacity for completeness because it uses the raw AST output by the parser. While this AST is preferable because it provides a pretty printer back to P4 source code, I am unable to support the header stack hueristic. To do so, I will need to switch to the typed AST, fully decorated with typing information by the type checker, which will be possible once pretty printing is supported. This done, computing n
for header stacks becomes trivial, whereas it is unnecessarily difficult in my current version.
Evaluation
I use three metrics for evaluation: correctness of the code transformation and the soundness and completeness of the static analysis tools.
For correctness, I begin with a minimal collection of toy p4 programs which I developed for the sake of test-driven development of loop unrolling. Residing in petr4/unroll-test
, these are edge-case control flow programs which are primarily expected-fail for the static analysis. In addition to these, Petr4 conveniently includes a robust test harness which runs the front-end on the entirety of p4c's benchmark suite. Combined with some custom test cases from the front-end development, this runs over 100 realistic p4 programs, each on several realistc packets, all in under a few seconds. In order to acquire reasonable evidence of the correctness of my transformation, I merely had to insert a call to my unroller in this test harness before type checking. After some work, my tool achieves the same correctness as the Petr4's evaluater on unmodified code. The only limitation of this approach is that it does not include any negative tests, so I am much more sure of the correctness of my code transformation tool than I am of my static analysis tool.
Soundness would require that every non-terminating P4 program be rejected by my static analysis. With a tool that rejects irreducible CFGs and loops that do not consume the packet, I can be reasonably comfortable with the soundness of my analysis. I group exceptions to soundness in my current version into two categories. The first consists of pending edge cases that are trivial upon switching to the typed AST. For example, I currently treat any call to extract
as consuming the packet, whereas calling extract
on an empty datatype does not actually consume the packet. Fixing this hole is difficult in the untyped AST but trivial in the typed one. The second category might be described as small semantic edge cases which would require a more computationally heavy analysis to resolve. The primary example of these is that I treat calls to advance
(consume a certain number of bits in the packet without writing them anywhere) as consuming the packet without checking that the argument is non-zero. Doing so would likely require something even as involved as a dataflow analysis, and could be elided by porting my tool to slightly lower-level stages of the Petr4 pipeline where this is not a concern.
Completeness, as discussed, is unattainable in practice. Difficulties already arise due to the fact that less is known about analyzing and transforming irreducible control flow, a task that would be necessary in order to increase completeness. However, I do view my hueristics for statically computing the number of unrollings as a promising step towards completeness, especially where the most common P4 use cases are concerned.