Modern dynamic languages, such as Python and Ruby, offer productivity advantages for many common kinds of applications. But their support for parallelism is typically disappointing.
Current interpreters pessimistically assume that all data is potentially shared. This resembles assumptions in mainstream imperative languages like C and Java. Unlike those languages, however, these languages have high-level operations that will almost certainly break in terrible ways in the presence of machine-level data races. For this reason, interpreters are typically forced to lock all accesses to all data, precluding parallelism even when dealing only with thread-local state. Many interpreters, including CPython and MRI, use a single lock that ensures that only one thread runs at a time. Even in interpreters that incur the complexity necessary to provide fine-grained locking or other global synchronization, single-threaded performance must be sacrificed and programmers are implicitly encouraged to write racy programs. I wrote about these problems earlier in the context of PyPy’s chosen approach.
To address this common shortcoming, I propose a straightforward language-level distinction between thread-local and private data together with a dynamically enforced locking discipline to prevent unsynchronized accesses a priori. The runtime is thus freed from over-synchronizing beyond the program’s specified requirements. As an additional benefit, a locking discipline encourages programmers to reason about synchronization rather than relying on imagined atomicity.
The most basic realization of this direction is a universal distinction between thread-local and shared objects. Object headers are augmented with (immutable) metadata consisting of a “shared” bit and, when that bit is zero, the thread ID of the “owner” thread. The programmer indicates which type of object to construct when calling a constructor. In this sense, shared-ness is a property of objects, not classes (dynamic types). Whenever a thread reads or writes an object, it checks the header. If the object is shared, the thread checks to ensure that a global lock is held. If the object is private, it compares its thread ID to the object’s. If either check fails, a runtime exception is thrown.
Despite its obvious shortcomings, this notional system improves over the status quo in several ways. Thread-local accesses run at full speed (modulo the thread ID check, which will likely be dwarfed by the hash table lookup typically required for field access). The programmer must acquire a global lock before reading or writing shared data, so races are impossible. Because sufficient locking is provided by the program, the interpreter does not need to add any synchronization.
One immediate problem with the proposed system is the lack of read sharing. I propose to address this by introducing a third object state, “shared and immutable.” Reads from such an object are allowed from any thread. Writes result in a runtime error. To construct an object, the programmer makes an immutable copy of another object (which itself may be local or global). Since the object is initialized as immutable, races on the object header are impossible.
The second obvious problem is the single, global lock for shared objects. While this is strictly better than CPython, for example, whose global lock must protect even private and immutable data, it is a clear threat to scalability and composability of performance. To address this, extend the header of shared objects to contain a set of locks that must be held when accessing the object. This set is defined at object creation time and is immutable. (This resembles multithreaded Cyclone but is checked dynamically rather than statically, incurring the accompanying trade-off between safety and expressivity.) Threads maintain a lockset and, before accessing a shared object, check that the object’s lockset is a subset of the thread’s.
Such a system precludes complex locking patterns in which an object’s lockset changes over time or intermediate happens-before edge provides implicit mutual exclusion without any consistent lockset for a given object. But it’s hard to concoct examples where this kind of flexibility is necessary. In fact, the assumption behind classical lockset analysis (e.g., Eraser) is sufficient to justify this restriction.
In my view, these mechanisms should be oriented toward library programmers, who in turn must implement higher-level concurrency constructs. (Very rarely do raw OS threads with shared mutable state represent the correct programming model for applications.) So for the most part, end programmers will not be exposed to locking discipline enforcement. Instead, they will enjoy the benefits of race-free parallel execution in the confines of a more restrictive model—such as actors, fork/join, or CSP—provided by a library. (Recall that interpreter limitations currently prevent even expert programmers from providing efficient parallelism.)
One potential criticism of this direction questions its relevance in a setting where performance is secondary (i.e., scripting languages). Why bother attempting to provide efficiency for languages that are doomed to be slow, where performance has already been sacrificed for productivity? But scalable parallelism is an ideal tool to address the performance shortcomings of dynamic languages. Even if a program pays a constant-factor slowdown for the convenience of a dynamic language, we can help make up for this difference by scaling performance with core count. If efficient parallelism can be made accessible enough, high-productivity languages can begin to close the gap with lower-level languages, in which parallelism is typically error-prone.
Several questions remain before this model is fully-formed:
- What should the language-level API look like for assigning locksets to objects? For example, as in Java, each object could implicitly be associated with a lock.
- How restrictive is the requirement that every object have its lockset determined at creation time? Are there cases when it would be useful to transition an object between thread-local and shared states?
- More specifically, how easy is it to code up common restrictive concurrency models using this approach?
- What is the performance overhead of the object metadata check on every field access?