The new atomic classes are the hidden gems of java.util.concurrent Brian Goetz (brian@quiotix.com), Principal Consultant, Quiotix Summary: Until JDK 5.0, it was not possible to write wait-free, lock-free algorithms in the Java language without using native code. The addition of the atomic variable classes in java.util.concurrent changes that situation. Follow along with concurrency expert Brian Goetz as he explains how these new classes have enabled the development of highly scalable nonblocking algorithms in the Java language. Fifteen years ago, multiprocessor systems were highly specialized systems costing hundreds of thousands of dollars (and most of them had two to four processors). Today, multiprocessor systems are cheap and plentiful, nearly every major microprocessor has built-in support for multiprocessing, and many support dozens or hundreds of processors. To exploit the power of multiprocessor systems, applications are generally structured using multiple threads. But as anyone who's written a concurrent application can tell you, simply dividing up the work across multiple threads isn't enough to achieve good hardware utilization -- you must ensure that your threads spend most of their time actually doing work, rather than waiting for more work to do, or waiting for locks on shared data structures. The problem: coordination between threads Few tasks can be truly parallelized in such a way as to require no coordination between threads. Consider a thread pool, where the tasks being executed are generally independent of each other. If the thread pool feeds off a common work queue, then the process of removing elements from or adding elements to the work queue must be thread-safe, and that means coordinating access to the head, tail, or inter-node link pointers. And it is this coordination that causes all the trouble. The standard approach: locking The traditional way to coordinate access to shared fields in the Java language is to use synchronization, ensuring that all access to shared fields is done holding the appropriate lock. With synchronization, you are assured (assuming your class is properly written) that whichever thread holds the lock that protects a given set of variables will have exclusive access to those variables, and changes to those variables will become visible to other threads when they subsequently acquire the lock. The downside is that if the lock is heavily contended (threads frequently ask to acquire the lock when it is already held by another thread), throughput can suffer, as contended synchronization can be quite expensive. (Public Service Announcement: uncontended synchronization is now quite inexpensive on modern JVMs.) Another problem with lock-based algorithms is that if a thread holding a lock is delayed (due to a page fault, scheduling delay, or other unexpected delay), then no thread requiring that lock may make progress. Volatile variables can also be used to store shared variables at a lower cost than that of synchronization, but they have limitations. While writes to volatile variables are guaranteed to be immediately visible to other threads, there is no way to render a read-modify-write sequence of operations atomic, meaning, for example, that a volatile variable cannot be used to reliably implement a mutex (mutual exclusion lock) or a counter. Implementing counters and mutexes with locking Consider the development of a thread-safe counter class, which exposes Listing 1. A synchronized counter class
The The atomic read-modify-write combination shows up in many concurrent algorithms. The code in Listing 2 implements a simple mutex, and the Listing 2. A synchronized mutex class
The counter class in Listing 1 works reliably, and in the presence of little or no contention will perform fine. However, under heavy contention, performance will suffer dramatically, as the JVM spends more time dealing with scheduling threads and managing contention and queues of waiting threads and less time doing real work, like incrementing counters. You might recall the graphs from last month's column showing how throughput can drop dramatically once multiple threads contend for a built-in monitor lock using synchronization. While that column showed how the new With locking, if one thread attempts to acquire a lock that is already held by another thread, the thread will block until the lock becomes available. This approach has some obvious drawbacks, including the fact that while a thread is blocked waiting for a lock, it cannot do anything else. This scenario could be a disaster if the blocked thread is a high-priority task (a hazard known aspriority inversion). Using locks has some other hazards as well, such as deadlock (which can happen when multiple locks are acquired in an inconsistent order). Even in the absence of hazards like this, locks are simply a relatively coarse-grained coordination mechanism, and as such, are fairly "heavyweight" for managing a simple operation such as incrementing a counter or updating who owns a mutex. It would be nice if there were a finer-grained mechanism for reliably managing concurrent updates to individual variables; and on most modern processors, there is. Hardware synchronization primitives As stated earlier, most modern processors include support for multiprocessing. While this support, of course, includes the ability for multiple processors to share peripherals and main memory, it also generally includes enhancements to the instruction set to support the special requirements of multiprocessing. In particular, nearly every modern processor has instructions for updating shared variables in a way that can either detect or prevent concurrent access from other processors. The first processors that supported concurrency provided atomic test-and-set operations, which generally operated on a single bit. The most common approach taken by current processors, including Intel and Sparc processors, is to implement a primitive called compare-and-swap, or CAS. (On Intel processors, compare-and-swap is implemented by the cmpxchg family of instructions. PowerPC processors have a pair of instructions called "load and reserve" and "store conditional" that accomplish the same goal; similar for MIPS, except the first is called "load linked.") A CAS operation includes three operands -- a memory location (V), the expected old value (A), and a new value (B). The processor will atomically update the location to the new value if the value that is there matches the expected old value, otherwise it will do nothing. In either case, it returns the value that was at that location prior to the CAS instruction. (Some flavors of CAS will instead simply return whether or not the CAS succeeded, rather than fetching the current value.) CAS effectively says "I think location V should have the value A; if it does, put B in it, otherwise, don't change it but tell me what value is there now." The natural way to use CAS for synchronization is to read a value A from an address V, perform a multistep computation to derive a new value B, and then use CAS to change the value of V from A to B. The CAS succeeds if the value at V has not been changed in the meantime. Instructions like CAS allow an algorithm to execute a read-modify-write sequence without fear of another thread modifying the variable in the meantime, because if another thread did modify the variable, the CAS would detect it (and fail) and the algorithm could retry the operation. Listing 3 illustrates the behavior (but not performance characteristics) of the CAS operation, but the value of CAS is that it is implemented in hardware and is extremely lightweight (on most processors): Listing 3. Code illustrating the behavior (but not performance) of compare-and-swap
Implementing counters with CAS Concurrent algorithms based on CAS are called lock-free, because threads do not ever have to wait for a lock (sometimes called a mutex or critical section, depending on the terminology of your threading platform). Either the CAS operation succeeds or it doesn't, but in either case, it completes in a predictable amount of time. If the CAS fails, the caller can retry the CAS operation or take other action as it sees fit. Listing 4 shows the counter class rewritten to use CAS instead of locking: Listing 4. Implementing a counter with compare-and-swap
Lock-free and wait-free algorithms An algorithm is said to be wait-free if every thread will continue to make progress in the face of arbitrary delay (or even failure) of other threads. By contrast, a lock-free algorithm requires only that some thread always make progress. (Another way of defining wait-free is that each thread is guaranteed to correctly compute its operations in a bounded number of its own steps, regardless of the actions, timing, interleaving, or speed of the other threads. This bound may be a function of the number of threads in the system; for example, if ten threads each execute the Substantial research has gone into wait-free and lock-free algorithms (also called nonblocking algorithms) over the past 15 years, and nonblocking algorithms have been discovered for many common data structures. Nonblocking algorithms are used extensively at the operating system and JVM level for tasks such as thread and process scheduling. While they are more complicated to implement, they have a number of advantages over lock-based alternatives -- hazards like priority inversion and deadlock are avoided, contention is less expensive, and coordination occurs at a finer level of granularity, enabling a higher degree of parallelism. Until JDK 5.0, it was not possible to write wait-free, lock-free algorithms in the Java language without using native code. With the addition of the atomic variables classes in the The atomic variable classes can be thought of as a generalization of While the atomic variable classes might look superficially like the Finer grained means lighter weight A common technique for tuning the scalability of a concurrent application that is experiencing contention is to reduce the granularity of the lock objects used, in the hopes that more lock acquisitions will go from contended to uncontended. The conversion from locking to atomic variables achieves the same end -- by switching to a finer-grained coordination mechanism, fewer operations become contended, improving throughput. Atomic variables in java.util.concurrent Nearly all the classes in the These classes could not have been constructed without the JVM improvements in JDK 5.0, which exposed (to the class libraries, but not to user classes) an interface to access hardware-level synchronization primitives. The atomic variable classes, and other classes in java.util.concurrent, in turn expose these features to user classes. Achieving higher throughput with atomic variables Last month, I looked at how the Listing 5 shows the PRNG implementation using synchronization, and the alternate implementation using CAS. Note that the CAS must be executed in a loop, because it may fail one or more times before succeeding, which is always the case with code that uses CAS. Listing 5. Implementing a thread-safe PRNG with synchronization and atomic variables
The graphs in Figures 1 and 2 below are similar to those shown last month, with the addition of another line for the atomic-based approach. The graphs show throughput (in rolls per second) for random number generation using various numbers of threads on an 8-way Ultrasparc3 box and a single-processor Pentium 4 box. The numbers of threads in the tests are deceptive; these threads exhibit far more contention than is typical, so they show the break-even between Figure 1. Benchmark throughput for synchronization, ReentrantLock, fair Lock, and AtomicLong on an 8-way Ultrasparc3 Figure 2. Benchmark throughput for synchronization, ReentrantLock, fair Lock, and AtomicLong on a single-processor Pentium 4 Most users are unlikely to develop nonblocking algorithms using atomic variables on their own -- they are more likely to use the versions provided in Developers may find use for atomic variables directly as a higher-performance replacement for shared counters, sequence number generators, and other independent shared variables that otherwise would have to be protected by synchronization. JDK 5.0 is a huge step forward in the development of high-performance, concurrent classes. By exposing new low-level coordination primitives internally, and providing a set of public atomic variable classes, it now becomes practical, for the first time, to develop wait-free, lock-free algorithms in the Java language. The classes in
|
|
來自: moonboat > 《concurrent》