Tuesday, November 30, 2010

How to Debug and Detect Deadlocks

Deadlocks in 2 minutes

Deadlocks are software bugs where multiple threads (typically one of them being the main thread) wait indefinitely on common synchronization primitives (locks) because their resolution is mutually interdependent.

Thread A waits on lock L1 to be released, but it held by thread B, who in turns waits on lock L2 to be released while it is held by thread A.

The scenario above is by far the most common source of deadlocks. The fundamental cause is that the locks are acquired in a different order by two different threads, and depending on an unpredictable race condition, can deadlock at runtime.

For example, the following snippet can cause a deadlock when Property.get() and Property.set() are accessed by two different threads:

class Property {
    static final public String DEFAULT_VALUE = "...";
    HashMap map = new HashMap();
    ArrayList listeners = new ArrayList();

    synchronized void set(String key, String value) {
        synchronized (map) {
            map.put(key, value);
        }
        for (Listener listener : listeners)
            listener.notifyChanged();
    }

    String get(String key) {
        synchronized (map) {
            String value = map.get(key);
            if (value == null) {
                value = DEFAULT_VALUE;
                set(key, value);
            }
            return value;
        }
    }
}

When the method Property.set() is called, the lock for "this" and "map" are acquired (using the synchronized keyword), while when Property.get() is called, the locks "map" and "this" are acquired, in the opposite order than Property.set().

This can cause a random deadlock when Property.get() and Property.set() are called from two different threads.

The example illustrate that:

  1. It is easy to write code that has incorrect lock acquisition order.
  2. It is hard to detect such errors, until a deadlock occurs randomly at runtime, and it might not be consistently reproducible.

Considering large source base where thousand of line of code acquire locks and have a very complex code path, it is often nearly impossible to prevent deadlocks from occurring at runtime, at the developer are faced with reducing the concurrency of the software, leading to poor usability, and sub-optimal performance.

Here is where the deadlock preventer tool comes to help.


The Deadlock Preventer

The deadlock preventer is a tool that dynamically instruments the java bytecode of an application when running in a java virtual machine in order to analyze the lock ordering and detect potential deadlocks without the need to rely on unpredictable and unreliable race conditions.

By running the deadlock preventer while executing the java program (the Eclipse IDE in our case) with a good code coverage (basically covering all features that the customers are susceptible to use), we can ensure that the code does not contain incorrect lock order, and is prone to deadlocks at runtime.

When we run the Property class in the example above with the following junit test code:

@Test
public void testProperty() {
    final Property property = new Property();
    Thread thread = new Thread(new Runnable() {
        @Override
        public void run() {
            property.get("foo");
        }
    });
    thread.start();
    try {
        thread.join();
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
    try {
        property.set("bar", "value");
        fail("should throw an exception");
    } catch (RuntimeException e) {
        assertTrue(e.getMessage().contains("ERROR"));
    }
}

We obtain a RuntimeException (as validated by the junit test case - something optional for the deadlock preventer to issue), but also the following error report on the console:

***DEADLOCK PREVENTER*** ERROR: Inconsistent locking order while acquiring lock: java.util.HashMap (id=0) in thread: 1 (main)
Property.set(Main.java:812)
Main.testProperty(Main.java:38)
...
(junit stack)
with predecent : Property (id=1)
Property.set(Main.java:811)
Main.testProperty(Main.java:38)
...
(junit stack)
Previously acquired lock: java.util.HashMap (id=0) in thread: 9 (Thread-0)
Property.get(Main.java:818)
Main$1.run(Main.java:28)
java.lang.Thread.run(Thread.java:619)
Previously acquired precedent: Property (id=1)
Property.set(Main.java:811)
Property.get(Main.java:822)
Main$1.run(Main.java:28)
java.lang.Thread.run(Thread.java:619)

The important aspect is that we receive this error report deterministically, each time the code runs, whether or not the code actually deadlocks at runtime given a race condition (something we made impossible with the thread.join() call).

Once the error is detected, it can then be corrected by the developer.

Note also that the deadlock preventer will report the error whether or not the code is actually susceptible to run concurrently or not, which in our case was impossible because we made the main thread block until the second thread finished execution (something that defeats the purpose to have a separate thread to begin with).

The deadlock preventer can also issue warnings when an inverse lock acquisition is detected in the same thread, but never run in different thread - it is always a good programming practice to have consistent lock order.

Using the deadlock preventer agent in a java program does not change any of the application logic and code path, but it does change its performance characteristics. Specifically, acquiring and releasing locks become computationally expensive. Ricks of actual deadlocks are effectively increased too, since the time spent while an incorrect lock acquisition order is progress is longer, especially if the 'Interactive' setting is used (see below).

Therefor, if the java program deadlocks, the last conflict reported by the deadlock preventer agent will be the out that refers to the deadlock obtained.

Using the Deadlock Preventer Agent

The deadlock preventer agent is available in two forms for two different development role.

  1. As a set of eclipse plugins for developers to test existing code, and investigate potential deadlocks.
  2. As a standalone launcher for testers, to run on a test machine in the background while the Eclipse Workbench is performing automated tests.

The Deadlock Preventer Eclipse integration.

agent_view

The deadlock preventer Eclipse integration consist of 3 plugins that contribute a new view in eclipse, as shown above (the view is accessible from the menu "Menu / Show View/Other..." under the "Other/Deadlock Preventer" section).

The Deadlock Preventer view allows the developer to select an existing launch configuration (only launch configurations for "Eclipse Application" as listed), configure settings, and click "Debug". The debugged Eclipse session will be automatically instrumented, and report any error in the view, under the "Conflicts" tab.

The best way to use the tool for an Eclipse developer, is first to add a breakpoint on the RuntimeException, then use the 'Interactive' mode, which allows to skip over some errors that are not of interest, and throw an exception for the relevant one.

Important: Note that if the 'throw exception' mode is selected, it will cause the code path to divert from the normal execution path, so it could cause sub-sequent errors, and bugs that would not otherwise happen.

Correcting incorrect lock acquisition order

Generally speaking, when two locks (A, B) are acquired in an inverse order by two different code paths susceptible to be called concurrently (i.e. by two different threads), a deadlock can occur.

For example, if thread 1 acquires the locks A and then B, while thread 2 acquires the locks B and then A, a deadlock is possible.

Thread 1: A -> B

Thread 2: B -> A

There are only 2 ways to avoid the problem:

  1. Re-order the lock acquisition, so that both threads acquire the locks in the same order

or

  1. Introduce a third lock, C, and make both threads acquire it before acquiring either A or B.

In the last case, the pattern would become:

Thread 1: C-> A –> B

Thread 2: C-> B -> A

The lock "C" would be in effect a "guard", preventing the locks A and B to be acquired concurrently, and avoiding the deadlock. The deadlock preventer agent recognize such patterns, and do not report any errors.

Obviously, the first solution is optimal, since it does not reduce concurrency as the second does, but it is sometimes impractical to re-order lock acquisition order, especially when access or modification of existing code is not possible.

Needless to say, removing (even partially) the locks is probably not a good solution, since the deadlock preventer agent does not report data corruption due to concurrent data access.

Miscellaneous comments

Running a full instrumented Eclipse workbench while the deadlock preventer agent analyzes the lock acquisition order, especially through the debugger, is demanding in computational resources. A recent (preferably 4 core or more) CPU is highly recommended.


Not all causes of deadlocks are detected at the moment by the deadlock preventer engine, such as Semaphore release leaks, or async execution from a background thread to the main thread.

(source code of the deadlock preventer to be posted within a few days)

7 comments:

ipsi said...

Err.... Half your blog post seems to be missing. Was included just fine in my RSS feed, but it's certainly not here.

Serge Beauchamp said...

sorry, I'm making some edit/cleaning up. the full content will be displayed shortly.

scratsh said...

Looks nice!

Where is the standalone application available?
And is there an update site for the three plugins?

Serge Beauchamp said...

Sorry about the source code not being available yet, I'm still preparing it for an EPL release.

I wanted to get the post out today since I submitted a talk a ECon2011 about it (https://www.eclipsecon.org/submissions/2011/view_talk.php?id=2200&search=deadlock), and wanted to get a blog post out for today's deadline.

franck102 said...

I find the log messages a bit confusing. In your example, the two traces that start with "Previously acquired" are in the same order as the two traces that don't?
In other terms, in looks like both "now" and "previously" the same precedent then the same lock were acquired in the same order - what am I missing?

Serge Beauchamp said...

I agree with you, the log is displayed in a structured way that makes it easier to parse (and generate) than to read.

I would be easier if it would be displayed as follows:


***DEADLOCK PREVENTER*** ERROR: Inconsistent locking order while acquiring locks in thread: 1 (main)

Property.set(Main.java:812) <== acquiring: java.util.HashMap (id=0)
Property.set(Main.java:811) <== acquiring: Property (id=1)
Main.testProperty(Main.java:38)

and thread: 9 (Thread-0):

Property.set(Main.java:811) <== acquiring: Property (id=1)
Property.get(Main.java:818) <== acquiring: java.util.HashMap (id=0)
Property.get(Main.java:822)
Main$1.run(Main.java:28)
java.lang.Thread.run(Thread.java:619)


What do you think?

Serge Beauchamp said...

The latest source and build now report the conflict as mentioned above:

https://github.com/sbeauchamp/Deadlock-Preventer/zipball/master