Thursday, September 5, 2013

Java final fields on x86 a no-op?

I have always enjoyed digging in to the details of multi-threaded programming, and always enjoy that despite reading for years about CPU memory consistency models, wait-free and lock-free algorithms, the java memory model, java concurrency in practice, etc. etc. -- I still create multi-threaded programming bugs. It's always a wonderfully humbling experience that reminds me how complicated of a problem this is.

If you've read the JMM, then you might remember that one of the areas they strengthened was the guarantee of visibility of final fields after the constructor completes. For example,
public class ClassA {
   public final String b;

   public ClassA(String b) {
      this.b = b;
ClassA x = new ClassA("hello");

The JMM states that every thread (even threads other than the one that constructed the instance, x, of ClassA) will always observe x.b as "hello" and would never see a value of null (the default value for a reference field).

This is really great! That means that we can create immutable objects just by marking the fields as final and any constructed instance is automatically able to shared amongst threads with no other work to guarantee memory visibility. Woot! The flip-side of this is that if ClassA.b were not marked as final then you would have no such guarantee. And other threads could observe a x.b == null result (if no other "safe publication" mechanisms were employed to get visibility)

Well when they created the new JMM, everyone's favorite JCP member, Doug Lea, created a cookbook to help JVM developers implement the new memory model rules. If you read this, then you will see that the "rules" state that JIT compilers should emit a StoreStore memory barrier, right before the constructor returns. This StoreStore barrier is a kind of "memory fence". When emitted in the assembly instructions, it means that no memory writes (stores) after the fence can be re-ordered before memory writes that appear before the fence. Note that it doesn't say anything about reads -- they can "hop" the fence in either direction.

So what does this mean? well if you think about what the compiler does when you call a constructor:
String x = new ClassA("hello");
  get's broken down in to pseudo-code steps of:

1. pointer_to_A = allocate memory for ClassA 
    (mark word, class object pointer, one reference field for String b)
2. pointer_to_A.whatever class meta data = ...
3. pointer_to_A.b = address of "hello" string
4. emit a StoreStore memory barrier per the JMM
5. x = pointer_to_A
The StoreStore barrier at step 4 ensures that any writes (such as class meta-data and to field b are not re-ordered with the write to x on step 5. This is what makes sure that if x is visible to any other threads -- that that other thread can't see x without seeing x.b as well. Without the StoreStore memory barrier, then steps 3 and 5 could be re-ordered and the write to main memory for x could show up before the write to x.b and another cpu core could observe pointer_to_A.b to be 0 (null), which would violate the JMM.

Great news! However, if you look at that cookbook you'll see some interesting things: (1) a lot of people are writing JVMs on lots of processor architectures! (2) *all* of the memory barriers on x86 are no-ops except the StoreLoad barrier! This means that on x86 this StoreStore memory barrier above is a no-op and thus no assembly is emitted for it. It does nothing! This is because the x86's memory model is a strong "total store ordering" (TSO). X86 makes sure that all memory writes are observed as if they were all made in the same order. Thus, the write of 5 would never appear before 3 to any other thread anyways due to TSO, and there is no need to emit a memory fence. Other cpu architectures have weaker memory models which do not make such guarantees, and thus the StoreStore memory fence is necessary. Note that weaker memory models, while perhaps harder or less-intuitive to program against, are generally much faster as the cpu can re-order things to make more efficient use of cache writes and reduce cache coherency work.

Obviously you should continue to write correct code that follows the JMM. BUT it also means (unfortunately or fortunately) that forgetting this will not lead to bugs if you're running on I do at work.

To really drill this home and ensure that there are no other side effects that maybe aren't being described in the cookbook, I ran the x86 assembly outputter as described here and captured the output of calling the constructor for ClassA (with the final on the reference type field) and the constructor for a ClassB, which was identical to ClassA except without the final keyword on the class member. The x86 assembly output is identical. So from a JIT perspective, on x86 (not itanium, not arm, etc), the final keyword has no impact.

If you're curious what the assembly code looks like here it is below. Note the lack of any locked instructions. When Oracle's 7u25 JRE emits an x86 StoreLoad memory fence it is done through emitting lock addl $0x0,(%rsp) which just adds zero to the stack pointer -- a no-op, but since its locked -- that has the effect of a full fence (which meets the criteria of a StoreLoad fence). There are a few different ways in x86 to cause the effect of a full fence, and they are discussed in the OpenJDK mailing list. They observed that at least on nehelem intel the lock add of 0 was most space compact/efficient.
  0x00007f152c020c60: mov    %eax,-0x14000(%rsp)
  0x00007f152c020c67: push   %rbp
  0x00007f152c020c68: sub    $0x20,%rsp         ;*synchronization entry
                                                ; - com.argodata.match.profiling.FinalConstructorMain::callA@-1 (line 60)
  0x00007f152c020c6c: mov    %rdx,(%rsp)
  0x00007f152c020c70: mov    %esi,%ebp
  0x00007f152c020c72: mov    0x60(%r15),%rax
  0x00007f152c020c76: mov    %rax,%r10
  0x00007f152c020c79: add    $0x18,%r10
  0x00007f152c020c7d: cmp    0x70(%r15),%r10
  0x00007f152c020c81: jae    0x00007f152c020cd6
  0x00007f152c020c83: mov    %r10,0x60(%r15)
  0x00007f152c020c87: prefetchnta 0xc0(%r10)
  0x00007f152c020c8f: mov    $0x8356f3d0,%r11d  ;   {oop('com/argodata/match/profiling/FinalConstructorMain$ClassA')}
  0x00007f152c020c95: mov    0xb0(%r11),%r10
  0x00007f152c020c9c: mov    %r10,(%rax)
  0x00007f152c020c9f: movl   $0x8356f3d0,0x8(%rax)  ;   {oop('com/argodata/match/profiling/FinalConstructorMain$ClassA')}
  0x00007f152c020ca6: mov    %r12d,0x14(%rax)   ;*new  ; - com.argodata.match.profiling.FinalConstructorMain::callA@0 (line 60)
  0x00007f152c020caa: mov    %ebp,0xc(%rax)     ;*putfield a
                                                ; - com.argodata.match.profiling.FinalConstructorMain$ClassA::@6 (line 17)
                                                ; - com.argodata.match.profiling.FinalConstructorMain::callA@6 (line 60)
  0x00007f152c020cad: mov    (%rsp),%r10
  0x00007f152c020cb1: mov    %r10d,0x10(%rax)   ;*new  ; - com.argodata.match.profiling.FinalConstructorMain::callA@0 (line 60)
  0x00007f152c020cb5: mov    %rax,%r10
  0x00007f152c020cb8: shr    $0x9,%r10
  0x00007f152c020cbc: mov    $0x7f152b765000,%r11
  0x00007f152c020cc6: mov    %r12b,(%r11,%r10,1)  ;*synchronization entry
                                                ; - com.argodata.match.profiling.FinalConstructorMain::callA@-1 (line 60)
  0x00007f152c020cca: add    $0x20,%rsp
  0x00007f152c020cce: pop    %rbp
  0x00007f152c020ccf: test   %eax,0x9fb932b(%rip)        # 0x00007f1535fda000
                                                ;   {poll_return}
  0x00007f152c020cd5: retq   
  0x00007f152c020cd6: mov    $0x8356f3d0,%rsi   ;   {oop('com/argodata/match/profiling/FinalConstructorMain$ClassA')}
  0x00007f152c020ce0: xchg   %ax,%ax
  0x00007f152c020ce3: callq  0x00007f152bfc51e0  ; OopMap{[0]=Oop off=136}
                                                ;*new  ; - com.argodata.match.profiling.FinalConstructorMain::callA@0 (line 60)
                                                ;   {runtime_call}
  0x00007f152c020ce8: jmp    0x00007f152c020caa  ;*new
                                                ; - com.argodata.match.profiling.FinalConstructorMain::callA@0 (line 60)
  0x00007f152c020cea: mov    %rax,%rsi
  0x00007f152c020ced: add    $0x20,%rsp
  0x00007f152c020cf1: pop    %rbp
  0x00007f152c020cf2: jmpq   0x00007f152bfc8920  ;   {runtime_call}