Quantcast
Channel: Psychosomatic, Lobotomy, Saw
Viewing all 76 articles
Browse latest View live

Poll me, maybe?

$
0
0
The java.util.Queue interface has the ever important poll/offer methods, documented thus:
This allows the caller to assume that if the poll method returns null the queue is empty, and in a single threaded world this is quite straight forward. A problem for multi producer lock free concurrent queues however is that while an element may not be visible the queue may not be empty. "WOT?" I hear you cry, lets look at an example:
The above snippet shows the offer/poll methods of an MPSC linked queue as originated by Mr. Vyukov(MPSC - Multi-Prodcuer, Single-Consumer) and ported into Java by your humble servant (though others have ported this algo before me: Akka/RxJava/Netty, and others...).
Where be dragons?

Multiple Producers Break The Chain

How the above algorithm works for multiple producers:

  • On line 17 we use the same mechanism offered by AtomicReference.getAndSet to atomically swap the producerNode with the new node.
  • This means no producerNode is ever returned more than once, preventing producer threads from overwriting the producerNode reference field and that node's reference to the next node.
  • We use the same mechanism offered by AtomicReference.lazySet to set this new node as the next node from the previous.
  • On the consumer thread side we process nodes by grabbing the next node and replacing the consumerNode with it after we pluck out it's value.

The problem is in the offer method lines 17-19 where we first xchgProducerNode and then set the new node (now the producerNode) to be the next node after the previous producerNode. For a short moment the old producer node has null as the next element, the chain is broken. This short moment is of undefined length, depending on scheduler pressure and the whimsical sense of humour of the gods the producer thread which got interrupted at line 18  (puff) may be held up for a while.
And yet, while this producer thread is sleeping (what dreams may come?), other producers can make progress. They may add any number of nodes to the queue, each linking the producerNode to the next. The producerNode can be at any 'distance' from that suspended node on that suspended thread waiting for it's next field to be patched through and may have any number of further broken links in the way.
Looking at the poll method in this light the problem becomes more obvious. If a node may have it's next set to null due to the timing described above, then poll may return null when the queue is in fact not empty.

Unbreaking The Chain

To fix this issue we could do the following:

And indeed this is how other implementors have chosen to tackle the issue (in spirit, if not in code).
In doing so we have given up quite allot however:
  1. poll() is no longer wait free. Which is a shame, I quite like wait freedom.
  2. The consumer thread is volunteered into spin waiting on the next field to become visible.
  3. In the event of hitting null we now read the producerNode. This introduces the potential for a cache miss. This is a big problem to my mind as this cache miss has an unknown and potentially very large cost.
  4. The producerNode read from the consumer thread will have a negative impact on the producer threads contending to write to it. This has been previously explored here. This will be particularly bad when the consumer is spinning on the poll() method while waiting for the next value.
Is it worth it? I'm not convinced... Given that a Queue already mandates an isEmpty method could we not settle for a relaxed definition of poll? given the above observation of the queue emptiness is also (by it's lock free and concurrent nature) imprecise should we really sacrifice performance/scalability for it?
At the moment I am leaning towards offering weaker guarantees for the lock-free queues offered as part of JCTools, but I'm hoping to get some feedback from prospective users on how important that part of the queue interface is to their purposes.

NOTE: The same issue exists for the offer method when we look at an SPMC queue, as discussed in this issue.

The many meanings of volatile read and write

$
0
0
Just a quick note on the topic as I find I keep having this conversation. Volatile fields in Java provide three distinct features:
  1. Atomicity: volatile long and double fields are guaranteed to be atomically written. This is not the case otherwise for long double. See JLS section 17.7 for more details. Also see this excellent argument made by Shipilev on why all fields could be made atomic with no significant downside.
  2. Store/Load to/from memory: a normal field load may get hoisted out of a loop and be done once, a volatile field is prevented from being optimized that way and will be loaded on each iteration. Similarly stores are to memory and will not be optimized.
  3. Global Ordering: A volatile write acts as a StoreLoad barrier thus preventing previous stores from being reordered with following loads. A volatile read acts as a LoadLoad barrier and prevents following loads from happening before it. This is opposed to the meaning of volatile in C/C++ where only other volatile loads/stores are prevented from reordering.
I would personally prefer to have these more refined tools at my disposal for when I need them, but volatile is a 3-in-1 sort of tool...

What about AtomicLog.lazySet?

For those of you wondering (as I did) weather or not AtomicLong.lazySet (A.K.A Unsafe.putOrderedLong) provides atomicity, it would seem the answer is yes. Digging through the JVM source code for the putOrderedLong intrinsic yields the following nugget:
Look at that perfectly pleasant C++ code! The store is indeed made atomic. We can further test this observation by looking at the generated assembly for a 32 vs 64 bit JVM:
There you go! Atomicity is perserved! Hoorah!

The volatile read suprise

$
0
0
On occasion, and for perfectly good reasons, I find myself trying to answer such deep existential questions as this one. Which is faster:
As you can see from the sample I turn to JMH to help me resolve such questions. If you know not what JMH is you may enjoy reading previous posts on the subject (start with this one). In short it is a jolly awesome framework for benchmarking java:
  • @Benchmark annotated methods will get benchmarked
  • The framework will pass in a Blackhole object that will pretend to 'consume' the values you pass into it and thus prevent the JIT compiler from dead code eliminating the above loops to nothing.
Assuming we are all on the same page with this snippet above, let the game begin!

Yummy yummy sugar!

So I ran the above benchmarks on some heavy duty benchmarking machine and get the following results for different array sizes:
It sure looks like that syntactic sugar is much better! more than twice as fast! awesome?

Must give us pause

At this point we could either:
  1. Declare syntactic sugar the clear winner and never write the old style for loops ever again 'cause they be slow like everything old! we hates them old loops! hates them!
  2. Worry that we are being a bit stupid
I get very little sleep and I was never very bright, so I'll go for 2. 
This benchmark result seems off, it's not what we expect. It would make sense for the JVM to make both loops the same, and yet they seem to work out very differently. Why, god? whhhhhhhy?
The above benchmark is a tiny piece of code, and is a fine example of a nano-benchmark (to use the term coined by Shipilev for benchmarks of nano-second scale). These are pretty suspect benchmarks at the best of time so you want to be quite alert when trying to make sense of them. When stuff doesn't make sense it is best to see what the JIT compiler made of your code and hit the assembly! Printing the JIT generated assembly is a neat party trick (sure to win you new friends and free drinks) and results in loads of funky text getting thrown at you. I was going to do a whole walk through the assembly but I have promises to keep and miles to walk before I sleep (some other time, I promise). So lets just skip to the WTF moment.

Into the hole

The assembly code for the goodOldLoop is long and painful to read through, and that in itself is a clue. Once you work out the control flow you'll sit there scratching your head and wondering. The thing that stands out (when the assembly smoke clears) is that bunn is loaded on every iteration, bunn.length is loaded and an array boundary check happens. This is surely a terrible way to interpret a for loop... 
The culprit turns out to be a volatile read in Blackhole.consume:

The above method ensures that a consumed value will not be subject to DCE even if it is completely predictable. The values for b1, b2 being volatile cannot be assumed to stay the same and so require re-examination. The side effect is however that we now have a volatile load in the midst of our for loop. A volatile load of one value requires the JVM to load all subsequent loads from memory to force happens before relationships, in this case the field bunn is reloaded on every iteration of the loop. If bunn may have changed then it's length may have also changed... sadness follows. To test this theory we can make a third loop:
This performs much like the sweet syntactic sugar version:

Lessons learnt?

  • Nano benchmarks and their results are hard to interpret. When in doubt read the assembly, when not in doubt smack yourself to regain doubt and read the assembly. It's very easy for a phenomena you are not looking to benchmark to slip into the benchmark.
  • Sugar is not necessarily bad for you. In the above case the syntactic sugar interpretation by the JVM was a better match to our intuition than the explicit old school loop. By being explicit we inhibited optimisation, despite intending the same thing. The enhanced for loop, as the JLS calls it, is semantically different from the basic for loop in that it assumes some sort of snapshot iterator taken at the beginning of the loop and used throughout, which for primitive arrays means taking the form used in goodOldLoopReturns.
  • Blackhole.consume is also a memory barrier, and these come with some side effects you may not expect. In larger benchmarks these may be negligible but in nano benchmarks every little thing counts. This is a fine use case for a 'weak' volatile read, one which requires a memory read but no memory barrier(previous post on the compound meaning of the volatile access)

Disassembling a JMH Nano-Benchmark

$
0
0
I often feel it is nano-benchmarks that give microbenchmarks a bad name (that and the fact MBMs tend to sell crack and their young bodies). Putting to one side the latter issue for bleeding heart liberalists to solve, we are left with the former. In this post I'd like to help the budding nano-benchmark writer resolve and investigate the embarrassing dilemma of: "What just happened?"
"What just happened?" is a question you should almost always ask yourself when running a nano-benchmark. The chances of the compiler finding out your benchmark does nothing, or that significant part of your benchmark can be omitted, are surprisingly large. This is partly a case of extreme cleverness of compiler writers and partly the simplicity of the benchmark code potentially leaving the door open to optimisations perhaps not possible in the wild. The best way to answer the question is to have a look at the assembly end result of your benchmark code.
Hipster developer that I am, I use JMH to write microbenchmarks. Chances are you should too if you are writing nano/micro benchmarks as it goes a long way toward solving common issues. In the rest of this post we'll be looking at the assembly produced by JMH benchmarks and explaining away the framework so that you can more easily find your way in your own benchmark.

The NOOP benchmark

I started with the observation that nano-benchmarks sometime get optimized away, if they did they'd have the same end result as this benchmark:
Exciting stuff! So we measure nothing at all. How are we measuring this? JMH generates some code around a call to the above method that will do the measurement:
So we have a while loop, spinning on the isDone flag and counting how many times we can manage to execute it until someone tells us to stop (by setting the isDone flag to true). It follows therefore that the measurement overhead is:
  • Reading the volatile field isDone (L1 hitting read, predictable)
  • Incrementing a counter (on the stack)
But healthy skepticism is what this is all about, let's see what the generated assembly looks like! I'll be gentle, assembly is often hard on the eyes.

Getting The Assembly Output

To try this at home you'll need a drink, a JVM setup to print assembley and the sample code. Build the project with maven and you run the benchmark and generate the assembly using the following command:
$JAVA_HOME/bin/java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*.noop_avgt_jmhLoop -XX:PrintAssemblyOptions=intel -XX:-UseCompressedOops -jar target/microbenchmarks.jar -i 5 -wi 5 -f 0 ".*.noop"> noop.ass
I'm only printing the measurement method, using the Intel syntax instead of the default AT&T and disabling compressed oops to get simpler output for this particular exercise. The output will contain several versions of the compiled method, I will be discussing the final version which is the last in the output.
Now we got the assembly printed we can get familiar with the structure of the JMH measurement loop as it is translated into assembly:

This is just the preliminaries for the method, so not much to see except noting which reference is in which register to help interpret the rest of the code. The comments in the printout are generated by the JVM, my comments are prefixed with [NW].
Once all the pieces are in place we can move on to some actual work.

Measurement Loop: 2 Timestamps diverged in a yellow wood

Refresh you memory of what the java code above does and let's see if we can see it here:
Have a sip and scan slowly. Here's some nuggets to consider:

  • As expected the noop() method is not called and any mention of it is gone from the measurement loop.
  • The first iteration of the loop has been 'peeled', this is common practice.
  • Even though we never call noop(), we still have to do the null check for the benchmark reference.
  • The sharp of eye reader will have noticed the redundant realTime variable in the generated measurement loop, so has the JIT compiler and it has been replaced with setting the result.realTime field directly to 0.
  • RBP is an 8 byte register, EBP is the lower half of the same register. Setting EBP to 1 in the peeled first iteration is the same as setting RBP to 1.
  • The measurement loop includes a safepoint! put that down as further measurement overhead.
This is the simplest benchmark one can write with JMH. On my test machine (an Intel Xeon E5-2697 v2 @ 2.70GHz) doing nothing is quite fast at 0.288 ns/op.
As you may have expected, reading the generated assembly is not so pleasant, I find the generated comments are very helpful for orientation and the timestamp calls on either side of the measurement loop help in zooming in on the important bits.

A Nano-Benchmark: i++

Nothing says "nano-benchmark" like benchmarking a single operation. Let's have a go at it!
The generated loop is the same, but this time that crafty old JIT compiler cannot just do nothing with our code. We will finally learn the true cost of incrementing an integer! Given the overhead includes a long increment already I might even guess the cost at 0.25 ns/op, so maybe the result reported by JMH will be 0.5 ns/op? A warm fuzzy feeling of wisdom.
But when I run this benchmark on the same machine I learn to my dismay that incrementing an integer takes 1.794 ns/op according to my JMH benchmark. Damn integers! why does the JVM torture us so with slow integer increments?
This is a silly benchmark, and the result makes absolutely no sense as an estimate of the cost of the ++ operator on integers. So what does it mean? Could it be that the JIT compiler failed us? Lets have a look at the assembly:
So why is the reported cost so much higher than our expectation?

What just happened?

My increment method got translated perfectly into: "inc DWORD PTR [r8+0x10]". There is no compiler issue.  The comparison I made between incrementing the operations counter and incrementing the benchmark field is flawed/misguided/stupid/ignorant when taking into account the benchmark framework.
The context in which we increment operations is:
  • It's a long variable allocated on the stack
  • It's used in a very small methods where there is no register pressure
  • It follows that operations is always a register
  • ADD/INC on a register cost very little (it's the cheapest thing you can do usually)
The context in which we increment benchmark.i is:
  • It's a field on the benchmark object
  • It's subject to happens-before rules so cannot be hoisted into a register inside the measurement loop (because control.isDone is a volatile read, see this post for more detail)
  • It follows that benchmark.i is always a memory location
  • INC on a memory location is not so cheap (by nano benchmark standards)
Consulting with the most excellent Agner Fog instructions tables tells me that for Ivy Bridge the latency for INC on memory is 6 cycles, while the latency on ADD for a register is 1. This indeed agrees to some extent with the cost reported by JMH (assuming 0.288 was for one cycle, 0.288 * 6 = 1.728 which is pretty close to 1.794).  But that's bad analysis. The truth is that cost is not additive, particularly when nano-benchmarks are concerned. In this case the cost of the INC seems to swallow up the baseline cost we measured before. 
Is there something wrong with JMH? I don't think so. If we take the benchmark to be "an attempt at estimating the cost of calling a method which increments a field" then I would argue we got a valid answer. It's not the only answer however. Calling the same method in a context which allows further optimizations would yield a different answer.


Celebrating 2 years of blogging!

$
0
0
2 years ago I started on this blog with a short and relatively under-exciting post about intrinsics. I was not happy with that first post. But you have to start somewhere I guess ;-). I set myself a target of writing 2 posts a month and pretty much kept to it (43 posts and 1 page). Some posts took huge investment, some less, I learnt something new while writing every one of them.
I spent last week at Joker Conf and Gee Con, I don't think I'd have been invited to speak in either was it not for my blog. I'm also pretty sure I owe my current job (and other job offers) to the blog. I'm still surprised to meet people who read it. Most seem happy. It proved to be allot of work, but just the sort of excuse I needed to dig deeper into corners of Java and concurrency I find exciting. Some of the effort that went into the blog became the ground work for JCTools. I guess what I'm trying to say is it worked out very well for me both in driving my learning process and gaining me some recognition that led to rewarding experiences and opportunities. Also, some other people seem to enjoy it :-)
The name of the blog proved puzzling for many (not a big surprise really), so in case you're still wondering where it came from, here's the relevant strip from Calvin & Hobbes:

I am a huge Bill Watterson fan, you should buy yourself the full C&H set, it will prove a more lasting reading material than any performance/Java/programming book you own. Also, I've seen many performance related discussions go a similar way to the above exchange...
A huge thank you to the readers, commentors and reviewers, urging me this way and steering me that way. Let's see if I can keep it up another 2 years :-)

The JVM Write Barrier - Card Marking

$
0
0
In Java, not all value stores are created equal, in particular storing object references is different to storing primitive values. This makes perfect sense when we consider that the JVM is a magical place where object allocation, relocation and deletion are somebody else's problem. So while in theory writing a reference field is the same as writing the same sized primitive (an int on 32bit JVMs or with compressed oops on, or a long otherwise) in practice some accounting takes place to support GC. In this post we'll have a look at one such accounting overhead, the write barrier.

What's an OOP?

An OOP (Ordinary Object Pointer) is the way the JVM views Java object references. They are pointer representations rather than actual pointers (though they may be usable pointers). Since objects are managed memory OOPs reads/writes may require a memory barrier of the memory management kind (as opposed to the JMM ordering barrier kind):
"A barrier is a block on reading from or writing to certain memory locations by certain threads or processes.
Barriers can be implemented in either software or hardware. Software barriers involve additional instructions around load or store operations, which would typically be added by a cooperative compiler. Hardware barriers don’t require compiler support, and may be implemented on common operating systems by using memory protection." 
- Memory Management Reference, Memory Barrier
"Write barriers are used for incremental or concurrent garbage collection. They are also used to maintain remembered sets for generational collectors." 
- Memory Management Reference, Write Barrier
 In particular this means card marking.

Card Marking

All modern JVMs support a generational GC process, which works under the assumption that allocated objects mostly live short and careless lives. This assumption leads to GC algorithm where different generations are treated differently, and where cross generation references pose a challenge. Now imagine the time to collect the young generation is upon our JVM, what do we need to do to determine which young objects are still alive (ignoring the Phantom/Weak/Soft reference debate and finalizers)?
  • An object is alive if it is referenced by a live object.
  • An object is alive if a static reference to it exists (part of the root set).
  • An object is alive if a stack reference to it exists (part of the root set).
The GC process therefore:
"Tracing garbage collectors, such as copying, mark-sweep, and mark-compact, all start scanning from the root set, traversing references between objects, until all live objects have been visited.
A generational tracing collector starts from the root set, but does not traverse references that lead to objects in the older generation, which reduces the size of the object graph to be traced. But this creates a problem -- what if an object in the older generation references a younger object, which is not reachable through any other chain of references from a root?" - Brian Goetz, GC in the HotSpot JVM
Illustration By Alexey Ragozin
It is worth reading the whole article to get more context on the cross generational reference problem, but the solution is card marking:
"...the heap is divided into a set of cards, each of which is usually smaller than a memory page. The JVM maintains a card map, with one bit (or byte, in some implementations) corresponding to each card in the heap. Each time a pointer field in an object in the heap is modified, the corresponding bit in the card map for that card is set."
A good explanation of card marking is also given here by Alexey Ragozin. I have taken liberty to include his great illustration of the process.
So there you have it, every time an object reference is updated the compiler has to inject some accounting logic towards card marking. So how does this effect the code generated for your methods?

Default Card Marking

OpenJDK/Oracle 1.6/1.7/1.8 JVMs default to the following card marking logic (assembly for a setter such as setFoo(Object bar) ):

So setting a reference throws in the overhead of a few instructions, which boil down to:
CARD_TABLE [this address >> 9] = 0;
This is significant overhead when compared to primitive fields, but is considered necessary tax for memory management. The tradeoff here is between the benefit of card marking (limiting the scope of required old generation scanning on young generation collection) vs. the fixed operation overhead for all reference writes. The associated write to memory for card marking can sometimes cause performance issues for highly concurrent code. This is why in OpenJDK7 we have a new option called UseCondCardMark.
[UPDATE: as JP points out in the comments, the (>> 9) is converting the address to the relevant card index. Cards are 512 bytes in size so the shift is in fact address/512 to find the card index. ]

Conditional Card Marking

This is the same code run with -XX:+UseCondCardMark:

Which boils down to:
if (CARD_TABLE [this address >> 9] != 0) CARD_TABLE [this address >> 9] = 0; 
This is a bit more work, but avoids the potentially concurrent writes to the card table, thus side stepping some potential false sharing through minimising recurring writes. I have been unable to make JDK8 generate similar code with the same flag regardless of which GC algorithm I run with (I can see the code in the OJDK codebase... not sure what's the issue, feedback/suggestions/corrections welcome).

Card Marking G1GC style?

Is complicated... have a look:
To figure out exactly what this was about I had to have a read in the Hotspot codebase. A rough translation would be:
oop oldFooVal = this.foo;
if (GC.isMarking != 0 && oldFooVal != null){
  g1_wb_pre(oldFooVal);
}
this.foo = bar;
if ((this ^ bar) >> 20) != 0 && bar != null) {
  g1_wb_post(this);
}
 The runtime calls are an extra overhead whenever we  are unlucky enough to either:
  1. Write a reference while card marking is in process (and old value was not null)
  2. Target object is 'older' than new value (and new value is not null)
The interesting point to me is that the generated assembly ends up being somewhat fatter (nothing like your mamma) and has a significantly worse 'cold' case (cold as in less likely to happen), so in theory mixing up the generations will be painful.

Summary

Writing references incurs some overhead not present for primitive values. The overhead is in the order of a few instructions which is significant when compared to primitive types, but minor when we assume most applications read more than they write and have a healthy data/object ratio. Estimating the card marking impact is non-trivial and I will be looking to benchmark it in a later post. For now I hope the above helps you recognise card marking logic in your print assembly output and sheds some light on what the write barrier and card marking is about.


The Mythical Modulo Mask

$
0
0
It is an optimisation well known to those who know it well that % of power of 2 numbers can be replaced by a much cheaper AND operator to the same power of 2 - 1. E.g:
x % 8 == x & (8 - 1)
[4/11/2014 NOTE] This works because the binary representation for N which is a power of 2 will have a single bit set to 1 and (N-1) will have all the bits below that set to 1 (e.g 8 = 00001000, 8-1= 00000111). When we do x AND (N-1) only the remainder of x / N will match the N-1 mask.
[4/11/2014 NOTE + slight spoiler: this only works when x >= 0]
The reason the & is  so much cheaper is because while % is implemented using the DIV instruction, & is just AND and as it turns out DIV is expensive and AND is cheap on x86 CPUs (and other places too I think). The optimisation is used in the Disruptor as well as the JCTools circular array queues and in the ArrayDequeue and other JDK classes. Is it time to replace % with & everywhere in your code which has this opportunity?
[4/11/2014 NOTE]  Technical term for this sort of optimization is Strength Reduction

Starting Simple

Lets start with some basic benchmarks:
And the results (on JDK8u5/E5-2697 v2 @ 2.70GHz/-XX:-UseCompressedOops for consistency between assembly and results):
  Benchmark                   Score   error  Units
  moduloLengthNoMask          3.448 ± 0.007  ns/op
  moduloConstantLengthNoMask  1.150 ± 0.002  ns/op
  moduloLengthMask            1.030 ± 0.006  ns/op
  moduloMask                  0.862 ± 0.001  ns/op
  consume                     0.719 ± 0.001  ns/op
  noop                        0.287 ± 0.000  ns/op

So pretty much as per expectation the modulo operation is far more expensive than the mask:
  • The clever JIT is aware of the optimisation opportunity and will replace a constant % with the &. It is not a perfect replacement, but pretty close.
  • At this sort of low digit ns benchmark we can’t make a statement such as “modulo is 4 times more expensive” because the same machine produces a baseline of 0.287ns/op for the noop benchmark and 0.719ns/op for the consume benchmark. If we deduct the consume result from the other scores we see a 1 : 25 ratio between the costs. Is that a good way to model performance? not really either, performance is not additive so simply subtracting one cost from the other doesn't really work at this scale. The truth is somewhere fuzzy in between and if we really care we should look at the assembly.
  • It seems that using a pre-calculated mask field is more awesome than using the "array length - 1" as a mask. That is consistent with the expectation that the re-calculation of the mask on the fly, as well as loading the value to be used for that calculation, is more expensive than using the pre-calculated field.
I love it when a plan comes together...

Going Deeper

The reason we wanted the modulo in the first place was to read from the array, right? so let’s try that:
And the results:
  Benchmark                   Score   error  Units
  readByLengthNoMask          3.736 ± 0.005  ns/op
  readByConstantLengthNoMask  1.437 ± 0.001  ns/op
  readByMask                  1.347 ± 0.022  ns/op
  readByLengthMask            1.181 ± 0.049  ns/op
  readNoMask                  1.175 ± 0.004  ns/op
Well, what’s this I see? "length-1" mask is leading the chart! How’d that happen?
To quote from the famous “Jack and the FlumFlum Tree”:
“Don’t get your knickers in a twist!” said Jack,
“Let’s have a look in the patchwork sack.”
Lets start with the generated assembly for the constant modulo:
I didna see that one coming! the modulo on a constant is not your garden variety & mask affair since it turns out our original assertion about the mask/modulo equality is only true for positive numbers. The JIT in it’s wisdom is dealing with the negative case by doing (x = -x; x = x&15; x = -x;).
I think the above case could be made a tiny bit faster by switching the branch around (so jump for negative value). It’s easy however to see what happens if we simplify the constant version further by using a constant mask:
And results:
  Benchmark                   Score   error  Units
  moduloConstantLengthNoMask  1.150 ± 0.002  ns/op
  moduloConstantLengthMask    0.860 ± 0.001  ns/op
  readByConstantLengthNoMask  1.437 ± 0.001  ns/op
  readByConstantLengthMask    1.209 ± 0.017  ns/op
So minor joy on the modulo, and reading is better than plain mask, nearly as good as the "length-1" mask. Oh well, let's move on.
The big surprise was the mask calculated on the fly from the array length version. How can calculating the mask on the fly, which seemed to be slower, end up being faster when reading from the array? Who feels like more assembly?
I was hoping the JVM was clever enough to remove the array bound checks, but that didn’t happen. What’s happening here is that the length load serves the purpose of both creating the mask and checking the bounds. This is not the case for the mask version where we load the mask for the index calculation and the length for the bounds check, thus paying for 2 loads instead of one:
So removing the computation did not make a difference because the bound check requires the extra load of the length anyhow, can we make the bounds check go away? Of course we can, but it’s Unsafe!!! Let’s do it anyways!
The assembly:

Shazzam! no bounds check, but look at all the work that’s gone into the unsafe read of the array. It would have been so much better if the unsafe read enjoyed the same addressing mode as normal array reads like so “r8d,DWORD PTR [r9+r10*4+0x18]”, but it seems the JIT compiler is not recognising the opportunity here. What’s the performance like?
  Benchmark                   Score   error  Units
  readByMask                  1.347 ± 0.022  ns/op
  readByLengthMask            1.181 ± 0.049  ns/op
  readNoMask                  1.175 ± 0.004  ns/op
  unsafeReadByMask            1.152 ± 0.001  ns/op

This is even better than no mask at all. Yay?
Well… sort of. If you mean to have the fastest ‘get’ from the array that allows for an array size which is not an application constant, than this is a mini-win. In particular is saves you a load of the array length in this case and loads can cost anything really. In the case where index and mask are long we can get better code generated:
But performance is much the same for this case. Seems like there’s not much left to win in this case.
For completeness sake we can compare the no mask result with an Unsafe equivalent:
  Benchmark                   Score   error  Units
  unsafeReadByNoMask          1.038 ± 0.022  ns/op
  readNoMask                  1.175 ± 0.004  ns/op

So it seems slipping past the array boundary check is worth something, but is it generally worth it? what if we weren't dealing with just the one element?

Bound Check Elimination

Looking at the above optimisation we need to accept that it is probably only worth it if array bound checks happen on every access. If we now compare a sum over an array:

We get the following results (length=100):
  Benchmark                    Score    error  Units
  loopOverArrayStraight        26.855 ± 0.060  ns/op
  loopOverArrayUnsafeInt       41.413 ± 0.056  ns/op
  loopOverArrayUnsafeLong      76.257 ± 0.171  ns/op
Oh Unsafe, why you so sucky sucky? How come the unsafe versions suck so significantly? isn’t Unsafe the cure to all performance problems?
Once the bounds check is eliminated by the JIT we can see that for the UnsafeInt we have the same issue with addressing conversion, only now the cost is not compensated for by the bound check removal. The UnsafeLong version is even worse, how come?
The generated loop for the int case is long and boring because it’s unrolled, the long case is pretty small:
2 'bad' things just happened:
  1. Addressing didn’t workout the way we’d like. Instead of the desired “mov    r11d,DWORD PTR [r9+rdi*4+0x18]” we get a two stage setup where we do:”lea    r10,[r9+rdi*4]” and then “add    r11d,DWORD PTR [r10+0x18]”. Bummer.
  2. We got a safe point poll in the loop. This is happening because long indexed loops are considered potentially very long (as opposed to shorter int loops... heuristics for time to safe point) and so include a safe point poll.
So we want to fix the addressing mode and stick to having an int index. If we were to insist on using Unsafe (perhaps because we are trying to do this with off heap memory instead of an array) we’d have to do this:
[4/11/2014 NOTE]  Note that what we really want here is more than just getting rid of the multiplication/widening, we want the JIT to identify the expression calculated for offset as relative array access and pick the correct addressing mode for MOV to use. There are clever people out there trying to make sure this will work better in the future.
This removes the need for a safe point poll and simplifies addressing to the point where we nearly match the iteration over the array case (length=100):
  Benchmark                    Score    error  Units
  loopOverArrayStraight        26.855 ± 0.060  ns/op
  loopOverArrayUnsafePointer   27.377 ± 0.049  ns/op
We can explore the relationship between the implementations by testing for different array sizes:
            10   100     1000    10000
straight    4.3  26.8    289.2   2883.7
unsafeP    4.8  27.3    296.1   2886.4

So it seems that the smaller the array the more relative advantage the array iteration has when iterating in this fashion. This should not really be surprising, there's nothing here to confuse the JIT compiler and iterating over arrays is important enough to optimize. We have to work hard to get close to the JIT compiler when it does what it does best.


Summary

We had a simple optimisation in mind, replace a % with &:
  • Observed that for the case where constants are used the JIT is able to perform that optimisation for us almost as well as we’d do ourselves (we have no way of specifying positive only modulo, i.e uint).
  • We proved the viability of the optimisation in 2 variations, using a pre-calculated mask field and using (array.length - 1)
  • Using the optimisation in the context of a circular array read showed an interesting reversal in performance. We observed the cause of this reversal to be the array.length load for the purpose of bound checks reused for the calculated mask as opposed to the re-calculated.
  • Using Unsafe we managed to bypass the array bound check and get the best result using the mask for a single read. 
  • When we try the same method naively in a loop (over the whole array) array bound check is eliminated and plain old array access is the best performer.
  • To regain the performance for Unsafe access we have to tweak the code to avoid safe point polling as well as to get the addressing mode we want in the resulting assembly. Even then plain array access is better for smaller arrays.
Simple innit?
Some notes on methodology:
  • I ran the same experiments on different intel generations, you get different results but the assembly remains the same. E.g. on older CPUs the maximum instructions per cycle would be less than on the Ivy Bridge CPU I've used here, this will lead to instruction spilling over to the next cycle. The L1 latency could be higher leading to loads dominating the costs etc. This ends up giving a slightly different balance to compute vs. memory load. Overall analysis holds.
  • Using -XX:-UseCompressedOops was done for the sake of consistent assembly and results. Using compressed oops makes loads look a bit clumsier and I wanted to have less to explain. But running with the flag on (as it is by default) also effects results on this scale. In particular because the compressed oops require a shift to be used and shifters are a limited resource the CPU (1 on Westmere, 2 on Ivy Bridge) it can end up adding a cycle to the results.
  • Running these same experiments on a laptop was good for getting the assembly out and a vague sense of scale for results, but measurements had far greater error in that environment. Also note that laptops and desktops tend to be a generation ahead of servers where processors are concerned.
  • An interesting experiment would be to look at same experiment with the JMH perfasm profiler. I did that but could not figure out how to get Intel syntax out of it and so for consistency sake stuck with what I had. Left as an exercise to the reader :P
Many thanks to J.P. Bempel and Peter Hughes for reviewing, any issues remaining were added by me after they reviewed the post.

The Escape of ArrayList.iterator()

$
0
0
{This post assumes some familiarity with JMH. For more JMH related content start at the new and improved JMH Resources Page and branch out from there!}
Escape Analysis was a much celebrated optimisation added to the JVM in Java 6u23:
"Based on escape analysis, an object's escape state might be one of the following:
  • GlobalEscape – An object escapes the method and thread. For example, an object stored in a static field, or, stored in a field of an escaped object, or, returned as the result of the current method.
  • ArgEscape – An object passed as an argument or referenced by an argument but does not globally escape during a call. This state is determined by analyzing the bytecode of called method.
  • NoEscape – A scalar replaceable object, meaning its allocation could be removed from generated code.
After escape analysis, the server compiler eliminates scalar replaceable object allocations and associated locks from generated code. The server compiler also eliminates locks for all non-globally escaping objects. It does not replace a heap allocation with a stack allocation for non-globally escaping objects." - from Java 7 performance enhancements 
Alas, proving an object never escapes is a difficult problem, and many people feel they cannot rely on this optimisation to kick in and "do the right thing" for them. Part of the problem is that there is no easy way to discover if a particular allocation has been eliminated (on a debug OpenJDK build one can use -XX:+UnlockDiagnosticVMOptions -XX:+PrintEscapeAnalysis -XX:+PrintEliminateAllocations).
The EscapeAnalysis skepticism leads some people to go as far as claim that the JIT compiler fails to eliminate the iterator allocation of collections, which are everywhere:

Buy why skepticize what you can analyze?

Theory: even simple iterators do not benefit from Escape Analysis

So lets give this some thought. I generally think the best of the JVM developer bunch. Sure they may miss here and there, they're just human after all, but we are talking about a good strong bunch of engineers. I tend to assume that if there is a simple case for an optimization that is worth while than they have not failed to capitalize on it. I would therefore be quite surprised if indeed iterators do not benefit from escape analysis as they seem quite natural candidates, particularly in the syntactic sugar case, but even in the explicit case. Still, my trust in the JVM engineers is no evidence, how can I prove this works for a given setup?
  1. Use a debug build... I invite the readers to try this method out, didna do it.
  2. Use a memory profiler, like the one packed with VisualVM or YourKit
  3. Setup an experiment to observe before/after effect of desired optimization. Use -XX:-+DoEscapeAnalysis and examine gc logs to observe effect.
  4. Look at the assembly...

Experiment: Observe a loop over an array list

Reach for your tool belts and pick the swiss army knife that is JMH. Here's the benchmark I will use for this:
This is one of them times when I'm not particularly concerned with the performance of this bit of code as such, but JMH makes a good crucible for java code. The same effort that went into correct measurement enables the examination of code in an isolated manner. This makes JMH a good tool for testing the JIT compiler.

Measurement: Profile the experiment

I want to plug the experiment into a profiler, so I set the number of iterations to 1000 and get into it man, profiling, doing it you know, like a... like a sex machine, can I count it all? Here's YourKit reporting:

Ouch! that array list iterator is right there at the top! all that trust I put in the JVM developers? GONE! Let's get a second opinion from JVisualVM, just to be sure:

Oh me, oh my, this is bad...
Finally, with tears in our eyes, let us try see what Java Mission Control has to say:

How curious! the iterator allocation is gone! but JMC is outnumbered 2 to 1 here. Could it be that some profilers are pooping their pants?

Measurement: Measure with -XX:+/-DoEscapeAnalysis

Sometimes we get lucky and the optimization we want to examine comes with a handy flag to turn it on and off. We expect escape analysis to remove the iterator allocation and thus leave us with a benchmark which generates no garbage. We are also fortunate that JMH comes with the handy GC profiler which simply examines the GC JMX bean to inform us if any collections happened. Running the benchmark with a short list size and a small heap should trigger plenty of young generation GC cycle in each measurement iteration. Lo and behold, with escape analysis on:
$java -jar target/microbenchmarks.jar -f 1 -i 10 -wi 10 -p size=1000 \
 -jvmArgs="-Xms64m -Xmx64m -XX:+DoEscapeAnalysis" -prof gc ".*.sumIteratorOverList"
Benchmark                                           (size)    Score    Error   Units
sumIteratorOverList                                   1000  816.367 ± 17.768   ns/op
sumIteratorOverList:@gc.count.profiled                1000    0.000 ±    NaN  counts
sumIteratorOverList:@gc.count.total                   1000    0.000 ±    NaN  counts
sumIteratorOverList:@gc.time.profiled                 1000    0.000 ±    NaN      ms
sumIteratorOverList:@gc.time.total                    1000    0.000 ±    NaN      ms


$java -jar target/microbenchmarks.jar -f 1 -i 10 -wi 10 -p size=1000 \
 -jvmArgs="-Xms64m -Xmx64m -XX:-DoEscapeAnalysis" -prof gc ".*.sumIteratorOverList"
Benchmark                                           (size)    Score    Error   Units
sumIteratorOverList                                   1000  940.567 ± 94.156   ns/op
sumIteratorOverList:@gc.count.profiled                1000   19.000 ±    NaN  counts
sumIteratorOverList:@gc.count.total                   1000   42.000 ±    NaN  counts
sumIteratorOverList:@gc.time.profiled                 1000    9.000 ±    NaN      ms
sumIteratorOverList:@gc.time.total                    1000   27.000 ±    NaN      ms
Now that is what we hoped for, escape analysis saves the day! Please note above numbers are from running on my laptop using Oracle Java 8u20, I was not aiming for measurement accuracy, just wanted to verify the compilation/GC aspects. Laptops are good enough for that.

WTF Profilers? Why you lie?

There's a big difference in how JVisualVM/YourKit work and how JMC work, in particular:
  • JVisualVM/YourKit: treat the JVM as a black box and profile via the JVMTI interfaces and bytecode injection. Work on all JVMs.
  • Java Mission Control: use internal JVM counters/reporting APIs only available to the Oracle JVM (so can't profile OpenJDK/Zing/IBM-J9 etc)
So why should this get in the way of escape analysis? Searching through the OpenJDK source code we can spot the following:
void C2Compiler::compile_method(ciEnv* env, ciMethod* target, int entry_bci) {
  assert(is_initialized(), "Compiler thread must be initialized");

  bool subsume_loads = SubsumeLoads;
  bool do_escape_analysis = DoEscapeAnalysis &&!env->jvmti_can_access_local_variables();
This explains a similar issue, but not this one. The above code is relevant for debuggers and other tools relying on stepping through code and examining variable values.
This issue is the result of some instrumentation trapping the reference, at least for JVisualVM. We can look at the assembly to be certain, this is the iterator non-allocation before we connect JVisualVM to the process:
Note how the iterator is initialized with no allocation actually taking place (note also how line 4 annotation is a big fat lie, this is actually getting modCount from the list). Method is recompiled to same when we attach the agent. And here's what happens when we turn on memory profiling: The iterator is now allocated (see 'new' keyword at line 18 above).We can see that the instrumenting agent added a traceObjAlloc call into the iterator constructor (line 52 above). JVisualVM is open source, so we can dig into the code of the above trace method and the instrumentation code, I leave it as an exercise to the reader. If I was a better man I might have a go at fixing it, maybe another day.

Summary

  • EscapeAnalysis works, at least for some trivial cases. It is not as powerful as we'd like it, and code that is not hot enough will not enjoy it, but for hot code it will happen. I'd be happier if the flags for tracking when it happens were not debug only.
  • Profilers can kill EscapeAnalysis. In the example above we have very little code to look at, so it is easy to cross examine. In the real world you'd be profiling a much larger codebase, looking for allocation elimination opportunities, I suggest you have a healthy doubt in your profiler.
  • Java Mission Control is my current favourite profiler for the Oracle JVM. It's a shame the underlying APIs are not made public and standard so that tool makers can rely on them for all JVMs. Perhaps one day at least part of these APIs will be made public.
Thanks Peter Lawrey for reviewing this post!

    MPMC: The Multi Multi Queue vs. CLQ

    $
    0
    0
    JCTools, which is my spandex side project for lock-free queues and other animals, contains a lovely gem of a queue called the MpmcArrayQueue. It is a port of an algorithm put forward by D. Vyukov (the lock free ninja) which I briefly discussed in a previous post on ring-buffer queues.
    The implementation is covered to a large extent in that post, but no performance numbers are provided to evaluate this queue vs. the JDK's own MPMC queue the illustrious ConcurentLinkedQueue (CLQ for short). Let's fix that!

    Welcome to the Machine

    Before we start the party, we must establish the test bed for these experiments. The results will only be relevant for similar setups and will become less indicative the more we deviate from this environment. So here it is, the quiet performance testing machine (shitloads of power and nothing else running):
    • OS: CentOS 6.4
    • CPU: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz, dual socket, each CPU has 12 cores (24 logical core as HT is on). This is an Ivy Bridge, so decent hardware if not the fanciest/latest.
    • JVM: Oracle JDK 1.8u25
    • Load average (idle): 0.00, 0.02, 0.00 -> it's oh so quiet....
    I'll be using taskset to pin threads to a given topology of producers/consumers which is important because:
    • Running 2 logical threads on same physical core will share queue in L1 cache.
    • Running on 2 physical cores will share queue in LLC.
    • Running on different socket (i.e 2 different CPUs) will force shared data via QPI.
    • Each topology has it's own profile, there's no point to averaging the lot together.
    For the sake of these benchmarks I'll be pinning the producer/consumer threads to never share a physical core, but remain on the same CPU socket. In my particular hardware this means I'll use the 12 physical cores on one CPU using taskset -c 0-11.
    For the sake of keeping things simple I ran the benchmarks several times and averaged the results, no fancy graphs (still recovering from the new year). Throughput results are quoted in ops per microsecond, if you are more comfortable with ops per second just multiply by 1 million. Latency results are in nanoseconds per operation.

    Throughput Benchmarks: Hand rolled 1 to 1

    Let's start with the simplest benchmark in JCTools. The QueuePerfTest is really a single producer/consumer test:
    • A producer thread is spun up which feeds into the queue as fast as it can.
    • The main thread plays the consumer thread and empties the queue as fast as it can. 
    The original benchmark was a sample used by Martin Thompson in one of his talks and I tweaked it further. It's nice and simple and has the added benefit of verifying that all the elements arrived at their destination. It also opens the door to loop unrolling which you may or may not consider a reasonable use case. This benchmark calls Thread.yield() if an offer/poll call fails which represents a rather forgiving backoff policy, which is why we also have the BusyQueuePerfTest which busy spins on fail.
    Here's what we get:
    Queue  Benchmark            Throughput(ops/µs)
    CLQ    QueuePerfTest        10.7
    Mpmc   QueuePerfTest        65.1
    CLQ    BusyQueuePerfTest    15.3
    Mpmc   BusyQueuePerfTest    63.5
             
    The observed throughput for CLQ varies quite a bit between iterations, but the full run average is not too unstable. The performance also depends on the size of the heap as CLQ allocates a node per item passed and the GC overhead can become a significant part of management. In my runs I set the heap size to 1GB and the benchmark also calls System.gc() between measurement iterations. In any case, off to a good start, the JCTools queue is looking well.

    Throughput benchmarks: JMH Joy

    I'm a big JMH fan (see the JMH reference page), and these benchmarks show some of the power JMH has. The benchmark code is very simple and it's clear what's being measured:
    • We have 2 methods, offer and poll
    • Thread groups hammer each method
    • We use @AuxCounters to report successful/failed interactions with the queue
    • The throughput is the number of pollsMade, i.e. the number of items delivered
    The nice thing is that once I got this setup I can now play with the number of threads in each group with no further investment (see an introduction to multi-threaded benchmarks with JMH).  I can run several iterations/forks and go make some coffee while JMH crunches through the lot and gives me a nice report (JMH options used for these benchmarks: "-gc -f 5 -i 5 -wi 5 -p qType=MpmcArrayQueue,ConcurrentLinkedQueue -p qCapacity=1000000" the -gc option forces a GC cycle between iterations as before, I use the -tg option to control number of producer/consumer threads). Here's the results for QueueThroughputBackoffNone (again results are in ops/µs, so higher is better):
    Queue  1P1C        2P2C      3P3C      4P4C      5P5C      6P6C
    CLQ     7.9 ±1.7 , 4.7 ±0.2, 3.8 ±0.1, 3.7 ±0.4, 3.1 ±0.1, 2.9 ±0.1
    Mpmc   68.4 ±11.1, 9.2 ±1.1, 7.3 ±0.5, 5.8 ±0.4, 5.2 ±0.4, 5.3 ±0.2

    Note the columns stand for number of producers/consumers so 1P1C will be running with -tg 1,1 and will have one thread hitting the offer and one the poll. 2P2C will have 2 consumers and 2 producers etc.
    So on the one hand, joy, MPMC still consistently ahead, on the other hand it is not some sort of magic cure for contention. If you have multiple threads hitting a shared resource you will suffer for it. The initial hit is the worst, but the following degraded performance curve isn't too bad and we seem to stay ahead.

    Average Round Trip Time Benchmarks


    This benchmark was set up to model a bursty load on queue implementations where we measure the time it takes for a burst of messages to travel from an originating thread, to a chain of threads inter-linked by queues, back to the same thread. This benchmark is focusing on near empty queues and the notification/wakeup time from when an item is placed in a queue until it becomes visible to another thread. It also highlights any batching optimisations impact as bursts grow larger.
    The benchmark has been discussed in some detail in 2 previous posts(1, 2)
    We have 2 axis we can explore here, burst size and ring size, I have not explore all the possibilities. The results seem quite straightforward and I'm short on time. I tested with burst size=1,10,100 in the default configuration, i.e. chain length 2 (so a ping-pong kind of setup), and just for the heck of it I ran the burst size=100 with a chain length of 8 (so 1->2->3...->7->8->1). There you go (results are now in nanoseconds per operation, lower is better):
    Queue  b=1,c=2   b=10,c=2   b=100,c=2    b=100,c=8
    CLQ    488 ±11,  2230 ±98,  15755 ±601,  24886 ±2565
    Mpmc   422 ±7,   1718 ±51,   5144 ±287,   8714 ± 200

    Note that the headers shorthand stands for burst size(b) and chain length(c) so b=1,c=2 which is the default config for the benchmark stands for burst size of 1 (so send 1 message and wait until 1 message is received back) and chain length of 2 (so 2 threads linked up in this ring: 1->2->1).
    The difference on single element exchange is not that great, but as burst size grows the gap widens significantly. This is perhaps down to the fact MPMC enjoys a much denser delivery of data, minimising the cache misses experienced as part of the message exchange. Note that extending the length of the chain seems to add the same percentage of overhead for each implementation, resulting in the same ratio for chain length 2 as we did for chain length 8 (MPMC is 3 times 'cheaper' than CLQ)

    Summary

    This was all rather dry, but I hope it helps people place the MPMC alternative in context. I would suggest you consider using MpmcArrayQueue in your application as a replacement to CLQ if:
    • You need a bounded queue
    • You are concerned about latency
    • You don't need to use the full scope of methods offered by Queue and can make do with the limited set supported in JCTools queues

    HdrHistogram: A better latency capture method

    $
    0
    0
    Some years back I was working on a latency sensitive application, and since latency was sensitive it was a requirement that we somehow capture latency both on a transaction/event level and in summary form. The event level latency was post processed from the audit logs we had to produce in any case, but the summary form was used for live system monitoring. We ended up with a solution I've seen since in many places (mild variations on a theme), which is what I've come to think of as the linear buckets histogram:

    The above data structure was an easy solution to a problem we had little time to solve, but it left much to be desired.
    These days the histogram problem is gloriously solved by the HdrHistogram (High Dynamic Range), and though it's been around for a couple of years now I still find olden hand rolled histograms in many a code base. Cut that shit out boys and girls! Let me show you a better way :-)

    HdrHistogram highlights:
    • Mature and battle tested, this data structure has been in production for many companies for a while now. Kinks have been unkinked and knickers untwisted.
    • Multi-lingual support: current implementations available in Java, C, C#, Erlang, Go and more are on the way.
    • Auto-resizing histograms (if you exceed your initial miximum value estimate)
    • Compact memory footprint supporting high precision of values across a wide range.
    • Compressed lossless serialization/de-serialization
    • Plotting scripts for gnuplot, a webby charting tool and an excel sheet chart
    • Lock-free concurrency support for recording and logging from multiple threads
    • Zero allocation on recording path (unless resizing which is optional, and then only if value exceeds initially specified max recorded value)
    • Constant time measurement which is less than cost of calling System.nanoTime() (on the cost, scalability and trustworthiness of nanoTime read Shipilev's excellent report)
    It is truly as if the bees-knees and the dogs-bollocks had a baby, is it not?

    Mama, what's a histogram?

    "Histogram of arrivals per minute" by DanielPenfield
    Own work. Licensed under CC BY-SA 3.0 via Wikimedia Common
    Well, you could look it up on wikipedia, but the short answer is that a histogram is a summary of data in terms of the frequency of ranges of values.  So given the following data set [1,2,3,...,100000] captured for a second I could summarize in several ways:
    • I could capture the whole range of values each value in a bucket of it's own, assigning each value a frequency of 1 per second.
    • I could have a single bucket for values between 0 and 100,000, this bucket will have the frequency of 100,000 times per second.
    These are both a bit silly, the first is as bad as dealing with the full set of data, the second is telling us nothing about the way the 100,000 values break down within the range. Still these are the 2 extremes of histograms, the alternatives lie within that range in terms of the data they offer but there are many ways to skin a cat (apparently, :( poor cats):
    • Capture the values in 1000 buckets, each bucket representing a range of 100 values: [1..100][101..200]...[99,901..100,000] that will result in 1,000 buckets each with a frequency of 100. This is the sort of histogram described above where all buckets capture the same fixed range.
    • Capture the values in 17 buckets, each bucket K representing a range [2^K..(2^(K+1)-1)] e.g. [1..1][2..3][4..7]...[65,536..131,071]. This would be a good solution if we thought most values are likely to be small and so wanted higher precision on the lower range, with lower and lower precision for larger values. Note that we don't have to use 2 as the base for exponential histogram, other values work as well.
    Both of the above solutions trade precision across a large range with storage space. In both solutions I am required to choose up-front the histogram precision and we expect to pay for a large range with either space or precision. Now that we realize what the variables are we can describe these solutions:
    • Linear buckets: For a range 0..R we will have to pay R/B space for buckets of range B. The higher R is the more space we require, we can compensate by picking a large B.
    • Exponential buckets: For a range 0..R we require space of log2 of R. The bucket size grows exponentially as we track higher values.
    The problem we face with latency data points is that the range of values we want to capture is rather large. It is not unreasonable to expect the latency outliers to be several orders of magnitude larger than the typical observed measurement. For example, it may well be that we are timing a method whose cost is in the 100s of nanoseconds, or a high speed connection round trip on the order of 1-100µs but on occasion our latency is dominated by some stop the world GC pause, or network congestion, which is in the order of 10ms to a few seconds. How can we correctly size the range of our histogram? Given the possibility of multi-second GC delays we need to cover a typical range of 1000ns to 100,000,000,000ns. If we used a linear histogram with a 100µs bucket we'd need 1,000,000 buckets (assuming an int counter this will add up to a ~4MB data structure).
    The HdrHistogram follows a different path to the above solutions and manages to accommodate a large range with a high precision across the range in a limited space.

    How does it work?

    Here's what the documentation has to say:
    "Internally, data in HdrHistogram variants is maintained using a concept somewhat similar to that of floating point number representation: Using an exponent a (non-normalized) mantissa to support a wide dynamic range at a high but varying (by exponent value) resolution. AbstractHistogram uses exponentially increasing bucket value ranges (the parallel of the exponent portion of a floating point number) with each bucket containing a fixed number (per bucket) set of linear sub-buckets (the parallel of a non-normalized mantissa portion of a floating point number). Both dynamic range and resolution are configurable, with highestTrackableValue controlling dynamic range, and numberOfSignificantValueDigits controlling resolution."
    Hmmm... I'll admit to having difficulty immediately understanding what's happening from the above text, precise though it may be. I had to step through the code to get my head around what works why, read the above again and let it simmer. I'm not going to dig into the implementation because, while interesting, it is not the point of this post. I leave it to the reader to pester Gil Tene(the author of HdrHistogram) with implementation related questions.
    The principal idea is a mix of the exponential and linear histograms to support a dynamic range precision that is appropriate to the time unit scale. At the scale of seconds we have a precision of milliseconds, at the scale of milliseconds we have a precision of microseconds etc. This translates roughly into exponential scale buckets which have linear sub-buckets.

    Example: From raw recording to histogram

    I have posted a while back a Java ping utility which measures the round trip between minimal client and server processes. Each round trip was recorded into a large array and every set number of round trips the measurements were summarized in percentiles:
    Recording raw data is the simplest way of capturing latency, but it comes at a price. The long[] used to capture the latencies is ~8MB in size, this is for a million samples and in a real application can grow without bounds until some cutoff point where we decide to summarize or discard the data. When we want to report percentiles we have to sort it and pick the relevant data points. This is not usually an acceptable solution (because of the memory footprint), but it offers absolute accuracy and is trivial to implement (until you have to consider serialization, concurrency and visualization, but otherwise trivial).
    Replacing this measurement method with a histogram is straight forward:
    This histogram is 31KB when using 2 decimal places precision which is good enough in most case (according to JOL, add it to your utility belt if it ain't there already. Increasing the precision to 3 increases the size to 220KB), which is a large improvement over 8MB. We could reduce the memory consumption further if we were willing to limit the maximum data point count per bucket and use an integer/short histogram(ints seem like a reasonable choice).
    If we print both measurements methods for the same run we can see the difference between the raw data and the HDR representation which is naturally slightly less accurate (# lines are HdrHistogram,@ is for raw data, each line represents 1M data points):
    We can see that reported percentiles are pretty close to the raw data:

    • Note that the nanoTime on Mac reports in µs granularity, which is why the real values(@ lines) all end with 3 zeros.
    • Note that the max/min reported are adjusted to the correct histogram resolution (not a big deal, but slightly surprising).

    What would have happened with our hand rolled solution? To keep a range of 0 to 60,000,000,000ns in a linear histogram of the same memory footprint we would need to limit ourselves to roughly 220k/8=64k buckets. Each bucket would have a granularity of roughly 1ms which would have translated to a very limited visibility on the lower end of the spectrum as most data sets are actually all under 1ms. This would have also completely skewed our percentiles (i.e 99.99% results in 1ms bucket, no breakdown of behaviour in percentiles). We could try and tackle the issue by picking a lower range to cover (which if applied to HdrHistogram will minimize memory usage further) or by blowing the memory budget on finer grained buckets.

    Example: Aeron samples, much percentiles! Such graphs! Wow!

    Percentiles are commonly used to describe latency SLA of a given system and a typical application will report a range of percentiles to reflect the probability of a given response time. In this context we can say that a 99%ile latency of 1ms means that 99% of all requests were handled in =< 1ms.
    Aeron is a low latency, reliable UDP messaging library. Since latency is an obvious concern, all the samples utilize the HdrHistogram lib to demonstrate measurement and report results, here are the relevant excerpts from the Ping sample:
    This results in a pile of text getting poured into the console, not that friendly:
    Fear not, HdrHistogram comes packed with a handy charty thingy! Here's what the above histogram looks like when plotted:

    To get this graph:
    • Save output above to a text file
    • Open a browser, and go here (the same HTML is in the project here)
    • Choose your file, choose percentiles to report and unit to report in
    • Export the picture and stick it in your blog!
    This histogram was provided to me by Martin Thompson (one of the Aeron developers) and is from a test run in a performance lab. We can see that Aeron is delivering a solid 7µs RTT up to the 90%ile where latency starts to gradually grow. In this particular data set the maximum observed latency was 38µs. This is a great latency profile. It is far more common for the max and 99.99%ile to be orders of magnitude more that the 90%ile.
    I could similarly plot this histogram using a gnuplot script to be found here. The gnuplot script is very handy for scripted reporting on large runs. It also allows for plotting several histograms on the same graph to allow visual comparison between benchmark runs for instance.

    Example: Compressed histogram logging

    Because the SLA is often specified in percentiles, it is common for applications to log only percentiles and not histograms. This leads to a reporting problem as it turns out that percentiles output cannot be combined to produce meaningful average percentiles. The solution would be to log the full histogram data, but who want a log that grows by 31KB every 5 seconds just to capture one histogram? Worry not, HdrHistogram comes with a compressed logging format and log writer and all that good stuff:
    How many bytes is that logged histogram costing us? The 31KB histogram compressed down to 1KB in my example, but the result of compression will depend on the histogram captured. It is fair to assume that histograms compress well as the array of buckets is full of zeros (on the byte level) as most buckets are empty or low count.
    If 1KB sounds like allot consider that a days worth of 10s interval histograms will result in an 8MB log file, which seems pretty acceptable even if you have a hundred such files. The benefit is that you will now have high precision interval latency data that you can reliably use to create longer interval latency data. You can use the HistogramLogProcessor to produce a full or partial log summary histogram for plotting as above.
    I believe there's some truly exciting data visualizations one could build on top of this data, but sadly thats not where my skills lie. If you got skills to show of in this area I'm sure HdrHistogram would value your contribution.

    Example: jHiccup and concurrent logging

    jHiccup is a pause measurement tool used to capture OS or JVM 'hiccups'. It deserves it's own post but I'll try and summarize it in a few points:
    • jHiccup runs a HiccupRecorder thread which sleeps for a period of time (configurable) and measures the delta between the wakeup time and actual time. The failure to be re-scheduled is taken as a potential OS/JVM hiccup. The size of the hiccup is recorded in a histogram.
    • jHiccup can be run as an agent in your own process, an external process, or both.
    • jHiccup has been ported to C as well.
    • People typically use jHiccup to help charecterize and diagnose disruptions to execution in their system. While not every hiccup is the result of a STW pause we can use the jHiccup agent evidence correlated with an external jHiccup process and the JVM gc logs to support root cause analysis. A significant hiccup is a serious sign of trouble meaning a thread was denied from scheduling for the length of the hiccup. We can safely assume in most cases that this is a sign that other threads were similarly disturbed.
    Gil Tene originally wrote HdrHistogram as part of jHiccup, but as HdrHistogram turned out to be more generally useful the two were split. The reason I bring jHiccup up in this context is that it serves as a regularly maintained full blown real world example of using an HdrHistogram.
    jHiccup has 2 interesting threads, with roles that parallel many real world applications out there:
    • The measuring thread(HiccupRecorder): This is the thread that sleeps and wakes up and so on. The rate at which it does that is potentially quite high and we don't want to skew to measurement by performing IO on this thread. Similarly many real world application will have critical threads where it is not desirable to introduce IO. Since this is the case actual persistence will be performed on another thread.
    • The monitoring/logging thread(HiccupMeter): This thread will wake up at regular intervals and write the last interval's histogram to the log file. But since it is reading a histogram while another thread is writing to the histogram we now need to manage concurrency.
    HdrHistogram offers a synchronization facility to serve exactly this use case in the form of the Recorder:
    • Recording a value in the recorder is a wait free operation (on JDK8, can be lock free on older depending on the getAndAdd implementation for AtomicLongArray).
    • The Recorder also comes in a single-writer flavour, which minimizes the concurrency related overheads.

    Under the covers the recorder is using an active histogram and an inactive one, swapped seamlessly when an interval histogram is requested. Using a Recorder looks much like using a normal histogram(full code here):
    And that's concurrent logging sorted ;-).

    Summary

    With HdrHistogram now hitting version 2.1.4 and offering a wealth of tried and tested functionality along with cross platform implementations and a standardized compressed logging format it is definitely time you gave it a go! May your latencies always be low!
    If you are looking for a pet project and have a gift for UI thingies a latency explorer for the histogram interval logs would be an awesome contribution!

    Correcting YCSB's Coordinated Omission problem

    $
    0
    0
    YCSB is the Yahoo Cloud Serving Benchmark(also on wiki): a generic set of benchmarks setting out
    The Nimbus Cloud Serving Board
    to compare different key-value store providers under a set of loads:
    The goal of the Yahoo Cloud Serving Benchmark (YCSB) project is to develop a framework and common set of workloads for evaluating the performance of different "key-value" and "cloud" serving stores.
    The code is open for extension and contribution and all that good stuff, you can get it here. And it has become tool for comparing vendors in the NoSQL space. The benchmarks set out to measure latency and throughput. The terms are not directly defined in the paper, but the following statement is made:
    The Performance tier of the benchmark focuses on the latency of requests when the database is under load. Latency is very important in serving systems, since there is usually an impatient human waiting for a web page to load. [...] Typically application designers must decide on an acceptable latency, and provision enough servers to achieve the desired throughput while preserving acceptable latency. [...] The YCSB Client allows the user to define the offered throughput as a command line parameter, and reports the resulting latency, making it straightforward to produce latency versus throughput curves.
    What could possibly go wrong?™
    It can go this wrong for instance, order of magnitude difference in results for different percentiles, leading to some poor decision making on how much hardware you'll need, leading to getting fired from your job and growing old bitter and twisted mumbling to yourself as you get drunk on the street corner until you freeze to death on a winter night. So potentially this is a risk to your future well being, listen up!


    It's broken? Coordinated WOT?

    When you measure latency bad,
    Mr. Tene is sad
     :(
    My colleague at Azul, Gil Tene, the magnificent, glorious, multi-dimensional, coding CTO, officer and gentleman (that's my bonus sorted) has been doing a lot of latency related preaching and teaching in the last few years. He has given the following talks at any number of conferences, but if you happened to have missed them, watch them NOW:
    In particular he has coined the term "Coordinated Omission" (see raging discussion on Mechanical Sympathy) to describe instances in which the measuring/monitoring system coordinates measurement with the system under test/measurement such that samples are biased. This issue manifests in many load generating frameworks where the call into the system under test is done synchronously and the measurement thread holds up the next sample while the call is ongoing. This enables the system under test to delay requests that would have been made during the synchronous call thus skewing the sample set. Consider for example a system where:
    • We set out to measure a request every 1ms (from a single thread, synchronously)
    • The first 500 calls come back in 100µs each (so call K starts at Kms and returns at Kms + 100µs )
    • Call 501 takes 500 milliseconds (starts at 500ms, returns at 1 second)
    • Call 502 takes 100µs
    See the problem?
    The problem is that call 502 did NOT happen at it's designated time, and saying it took 100µs fails to capture this. It failed the assumptions laid out in the first sentence because we were blocked for 500ms. If we were to stick to our original schedule we would be making calls 502 to 1000 in the time it took for call 501 to execute. How should we treat this departure from plan?
    Is it safe to come out yet?
    1. Ignore it! it will go away by itself! - This is the coordinated omission way. We are now reporting numbers that are no longer according to the test plan, which means that our "latency versus throughput curves" are off the mark. This is a very common solution to this issue.
    2. Fail the test, we wanted a call every ms and we didn't get that - This is an honest hardline answer, but it potentially throws the baby with the bath water. I think that if you set out to schedule 1000 calls per second you might want to see how often this falls apart and how. But this answer is The Truth™, can you handle it? If one is to start from scratch and write their own load generator I propose a read of the Iago load test framework philosophy page: "Iago accurately replicates production traffic. It models open systems, systems which receive requests independent of their ability to service them. Typical load generators measure the time it takes for M threads to make N requests, waiting for a response to each request before sending the next; if your system slows down under load, these load testers thus mercifully slow down their pace to match. That's a fine thing to measure; many systems behave this way. But maybe your service isn't such a system; maybe it's exposed on the internet. Maybe you want to know how your system behaves when N requests per second come in with no "mercy" if it slows down.". This is a fine sentiment.
    3. Coordinated Omission correction: Adjust the results to reflect the expected call rate. This can be done in a straight forward manner if the 'missing' calls are added back with a latency which reflects the period for which they were delayed. This correction method is supported out of the box by HdrHistogram but the discussion regarding it's over or under estimation of the impact of the delay is outside the scope of this post.
    4. Coordinated Omission avoidance: Measure all calls according to original schedule. We are now saying: "If I can't make the call, the meter is still running!". This is particularly relevant for systems where you would typically be making the requests to the system under test from a thread pool. That thread pool would be there to help you support asynchronous interaction where the API failed in giving you that option. Like JDBC... Like many key-value pair provider APIs.
    This last solution is the one we'll go for in this post, but I would urge you to consider the results critically. In particular if you are trying to simulate independent access to a web server (as opposed to a DB via a thread pool) then the adherence to schedule might be hugely optimistic of the results in your case. This is because failing to generate independent load may have all sorts of beneficial effects on the system under test.
    For the YCSB benchmark I'm assuming the harness/load generator is simulating a web serving layer accessing the key-value store in an effort to serve an unbounded, uncoordinated user request load via a predefined thread pool. So it's door number 3 for me. The corrected load generator is here.


    Step 0: Some preliminary work (not strictly required)

    HdrHistogram, as approved by Freddie!
    As described previously here, we should all just get on with capturing latency using HdrHistograms. So as a first step toward correcting YCSB I have gone in and added an HdrHistogram measurement container. This is pretty straight forward as all I needed to modify was the Measurements class to allow a new measurement type. While I was there I tweaked this and that and the following list of changes to that class emerged:
    1. Add new measurement type and corresponding command-line option("-p measurementtype=hdrhistogram")
    2. Add combined measurement option allowing old/new measurement side by side: "hdrhistogram+histogram"
    3. Add support for capturing both corrected and uncorrected measurements for the same run.
    4. Use CHM instead of synchronizing around a HashMap.
    The new measurement type supports logging loss less HdrHistogram data to a file (controlled by the hdrhistogram.fileoutput=<true|false> option and the hdrhistogram.output.path=<path> option) as well as better precision percentile data and lock free logging of latencies. This is not very interesting work but if you are interested in the "How would I plug in my own data structure to capture latency into YCSB?" topic have fun. It was not necessary for correction but it was good to do so that better quality results can be observed. You're welcome.


    Step 1: Demonstrate the issue

    YCSB includes a very useful means of verifying the measurements in the form of a mock DB driver. This means we can test our assertions regarding coordinated omission without setting up a key value store of any kind. The mock DB is called BasicDB and is the default DB used. We can configure it to simulate a pause and see what happens (-p basicdb.verbose=false -p basicdb.simulatedelay=4  will make the mock DB stop logging every action and simulate a latency of 0-4ms for each action). I added a further option to the BasicDB which allows us to turn off the randomization of the delay (-p basicdb.randomizedelay=false).
    Let's consider our expectations in the case where a DB simply cannot handle request quickly enough. We can setup an experiment with the following settings: -target 1000 -threads 1 -s -p status.interval=1 -p workload=com.yahoo.ycsb.workloads.CoreWorkload -p basicdb.verbose=false -p basicdb.simulatedelay=4 -p basicdb.randomizedelay=false -p measurementtype=hdrhistogram -p maxexecutiontime=60
    Here's what they all mean:
    • -target 1000 -> We aim to test 1000 requests per second
    • -threads 1 -> We have a single client thread
    • -s -p status.interval=1 -> We will be printing out status every second (I made the status interval configurable)
    • -p basicdb.verbose=false -p basicdb.simulatedelay=4 -p basicdb.randomizedelay=false -> The DB will sleep 4ms on each request, so the maximum we can hope for is 250, no noisy printing per operation please
    • -p measurementtype=hdrhistogram -> Use HdrHistogram to capture the latencies
    • -p maxexecutiontime=60 -> Run for one minute, then exit and print summary
    This DB is obviously failing, it can't keep up with the rate of incoming requests and according to our model they queue up. The time measured per call is reflected in the summary for the READ operations:
    [READ], Operations, 12528.0
    [READ], AverageLatency(us), 4477.102809706258
    [READ], MinLatency(us), 4018.0
    [READ], MaxLatency(us), 44703.0
    [READ], 95thPercentileLatency(ms), 4.0
    [READ], 99thPercentileLatency(ms), 4.0
    [READ], Return=0, 12528
    But this completely ignores the time spent on the queue. If we were measuring according to schedule we'd get the following set of latencies:
    Latency[k] = 4 + 3*(k-1) ms
    The max latency would be for the last request to get in. We ran for 60 seconds, at 250 requests/sec which means our last request was (k=15000) and had a latency of 45 seconds when measured from the time we intended to make it. This number reflects the system's failure to handle load far more correctly than the numbers quoted above.


    Step 2: Working to Schedule

    The YCSB load generator has a weak notion of schedule, in the sense that it opts for option number 1 above and will just execute the operations when it can. When faced with the task of correcting this kind of issue (in a pile of foreign code) we need to look for 2 things in the load generator:
    1. "Scheduling an action to run at time X" - This will involve some calls to one of the many scheduling facilities in the JDK:
      1. Thread.sleep is an old favourite, but TimeUnit also supports a sleep method. A search for sleep in the code base will cover both. This is what YCSB was using to schedule next event to fire.
      2. Code submitting tasks to java.util.Timer, or alternatively the ScheduledExecutorService
      3. Code using LockSupport.parkNanos
      4. Object.wait(...)
      5. others?
    2. "Measuring the length of an operation" - This will involve calls to System.nanoTime() or currentTimeMillis(). For YCSB this is found to happen for example here.

    To correct this problem I had to introduce the concept of 'intended start time' to the operations measurement. Schedule for YCSB is specified by the -target command line option which sets the overall number of operations per second to be attempted by the load generator. This is optional, and the default is to go as fast as you can manage, i.e. with no schedule but the back pressure from the system under test to guide us. I'm not sure what a good assumed rate of requests is reasonable in this case, so I did not correct this case. NOTE: If you don't specify target no correction will take place.
    The target parameter is translated to a per-thread operation rate (number of threads is set via the threads option, default is 1) so if we have 10 threads, and the target request rate is 1000 (-target 1000 -threads 10) we will have each thread hitting the store with 100 requests per second. The client threads randomize the first operation time to avoid all hitting the store on the same interval. I did some ground work here by setting the units across the board to nanoseconds and naming interval parameters appropriately, nothing too exciting.
    The actual correction at it's core involves:
    1. Record the operation's intended start time
    2. Use the intended start time when computing latency
    Sadly the way YCSB measures latency does not lend itself to a simple in place fix. The operations are scheduled by the ClientThread which calls into a workload, calling into a DB, which is actually the DBWrapper which measures the latency (for calling into an actual DB implementation) and reports it to the central Measurements singleton. This means that changing the Workload/DB API to include a startTime parameter to each call is quite a far reaching change which would require me to dig through all the DB drivers implementations and would result in a very unpleasant time for all.
    I settled on using a thread local on the Measurements object to transfer the start time to the DBWrapper, it is not a nice way to do things (and I'm happy to hear better suggestions) but it does the job without modifying the API.
    Once we have:
    1. ClientThread setting up the start time for the operation via Measurements
    2. DBWrapper using the start time from Measurements to compute the operation latency
    That's pretty much it. For extra points I wanted to include some facilities to compare measurements before/after the change. These can be removed if we accept HdrHistogram as a replacement and if we accept we only want to measure the intended latency, which would result in a much smaller PR.


    Step 3: is the issue solved?

    Running the setup from step 1 such that it produces the intended latency as well as the original measurement side by side(-p measurement.interval=both) yields the following result for the READ operations:
    [READ], Operations, 12414.0
    [READ], AverageLatency(us), 4524.981069759949
    [READ], MinLatency(us), 4018.0
    [READ], MaxLatency(us), 24703.0
    [READ], 95thPercentileLatency(ms), 4.0
    [READ], 99thPercentileLatency(ms), 4.0
    [READ], Return=0, 12414
    [Intended-READ], Operations, 12414.0
    [Intended-READ], AverageLatency(us), 2.359010991606251E7
    [Intended-READ], MinLatency(us), 4256.0
    [Intended-READ], MaxLatency(us), 4.6989311E7
    [Intended-READ], 95thPercentileLatency(ms), 42369.0
    [Intended-READ], 99thPercentileLatency(ms), 46530.0

    This reflects the effect a backed up system would have on latency as we express in Step 1 above. It's actually a bit worse because the average cost of calling the mock DB with a sleep of 4ms is 4.5ms. As we can see the maximum latency is 46.9 seconds, reflecting the fact that the last read to execute was scheduled to hit the system 13.1 seconds into the run.


    Step 4: The limitations of the harness

    We can now also consider the perfect DB for the sake of observing the short comings of the test harness by setting the mock DB delay to 0(-p basicdb.simulatedelay=0):
    [READ], Operations, 56935.0
    [READ], AverageLatency(us), 0.01796785808377975
    [READ], MinLatency(us), 0.0
    [READ], MaxLatency(us), 49.0
    [READ], 95thPercentileLatency(ms), 0.0
    [READ], 99thPercentileLatency(ms), 0.0
    [READ], Return=0, 56935
    [Intended-READ], Operations, 56935.0
    [Intended-READ], AverageLatency(us), 232.37026433652412
    [Intended-READ], MinLatency(us), 0.0
    [Intended-READ], MaxLatency(us), 39007.0
    [Intended-READ], 95thPercentileLatency(ms), 0.0
    [Intended-READ], 99thPercentileLatency(ms), 0.0

    How come it take so long to measure a noop? why such large differences? Here's some generic theories and how they panned out:
    • The JVM running the load generator is running with suboptimal settings(-Xms64m -Xmx64m -XX:+PrintGCDetails -XX:+PrintGCApplicationStoppedTime, Oracle JDK8u31) on a busy Mac laptop running on battery
    This is no way to benchmark anything, but the interesting thing is that if we have no schedule to stick to the test harness is willing to just ignore the issue. If we run on a decent machine (with a decent OS) we get nicer results. This is from a server class machine running CentOS6.3/OracleJDK8u25 with same settings:
    [READ], Operations, 56930.0
    [READ], AverageLatency(us), 0.44417705954681186
    [READ], MinLatency(us), 0.0
    [READ], MaxLatency(us), 20.0
    [READ], 95thPercentileLatency(ms), 0.0
    [READ], 99thPercentileLatency(ms), 0.0
    [READ], Return=0, 56930
    [Intended-READ], Operations, 56930.0
    [Intended-READ], AverageLatency(us), 146.31262954505533
    [Intended-READ], MinLatency(us), 15.0
    [Intended-READ], MaxLatency(us), 14255.0
    [Intended-READ], 95thPercentileLatency(ms), 0.0
    [Intended-READ], 99thPercentileLatency(ms), 0.0
    This is still significant.

    • The JVM suffers from warmup related artefacts
    This certainly correlated to the max values I'm seeing here. When looking at the status line for the first second I see:
    [READ: Count=22, Max=14, Min=0, Avg=0.64, 90=0, 99=14, 99.9=14, 99.99=14]
    [Intended-READ: Count=23, Max=14255, Min=15, Avg=5563.39, 90=13287, 99=14255, 99.9=14255, 99.99=14255]
     But after a few seconds the process settles and we see much better results, this is typical:
    [READ: Count=947, Max=14, Min=0, Avg=0.02, 90=0, 99=0, 99.9=2, 99.99=14]
    [Intended-READ: Count=946, Max=194, Min=61, Avg=151.29, 90=165, 99=175, 99.9=186, 99.99=194]
    A good way to handle this issue is by relying on the HdrHistogram output files to grab relevant time periods for analysis. With the original YCSB output we have the percentile summary data, but this is not something we can combine for analysis. With the loss-less interval histogram logs we can look at any sub-period(which is longer than one interval, but shorter than the whole run) and get accurate full range histogram data. A common practice is to discard warmup period results, I'm no a fan of throwing away data, but since this is the load generator warmup I'd think it's quite legitimate. It's perhaps an interesting feature to add to such a framework that the framework can be warmed up separately from the system to examine cold system behaviour.
    • Thread.sleep/LockSupport.parkNanos are not super accurate and may wakeup after the intended operation start time
    I've added an option for spinning instead of sleeping (so burn a CPU). This has improved the average value dramatically from ~146µs to ~3.1µs. A typical status line now looks like:
    [READ: Count=947, Max=13, Min=0, Avg=0.02, 90=0, 99=0, 99.9=3, 99.99=13]
    [Intended-READ: Count=948, Max=57, Min=0, Avg=0.47, 90=1, 99=12, 99.9=26, 99.99=57]
    It is obviously not desirable for the load generator to burn a CPU instead of using sleep, but it also introduces scheduling inaccuracies. This is an accuracy issue we didn't have to deal with when not measuring from a schedule. This didn't impact the measured outliers, but has dramatically reduced their number. The take away here is just that there are accuracy limitations to the load generators ability to stick to schedule.

    • GC pauses that are large enough to derail the schedule on the load generator side are now captured. Unless the GC pauses happen inside the measurement gap we will have no idea we have gone off schedule if we don't track the intended start time.
    We should capture GC logs on load generator side and make sure we correlate the GC events with recorded latencies. Here's a GC pause being captured by the corrected measurement:
    [READ: Count=952, Max=0, Min=0, Avg=0, 90=0, 99=0, 99.9=0, 99.99=0] 
    [Intended-READ: Count=952, Max=14, Min=0, Avg=0.03, 90=0, 99=0, 99.9=3, 99.99=14]
    [GC (Allocation Failure) [PSYoungGen: 17895K->1824K(18944K)] 17903K->1840K(62976K), 0.0024340 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
    Total time for which application threads were stopped: 0.0026392 seconds
    [READ: Count=957, Max=0, Min=0, Avg=0, 90=0, 99=0, 99.9=0, 99.99=0]
    [Intended-READ: Count=957, Max=2719, Min=0, Avg=5.21, 90=0, 99=0, 99.9=2119, 99.99=2719]
    This process is running with a 64M heap, you can expect longer pauses as the heap grows (in particular as the young generation grows).

    • The operation setup time is now being measured as well as the operation itself.
    When running with the spin option we can see the average operation cost is ~3.1µs, this is all test harness overhead and is really quite negligible in the context of network hopping operations. In other words, nothing to worry about for this harness but could well prove an issue for others.


    Step 5: The Good, The Bad And The STW pausing DB

    Many software processes have a latency profile that is far from normally distributed. To see what YCSB makes of this kind of profile now that we can compare corrected vs. uncorrected measurement I have built a mock DB that has 4 mods of latency (p is uniform random number [0,1]):
    1. Awesome (p < 0.9): we return in 200µs-1ms
    2. Good (0.9 < p < 0.99): we return in 1-10ms
    3. Minor Hiccup( 0.99 < p < 0.9999): we hit a bump, but only one thread is affected 10-50ms
    4. Major Hiccup(0.9999 < p): we hit a STW pause(because GC/THP/LBJ/STD/others), all threads stop for 50-200ms
    I implemented the above with a read write lock, where the STW pause grabs the write lock and all the others grab the read lock. This is far from a perfect representation of a system (everyone waits for STW as intended, but also STW waits for everyone to start), but it will do. If you feel strongly that a better simulation is in order, write one and let's discuss in the comments!
    What sort of profile will we see? How far off course will our measurements be if we don't stick to schedule? Here's this setup run at a rate of 10,000 requests per second, with 25 threads (so each thread is trying for 250 reqs/sec or 1 request per 4ms):
    [READ], Operations, 569516.0
    [READ], AverageLatency(us), 1652.1852871561116
    [READ], MinLatency(us), 210.0
    [READ], MaxLatency(us), 142463.0
    [READ], 95thPercentileLatency(ms), 1.0
    [READ], 99thPercentileLatency(ms), 19.0
    [READ], Return=0, 569516

    According to these numbers, the max is quite high but the overall impact of hiccups is not too severe (all depends on your requirements of course). Even at this stage we can see that the effect of global pauses is skewing the other measurements (if you hit a short operation while a STW pause is in progress you still have to wait for the STW event to finish).
    The corrected measurements tell a different story:
    [Intended-READ], Operations, 569516.0
    [Intended-READ], AverageLatency(us), 24571.6025835973
    [Intended-READ], MinLatency(us), 268.0
    [Intended-READ], MaxLatency(us), 459519.0
    [Intended-READ], 95thPercentileLatency(ms), 83.0
    [Intended-READ], 99thPercentileLatency(ms), 210.0


    How can this be right? Can this be right?
    • At a rate of 10000 request per second, the unlikely Major Hiccup is likely to happen every second. Consider this next time someone tells you of a 99.99%ile behaviour. Given an event rate of 10K per second, 99.99% is suddenly not very rare. Consider that at this rate there's likely to be a few events that are worse.
    • The average major hiccup is 125ms long, in this time 125/4 events are delayed on all 25 threads -> 125 * 25 / 4 = 781 events are delayed from starting, they will further delay each other as they execute. In roughly 12 seconds we can see how it is quite probable that one of these events is another major hiccup. What with all the queuing up behind the first one etc, the pileup becomes quite reasonable.
    • The probability of a 'mode' is not the probability of the per event latency once STW and queuing effects are in play.
    I've made the mock DB print out 'OUCH' every time we get slapped with a STW event. It turns out that we got very unlucky in this run and hit three of these in a row:
    56 sec:
    [READ: Count=9192, Max=83903, Min=238, Avg=1745.13, 90=1531, 99=26959, 99.9=79551, 99.99=83775]
    [Intended-READ: Count=9208, Max=159999, Min=303, Avg=16496.92, 90=54271, 99=103807, 99.9=150527, 99.99=158335]
    OUCH
    OUCH
    OUCH
    57 sec: 
    [READ: Count=9642, Max=129727, Min=247, Avg=2318, 90=1799, 99=40607, 99.9=125631, 99.99=127359]  
    [Intended-READ: Count=9635, Max=459519, Min=320, Avg=102971.39, 90=200319, 99=374271, 99.9=442367, 99.99=457983]

    This is quite telling.
    The view on what's the worst second in this run is wildly different here. Because the uncorrected measurement takes each event as it comes it will take the view that 75 events were delayed by these hiccups, and none by more than 130ms. But from the corrected measurement point of view all the queued up measurements were effected and were further delayed by each other.
    I've re-run, this time logging interval histograms in their compressed form for every second in the run. Logging a 60 seconds run with 1 second interval data cost me 200k (we can tweak the construction in OneMeasurementHdrHistogram to minimize the cost). I can take the compressed logs and use the HistogramLogProcessor script provided with HdrHistogram to process the logs (you need to copy the HdrHistogram.jar into the script folder first). Running:
    ./HistogramLogProcessor -i READ.hdr -o uncorrected -outputValueUnitRatio 1000
    ./HistogramLogProcessor -i Intended-READ.hdr -o corrected -outputValueUnitRatio 1000
    Will produce *.hgrm files for both. I then use the plotFiles.html to generate the following comparison:


    They tell very different stories don't they.
    The red line will have you thinking your system copes gracefully up to the 99%ile slowly degrading to 20ms, when measuring correctly however the system is shown to degrade very quickly with the 20ms line crossed as early as the median, and the 99%ile being 10 times the original measurement. The difference is even more pronounced when we look at one of those terrible seconds where we had back to back STW hiccups. I can use the HistogramLogProcessor script to produce partial summary histograms for the 3 seconds around that spike:
    ./HistogramLogProcessor -i Intended-READ.hdr -o correctedOuch3 -outputValueUnitRatio 1000 -start 1425637666.488 -end 1425637668.492
    Similarly we can compare a good second with no STW pauses:


    Summary

    Coordinated Omission is a common problem in load generators (and other latency reporters), we had a look at fixing YCSB, an industry standard load generator:
    • Replaced the data structure used to capture latency with HdrHistogram: that is just generally useful and gives us better data to work with when examining the corrected measurement
    • Found scheduling code and introduced notion of operation start time.
    • Found measuring code and captured both operation cost (uncorrected measurement) and scheduled time latency (corrected measurement).
    • Use a mock system under test to evaluate measurement of known scenario. This is a very handy thing to have and luckily YCSB had this facility in place. In other places you may have to implement this yourself but it's a valuable tool to have in order to better understand the measurement capabilities of your harness. This helped highlight the scale of scheduling inaccuracies and test harness overhead per operation, as well as the scale of test harness error during its own warmup period.
    • Use HdrHistogram facilities to visualise and analyse latency histogram data from the compressed histogram logs.
    Thanks goes to this posts kind reviewers: Peter Huges, Darach, and Philip Aston


    Announcing the JAQ(Java Alternative Queues) Project

    $
    0
    0

    {This post is part of a long running series on lock free queues, checkout the full index to get more context here}
    To quote Confucius:
    "To learn and then practice it time and again is a pleasure, is it not? To have friends come from afar to share each others learning is a pleasure, is it not? To be unperturbed when not appreciated by others is gentlemanly, is it not?" - Analects 1:1
    It is obvious to me the old man was talking about open source software, where we repeat what we learn, share with friends from afar, and try and behave when no one seems to get it. In this spirit I am going to try and apply lessons learnt and put together a concurrent queues library for Java - Java Alternative Queues.
    It's early stages, but at this point I would value some feedback on:
    1. Intentions
    2. Interfaces and usability
    3. Project roadmap

    Intentions

    When concurrent queues are concerned, it is my opinion that the JDK offering has been robust, but too generic to benefit from the performance gains offered by a more explicit declaration of requirements. JAQ would tackle this by providing queues through a requirements focused factory interface allowing the user to specify upfront:
    1. Number of producers/consumers
    2. Growth: Bounded/Unbounded
    3. Ordering (FIFO/other)
    4. Size
    5. Prefer throughput/latency
    To see a wider taxonomy of queues see 1024Cores.net excellent analysis. At this point all the queues I plan to implement are non-blocking and lock-free as my main interest lies in that direction, but many of the principals may hold for blocking implementations and those may get added in the future.

    Interfaces and Usability

    I like the idea of separating several entities here:
    • ConcurrentQueueFactory - Tell me what you need, through a ConcurrentQueueSpec.
    • ConcurrentQueue - The queue data structure, provided by the factory. At the moment it does nothing but hand out producer/consumer instances. This is where pesky methods such as size/contains may end up. I'm not keen on supporting the full Queue interface so feedback on what people consider essential will be good.
    • ConcurrentQueueConsumer - A consumer interface into the queue, provided by the queue. I'm planning to support several consumption styles,.
    • ConcurrentQueueProducer - A producer interface into the queue, provided by the queue.
    The producer/consumer instances are thread specific and the appropriate thread should call into the queue provider method. Here is the old QueuePerfTest converted to use the new interface (I cleared out the irrelevant cruft for timing this and that):

    I realize this goes against the current Queue interface, but part of the whole point here is that the more we know about the calling code the better performance/correctness we can hope to offer.

    Roadmap

    I'd like to tackle the work in roughly the following order:
    • Specify/document/settle on high level interfaces (initial cut done)
    • SPSC implementations/tests/benchmarks (good bounded SPSC implementation is done, some benchmarks)
    • MPSC implementations/tests/benchmarks (some bounded MPSC variants are included but not integrated)
    • SPMC implementations/tests/benchmarks (not started)
    • MPMC implementations/tests/benchmarks (not started)
    There's more I want to do in this area, but the above will keep me busy for a while so I'll start with that and increase the scope when it reaches a satisfying state.
    I'm using JMH (and getting valuable support from @shipilev) for benchmarking the queues and hoping to use JCStress to test multi-threaded correctness. 

    Contributors/Interested Parties

    I know I'll be using this library in the near future for a few projects, but I hope it will be generally useful so your feedback, comments and observations are very welcome. I've not been involved much in open-source projects before, so any advise on project setup is also welcome. Finally, if you feel like wading in and cutting some code, adding some tests or benchmarks, reporting some bugs or expressing interest in features BRING IT ON  :-) pull requests are very welcome.

    On Arrays.fill, Intrinsics, SuperWord and SIMD instructions

    $
    0
    0
    {This post turned rather long, if you get lazy feel free to skip to the summary}
    Let's start at the very beginning, a very good place to start... My very first post on this blog was a short rant on intrinsics, and how they ain't what they seem. In that post I made the following statement:
    "intrinsic functions show up as normal methods or native methods"
    Which is correct. An intrinsic function is applied as a method substitution. A method call will appear in the code and the compiler will replace it's apparent source-level implementation with a pre-cooked implementation. In some cases intrinsics are sort of compilation cheats, the idea being that some bits of functionality are both very important (i.e. worth while optimizing) and can benefit from a hand crafted solution that will be better than what the compiler can achieve. The end result can be in one of a few flavours:
    1. Method call replaced with a call to a JVM runtime method: E.g. System.arrayCopy is replaced with a call to a method stub generated by the runtime for all array types. This method call is not a JNI call, but it is a static method call that is not inlined.
    2. Method call replaced with one or more instructions inlined: E.g. Unsafe.getByte/compareAndSet/Math.max
    3. Method call replaced with compiler IR implementation: E.g. java.lang.reflect.Array.getLength
    4. A mix of the above: E.g. String.equals is partially implemented in IR, but the array comparison is a call to a method stub.
    The intrinsics are all set up in vmSymbols.hpp and if you look, you'll see Arrays.fill is NOT on the list. So why am I talking about Chewbacca? Because it is something like an intrinsic...

    The Arrays.fill SIMD Opportunity

    Arrays.fill is the Java memset (fills an array with a given value), and just like System.arrayCopy (memcpy in C lingo) is worth the effort to optimize and offers the same kind of opportunity. What opportunity might that be, you ask? the opportunity to use SIMD (Single Instruction Multiple Data) instructions when the underlying CPU offers them (I assume for the sake of discussion AVX enabled CPUs i.e. since Sandy Bridge, I find this listing of intel intrinsics useful to explain and sort through the available instructions). These instructions allow the CPU to operate on up to 256 bit (512 bit soon) chunks of data, thus transforming 32 byte sized MOV instructions into a single wide MOV instruction (E.g. the intel C instrinsic  _mm256_storeu_si256 or the corresponding instruction vmovdqu). SIMD instructions are good for all sorts of operations on vectors of data, or arrays, which is why the process of transforming element by element operations into SIMD instructions is also referred to as vectorization.
    The actual assembly stub is generated dependent on CPU and available instruction set. For x86 the code is generated by the macroAssembler_x86.cpp, and the observant digger into the code will find it makes use of the widest memory instructions it can identify the processor is capable of. Wider is better baby! If you are not morbidly curious about what the implementation looks like, skip the next wall of assembly and you'll be back in Java land shortly.
    Here's what the assembly boils down to when UseAVX>=2/UseSSE>=2/UseUnalignedLoadStores=true:

    Roughly speaking the algorithm above is:
    1. Fill up an XMM register with the intended value
    2. Use the XMM register to write 64 byte chunks (2 vmovdqu) until no more are available
    3. Write leftover 32 byte chunk (skipped if no matching leftovers)
    4. Write leftover 8 byte chunks (skipped if no matching leftovers)
    5. Write leftover 4 bytes (skipped if no matching leftovers)
    6. Write leftover 2 bytes (skipped if no matching leftovers)
    7. Write leftover 1 bytes (skipped if no matching leftovers)
    It ain't nice, but we do what we gotta for performance! There are variations of the above described across the internets as the done thing for a memset implementation, this might seem complex but is pretty standard... anyway, moving right along.

    The Arrays.fill 'intrinsic'

    Arrays.fill is different from System.arrayCopy because, as it's absence from vmSymbols suggests, it's not a method substitution kind of intrinsic (so technically not an intrinsic). What is it then? Arrays.fill is a code pattern substitution kind of compiler shortcut, basically looking for this kind of loop:
    And replacing it with a call into the JVM memset implementation (I recently learnt the same thing is done by GCC as well, see code to assembly here). The pattern matching bit is done in loopTransform.cpp. This feels enough like an intrinsic grey area that the method doing the pattern match and replace is called intrinsify_fill.
    Pattern matching makes this optimization potentially far more powerful than method substitution as the programmer doesn't have to use a special JDK method to convey meaning, they can just express their meaning in code and the compiler 'knows' that this simple loop means 'fill'. Compare that with System.arrayCopy where rolling your own leads to performance that is much worse than that offered by the intrinsic.
    Let's prove me right (my favourite thing, beats kittens and all that crap), here's a JMH (see the JMH reference page for more JMH info/examples) benchmark comparing Arrays.fill to a hand rolled fill, and System.arrayCopy to handrolled array copy:
    And the results are (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu, array is 32K in size)?
    ArrayFill.fillBytes                561.540 ± 10.814 ns/op
    ArrayFill.manualFillBytes          557.901 ± 5.255  ns/op
    ArrayFill.manualReversedFillBytes 1017.856 ± 0.425  ns/op
    ArrayFill.copyBytes               1300.313 ± 13.482 ns/op
    ArrayFill.manualCopyBytes         1477.442 ± 13.030 ns/op

    We can verify that the call out to the JVM fill method happens for fillBytes/manualFillBytes by printing out the assembly:

    So what have we learnt so far:
    • Use System.arrayCopy, it is better than your handrolled loop. But surprisingly not hugely better, hmmm.
    • You don't have to use Arrays.fill, you can roll your own and it works the same. Notice the call out to the fill method. But...
    • Don't get too creative rolling your own. If you get too funky (like filling the array backwards) it'll fall apart and the 'intrinsic' won't happen. But do note that the reverse fill still has some of that good SIMD stuff going, we'll get to that in a sec.

    Are The Other Types Filling The Love?

    It all sounds great don't it? Let's see how this pans out for other types. We'll be filling an array of 32KB. To be uniform across data types that means a 16K chars/shorts array, an 8K ints/floats array and a 4K array of longs. I added an 8K array of objects, which is the same size for compressed oops on the Oracle JVM (reference size is 4 bytes, same as an int).
    The JMH benchmark code is as you'd expect:
    Here's some reasonable expectations:
    • If no optimizations are present, wider writes are more efficient. It follows that the longFill would be the fastest. But...
    • Given a clever compiler the fill loop is replaced with the widest writes possible, so there should be no significant difference. But the fill optimization does not cover double/long/object arrays, so we might expect longFill to be the worst performer.
    • An objects array is not that different from an int array, so performance should be similar. Sure there's a write barrier, but it need only be done once per card (not once for the whole array as I thought initially, god bless Shipilev and PrintAssembly), so that's an extra byte write per card of elements filled. A card is per 512 bytes, each element is 4 bytes, so that's one card per 128 elements. Given there is no fill method implemented for it we may expect it to be slightly worse than the longFill.
    • We should not rely on expectations, because performance is best measured.
    As you'd expect the results are somewhat different than the expectations (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu):
    ArrayFill.fillBytes     561.540 ± 10.814  ns/op
    ArrayFill.fillChars     541.901 ±  4.833  ns/op
    ArrayFill.fillShorts    532.936 ±  4.508  ns/op
    ArrayFill.fillInts      543.165 ±  3.788  ns/op
    ArrayFill.fillFloats    537.599 ±  2.323  ns/op
    ArrayFill.fillLongs     326.770 ±  3.267  ns/op
    ArrayFill.fillDoubles   346.840 ±  5.786  ns/op
    ArrayFill.fillObjects  4388.242 ± 11.945  ns/op

    Say WOT?
    For bytes/chars/shorts/ints/floats Arrays.fill performs very similarly. This much is as expected from the second point above. But filling an array of longs/doubles is better than the others. The funny thing is, there's no fill function implemented for the long array, how come it is so darn quick? Also, why does the objects fill suck quite so badly when compared with the rest (I will not be addressing this last question! I refuse! this post is too fucking long as it is!)?
    This is what happens when we turn off the OptimizeFill flag:
    ArrayFill.fillBytes    1013.803 ± 0.227  ns/op
    ArrayFill.fillChars     323.806 ± 3.879  ns/op
    ArrayFill.fillShorts    323.689 ± 4.499  ns/op
    ArrayFill.fillInts      326.336 ± 1.559  ns/op
    ArrayFill.fillFloats    319.449 ± 2.048  ns/op
    ArrayFill.fillLongs     328.692 ± 3.282  ns/op
    ArrayFill.fillDoubles   345.035 ± 6.362  ns/op
    ArrayFill.fillObjects  4397.130 ± 7.161  ns/op

    Strange innit? now we got char/int/long arrays all performing similarly. In fact, with the exception of the byte array, everything is better than it was with the optimization.


    Superword to the rescue! 

    Turns out the JIT compiler is clued up on the topic of SIMD parallelisation by way of Superword Level Parallelism (see the original paper here):
    In some respects, superword level parallelism is a restricted form of ILP (Instruction Level Parallelism). ILP techniques have been very successful in the general purpose computing arena, partly because of their ability to find parallelism within basic blocks. In the same way that loop unrolling translates loop level parallelism into ILP, vector parallelism can be transformed into SLP. This realization allows for the parallelization of vectorizable loops using the same basic block analysis. As a result, our algorithm does not require any of the complicated loop transformations typically associated with vectorization. In fact, vector parallelism alone can be uncovered using a simplified version of the SLP compiler algorithm.
    ...
    Superword level parallelism is defined as short SIMD parallelism in which the source and result operands of a SIMD operation are packed in a storage location.
    ...
    Vector parallelism is a subset of superword level parallelism.
    The Hotspot compiler implements SLP optimizations in superword.cpp and you are invited to dive into the implementation if you like. I'm going to focus on it's impact here, and to do that I only need to know how to turn it on and off (core competency for any software person). It's on by default, so above results are what happens when it is on, here's what life looks like when it is off too (so -XX:-OptimizeFill -XX:-UseSuperWord):
    ArrayFill.fillBytes   8501.270 ±  2.896  ns/op
    ArrayFill.fillChars   4286.161 ±  4.935  ns/op
    ArrayFill.fillShorts  4286.168 ±  3.146  ns/op
    ArrayFill.fillInts    2152.489 ±  2.653  ns/op
    ArrayFill.fillFloats  2140.649 ±  2.587  ns/op
    ArrayFill.fillLongs   1105.926 ±  2.228  ns/op
    ArrayFill.fillDoubles 1105.820 ±  2.393  ns/op
    ArrayFill.fillObjects 4392.506 ± 11.678  ns/op


    Life is revealed in all it's sucky splendour! This is what happens when the compiler shows you no love... did I say no love? hang on, things can get a bit worse.

    Detour: Unsafe? We don't serve you kind here

    To all the Unsafe fans, I got some sad news for y'all. Unsafe 'fill' loops are not well loved by the compiler. This is the price of stepping off the beaten path I guess. Consider the following benchmark:
    The results are:
    ArrayFill.unsafeFillOffheapBytes  9742.621 ±  2.270  ns/op
    ArrayFill.unsafeFillOnHeapBytes  12640.019 ±  1.977  ns/op
    ArrayFill.fillBytes(for reference) 561.540 ± 10.814 ns/op

    The Unsafe variant do not enjoy the 'fill' pattern matching magic, nor do they get the SuperWord optimizations. What can you do? For this kind of thing you should use the Unsafe.setMemory method instead:
    With the result:
    ArrayFill.unsafeSetOffheapBytes   1259.281 ± 21.294  ns/op
    ArrayFill.unsafeSetOnHeapBytes    1275.158 ± 27.950  ns/op
    Not quite there, still ~2x worse (why? how come it doesn't just call the bytes fill method? a bit of digging shows it ends up calling the underlying platform's memset...) but beats being 20-25x worse like the handrolled method is.

    Summary and Musings

    It's the circle of life!
    So what did we learn:
    • There's another kind of 'intrinsic' like optimization, which uses pattern matching to swap a block of code rather than a method. This is employed for memset like memory fill loops (in particular Arrays.fill) intrinsicfication. It's not an intrinsic technically, but you know what I fucking mean. 
    • System.arrayCopy/Arrays.fill implementations utilize SIMD instructions to improve their efficiency. These instructions are not available in plain Java, so some compiler intervention is required.
    • The JIT compiler is also able to use SuperWord Level Parallelism to derive SIMD code from 'normal' sequential code.
    • In the case of Arrays.fill, it looks like the SuperWord optimized code is faster than the fill specialized implementation for all types except bytes (on the system under test)
    • If you use Unsafe you will be excluded from these optimizations.
    So I look at this process and I imagine history went something like this:
    We want to use SIMD instructions, but the JIT compiler isn't really clever enough to generate them by itself. Memset implementations are rather specialized after all. Let's make life a bit easier for the compiler by creating an intrinsic. We'll even go the extra mile and make an effort to automatically identify opportunities to use this intrinsic, so now it's not really an intrinsic any more. The Arrays.fill optimization is available on Oracle JDK6u45 (the oldest I keep around, maybe it was there a while before that) and on that JVM it is twice as fast as the SLP generated code.
    Over time, SLP gets better and eventually the compiler is now good enough to optimize the fill loop by itself and beat the specialized method. That is an awesome thing. We just need to remove the training wheels now.
    And there's a final punch line to this story. Memset/Memcpy are such common and important opportunities for optimization, so Intel has decided to offer an assembly 'recipe' for them and save everyone the effort in writing them:
    3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)
    Beginning with processors based on Intel microarchitecture code name Ivy Bridge, REP string operation using MOVSB and STOSB can provide both flexible and high-performance REP string operations for soft- ware in common situations like memory copy and set operations. Processors that provide enhanced MOVSB/STOSB operations are enumerated by the CPUID feature flag: CPUID:(EAX=7H, ECX=0H):EBX.ERMSB[bit 9] = 1. - [From the Intel Optimization Manual(September 2014)]
    From the manual it seems that this method of implementing memcpy/memset can perform well, but like anything else, YMMV (the intel manual discussion of the performance differences is in itself interesting both on the results and the methodology level). One obvious advantage of this method is that it results in much much smaller code that should be trivial to inline into callers. This will however put the SuperWord method at a slight disadvantage, and the tide will change again.
    [UPDATE 14/03/2015: It seems the good folks of Oracle have considered and rejected the use of REP MOVSB for array copy.]
    Thanks go to the kind reviewers Peter 'Massive' Hughes, Darrach and the Shipster

    Porting Pitfalls: Turning D.Vyukov MPSC Wait-free queue into a j.u.Queue

    $
    0
    0
    {This post is part of a long running series on lock free queues, checkout the full index to get more context here}
    D. Vyukov is an awesome lock-free dude, and I often refer to his instructive and invaluable website 1024cores.net in my posts. On his site he covers lock free queue implementations and in particular a wait-free MPSC linked node queue. This is really rather special when you consider that normally MP would imply lock-free rather than wait-free guarantees. I've ported his algorithm to Java (and so have many others: Netty/Akka/RxJava etc.), and had to tweak it to match the Queue interface. In this post I'd like to explain the algorithm, it's translation to Java, and the implications of making it a j.u.Queue.

    Lock free vs. Wait free

    Let's review the definitions:
    • Wait-free: thread progress is guaranteed, all operations finish in a determined number of steps.
    • Lock-free: global progress is guaranteed, though a particular thread may be blocked.

    An example of a transition from lock free to wait free is available with JDK8 changes to AtomicReference::getAndSet(). The change was made by utilizing the newly available Unsafe::getAndSetObject intrinsic which translates directly to XCHG (on x86). So where we used to have for AtomicReference:
    T getAndSet(T newVal) {
      T currentValue;
      do {
      currentValue = val; // val is a volatile field
      } while (!Unsafe.compareAndSwapObject(this, VAL_FIELD_OFFSET, currentValue, newValue));
      return currentValue;
    }
    Now we have:
    T getAndSet(T newVal) {
      return Unsafe.getAndSetObject(this, VAL_FIELD_OFFSET, newValue);
    }
    I discussed a similar change to AtomicLong.getAndAdd in a previous post, replacing the CAS loop with LOCK XADD.

    The Vyukov Wait Free MPSC queue

    This is a LinkedList type structure and the interesting methods are offer and poll, here's the original (I did some formatting):
    Awesome in it's simplicity, deceptive in it's genius. Be aware that head/tail meaning is the other way around than what most people are used to. I personally go for producer/consumerNode (head = producer side, tail = consumer side in the above snippet) in my code, but for consistency I'll stick with Mr. Vs notation for the porting exercise.
    But how do we manage the same memory barriers in Java? We can be nice about it and use the AtomicFieldUpdater or more brutal and use Unsafe. I find you get better performance with Unsafe, but you should consider how appropriate this is for you. In any case, here's what we end up with:
    The code is pretty similar. Now if we wanted to complete the j.u.Queue we could extend AbstractQueue and implement only size()/peek() and we're done (I'm not going to bother with the iterator()):
    There we go, seems reasonable don't it? off to the pub to celebrate some kick ass lock free programming!

    Vyukov::poll() ain't Queue::poll!

    I've discussed this particular annoyance in a previous post which has some overlap with this one. The names are the same, but as it turns out the guarantees are quite different. While Vyukov is doing a fine job implementing a queue, not any queue is a j.u.Queue. In particular for this case, the poll() method has a different contract:
    • j.u.Queue: Retrieves and removes the head of this queue, or returns null if this queue is empty.
    • Vyukov: Retrieves and removes the head of this queue, or returns null if next element is not available.
    Why wouldn't the next element be available? Doesn't that mean the queue is empty? Sadly this ain't the case. Imagine for instance there are 2 producers and we run both threads step by step in a debugger. We break at line 33 (4th line of offer, the Java version):
    1. Producer 1: we step 1 line, we just replaced head with node n. We are suspended before executing line 35.
    2. Producer 2: we let the program continue. We've replaced head and linked it to the previous head.
    What state are we in? Let's look at head/tail nodes and where they lead (I number the nodes in order of assignment to head):
    • head = Node[2], this is the node created by Producer 2. We also know that Node[1].next = Node[2], because we let offer run it's course on Producer 2.
    • tail = Node[0] the node we allocated in the constructor. This is the node head was before the first producer came along. This is what prev is equal to for Producer 1, but because we suspended that thread it never set it's next value. Node[0].next is still null!

    If a consumer came along now they would get a null from poll(), indicating the queue is empty. But the queue is obviously not empty!
    So it seems we cannot deduce the queue is empty from looking at tail.next here's 2 valid indicators that the queue is empty:
    • head == tail : this is the starting point set in the constructor and where the consumer ends up after consuming the last element
    • head.val == null : head can only have a value of null if it is tail
    Here's a solution to a correct poll() and the knock on effect on peek():
    This is a bit annoying because now we have a wait-free offer(), but poll() and peek() are lock-free(only block one thread, producers can make progress).
    This pitfall is tricky enough that not only did I fall for it (on another queue algorithm, but same mistake), it also took by surprise Mr. Manes writing an interesting variant on this queue for his high performance cache implementation (I filed an issue which he promptly fixed), and struck the notorious Dave Dice when considering Vyukov's MPMC queue (see comments for discussion).
    So is this it? Are we done?
    Almost... size() is still broken.

    A Good Size

    It's not really surprising that size is broken given the terminating condition for the size loop was relying on next == null to terminate the count. Size was also broken in 2 other subtle ways:
    • The interface for size dictates that it returns a positive int. But given that the queue is unbounded it is possible (though very unlikely) for it to have more than 2^31 elements. This would require that the linked list consume over 64GB (16b + 8b + 8b=32b per element, refs are 8b since this requires more than 32GB heap so no compressed oops). Unlikely, but not impossible. This edge condition is handled for ConcurrentLinkedQueue (same as here), and for LinkedBlockingQueue (by bounding it's size to MAX_INT, so it's not really unbounded after all) but not for LinkedList (size is maintained in an int and is blindly incremented).
    • Size is chasing a moving target and as such can in theory never terminate as producers keep adding nodes. We should limit the scope of the measurement to the size of the queue at the time of the call. This is not done for CLQ and should perhaps be considered.
    Here's the end result:
    If you need a full SPSC/MPSC linked queue implementation they are available in JCTools for your pleasure, enjoy!

    Special thanks to Doug and Ben for reviewing!

    Degrees Of (Lock/Wait) Freedom

    $
    0
    0
    Yo, Check the diagonal, three brothers gone...
    I've been throwing around the terms lock-free and wait-free in the context of the queues I've been writing, perhaps too casually. The definition I was using was the one from D. Vyukov's website (direct quote below):
    • Wait-freedom: Wait-freedom means that each thread moves forward regardless of external factors like contention from other threads, other thread blocking. Each operations is executed in a bounded number of steps. 
    • Lock-freedom: Lock-freedom means that a system as a whole moves forward regardless of anything. Forward progress for each individual thread is not guaranteed (that is, individual threads can starve). It's a weaker guarantee than wait-freedom. [...] A blocked/interrupted/terminated thread can not prevent forward progress of other threads. Consequently, the system as a whole undoubtedly makes forward progress.
    • Obstruction-freedom: Obstruction-freedom guarantee means that a thread makes forward progress only if it does not encounter contention from other threads. That is, two threads can prevent each other's progress and lead to a livelock. 
    • Blocking Algorithms: It's the weakest guarantee - basically all bets are off, the system as a whole may not make any forward progress. A blocked/interrupted/terminated thread may prevent system-wide forward progress infinitely. 
    The above definitions refer to "forward progress/moving forward" in the context of "system" and "thread". In particular they translate "X-freedom" to "a guarantee that T/S makes forward progress".
    Now "thread" is a well defined term, but what counts as forward progress of a given thread?
    To my mind a thread which is free to return from a method is free to 'make progress'. For example: if the next element in a queue is not visible to a consumer thread I can return control to the consumer thread which is then free to make progress on anything it feels like and try getting an element out of the queue later. Similarly a producer thread which is unable to add elements to a queue is 'free' to 'make progress' even if the queue is full. In short, I interpreted 'freedom' as regaining control of execution.

    Freedom? yeah, right...

    My interpretation however is limited to the scope of the thread and assumes it has other things to do (so placing the thread within the context of a given 'system'). So the freedom to make progress is really a freedom to make progress on 'other things'. The definitions above when applied to a given data structure have no knowledge of the 'system' and it is therefore fair to assume nothing. And so if the 'system' is viewed as being concerned only with using the data structure, it seems my view of 'regained control freedom' is not inline with the 'progress making freedom'.
    Let's consider for example the linked Multi-Producer-Single-Consumer queue discussed in the previous post (this is the not the j.u.Queue compliant version):

    Now let us assume a producer has stalled in that unpleasant gap between setting the head and pointing prev to the new head(at line 34), this is what I've come to think of as a 'bubble' in the queue. The next producer will see the new head, but the consumer will not see it (or any nodes after it) until prev.next is set.
    Others producers can 'make progress' in the sense that elements will get added to the queue. Those elements visibility to the consumer however is blocked by the 'bubble'. What is the correct degree of freedom for these producers? what about the consumer?
    When describing the queue Vyukov makes the following statements:
    • Wait-free and fast producers. One XCHG is maximum what one can get with multi-producer non-distributed queue.
    • Push [offer in the Java version] function is blocking wrt consumer. I.e. if producer blocked in (* [line 34]), then consumer is blocked too. Fortunately 'window of inconsistency' is extremely small - producer must be blocked exactly in (* [line 34]).
    So producers are wait-free, but consumer is blocking. The consumer is truly blocked by the 'bubble' and can make no progress in the terms of the data structure. Despite other producers adding nodes the consumer cannot see those nodes and is prevented from consuming the data. The fact that control is returned to the caller is considered irrelevant. And if we are being precise we should call this a blocking queue.

    Life imitates Art

    The "Art of Multiprocessor Programming" offers a different definition:
    "A concurrent object implementation is wait free if each method call completes in a finite number of steps. A method is lock-free if it guarantees that infinitely often some method call finishes in a finite number of steps" - page 99
    This sentiment is similarly echoed in "Wait-Free Queues With Multiple Enqueuers and Dequeuers" a paper on lock free queues:

    • [...] to ensure that a process (or a thread) completes its operations in a bounded number of steps, regardless of what other processes (or threads) are doing. This property is known in the literature as (bounded) wait-freedom.
    • lock-freedom ensures that among all processes accessing a queue, at least one will succeed to finish its operation.

    Hmmm... no system, no progress. The boundaries are now an object and a method, which are well defined. I think this definition matches my understanding of 'regained control freedom' (open to feedback). If a method returns, progress or no progress, the condition is fulfilled. Under this definition the queue above is wait-free.
    The wiki definition is again closer to the one outlined by Vyukov:
    • An algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread for some operations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress.
    • Wait-freedom is the strongest non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom. An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes.
    • Lock-freedom allows individual threads to starve but guarantees system-wide throughput. An algorithm is lock-free if it satisfies that when the program threads are run sufficiently long at least one of the threads makes progress (for some sensible definition of progress). All wait-free algorithms are lock-free.
    We are back to a fuzzy definition of system, progress, but the underlined sentence above is interesting in highlighting suspension. In particular I'm not sure how this is to be interpreted in the the context of the lock-free definition where suspension is not an issue if some threads can keep going.
    Note that once we have fixed poll to behave in a j.u.Queue compliant way: The poll method is no longer wait-free by any definition.

    What about the JCTools queues?

    JCTools covers the full range of MPMC/MPSC/SPMC/SPSC range of queues, and aims for high performance rather than strict adherence to any of the definitions above. To be more precise in the definitions I would say:
    • SpscArrayQueue/SpscLinkedQueue are Wait Free (on both control and progress senses)
    • MpscArrayQueue is lock free on the producer side and blocking on the consumer side. I'm planning to add a weakPoll method which will be wait free in the control sense.
    • MpscLinkedQueue is wait free on the producer side and blocking on the consumer side. I'm planning to add a weakPoll method which will be wait free in the control sense.
    • SpmcArrayQueue is lock free on the consumer side and blocking on the producer side. I'm planning to add a weakOffer method which will be wait free in the control sense.
    • MpmcArrayQueue is blocking on the both producer and consumer side. I'm planning to add a weakOffer/Poll method which will be lock free in the control sense.
    Does it matter that the queues do not meet the exact definitions set out below? As always it depends on your needs...

      Non-Blocking Algorithms On The JVM?

      {Full disclosure: please note when reading the next section that I work with Azul Systems on the Zing JVM, I'm not trying to sell anything, but I am naturally more familiar with it and consider it awesome :-)}
      What happens when we include the JVM in our definition of a system? Can you build a non-blocking algorithm on the JVM? All the JVMs I know of are blocking in some particular cases, important to the discussion above are:
      1. Allocation: The common case for allocation is fast and involves no locking, but once the young generation is exhausted a collection is required. Young generation collections are stop the world events for Oracle/OpenJDK. Zing has a concurrent young generation collector, but under extreme conditions or bad configuration it may degrade to blocking the allocator. The bottom line is that allocation on the JVM is blocking and that means you cannot consider an algorithm which allocates non-blocking. To be non-blocking a system would have to provably remove the risk of a blocking collection event on allocation.
      2. Deoptimization: Imagine the JIT compiler in it's eager and aggressive compilation strategy has decided your offer method ends up getting compiled a particular way which ends up not working out (assumptions on inheritance, class loading, passed in values play a part). When the assumption breaks a deoptimization may take place, and that is a blocking event. It is hard to prove any piece of code is deoptimization risk free, and therefore it is hard to prove any Java code is non-blocking. Deoptimization is in many cases a warmup issue. Zing is actively battling this issue with ReadyNow which reloads previously recorded compilation profiles for the JVM, greatly reducing the risk of deoptimization. Oracle/OpenJDK users can do a warmup run of their application before actually using it to reduce the risk of deoptimization. My colleague Doug Hawkins gave a great talk on this topic which will give you far more detail then I want to go into here :-). 
      3. Safepoints: If your algorithm includes a safepoint poll it can be brought to a halt by the JVM. The JVM may bring all threads to a safepoint for any number of reasons. Your method, unless inlined, already has a safepoint on method exit(OpenJDK) or entry(Zing). Any non-counted loop will have a safepoint-poll, and this includes your typical CAS loop. Can you prevent safepoints from ever happening in your system? It might be possible for a JVM to provide users with compiler directives which will prevent safepoint polls from being inserted in certain code paths, but I don't know of any JVM offering this feature at the moment.
      In short, if you are on the JVM you are probably already blocking. Any discussion of non-blocking algorithms on the JVM is probably ignoring the JVM as part of the 'system'. This needs to be considered if you are going to go all religious about wait-free vs. lock-free vs. blocking. If a system MUST be non-blocking, it should probably not be on the JVM.

      Summary

      Any term is only as good as our ability to communicate with others using it. In that sense, splitting hairs about semantics is not very useful in my opinion. It is important to realize people may mean any number of things when talking about lock-free/wait-free algorithms and it is probably a good idea to check if you are all on the same page.
      I personally find the distinction between control/progress freedom useful in thinking about algorithms, and I find most people mean lock/wait-free excluding the effects of the JVM...
      Thanks Martin & Doug for the review, feedback and discussion, any remaining errors are my own (but their fault ;-) )


      Object.equals, primitive '==', and Arrays.equals ain't equal

      $
      0
      0
      It is a fact well known to those who know it well that "==" != "equals()" the example usually going something like:
        String a = "Tom";
        String b = new String(a);
        -> a != b but a.equals(b)

      It also seems reasonable therefore that:
        String[] arr1 = {a};
        String[] arr2 = {b};
        -> arr1 != arr2 but Arrays.equals(arr1, arr2)

      So far, so happy... That's examples for you...
      For primitives we don't have the equals method, but we can try boxing:
        int i = 0;
        int j = 0;
        -> i == j and also ((Integer)i).equals((Integer)j), and Arrays.equals({i}, {j})


      Floats ruin everything

      But... some primitives are more equal then others. Consider for instance the following example:
        float f1 = Float.Nan;
        float f2 = Float.Nan;
        -> f1 != f2, and also f1 != f1, Nan's are nasty

      This is what floats are like children, they be treacherous little hobbitses and no mistake. The fun starts when you box them fuckers:
        -> ((Float)f1).equals((Float)f2), because Java likes to fix shit up
        -> Arrays.equals({((Float)f1)},{((Float)f2)}), hmm consistent
        -> but also: Arrays.equals({f1},{f2})...

      This is counter to how one would normally think arrays are compared. You would think that for primitives (skipping arrays null and length checks):
        boolean feq = true;
        for(int i=0;i<farr1.length;i++){
          if(farr1[i] != farr2[i]) {
            feq = false; break;
          }
        }
      Is the same as:
        boolean feq = Arrays.equals(farr1,farr2);
      But for double[] and float[] the contract has been changed to accommodate Nan. This is perhaps a good decision on the JDK authors side, but it is somewhat surprising.
      Conclusion: 2 arrays can be equal, even if some elements are not equal.

      Objects are weird

      Let's consider objects for a second. We started with a != b does not predicate that !a.equals(b), but what about a == b?
      The Object.equals() javadoc offers the following wisdom:
      The equals method implements an equivalence relation on non-null object references:

      • It is reflexive: for any non-null reference value x, x.equals(x) should return true.
      • It is symmetric: for any non-null reference values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.
      • It is transitive: for any non-null reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.
      • It is consistent: for any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified.
      • For any non-null reference value x, x.equals(null) should return false.
      You know what the most important word in the above lines is? SHOULD
      In the wonderful land of Should, all classes behave and if they bother implementing an equals method they follow the above rules (and also override the hashcode method while they are there).
      But what happens when an object is implement perversely, like this:
        publicclass Null {
          @Override
          publicboolean equals(Object obj) {
              returnobj == null;
          }
        }

      This has some nice implications:
        Null n = new Null();
        -> n != null, but n.equals(null)
        Object[] a = {n};
        Object[] b = {null};
        Object[] c = {n};
        -> Arrays.equals(a, b) is true, because n.equals(null)
        -> Arrays.equals(b, a) is false, because null != n
        -> Arrays.equals(a, a) is true, because a == a
        -> Arrays.equals(a, c) is false, because !n.equals(n)

      Or to quote The Dude: "Fuck it man, let's go bowling"


      JMH perfasm explained: Looking at False Sharing on Conditional Inlining

      $
      0
      0
      There is an edge that JMH (read the jmh resources page for other posts and related nuggets) has over other frameworks. That edge is so sharp you may well cut yourself using it, but given an infinite supply of bandages you should definitely use it :-) This edge is the ultimate profiler, the perfasm (pronounced PERF-AWESOME!, the exclamation mark is silent). I've been meaning to write about it for a while and as it just saved my ass recently...

      SpscGrowableArrayQueue's False Sharing Issue

      JCTools includes a specialized SPSC (single-producer/consumer) bounded queue aimed at actor systems, the SpscGrowableArrayQueue. This queue is quite neat because it is combining the compactness of a linked queue with the awesome throughput of an array backed queue. The idea is quite simple, and similar in spirit to an ArrayList:
      1. Start with a small buffer, default to 16 elements
      2. As long as the queue doesn't grow beyond that just stay small
      3. If on offer you find that the queue is full double the size of the underlying buffer
      The mechanics of the queue resizing, handling intermediate queue states, and detecting a new buffer from the consumer are a bit involved and will perhaps be expanded on some other day.
      Because the queue is geared at actor systems per queue footprint is important, so I reworked the memory layout of SpscArrayQueue and skimped on the padding where possible. If you have no idea what I mean by padding, and why it is done you can read the following posts:
      1. False sharing and the MESI protocol details related to cached counters: This explores the motivation of padding fields in the first place
      2. Discovering and controlling object layout(Using an early version of JOL): This show cases the JOL tool and how it should be used to discover object layout
      3. SPSC Revisited - part I: An empiricist tale: This post discusses the inlining of counters into a queue use inheritance to control field order and introduce padding. In this post I show how to pad the counter from each other and also how to pre/post pad the class and buffer.
      4. A more high level summary of false sharing is given in this post
      With an overhead of roughly 128b per padding this was instrumental to reducing the high memory cost per queue. The hot counters were still padded from each other, but the class pre/post padding were removed as well as the padding of the elements array.
      So, I gave up on some of the padding, but reasoned that in most cases this should not make a difference because the hottest fields were still padded from each other.

      Why so slow?

      Now, I expect a queue that is very similar to SpscArrayQueue, but adds some features to be slower. There's just very little you can do about this, doing more will cost you something. But given that resizing is a rather exceptional event for this queue I thought this will be a minor hit, maybe 10-20% reduction in performance (for a certain definition of performance). JCTools has some benchmarks included and I ran the QueueThroughputBackoffNone which will have a producer and consumer threads chewing on the queue as hard as they can.
      Since the numbers are not very important here, I'll stick to running benchmarks on my laptop (an MBP 11,1/ Ubuntu 15.04/Zulu JDK8u45 - Zulu is a supported and tested OpenJDK build) in this post. Rest assured that I have confirmed on real hardware the same results. The results I quote below are the pollsMade figure which reflects the actual delivered throughput.
      To my puzzlement I found:
      SpscArrayQueue          361.223 ±  7.156  ops/us
      SpscGrowableArrayQueue   64.277 ± 31.803  ops/us

      Crap performance and terrible variance, bugger me sideways.
      Looking at the code, I thought there must be something I was doing to upset the mighty JIT spirits. Maybe my methods were too big? My branches too unpredictable? My variable names too offensive? So I tweaked the code this way and that, looked at the inlining log (-XX:+PrintInlining) and the assembly (-XX:+PrintAssembly/-XX:CompileCommand=print,*) I got some minor improvements, but it mostly still sucked. What's wrong? A quick look at "-prof perfasm" and a fair amount of head scratching lead to the answer. The code (before and after) is on github, the main focus for this post is perfasm and its usage so I won't dive into it.

      Before we start: How does perfasm work?

      To use perfasm you'll need a Linux(or Windows) OS, running on real hardware, the relevant perf tool installed, and a JVM setup to print assembly.
      Perf is (amongst other things) an instruction level profiler, of the kind that don't usually work with Java traditionally (though things are slowly changing). One of the features offered by perf is the "perf record/annotate" workflow. Perf interrupts your process repeatedly and records the current PC (program counter) at each sample. This sampling of the program counter is recorded over a period of time to be post processed by the annotate feature which correlates samples to methods, lines of code and assembly instructions. The challenge for perf when dealing with Java code is that the binary form of each method only exists for the duration of that process lifetime. This means the PC is mostly referring to methods that are nowhere to be found when the annotation stage comes along.
      To summarize: perf record works, but perf annotate is broken for Java.
      To make perfasm work JMH captures the JVM compiler outputs by enabling the following flags:
        -XX:+UnlockDiagnosticVMOptions
        -XX:+LogCompilation
        -XX:LogFile=...
        -XX:+PrintAssembly
        -XX:+PrintInterpreter
        -XX:+PrintNMethods
        -XX:+PrintNativeNMethods
        -XX:+PrintSignatureHandlers
        -XX:+PrintAdapterHandlers;
        -XX:+PrintStubCode
        -XX:+PrintCompilation
        -XX:+PrintInlining
        -XX:+TraceClassLoading
        -XX:PrintAssemblyOptions=syntax
      The data collected here allows perfasm to do the annotation process by itself. I'll not bore you with the details of the output processing, but hat off to Mr. Shipilev who ploughed through the gory mess and made this work. The end result is a detailed output offering you the assembly of the hottest regions in your program, along with a list of the top hottest methods including both native and Java parts of the stack.
      Because perf allows the recording of any number of events the definition of hottest depends on the events you choose to profile. The default events are cycles and instructions, but you can specify any number of events (E.g. -Djmh.perfasm.events=cycles,cache-misses). The first event specified will be the 'hotness' qualifier.

      What do you get?

      The perfasm output is split into 4 sections:
      1. Annotated assembly output for the top HOT regions(titled "Hottest code regions (>10.00% "cycles" events):")
      2. A list of the hot regions in your benchmark (titled "[Hottest Regions]"). This is the list of compiled methods inclusive of native methods, which is why we have the next section.
      3. A list of the hottest methods after inlining, so only java methods in this section (titled "[Hottest Methods (after inlining)]").
      4. A distribution of the cycles between types of regions (titled "[Distribution by Area]"). This will inform you of how the split goes between compiled code, kernel, JVM etc.
      The most interesting of the 4 sections (and the only one I'm going to explain here) is the annotated hot regions section. I tend to edit the output of assembly spouting tools to make it more readable, but for the sake of explaining what's what here's the original output with footnotes:
      To focus on the main features I trimmed out allot of the assembly code, which is where you see the "<MORE OF SAME/>" comments. Note the legend at the start of the output describing the 2 left most columns: cycles and instructions sample percentages.
      Here's the footnotes explained:
      1. (Line 7) This column is the instruction address. This is important as this column is both how perfasm matches the perf record output with the PrintAssembly output and how poor sods like yours truely have to look for jump destinations in the code.
      2. (Line 7) This column has the actual instructions, in AT&T syntax (src -> dst).
      3. (Line 7) This column is the instruction annotation generated by PrintAssembly. It is not always correct, but it's helpful in connecting the dots. Multi-level inlining of methods can be seen in action everywhere.
      4. (Line 12) The condition/loop edge annotation is a wonderful recent addition to perfasm. It makes life marginally more bearable for the assembly consumer by connecting the jumps with their destinations. This is the start point annotation ""
      5. (Line 16) This is the end point annotation ""
      When looking at this you will be searching for suspect hot spots in the code. Some instruction which is chewing away your cycles. If you're in luck you'll be able to tweak your code just right to side step it and then you are off and away :-)

      Back on Track: Show me the cycles!

      Right, so I point Captain Awesome at my benchmark and what do I see (trimmed addresses and edges as there's no important conditions here, also trimmed package names and moved code annotation left):
      This instruction, eating 30% of my cycles (line 7, needs HEALING!), is part of a guard generated to ensure that inlining the poll method call is still a valid decision, that the instance observed at the callsite is of the class for which we inlined the poll method. This is not really a Java code line, this is the aftermath of inlining a virtual call (through an interface, see Aleksey's method dispatch inlining article to learn more on inlining).
      To be fair, the above is kind of hard to understand. Why is a comparison between a constant and a register eating 30% of the cycles? according to every instruction 'cost' manual this should take 1 cycle (and the CPU can do 3-4 of them in parallel too), this is obviously not the instruction I'm looking for.
      This is a phenomena known as 'skid' where the reported instruction at a given sample is inaccurate because modern CPUs are complicated. See the following dialogue on the linux-perf-user mailing list:
      > I think Andi mentioned this to me last year -- that instruction profiling was no longer reliable. 
      It never was. 
      > Is this due to parallel and out-of-order execution? (ie, we're sampling the instruction pointer, but that's set to the resumption instruction, not the instructions being processed in the backend?). 
      Most problems are due to 'skid': It takes some time to trigger the profiling interrupt after the event fired. [...] There are also other problems, for example an event may not be tied to an instruction. Some events have inherently large skid.
      This sounds grim, but at the end of the day this is as accurate as profiling can get. It's not perfect, but it's still very valuable and you get used to it (there are ways to minimize skid discussed in the link, I've not tried those).
      What you end up doing is looking for instructions just before the blamed instruction, or instructions on which the blamed instruction is dependant, which may be more reasonably blamed for the bottleneck. Given the CMP is not the problem, we must ask why would the CPU spend so much time at it? A CMP or a TEST will often get blamed for the price of the load into the registers they use, in this case the CMP is most probably being blamed for the load of the queue type from the object header one instruction back:
      "0x8(%r12,%r11,8),%r8d  ; implicit exception: dispatches to 0x00007f83c93f929d"
      It doesn't help that the comment talks about some implicit exception (this is an implicit null check), where it could say "getkid(q) + implicit_nullchk(q)" or something similar to indicate we are loading the kid (klass id) from the object header (see object header layout details here).
      Now that I had pointed a finger at this operation, I was still quite confused. This was not an issue for any of the other queues, why would loading the kid be such a bottleneck in this case? Maybe I'm wrong about this (always a good assumption)? To prove this is the issue I created a duplicate of the original benchmark where I use the SpscGrowableArrayQueue directly instead of going via the interface, for comparison I also benchmarked the SpscArrayQueue in the same fashion:
      SpscArrayQueue          369.544 ±  1.535  ops/us
      SpscGrowableArrayQueue  272.021 ± 12.133  ops/us

      Now that's more like it! the expected 20% difference is more like 30%, but this is much closer. This still begs the question, why is the type check for SpscGrowableArrayQueue so expensive? We can see that for SpscArrayQueue this makes very little difference, how is SpscGrowableArrayQueue different?

      Messing with the KID

      I had a long hard look at this issue, which didn't help, then slept on it, which did help, and realized the problem here is that the object header is false-sharing with the producer fields. When I trimmed down the padding on this class in an effort to minimize allocation, I removed the array post and pre-padding as well as the class pre/post padding and reasoned that for the most part I need not worry about the object's neighbours false sharing. What I failed to realize was that the consumer and producer threads might be frequently hitting the object header, in this benchmark on every call. Once I realized this was the issue I reinstated the pre-padding such that the producer index is far enough from the object header to stop interfering with it and the problem went away (see before and after, it's a messy diff as I fixed some other niggles while I was there, you can take the last version and play with adding/removing the padding and reordering the index and cold fields to verify my claims).
      Here's the (slightly trimmed for readability) perfasm output for the same conditional inlining check in the padded version:

      As an interesting side note, this issue was not visible in the hand rolled throughput benchmarks. This is because in those benchmarks the queue is hoisted into a variable before the consume/produce loop which means the conditional inlining check can be done out of loop as well. This is great for getting nice numbers but hides an issue which users are likely to hit. Credit goes to JMH for herding benchmarks down a path which forces these issues into the measured scope.

      Summary

      The main focus of this post is perfasm, I hope it helps get you started on using it. The broader context in which you would use this tool is a less explicit background. I use perfasm regularly, but I'm also reasonably happy to look at assembly and compiler annotation which I know most people are not. I find it to be invaluable in issues, like the one described in this post, where the cost is between the Java lines rather than the lines themselves. Any profiler can give you a broad feel of where performance bottlenecks are, and a non-safepoint-biased profiler can show you the hottest line of code. What a Java profiler will not tell you is about the generated JVM code between your lines. It will attribute the cost of those instructions to the nearest Java line, and that can become a very confusing chase.
      It also worth pointing out that any nano-benchmark (measuring operations in the 0-500 nanoseconds range) practically requires you to look at the assembly for it's analysis. But 500ns can still be an aweful lot of code and an assembly level profiler is very handy. At this point it worth mentioning the Oracle Solaris Studio (first prize for least known profiler every year since release). It is a great assembly level profiler, and just generally a great profiler. If your measurement needs to take place outside the cosy comforts of a JMH benchmark I would recommend you give it a spin.
      Finally, this investigation came in the context of a workflow that is followed in the development of JCTools. I would loosely describe it as follows:
      1. Implement new feature/data structure
      2. Reason about expected impact/performance as part of design and implementation
      3. Test expectations using the existing set of benchmarks
      4. Expectations are far off the mark (if not... not sure, will figure it out when it happens)
      5. Dig in until either expectations or code are cured.
      This has a whiff of scientific exploration to it, but I assure you it is not done quite so seriously and I often fail to follow my own advice (or worse other people's advice). The habit of testing performance assumptions/expectation has offered me many an afternoon spent banging my head on a variety of surfaces. Perfasm has been instrumental in reducing the amount of head banging, but I fear nothing short of permanent brain damage will actually solve the problem.
      This post has been kindly reviewed by Aleksey Shipilev, Darach Ennis and Richard Warburton. Any remaining errors are entirely their fault ;-)

      When I say final, I mean FINAL!

      $
      0
      0
      {This post is part of a long running series on lock free queues, checkout the full index to get more context here}
      Having recently bitched about the lack of treatment of final field as final I was urged by Mr. Shipilev to demonstrate the issue in a more structured way (as opposed to a drunken slurred rant), I have now recovered my senses to do just that. The benchmark being run and the queue being discussed are covered in this post, so please refresh you memory for context if you need. The point is clear enough without full understanding of the context though.
      It is perhaps a fact well known to those who know it well that final fields, while providing memory visibility guarantees, are not actually immutable. One can always use reflection, or Unsafe, to store new values into those fields, and in fact many people do (and Cliff Click hates them and wishes them many nasty things). This is (I believe) the reason behind some seemingly trivial optimizations not being done by the JIT compiler.

      Code Under Test: FFBufferWithOfferBatch.poll()

      The buffer field is a final field of FFBufferWithOfferBatch and is being accessed twice in the method above. A trivial optimization on the JIT compiler side would be to load it once into a register and reuse the value. It is 'immutable' after all. But if we look at the generate assembly (here's how to, I also took the opportunity to try out JITWatch which is brilliant):
      We can see buffer is getting loaded twice (line 15, and again at line 24). Why doesn't JIT do the optimization? I'm not sure... it may be due to the volatile load forcing a load order that could in theory require the 'new' value in buffer to be made visible... I don't know.

      Hack around it, see if it makes a difference

      Is that a big deal? Let's find out. The fix is trivial:
      And the assembly code generated demonstrates the right behaviour now (one load at line 15):
      Now, was that so hard to do? And more importantly, does it make any difference to performance? As discussed previously, the throughput benchmark is sensitive to changes in the cost balance between offer/poll. The optimization creates an interesting change in the pattern of the results:
      The benchmark is run on Ubuntu13.10/JDK7u45/i7@2.4, the x axis is the index of the benchmark run and the Y axis is the result in ops/sec. The chart displays the results for before the change (B-*) and after(A-*) with different sparse data settings. We can see the change has accelerated the consumer, leading to increased benefit from sparse data that was not visible before. With sparse data set to 1 the optimization results in a 2% increase in performance. Not mind blowing, but still. The same change applied to the producer thread loop (localizing the reference to the queue field) discussed in the previous post enabled a 10% difference in performance as the field reference stopped the loop from unrolling and was read on each iteration. I used the poll() example here because it involves allot less assembly code to wade through.

      Hopefully this illustrates the issue to Mr. Shipilev's content. Thanks goes to Gil Tene for pointing out the optimization to me and to Chris Newland for JITWatch.

      Unsafe Pointer Chasing: Running With Scissors

      $
      0
      0
      Love running? Love scissors? I know just the thing for you! Following on from recent discussion on the Mechanical Sympathy mailing list I see an anti pattern worth correcting in the way people use Unsafe. I say correcting as I doubt people are going to stop, so they might as well be made aware of the pitfalls. This pattern boils down to a classic concurrency bug:

      Q: "But... I not be doing no concurrency or nuffin' guv"
      A: Using Unsafe to gain a view of on-heap addresses is concurrent access by definition.

      Unsafe address: What is it good for?

      Absolutely nothing! sayitagain-huh! I exaggerate, if it was good for nothing it would not be there, let's look at the friggin manual:
      As we can see the behaviour is only defined if we use the methods together, and by that I mean that get/putAddress are only useful when used with an address that is within a block of memory allocated by allocateMemory. Now undefined is an important word here. It means it might work some of the time... or it might not... or it might crash your VM. Let's think about this.

      Q: What type of addresses are produced by allocateMemory?
      A: Off-Heap memory addresses -> unmanaged memory, not touched by GC or any other JVM processes

      The off-heap addresses are stable from the VM point of view. It has no intention of running around changing them, once allocated they are all yours to manage and if you cut your fingers in the process or not is completely in your control, this is why the behaviour is defined. On-Heap addresses on the other hand are a different story.

      Playing With Fire: Converting An Object Ref to An Address

      So imagine you just had to know the actual memory address of a given instance... perhaps you just can't resist a good dig under the hood, or maybe you are concerned about memory layout... Here's how you'd go about it:
      Now... you'll notice the object ref needs a bit of cuddling to turn into an address. Did I come up with such devilishly clever code myself? No... I will divulge a pro-tip here:
      If you are going to scratch around the underbelly of the JVM, learn from as close to the JVM as you can -> from the JDK classes, or failing that, from an OpenJDK project like JOL (another Shipilev production)
      In fact, the above code could be re-written to:
      Now that we have the address what can we do with it? Could we use it to copy the object? maybe we could read or modify the object state? NO! we can but admire it's numerical beauty and muse on the temperamental values waiting at the other end of that address. The value at the other end of this address may have already been moved by GC...

      Key Point: On-Heap Addresses Are NOT Stable

      Consider the fact that at any time your code may be paused and the whole heap can be moved around... any address value you had which pointed to the heap is now pointing to a location holding data which may be trashed/outdated/wrong and using that data will lead to a funky result indeed. Also consider that this applies to class metadata or any other internal accounting managed by the JVM.
      If you are keen to use Unsafe in the heap, use object references, not addresses. I would urge you not to mix the 2 together (i.e. have object references to off-heap memory) as that can easily lead to a very confused GC trying to chase references into the unknown and crashing your VM.

      Case Study: SizeOf an Object (Don't do this)

      This dazzling fit of hackery cropped up first (to my knowledge) here on the HighScalability blog:
      This is some sweet macheta swinging action :-). The dude who wrote this is not suggesting it is safe, and only claims it is correct on a 32bit VM. And indeed, it can work and passes cursory examination. The author also states correctly that this will not work for arrays and that with some corrections this can be made to work for 64 bit JVMs as well. I'm not going to try and fix it for 64 bit JVMs, though most of the work is already done in the JOL code above. The one flaw in this code that cannot be reliably fixed is that it relies on the native Klass address (line 6) to remain valid long enough for it to chase the pointer through to read the layout helper (line 8). Spot the similarity to the volatile bug above?
      This same post demonstrates how to forge references from on-heap objects to off-heap 'objects' which in effect let you cast a pointer to a native reference to an object. It goes on to state that is a BAD IDEA, and indeed it can easily crash your VM when GC comes a knocking (but it might not, I didn't try).

      Case Study: Shallow Off-Heap Object Copy (Don't do this)

      Consider the following method of making an off-heap copy of an object (from here, Mishadof's blog):
      We see the above is using the exact same method for computing size as demonstrated above. It's getting the on-heap object address (limited correctness, see addresses discussion above) than copying the object off-heap and reading it back as a new object copy... Calling the Unsafe.copyMemory(srcAddress, destAddress, length) is inviting the same concurrency bug discussed above. A similar method is demonstrated in the HighScalability post, but  there the copy method used is Unsafe.copyMemory(srcRef, srcOffset, destRef, destOffset, length). This is important as the reference using method is not exposed to the same concurrency issue.
      Both are playing with fire ofcourse by converting off-heap memory to objects. Imagine this scenario:
      • a copy of object A is made which refers to another object B, the copy is presented as object C
      • object A is de-referenced leading to A and B being collected in the next GC cycle
      • object C is still storing a stale reference to B which is no managed by the VM
      What will happen if we read that stale reference? I've seen the VM crash in similar cases, but it might just give you back some garbage values, or let you silently corrupt some other instance state... oh, the fun you will have chasing that bugger down...

      Apologies

      I don't mean to present either of the above post authors as fools, they are certainly clever and have presented interesting findings for their readers to contemplate without pretending their readers should run along and build on their samples. I have personally commented on some of the code on Mishadof's post and admit my comments were incomplete in identifying the issues discussed above. If anything I aim to highlight that this hidden concurrency aspect can catch out even the clever.
      Finally, I would be a hypocrite if I told people not to use Unsafe, I end up using it myself for all sorts of things. But as Mr. Maker keeps telling us "Be careful, because scissors are sharp!"

      Where is my safepoint?

      $
      0
      0
      My new job (at Azul Systems) leads me to look at JIT compiler generated assembly quite a bit. I enjoy it despite, or perhaps because, the amount of time I spend scratching my increasingly balding cranium in search of meaning. On one of these exploratory rummages I found a nicely annotated line in the Zing (the Azul JVM) generated assembly:
      gs:cmp4i [0x40 tls._please_self_suspend],0
      jnz 0x500a0186
      Zing is such a lady of a JVM, always minding her Ps and Qs! But why is self suspending a good thing?

      Safepoints and Checkpoints

      There are a few posts out there on what is a safepoint (here's a nice one going into when it happens, and here is a long quote from Mechnical Sympthy mailing list on the topic). Here's the HotSpot glossary entry:
      safepoint

      A point during program execution at which all GC roots are known and all heap object contents are consistent. From a global point of view, all threads must block at a safepoint before the GC can run. (As a special case, threads running JNI code can continue to run, because they use only handles. During a safepoint they must block instead of loading the contents of the handle.) From a local point of view, a safepoint is a distinguished point in a block of code where the executing thread may block for the GC. Most call sites qualify as safepoints. There are strong invariants which hold true at every safepoint, which may be disregarded at non-safepoints. 
      To summarize, a safepoint is a known state of the JVM. Many operations the JVM needs to do happen only at safepoints. The OpenJDK safepoints are global, while Zing has a thread level safepoint called a checkpoint. The thing about them is that at a safepoint/checkpoint your code must volunteer to be suspended to allow the JVM to capitalize on this known state.
      What will happen while you get suspended varies. Objects may move in memory, classes may get unloaded, code will be optimized or deoptimized, biased locks will unbias.... or maybe your JVM will just chill for a bit and catch its breath. At some point you'll get your CPU back and get on with whatever you were doing.
      This will not happen often, but it can happen which is why the JVM makes sure you are never too far from a safepoint  and voluntary suspension. The above instruction from Zing's generated assembly of my code is simply that check. This is called safepoint polling.
      The safepoint polling mechanism for Zing is comparing a thread local flag with 0. The comparison is harmless as long as the checkpoint flag is 0, but if the flag is set to 1 it will trigger a checkpoint call (the JNZ following the CMP4i will take us there) for the particular thread. This is key to Zing's pause-less GC algorithm as application threads are allowed to operate independently.

      Reader Safpoint

      Having happily grokked all of the above I went looking for the OpenJDK safepoint.

      Oracle/OpenJDK Safepoints

      I was hoping for something equally polite in the assembly output from Oracle, but no such luck. Beautifully annotated though the Oracle assembly output is when it comes to your code, it maintains some opaqueness when it's internals are concerned. After some digging I found this:
      test   DWORD PTR [rip+0xa2b0966],eax        # 0x00007fd7f7327000
                                                      ;   {poll}
      No 'please', but still a safepoint poll. The OpenJDK mechanism for safepoint polling is by accessing a page that is protected when requiring suspension at a safepoint, and unprotected otherwise. Accessing a
      protected page will cause a SEGV (think exception) which the JVM will handle (nice explanation here). To quote from the excellent Alexey Ragozin blog:
      Safepoint status check itself is implemented in very cunning way. Normal memory variable check would require expensive memory barriers. Though, safepoint check is implemented as memory reads a barrier. Then safepoint is required, JVM unmaps page with that address provoking page fault on application thread (which is handled by JVM’s handler). This way, HotSpot maintains its JITed code CPU pipeline friendly, yet ensures correct memory semantic (page unmap is forcing memory barrier to processing cores).
      The [rip+0xa2b0966] addressing is a way to save on space when storing the page address in the assembly code. The address commented on the right is the actual page address, and is equal to the rip (Relative Instruction Pointer) + given constant. This saves space as the constant is much smaller than the full address representation. I thank Mr. Tene for clarifying that one up for me.
      If we were to look at safepoint polls throughout the assembly of the same process they would all follow the above pattern of pointing at the same global magic address (via this local relative trick). Setting the magic page to protected will trigger the SEGV for ALL threads. Note that the Time To Safe Point (TTSP) is not reported as GC time and may prove a hidden performance killer for your application. The effective cost of this global safepoint approach goes up the more runnable (and scheduled) threads your application has (all threads must wait for a safepoint consensus before the operation to be carried out at the safepoint can start).


      Find The Safpoint Summary

      In short, when looking for safepoints in Oracle/OpenJDK assembly search for poll. When looking at Zing assembly search for _please_self_suspend.



      Viewing all 76 articles
      Browse latest View live