Java Intrinsics are not JNI calls

October 25, 2012, 6:34 am

≫ Next: Breaking String encapsulation using Unsafe

That is a fact well known to those who know it well... but sadly some confusion still abounds, here's another go of explaining the differences.
An intrinsic is well defined on Wikipedia, to summarize: it is a function 'macro' to be handled by the compiler. The JIT compiler supports a large list of such intrinsic function macros . The beauty of the 2 concepts joined together(intrinsic functions and JIT compilation optimizations) is that the JIT compiler can optimize whole functions into single processor calls, for the particular processor detected at runtime, and get a great performance boost.
Where it all gets a bit confusing is that the intrinsic functions show up as normal methods or native methods when browsing the source code. Regardless of what they look like(native/Java) they will magically be transformed to far more performant instructions when picked up by the JIT (Caution must be taken with this piece of advice as the JIT compiler behavior can change from JVM to JVM implementation and turn your crafty choice of functions from a speedy chariot to a soggy pumpkin... one such important JVM is the Android Dalvik)
To get an idea of the range of functions which benefit from this nifty trick you can check out this list on this Java gaming Wiki, or have a look in this header file where you can find a more definitive list(look for do_intrinsic) and also get a view on how these things hang together. Some classes/methods on the list:

The wonderful Unsafe. Almost all intrinsics.
Math: abs(double); sin(double); cos(double); tan(double); atan2(double, double); sqrt(double); log(double); log10(double); pow(double, double); exp(double); min(int, int); max(int, int);
System: identityHashCode(Object); currentTimeMillis(); nanoTime(); arraycopy(....);
And many many more...

See Mike Barker's detailed explanation here which is really much more in depth then mine. He did everything I meant to do when I set out to write this little post, so really you must check it out.
One take away from this is that looking through the code is not enough to reason about the performance. In some cases where the performance is surprisingly good you will find an intrinsic standing behind that little boost you didn't see coming.

↧

Breaking String encapsulation using Unsafe

October 31, 2012, 10:55 am

≫ Next: Encode UTF-8 String to a ByteBuffer - faster

≪ Previous: Java Intrinsics are not JNI calls

As we all know, String is immutable, which is great and it is also defensive about it's internals to maintain that immutability which is also great, but... sometimes you want to be able to get/set that damn internal char[] without copying it. While arguably this is not very important to most people, it is quite desirable on other occasions when you are trying to get the best performance from a given piece of code. Here's how to break the encapsulation using Unsafe:
1. Acquire Unsafe:
2. Get the field offsets for String fields:
3. Use the offsets to get/set the field values:
Now this seems a bit excessive doesn't it? Isn't encapsulation important? Why would you do that? The bottom line is that this is to scrape some extra performance juice out of your system, and if you are willing to get your hands dirty the above can give you a nice boost.
Getting the data out of String is far less questionable then altering it's internal final fields, just so we are clear. So it is not really recommended that you use the set functionality as illustrated above, unless you are sure it's going to work. Here is why it's generally a bad idea to write into final fields.
Using other techniques it should be possible to hack your way into the package private String constructor that would spare us that bit of hackiness to the same effect.
Measurements and a real world use case to follow...

Update(05/12/2012):
As per the source code used for the next post, the above source code would break for JDK7 as String no longer has the fields offset and count. The sentiment however stays the same and the final result is on GitHub.

↧

Encode UTF-8 String to a ByteBuffer - faster

December 3, 2012, 7:56 am

≫ Next: Experimentation Notes: on herding processes and CPUs

≪ Previous: Breaking String encapsulation using Unsafe

Summary: By utilizing Unsafe to gain access to String/CharBuffer/ByteBuffer internals and writing a specialized UTF-8 encoder we can get a significant(> 30%) performance gain over traditional methods and maintain that advantage for both heap and direct buffers across different JDK versions.

"Not all who wonder are lost"

Not so recently I've been looking at this code that de-serializes messages through a profiler and a hotspot came up around the decoding/encoding of UTF8 strings. Certainly, I thought to myself, this is all sorted by some clever people somewhere, had a look around and found lots of answers but not much to keep me happy. What's the big deal you wonder:

UTF8 is a clever little encoding, read up on what it is on Wikipedia. Not as simple as those lovely ASCII strings and thus far more expensive to encode/decode. The problem is that Strings are backed by char arrays, which in turn can only be converted back and forth to bytes using encodings. A UTF8 char can be encoded to 1-4 bytes. The Strings also do not have a length in bytes, and are often serialized without that handy piece of information being available, so it's a case of finding how many bytes you need as you go along (see the end result implementation for a step by step).
Java Strings are immutable and defensive, if you give them chars they copy it, if you ask for the chars contained they get copied too.
There are many ways to skin the encoding/decoding cat in Java that are considered standard. In particular there are variations around the use of Strings, Charsets, arrays and Char/Byte buffers (more on that to come).
Most of the claims/samples I could find were not backed by a set of repeatable experiments, so I struggled to see how the authors arrived at the conclusions they did.

"Looking? Found someone you have!"

After much looking around I've stumbled upon this wonderful dude at MIT Evan Jones and his excellent set of experiments on this topic Java String Encoding and his further post on the implementation behind the scenes. Code is now on GitHub(you'll find it on his blog in the form of a zip, but I wanted to fork/branch from it): https://github.com/nitsanw/javanetperf and a branch with my code changes is under https://github.com/nitsanw/javanetperf/tree/psylobsaw

Having downloaded this exemplary bit of exploratory code I had a read and ran it (requires python). When I run the benchmarks included in the zip file on my laptop I get the following results(unit of measurement has been changed to millis, upped the test iterations to 1000 etc. The implementation is the same, I was mainly trying to get stable results):

* The above was run using JDK 1.6.37 on Ubuntu 12.04(64bit) on with an i5-2540M CPU @ 2.60GHz × 4.
I find the names a bit confusing, but Mr. Jones must have had his reasons. In any case here's what they mean:

bytebuffer - uses a CharBuffer.wrap(String input) as the input source to an encoder.
string - uses String.getBytes("UTF-8")
chars - copies the string chars out by using String.getChars, the chars are the backing array for a CharBuffer then used as the input source to an encoder.
custom - steals the char array out of String by using reflection and then wrapping the array with a CharBuffer.
once/reuse - refers to the encoder object and it's bits, weather a new once was created per String or the same reused.
array/buffer - refers to the destination of the encoding a byte array or a ByteBuffer.

The story the above results tell is quite interesting, so before I dive into my specialized case, here's how I read them:

There are indeed many flavours for this trick, and they vary significantly in performance.
The straight forward, simplest way i.e. using String.getBytes() is actually not bad when compared to some of the other variations. This is quite comforting on the one hand, on the other hand it highlights the importance of measuring performance 'improvements'.
The best implementation requires caching and reusing all the working elements for the task, something which happens under the hood in String for you. String also caches them in thread local variables, so in fact gives you thread safety on top of what other encoding implementations offer.
Stealing the char array using reflection is more expensive then copying them out. Teaching us that crime does not pay in that particular case.

One oddity I stumbled on while researching the topic was the fact that using String.getBytes(Charset) is actually slower then using the Charset name. I added that case as string2 in the chart above. This is due to the getBytes(String) implementation re-using the same encoder(as long as you keep asking for the same charset) while the one using the Charset object is creating a new encoder every time. Rather counter intuitive as you'd assume the Charset object would spare String.getBytes the need to look it up by the charset name. I read that as the JDK developers optimizing for the most common case to benefit the most users, leaving the less used path worse off. But the lesson here is again to measure, not assume.

"Focus!"

The use case I was optimizing for was fairly limited, I needed to write a String into a ByteBuffer. I knew a better way to steal the char array, as discussed in a previous post, using Unsafe(called that one unsafe). Also, if I was willing to tempt fate by re-writing final fields I could avoid wrapping the char array with a new CharBuffer(called that one unsafe2). I also realized that calling encoder.flush() was redundant for UTF-8 so trimmed that away(for both unsafe implementations, also did the same for chars and renamed chars2). Re-ran and got the following results, the chart focuses on the use case and adds relevant experiments:

We got a nice 10% improvement there for stealing the char[] and another 5% for re-writing a final field. Not to be sneezed at, but not that ground breaking either. It's worth noting that this is 10-15% saved purely by skipping a memory copy and the CharBuffer wrap, which tells us something about the cost of memory allocations when compared to raw computation cycles.

To further improve on the above I had to dig deeper into the encoder (which is where unsafe3 at above comes into play) implementation which uncovered good news and bad news.

"The road divides"

The bad news were an unpleasant little kink in the works. As it turns out string encoding can go down 2 different paths, one implementation is array based for heap ByteBuffers(on heap, backed by arrays), the other works through the ByteBuffer methods(so each get/set does explicit boundary checks) and will be the chosen path if one of the buffers is direct (off heap buffers). In my case the CharBuffer is guaranteed to be heap, but it would be quite desirable for the ByteBuffer to be direct. The difference in performance when using a direct ByteBuffer is quite remarkable, and in fact the results suggest you are better off using string and copying the bytes into the direct buffer. But wait, there's more.

"And then there were three"

The good news were that the topic of String encoding has not been neglected and that it is in fact improving from JDK version to JDK version. I re-run the test for JDK5, 6 and 7 with heap and direct buffers. I won't bore you with the losers of this race, so going forward we will look only at string, chars2, and the mystery unsafe3. The results for heap buffers:

Heap buffer encoding results

The results for direct buffers(note the millis axis scale difference):

Direct buffer encoding results

What can we learn from the above?

JDK string encoding performance has improved dramatically for the getBytes() users. So much in fact that it renders previous best practice less relevant. The main benefit of going down the Charset encode route is the memory churn reduction.
The improvement for the unsafe implementation suggests there were quite a few underlying improvements in the JIT compiler for in-lining intrinsics.
Speed has increased across the board for all implementations, joy for all.
If you are providing a framework or a library you expect people to use with different JDKs and you need to provide comparable performance you need to be aware of underlying implementation changes and their implications. It might make sense to back port bit of the new JDK and add them to your jar as a means of bringing the same goods to users of old JDKs. This sort of policy is fairly common with regards to the concurrent collections and utilities.

And now to unsafe3...

"Use the Source Luke"

At some point I accepted that to go faster I will have to roll up my sleeves and write an encoder myself(have a look at the code here). I took inspiration from a Mark Twain quote: "The kernel, the soul — let us go further and say the substance, the bulk, the actual and valuable material of all human utterances — is plagiarism.". No shame in stealing from your betters, so I looked at how I could specialize the JDK code to my own use case:

I was looking at encoding short strings, and knew the buffers had to be larger then the strings, or else bad things happen. As such I could live with changing the interface of the encoding such that when it fails to fit the string in the buffer it returns overflow and you are at the same position you started. This means encoding is a one shot operation, no more encoding loop picking the task up where they left it.
I used Unsafe to pick the address out of the byte buffer object and then use it to directly put bytes into memory. In this case I had to keep the boundary checks in, but it still out performed the heap one in JDK7, so maybe the boundary checks are not such a hindrance. I ran out of time with this, so feel free to experiment and let me know.
I tweaked the use of types and the constants and so on until it went this fast. It's fiddly work and you must do it with an experiment to back up your theory. At least from my experience a gut feeling is really not sufficient when tuning your code for performance. I would encourage anyone with interest to have a go at changing it back and forth to the JDKs implementation and figuring out the cost benefit of each difference. If you can further tune it please let me know.
Although not explicitly measured throughout this experiment there is a benefit in using the unsafe3 method(or something similar) in reducing the memory churn involved in the implementation. The encoder is very light weight and ignoring it's own memory cost involves no further memory allocation or copying in the encoding process. This is certainly not the case for either the chars2 solution or the string one and the overhead is proportional to the encoded string length.
This same approach can be used to create an encoder that would conform to the JDK interface. I'm not sure what the end result would perform like, but if you need to conform to the JDK interface there's still a win to be had here.
The direct buffer results suggest it might be possible to further improve the heap buffer result by using unsafe to put the byte in rather than array access. If that is the case then it's an interesting result of and by itself. Will be returning to that one in future.

And there you have it, a way to encode strings into byte buffers that is at least 30% faster then the JDK provided ones(for now) which gives you consistent performance across JDK versions and for heap and direct buffers :)

References:

Java String Encoding - Evan Jones blog entry on the topic.
https://github.com/nitsanw/javanetperf/tree/psylobsaw - The original experiments along with my changes and new encoder.
JDK7 UTF-8 Charset source with Encoder/Decoder
JDK6 UTF-8 Charset source with Encoder/Decoder

↧

Experimentation Notes: on herding processes and CPUs

December 5, 2012, 10:24 am

≫ Next: Atomic*.lazySet is a performance win for single writers

≪ Previous: Encode UTF-8 String to a ByteBuffer - faster

Summary: notes on cpu affinity and c-state management from recent benchmarking efforts.
Your OS assumes your machine is used for a mix of activities, all of which have the same priority roughly and all must stand in line to get at limited resources. This is mostly terrific and lets us multi-task to our inner ADHD child's content. There are times however when you might want to exercise some control over who goes where and uses what, here's some notes to assist on this task. This is not a proper way to do this(it will all go away on restart for one), I'm not much of a UNIX guru, this is rather an informal cheat sheet to help you get where you are going.

IRQ Balance

Stop the OS from sending interrupts fairly:

service irqbalance stop

CPU power saving - cpufreq

Your OS is trying to be green and put your CPUs to sleep when it thinks you are not using them. Good for some but not when you are in a hurry:

echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

When you are done set it back to scaling the frequency on demand:

echo ondemand | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Pin processes to CPU cores

To put all you processes on a particular cpu mask(0 in this example):

for i in `ps -eo pid` ; do sudo taskset -pc 0 $i ; done

When you are done you can let them roam again:

for i in `ps -eo pid` ; do sudo taskset -pc 0-3 $i ; done

This is useful when benchmarking, everybody moves to one core and you taskset your benchmarking process onto the cores left. Note that some processes may refuse to move. If you are in a NUMA environment you might have to use numactl instead.

↧

Atomic*.lazySet is a performance win for single writers

December 13, 2012, 7:00 am

≫ Next: Java ping: a performance baseline utility

≪ Previous: Experimentation Notes: on herding processes and CPUs

Summary:For programs respecting the Single Writer principle the Atomic*.lazySet method and it's underlying Unsafe.putOrdered* intrinsic present a performance win in the form of significantly cheaper volatile writes.

A few months ago I attended Martin Thompson's excellent Lock Free Algorithms course, the course walks through some familiar territory for those who have been reading his blog and read through the disruptor, and lots of other goodies which are not. Most of all, the dude himself is both amazingly knowledgeable on all things concurrent, and a clear presenter/teacher on a topic that is confusing and often misunderstood. One of the facilities we utilized during that course, and one that is present under the covers of the disruptor, was lazySet/putOrdered. It was only after the course that I wondered what is that bit of magic and how/why it works. Having talked it over with Martin shortly, and having dug up the treasures of the internet I thought I'd share my findings to highlight the utility of this method.

The origins of lazySet

"In the beginning there was Doug"

And Doug said: "This is a niche method that is sometimes useful when fine-tuning code using non-blocking data structures. The semantics are that the write is guaranteed not to be re-ordered with any previous write, but may be reordered with subsequent operations(or equivalently, might not be visible to other threads) until some other volatile write or synchronizing action occurs)." - Doug Lea is one of the main people behind Java concurrency and the JMM and the man behind the java.util.concurrent package. Carefully reading his definition of lazySet it is not clear that it guarantees much at all of and by itself.
The description of where it might prove useful is also not that encouraging: "The main use case is for nulling out fields of nodes in non-blocking data structures solely for the sake of avoiding long-term garbage retention" - Which implies that if the implementers of lazySet are free to delay the set indefinitely. Nulling out values you don't care about particularly in terms of visibility does not sound like such a hot feature.
The good bit is however saved for last: "lazySet provides a preceeding store-store barrier (which is either a no-op or very cheap on current platforms), but no store-load barrier" - Lets refresh our memory from Doug's cookbook(no muffins there :-(, but lots of crunchy nuggets of wisdom):

StoreStore Barriers: The sequence: Store1; StoreStore; Store2 ensures that Store1's data are visible to other processors (i.e.,flushed to memory) before the data associated with Store2 and all subsequent store instructions. In general, StoreStore barriers are needed on processors that do not otherwise guarantee strict ordering of flushes from write buffers and/or caches to other processors or main memory.

StoreLoad Barriers: The sequence: Store1; StoreLoad; Load2 ensures that Store1's data are made visible to other processors (i.e., flushed to main memory) before data accessed by Load2 and all subsequent load instructions are loaded. StoreLoad barriers protect against a subsequent load incorrectly using Store1's data value rather than that from a more recent store to the same location performed by a different processor. Because of this, on the processors discussed below, a StoreLoad is strictly necessary only for separating stores from subsequent loads of the same location(s) as were stored before the barrier. StoreLoad barriers are needed on nearly all recent multiprocessors, and are usually the most expensive kind. Part of the reason they are expensive is that they must disable mechanisms that ordinarily bypass cache to satisfy loads from write-buffers. This might be implemented by letting the buffer fully flush, among other possible stalls.

We all like cheap(love no-op) and hate expensive when it comes to performance, so we would all like lazySet to be as good as a volatile set, just allot cheaper. A volatile set would require a StoreLoad barrier, which is expensive because it has to make the data available to everyone before we get on with our tasks, and get the latest data in case someone else changed it. This is implicit in the line "protect against a subsequent load incorrectly using Store1's data value rather than that from a more recent store to the same location". But if there is only a single writer we don't need to do that, as we know no one will ever change the data but us.
And from that follows that strictly speaking lazySet is at the very least as correct as a volatile set for a single writer.
At this point the question is when (if at all) will the value set be made visible to other threads.

"Dear Doug"

The Concurrency Interest is an excellent source of informal Q&A with the Java concurrency community and the question I ask above has been answered there by the Doug:

1) Will lazySet write actually happens in some finite time?
The most you can say from the spec is that it will be written no later than at the point that the process must write anything else in the Synchronization Order, if such a point exists. However, independently of the spec, we know that so long as any process makes progress, only a finite number of writes can be delayed. So, yes.
2) If it happens (== we see spin-wait loop finished) -- does it mean,that all writes preceeding lazySet are also done, commited, and visible to thread 2, which finished spin-wait loop?
Yes, although technically, you cannot show this by reference to the Synchronization Order in the current JLS.
...
lazySet basically has the properties of a TSO store

To give credit where credit is due, the man who asked the question is Ruslan Cheremin and if you can read Russian, or what google translate makes of it you can see he was similarly curious about the guarantees provided by lazySet and his inquiry and the bread crumbs it left made my job much easier.
Now that we've established lazySet definitely should work, and that Doug promises us an almost free volatile write for single writers, all we need to quantify is how lazy is lazySet exactly. In Doug's reply he suggests the publication is conditional on further writes being made somehow causing the CPU to flush the store queue at some unknown point in the future. This is not good news if we care about predictable latency.

Lazy, Set, Go!

To demonstrate that lazySet is in fact fine in terms of latency, and to further demonstrate that it is a big win for single writers I put together some experiments. I wanted to demonstrate the low level mechanics behind lock free wait free(no sychronized blocks/locks/wait/notify allowed) inter-thread communications and to do so I re-implemented/trimmed to size AtomicLong as a VolatileLong because we don't need the atomicity offered on set and I also wanted to add a direct(as in not ordered or volatile) setter of the value(the full code is here):
I hid the actual choice of setter by creating a Counter interface with a get and set method. The get always used the volatile getter, as using the direct one results in the values never being propagated. It's included for completeness. The experiments were run with same core and cross core affinity.

Ping Pong - demonstrate lazySet has good latency charecteristics

We have 2 threads who need to keep pace with each other such that one informs the other of it's current long value, and waits for the other to confirm he got it before incrementing that same value and repeating. In the real world there is a better(as in faster) way to implement this particular requirement of keeping 2 threads in step, but as we are looking at single writers we will make a counter for each thread and maintain the single writer principle. The full code is here, but the core is:
Note that this is a rather contrived behaviour in a lock free wait free program as the threads spin wait on each other, but as a way of measuring latency it works. It also demonstrates that even though no further writes are made after lazySet the value still 'escapes' as required.

Catchup - demonstrate lazySet cost for single writer

One thread is using a counter to mark inserted values into an array. The other thread is reading the value of this counter and scans through the array until it catches up with the counter. This is a typical producer consumer relationship and the low cost of our lazy write is supposed to shine here by not imposing the cost of a full store-load instruction on the writing thread. Note that in this experiment there is no return call from the consumer to the producer. The full code is here, but the core is:

Results and analysis

Sorry for the slightly lame chart, I'll explain:

Labels go Experiment -> set method -> affinity, i.e Catchup direct same means it's the result for running the catchup experiment using direct set with affinity set up such that both threads are running on the same core.
Yellow bar is maximum run time, orange is median, blue is minimum.

As we can see for the Ping Pong experiment there is practically no difference between the different methods of writing. Latency is fairly stable although volatile performs slightly better in that respect. The Catchup experiment demonstrates the fact that volatile writes are indeed significantly more expensive(5-6 times) then the alternatives.
The curious guest at this party is the direct write. It shouldn't really work at all, and yet not only does it work it also seems like a better performer than lazySet/putOrdered, how come? I'm not sure. It certainly isn't a recommended path to follow, and I have had variations of the experiments hang when using the direct set. The risk here is that we are completely at the mercy of the JIT compiler cleverness not realizing that our set can legally be done to a register rather than a memory location. We also have no guarantees regarding re-ordering, so using it as a write marker as done in catchup is quite likely to break in more complex environments or under closer inspection. It may be worth while using when no happens before guarantee is required for prior writes i.e. for publishing thread metrics or other such independent values, but it is an optimization that might backfire at any time.

Summary:

The lazySet/putOrdered as a means of providing a happens before edge to memory writes is one of the building blocks of the Disruptor and other frameworks. It is a useful and legitimate method of publication and can provide measurable performance improvements as demonstrated.

Further thoughts on related topics...

As part of the data collection for this article I also looked at padded variations of the volatile long used to defend against false sharing and implemented a few variations of those. I went on to implement the same padded and un-padded variations as off heap structures and compared the performance of each, hitting some interesting issues along the way. In the end I decided it is best to keep this post focused and put the next step along into another post, the code is however available for reading on Github and should you find it interesting I'm happy to discuss.

References:

The code can be found here : https://github.com/nitsanw/psy-lob-saw
Concurrency Interest mailing list
Single Writer principle article on Martin's blog
Volatile keyword explained
Memory Barriers in Java
JMM Implementation Cookbook
Broader discussion of memory barriers
Cost of volatile reads

↧

Java ping: a performance baseline utility

December 20, 2012, 5:18 am

≫ Next: Direct Memory Alignment in Java

≪ Previous: Atomic*.lazySet is a performance win for single writers

Summary:an open source mini utility for establishing a baseline measurement of Java application TCP latency, a short discussion on the value of baseline performance measurements, and a handful of measurements taken using the utility.

We all know and love ping, and in most environments it's available for us as a means of testing the basic TCP network latency between machines. This is extremely useful, but ping is not written in Java and it's also not written with low latency in mind. This is important (or at least I think it is) when examining a Java application and trying to make an informed judgement on observed messaging latency in a given environment/setup.
I'll start with the code and bore you with the philosophy later:

The mechanics should be familiar to anyone who used NIO before, the notable difference from common practice is using NIO non-blocking channels to perform essentially blocking network operations.
The code was heavily 'inspired' by Peter Lawrey's socket performance analysis post and code samples (according to Mr. Lawrey's licence you may have to buy him a pint if you find it useful, I certainly owe him one). I tweaked the implementation to make the client spin as well as the server which improved the latency a bit further.I separated the client and server, added an Ant build to package them with some scripts and so on. Notes:

The server has to be running before the client connects and will shut down when the client disconnects.
Both server and client will eat up a CPU as they both spin in wait for data on the socket channel.
To get the best results pin the process to a core (as per the scripts).

Baseline performance as a useful measure

When measuring performance we often compare the performance of one product to the next. This is especially true when comparing higher level abstraction products which are supposed to remove us from the pain of networking, IO or other such ordinary and 'technical' tasks. It is important however to remember that abstraction comes at a premium, and having a baseline measure for your use case case help determine the premium. To offer a lame metaphor this is not unlike considering the bill of material in the bottom line presented to you by your builder.
While this is not a full blown application, it illustrates the cost/latency inherent in doing TCP networking in Java. Any other cost involved in your application request/response latency needs justifying. It is reasonable to make all sort of compromises when developing software, and indeed there are many a corner to be cut in a 50 line sample that simply would not do in a full blown server application, but the 50 line sample tells us something about the inherent cost. Some of the overhead you may find acceptable for your use case, other times it may not seem acceptable, but having a baseline informs you on the premium.

On the same stack(hardware/JDK/OS) your application will be slower then your baseline measurement, unless it does nothing at all.
If you are using any type of framework, compare the bare bones baseline with your framework baseline to find the basic overhead of the framework (you can use the above to compare with Netty/MINA for instance).
Consider the hardware level functionality of your software to match with baseline performance figures (i.e: sending messages == socket IO, logging == disk IO etc.). If you think a logging framework has little overhead on top of the cost of serializing a byte buffer to disk, think again.

Variety is the spice of life

To demonstrate how one would use this little tool I took it for a ride:

All numbers are in nanoseconds
Tests were run pinned to CPUs, I checked the variation between running on same core, across cores and across sockets
This is RTT(round trip time), not one hop latency(which is RTT/2)
The code prints out a histogram summary of pinging a 32b message. Mean is the average, 50% means 50% of updates had a latency below X, 99%/99.99% in the same vain. (percentiles are commonly used to measure latency SLAs)

To start off I ran it on my laptop(i5/Ubuntu 12.04/JDK7) on loopback, the result was:

Same core: mean=8644.23, 50%=9000, 99%=16000, 99.99%=24000
Cross cores: mean=5809.40, 50%=6000, 99%=9000, 99.99%=23000

Sending and receiving data over loopback is CPU intensive, which is why putting the client and the server on the same core is not a good idea. I went on to run the same on a beefy test environment, which has 2 test machines with tons of power to spare, and a choice of NICs connecting them together directly. The test machine is a dual socket beast so I took the opportunity to run on loopback across sockets:

Cross sockets: mean=12393.97, 50%=13000, 99%=16000, 99.99%=29000
Same socket, same core: mean=11976.68, 50%=12000, 99%=16000, 99.99%=28000
Same socket, cross core: mean=7663.82, 50%=8000, 99%=11000, 99.99%=23000

Testing the connectivity across the network between the 2 machines I compared 2 different 10Gb card and a 1Gb card available on that setup, I won't mention make and model as this is not a vendor shootout:

10Gb A: mean=19746.08, 50%=18000, 99%=26000, 99.99%=38000
10Gb B: mean=30099.29, 50%=30000, 99%=33000, 99.99%=44000
1Gb C: mean=83022.32, 50%=83000, 99%=87000, 99.99%=95000

The above variations in performance are probably familiar to those who do any amount of benchmarking, but may come as a slight shock to those who don't. This is exactly what people mean when they say your mileage may vary :). And this is without checking for further variation by JDK version/vendor, OS etc. There will be variation in the performance depending on all these factors which is why a baseline figure taken from your own environment can provide a useful estimation tool to performance on the same hardware. The above also demonstrates the importance of process affinity when considering latency.

Conclusion

An average RTT latency of 20 microseconds between machines is pretty nice. You can do better by employing better hardware and drivers(kernel bypass), and you can make your outliers disappear by fine tuning JVM options and the OS. At it's core Java networking is pretty darn quick, make sure you squeeze all you can out it. But to do that, you'll need a baseline figure to let you know when you can stop squeezing, and when there's room for improvement.

↧

Direct Memory Alignment in Java

January 6, 2013, 3:19 am

≫ Next: Using Caliper for writing Micro Benchmarks

≪ Previous: Java ping: a performance baseline utility

Summary: First in a quick(hopefully) series of posts on memory alignment in Java. This post introduces memory alignment, shows how to get memory aligned blocks, and offers an experiment and some results concerning unaligned memory performance.

Since the first days of Java, one of the things you normally didn't need to worry about was memory. This was a good thing for all involved who cared little if some extra memory was used, where it was allocated and when will it be collected. It's one of the great things about Java really. On occasion however we do require a block of memory. A common cause for that is for doing IO, but more recently it has been used as a means to having off-heap memory which allows some great performance benefits to the right use case. Even more recently there has been talk of bringing back C-style structs to Java, to give programmers better control over memory layout. Having stayed away from direct memory manipulation for long(or having never had to deal with it) you may want to have a read of the most excellent series of articles "What Every Programmer Should Know About Memory" by Ulrich Drepper. I'll try and cover the required material as I go along.

Getting it straight

Memory alignment is mostly invisible to us when using Java, so you can be forgiven for scratching you head. Going back to basic definitions here:

"A memory address a, is said to be n-byte aligned when n is a power of two and a is a multiple of nbytes.

A memory access is said to be aligned when the datum being accessed is n bytes long and the datum address is n-byte aligned. When a memory access is not aligned, it is said to be misaligned. Note that by definition byte memory accesses are always aligned."

In other words for n which is a power of 2:

boolean nAligned = (address%n) == 0;

Memory alignments of consequence are:

Type alignment(1,2,4,8) - Certain CPUs require aligned access to memory and as such would get upset(atomicity loss) if you attempt misaligned access to your data. E.g. if you try to MOV a long to/from an address which is not 8-aligned.
Cache line alignment(normally 64, can be 32/128/other) - The cache line is the atomic unit of transport between the main memory and the different CPU caches. As such packing relevant data to that alignment and to that punctuation can make a difference to your memory access performance. Other considerations are required here due to the rules of interaction between the CPU and the cache lines(false sharing, word tearing, atomicity)
Page size alignment(normally 4096, can be configured by OS) - A page is the smallest chunk of memory transferable to an IO device. As such aligning your data to page size, and using the page size as your allocation unit can make a difference to interactions when writing to disk/network devices.

Memory allocated in the form of objects (anything produced by 'new') and primitives is always type aligned and I will not be looking much into it. If you want to know more about what the Java compiler make of you objects, this excellent blog post should help.
As the title suggests I am more concerned with direct memory alignment. There are several ways to get direct memory access in Java, they boil down to 2 methods:

JNI libraries managing memory --> not worried about those for the moment
Direct/MappedByteBuffer/Unsafe which are all the same thing really. The Unsafe class is at the bottom of this gang and is the key to official and semi-official access to memory in Java.

The memory allocated via Unsafe.allocatememory will be 8 byte aligned. Memory acquired via ByteBuffer.allocateDirect will ultimately go through the Unsafe.allocatememory but will differ in 2 important ways:

ByteBuffer memory will be zero-ed out.
Up until JDK6 direct byte buffers were page aligned at the cost of allocating an extra page of memory for every byte buffer allocated. If you allocated lots of direct byte buffers this got pretty expensive. People complained and as of JDK7 this is no longer the case. This means that direct ByteBuffer are now only 8 byte aligned.
Less crucial, but convenient: memory allocated via ByteBuffer.allocateDirect is freed as part of the ByteBuffer object GC. This mechanism is only half comforting as it depends on the finalizer being called, and that is not much to depend on.

Allocating an aligned ByteBuffer

No point beating about the bush, you'll find it's very similar to how page alignment works/used to work for ByteBuffers, here's the code:

To summarize the above, you allocate as much extra memory as the required alignment to ensure you can position a large enough memory slice in the middle of it.

Comparing Aligned/Unaligned access performance

Now that we can have an aligned block of memory to play with, lets test drive some theories. I've used Caliper as the benchmark framework for this post (it's great, I'll write a separate post on my experience/learning curve) and intend to convert previous experiments to use it when I have the time. The theories I wanted to test out in this post are:

Theorem I: Type aligned access provides better performance than unaligned access.
Theorem II: Memory access that spans 2 cache lines has far worse performance than aligned mid-cache line access.
Theorem III: Cache line access performance changes based on cache line location.

To test the above theorems I put together a small experiment/benchmark, with an algorithm as follows:

Allocate a page aligned memory buffer of set sizes: 1,2,4,8,16,32,64,128.256,512,1024,2048 pages. Allocation time is not measured but done up front. The different sizes will highlight performance variance which stems from the cache hierarchy of sizes.
We will write a single long per cache line (iterate through the whole memory block), then go back and read the written values (iterate through the whole memory block). There are usually 64 cache lines in a page (4096b/64b case), but we can only write 63 longs which will 'straddle' 2 lines, so the non-straddled offset will skip the first line to have the same number of operations.
Repeat step 2 (2048/(size in pages)) times. The point is to do the same number of memory access operations per experiment.

Here is the code for the experiment:

Before moving on to the results please note that this test is very likely to yield different results on different hardware, so your mileage is very likely to vary. If you find the time to run the test on your own hardware I'd be very curious to compare notes. I used my laptop, which has an i5-2540M processor(Sandy Bridge, dual core, hyper threaded, 2.6Ghz). The L1 Cache is 32K, L2 is 256K and L3 is 3M. This is important in the context of the changes in behaviour we would expect to see as the data set exceeds the capacity of each cache. Should you run this experiment on a machine with different cache allocations, expect your results to vary in accordance.
This experiment is demonstrating a very simple use case(read/write long). There is no attempt to enjoy the many features the CPU offers which may indeed stronger alignment than simple reading and writing. As such take the results to be limited in scope to this use case alone.
Finally, note that the charts reflect relative performance to offset 0 memory access. As the performance varied also as a function of memory buffer size I felt this helps highlight the difference in behaviour which is related to offset rather than size. The effect of the size of the memory buffer is reflected in this chart, showing time as function of size for offset 0:

The chart shows how the performance is determined by the cache the memory buffer fits into. The caches size in pages is 8 for L1, 64 for L2, and 768 for L3 which correspond to the steps(200us, 400us, 600us) we see above, with a final step into main memory(1500us). While not the topic at hand it is worth noting the effect a reduced data set size can have on your application performance, and by extension the importance of realistic data sets in benchmarking.

Theorem I: Type aligned access provides better performance than unaligned access - NOT ESTABLISHED (for this particular use case)

To test Theorem I I set offset to values 0 to 7. Had a whirl and to my surprise found it made no bloody difference! I double and triple checked and still no difference. Here's a chart reflecting the relative cost of memory access:

What seems odd is the difference in performance for the first 4 byte, which are significantly less performant then the rest for memory buffer sizes which fit in the L1 cache(32k = 4k * 8 = 8 pages). The rest show relatively little variation and for the larger sizes there is hardly any difference at all. I have been unable to verify the source of the performance difference...
What is this about? Why would Mr. Drepper and many others on the good web lie to me? As it turns out Intel in their wisdom have sorted this little nugget out for us recently with Nehalem generation of CPUs, such that unaligned type access has no performance impact. This may not be the case on other non-Intel CPUs, or older Intel CPU. I ran the same experiment on my i5 laptop and my Core2 desktop, but could not see a significant difference. I have to confess I find this result a bit alarming given how it flies in the face of accepted good practice. A more careful read of Intel's Performance Optimization Guide reveals the recommendation to align your type stands, but it is not clear what adverse effects it may have for the simple use case we have here. I will attempt to construct a benchmark aimed at uncovering those issues in a further post perhaps. For now it may comfort you to know that breaking the alignment rule seems to not be the big issue it once was on x86 processors. Note that behaviour on non-x86 processors may be wildly different.

Theorem II: Memory access that spans 2 cache lines has far worse performance than aligned mid-cache line access - CONFIRMED

To test Theorem II I set the offset to 60. The result was a significant increase in the run time of the experiment. The difference was far more pronounced on the Core2 (x3-6 the cost of normal access) then on the i5 (roughly x1.5 the cost of normal access), and we can look forward to it being less significant as time goes by. This result should be of interest to anybody using direct memory access as the penalty is quite pronounced. Here's a graph comparing offset 0,32 and 60:

The other side effect of cache line straddling is loss of update atomicity, which is a concern for anyone attempting concurrent direct access to memory. I will go back to the concurrent aspects of aligned/unaligned access in a later post.

Theorem III: Cache line access performance changes based on cache line location - NOT ESTABLISHED

This one is based on Mr. Dreppers article:

3.5.2 Critical Word Load

Memory is transferred from the main memory into the caches in blocks which are smaller than the cache line size. Today 64 bits are transferred at once and the cache line size is 64 or 128 bytes. This means 8 or 16 transfers per cache line are needed.
The DRAM chips can transfer those 64-bit blocks in burst mode. This can fill the cache line without any further commands from the memory controller and the possibly associated delays. If the processor prefetches cache lines this is probably the best way to operate.
If a program's cache access of the data or instruction caches misses (that means, it is a compulsory cache miss, because the data is used for the first time, or a capacity cache miss, because the limited cache size requires eviction of the cache line) the situation is different. The word inside the cache line which is required for the program to continue might not be the first word in the cache line. Even in burst mode and with double data rate transfer the individual 64-bit blocks arrive at noticeably different times. Each block arrives 4 CPU cycles or more later than the previous one. If the word the program needs to continue is the eighth of the cache line, the program has to wait an additional 30 cycles or more after the first word arrives

To test Theorem III I set the offset values to the different type aligned locations on a cache line for a long: 0,8,16,24,32,40,48,56. My expectation here was to see slightly better performance for the 0 location as explained in Mr. Drepper's paper, with degrading performance as you get further from the begining of the cache line, at least for the large buffer sizes where data is always fetched from main memory. Here's the result:

The results surprised me. As long as the buffer fit in my L1 cache, writing/reading from the first 4 bytes was significantly more expensive than elsewhere on the line, as discussed in Theorem I, but the other locations did not have measurable differences in cost. I pinged Martin Thompson which pointed me at the Critical Word First feature/optimization explained here:

Critical word first takes advantage of the fact that the processor probably only wants a single word by signaling that it wants the missed word to appear first in the block. We receive the first word in the block (the one we actually need) and pass it on immediately to the CPU which continues execution. Again, the block continues to fill up in the "background".

I was unable to determine when exactly Intel introduced the feature to their CPU's, and I'm not sure which processors fail to supply it.

Conclusions/Takeaways

Of the 3 concerns stated, only cache line straddling proved to be a significant performance consideration. Cache line straddling impact is diminishing in later Intel processors, but is unlikely to disappear altogether. As such it should feature in the design of any direct memory access implementation and may be of interest to serializing/de-serializing code.
The other significant factor is memory buffer size, which is obvious when one considers the cache hierarchy. This should prompt us to make an effort towards more compact memory structures. In light of the negligible cost of type mis-alignment we may want to skip type alignment driven padding of structures when implementing off-heap memory stores.
Finally, this made me realize just how fast moving is the pace of development in hardware which invalidates past conventional wisdom/assumptions. So yet another piece of evidence validating the call to 'Measure, Measure, Measure' all things performance.

Running the benchmarks yourself

Code is on GitHub
Run 'ant run-unaligned-experiments'
Wait... it takes a while to finish
Send me the output?

To get the most deterministic results you should run the experiments pinned to a core which runs nothing else, that would reduce noise from other processes 'polluting' the cache.

Many thanks to Martin, Aleksey Shipilev and Darach Ennis for proof reading and providing feedback on initial drafts, any errors found are completely their fault :-).

Second part is here: Alignment, Concurrency and Torture

↧

Using Caliper for writing Micro Benchmarks

January 21, 2013, 2:14 pm

≫ Next: Experimentation Notes: Java Print Assembly

≪ Previous: Direct Memory Alignment in Java

Summary: Caliper is a micro benchmarking framework, here's how to get going with it using Ant + Ivy. Showing a before/after of an existing benchmark and working around the broken dependency/option for measuring memory allocation.
Writing Micro-Benchmarks in Java is famously hard/tricky/complicated. If you don't think so read these articles:

Caliper is an open source project which aims to simplify the issue by... giving you a framework for writing benchmarks! It's still a bit rough around the edges, but already very useful.

Get started

I'll assume you got Ant sorted and nothing more, you'll need to add Ivy support to your build. You can use the following which I cannibalised from the Ivy sample:
And you will need an ivy.xml file to bring in Caliper:
You'll notice the allocation dependency is excluded and that the build has a task in it to download the jar directly from the website... there's good reason for that. Ivy uses maven repositories to get it's dependencies and the java-allocation-instrumenter maven produced jar is sadly broken. You can fix it by downloading it manually from here. There is probably a cleaner way to handle this with Ivy using resolvers and so on, but this is not a post about Ivy, so I won't bother.
You can use an Eclipse plugin to support Ivy integration and bring the jars into your project. You'll still need to get the allocation.jar and sort it out as described below.
Now that we got through the boring bits, let's see why we bothered.

UTF-8 Encoding benchmarks: Before and after

To give context to this tool you need to review how hand rolled benchmarks often look. In this case I'll just revisit a benchmark I did for a previous post measuring different methods of encoding UTF-8 Strings into a byte buffer. The full code base is here but here's the original code used for benchmarking and comparing(written by Evan Jones):
This is quite typical, actually better than most benchmarking code as things go. But it's quite allot of code to basically compare a few ways of achieving the same thing. If you are going to do any amount of benchmarking you will soon grow tired of this boiler plate and come up with some framework or other... So how about you don't bother? Here's how the benchmark looks when Caliper-ized, including the code actually under test:
Note there's hardly any code concerned with the benchmarking functionality and the fact that for less lines of code we also fit in 3 flavours of the code we wanted to compare. Joy!

Running the main give you the following output:
Now, isn't that nice? you got this lovely little ASCII bar on the right, the results units are sorted. Yummy! Here's some command line options to play with:

--trials <n> : this will run several trials of your benchmark. Very important! you can't rely on a single measurement to make conclusions.

--debug : If you want to debug the process this will not spawn a new process to run your benchmark so that you can intercept the breakpoints easily.

--warmupMillis <millis> : how long to warm up your code for.

--runMillis <millis> : how long should a trial run take

--measureMemory : will measure and compare allocations

Isn't that great? sadly the last one (measureMemory) is a bit annoying to get working because:

The dependency jar does not work
Just getting the right jar is not enough because...
You need to set up a magical environment variable: ALLOCATION_JAR
Don't rename the allocation.jar the name is in the manifest and is required for the java agent to work.

Here's an Ant task which runs the UTF-8 benchmark with measureMemory:
And the output illustrates how using String.getBytes will cause allot of extra allocations compared to the other methods:
That took a bit of poking around the internet to sort out, but now you don't have to. And now that it's so easy to write micro benchmarks, there's less of an excuse to not measure before implementing a particular optimization.
Finally, to their eternal credit to writers of Caliper include a page on the project site which highlights some of the pitfalls and considerations around micro benchmarks, so please "Beware the Jabberwock" :P
Enjoy.

↧

Experimentation Notes: Java Print Assembly

January 28, 2013, 9:13 am

≫ Next: Alignment, Concurrency and Torture (x86)

≪ Previous: Using Caliper for writing Micro Benchmarks

Summary: How to print out Java compiled assembly instructions on Linux. Not hard to do, just a reminder how.
This is with the Oracle JDK 7, on Ubuntu 12.10, but should work similarly elsewhere.
To get the assembly code print out of your java process add the following options to your Java command line:

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly ...

Which doesn't work :( :

Java HotSpot(TM) 64-Bit Server VM warning: PrintAssembly is enabled; turning on DebugNonSafepoints to gain additional output
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled

So where do you get hsdis-amd64.so? and where should you put it?

Project Kenai got different versions for download, pick the one that matches your OS.
Rename it to hsdis-amd64.so and put it under /usr/lib/jvm/java-7-oracle/jre/lib/amd64/ the actual path on your machine may differ if you installed your JDK in another method/put it elsewhere etc.

Sorted! Enjoy a fountain of assembly code spouting from your code.

↧

Alignment, Concurrency and Torture (x86)

February 12, 2013, 1:40 pm

≫ Next: Merging Queue, ArrayDeque, and Suzie Q

≪ Previous: Experimentation Notes: Java Print Assembly

Summary: Exploring the effects of unaligned/aligned concurrent access in Java for off/on heap memory, how to test for correctness, torture and cushions.

Concurrent access to memory is a pain at the best of times. The JMM offers Java programmers a way to reason about concurrency in a platform independent manner, and while not without fault it's certainly better then having no memory model at all. Other languages are catching up of course, but the focus of this article is not comparative concurrency, so look it up yourselves ;).
When we go off heap via Unsafe or Direct Byte Buffers, or when we utilize Unsafe to gain direct access to on heap memory we are on our own. Direct memory access strips away the JVM guarantees and leaves you exposed to the underlying OS/architecture memory access rules and the interpretation of the JVM runtime.
Should you be worried?
Silly question.
If you are not the worried type, follow this link with my blessings.
Worried? Here's 2 guarantees now gone to worry about:

Atomicity - memory access on the JVM is guaranteed to be atomic, even when the underlying system does not support it. No such similar guarantees are available for ByteBuffers. In fact ByteBuffers are explicitly not thread safe, so atomicity is not guaranteed at all. Unsafe has no guarantees for anything, so caution is advised.
Word tearing - this means concurrent updates to adjacent bytes in memory do not result in corruption of state. Some processors perform memory writes in units of a word, which can be 2 or 4 bytes. This can result in the unmolested parts of the word overwriting bytes written to by other threads. While the JLS is binding for 'plain' memory access (via assignment) the guarantee does not necessarily extend to off heap memory.

The problem with verifying these behaviours is that concurrency guarantees and issues are by definition timing dependent. This means that while most of the time there is no issue, that is not to say that throwing a few extra processors in, or changing memory layout, or maybe just waiting a bit longer will not trigger the issue...
So how do you prove your code is thread safe once you stepped off heap and into the wild? I recently answered a question on stack overflow regarding testing for correctness under multi-threaded access of a data structure. My answer was accepted, so obviously it's correct, I cut out the niceties to keep this short:

1. Formally prove the correctness of your program in the terms of the JMM.

2. Construct test cases which demonstrate intended above behaviour by using count down latches or other means of suspending threads of execution at specific points in your program to force contention.

3. Statistically demonstrate correctness by exercising the code over a sufficient period of time from multiple threads.

This is all fine for when you write 'safe' Java, but option 1, and therefore 2 go out the window when you are straying off spec. This is an important observation to make: if the rules are not well defined then testing the edge cases for those rules is not that meaningful.
Which leaves us with number 3.

Science is torture

It is a fact well known to those who know it well that a scientific experiment can only prove your theory is either definitely wrong or so far right. This (limited) power of induction from past experiences is what empiricism is all about. To prove JVM implementations implement the JMM correctly Mr. Shipilev (Benchmarking Tzar) has put together the Java Concurrency Torture Suite into which he drags JVMs and tries to break them on any number of architectures (not to worry, he tries it on cute bunnies first). It's a great project for anyone trying to reason about the JMM as it gives you examples in code of edge cases and the corresponding range of expected behaviours on all architectures. It also means that if you got questionable behaviours you want to explore on a particular architecture Aleksey just saved you allot of work.
The JCTS project is a very professional piece of work, so I'll try not to mince words over what you can read in the readme. The general approach is that if you leave it (the concurrent behaviour) long enough the edge cases will come crawling out of the woodwork. Here's one bug found using it, the discussion in the ticket is quite interesting.
The tests run also report how many times each state was detected, which has the added benefit of educating us on the distribution of occurrences for the different cases. The next version may support ASCII bar charts too. Instead of yapping on about it, lets see how atomicity is tackled and all will become clear.

Biggles! Fetch...THE CUSHIONS!

The JCTS already has atomicity tests, I just had to my own to target unaligned access. I added 2 tests, one for unaligned access within the line and one for crossing the cache line, for brevity this is just the long test:
The method for allocating aligned direct memory byte buffers is described in this post. The cross cache line case is very similar with the position being set intentionally to cause cache straddled access.
I expected the straddled line access to be non-atomic and for once I was right(read on, it doesn't last)!!!
Note the fine printed output, a joy! The test run several times and we can see how in the majority of cases things are fine, the observed result is either 0 or -1. But there are other values which should not be there (when our mother is not), those observed values are the result of the lack of atomicity. Reading the value in some undefined trashed state gives us... mmm... trash. As I suspected! Patting myself on the shoulder I moved on to the unaligned non-straddling tests.
The results suggested that unaligned access inside the cache line is atomic on my machine. I admit it was not what I thought. This is a bit surprising. Good old Intel, they can really make this work!

Our chief weapon is surprise

Going through these experiments and results reminded me of a comment made by Gil Tene (Azul CTO) on Martin Thompson's blog:

Cache alignment is only a performance issue. Never a correctness issue. Cache lines do not add or remove atomicity. Specifically, non atomic updates within a single cache line can still be seen non-atomically by other threads.The Java spec only guarantees atomic treatment (as in the bits in the field are read and written all-or-nothing) for boolean, byte, char, short, int, and float field types. The larger types (long and double) *may* be read and written using two separate operations (although most JVMs will still perform these as atomic, all-or-nothing operations).
BTW, all x86 variants DO support both unaligned data types in memory, as well as LOCK operations on such types. This means that on an x86, a LOCK operations that spans boundary between two cache lines will still be atomic (this little bit of historical compatibility probably has modern x86 implementors cursing each time).

I read the above to suggest Gil was thinking atomicity is a given across the cache line, so I sent him an e-mail to discuss my findings. Now Gil has been under the JVM hood more than most people, and this turned out to be quite enlightening:

I'm not saying that cross cache line boundaries cannot induce non-atomicity to occur more often. I *am* saying that whenever you see non-atomicity on accesses that cross cache line boundaries, that same non-atomicity potential was there [for the same code] even within a single cache line, and is usually there regardless of whether or not the in-same-cache-line access is aligned to the type size boundary.

He went on to point out that the whole experimentation as means to proving correctness is not possible (I told you it's well known) and that where the spec. doesn't promise you anything, expect nothing:

... there is no way to verify atomicity by experimentation. You can only disprove it, but cannot prove it.
The only way you can control atomicity on a known x86 arch. for off heap access would be to control ALL x86 instructions used to access your data, where knowing exactly which instructions are used and what the layout restrictions you impose would allow you to prove atomicity through known architectural qualities.

I didn't expect a kind of Spanish Inquisition

Mmmm... but how would I know? Given there are no guarantees on the one hand and the limitations of experimentation on the other, we are back to square one... And while musing on Gil's wise words I ran the unaligned access test again, and what do you know:

Bugger. Bugger. Bugger. It seems like a special value sneaks in every once in a while. This happens quite infrequently and as things worked out it failed to happen until my little exchange with Gil was over, leading me to believe Gil possesses that power of the experienced to have reality demonstrate for them on cue. On the other hand, now I know it doesn't work. Certainty at last.
What about aligned memory access? I gave it a good run and saw no ill effects, but given Gil's warnings this may not be enough. The JCTS already had a test for aligned access and there are no particular expectations there. This is important: it is OK for a JVM implementation to give you no atomicity on putLong/Int/... (via Unsafe or DirectByteBuffer). Aligned access on x86 is atomic, but as Gil points out we don't have full control on every level so that may not be a strong enough guarantee...

Conclusion?

The take away from the above is that offheap storage should be treated as if atomicity is not guaranteed, especially if no attempt is made at proper type alignment. Is that the end of the world? certainly not, but it does mean precautions must be taken if data is to be written and read concurrently. Note that mutating shared data is always a risk, but you may go further without noticing issues with a POJO approach. Also note that some of the tools used for proper object publication are not available, in particular final fields.

Many thanks to Aleksey Shipilev and Gil Tene for their feedback and guidance along the way. I leave word tearing as an exercise to the interested reader.

Update 14/02/2013:
Martin Thompson got back to me asking if the same atomicity issue is still there for putOrdered*/put*Volatile when the cache line is crossed. I expected it would be (so did he), wrote some more tests and it turned out we were both correct. It makes no difference if you use put*/putOrdered*/put*Volatile/compareAndSwap*, if you cross the cache line there is no atomicity. Similarly I would expect unaligned access within the cache line to lack atomicity, but have not had time to add the tests. Code is on GitHub for your pleasure.
Thanks Martin.
Update 16/03/2013:
The GitHub links to the code are temporarily removed, please contact for details if required.

↧

Merging Queue, ArrayDeque, and Suzie Q

March 6, 2013, 2:06 pm

≫ Next: Single Producer/Consumer lock free Queue step by step

≪ Previous: Alignment, Concurrency and Torture (x86)

A merging queue is a map/queue data structure which is useful in handling data where only the latest update/value for a particular topic is required and you wish to process the data in FIFO manner. Implement 2 flavours, benchmark and discussion.
The merging queue is a useful construct for slow consumers. It allows a bounded queue to keep receiving updates with the requirement for space limited to the number of keys. It also allows the consumer to skip old data. This is particularly of interest for systems dealing with fast moving data where old data is not relevant and will just slow you down. I've seen this requirement in many pricing systems in the past few years, but there are other variations.
Here's the interface:

What about LinkedHashMap?

Now it is true that LinkedHashMap offers similar functionality and you could use it to implement a merging queue as follows:
This works, but the way we have to implement poll() is clumsy. What I mean by that is that it looks like we are asking for allot more than we want to work around some missing functionality. If you dig into the machinery behind the expression: "lastValMap.remove(lastValMap.keySet().iterator().next())" there's an awful lot of intermediate structures we need to jump through before we get where we are going. LinkedHashMap is simply not geared toward being a queue, we are abusing it to get what we want.

ArrayDeque to the rescue!

ArrayDeque is one of the unsung heroes of the java collections. If you ever need a non-concurrent queue or stack look no further than this class. In it's guts you'll find the familiar ring buffer. It doesn't allocate or copy anything when you pop elements out or put them in(unless you exceed the capacity). It's cache friendly(unlike a linked list). It's LOVELY!
Here's a merging queue implementation using a HashMap and ArrayDeque combined:
You can replace the HashMap with an open address implementation to get more cache friendly behaviour for key collisions if you like, but in the name of KISS we won't go down that particular rabbit hole. As the comment states, setting entries to null rather than removing them is an optimization with a trade off. If your key set is not of a finite manageable range then this is perhaps not the way to go. As it stands it saves you some GC overheads. This optimization is not really open to you with LinkedHashMap where the values and their order are managed as one.
ArrayDeque is a better performer than any other queue for the all the reasons discussed in this StackOverflow discussion, which boil down to:

backed by a ring buffer (yes, like the Disruptor! you clever monkeys)
it uses a power of 2 sized backing array, which allows it to replace modulo(%) with a bit-wise and(&) which works because x % some-power-of-2 is the same as x & (some-power-of-2 - 1)
adding and removing elements is all about changing the head/tail counters, no copies, no garbage (until you hit capacity).
iterating through an array involves no pointer chasing, unlike linked list.

I like the way you walk, I like the way you talk, Susie Q!

I'm using a micro benchmarking framework which is both awesome and secret, so sadly the benchmark code is not entirely peer review-able. I will put the benchmark on GitHub when the framework makers will give me the go ahead which should be soon enough. Here are the benchmarks:

The results(average over multiple runs) are as follows:
Experiment Throughput Cost
array.measureOffer, 100881.077 ops/msec, 10ns
array.measureOffer1Poll1, 41679.299 ops/msec, 24ns
array.measureOffer2Poll1, 30217.424 ops/msec, 33ns
array.measureOffer2Poll2, 21365.283 ops/msec, 47ns
array.measureOffer1000PollUntilEmpty, 102.232 ops/msec, 9804ns
linked.measureOffer, 103403.692 ops/msec, 10ns
linked.measureOffer1Poll1, 24970.200 ops/msec, 40ns
linked.measureOffer2Poll1, 16228.638 ops/msec, 62ns
linked.measureOffer2Poll2, 12874.235 ops/msec, 78ns
linked.measureOffer1000PollUntilEmpty, 92.328 ops/msec, 10830ns
--------

Interpretation:

Offer method cost for both implementations is quite similar at 10ns, with the linked implementation marginally faster perhaps.
Poll method cost is roughly 14ns for the array deque based implementation, and 30ns for the linked implementation. Further profiling has also shown that while the deq implementation generates no garbage the linked implementation has some garbage overhead.
For my idea of a real world load the array deq is 10% faster.

Depending on the ratio between offer and poll the above implementation can be quite attractive. Consider for instance that queues/buffer buildup tends to be either empty, or quite full when a burst of traffic comes in. When you are dealing with relatively little traffic the cost of polling is more significant, when you are merging a large buildup of updates into your queue the offer cost is more important. Luckily this is not a difficult choice as the array deque implementation is only marginally slower for offering and much faster for polling.
Finally, a small real world gem I hit while writing this blog. When benchmarking the 1k offer/queue drain case for the linked implementation I hit this JVM bug - "command line length affects performance". The way it manifested was bad performance (~50 ops/ms) when running with one set of parameters and much better performance when using some extra parameters to profile GC which I'd have expected to slow things down if anything. It had me banging my head against the wall for a bit, I wrote a second benchmark to validate what I considered the representative performance etc. Eventually I talked to Mr. Shipilev who pointed me at this ticket. I was not suffering the same issue with the other benchmarks, or the same benchmark for the other implementation which goes to show what a slippery sucker this is. The life lesson from this is to only trust what you measure. I can discard the benchmark result if I like, but if you change your command line arguments in a production environment and hit a kink like that you will have a real problem.
Many thanks to Doug Lawrie with whom I had a discussion about his implementation of a merging event queue (a merging queue stuck on the end of a Disruptor) which drove me to write this post.

Update 08/03/2013: Just realized I forgot to include a link to the code. Here it is.

↧

Single Producer/Consumer lock free Queue step by step

March 17, 2013, 4:09 pm

≫ Next: 135 Million messages a second between processes in pure Java

≪ Previous: Merging Queue, ArrayDeque, and Suzie Q

Reading through a fine tuned lock free single producer/consumer queue. Working through the improvements made and their respective impact.
Back in November, Martin Thompson posted a very nice bit of code on GitHub to accompany his Lock Free presentation. It's a single producer/consumer queue that is very fast indeed. His repository includes the gradual improvements made, which is unusual in software, and offers us some insights into the optimization process. In this post I break down the gradual steps even further, discuss the reasoning behind each step and compare the impact on performance. My fork of his code is to be found here and contains the material for this post and the up and coming post. For this discussion focus on the P1C1QueueOriginal classes.
The benchmarks for this post were run on a dual socket Intel(R) Xeon(R) CPU E5-2630 @ 2.30GHz machine running Linux and OpenJdk 1.7.09. A number of runs were made to unsure measurement stability and then a represantative result from those runs was taken. Affinity was set using taskset. Running on same core is pinned to 2 logical cores on the same physical core. Across cores means pinned to 2 different physical cores on the same socket. Across sockets means pinned to 2 different cores on different sockets.

Starting point: Lock free, single writer principle

The initial version of the queue P1C1QueueOriginal1, while very straight forward in it's implementation already offers us a significant performance improvement and demonstrates the important Single Writer Principle. It is worth while reading and comparing offer/poll with their counter parts in ArrayBlockingQueue.
Running the benchmark for ArrayBlockingQueue on same core/across cores/across sockets yields the following result (run the QueuePerfTest with parameter 0):
same core - ops/sec=9,844,983
across cores - ops/sec=5,946,312 [I got lots of variance on this on, took an average]
across sockets - ops/sec=2,031,953

We expect this degrading scale as we pay more and more for cache traffic. These result are our out of the box JDK baseline.
When we move on to our first cut of the P1C1 Queue we get the following result (run the QueuePerfTest with parameter 1):
same core - ops/sec=24,180,830[2.5xABQ]
across cores - ops/sec=10,572,447[~2xABQ]
across sockets - ops/sec=3,285,411[~1.5xABQ]

Jumping jelly fish! Off to a good start with large improvements on all fronts. At this point we have gained speed at the expense of limiting our scope from multi producer/consumer to single producer/consumer. To go further we will need to show greater love for the machine. Note that the P1C1QueueOriginal1 class is the same as Martin's OneToOneConcurrentArrayQueue.

Lazy loving: lazySet replaces volatile write

As discussed previously on this blog, single writers can get a performance boost by replacing volatile writes with lazySet. We replace the volatile long fields for head and tail with AtomicLong and use get for reads and lazySet for writes in P1C1QueueOriginal12. We now get the following result (run the QueuePerfTest with parameter 12):
same core - ops/sec=48,879,956[2xP1C1.1]
across cores - ops/sec=30,381,175[3xP1C1.1 large variance, average result]
across sockets - ops/sec=10,899,806[3xP1C1.1]

As you may or may not recall, lazySet is a cheap volatile write such that it provides happens-before guarantees for single writers without forcing a drain of the store buffer. This manifests in this case as lower overhead to the thread writing, as well as reduced cache coherency noise as writes are not forced through.

Mask of power: use '& (k pow 2) - 1 instead of %

The next improvement is replacing the modulo operation for wrapping the array index location with a bitwise mask. This 'trick' is also present in ring buffer implementations, Cliff Click CHM and ArrayDeque. Combined with the lazySet improvement this version is Martin's OneToOneConcurrentArrayQueue2 or P1C1QueueOriginal2 in my fork.
The result (run the QueuePerfTest with parameter 2):
same core - ops/sec=86,409,484[1.9xP1C1.12]
across cores - ops/sec=47,262,351[1.6xP1C1.12 large variance, average result]
across sockets - ops/sec=11,731,929[1.1xP1C1.12]

We made a minor trade off here, forcing the queue to have a size which is a power of 2 and we got some excellent mileage out of it. The modulo operator is quite expensive both in terms of cost and in terms of limiting instruction throughput and it is a trick worth employing when the opportunity rises.
So far our love for the underlying architecture is expressed by offering cheap alternatives for expensive instructions. The next step is sympathy for the cache line.

False sharing

False sharing is described elsewhere(here, and later here, and more recently here where the coming of a solution by @Contended annotation is discussed). To summarize, from the CPU cache coherency system perspective if a thread writes to a cache line then it 'owns' it, if another thread then needs to write to the same line they need to exchange ownership. When this happens for writes into different locations the sharing is 'falsely' assumed by the CPU and time is wasted on the exchange. The next step of improvement is made by padding the head and tail fields such that they are not on the same cache line in P1C1QueueOriginal21.
The result (run the QueuePerfTest with parameter 21):
same core - ops/sec=88,709,910[1.02xP1C1.2]
across cores - ops/sec=52,425,315[1.1xP1C1.2]
across sockets - ops/sec=13,025,529[1.2xP1C1.2]

This made less of an impact then previous changes. False sharing is a less deterministic side effect and may manifest differently based on luck of the draw memory placement. We expect code which avoids false sharing to have less variance in performance. To see the variation run the experiment repeatedly, this will result in different memory layout of the allocated queue.

Reducing volatile reads

Common myth regarding volatile reads is that they are free, the next improvement step shows that to be false. In P1C1QueueOriginal22 I have reversed the padding of the AtomicLong (i.e head and tail are back to being plain AtomicLong) and added caching fields for the last read value of head and tail. As these values are only used from a single thread (tailCache is used by consumer, headCache used by producer) they need not be volatile. Their only use is to reduce volatile reads to a minimum. Normal reads, unlike volatile reads are open to greater optimization and may end up in a register, volatile reads are never from a register (i.e always from memory).
The result (run the QueuePerfTest with parameter 22):
same core - ops/sec=99,181,930[1.13xP1C1.2]
across cores - ops/sec=80,288,491[1.6xP1C1.2]
across sockets - ops/sec=17,113,789[1.5xP1C1.2]

By Toutatis!!! This one is a cracker of an improvement. Not having to load the head/tail value from memory as volatile reads makes a massive difference.
The last improvement made is adding the same false sharing guard we had for the head and tail fields around the cache fields. This is required as these are both written at some point and can still cause false sharing, something we all tend to forget can happen to normal fields/data even if it is nothing to do with concurrency. I've added a further implementation P1C1QueueOriginal23 where only the cache fields are padded and not the head and tail. It makes for a slight further improvement, but as the head and tail are still suffering from false sharing it is not a massive step forward.

All together now!

The final version P1C1QueueOriginal3 packs together all the improvements made before:

lock free, single writer principle observed. [Trade off: single producer/consumer]
Set capacity to power of 2, use mask instead of modulo. [Trade off: more space than intended]
Use lazySet instead of volatile write.
Minimize volatile reads by adding local cache fields. [Trade off: minor size increment]
Pad all the hot fields: head, tail, headCache,tailCache [Trade off: minor size increment]

The result (run the QueuePerfTest with parameter 3):
same core - ops/sec=110,405,940[1.33xP1C1.2]
across cores - ops/sec=130,982,020[2.6xP1C1.2]
across sockets - ops/sec=19,338,354[1.7xP1C1.2]

To put these results in the context of the ArrayBlocking queue:
same core - ops/sec=110,405,940[11xABQ]
across cores - ops/sec=130,982,020[26xABQ]
across sockets - ops/sec=19,338,354[9xABQ]

This is great improvement indeed, hat off to Mr. Thompson.

Summary, and the road ahead

My intent in this post was to give context and add meaning to the different performance optimizations used in the queue implementation. At that I am just elaborating Martin's work further. If you find the above interesting I recommend you attend one of his courses or talks (if you can find a place).

During the course of running these experiments I encountered great variance between runs, particularly in the case of running across cores. I chose not to explore that aspect in this post and picked representative measurements for the sake of demonstration. To put it another way: your mileage may vary.

Finally, my interest in the queue implementation was largely as a data structure I could port off heap to be used as an IPC messaging mechanism. I have done that and the result are in my fork of Martin's code here. This post evolved out of the introduction to the post about my IPC queue, it grew to the point where I decided they would be better apart, so here we are. The IPC queue is working and achieves similar throughput between processes as demonstrated above... coming soon.

↧

135 Million messages a second between processes in pure Java

April 7, 2013, 3:37 pm

≫ Next: Writing Java Micro Benchmarks with JMH: Juicy

≪ Previous: Single Producer/Consumer lock free Queue step by step

Porting an existing single producer/single consumer concurrent queue into an IPC mechanism via memory mapped files and getting 135 million messages throughput in pure Java.
In my previous post I covered a single producer/consumer queue developed and shared by Martin Thompson capable of delivering an amazing 130M messages per second. The queue he delivered is a great tool for communicating between threads, but sometimes communicating between threads is not enough. Sometime you need to leave your JVM and go out of process. Inter Process Communications (IPC) is a different problem to inter thread communications, can it be cracked by the same approach?

IPC, what's the problem?

Inter Process Communication is an old problem and there are many ways to solve it (which I will not discuss here). There are several attractions to specialized IPC solutions for Java:

Faster than socket communication.
An out of process integration option with applications written in other languages.
A means of splitting large VMs to smaller ones improving performance by allowing GC and JIT specialization.

For IPC to be attractive it has to be fast, otherwise you may as well go for network based solutions which would extend beyond your local machine uniformly. I attended an Informatica conference a while back and got talking to Todd Montgomerey about the Disruptor and mechanical sympathy, he suggested that IPC should be able to perform as well as inter thread messaging. I found the idea interesting and originally meant to port the Disruptor, but Martin's queue is simpler (and quicker) so I went for that instead. Starting with a good algorithm/data structure is very good indeed, now I just needed to bridge the gap and see if I can maintain the benefits.

Off the heap we go!

To do IPC we must go off heap. This has several implications for the queue, most importantly references are not supported. Also note persistence to and from the queue is required, though one could extend my implementation to support a zero copy interaction where a struct is acquired, written and committed instead of the offer method, and similarly acquired, read and finally released instead of the poll method. I plan to make several flavours of this queue to test out these ideas in the near future.

My IPC queue uses a memory mapped file as a means of acquiring a chunk of shared memory, there is no intention to use the persisted values though further development in that direction may prove interesting to some. So now that I got me some shared memory, I had to put the queue in it.

I started by laying out the queue counters and cached counters. After realizing the counters need to be aligned to work properly I learnt how to align memory in Java. I went on to verify that aligned memory offers the guarantees required for concurrent access. Quick summary:

aligned access means writing data types to addresses which divide by their size.
unaligned access is not atomic, which is bad for concurrency :(
unaligned access is slow, which is bad for performance :(
unaligned access may not work, depending on OS and architecture. Not working is very bad :(

Sorting out alignment is not such a big deal once you know how it works. One of the nice things about going off-heap was that solving false sharing has become far more straightforward. Move your pointer and you're in the next cache line, job done. This left me rather frustrated me with the tricks required to control memory layout in Java. Going back to the original implementation you will notice the Padded classes who's role it is to offer false sharing protection. They are glorious hacks (with all due respect) made necessary by this lack of control. The @Contended annotation coming in JDK 8 will hopefully remove the need for this.
This is how the memory layout worked out:

To illustrate in glorious ASCII graphics (each - is a byte), this is what the memory layout looks like when broken into cache lines:
|--------|--------|--------|head....|--------|--------|--------|--------|
|--------|--------|--------|tailCach|--------|--------|--------|--------|

|--------|--------|--------|tail----|--------|--------|--------|--------|

|--------|--------|--------|headCach|--------|--------|--------|--------|

I played around with mixing off heap counters with on heap buffer but in the interest of brevity I'll summarize and say the JVM does not like that very much and the end result performance is not as good as all heap/off-heap solutions. The code is available with everything else.

Once alignment and memory layout were sorted I had to give up the flexibility of having reference pointers and settle for writing my data (an integer) directly into the memory. This leaves my queue very restrictive in it's current form. I intend to revisit it and see what I can do to offer a more extendable API on top of it.

Let me summarize the recipe at this point:

Create a memory mapped file large enough to hold:

4 cache lines for counters/cached counters.
4 bytes(per integer) * queue capacity (must be a power of 2).
1 spare cache line to ensure you can align the above to the cache line.

Get a mapped byte buffer, which is a direct byte buffer on top of the mapped memory.
Steal the address and get the contained aligned byte buffer.
Setup pointers to the counters and the beginning of the buffer
Replace use of natural counters with off heap counters accessed via Unsafe using the pointers.
Replace use of array with use of offset pointers into buffer and Unsafe access.
Test and debug until you work out the kinks...

The above code should give you a fair idea how it works out and the rest is here. This queue can work in process and out of process as demonstrated in the tests included in the repository. Now that it works (for the limited use case, and with room for further improvement... but works), is it fast enough? not so fast? is it...<gasp> ... FASTER????!?!?!

Smithers, release the hounds

Here are the numbers for using the different implementations in process:

Implementation/Affinity	Same core	Cross core	Cross socket
P1C1QueueOriginal3	110M	130M	19M
P1C1OffHeapQueue	130M	220M	200M
P1C1QueueOriginalPrimitive	124M	220M	215M

Confused? Let me explain. First line is the measurements taken for the original queue. Similar to what was presented in prev. post, though I saw a slight improvement in the results with increasing the compile threshold to 100000.
The second line is my offheap implementation of same algorithm. It is significantly faster. This is not IPC yet, this is in process. The reason it is faster is because data is inlined in the queue, which means that by loading an entry in the queue we get the data as opposed to a reference to the data. Getting a reference is what you get when you have and Object[] array. The array holds the references and the data is elsewhere, this seems to make it more painful as we get further from the producer.
The last entry is a mutation of P1C1QueueOriginal3 into a primitive array backed queue to compare performance like for like. As you can see this displays very similar results to the off heap implementation supporting the theory that data in-lining is behind the observed performance boost.
The lesson here is an old one, namely that pointer chasing is expensive business further amplified by the distance between theek producing CPU and consuming CPU.
The off-heap queue can offer an alternative to native code integration as the consuming thread may interact directly with the off-heap queue and write results back to a different off-heap queue.
Running a similar benchmark adapted to use a memory mapped file as the backing DirectByteBuffer for the off-heap queue we get:
    same core - ops/sec=135M
    across cores - ops/sec=98M
    across sockets - ops/sec=25M

JOY! a pure Java IPC that gives you 135M messages per second is more throughput then you'd get with most commercial products out there. This is still not as fast as the same queue in process and I admit I'm not sure what the source of the performance difference is. Still I am quite happy with it.
A few notes/observations from the experimentation process:

I got a variety of results, stabilizing around different average throughputs. I chose the best for the above summary and plan to go into detail about the results in the near future.
The JVM was launched with: -XX:+UseCondCardMark -XX:CompileThreshold=100000
Removing the Thread.yield from the producer/consumer loops improved performance when running on the same core, but made it worse otherwise.
Moving the queue allocation into the test loop changes the performance profile dramatically.
I've not had time to fully explore the size of the queue as a variable in the experiment but the little I've done suggests it makes a difference, choose the right size for your application.

I realize this post is rather less accessible than the previous one, so if you have any questions please ask.

↧

Writing Java Micro Benchmarks with JMH: Juicy

April 28, 2013, 2:41 pm

≫ Next: Using JMH to Benchmark Multi-Threaded Code

≪ Previous: 135 Million messages a second between processes in pure Java

Demonstrating use of JMH and exploring how the framework can squeeze every last drop out of a simple benchmark.
Writing micro benchmarks for Java code has always been a rather tricky affair with many pitfalls to lookout for:

JIT:

Pre/Post compilation behaviour: After 10K(default, tuneable via -XX:CompileThreshold) invocations your code with morph into compiled assembly making it hopefully faster, and certainly different from it's interpreted version.
Specialisation: The JIT will optimise away code that does nothing (i.e. has no side effects), will optimise for single interface/class implementations. A smaller code base, like a micro-benchmark, is a prime candidate.
Loop unrolling and OSR can make benchmark code (typically a loop of calls to the profiled method) perform different to how it would in real life.

GC effects:

Escape analysis may succeed in a benchmark where it would fail in real code.
A buildup to a GC might be ignored in a run or a collection may be included.

Application/Threading warmup: during initialisation threading behaviour and resources allocation can lead to significantly different behaviour than steady state behaviour.

Environmental variance:

Hardware: CPU/memory/NIC etc...
OS
JVM: which one? running with which flags?
Other applications sharing resources

Here's a bunch of articles on Java micro-benchmarks which discuss the issue further:

Robust Java Benchmarking
The perils of benchmarking under dynamic compilation
How to write a benchmark?(from StackOverflow)
Read the footnotes of the JMH samples for further highlights on the topic.

Some of these issues are hard to solve, but some are addressable via a framework and indeed many frameworks have been written to tackle the above.

Let's talk about JMH

JMH (Java Micro-benchmarks Harness or Juicy Munchy Hummus, hard to say as they don't tell you on the site) is the latest and as it comes out of the workshop of the very people who work hard to make the OpenJDK JVM fly it promises to deliver more accuracy and better tooling then most.
The source/project is here and you will currently need to build it locally to have it in your maven repository, as per the instructions. Once you've done that you are good to go, and can set yourself up with a maven dependency on it.
Here is the project I'll be using throughout this post, feel free to C&P to your hearts content. It's a copy of the JMH samples project with the JMH jar built and maven sorted and all that so you can just clone and run without setting up JMH locally. The original samples are pure gold in terms of highlighting the complexities of benchmarking, READ THEM! The command line output is detailed and informative, so have a look to see what hides in the tool box.

I added my sample on top of the original samples, it is basic (very very basic) in it's use of the framework but the intention here is to help you get started, not drown you in detail, and give you a feel of how much you can get out of it for very little effort. Here goes...

It's fun to have fun, but you've got to know how

For the sake of easy comparison and reference I'll use JMH to benchmark the same bit of code I benchmarked with a hand rolled framework here and later on with Caliper here. We're benchmarking my novel way of encoding UTF-8 strings into ByteBuffers vs String.getBytes() vs best practice recommendation of using a CharsetEncoder. The benchmark compares the 3 methods by encoding a test set of UTF-8 strings samples.
Here's what the benchmark looks like when using JMH:

We're using three JMH annotations here:

State - This annotation tells JMH how to share benchmark object state. I'm using the Thread scope which means no sharing is desirable. There are 2 other scopes available Group (for sharing the state between a group of threads) and Benchmark (for sharing the state across all benchmark threads).
Setup - Much like the JUnit counterpart the Setup annotation tells JMH this method needs to be called before it starts hammering my methods. Setup methods are executed appropriately for your chosen State scope.
GenerateMicroBenchmark - Tells JMH to fry this method with onions.

A lot of good tricks, I will show them to you, your mother will not mind at all if I do

To get our benchmarks going we need to run the generated microbenchmarks.jar. This is what we get:

Nice innit?
Here's the extra knobs we get on our experiment for our effort:

I'm using some command line options to control the number of iterations/warmup iterations, here's the available knobs on that topic:

i - number of benchmarked iterations, use 10 or more to get a good idea
r - how long to run each benchmark iteration
wi - number of warmup iterations
w - how long to run each warmup iteration (give ample room for warmup, how much will depend on the code you try and measure, try and have it execute 100K times or so)

To choose which benchmarks we want to run we need to supply a regular expression to filter them or ".*" to run all of them. If you can't remember what you packed use:

v - verbose run will also print out the list of available benchmarks and which ones were selected by your expression
l - to list the available benchmarks

If you wish to isolate GC effects between iterations you can use the gc option, this is often desirable to help getting more uniform results.
Benchmarks are forked into separate VMs by default. If you wish to run them together add "-f false".

The output is given for every iteration, then a summary of the stats. As I'm running 3 iterations these are not very informative (this is not recommended practice and was done for the sake of getting sample outputs rather than accurate measurement, I recommend you run more than 10 iterations and compare several JMH runs for good measure) but if I was to run 50 iterations they'd give me more valuable data. We can choose from a variety of several output formats to generate graphs/reports later. To get CSV format output add "-of csv" to your command line, which leaves you to draw your own conclusions from the data (no summary stats here):

The above has your basic requirements from a benchmark framework covered:

Make it easy to write benchmarks
Integrate with my build tool
Make it easy to run benchmarks (there's IDE integration on the cards to make it even easier)
Give me output in a format I can work with

I'm particularly happy with the runnable jar as a means to packaging the benchmarks as I can now take the same jar and try it out on different environments which is important to my work process. My only grumble is the lack of support for parametrization which leads me to use a system property to switch between the direct and heap buffer output tests. I'm assured this is also in the cards.

I will show you another good game that I know

There's even more! Whenever I run any type of experiment the first question is how to explain the results and what differences one implementation has over the other. For small bits of code the answer will usually be 'read the code you lazy bugger' but when comparing 3rd party libraries or when putting large compound bits of functionality to the test profiling is often the answer, which is why JMH comes with a set of profilers:

gc: GC profiling via standard MBeans
comp: JIT compiler profiling via standard MBeans
cl: Classloader profiling via standard MBeans
hs_rt: HotSpot (tm) runtime profiling via implementation-specific MBeans
hs_cl: HotSpot (tm) classloader profiling via implementation-specific MBeans
hs_comp: HotSpot (tm) JIT compiler profiling via implementation-specific MBeans
hs_gc: HotSpot (tm) memory manager (GC) profiling via implementation-specific MBeans
hs_thr: HotSpot (tm) threading subsystem via implementation-specific MBeans
stack: Simple and naive Java stack profiler

Covering the lot exceeds the scope of this blog post, let's focus on obvious ones that might prove helpful for this experiment. Running with the gc and hs_gc profiler (note: this should be done with fixed heap for best result, just demonstrating output here) give this output:

The above supports the theory that getBytes() is slower because it generates more garbage than the alternatives, and highlights the low garbage impact of custom/charset encoder. Running with the stack and hs_rt profilers gives us the following output:
What I can read from it is that getBytes() spends less time in encoding then the other 2 due to the overheads involved in getting to the encoding phase. Custom encoder spends the most time on on encoding, but what is significant is that as it outperforms charset encoder, and the ratios are similar we can deduce that the encoding algorithm itself is faster.

But that is not all, oh no, that is not all

The free functionality does not stop here! To quote Tom Waits:

It gets rid of your gambling debts, it quits smoking
It's a friend, and it's a companion,
And it's the only product you will ever need
Follow these easy assembly instructions it never needs ironing
Well it takes weights off hips, bust, thighs, chin, midriff,
Gives you dandruff, and it finds you a job, it is a job
...
'Cause it's effective, it's defective, it creates household odors,
It disinfects, it sanitizes for your protection
It gives you an erection, it wins the election
Why put up with painful corns any longer?
It's a redeemable coupon, no obligation, no salesman will visit your home

'What more?' you ask, well... there's loads more functionality around multi threading I will not attempt to try in this post and several more annotations to play with. In a further post I'd like to go back and compare this awesome new tool with the previous 2 variations of this benchmark and see if, how and why results differ...
Many thanks to the great dudes who built this framework of whom I'm only familiar with Master Shipilev (who also took time to review, thanks again), they had me trial it a few months back and I've been struggling to shut up about it ever since :-)

↧

Using JMH to Benchmark Multi-Threaded Code

May 15, 2013, 11:37 am

≫ Next: Know Thy Java Object Memory Layout

≪ Previous: Writing Java Micro Benchmarks with JMH: Juicy

Exploring some of the multi-threading related functionality in JMH via a benchmark of false shared counters vs. uncontended counters.
As discussed in previous post the JMH framework introduced by the good folk of the OpenJDK offers a wealth of out of the box functionality to micro benchmark writers. So much in fact that a single post is not enough! In this post I'll explore some of the functionality on offer for people writing and measuring multi-threaded code.

Sharing != Caring?

The phenomena I'm putting under the microscope here is the infamous False Sharing. There are many explanations of what False Sharing is and why it's bad to be found elsewhere(Intel, usenix, Mr. T ) but to save you from having to chase links here's mine. Here goes:

In order to maintain memory view consistency across caches modern CPUs must observe which CPU is modifying which location is memory. 2 CPUs modifying the same location are actually sharing and must take turns in doing so leading to contention a much back and forth between the CPUs invalidating each others view on the memory in question. Sadly the protocol used by the CPUs is limited in granularity to cache lines. So when CPU 1 modifies a memory location, the whole cache line (some memory locations to the actual locations right and left) is considered as modified. False sharing occurs when CPU1 and CPU2 are modifying 2 distinct memory locations on the same cache line. They shouldn't interfere with each other, as there is no actual contention, but due to the way cache coherency works they end up uselessly chattering to each other. The image to the right is taken from the Intel link provided above.
I will be using JMH to demonstrate the impact of false sharing by allocating an array of counters and accessing a counter per thread. I am comparing the difference between using all the counters bunched together and spacing them out such that no false sharing happens. I've repeated the same experiment for plain access, lazySet and volatile access to highlight that false sharing is not limited to memory explicitly involved in multi-threading building blocks.

Multi-Threaded experiments with JMH

JMH offers support for running multi-threaded benchmarks, which is a new feature for such frameworks (to the best of my knowledge). I'll cover some of the available annotations and command line options I used for this benchmark. First up, let's look at the code:

The lazySet and volatile access versions are very similar using an AtomicLongArray instead of the primitive array used above.
In the previous post I briefly described the State annotation and made the most basic use of it. Here's the full description from the javadoc (slightly edited to read as one paragraph here):

State: Marks the state object. State objects naturally encapsulate the state on which benchmark is working on. The Scope of state object defines to which extent it is shared among the worker threads. Three Scope types are supported:
Benchmark: The objects of this scope are always shared between all the threads, and all theidentifiers. Note: the state objects of different types are naturally distinct.Setup and TearDown methods on this state object would be performedby one of the worker threads. No other threads would ever touch the state object.
Group: The objects of this scope are shared within the execution group, across all theidentifiers of the same type.Setup and TearDown methods on this state object would be performedby one of the group threads. No other threads would ever touch the state object.
Thread: The objects of this scope are always unshared, even with multiple identifiers withinthe same worker thread. Setup and TearDown methods on this state object would be performedby single worker thread exclusively. No other threads would ever touch the state object.

In the above example I'm exploring the use of scoped state further by utilising both Benchmark scope state to store the shared array and a Thread scope state to hold the thread counter index for both the false shared and no sharing case.
The state classes I define are instantiated by JMH rather than directly by my benchmark class. The framework then passes them as parameters to the methods being benchmarked. The only difference between the methods being the use of a different index for accessing the array of counters. As I add more threads, more instances of the Thread state will materialize, but only a single Benchmark state object will be used throughout.

Great Expectations

Before we start looking at what JMH offers us let's consider our expectations of the behaviour under test:

We expect shared and unshared cases to behave the same when there is only a single benchmarking thread.
When there are more than 1 benchmarking threads we expect the unshared case to outperform the shared case.
We expect the unshared case to scale linearly, where the shared case performance should scale badly.

We can put our expectations/theory to the test by running the benchmark and scaling the number of threads participating. JMH offers us several ways to control the number of threads allocated to a benchmark:

-t 4 - will launch the benchmark with 4 benchmark threads hitting the code.
-tc 1,2,4,8 - will replace both the count of iterations and the number of threads by the thread count pattern described. This example will result in the benchmark being run for 4 iterations, iteration 1 will have 1 benchmarking thread, the second will have 2, the third 4 and the last will have 8 threads.
-s - will scale the number of threads per iteration from 1 to the maximum set by -t

I ended up either setting the number of threads directly or using the 'pattern' option. Variety is the spice of life though :-) and its nice to have the choice.
False sharing as a phenomena is a bit random (in Java where memory alignment is hard to control) as it relies on the memory layout falling a particular way on top of the cache line. Consider 2 adjacent memory locations being written to by 2 different threads, should they be placed such that they are on different cache lines in a particular run then that run will not suffer false sharing. This means we'll want to get statistics from a fair amount of runs to make sure we hit the variations we expect on memory layout. JMH supports running multiple forks of your benchmark by using the -f option to determine the number of forks.

And the results are?

Left as an exercise to the reader :-). I implore you to go have a play, come on.
Here's a high level review on my experimentation process for this benchmark:

As sanity check, during development I ran the benchmark locally. My laptop is a dual core MBA and is really not a suitable environment for benchmarking, still the results matched the theory although with massive measurement error:

Even at this point it was clear the false sharing is a significant issue. Results are consistent, if unstable in their suggestion of how bad it is.
Also as expected, volatile writes are slower than lazy/ordered writes which are slower than plain writes.

Once I was happy with the code I deployed it on a more suitable machine. A dual socket Intel Xeon CPU E5-2630 @ 2.30GHz, using OpenJDK 7u9 on CentOS 6.3(OS and JDK are 64bit). This machine has 24 logical cores and is tuned for benchmarking (what that means is a post onto itself). First up I thought I'd go wild and use all the cores, just for the hell of it. Results were interesting if somewhat flaky.

Using all the available CPU power is not a good idea if you want stable results ;-)

I decided to pin the process to a single socket and use all it's cores, this leaves plenty of head room for the OS and give me data on scaling up to 12 threads. Nice. At this point I could have a meaningful run with 1,2,4,8,12 threads to see the way false sharing hurts scaling for the different write methods. At this point things started to get more interesting:

Volatile writes suffering from false sharing degrade in overall performance as more threads are added, 12 threads suffering from false sharing achieve less than 1 thread.
Once false sharing is resolved volatile writes scale near linearly until 8 threads, the same is true for lazy and plain. 8 to 12 seem to not scale quite so cleanly.

Lazy/Plain writes do not suffer quite as much from false sharing but the more threads are at work the larger the difference (up to 2.5x) becomes between false sharing and unshared.

Scaling from 1-12 threads

I focused on the 12 thread case, running longer/more iterations and observing the output, it soon became clear the results for the false shared access suffer from high run-to-run variance. The reason for that, I assumed, was the different memory layout from run to run resulting in different cache line populations. In the case of 12 threads we have 12 adjacent longs. Given the cache line is 8 longs wide they can be distributed over 2 or 3 lines in a number of ways (2-8-2,1-8-3,8-4,7-5,6-6). Going ever deeper into the rabbit hole I thought I could print out the thread specific stats (choose -otss, JMH delivers again :-) ). The thread stats did deliver a wealth of information, but at this point I hit slight disappointment... There was no (easy) way for me to correlate thread numbers from the run to thread numbers in the output, which would have made my life slightly easier in characterising the different distributions...
I had enough and played with the data until I was happy with the results. And I finally found a use for a stacked bar chart :-). Each thread score is a different color, the sum of all thread scores is the overall score.

Here's the results for plain/volatile shared/unshared running on 12 cores (20 runs, 5 iterations per run: -f 20 -i 5 -r 1):

Plain writes suffering from false sharing. Note the different threads score
differently and the overall result falls into several different 'categories'.

Plain writes with no false sharing. Threads score is fairly uniform,
and overall score is stable.

Volatile writes suffering from false sharing, note the score varies
by less than it did in the plain write case but similar categories emerge.

Volatile writes with no false sharing, similarly uniform behaviour
for all threads.

Hope you enjoyed the pictures :-)

↧

Know Thy Java Object Memory Layout

May 21, 2013, 12:36 am

≫ Next: Java Concurrent Counters By Numbers

≪ Previous: Using JMH to Benchmark Multi-Threaded Code

Sometimes we want to know how an object is laid out in memory, here's 2 use cases and 2 solutions which remove the need to hypothesise.
A few months back I was satisfying my OCD by reading up on java object memory layout. Now Java, as we all know and love, is all about taking care of such pesky details as memory layout for you. You just leave it to the JVM son, and don't lose sleep over it.
Sometimes though... sometimes you do care. And when you do, here's how to find out.

In theory, theory and practice are the same

Here's an excellent article from a few years back which tells you all about how Java should layout your object, to summarise:

Objects are 8 bytes aligned in memory (address A is K aligned if A % K == 0)
All fields are type aligned (long/double is 8 aligned, integer/float 4, short/char 2)
Fields are packed in the order of their size, except for references which are last
Classes fields are never mixed, so if B extends A, an object of class B will be laid out in memory with A's fields first, then B's
Sub class fields start at a 4 byte alignment
If the first field of a class is long/double and the class starting point (after header, or after super) is not 8 aligned then a smaller field may be swapped to fill in the 4 bytes gap.

The reasons why the JVM doesn't just plonk your fields one after the other in the order you tell it are also discussed in the article, to summarise:

Unaligned access is bad, so JVM saves you from bad layout (unaligned access to memory can cause all sorts of ill side effects, including crashing your process on some architectures)
Naive layout of your fields would be wasting memory, the JVM reorders fields to improve the overall size of your object
JVM implementation requires types to have consistent layout, thus requiring the sub class rules

So... nice clear rules, what could possibly go wrong?

False False Sharing Protection

For one thing, the rules are not part of the JLS, they are just implementation details. If you read Martin Thompson's article about false sharing you'll notice Mr. T had a solution to false sharing which worked on JDK 6, but no longer worked on JDK 7. Here are the 2 versions:

It turns out the JVM changed the way it orders the fields between 6 and 7, and that was enough to break the spell. In fairness there is no rule specified above which requires the fields order to correlate to the order in which they were defined, but ... it's allot to worry about and it can trip you up.

Just as above rules were still fresh in my mind, LMAX (who kindly open sourced the Disruptor) released the Coalescing Ring Buffer. I read through the code and came across the following:

I approached Nick Zeeb on the blog post which introduced the CoalescingRingBuffer and raised my concern that the fields accessed by the producer/consumer might be suffering from false sharing, Nick's reply:

I’ve tried to order the fields such that the risk of false-sharing is minimized. I am aware that Java 7 can re-order fields however. I’ve run the performance test using Martin Thompson’s PaddedAtomicLong instead but got no performance increase on Java 7. Perhaps I’ve missed something so feel free to try it yourself.

Now Nick is a savvy dude, and I'm not quoting him here to criticise him. I'm quoting him to show that this is confusing stuff (so in a way, I quote him to comfort myself in the company of others equally confused professionals). How can we know? here's one way I thought of after talking to Nick:

Using Unsafe I can get the field offset from the object reference, if 2 fields are less than a cache line apart they can suffer from false sharing (depending on the end location in memory). Sure, it's a bit of a hackish way to verify things, but it can become part of your build so in the case of version changes you on't get caught out.

Note that it's the runtime which will determine the layout, not the compiler, so if you build on version X and run on Y it won't help you much.

Enough of that false sharing thing... so negative... why would you care about memory layout apart from false sharing? Here's another example.

The Hot Bunch

Through the blessings of the gods, at about the same time LMAX released the CoalescingRingBuffer, Gil Tene (CTO of Azul) released HdrHistogram. Now Gil is seriously, awesomely, bright and knows more about JVMs than most mortals (here's his InfoQ talk, watch it) so I was keen to look into his code. And what do you know, a bunch of hot fields:

What Gil is doing here is good stuff, he's trying to get relevant fields to huddle together in memory, which will improve the likelihood of them ending up on the same cache line, saving the CPU a potential cache miss. Sadly the JVM has other plans...

So here is another tool to help make sense of your memory layout to add to your tool belt: Java Object Layout I bumped into it by accident, not while obsessing about memory layout at all. Here's the output for Histogram:

Note how histogramData jumps to the botton and subBucketMask is moved to the top, breaking up our hot bunch. The solution is ugly but effective, move all fields but the hot bunch to an otherwise pointless parent class:

And the new layout:

Joy! I shall be sending Mr. Tene a pull request shortly :-)

↧

Java Concurrent Counters By Numbers

June 21, 2013, 1:16 am

≫ Next: A Java Ping Buffet

≪ Previous: Know Thy Java Object Memory Layout

It is common practice to use AtomicLong as a concurrent counter for reporting such metrics as number of messages received, number of bytes processed etc. Turns out it's a bad idea in busy multi-threaded environments... I explore some available/up-and-coming alternatives and present a proof of concept for another alternative of my own.
When monitoring an application we often expose certain counters via periodical logging, JMX or some other mechanism. These counters need to be incremented and read concurrently, and we rely on the values they present for evaluating load, performance or similar aspects of our systems. In such systems there are often business threads working concurrently on the core application logic, and one or more reporting/observing threads which periodically sample the event counters incremented by the business threads.
Given the above use case, lets explore our options (or if you are the impatient kind skip to the summary and work your way backwards).

Atomic Mother: too much love

We can use AtomicLong, and for certain things it's a damn fine choice. Here's what happens when you count an event using AtomicLong:

This is a typical CAS retry loop. To elaborate:

Volatile read current value
Increment
Compare and swap current with new
If we managed to swap, joy! else go back to 1

Now imagine lots of threads hitting this code, how will we suffer?

We could loop forever (in theory)!
We are bound to experience the evil step sister of false sharing, which is real sharing. All writer threads are going to be contending on this memory location and thus one cache line and digging their heels until they get what they want. This will result is much cache coherency traffic.
We are using CAS (translates to lock cmpxchg on intel, read more here) which is a rather expensive instruction (see the wonderful Agner reference tables). This is much like a volatile write in essence.

But the above code also has advantages:

As it says on the tin, this is an atomic update of the underlying value. That value will go through each sequential value and will only continue execution once we ticked that value.
It follows that values returned by this method are unique.
The code is readable, the complex underside neatly cared for by the JMM(Java Memory Model) and a CAS.

There's nothing wrong with AtomicLong, but it seems like it gives us more than we asked for in this use case, and thus delivers less in the way of writer performance. This is your usual tradeoff of specialisation/generalisation vs performance. To gain performance we can focus our requirements a bit:

Writer performance is more important than reader performance: We have many logic threads incrementing, they need to suffer as little as possible for the added reporting requirement.
We don't require a read for each write, in fact we're happy with just incrementing the counters: We can live without the atomic joining of the inc and the get.
We're willing to sacrifice the somewhat theoretical accuracy of the read value if it makes things better: I say theoretical because at the end of the day if you are incrementing this value millions of times per second you are not going to report on each value and if you try and capture the value non-atomically then the value reported is inherently inaccurate.

The above use case is in fact benchmarked in the Asymmetric example bundled with JMH. Running a variation with N writers and one reader we get the following (code is here):
AtomicCounterBenchmark.rw:get(N=1) - 4.149 nsec/op
AtomicCounterBenchmark.rw:inc(N=1) - 105.439 nsec/op

AtomicCounterBenchmark.rw:get(N=11) - 43.421 nsec/op
AtomicCounterBenchmark.rw:inc(N=11) - 748.520 nsec/op

AtomicCounterBenchmark.rw:get(N=23) - 115.268 nsec/op
AtomicCounterBenchmark.rw:inc(N=23) - 2122.446 nsec/op

Cry beloved writers! the reader pays nothing while our important business threads thrash about! If you are the sort of person who finds joy in visualising concurrent behaviour think of 11 top paid executives frantically fighting over a pen to fill in an expenses report to be read by some standards committee.

WARNING: Hacks & Benchmarks ahead!

In order to make the JDK8 classes used below run on JDK7 and in order to support some of my own code used below I had to run the benchmarks with my own Thread class. This led to the code included being a hack of the original JDK8 code. A further hack was required to make JMH load my thread factory instead of the one included in JMH. The JMH jar included is my hacked version and the changed class is included in the project as well. All in the name of experimentation and education...

The benchmarks were all run on an intel Xeon dual socket, 24 logical core machine. Because of the way JMH groups work (you have to specify the exact number of threads for asymmetric tests) I benchmarked on the above setup and did not bother going further(1/11/23 writers to 1 reader). I did not benchmark in situations were the number of threads exceeds the number of cores. Your mileage may wildly vary etc, etc, etc. The code is on GitHub as well as the results summary and notes on how the benchmarks were run.

The Long Adder

If ever there was a knight in shining armour for all things java concurrency it is Doug Lea, and Doug has a solution to the above problem coming to you in JDK 8 (or sooner if you want to grab it and use it now) in the form of LongAdder (extends Striped64). LongAdder serves the use case illustrated above by replacing the above increment logic with something a bit more involved:

Well... alot more involved really. 'Whatup Doug?' you must be asking yourself (I assume, maybe it makes instant sense to you, in which case shut up smart asses). Here's a quick break down:

Look! No contention!

LongAdder has a field called cells which is an array of Cells whose length is a power of 2, each Cell is a padded against false sharing volatile long (padding is counting on field order, they should probably have used inheritance to enforce that, like JMH's BlackHole). Kudos to Doug for the padded cells pun!
LongAdder has a field called base which is a volatile long (no padding here),
if ((as = cells) != null || !casBase(b = base, b + x)) - Load the volatile cells field into a local variable. If it's not null keep going. If it is null try and CAS the locally read value of base with the incremented value. In other words, if we got no cells we give the old AtomicLong method a go. If there's no contention this is where it ends. If there is contention, the plot thickens...
So we've either hit contention now or before, our next step is to find our thread's hash, to be used to find our index in the cells array.
if (as == null || (n = as.length) < 1 || (a = as[(n - 1) & h]) == null || !(uncontended = a.cas(v = a.value, v + x))) - a big boolean expression, innit? let's elaborate:

as == null - asis the local cells copy, if it's null we ain't got no cells.
(m = as.length - 1) < 0 - we have a cells array, but it's empty. Compute mask while we're here.
(a = as[getProbe() & m]) == null - get the cell at our thread index (using the old mask trick to avoid modulo), check if it's null.
!(uncontended = a.cas(v = a.value, v + x)) - we got a cell and it's not null, try and CAS it's value to the new value.

So... we either have no cell for this thread, or we failed to increment the value for this thread, lets call longAccumulate(x, null, uncontended) where the plot thickens

If you thought the above was complex, longAccumulate will completely fry your noodle. Before I embark on explaining it, let us stop and admire the optimisations which go into each line of code above, in particular note the careful effort made to avoid redundant reads of any data external to the function and the need to cache locally all values of volatile reads.
So... what does longAccumulate do? The code explains itself:

Got it? Let's step through (I'm not going to do line by line for it, no time):

Probe is a new field on Thread, put there to support classes which require Thread id hashing. It's initialised/accessed via ThreadLocalRandom and used by ForkJoin Pool and ConcurrentHashMap (This is JDK8, none of it happens yet in JDK7 land), getProbe is stealing it directly from Thread using Unsafe and field offset, cheeky! in my version I added a field to my CounterThread to support this.
If there's no array we create it. The cells array creation/replacement is a contention point shared between all threads, but should not happen often. We use cellsBusy as a spin lock. The cells array is sort of a copy on write array if you will, but the write spin lock is not the array, it's cellsBusy.
If there's no cell at our index we create one. We use cellsBusy to lock for writing to the table.
If we contend on the cell we expand the cells array, the array is only expanded if it's smaller than the number of available CPUs.
If CAS on the cell we have fails we modify our probe value in hope of hitting an empty/uncontended slot.

The above design (as I read it) is about compromise between contention, CPU cycles and memory. We could for instance simplify it by allocating all the cells upfront and set the number of cells upfront to the number of CPUs (nearest power of 2 >= NCPU to be exact). But imagine this counter on a 128 CPU beast, we've just allocated 128 * (56 * 2 + 8 + change) bytes -> 14k for a counter that may or may not be contended. We pay for our frugal approach every time we hit contention between threads. On the plus side, every time we grow the cells array the likelihood of collision goes down. To further help things settle, threads will change cells to eventually find less contended cells if their probe value is an issue.
Here's a discussion between Doug and Nathan Raynolds on the merits of having per CPU counters which may help shed light on the design further. I got the feeling from reading it that the cells are an abstraction of the CPU counters, and given no facility in the language to discover processor ID on the fly it seems like a good approach. Doug also mentions @Contended as a means to replace the manual padding, which is probably on it's way.
I can think of 2 ways to improve the above slightly:

Use an array of longs instead of cells, padding by leaving slots of the array empty. This is essentially the approach taken by Cliff Click here: ConcurrentAutoTable (One reviewer pointed out the CAT vs Doug pun...). This should reduce the space requirement (only need half the padding) and will inline the values in the array (one less pointer to chase). If we had value type/structs array support in Java we would need this. The even stride in memory should also improve the read performance.
Under the assumption of cells correctly reflecting the CPU affinity of threads we could co-host counters in the same cell. Of-course if the same cell is hit from several processors we've just re-introduced false sharing :(... maybe hold off on that one until we have processor ID.

Assuming all threads settle on top of their respective CPUs and no contention happens, this implementation is still bound by doing volatile writes and the logistics surrounding the cell acquisition.
How does it perform (benchmark code is here)?

LongAdderCounterBenchmark.rw:get(N=1) - 10.600 nsec/op

LongAdderCounterBenchmark.rw:inc(N=1) - 249.703 nsec/op

LongAdderCounterBenchmark.rw:get(N=11) - 752.561 nsec/op

LongAdderCounterBenchmark.rw:inc(N=11) - 44.696 nsec/op

LongAdderCounterBenchmark.rw:get(N=23) - 1380.936 nsec/op

LongAdderCounterBenchmark.rw:inc(N=23) - 38.209 nsec/op

Nice! low cost updates and high (but acceptable) read cost. This better reflects our use case, and if your are implementing performance counters in your application you should definitely consider using a back ported LongAdder. This is the approach taken in in Yammer Metrics project. Their Counter holds a back ported LongAdder whose performance is (code here):

LongAdderBackPortedCounterBenchmark.rw:get(N=1) - 5.854 nsec/op
LongAdderBackPortedCounterBenchmark.rw:inc(N=1) - 192.414 nsec/op
LongAdderBackPortedCounterBenchmark.rw:get(N=11) - 704.011 nsec/op
LongAdderBackPortedCounterBenchmark.rw:inc(N=11) - 40.886 nsec/op
LongAdderBackPortedCounterBenchmark.rw:get(N=23) - 1188.589 nsec/op
LongAdderBackPortedCounterBenchmark.rw:inc(N=23) - 38.309 nsec/op

Perhaps because of the changes between the 2 versions, or because LongAdder was designed to run on a JDK8 VM the results favour the backport over the JDK8 version when running on a JDK7 install. Note that the 1 writer/reader case performs badly, far worse than AtomicLong. On the high contention path this reverts to expected behaviour.

For completeness I tested how Cliff's CAT (I took the code from the link provided earlier) performs (here's the benchmark). CAT has a nice _sum_cache field which is used to avoid reading the table if it has not been invalidated. This saves the reader alot of work at theoretically little cost for the writers. The results I got running the benchmark varied wildly with get() looking great when inc() wasn't and get() performing very badly when inc() performs superbly. I suspect the shortcut above to be at the heart of this instability, but have not had time to get to the root of it. I exclude the results as they varied too wildly from run to run to be helpful.

Why Should You Care?

Why indeed? Why are Doug and Cliff worried? if you've watched the Future Of The VM talk featuring those very same dudes, you'd have noticed the future has many many many CPUs in it. And if we don't want to hit a wall with our data structures which worked so well back when we had only 4 CPUs to worry about then we need to get cracking on writing some mighty parallel code :-)
And if it turns out your performance counters are in fact the very reason you can't perform, wouldn't that be embracing? We need to be able to track our systems vital stats without crippling these systems, here is a paper from Mr. Dave Dice (follow his blog, great stuff) and friends looking at scalable performance counters. Let me summarise for my more lazy readers:

"commonly-used naive concurrent implementations quickly become problematic, especially as thread counts grow, and as they are used in increasingly Non-Uniform Memory Access (NUMA) systems." - The introduction goes on to explore AtomicLong type counters and non-threadsafe hacks which lose updates. Both are unacceptable. Using LOCK ADD will not save you - "Experiments conﬁrm our intuition that they do not adequately addresses the shortcomings of the commonly used naive counters. "
Section 2.1 "Minimal Space Overhead" while interesting is not really an option for a Java hacker as it requires us to know which NUMA node we are on. In future, when processor ID is available to us we may want to go back to that one. Further more conflicts are resolved by letting updaters wait.
Section 2.2 "Using A little More Space" outlines something similar to LongAdder above, but again using NUMA node IDs to find the slot and contend locally on it. It then boils down to these implementations - "suffer under high contention levels because of contention between threads on the same node using a single component." - not surprising given that a NUMA node can have 12 logical cores or more hitting that same component, CASing away.

"won't be nothing you can't measure anymore"

Section 3 onward are damn fascinating, but are well beyond the scope of this humble post. Dice and friends go on to explore statistical counters which using fancy maths reduce the update rate while maintaining accuracy. If you find 'fancy maths' unsatisfying, JUST READ IT!!! The end approach is a combination of statistical counters and the original LongAdder style counters. It is balancing CPU cycles vs. the cost of contention for the sake of memory efficiency and finding the right balance.

The reason I mention this article is to highlight the validity of the use case. Writes MUST be cheap, reads are less of a worry. The more cores are involved, the more critical this issue becomes. The future, and in many ways the present on the server side, is massively parallel environments. We still need counters in these environments, and they must perform. Can we get better performance if we were to give up some of the features offered by LongAdder?

TLC: Love does not stand sharing

What the LongAdder is trying to avoid, and the above article is side stepping as a viable option, is having a counter copy per thread. If we could have ProcessorLocal counters then that would probably be as far as we need to go, but in the absence of that option should we not consider per thread counters? The down side as one would expect is memory consumption, but if we were to off-balance the padding against the duplicates then there are use cases where per thread counters may prove more memory efficient. The up side is that we need not worry about contention, and can replace the costly CAS with the cheap lazySet (see previous post on the topic). Even if we pad, there are ways to minimize the padding such that all counters share a given thread's padding. If you work in an environment where you typically have fixed sized thread pools executing particular tasks, and performance is critical to what you do then this approach might be a good fit.
The simplest approach is to use ThreadLocal as a lookup for the current thread view of the shared counter and have get() sum up the different thread copies. This ThreadLocalCounter (with padding) scales surprisingly well:

ThreadLocalCounterBenchmark.rw:get(N=1) - 61.563 nsec/op
ThreadLocalCounterBenchmark.rw:inc(N=1) - 15.138 nsec/op
ThreadLocalCounterBenchmark.rw:get(N=11) - 725.532 nsec/op
ThreadLocalCounterBenchmark.rw:inc(N=11) - 12.898 nsec/op
ThreadLocalCounterBenchmark.rw:get(N=23) - 1584.209 nsec/op
ThreadLocalCounterBenchmark.rw:inc(N=23) - 16.384 nsec/op

A room of one's own

There are minor details worth considering here such as how ThreadLocal copies are added to the pool of counters and how to deal with dead threads. While joining is easily done via ThreadLocal.initialValue, there is sadly no cleanup callback to utilise on on Thread termination which means dead counters can collect in the pool. In an environment in which threads live short and careless lives only to be replaced by equally foolish threads this approach will not scale as counters are left stuck in the array. A simple solution here is to have the pool check for dead counters when new threads join. Counters may still be left in the pool long after the Thread died, but it will do a rudimentary check for leftovers on each thread added. Another solution is to do the check on each get(), but this will drive the cost of get() up. Yet another approach is to use WeakReferences to determine counters dropping off. Pick whichever suits you, you could even check on each get if it's seldom enough. This post is getting quite long as is so I'll skip the code, but here it is if you want to read it.

For all it's simplicity the above code delivers a triple the performance of LongAdder for writers at the cost of allocating counters per thread. With no contention LongAdder will deliver the cost of a single AtomicLong, but TLC will cost the same. The get() performance is similar to LongAdder, but note that contended or not the get() performance of the TLC is constant and gets worse with every thread added, where the LongAdder will only deteriorate with contention, and only up to the number of CPUs.

Become a member today!

A more specific solution requires us to add a counter field to our thread. This saves us the lookup cost, but may still leave us exposed to false sharing should our thread field be updated from other thread (read this concurrency-interest debate on making Thread @Contended). Here's how this one performs (benchmark is here):

ThreadFieldCounterBenchmark.rw:get(N=1) - 25.264 nsec/op
ThreadFieldCounterBenchmark.rw:inc(N=1) - 3.723 nsec/op
ThreadFieldCounterBenchmark.rw:get(N=11) - 483.723 nsec/op
ThreadFieldCounterBenchmark.rw:inc(N=11) - 3.177 nsec/op
ThreadFieldCounterBenchmark.rw:get(N=23) - 952.433 nsec/op
ThreadFieldCounterBenchmark.rw:inc(N=23) - 3.045 nsec/op

This is 10 times faster than LongAdder on the inc(), and 700 times faster than AtomicLong when the shit hits the fan. If you are truly concerned about performance to such an extreme this may be the way to go. In fact this is exactly why ThreadLocalRandom is implemented using a field on Thread in JDK8.

A cooking pun!

Now, cooking your own Thread classes is certainly not something you can rely on in every environment. Still it may prove an interesting option for those with more control over their execution environments. If you are at liberty to add one field, just bunch the lot with no padding between them on your thread and pad the tail to be safe. As they are all written to from the same thread there is no contention to worry about and you may end up using less space than the LongAdder.

Summary

For all counters results reflect 1/11/23 writers busily updating a counter while a single reader thread reads:

AtomicLong -> One counter for all threads -> fast get(4ns/43ns/115ns) but slow inc(105ns/748ns/2122ns). inc() is slower as contention grows. Least memory required.
LongAdder -> A table of counters -> slow get(10ns/752ns/1380ns) but faster inc(105ns/40ns/38ns). get() is slower as contention grows, inc() is slow when no cells are present, but settles on a fast figure ones it gets going. Memory consumption increases with contention, but little overhead for no contention. Cells are padded so increase has 122b overhead.
ThreadLocalCounter -> Every thread has a counter in it's ThreadLocal map -> slow get(61ns/725ns/1584ns) but even faster inc(15ns/12ns/16ns). get() is slower as the number of threads grow, inc() is pretty contant and low cost. Memory consumption increases with every thread added, regardless of contention. Counters are padded so each has 122b overhead. It is suggested you use a shared counters class to amortise the padding cost.
ThreadFieldCounter -> Every thread has a counter field -> get(25ns/483ns/952ns) is slower than AtomicLong, but better than the rest. The fastest inc(4ns/3ns/3ns). get() is slower as the number of threads grow, inc() is contant and low cost. Memory consumption increases with every thread added, regardless of contention. Counters are not padded but you may want to pad them.

I believe that for certain environments the counter per thread approach may prove a good fit. Furthermore for highly contended counters, where LongAdder/similar will expand I believe that this approach can be extended to provide a more memory efficient alternative when the thread sets are fixed or limited and a group of counters is required rather then just one counter.

↧

A Java Ping Buffet

July 1, 2013, 12:08 am

≫ Next: Single Producer Single Consumer Queue Revisited: An empiricist tale - part I

≪ Previous: Java Concurrent Counters By Numbers

Buffet pings

When considering latency/response time in the context of client/server interactions I find it useful to measure the baseline, or no-op, round trip between them. This give me a real starting point to appreciate the latency 'budget' available to me. This blog post presents an open source project offering several flavours of this baseline measurement for server connectivity latency.

How did it come about?

I put the first variation together a while back to help review baseline assumptions on TCP response time for Java applications. I figured it would be helpful to have a baseline latency number of the network/implementation stack up to the application and back. This number may seem meaningless to some, as the application does nothing much, but it serves as a boundary, a ballpark figure. If you ever did a course about networking and the layers between your application and the actual wire (the network stack), you can appreciate this measurement covers a round trip from the application layer and back.

This utility turned out to be quite useful (both to myself and a few colleagues) in the past few months. I tinkered with it some more, added another flavour of ping, and another. And there you have it, a whole bloody buffet of them. The project now implements the same baseline measurement for:

Busy spin on non-blocking sockets
Selector busy spin on selectNow
Selector blocking on select
Blocking sockets
Old IO sockets are not covered(maybe later)

Busy spin on non-blocking sockets
All the other cases covered for TCP are not covered(maybe later)

IPC via memory mapped file

This code is not entirely uniform and I beg your forgiveness (and welcome you criticism) if it offends your sensibilities. The aim was for simplicity and little enough code that it needs little in the way of explaining. All it does is ping, and measure. All measurements are in nanoseconds (Thanks Ruslan for pointing out the omission).
The original TCP spinning client/server code was taken from one of Peter Lawrey's examples, but it has been mutilated plenty since, so it's not really his fault if you don't like it. I also had great feedback and even some code contribution from Darach Ennis. Many thanks to both.

My mother has a wicked back hand

Taking the code for a drive

Imagine that you got some kit you want to measure Java baseline network performance on. The reality of these things is that the performance is going to vary for JVM/OS/Hardware and tuning for any and all of the ingredients. So off you go building a java-ping.zip (ant dist), you copy it onto your server/servers of choice and unzip (unzip java-ping.zip -d moose). You'll find the zip is fairly barebones and contains some scripts and a jar. You'll need to make the scripts runnable (chmod a+x *.sh). Now assuming you have Java installed you can start the server:

$ ./tcp-server.sh spin &

And then the client:

$ ./tcp-client.sh spin

And in lovely CSV format the stats will pour into your terminal, making you look busy.

Min,50%,90%,99%,99.9%,99.99%,Max
6210,6788,9937,20080,23499,2189710,46046305
6259,6803,7464,8571,10662,17259,85020
6275,6825,7445,8520,10381,16981,36716
6274,6785,7378,8539,10396,16322,19694
6209,6752,7336,8458,10381,16966,55930
6272,6765,7309,8521,10391,15288,6156039
6216,6775,7382,8520,10385,15466,108835
6260,6756,7266,8508,10456,17953,63773

Using the above as a metric you can fiddle with any and all the variables available to you and compare before/after/this/that configuration.
In the previous post on this utility I covered the variance you can observe for taskset versus roaming processes, so I won't bore you with it again. All the results below were acquired while using taskset. For IPC you'll get better results when pinning to the same core (different threads) but worse tail. For TCP/UDP the best results I observed were across different cores on same socket. If you are running across 2 machines then ignore the above and pin as makes sense to you (on NUMA hardware the NIC can be aligned to a particular socket, have fun).
The tool allows for further tweaking of weather or not it will yield when busy-spinning (-Dyield=true) and adding a wait between pings (-DwaitNanos=1000). These are provided to give you a flavour of what can happen as you relax the hot loops into something closer to a back-off strategy, and as you let the client/server 'drift in their attention'.

Observing the results for the different flavours

The keen observer will notice that average latency is not reported. Average latency is not latency. Average latency is just TimeUnit/throughput. If you have a latency SLA you should know that. An average is a completely inappropriate tool for measuring latency. Take for example the case where half your requests take 0.0001 millis and half take 99.9999 millis, how is the average latency of 50 millis useful to you? Gil Tene has a long presentation on the topic which is worth a watch if the above argument is completely foreign to you.
The results are a range of percentiles, it's easy enough to add further analysis as all the observed latencies are recorded (all numbers are in nanoseconds). I considered using a histogram implementation (like the one in the Disruptor, or HdrHistogram) but decided it was better to stick to the raw data for something this small and focused. This way no precision is lost at the cost of a slightly larger memory footprint. This is not necessarily appropriate for every use case.
Having said all that, here is a sample of the results for running the code on semi-respectable hardware (all runs are pinned using taskset, all on default settings, all numbers are in nanoseconds):

Implementation, Min, 50%, 90%, 99%, 99.9%, 99.99%,Max
IPC busy-spin, 89, 127, 168, 3326, 6501, 11555, 25131
UDP busy-spin, 4597, 5224, 5391, 5958, 8466, 10918, 18396
TCP busy-spin, 6244, 6784, 7475, 8697, 11070, 16791, 27265
TCP select-now, 8858, 9617, 9845, 12173, 13845, 19417, 26171
TCP block, 10696, 13103, 13299, 14428, 15629, 20373, 32149
TCP select, 13425, 15426, 15743, 18035, 20719, 24793, 37877

Bear in mind that this is RTT(Round Trip Time) so a request-response timing. The above measurement are also over loopback, so no actual network hop. The network hop on 2 machines hooked into each other via a network cable will be similar, anything beyond that and your actual network stack will become more and more significant. Nothing can cure geography ;-)

I am sure there are further tweaks to make in the stack to improve the results. Maybe the code, maybe the OS tuning, maybe the JVM version. It doesn't matter. The point is you can take this and measure your stack. The numbers may differ, but the relative performance should be fairly similar.

Is it lunch time?

This is a bit of a detour, but bear with me. On the IPC side of things we should also start asking ourselves: what is the System.nanotime() measurement error? what sort of accuracy can we expect?
I added an ErrPingClient which runs the test loop with no actual ping logic, the result:

Min, 50%, 90%, 99%, 99.9%, 99.99%,Max
38, 50, 55, 56, 59, 80, 8919

Is this due to JVM hiccups? inherent inaccuracy of the underlying measurement method used by the JVM? in this sort of time scale the latency measurement becomes a problem onto itself and we have to revert to counting on (horrors!) average latency over a set of measurements to cancel out the inaccuracy. To quote the Hitchhikers Guide: "Time is an illusion, and lunch time doubly so", we are not going to get exact timings at this resolution, so we will need to deal with error. Dealing with this error is not something the code does for you, just be aware some error is to be expected.

What is it good for?

My aim with this tool (if you can call it that) was to uncover baseline costs of network operations on a particular setup. This is a handy figure to have when judging the overhead introduced by a framework/API. No framework in the world could beat a bare bones implementation using the same ingredients, but knowing the difference educates our shopping decisions. For example, if your 'budget' for response time is low the overhead introduced by the framework of your choice might not be appropriate. If the overhead is very high perhaps there is a bug in the framework or how you use it.

As the tool is deployable you can also use it to validate the setup/configuration and use that data to help expose issues independent of your software.

Finally, it is a good tool to help people who have grown to expect Java server applications response time to be in the tens of milliseconds range wake up and smell the scorching speed of today's hardware :-)

↧

Single Producer Single Consumer Queue Revisited: An empiricist tale - part I

July 14, 2013, 3:58 pm

≫ Next: SPSC Revisited - part II: Floating vs Inlined Counters

≪ Previous: A Java Ping Buffet

Applying lessons learnt about memory layout to prev. discussed SPSC (Single Producer Single Consumer) queue. Tackling and observing some manifestations of false sharing. Getting bitch slapped by reality.

In preparations to implement an MPMC (Many Producers Many Consumers) queue I went back to Martin Thompson's SPSC queue I dissected in detail in this blog post. I was going to use it as the basis for the new queue with a few changes and discuss the transition. In particular an opportunity offered in the implementation of said queue with the addition of the new getAndAdd intrinsic to JDK 8. I was going to... but then I thought, 'let me try a couple more things!' :-).

Where were we?

In case you can't be bothered to go back and read the whole thing again, here's a quick summary. Mr. Thompson open sourced a few samples he discusses in his presentation on lock-free algorithms, in particular a SPSC queue developed across a few stages. I broke down the stages further and benchmarked before and after to explore the effect of each optimisation as it is applied. I then ported the same queue into an off-heap implementation and used it to demonstrate a fast IPC mechanism capable of sending 135M messages per second between processes. The queue demonstrated the following techniques:

Snowman contemplating
evolution

lock free, single writer principle observed.
Set capacity to power of 2, use mask instead of modulo.
Use lazySet instead of volatile write.
Minimize volatile reads by adding local cache fields.
Pad all the hot fields: head, tail, headCache,tailCache to avoid false sharing.

So... what's to improve on? There were a few niggles I had looking at this code again, some I've mentioned before. The last time I benchmarked the original queue implementation I noticed high run to run variance in the results. This was particularly prominent when running across 2 cores on the same socket or across sockets.

To expose the variance I modified the test to produce a summary line (each test runs 20 iterations, the summary is the average of the last 10 test iterations) and ran it 30 times. The results demonstrate the variance (results are sorted to highlight the range, X-axis is the run index, Y-axis is a summary of the run, SC means producer and consumer run on the same core, CC means they run across cores on the same socket):

Original queue performance

OUCH! We get half the performance 20% of the time. The results were very stable within a given run, leading me to believe there was a genuine issue at play.

So I thought, let's poke the i's and kick the t's, see if we can shake the variance out of the bugger.

Terms & Conditions

Benchmarks were carried out on a dual socket Xeon running CentOS 6 and Oracle JDK 1.7u17. Affinity was set using taskset, the scripts used are included with the code as well as the raw results and the LibreOffice spread sheets used for the graphs. Furry, fluffy animals were not harmed and people of different religions and races were treated with the outmost respect and afforded equal opportunity. I ran the cross socket tests, and the data is included, but I chose not to discuss them as no improvement was made on that front and they would make this already long post longer.

Flatten me counters!

To start off, I was not thrilled with the way the counter padding was implemented, for 2 reasons:

By using container classes for the counters we introduce indirection, we could optimise by inlining the data into the queue structure.
The Padded* classes only pad to one side, we are counting on the instances to be laid out together because they are instantiated together. In the right environment I'm pretty sure this can go wrong. By go wrong I mean the instances might get allocated/placed next to data modified elsewhere leading to false sharing. By inlining the counters and forcing strict order we could kill 2 bird with one stone.

To inline the counters, and provide the padding required to provide false-sharing protection I used inheritance to force layout (as outlined previously here). I used Unsafe to get field offset and implement lazySet directly into the fields inlined in my class (this is replacing the original PaddedLong/PaddedAtomicLong, the same method is used in AtomicLong to implement lazySet) the code:

It ain't pretty, but it does the job. The layout can be verified using the excellent java-object-layout tool (I format the output for brevity):

psy.lob.saw.queues.spsc1.SPSCQueue1
offset size type description
0 12 (assumed to be the object header + first field alignment)
   12 4 int ColdFields.capacity
   16 4 int ColdFields.mask
   20 4 Object[] ColdFields.buffer
   24-72 8 long L1Pad.p10-16
   80 8 long TailField.tail
   88-136 8 long L2Pad.p20-p26
144 8 long HeadCache.headCache
152-200 8 long L3Pad.p30-36
208 8 long HeadField.head
216-264 8 long L4Pad.p40-46
272 8 long TailCache.tailCache
280-328 8 long L5Pad.p50-56
  336 (object boundary, size estimate)

Wicked! Run the same tests above to see what happened:

Original vs. Inlined counters(I1)

We get a small improvement when running on the same core (average 3% improvement), but the cross core behaviour is actually worse. Bummer, keep kicking.

Padding the class fields

If we look at the above memory layout, we'll notice the fields capacity, mask and buffer are left flush against the object header. This means that they are open to false sharing with other objects/data allocated on the same cache line. We can add a further love handle on that big boy to cover that front:

Note that by happy coincidence we have already padded the tail side of the fields as part of our flattening exercise. So now the layout is:

psy.lob.saw.queues.spsc3.SPSCQueue3
offset size type description
0 12 (assumed to be the object header + first field alignment)
   12 4 (alignment/padding gap)
   16-72 8 long L0Pad.p00-07
   80 4 int ColdFields.capacity
   84 4 int ColdFields.mask
   88 4 Object[] ColdFields.buffer
   92 4 (alignment/padding gap)
   96-144 8 long L1Pad.p10-16
152 8 long TailField.tail
160-208 8 long L2Pad.p20-26
216 8 long HeadCache.headCache
224-272 8 long L3Pad.p30-36
280 8 long HeadField.head
288-336 8 long L4Pad.p40-46
344 8 long TailCache.tailCache
352-400 8 long L5Pad.p50-56
  408 (object boundary, size estimate)

Try again and we get:

Original vs Inlined counters and padded class(I3)

Same single core behaviour as above, but the cross core behaviour looks stable. Sadly the cross core results are worse than the original in many cases. Good thing there is one more trick up our sleeves.

Padding the data

So, this last step may seem a bit over the top, padding the sides of the buffer as well as spacing out all the bloody fields. Surely you must be joking? I padded the buffer by allocating an extra 32 slots to the array and skipping the first 16 on access. The object layout remains the same, you'll have to imagine the extra padding (code is here). But the results are:

Original vs Inlined counters, padded class and padded data(I4)

Run fat Q! Run! This is a nice kick in the single core category (10% increase) and in the cross core it is pretty flat indeed. So when the original behaves they are much the same, but the average result is a 10% improvement. Very little variance remains.

But it's UGLY!

Got me there, indeed it is. Why do we need to go this far to get rid of flaky behaviour? This is one of

Y do we care?

them times when having a direct line to memory management really helps, and Java has always been about you not having to worry about this sort of thing. There is help on the way in the form of the @Contended annotation which would render the above class much nicer, but even then you will need to pad the buffer by yourself. If you look at how the OffHeapQueue manages this issue, you'll see that less padding is required when you know the data alignment. Sadly there is no @Align(size=byte|short|int|long|cache|page) annotation coming anytime soon, and the @Contended annotation is not clear on how you could go about marking an instance rather than a class/field as contended.

Hang on, what if we do it the other way around?

For all you logical minded people out there who think: "we applied the changes together, but not separately. What if this is all about the data padding? we could fix the original without all this fuss...". I feel your pain obsessive brothers. So I went back and added 2 variants, one of a padded data original(referred to as O2 in the graphs) and another of the padded data and inlined fields without the class field padding. I was surprised by the results:

Same Core comparison of data padding impact

Padding the data, of and by itself made things worse for the original implementation and the inlined implementation when running on the same core. Padding the data sucks!

Cross Core comparison of data padding impact

When we look at the cross core results we can see some benefit from the data padding, suggesting it is part of the issue we are trying to solve, but not the whole story.
Padding the cold fields by itself also did little for the performance, as demonstrated above, but removed some of the variance. The 2 put together however gave us some extra performance, and killed the variance issue. Why? I don't know... But there you go, holistic effects of memory layout at work.

A happy ending?

Well... sort of. You see, all the above tests were run with the following JVM options:

-XX:+UseNUMA -XX:+UseCondCardMark -XX:CompileThreshold=100000

And thus the results present a certain truth, but maybe not all of the truth for everyone all of the time... I decided to waste some electricity and rerun some variations of the above options to see what happens. Running with no JVM options I got the following:

Cross core - no JVM opts

I4 which is the final inlined version is still quite stable, but it's performance lags behind the other implementations.
O1 which is the original implementation has less variance then before (could be luck, who knows) and has the best average result.

Same core - no JVM opts

This time I3 (inlined counters, padded class) is the clear winner.
I2 (inlined counters, padded data) is second, followed by I4 and I1.
When running on the same core, inlining the counters delivers a minor performance win.

Running with -XX:+UseCondCardMark:

ConcCardMark cross core

Using CondCardMark has reduced the range somewhat, but still a fair range for all.
I4 and I3 are stable, but the overall best average score goes to I1 (133M).
I4 is overall slightly better then O1 but worse then O2(original implementation with padded data)

ConcCardMark same core

I1 is the best by quite a bit, followed by I4.

Running with -XX:+UseCondCardMark -XX:CompileThreshold=100000:

-XX:+UseCondCardMark -XX:CompileThreshold=100000 - Cross Core

With the increased compile threshold O2 has pulled to the top followed by I4 and O1.
I4 is looking very much like O1
I1, which was the clear winner a second ago, is now rubbish.

-XX:+UseCondCardMark -XX:CompileThreshold=100000 - Same Core

On the same core we are seeing now the same behaviour we saw in the beginning.
I4 is the clear winner, I1 and I3 pushed down to second place etc.
This is odd, why would giving the JIT more data for compilation push the performance of I1 down?

And for comparison the results from -XX:+UseNUMA -XX:+UseCondCardMark -XX:CompileThreshold=100000 presented together:

All together now - Cross Core

I4 is the most stable, although O2 has the best overall average throughput.
Comparing with all the other combinations we seem to have degraded the performance of other options on our way to find the best option for I4 :(

All together now - Same Core

What does it all mean?

At this point I got quite frustrated. The above approach was producing improvements to the variance, and even an improvement to overall performance on occasion, but the effect was not as decisive and clear as I'd have liked. Some of the variation I just could not shake, even with the best result above I4 is still moving between 110M and 120M and has a spike on either side of this range.

These results are a fine demonstration of how time consuming and error prone the estimation of performance can be. To collect the above results I had to setup the same test to run for each implementation 30 times for each of the affinity options (same core/cross core/cross socket) and repeat
for 4 JVM options combinations (8 impls * 30 runs * 3 affinity setups * 4 JVM_OPTS + extra trial and error runs... = lots of time, quite a bit of data). This result is likely to be different on different hardware, it is likely to change with JDK8, other JVM options and so on. Even with all this effort of collecting data I am very far from having a complete answer. Is this data enough to produce meaningful results/insights at all?

To a certain extent there is evident progress here towards eliminating some of the sources of the run to run variation. Looking at the above results I feel justified in my interpretation of the object layout and the test results. In running the same code on other hardware I've observed good results for I4 and similar variation for the O1, so not all is lost. But this journey is, surprisingly enough, still not over...

More, more, more!

David Hume

If you found this riveting tale of minute implementation changes and their wacky effect on performance absorbing you will love the next chapter in which:

I explore cross socket performance and it's implications
We contemplate the ownership of memory and it's impact
The original implementation is evolved further
We hit a high note with 250M ops per second

This post is long enough as is :-)

↧

SPSC Revisited - part II: Floating vs Inlined Counters

July 22, 2013, 11:59 pm

≫ Next: Java Concurrency Torture Update: Don't Stress

≪ Previous: Single Producer Single Consumer Queue Revisited: An empiricist tale - part I

Continued from here, we examine the comparative performance of 2 approaches to implementing queue counters.
If we trace back through the journey we've made with our humble SPSC queue you'll note a structural back and forth has happened. In the first post, the first version had the following layout (produced using the excellent java-object-layout tool):

uk.co.real_logic.queues.P1C1QueueOriginal1
offset size type description
0 12 (assumed to be the object header + first field alignment)
   12 4 Object[] P1C1QueueOriginal1.buffer
   16 8 long P1C1QueueOriginal1.tail
   24 8 long P1C1QueueOriginal1.head
32 (object boundary, size estimate)

We went on to explore various optimisations which were driven by our desire to:

Replace volatile writes with lazySet
Reduce false sharing of hot fields

To achieve these we ended up extracting the head/tail counters into their own objects and the layout turned to this:

uk.co.real_logic.queues.P1C1QueueOriginal3
offset size type description
0 12 (assumed to be the object header + first field alignment)
   12 4 int P1C1QueueOriginal3.capacity
   16 4 int P1C1QueueOriginal3.mask
   20 4 Object[] P1C1QueueOriginal3.buffer
   24 4 AtomicLong P1C1QueueOriginal3.tail
   28 4 AtomicLong P1C1QueueOriginal3.head
   32 4 PaddedLong P1C1QueueOriginal3.tailCache
   36 4 PaddedLong P1C1QueueOriginal3.headCache
40 (object boundary, size estimate)

But that is not the whole picture, we now had 4 new objects (2 of each class below) referred from the above object:

uk.co.real_logic.queues.P1C1QueueOriginal3$PaddedLong
offset size type description
0 12 (assumed to be the object header + first field alignment)
12 4 (alignment/padding gap)
16 8 long PaddedLong.value
24-64 8 long PaddedLong.p1-p6
72 (object boundary, size estimate)

uk.co.real_logic.queues.PaddedAtomicLong
offset size type description
0 12 (assumed to be the object header + first field alignment)
12 4 (alignment/padding gap)
16 8 long AtomicLong.value
24-64 8 long PaddedAtomicLong.p1-p6
72 (object boundary, size estimate)

These counters are different from the original because they are padded (on one side at least), but also they represent a different overall memory layout/access pattern. The counters are now floating. They can get relocated at the whim of the JVM. Given these are in all probability long lived objects I
wouldn't think they move much after a few collections, but each collection presents a new opportunity to shuffle them about.
In my last post I explored 2 directions at once:

I fixed all the remaining false-sharing potential by padding the counters, the class itself and the data in the array.
I inlined the counters back into the queue class, in a way returning to the original layout (with all the trimmings we added later), but with more padding.

I ended up with the following layout:

psy.lob.saw.queues.spsc4.SPSCQueue4
offset size type description
0 12 (assumed to be the object header + first field alignment)
12 4 (alignment/padding gap)
16-72 8 long L0Pad.p00-07
80 4 int ColdFields.capacity
84 4 int ColdFields.mask
88 4 Object[] ColdFields.buffer
92 4 (alignment/padding gap)
96-144 8 long L1Pad.p10-16
152 8 long TailField.tail
160-208 8 long L2Pad.p20-26
216 8 long HeadCache.headCache
224-272 8 long L3Pad.p30-36
280 8 long HeadField.head
288-336 8 long L4Pad.p40-46
344 8 long TailCache.tailCache
352-400 8 long L5Pad.p50-56
408 (object boundary, size estimate)

This worked well, reducing run to run variance almost completely and delivering good performance. The problem was it was failing to hit the highs of the original implementation and variants, particularly when running across cores.
To further explore the difference between the inlined and floating versions I went back and applied the full padding treatment to the floating counters version. This meant replacing PaddedLong and PaddedAtomicLong with fully padded implementations, adding padding around the class fields and padding the data. The full code is here, it's very similar to what we've done to pad the other classes. The end result has the following layout:

psy.lob.saw.queues.spsc.fc.SPSPQueueFloatingCounters4
offset size type description
0 12 (assumed to be the object header + first field alignment)
   12 4 (alignment/padding gap)
   16-72 8 long SPSPQueueFloatingCounters4P0.p00-p07
   80 4 int SPSPQueueFloatingCounters4Fields.capacity
   84 4 int SPSPQueueFloatingCounters4Fields.mask
   88 4 Object[] SPSPQueueFloatingCounters4Fields.buffer
   92 4 VolatileLongCell SPSPQueueFloatingCounters4Fields.tail
   96 4 VolatileLongCell SPSPQueueFloatingCounters4Fields.head
100 4 LongCell SPSPQueueFloatingCounters4Fields.tailCache
104 4 LongCell SPSPQueueFloatingCounters4Fields.headCache
108 4 (alignment/padding gap)
112-168 8 long SPSPQueueFloatingCounters4.p10-p17
  176 (object boundary, size estimate)

psy.lob.saw.queues.spsc.fc.LongCell
offset size type description
0 12 (assumed to be the object header + first field alignment)
   12 4 (alignment/padding gap)
   16-64 8 long LongCellP0.p0-p6
   72 8 long LongCellValue.value
   80-128 8 long LongCell.p10-p16
  136 (object boundary, size estimate)

psy.lob.saw.queues.spsc.fc.VolatileLongCell
offset size type description
0 12 (assumed to be the object header + first field alignment)
   12 4 (alignment/padding gap)
   16-64 8 long VolatileLongCellP0.p0-p6
   72 8 long VolatileLongCellValue.value
   80-128 8 long VolatileLongCell.p10-p16
  136 (object boundary, size estimate)

If we cared more about memory consumption we could count the object header as padding and pad with integers to avoid the alignment gaps. I'm not that worried, so I won't. Note that the floating counters have to consume double the padding required for the flattened counters as they have no guarantees of their neighbours on either side. In the interest of comparing the impact of the data padding separately I also implemented a none data padding version.

Which one is more better?

While the charts produced previously are instructive and good at highlighting the variance, they make the post very bulky, so this time we'll try a table. The data with the charts is available here for those who prefer them. I've expanded the testing of JVM parameters a bit to cover the effect of the combinations of the 3 options used before, otherwise the method and setup are the same. The abbreviations stand for the following:

O1 - original lock-free SPSC queue with the original padding.
FC3 - Floating Counters, fully padded, data is not padded.
FC4 - Floating Counters, fully padded, data is padded.
I3 - Inlined Counters, fully padded, data is not padded.
I4 - Inlined Counters, fully padded, data is padded.
CCM - using the -XX:+UseConcCardMark flag
CT - using the -XX:CompileThreshold=100000 flag
NUMA - using the -XX:+UseNUMA flag

Same Core(each cell is min, mid, max, all numbers are in millions of ops/sec, bold is best result, red is best overall):

	No Opts	CCM	CT	NUMA	CCM+CT	CCM+NUMA	CT+NUMA	CCM+CT+NUMA
O1	107,108,113	97,102,103	111,112,112	88,108,113	97,103,103	100,102,103	111,112,112	102,103,103
FC3	105,108,113	102,103,104	111,112,113	103,108,113	102,103,103	97,103,103	111,112,113	102,103,103
FC4	93, 96,101	87, 89,90	99,100,101	82, 95,100	89, 90,90	86, 90,90	81,101,101	89, 90,90
I3	103,123,130	97, 98,100	129,129,130	122,129,130	105,106,107	97, 98,99	128,129,130	104,105,107
I4	108,113,119	103,105,106	111,113,113	99,118,120	105,113,114	98,104,114	112,112,113	105,114,114

Cross Core:

	No Opts	CCM	CT	NUMA	CCM+CT	CCM+NUMA	CT+NUMA	CCM+CT+NUMA
O1	38, 90,130	57, 84,136	40, 94,108	38, 59,118	58, 98,114	55,113,130	39,53,109	57,96,112
FC3	94,120,135	109,141,154	94,105,117	98,115,124	103,115,128	116,129,140	95,108,118	102,120,127
FC4	106,124,132	118,137,152	104,113,119	107,119,130	107,126,133	114,132,150	105,114,119	99,123,131
I3	86,114,122	88,113,128	72, 96,111	90,112,123	86, 98,108	85,117,125	88,96,111	90,99,108
I4	49,107,156	78,133,171	58,100,112	48,126,155	108,128,164	88,143,173	55,96,113	104,115,164

I leave it to you to read meaning into the results, but my take aways are as follows:

Increasing the the CompileThreshold is not equally good for all code in all cases. In the data above it is not proving helpful of and by itself in the cross core case for any implementation. It does seem to help once CCM is thrown in as well.
Using ConCardMark makes a mighty difference. It hurts performance on the single core, but greatly improves the cross core case for all implementations. The difference made by CCM is covered by Dave Dice here and it goes a long way to explain the variance experienced in the inlined versions when running without it.
NUMA makes little difference that I can see to the above cases. This is as expected since the code is being run on the same NUMA node throughout. Running across NUMA nodes we might see a difference.
As you can see there is still quite a bit of instability going on, though as an overall recommendation thus far I'd say I4 is the winner. FC4 is not that far behind when you consider the mid results to be the most representative. It also offers more stable overall results in terms of the variance.
173M ops/sec! it's a high note worth musing over... But... didn't I promise you 250M? I did, you'll have to wait.

The above results also demonstrates that data inlining is a valid optimization with measurable benefits. I expect the results in more real-life scenarios to favor the inlined version even more as it offers better data locality and predictable access over the floating fields variants.
One potential advantage for the floating counters may be available should we be able to allocate the counters on their writer threads. I have not explored this option, but based on Dave Dice's observations I expect some improvement. This will make the queue quite awkward to set up, but worth a go.
There's at least one more post coming on this topic, considering the mechanics of the experiment and their implications. And after that? buggered if I know. But hang in there, it might become interesting again ;-)

UPDATE: thanks Chris Vest for pointing out I left out the link to the data, fixed now.

↧