Video details

Faster Java Without Changing Any Code by Simon Ritter


For more info on the next Devoxx UK 👉
Applications that run on the JVM benefit from a managed runtime environment with JIT compilation that can easily exceed the performance of natively compiled code. In this byte-sized session, we'll look at some ways to improve the performance of your JVM-based applications without having to change a single line of code or even recompile!


Well, good afternoon and welcome. So I have 15 actually, I've got 16 minutes. According to this, I've got a three minute. This is incredible. So I've got 16 minutes to talk about how we can improve jobs performance without changing any code. And what I want to do is start off with exploring what are the reasons that we need to improve performance. Because the JVM is a very powerful piece of software. It's really the reason that Java has become and maintained its popularity over the last what is it now, 26 years? But there are some things about the way that the JVM works that can impact the performance of your applications in a way that might not be ideal for your application. So first of those is latency. How quickly can we respond to a request from the user to our application? And there are a number of different things that can impact on latency of the JVM. But the biggest of those is garbage collection automated memory management. We don't have to worry about allocating space for our objects, and more importantly, we don't have to worry about reclaiming that space once we've finished with it. So who here has ever experienced the garbage collection pause? Yes. Okay. Garbage collection pauses can vary. They can be a few milliseconds, they can be a few tens of milliseconds. If you're unlucky, they are seconds. If you're really unlucky, they're minutes. And in the most extreme case that I came across a one and a half day garbage collection ports. Now that's a really extreme thing when you've got a very, very large heap. But the reality is that you often will see garbage collection pauses that can impact on your application performance. And one of the other things that's important about that is the fact that the pause times for most algorithms are proportional to the size of the heap, not to the amount of data you've got. The bigger your heap, the longer you pause is. Second thing to look at is throughput of the application. How many transactions per second can we actually deal with? And the way that the JVM works is again, very efficient because it uses just in time compilation. We look at how many times the method is called. When it gets called frequently enough, we decide that rather than using bytecodes, we'll compile it into native instructions. And that happens in a couple of phases. We use C one to compile code very quickly, but it doesn't optimize the code very heavily. And then we use C two, which optimizes much more heavily based on profiling data that we've collected. And the thing that we don't want to see there, which does impact on performance, is de optimizations. Because we can use speculative optimizations, we assume that the code will continue to work in the way that we've seen it work so far. And if we make a mistake about that, we have to throw away code and then recompile it, and the level of the optimization there is really key. And then the third thing from a performance perspective is warm up time. How quickly can we get that application running at the 100% level of performance for that application? That time is how long it takes to analyze how many methods we need to compile compiling them with C one, then compiling them with C two. And the problem with that, the real problem is that every time we start the application, we go through the whole same process again. We don't learn from our mistakes in the past. So what we've done is all is to address those problems. And we started off and we said, Well, let's not rewrite Java from scratch. Open JDK hotspot JVM is very good. It's very powerful. So let's take that as a starting point and then change some things, make things a little bit better for particular instances. The key thing once we've done that is that we want the JVM to be the same. And this kind of comes back to the tumble with the idea that we can change performance without changing any code. To make sure that's true, we run all of the TCC tests on our JDK TCK is part of the Java Se standard. It's something like 1500 tests that you need to pass to show that your implementation matches the specification. So we do that. And what that means is that you can take your Java application. You don't need to change any of the code, you don't need to recompile it, take exactly the same class files and Jar files. All you do is run the prime JVM rather than the hotspot JVM. In terms of the changes that we made addressing those three things that we just talked about, first is we replace all of the garbage collectors in Hotspot with one called C Four. That means there's no G one, there's no Shenandoah, there's no ZGC, there's no CMS, there's no serial collector, and so on. The second thing we've done is to replace the second half of the JIT compiler, one that does more heavy optimization with one that's called Falcon. And I'll talk a little bit more about that later on, but it's based on another open source project called LVM some of you may have heard of. And then the third thing to address that warm up time is what we call ready now. Ready now is about the idea of taking a profile of running application, and then the next time you run that application, you can use that profile to avoid a lot of the work that you would normally do. One thing about what we do with our prime JVM is that we made a decision to only support it on Linux. Now, technically, there's no reason we couldn't do it on Windows, but the reason we've done it on Linux is because we see a bigger market there. And the reason that technically we need to do that is because we also do some clever things in terms of memory management, depending on how big your heap is. If it's less than a terabyte, then you can just use some system calls that we have in the kernel that enable us to do some interesting memory management techniques. If it's over a terabyte, then we actually install a kernel module to enable us to handle memory management more efficiently based on the profile of how the JVM works. What this allows us to do, and I think I've actually made a mistake in my slide here because somebody told me this morning one of my colleagues that we've actually changed the maximum size of our heap so we can actually scale all the way up to 20 terabytes of heap without changing the profile of garbage collection. So you won't see any increases in pores time going all the way from half a gigabyte up to 20 terabytes of heap. So what is C Four or C Four is the continuous concurrent compacting collector C Four, and this is a different algorithm in terms of garbage collection. It uses some different techniques. First thing to understand, though, is that some of what it does is quite familiar. We divide the heap into young old generations. It is a generational heap space, and we do that because if you look at most objects that you allocate in a Java application, almost all of them will only be used for a very, very short space of time. If you can garbage collect most of the objects you allocate in the young generation, you can reduce the load on the old generation and have overall better efficiency for the memory management of your application. All of the phases that we have in terms of our collector, not only are concurrent, meaning that we can do them at the same time as the application threads are still running, which means no GC pauses, but they're also parallel. We can decompose each of those phases into multiple threads, have them handle different parts of the workload, and therefore get the work done more quickly. Very important part about this is it doesn't have a stop the world compacting fallback situation. Many collectors, G one is a good example, will reach a point where they're doing a lot of concurrent work and maintaining that very low pause time. But at some point they'll have too much fragmentation, the allocation rate is too high, and they will then fall back and say, right, stop everything and do a compacting collection on the old generation. That's when you see these very large pause times. In terms of our algorithm, it's three phases. We do a marking phase to identify which objects are still in use. We do a relocation phase, which enables us to move objects around within the heap to compact them, and then we do a remapping phase to update the information in the object headers, all of which is done concurrently. How we can do this concurrently is through what we call the loaded value barrier. It's effectively a read barrier, meaning that every time you access an object in the heap, we will intercept that, and we will look at some bits in the object header to see if the state of that object header matches what we need it to be. If it isn't, we'll then jump to a handler and do some work. What this allows us to do is to enforce two rules. First of those rules is if we're in the marketing phase where we're identifying objects that are still in use, then every reference you get is guaranteed to be marked, so there's no accidental garbage collection because we missed something and you start using an object that doesn't get marked and we then garbage collect it. The second thing we can enforce is that if we're in either the relocating or remapping phases, the reference you get to the object will always be the correct one. Even if we're moving it around in the heap, the reference you get means you can make changes to the state of that object, and those changes won't be lost. So no inconsistent data. Many people will say to me, well, hang on, if you're going to intercept every object reference that I make, isn't that going to degrade performance because you're doing extra work? And the answer is yes and no. Yes it is, because technically we are looking at the object header and looking at the state of a bit. But in reality, what that equates to is two instructions, a test on the bit, and if the bits in the wrong state, we jump to a handler. So it's a test and jump two instructions at the intel micro architecture level is essentially one micro operation. So less than 1 NS of overhead per object reference. Again, you might say, Well, I do lots of object references, but the reality is that if you're doing anything which involves L, two cache, L three cache, or main memory, the effect of those reads is going to be far more than our loaded value barrier. How do we measure this? How can we show this in reality? Well, we created a thing called J. Hiccup. And what that does is spend most of its time asleep. So it's a thread we put alongside your application. No interaction, no recoding. It spends its time asleep, sleeps for one millisecond. When it wakes up, it says, what time did I actually wake up and compare it to when I expected to wake up? Any Delta that might be because of a garbage collection. Pause. We can log that, generate some histogram files, and then produce a graph. Here is a graph. So what this shows here is on the left hand side, Elasticsearch application running on 128 gigabyte heap. Very typical profile. We've got lots of small spikes which are minor GC pauses. A couple of big spikes which are major GC pauses on the right, we've got Zing or prime as we call it. And remember, high is bad here. So you might look at that and go, well, hang on, that looks actually no better than the left hand side. Left hand side looks good. But I've been deliberately misleading here. And I've said I've scaled the graphs differently. So on the left hand side I've got zero to 8 seconds, and on the right hand side I've got zero to 80 milliseconds. And if I scale those the same, then we suddenly see how that actually works. So all of the garbage collection pauses are eliminated. As I said, the second thing we did is the idea of the Falcon Jet compiler replacing C Two. This is based on the LVM open source project. We looked around and said, well, it makes sense to reuse work that's already been done. It's a very powerful piece of compiler technology. Lots of companies working on it, people like Intel, Nvidia, Microsoft, and so on. What this allows us to do is to generate better performing code from the JIT compiler and that therefore get higher throughput. Essentially we can do things like better intrinsic, we can do more in lining of code. One of the things that we see a lot is better use of vector operations. So very wide registers, single instruction, multiple data, ready. Now, as I said, is the idea of eliminating as much as possible the warm up time associated with an application having to go through the analysis of which methods to compile recompiling them, and so on. So we let the application run and then we take a profile. The profile tells us which classes are loaded, which classes are initialized. It tells us all the profiling data that we collected during that run, and speculative optimizations that failed. So we know not to make that mistake again. And we take a copy of all the compiled code when you run the application again, before we get to your main entry point, we'll load all the classes we can initialize all the classes we can and then either compile methods straight away, or if we have them in the cache that we got of compiled code, we'll reuse that and not have to do the compilation again. So this makes for much better performance in terms of getting that overall speed. Literally, when you get to Maine, you'll have about 98% of the performance you had when you took the profile. There's a couple of things that we can't do to get that 100% level around the way that the JVM is defined in terms of its startup and also things like Lambda expressions can be a little bit complicated, although we are working on that. Just to summarize, then basically the idea of our JVM is to address those three key areas of performance. So it's about starting Java faster, using ready now, using our profiles, it's about getting Java to go faster and increasing the overall performance and throughput of the application from generating better code. And it's about keeping Java staying fast by not having these garbage collection pauses that you get associated with GC so very much. In terms of a cloud environment. This can really help you to reduce the size of the instances you have. It can reduce the number of nodes you have in a cluster and end up saving you money in a big way. There it is as I said a simple replacement for other JVMs you don't have to Recode you don't even have to recompile and everything will work in exactly the same way from a functional point of view better performance yes but your applications will do the same thing that they would do if you were running on Hotspot or another JVM so it is a simple replacement for that and if you want to you can try our platform prime you can go to our website and pick up a trial there and with that thank you very much.