by Nikola Grcevski
At: FOSDEM 2020 https://video.fosdem.org/2020/H.1302/reducing_gc_times.webm
❮p❯In this talk we'll explore ways that the JVM can reduce the object allocation rate of Java programs automatically by performing stack allocation of objects that are known to be local to a method, or in compiler terms non-escaping. The discussion is focused on employing the escape analysis optimization in the OpenJDK Hotspot C2 compiler to determine which Java objects can be stack allocated, and how this optimization can reduce pressure on the Java JVM garbage collectors.❮/p❯
❮p❯We'll show some results on how various real world applications can benefit from such optimizations and describe the methodology of how we prototyped this in OpenJDK. Our work is only in prototype state at this moment and we are looking for more data to understand how broadly applicable this optimizations is. This work wouldn't be possible without free open source access to Java.❮/p❯
Room: H.1302 (Depage) Scheduled start: 2020-02-01 14:20:00
Hi, everyone. This story is going to be a complete talk about the Setu compiler and open you to Kay, specifically optimization that we worked on Microsoft, surprisingly. But so called stack allocation in this particular topic, us started quicker introduction what currently exists in the cosmos to extend with the work we did in our engineering group, then show some results on popular benchmarks and hopefully kind of finish off with the stuff we still need to do. So still very much a work in progress. We haven't submitted a patch to the open JDK mailing list, compiled a list, but just a warning. I mean, maybe boring for a bunch of people. It's gonna be a bunch of compiler jargon here. I can't work around that, fortunately. So stack allocation. So the main motivation was to alleviate some of the G.C. pressure. It was originally brought to us. We started working on this by Kirk Peper Dynamo. Some of you may know him. He's big on GC tuning in the real world. And you often would say, as we're looking at some Sparke workloads to improve them as they are these allocations skalla give the GC seizures and so on. So it's really bad you guys, to try to fix this. So. So how are we actually going to try to doing this is by saying we eliminate the location but not quite where you're going to allocate on the stack frame rather than on the heap. It's a known optimization. Compilers just somewhat not actually being done in C2 yet. So. So when can we do this is when we say the object does not escape the current methods context and typically object this case with returns, calls to methods or maybe storying to fields passed around. So the places where we can do it, places we can't. So here's an example. We have a very simple Giler program here where three allocations got box two integers and then we finally return. Another one integer is immutable class. So this two additions create a new object and has the returned. So. The first two objects do not escape the method. The last one does, and the way that we do tell which ones do escape, which was downed, is by this compiler back in past called escape analysis. So a bit more about escape analysis now. Escape analysis was deduced by this paper. It's it was introduced by IBM. T.J. wants to research a while back. And essentially they had in a paper two different kind of optimizations described flow sensitive one and flow is sensitive. The one that's implemented C two is the flow insensitive version of it. And is the right choice, actually. B, the paper describes both of them, but actually shows that they do similarly well. And the flow is sensitive, is much easier to implement, maintain and less memory intensive. Curley's use the C to for the following two purposes. So we use it for monitor elimination, which means if the objects are proven, they're not escaping. Therefore they can only be seen by one thread. So we can eliminate the monitor, enter and monitor our exit operations on these objects. So no synchronization on them. So. So you're using a string buffer instead of string builder? Well, this will help. The second one, which was interesting for us, it was the scalar replacement. So this is the form of an optimization that we kind of extend it. So scalar placement goes by the same sort of concept. You take out the original object. You break it apart into the individual parts. So you actually make away the actual location. So breaking off the object turns it into a normal auto's or local variables that sit on the stack. Therefore, no heap allocation there. OK. So. Here is another example. Slightly different than before, and maybe you'll notice subtle difference. People that know the stuff. But we are doing the three allocation as before. Now is move to the next step and the scale replacement will come in and actually turn it into this. This is why the final program would actually behave like the integer feel within the integer object. The primitive data type was extracted and actually stored as a local variable on the stack. Now when can we do this? Mainly when we actually don't need the original form of the object anymore. So when we can actually prove that the object as a whole integer is not needed. Now let's see some of the limitations with scale replacement. So when we looked at the code, there's a number of reasons why scalar placement can't fail. Then when we did the analysis and ran bunch of benchmarks and we're close to see what was the main cause of scale replacement failing in cost with C2 is introduction of control flow. So namely as this compiler talk again, like fly over here is we have the two definitions on two sides of the control floor. One is the object being instantiated, this new class object. Here, my class and inside we're pulling it out of an array. Coming down to the last return object Audax. Now, which one do we have in our hands? Therefore, we need the original shape of the object. We need to do a field load of that object because on one side, maybe scale we place by the other side, we'll need a full bloody object coming from that array. So how common is this issue? So I'm back to my original example. So the side that we have on the left here, we can scale or replace that. But that's not what typically happens with out of boxing. The side on the right isn't see, too is unable to handle, mainly because integer value off, which is what actually happens when you order a box of primitive data type internally, has the exact same pattern. And I showed them the previous light has a compare actually with two ranges, minus one to twenty eight to one twenty seven, which is also configurable. If if the values fall in this range, you're getting free cached integer object from a static array which is a poor man's version of elimination of allocations, but it does actually work for that range of objects. But every time you go above that range or beneath that range, you get a stack actually heap allocated objects. So can we make this work? So there's compiler optimizations that could potentially make this little example here work. However, he has some drawbacks. One typical way we could do this is by actually cloning optimizations in the back and optimizations could potentially say, well, we have a condition, we can we go either way. So why don't we just specialize the method body for this side from that side? But let's say you have another branch, then it gets some while this becomes exponential, so called grows insanely. So it's not actually very useful. Otherwise you can do that is maybe by code motion, but then you're stuck with side effects. Let's say the array, we're pulling out that object. This object is a null object. So can you actually what if you have to throw the null put exception only on the wrong line? So these kind of optimizations do have limitations in how quick how actually often can we apply them? Now, this is what it was. We actually came out, we said, well, we don't actually have to scale replace it. What if we actually allocated but actually allocate on the stack? So the object shape is preserved as it typically was. And it just lives on the stack, just like the primitive, which is a little bit of extra stuff. Has Flagg's field, has a class pointer, everything you normally expect from an object. Now, is this useful? Well, yeah. Let's consider this example like you have this loop over here. I mean, I Cleverley return the primitive data type so myalgic doesn't escape. But if this was the case here, this loop will keep generating new objects every time around. Inderjit is immutable, therefore new object through forever new additions you do. So stock allocation will only work on previous example. Yeah. Because on both sides now we actually have a plane on Java Objects. So the object ex field load, the gas field that happens there has no problem existing from one side or will read from the stack on the other side. It will read from the object they got from the static array. So how do we implement this and see to come to the second part of the presentation. So Charlie Gracy myself, inspired by the words of Kirk, started looking at into implementing this WHISSTOCK allocation. We had to modify escape analysis in C two to. Nice cases where we can safely stack al-Qaeda objects. Not all non escaping objects can be stock allocated. I'll show some of the limitations later on, but there's plenty of them. We implemented a stock allocation path and macro expansion, so we had to actually write a separate path that took out everything else. We removed everything else, but the part where we stack Elkadi object. So how do we stack Allegan? And this was one of the bigger revelations. We actually use box lock No. Which was used for monitors because mainly we needed a way to communicate stack loop, which was not done in any other way from the IRR back to the Cogen. I know to say, hey, this should be a point of reference on the stack somewhere. So right now a stock Elachi objects end up where the lock slots would be, which is right after the all the spills and locals before the reserve registers on the frame. So so that stuff we have to actually worry about. As soon as we did that, we got immediately a service in the garbage collector says, what the hell is this? You're giving me a new reference on on the stack. That's not right. So we had to actually extend these through scanning to support these objects, because what it will look like there will be a local on the stack that points to another stack location, which is doing quite set. Right. And more kind of subtle issue I'll describe later is detecting live ranges of of objects in loops and this sort of attitude items removing the right barriers. We obviously can't do them because you do a car mark on a stock location that an snogged and then the other two were already being done. Similar call we found for skell replacement. So we were able to leverage that. We had to kind of similarly implement humidification objects on the optimization. So any safe point that the allocation can reach, we had to inject a scalar replace Salak node. I believe it was called where we describe which fields need to be copied over to a heap object. So here's the GC who is scanning typically would you normally see is now below the locals. You'd have a pure stack Elkader objects. The first five or here will be a Flagg's field. Then we have a class pointer and some reference. So the GC need to be thought that well as you walk into Stack, you have a reference coming over here. You need to find all the fields and actually mark them. So you don't actually lose any objects. Now this Alterra overlapping live ranges was kind of like a subtle gotcha. And to be honest, I kind of knew this, but I forgot about it. I used to work in IBM on the Dyster himself and Jane on Compiler, and we used to do this, but sort of had to we learned it from scratch. It's it's an interesting case where if these two objects, as we have them here, let's say our stack allocated as soon as we get into the definition, which is definition V2, where value equals result. What ends up happening is that these two addresses, which are actually addresses on the stack, became become the same. So what that definition is coming back on the second iteration of the loop will be the address. That result was before. But result is that Calica is always the same address. So now all of a sudden noise you end up doing is well after you first enter this if you never entered again. So typically when this was a heap allocation, you will get a new address every time you allocate from the three local he buffer or you allocate from the heap somewhere. But it's a new address, so you address comparison will work and the location on the where the object is stored is different. But once it's home, the stack is always the same, which we want to do. We want to actually reuse this for the purpose of better cash page misses and also remove the allocation while we end up in this problem. So we had our current code to detect this and actually reject one of them as a candidate for stock allocation. One of them. The items was tax fund. So we go into the current limitations. So we have few limitations we can actually do right now. We don't stack out that object. Which monitors is kind of side effect with box lock? No, we just didn't finish the work. It's not hard to do. But some of the Monitor elimination code eventually compacts are box locks, loss for stock, allocate objects. So we mess up. We don't. But this is a mate. The second one is the main issue that we have with performance right now. We do not allow stock. Ellacott objects to be pointed to each other. So obviously a heap parent would mean escaping. So that's actually handled by escape analysis. But stack aligator stock ELLIKA is not allowed at the moment. There's ways to resolve that, but. Right now is we don't do it. We don't have compressed whoops support yet. And this is mainly because you can have emerged point one side, a heap of you on the other side of stock. Yelich Object that goes to Code T get stored as compressed in the stack while compressing a stack doesn't work because you cannot guarantee that the address range will be within that 32 bit space. We don't stack allocate arrays in the moment as well. We just ran out of time. There's no particular reason why we didn't do it. Perimeter raised would be simple reference or a special consideration would array copies so it just didn't have time to finish this presentation. Come here and talk about this. And finally, thank you, Ron Tressler. We actually may need to do something special for Project Loon here. ITR prevents stock allocation of objects that live across method calls because in that mode where they do, the fast relocation of the stack is just a simple man. Copy. So if you have a reference in the stack pointed to a stack object, well nobody's there to patch it to update that. So we'll get there. So now some good news, actually. So these are the performance improvements. They're actually God, where this prototype that we have been compiling guy for me. This is amazing because I usually would work for three, four months or two percent or three percent improvement. And having a range of applications actually get significant speedups is actually quite good to see. One of the stack allocations, the stack allocated object would be another massive improvement if you get it right, because there's certain patterns, escala, that are very common, that do have an object graph pointed to each other, which we currently reject. And so finally, the last bit of the presentation. So when can we see this patch? Nowhere right now. So Charlie and I had a process of migrating our paths from JDK eleven. There's no particular reason by Joe and Carl Levin, just we were looking at sparks sort of continue down that path from the builder we're using. We're migrating as a tip, cleaning up the code. And as soon as it's done, we'll actually post this the compiler that mailing list, then ask for review. That's a plan. So our next steps from our perspective, are that we have to stabilize the prototype and clean it up. We still have few crashes. We haven't looked at every of those Mattos embezzlements. We couldn't run because there were issues started working on removing the limitations one by one stack. Allocative stack healthcare would be probably my first pick. Right now we only support G1 and PGC. Would ours be a heap with a Mach extension to walk stock Halkett objects. So we need to stand and see how it actually works in order GC those lection and do IGCC. And finally look for more opportunities in other real world applications. I want to see if we can actually improve the various rest frameworks there out there that people build with and May ElasticSearch priced like that. And which leads me to the end of the presentation here, which is, you know, if you like what you saw here, please stay in touch. We'd like to actually work with everyone here. Both Charlie and I are really new to the code base and need a lot of help to actually make this a reality and help us redo the past would be awesome if anybody is willing to do that. That's it. Thank you. Any questions? Do you have time? Yes. Five minutes. I will completely stunned by that. Oh, no, not everybody, evidently. Yeah. So the question is, can you say something about your right by your implementation? You said that it's always a performance when. But if the stock allocation fails, then presumably you've got more expensive. Right? You know? That's right. So we currently. And that's exactly the where I was going. I had actually two appendix slides, which I'm actually going to talk about the reference to Air Force issues, which leads me to the stack right barrier. So we currently remove the right barriers on Stack Halkett objects weak because we are sure that when we make it a candidate for stock allocation, there will be never it be coming apart of something else. So if you store a field and if it has a field and you write into that field, you don't need a right there because nobody is ever going to see that object as it lives on the stack. Now, the reason why we can't do stock allocations is stock location. Exactly this case. Let's have this example here where we have two objects pointing to each shutter. Now we get to the bottom part. We allow the original test object from the wrapper. We do T dot x. Everything is going to be a. Remove the right barrier. Now, what if there was another colon between like this and this actually gave us a heap. Now coming down here, it's either a heap or a stack object down this t one vortex. So we don't know. In that case, two ways we can actually detect this case when analysis and reject it. We're sure rejection Kansas or we extend the right barrier to actually look at the stack ranges, say, yeah. This falls within the stack and you're good. Keep going. Don't worry about that. So it would increase the cost of the right barrier. If we did that approach. Very interesting results. Thanks a lot. Quick question. Have you considered allocating such objects on sheep instead of one? But just like you mean, reserve a special hip region for the leg on along with two laps. Your dad like you two labs. But to just for the duration of the snow, for a single call or just challenge them. And do you buy that? You can significantly simplify the requirements to the runtime. You don't need to treat special objects on the stack. Since everything stays on sheep. Well, I asked you to consider how that would work. I can't think of that in my head, but. But yeah, well, we'll think about it. Maybe there's a way that we actually can do that. We didn't know that there was a question here. Is that looking for the right path in the memory? I'm sure before we had some benchmark very problematic and stressed the flow sensitive case that you were mentioning before. Yeah. Especially with the immutable friends. And I love the neutralises myself. It actually traced copies every time, which we're hoping that this would actually take care of. Charlie. So I just wanna be clear that there's still a limitation that day. There's no partial escape analysis here. There's no lazily standing step back up. OK. Yeah, absolutely. So we can look into that next. I mean, right now we're not doing it. So anytime we see something that escapes for us, this case doesn't matter. Is the cold call or something that's never reached. We're just. OK. OK, well, I think we done. Thank you. Thanks.