Video details

Debugging a Non Reproducible Crash - Alexandre Moureaux, React Advanced 2021

React
10.24.2021
English

React Advanced 2021 ##ReactAdvanced #GitNation Website – https://reactadvanced.com/
Follow the link to watch the full version of all the conference talks, QnA’s with speakers and hands-on workshop recordings → https://portal.gitnation.org/events/react-advanced-conference-2021
Talk: Debugging a Non Reproducible Crash POV: Your app has a crash affecting thousands of users, but for the life of you, you can't reproduce it and have no idea what's causing it. Hear the story of an epic struggle to vanquish a non reproducible bug and learn what to do (and what not to do) when facing such a foe.
This event would not take place without the support of sponsors:
🏆 Platinum Sponsors Toptal → https://www.toptal.com/freelance-jobs The Graph → https://thegraph.com/en/ Focus Reactive → https://focusreactive.com/
🥇 Gold Sponsors StackHawk → https://www.stackhawk.com/ Sanity → https://www.sanity.io Kontent by Kentico → https://kontent.ai/ Sourcegraph → https://about.sourcegraph.com/ Shopify → https://shopify.engineering/ Ionic → https://ionicframework.com/ JetBrains → https://www.jetbrains.com/ Progress KendoReact → https://www.telerik.com/kendo-react-ui/ Sentry → https://sentry.io/ Snyk → https://snyk.io/ Neuralegion→ https://www.neuralegion.com/
🥈 Silver Sponsors Stream → https://getstream.io/ CodeSandbox → https://codesandbox.io/ Smarkets → https://smarkets.com/about/ 3T → https://studio3t.com/ Modus Create → https://moduscreate.com/ Theodo → https://www.theodo.co.uk/ Commercetools
→ https://commercetools.com/ Strapi → https://strapi.io/ MUX → https://mux.com/ Callstack → https://callstack.com/ hackajob → https://hackajob.co/talent Hasura → https://hasura.io/ twilio → https://twilio.com/ahoy/ zeroheight → http://www.zeroheight.com/

Transcript

Today, I'm going to tell you a story, the story of a bug and our fight against this bug, a bug so vicious and cruel that it actually caused us no less than 200 crashes. But introductions first. Hi, everyone. I'm Alex. I'm very excited to be here at Rectance London. I'm a tech lead at Bam. We're based in Paris, and we developed mobile apps in Flutter native, and, of course, React native. And our story begins in October. And we're a team of nine people, and we're very happy and proud to release version four, three of our app. Why are we so happy and proud? Well, because actually we were getting ready for our October 11 live event that the app was covering, and we were adding a lot of essential features to the app. Super. We're super happy. But then the unexpected hookahs, suddenly our crash rate actually goes up significantly. And actually, our crash reporting tool that we're using, Century, is under heavy fire. It's reporting an exception every minute, then a lot of exceptions every minute, then it's basically an exception every second. And it's getting overwhelming. And all of those exceptions are a bit different, but they all kind of have the same shape. They're like this. Basically, it's a JS application. It'll go argument exception error while updating property, assigned property, and shadow not of type A React native component. And so. Well, first part is like, well, we did QA this release. We did test it out a lot. Why did we not see this happening? And also, if you search a bit more about this error, this tends to happen if you set a wrong value to a style. For example, if I set padding top to NAN, not a number, this is what would occur. So it kind of sounds like something quite easy to detect. Well, maybe it happens only in certain extreme cases that we have not tested properly before. But it turns out that Sentry is basically reporting that it happens for every user, every Android device. So this is an Android issue only, but all Android devices are affected. And also in our app, you can actually favorite a team, for example, to change a bit the experience of the app, but it doesn't matter. Whichever team you're actually favoriting doesn't impact this, you're getting the crash. All right, well, we have a big crash. We have a big fire to put out. So let's start by trying to reproduce the crash. Right. Fortunately, we configure it for a century or crash reboarding tool to tell us what the user was doing before triggering the crash. So here we see that the user is actually opening the app, starting the first screen of the app, which is called Home. And boom, actually, it crashes instantly. All right, so basically, you're telling me that it affects any device, it crashes on startup, affects any user, and we can't reproduce it. We've never seen it before. How is that even possible. All right, well, I guess step two, if you can't really reproduce, is analyzing the stack trace. So let's take a look. Okay. I did say that we have several different errors, I guess. Let's take a look at the first one. So this one is an array index out ofbound exception. It's a Java error, and it's happening in the class called Simple Pool. And it's a class from Android V four support library. And it's happening in Simplepool Release at 116 emails pulls that Java. Well, to be honest, at this point, I'm like, I don't even know what Simple Pull is, and I don't even know why I'm even in the Android source code. Like, there's a big fire to put out, and it feels like it's going to take a lot of time to actually figure out what's going on, because I don't really understand this. So I guess let's find an easier solution to put up the fire. So one idea would be, well, could we just roll back our release? Well, if you're a mobile app developer, you know that we can't actually really roll back the release. We actually have to deploy a new release with the old code. So it's kind of annoying. And it means that certain users, the users will get an update of the app, just reverting everything. And at this point in time, we actually know that our crash rate is about 10%. So it seems that basically a user opening the app has one out of ten chances to crash the app. But it seems that whenever they try to restore it, it works. And also, this release has actually great value for users. It turned out to be one of the highest rated releases despite this outstanding crash. So we thought, well, no, let's not roll back. It's not the end of the world. It's outrageously big to have 10% crash rate, but let's try to fix it in another way. So. All right, we know that the crash rate is 10%, so I'm like, okay, I can device a battle plan. I'm just going to take six Android devices. I'm going to trigger with a scrape ten app launch pair device. So statistically, I should get like five to ten crashes. Right? And at least that would be some kind of reproduction. I would be able to finally see the issue, and if I get a fix, then I would be able to test it out. The result was that I didn't get any crashes. None whatsoever. Quite unlucky. So, okay, I guess we need to find something else. Another idea was what actually changed our previous release was not crashing. This release is crashing. So what did we introduce between the two releases that actually crashed the app? So my thought at this point was maybe we should take a look at the native dependencies we upgraded because, well, this is a Java exception, so it happens in the native code. So probably the culprits is a native dependency that we upgraded. And it turns out that we upgraded two native dependencies since the last release. First one was Ragnative SVG, and the second one was native navigation. So you probably don't know native navigation. So it's actually a fork that we made from an Airbnb navigation library, which is using, well, native navigation. And it turns out that we ourselves added some features to improve the performance at startup. Right. So it sounds like a very nice culprit. We upgraded it to improve the performance at startup. We get crashes at startup. Okay, it sounds like this one should be the culprit behind or crashes. So, as you may know, in the Play Store, you can actually roll out a new version of your app to only a subset of your user. So, for example, you can just roll out the new version to, like 10% of your users. So that allows us to devise a new battle plan. If native navigation is actually the culprit, we can just test it out. We downgrade native navigation. We release a new version that we roll out only for 10% of our users. We check back. We should be able to see in like an hour or so if the new release is actually successful. And if it is successful, then we roll out to everybody the new release because, well, the crash is fixed. But what if it actually doesn't fix the crash? Okay, I guess in this case, let's just downgrade the other one, the SVG library. And. Well, we do the same. We roll out for 10% of our users. We check back if success. Yeah, full roll out. Okay, cool. We want. But what if, again, that didn't fix the crash? So this would actually mean that if it still doesn't fix the crash, it would mean that we upgraded twice our app. And every time, each time, 10% of our users got an update. We actually didn't do anything and didn't fix everything. Didn't fix the crash. That's actually a source of potential uninstall, like when a user gets a lot of upgrades of his app, but it doesn't do anything for him. It happens sometimes that user actually uninstalled the app because of this. So to be honest, that plan is. Yeah, it's kind of dumb. All right, I guess at this point we need to go deeper. We really need to understand the bug, and we really need to analyze it. So let's take a look. Again, our bug, as you recall, was an array indexed out of bounds exception. All right, let's take a look at where it was happening. It was happening, as you might remember, in a class called Simple Pool inside the Android V four support library code. And basically the bug was this. We have an array of object called Empool, and we have an index called ample size. And we're trying to access this array at index. Ample size, which apparently equals minus one. So now you don't need to be a Java expert developer to know that accessing an array at index minus one is really not a good idea. So you can understand where the crash is coming from. So Ample Size value is minus one, which is not good. So the question now is what actually can modify Ample Size? So Ample Size is actually modified only in this place? Well, it's initialized to ten and then it's only being decreased in this function called Acquire inside Simple Pool. And yeah, this is the only place where Ample Size actually changes in this function and it gets decreased. But you might notice something, right? There's actually a condition there to protect it from being below zero. There is if Ample Size is over zero and decreased Ample Size. So it kind of sounds impossible that Ample Size would become minus one because, well, if Ample Size is zero, you cannot decrease it even further. So that really sounds impossible. Okay, I guess it's time to bring out our ultimate weapon. And of course, I'm talking about the debugger against Bug ultimate weapon, the debugger. So, all right, let's open Android Studio. And so the famous function which actually decreases impulse size was the function Acquire. And in the stack traces that we were seeing, we would see that this was called from the react native code in the class called Dynamic from Map and the function Create. So exactly this line. All right, let's put a Breakpoint there. So the first hit of the Breakpoint that I saw, so basically I just ran the app and well, I was hitting the first Breakpoint quite fast and it's telling me basically that, well, actually Dynamic from Map is used by React native to four component properties. So here we're updating the width of a certain component. So you can imagine that this is actually going to happen a lot because, well, every time basically we modify a style, we hit the Breakpoint. So, yeah, Breakpoint. Hit two was kind of similar. And actually, yeah, I clicked basically 34 times on the Play button and had similar results. But, oh, actually revelation at the 34th click because from hit one to 33, I'm getting something like this. Heap 34, I'm getting something like this. There is a very subtle difference in those hits because from hit one to 33, the thread that Android Studio was reporting is MQTT native modules. And while this is the thread that rare, nativeive is usually using to deal with native code through the bridge. But on Heat 34, the thread that was used is the main thread. So basically what this means is I was actually not triggering the bug in this case, but this gave me a very big clue. This function Dynamic from Map create could be called from different threads. If we take a look at Heat 34, actually, we notice that in this case, the property being updated was a property called Fill and well, this really doesn't sound like a React native style property, right? Indeed. It's actually an SVG property. So it was the SVG upgrade all along that actually caused this bug. So let's see what actually can happen. So recognized SVG, we upgraded to V Seven and they started using this code Dynamic from Map. Create to improve the performance of native SVG animations. But they were using it from the main thread while React Native was actually using it from Mqtnative modules. So what can actually happen this impossible condition? Well, when you have something impossible happening in Java, it's usually because of thread safety. As JavaScript developers, we're not really used to having multiple threads and having to deal with that. But when you do reknative, you get also Java in the mix. So you get freight safety in the mix. So here this impossible condition. Well, it could happen that two threads, thread A and thread B, could actually go pretty much at the same time on the condition. If Impulse is over zero and think it is over zero, and then they both enter the condition at the same time. And so it means that they both decrease it. It's kind of like this thread a C's ample size over zero is true. Cool. But it doesn't have time to decrease yet. It doesn't have time to get out of the function yet because thread B is actually entering the condition as well. And checking Ample size over zero is true. And if at the beginning ample size is one, then it's again one when we check the condition for thread B and then what happens is they actually both decrease Ample size, so it becomes zero and then minus one. So wow, we actually know where this is coming from, and this is actually why it was so hard to reproduce because this is a race condition that was very tricky to actually trigger. So let's fix. So when we investigated, we found that there was a pull request on React native, dealing with this, dealing with threat safety, actually on Dynamic from Map. Create. And so with collaboration with the React native core contributor that submitted the Spore request and React native SVG maintainers, we devised a final battle plan. We patch React native locally. We deployed this version to 10% of our users just in case to check. And then, of course, check back. Was it successful? Yes. Finally we fixed it and our crash rate was back to normal. All right, well, this was fantastic, but maybe a few takeaways from this first one is this you should use your crash rewarding tool extensively, and you should configure it to be able to use it because you're going to get crashes in production and probably you're going to get crashes that you can't reproduce. So you should know what the user is doing before triggering the crash out of the box. You're not necessarily going to have this on your crashboarding tool so you should set it up so that it's easily to see, for example, the screens that your user is navigating to. You should also add as many details about the user as possible, of course, in a GDPR friendly manner. For example, in our case adding what teams the user was actually favoriting to change his experience because sometimes you trigger bugs only in certain cases in your app, of course, then you should, of course monitor your release health. 10% crash rate, of course, is outstanding. It's really bad. I mean, 0.2% crash rate is a bit better. The market standard is about 0.30 .4% for Android it's even lower for iOS and if you actually do that, it also allows you to do one thing protect your users. And that's what we did. After this, every time we're deploying a new release we were actually rolling it out to 10% of our users. Of course, we should never have crashes, outstanding crashes like this in those 10%. But in case it actually does happen, at least we impacted only 10% of our users so the rest of the users have no impact. And of course that means you're able to know if the release was successful so you're able to monitor the health of your release and of course you have time between the initial rollout and for example, in our case, we had a live event on October 11. We did the release on October 9 not really a good idea and the final one is this. You can actually learn a lot by digging deeper. I have never learned as many things as when I was actually going through a bug that I could not produce and I did. I dived in deeper into the code, into the source code of libraries I was using and every time I learned so much and that's it. Thank you for watching and do hit me up if you have any questions on the discord channel or on Twitter thank you, bye.