Video details

How to Fail at Serverless (Without Even Trying) - Kam Lasater

Serverless
07.17.2022
English

Serverless is great. I love serverless. I'm not always sure serverless loves me. I want to share some of the ways I've failed with serverless apps going from local, to deployed, to running with production workloads.

Transcript

It feels a little bit like a confessional here, so I hope this can be a safe space. A little bit of a trigger warning. Hopefully a friend of yours may have made these mistakes. I know I've made plenty of mistakes in my career in the transportation industry with nurse mistakes, people talk about them, right? And that's how we prevent making mistakes in the future. So that's what I'm trying to encourage here. Let's talk about things that didn't work out so well, maybe have a few laughs. Hopefully we can find some humor here. This is AWS focused because everybody's in AWS, sorry Cloudflare and sorry Microsoft, but I guess that's the world. So, serverless is stateless. Who agrees with this? Only a few. Okay, that's fine. So, classic serverless use case. We got a step function, rebreaking up some PDFs, changing, doing some image manipulation, maybe reading files in from S three, writing them back out to s three. So everything's great, right? This application works wonderfully. I see some smirks. How does this fail? How does it fail? Any suggestions from the audience how this might fail? Oh yes, TMP is stateful per instance, right? But this is like if you have the right sized file and if you have a bursty load, you'll flush temp because you'll be flushing those instances. So after some time it'll fail. But as soon as it fails, that instance gets flushed, starts another instance, you restart from the state machine. Error handling takes care of it now. So we clear out TMP like say, okay, that was a stupid one, we can do better next time. We keep running. Maybe every two weeks our deployment cadence is starting to decrease on the app. We keep getting errors on this. We start to get out of memory errors people can't figure out they go and they turn debugging on rate. Well, what is touching a variable on lambda? Recycles your instance. We had to take a lambda, we had to put it in isolation, we had to keep the concurrency at one, and we had to start running images through it until we found that after 12,000 we hit an out of memory error. What was happening? Image Magic apparently has a 20 year old bug where it leaves a file descriptor open. And so you would eventually fill up memory and fail memory on the lambda. So this was made more difficult because of the recycling of environment variables. But that led to the actual solution, which I don't think there's anybody from AWS here. After processing an image, if you just update an environment variable on yourself, you will flush all versions of the lambda that's running and you'll just recycle. I see some smarts, hopefully. I think that there's some people in the audience who have done this. So we've just proved that serverless is stateful. Another use case here. We're in a micro service environment using JWT tokens. Everything's great. You now want to add session invalidation. Put on passport JS. Everything works great in dev and test. How does this fail? Well, as soon as you have a second instance right. You have a second memory space boom, right? The solution that I've heard is to try Cognito. I would leave that as an exercise to the user. I've not heard positive responses after somebody goes home and does that on their own. So we got burned by Cognito. It's not very funny story. So I don't know if it's to be included here. What about s three is global. Who agrees with S. Three is global. Nobody thinks S three is global. What's that? No, so it's great we built an east one because everybody else is building an east one. That's where all the cool kids are. We have our path thing, our S Three paths. And of course, you want to hard code your strings and all your CloudFormation templates and app code and CD scripts and your manual manipulation. So how does this fail? What happened? I think it was fall in 2019, global DDoS on the S Three global endpoint. So if you'll notice here, it has S Three versus the regional endpoint. So S Three was still up, the regional endpoints were still up, the regional DNS was still up. Everything was still there and accessible. Right? All the nines of s three. Just the one thing that was down was the global endpoint. Well, what is the one thing that our whole application and all of its build systems and all of its associated maintenance code relied on the global Veneer on top of this multiregional service. So we couldn't launch new versions of the code, we couldn't build new versions of the code. We would have had to rewrite the deployment scripts for the build system, change how the build system stored things, retrieved them. Oh, again, bonus for the user, if you implement best practices for CI CD, you only let code get into production if you've merged it through a GitHub PR. So now you have to go and touch more systems. And the controls that we're supposed to keep you safe and keep your system stable are now keeping you from actually recovering from an outage. Okay. S three has infinite read. Yeah. Okay, good. I got one. So, use case data lake. So we're just writing all our millions and billions and trillions of images and JSON files into S Three. We were processing things in batches coming from our clients, and so we keyed it based on day. And then we had a random transaction ID. And then the type was like a data type, whether it was particular final version or some debugging internal output, or maybe some particular attribute about the package that came in. And so we wanted to add reporting to this. And so we took Athena serverless reporting tool and we pointed it at this data lake and we started to write some reports, render them into quick site. Everything's good, right? Works in dev, works in test. You ship it and then go off on your Christmas holiday, right? Yes. Athena and S Three seem to like communicating in 429s. This is what we learned. You have to pay extra to learn this because you have to turn S Three metrics on so that you can then determine that S Three is returning four twenty nine S to Athena. Now, Athena has all this retry logic, and it's going to pull as much data from S Three as fast as it can. And so it welcomes the 429s. It just says, like, okay, well, how much data do you have and how many shards do you have? And that's how it communicates. And S Three at the time, I don't know if there's anybody who knows they needed a fixed path and then some randomness. So a fixed prefixed and then some randomness. And that's how we would do the sharding internally, right? Because it needs to define which server the data is actually going to go on. So let me just jump back for a second. If we look at this, we see that we had the day key in there. So that would create a non fixed path for S Three to shard across. So it was hot spotting. Athena was now hot spotting a single shard, which was exactly the shard we needed for our transactional rights and our transactional reads some other great bonuses here. S Three is so reliable that we had no tests in our code to check if S Three failed on a read or failed on a right. We assumed if S Three is not here, the world has come to an end, and we'll just restart the transaction at the top. Like, just restart the step function because something terrible has happened. And we don't need to retries in the code itself. I already talked about definitely do it before you leave for Christmas vacation. That's a good time. Everybody's free anyways in convenient places. If you use your reports to check on system health, so you can tell our transactions succeeding or failing, then you can use that and give that to your users, but you can also use it as your operators. And then when there's a problem in the system, the operators can then go to the dashboard, and I'll refresh it as well, so they can all trigger Athena queries onto the same hotspotted. So you can DDoS yourself. You can actually help Athena DDoS yourself. Yeah. So that was a fun one. So, of course, right after that, we're going to turn the reports to scheduled refreshes and do some other things. So s Three has replication. I don't know. This is a true statement. That's not a trick question here. Okay. We were hotspotting the production data lake. We should probably replicate it, create a read instance, point our Athena reports at that, and then schedule refreshes. So that was great. Everything was working right. We start increasing more data, we start adding more graphs, more Athena Queries. How does this fail? Well, finance calls, and your data is now in US east two. Or our data is now in US. East Two and our Athena Queries are still being run from us. East One. And we are now reading several terabytes a day or more, several terabytes an hour and transporting than paying Amazon for the privilege of shipping data from Ohio to Northern Virginia to process in Northern Virginia, and then put up a graph at 01:00 A.m., and then do it again at 02:00 A.m. And a 03:00 A.m.. It's hard though, to deploy your code if your build system is in East One and change it all and set it up to deploy in East Two. Finance just thought it was getting a little expensive. So if you just turned down the refresh on the reports that you don't need, you may not have to actually move your code. So this might not actually be fixed. I don't know. I left before that was the case. So we'll put a question mark next to that one. So S three has infinite capacity. Yeah. Yes. No, I think it has infinite capacity to charge you money. That's my general opinion. The lifecycle policy is optional, right? And it's only 2.3 cents per gigabyte month. Right? It's nothing, right? You just write another JSON file, write another image, write some interim debugging output again. This was the sort of the keying structure the mandate was. We're not quite sure when in doubt. Right? It we can always clean it up later. We have this day based path things, so we can delete things. It's not a problem. Right? So this was performant on transactional level for operating and for doing the Athena Queries. We could use that day to do we could downscope it. And so we were only doing when we were transferring into parquet format, we could use the day as a partition key. So it was performing except for the reading across region part. Let's leave that out for a second. We could read just the day that it mattered. So any thoughts on how this might fail? Okay, well, finance calls again. This time it's a little bit trickier because we're now storing so much data, because we've increased the amount we're writing per transaction, we've increased the number of transactions that we're doing, and we're now ready to implement a lifecycle policy. This keying structure I just talked about being performant, this lifecycle policies need a fixed prefix, and you can put it like a star, but it's a star at the end of the life cycle policy. So if your type of data determines how long you want to store it, if the type is final output, for example, you want to store that for seven years. We were in a finance industry, so you want to store that for a long time. If the type is some interim image or some interim blob of your OCR or something, that's bloated. You just want that for some debugging. You don't mind flushing that after a few months. So this created a problem where we couldn't actually flush the data that we wanted to flush, and we were going to have to keep all the data for seven years. So hopefully Amazon suggested that we go and transition all of the files to Glacier, which that would have worked. That would be a really nice solution and really have dropped the price. There's one small little warning that pops up in the console when you go to transition buckets or transition objects to Glacier. I don't know if anybody's seen this. There's a warning that transitional request charges may apply. I was curious. Okay, what charges exactly? And if you calculate at least on this bucket that we were looking at our million, billion, trillion, however many objects we had, the rough cost was about one hundred k. To press that button in the console. I don't know. Has anybody pressed the 100K button in the AWS console before? I have not. The film guy in the back is and as you can imagine, not a lot of either operators SREs DevOps what developers going to be like? Oh no, yeah, I have authority to press that button. So of course that creates this organizational discussion. Right? It leads back into now we really need to gather requirements about how long we're keeping this data. Why is it costing us so much? Why is it going to cost us 100 kwh? Oh, by the way, it's only costing us, what, 1520 kwh a month right now? Let's have another meeting on it. You find that your team starts to rotate out of your company, and if you do this long enough, you can actually have everybody rotate out. And so then it could still be there. I don't know if they've even implemented the path based policy to do the seven year retention on anything. So this one's a little bit frightening. Infrastructure's code is just YAML, just some config, right? More smirks. Okay, I know that. I guess the cockroach gives it away though, right? No tests needed, right? You're leading the industry. Everything is like, you're doing better than all those other people. Like just go out to happy hour and everything's awesome, right? This got us into trouble in some really interesting ways. Infrastructure is code when it contains state language and defined step functions. Which step functions were great, they saved us from a lot of other problems. And given the transactional or the batch transaction processing we were doing was a very good fit for what we were doing and helped us in a lot of ways in visualizing it. And so it was very powerful. But it also bound that infrastructure as code that YAML file to the control flow of our code. I don't know very many organizations that test the error paths, both the known errors and the unknown errors. Well, this makes it more difficult. I haven't seen good testing frameworks for how to implement inject errors into states language inside of infrastructure as code. Maybe there's some out there that I'm not aware of. That's fine. Another embarrassing one. So as we come to the confessional, continue the confessions. We had a VPC that was in infrastructure as code that was great and everything's wired up. It's named great driven off of variables or attributes. If you don't check that the attributes are different on the input to your template, you may deploy a VPC that you think is in multi regions but actually is just being defined to have a duplicate of subnets in the same AZ. Does that make sense? The other thing that we found out was that we had an outage. It was luckily not production outage, but we had an outage where the AZ across accounts. The mapping of az's to physical az's did not match across our account. So it would fail in one account, but it would not fail in another account. Even though the variables into the VPC infrastructure as code was the same and we found out later that they were different per account, we could bind new accounts to be exactly like all our other accounts. I'm not sure if this has been [email protected] level. I'm seeing some nodding or some this may be changed. Ask your Tam, I guess is the takeaway there? Yeah. And also you can find this during a failover test or a Dr test. So if you're doing tests of failure, this is probably a good place to do it. More infrastructure as code fails. Yeah. ELB access logs if you're not going and checking that they're getting emitted, even if you think you're generating them, they might not be there, I guess is my point. When you get DDoS and you're looking for who is sending you a bunch of traffic this one. I don't know why Amazon loves their two pizza teams. And where it gets incredibly frustrating to me is Lambda very helpfully, will create a log group for you. And that log group is named something that Lambda decides on and it will set the retention policy to be infinity because it doesn't know how long you want to keep the logs. I mean, I don't think they know that they want you to keep them for forever. But that's what ends up happening because either you end up creating more and more log groups each time you launch a new stack and you create a new log group that you're going to have to then go into Cloud Watch to delete, which now Cloud Watch has a selector to select all to go delete log groups. I think this is literally from customers who have created so many log groups with retain infinity that the customers got poed. And finally Cloud Watch decided to add that functionality versus lambda going and letting you configure which Cloud Watch log group to create or to pipe logs to. The other problem is once you've created the lambda, if it's named, then how are you going to ever pull that log group into infrastructure as code? I've never seen that happen. We've always just ended up changing the name of the lambda and gone in and cleaned up manually in the back end. Did you have a it's coming? Okay, we'll take a look. The tooling that I have seen I think could best be described as needs improvement. Native runtimes are the best on lambda. I actually do think the native runtimes are the best because isn't the whole point of lambda that you don't have to run, you don't have to build your own OS, you don't have to manage all this complexity. Right where it does get problematic is the best practice as I've heard of it. Well, how did we run into this problem? We ran into this problem because AWS on Python, and I believe it was Python maybe three two or three four stopped bundling requests as part of their boto three SDK. And so they can go and update the SDK as part of the native runtime. They generally will send you alerts. And if those alerts are not going to the right developers in your organization, they might be helpfully archiving them and leaving them for somebody else to deal with sometime later so they can change the SDK. And the best practice of even though that the SDK is included as part of the native runtime, AWS suggests you pin the SDK by including as part of your own code bundle well, for node at least a very spelt 75 meg of SDK or 30% of your budget for code. The other solution is you then use custom lambdas which you can then get the privilege of paying for, and you get increased cold start times roughly in the three to five. I don't know if other people have better stats on this, but seeing like roughly 200 to 1000 millisecond slowdowns in container lambda cold starts versus native run times, so kind of painful either way, I still go with the native scale on demand. Yes, to infinity and beyond. So we were working in a micro service environment and we wanted every service to pack its own backpack, which meant that it needed to deploy all the infrastructure needed. It couldn't rely on some other service to do it. And so to test, to do integration tests, we needed to deploy fleets of cloud formation for each get push. So on each PR we deploy this fleet of CloudFormations. You isolate each developer from each other, each change that they're making from each other. It was really worked out great well, until you hit control plane API limits. And for rest APIs, you can only create two custom domains per minute. I think CloudFormation is known for its minutes to dopamine or maybe hours to dopamine. This was definitely helpful for developers to help them with the Xkcd comic instead of their code compiling like, oh, it's launching on CloudFormation. That's why I'm shooting pool or shooting Legos with a nerf gun or something. It's also great. I would suggest you turn on config and then you can pay and make sure that they hit their quarterly revenue numbers with all these infrastructure changes you're making. Because in your dev environment, you definitely want to see your highest bill going to config when you're running a surplus code. Apparently I've been talking to our rep that the suggestion, or the best practice for how to use config and a dev environment when you have serverless is to turn it off. Still pending, but it seems like an odd suggestion from the rep we've talked to. Okay, how am I doing on time? Serverless is secure. I get a big laugh out in the back. Okay. Luckily, this did not make it any further than an architecture review when the service architect kind of pulled us aside and said how you wrote that lambda permission policy. Technically, any API gateway inside of AWS could invoke that lambda because you're just constraining it to API gateway and lambda invoke. You didn't actually restrict it to a source arm of the exact API gateway or to the account that you were in. Kind of dovetailing. Even if you don't have stars in policy, sometimes permissions can be overly broad in ways that might be easy to copy and paste around, but could potentially create real problems. So luckily we found that okay, I just covered that. So this is me, co founder of Cyclic We're, a serverless NodeJS platform, much in the vein of infrastructure from code. You can reach me on Twitter, and that's my email. I would love to hear your stories of fail. I'm sure we all have funny stories that maybe we can only recount in a more private setting, but hopefully I can break through the wall and start a little bit more of a conversation. So thank you for entertaining me.