Video details

Sven Malvik - Stretching for Zero Downtime Upgrades in Azure - NDC Copenhagen 2022


Imagine you navigate in the Azure portal and click AKS upgrade; 2 minutes later your entire online presence is inaccessible to your users. This presentation is discussing Vipps' journey of managing infrastructure in Azure.
In this session, we will walk you through the design decisions of our Azure infrastructure during the last years. We will start by discussing the flaws and challenges of our early architecture before we introduce our current setup of immutable infrastructure and why we are still not happy. Finally, you will learn why and how we favor a multi-cluster architecture in Azure.
The target audience is Azure administrators and architects who want to learn from Vipps' experiences with Azure services, in particular AKS. We assume that you have some basic understanding of Azure and Kubernetes.
Check out more of our featured speakers and talks at


Yes, though we will start now. Thank you very much for being here. I want to talk about a cast day for Azure Kubernetes service. And you know, when you remember maybe back five years ago or ten years ago, year ago, when you upgrade an application on your own machine, that might fail, that can happen. And then you need to maybe reinstall it and provision it again. And this is fine, but in a mission critical system, that's really a problem. And today I want to talk about how we do it at VIPs. But first of all, who is VIPs actually, for those that are coming from Norway? Know it maybe, but otherwise you might not know. It is a service payment service provider, basically, which all the Norwegians use. So every Norwegian knows VIPs because everybody literally has VIPs on its phone. About seven years old. Right now we are five 5 million people in Norway and over 4 million have installed VIPs and use it on a regular basis. But what actually can you do with it? Basically, when you transfer money to a person, to a friend or so, you would need his bank account number. With whips. It's pretty easy. You just need the cell phone or what I like very much paying my bills invoices every invoice that is coming from any online store or other bank or it doesn't matter. All of them are in the VIPs app. Which makes it super simple across any bank in Norway. What's even more cooler, what I love very much and everybody is really using it, is when I pay buy stuff online. Then I in a shop that I don't know even yet. I haven't been there before and they don't know me, but I just can visit it and I can log in with the login because I have the app and I can pay with it and the online store doesn't need to know anything about me. The thing is really that everybody is using this service in Norway and from the financial authorities, it's considered a mission critical environment, so it cannot go down. But when you have all your workload running on one system and you upgrade it and upgrades can fail, and so can this fail as well, even though it's managed. We talk about this. What I like very much, but super unimportant, is that it's the most interesting or recommended brand in Norway. That's maybe also why we are merging together with mobile Pay. Now, after summer, it's more interesting to pay stuff than Netflix and Spotify, which is kind of funny. However, the most important day in Norway is the 17 May National Independence Day. Two and a half million transactions go through our AKS systems on this day, which is huge. And everybody is paying basically between eleven and 12:00, which makes it kind of a denial of service attack, which is pretty interesting. And that's also a reason why we choose to go into the cloud, because the cloud manages stuff like that can scale more or less automatically. It's what they say at least. So today we are it looks pretty much like this. You have the API consumer and ask for IP address through Traffic Manager and then it goes through Application Gateway where you have all those Waves web application Firewall installed or enabled. You have a DDoS plan attached to it and then from there it goes to Azure API management. All of our APIs, what we have about 100 or more applications, all those APIs are in Azure API management. And from there the traffic goes to AKS, where all the services are. The problem here is that you have Akes and AKS has a version. Even though it's a managed service, it's not really managed, it's more like manage because you need to ensure that you are always up to date and you upgrade frequently. So you get all your security patches enabled and get all the new features that you want. It's underlying kubernetes, you have to upgrade it yourself. But upgrades can fail. And they have failed in VIPs. And that was really bad because everything was basically down in VIPs and the whole nation could not log in or pay or whatever. So they really had to go into the banks and do all that stuff. But if you have an important date, like 17 May and it's not really working. So everybody is paying their stuff, their hot dogs and so on, you can't do this. What I'm talking about today is first of all the challenges of upgrading Akes, what we experienced with AKS in Azure. Then talk about the main upgrade strategies, because when we stand there and realize, okay, the simplest way of upgrading AKS in Azure doesn't work, but what alternatives are there? We discussed with Microsoft and came up with three strategies. There might be many more, but those are the three. And then we talked about the solution that we went for, which is Immutable infrastructure, which sounds super cool, but it's really not so cool for us. And then we looked into Multiclustering basically having multiple streams to AKS just to find out, OK, if this fails, doesn't really matter. We have this one and how this works. And at the end there's this new service, Azure Container Apps, which might be a solution to our problem. We'll just shortly introduce it to you. So there are basically three challenges, or there are many challenges, but I want to discuss three challenges today that we experienced in VIPs. But first of all, why not shortly discuss note pools in ACS? When we started ACS four years ago, we had one node pool. A node pool is basically a group of VMs which share the same VM type. But we had one of them at the beginning because the concept of node pools didn't exist. But now they exist too. You have multiple node pools, or else so you have one node pool and you upgrade it. What happens is the control plane, which is more or less the managed part of Azure of AKS, that one gets upgraded first always. But that means also when this fails, you cannot do anything anymore. With Akes you are basically screwed. You cannot interact anymore. So when there is a security issue or so, you can't fix it and you have to create a support ticket and then you have to wait it out until they fix it for you. However, when you upgrade a node or worker node pool set, then it creates a new buffer node. All the workload is switched over to other nodes, distributed, coordinated and drained. And then the one node in this node pool which is upgraded is reimaged, started again, and then workload can switch over to this one. This is basically how it works. The first really issues we faced once was application dependencies. They have nothing to do with a case, but they can happen. So we had a spring boot application, which is a configuration service, basically, where all the applications depend on. That means when this one configuration server fails, other applications will also fail. And the thing is that this config server needed more than 30 seconds to start, two minutes or so, because it was really poorly configured. And then after 30 seconds, which is default, it gets restarted and restarted and so it will never come up. The thing is that this one node which was updated and was refreshed and this configuration could start there and all the other applications as well. But it really came never up because it was in a failing state. All those applications was a failing state. And then the next note was reimaged and the same happened. And the next, and the next there's no way you can stop this process. You're just going down and you fail. You can't do anything which is super bad. That happens to us. Once then we had once the extending of resource limits. That means that you are in the middle of an upgrade of AKS and a new buffer node is created. Everything works really fine, but then suddenly a next node or Buffer note is created, but it doesn't really work. And you are standing there and don't really know what happens. Why is not the Buffer Note created anymore? I mean, it should automatically work, but it didn't. And we found out after some digging, we were very new in APS four years ago, but we found out that there are resource limits. There are many such, I wouldn't call them details, but there's a lot of stuff you have to note. And there are resource limits on subscription level, which means that you will exceed them at one point and you have to clean it up or you have to do something about it. So in the meantime, you are in an inconsistent state and AKS is in a failed state and you can't do anything despite of calling Microsoft support and ask them for help. And while you're doing this, you waited and let's just imagine you have breaking changes and then you have some nodes which are already upgraded, but then you have other nodes that are not upgraded yet and then you have really a problem. So many services then might not be running because of this inconsistent state. And then you get random fail outs and customers starting complaining, of course. And then another failure that happens that were random node failures. That means that for example, we had on one note or several nodes where the Azure provider couldn't be really didn't work, so it couldn't connect really to Azure, this node or those nodes. And then you have problems with identity and applications are not going to start and you again, you really don't know what you, what you should do in this, that and then you have to create a Microsoft ticket ticket again and you have to waste our time. You're super dependent on Microsoft when you do upgrade AKS with simple or single node pools upgrade in particular. And that's what's I guess also the default when you create AKS Azure Kubernetes service in Azure. And many of you, and or especially us, we were quite naive and thought OK, let's just try a guess because it's a managed Kubernetes, super easy and we provisioned it and we moved or implemented, deployed some new applications to it and we're quite fine and we were super happy with it. But as it turns out, it's quite hard and we struggled a lot. So what we actually in this case did was super simple. Just deleting the notes and do it again. But again, when you do it automatically then you really don't know what you should do and you create Microsoft tickets. But there are three strategies that we discussed with Microsoft and then I want to go through and then discuss what strategy we went for and why this wasn't even a good idea at the end. So the first one is the simplest one, that's the one that we did super naively at that time. We went into the portal and we search for a case cluster and to the configuration part and then we just click upgrade and then select the version that we want. That's how it works. It's super simple and you can't do really anything wrong. One note pool, one note after the other is just upgraded and you just waited out until what happens. The other is with node pools, so you have one control plane. Again, this needs to be upgraded first and then you have all the other nodes. So in this example here, the system node pool runs with a very small VM type, which is fine because when that fails, let's say I have some monitoring services running there, if it's failed, they can just switch over to for example, node pool one, where general purpose applications run on and that works quite good. The thing is that when you have node pool two, for example, when this fails, it's deployed on an H eight VM type. If those applications fails, they cannot move to general purpose VM or the B eight type because they have other requirements and they are forced, they have to run on those high availability VM types. So you need to have at least two of those kinds of node pool types. But again, what is critical here is if you have this control plane and this one needs to be upgraded first, so the managed part of Microsoft AKS and that fails, then nothing works. And this is a risk that we consider to be too high for us because when everybody is using our service, our AKS, and it is considered from the financial authorities to be mission critical, this is a risk that we didn't want to take. Okay? And then we thought, okay, what if we just create a new AKS, entirely new and then we can just switch over to the new one. We have Application or Azure API management in front and it's just the name value, just a switch. And then we are there. In theory, that's super easy. And I will show you why it's not so easy. But to compare them side by side you have on the left side you have this one node pool upgrade through the portal. Super simple, there's really no extra. Costs are very low, but the risk is super high and it's not complex because you didn't add anything to it. And then you have on the right side, on the other extreme, the new AKS, which is super complex because you have to first create it in the exactly same way, which exactly same configuration. And then you have to move over all of this and it's super complex. And the cost is also very high because you have the twice you have a redundant system and you have to have the manpower to actually do this switch, which can also be complicated, but the risk is super low because you can test everything beforehand, which is very very nice. So we went for the third option and we thought okay, instead of having just one AKS or two AKS and APM is communicating to both of them in order to make this happen, we need infrastructure as code. I mean, we all should have infrastructure as code, but at that time we weren't really good at it. It was a lot of clickups. Okay, we thought that's good, but why not doing it for the entire infrastructure? I mean, we are under the financial authorities. We need to have our code documented, our infrastructure, so we need to have infrastructure as code anyway. And then we said okay, let's just do this because then we have the test client which can entirely test the new infrastructure and then we know exactly if it works or not. And there are always some breaking changes in major versions of Kubernetes and then we can test them out. Actually, we can test out everything. We can test out, I don't know, new rules in application gateway or changes in Azure API management as well. And all those nitty gritty infrastructure stuff. We said okay, then we will do this, let's go for it. But the first we had to do was we needed to have the code for the infrastructure. And there are several options. There's Arm, there's Poloomi. There's TerraForm. There's Bicep. And we did some TerraForm in some teams. But you know, they had at that time secret that were quite visible in the state file and wasn't really an option for us at that time. It solved, I guess now. But the third was okay, Arm. Is that what everybody is doing, it's well documented and so on. Let's create a tool that creates all those Arm templates for us, so that you just can have configurations and then it spits out this Arm templates. But I mean, this looks quite complicated and nobody really cares about what's in there. The thing is that this tool was a C sharp tool, which worked quite fine. But you have to have developers that can understand and maintain this code. And when you are a platform team, you are not really into a development C sharp application. So this was quite a problem. We could maintain it, but the code base was super bad, was hundreds of lines that really nobody cared so really about. And then we went to something else. Poloomi Purumi was nice because it was more configuration style of coding, which everybody understood. And you can even have develop in your language a choice. That means that the teams, not the platform team, but other teams can also take this and develop their own infrastructure. Let's say they need a SQL database or so they can do it themselves because they are autonomous. And they can also use Pollmer in their own language, for example, Golang and we do it in another language, JavaScript for example. Or so it turns out that it works nice, but all those dependencies to those modules and so on. This is not really a fun way of working. It didn't really work out for us so well. Even though it worked, it was not fun at all. And then Bicep came out and we said okay, let's give it a shot. Bicep is really native. Everybody who works in Azure should understand Bicep. So we tried it out and till now we are super happy with it. It's clean, it's easy, you can do a lot of stuff. I know there are some constraints yet still, but they all get fixed over time and so on, that you can always get around. But that's what we do today. But if you work with this so we had implemented the entire infrastructure, first from Arm to Pulumi and then to Bicep. And then you create the entire infrastructure. Again, this inactive what you want to make active. But first of all you have to test it, right? You are the platform team. You don't know anything about those applications, how to test them. So you need to have test pipelines in place. So you ask all the teams around and ask them have you in pipeline? Then you give your pipelines. Some of them work, some of them won't work. And then we made the first switch was two years ago. So the entire platform team were about seven, eight people or so. We were together on a Sunday night till Monday morning, two or 3 hours at midnight. And we were on teams going through all those steps that were necessary to do to make this change at that time. For example, Azure APM management wasn't this cross region feature. So you had to create your own in the other in the other resource group, for example, or VNet, and then had to take a backup and then a restore, same for AKS backup and the restore, all those workloads, that's a lot of work. And sometimes when you do it, especially in Azure API management, this restore or backup procedure fails. Then you have to create a supporter. This is a process that takes a long time, so several hours. And then you need to test it and to be sure that everything works and well tested, you get all the testers from all the teams joining the meeting and help testing this before you make the switch. And then you need to consider that there might be failures. And there are failures, always failures. So you have all those expert domain experts with you as well. So we were about 20 or 25 people at midnight on a Sunday doing this. And it worked. This was not easy. We had to figure out some problems and so on. But then we thought, okay, we cannot do this really every time with 25 people and so on. We need to get better at this and with less people, maybe two or three. So we really learned that this is not an option at all. We cannot do a switch. Then we thought, okay, what about multiclustering? That means that you have an AKS there and then you have another AKS there's. No active and inactive, but they are both active at all the time. So you could maybe then have 90% there and 10% to the new one and just test it out. That's a way to do it. So we thought, okay, let's investigate how we do this. But we thought something more. We learned something very important at that point that immutable infrastructure completely as we have done it is a really stupid way of working because there are components like Application App gateway or the Application gateway or Azure API Management. They are really managed, especially Azure API management. So you don't need to reprovision that one at all. Why should you? So why not just let it be there and do nothing? Why not just having one one up gateway, one Azure IP measure as we had before, and just two AKS, because that's what we want. The problem really was to upgrade AKS. Immutable infrastructure was not our problem. Upgrading, nothing more. So we went back and then we thought, okay, this could be something just having Azure IP management as I the previous picture, and then just make a switch there. It's a few lines of code. Just implement this and then you can go into Azure IP management. And those who are familiar with Azure API management have named values. It's just a simple value that you change. It's super easy. So with that said, okay, let's try this out then you can have weights, right? 10% there and 90% there. And it worked quite well. But the problem really with this one is that Azure is Azure API management itself. Because there you have policies where you may create this code which makes this happen. And this policy is developed in Dsharp within XML, which makes it almost impossible to maintain. Nobody likes that and nobody really likes to learn this. We need to have some experts in this who can make this happen. So really dependent on certain persons in the organization. And we didn't really want that. But this was one option. The other option is to make, which is not really an option, but we consider it anyway to make this on this switch, basically based on Traffic Manager, because Traffic Manager also has weights. You can say 10% this way and 90% this way. But this is not really load balancing because you go 1 minute this way and the other minute this way doesn't really work this way. And then you have Caching and Azure AP management as well. So it's nothing like ground robin or so it's just this way or this way. And then we considered a third option which was having application gateway there. So this is a level seven kind of proxy which you can have. And yeah, the problem here was application gateway is a very, very, quite a complex service, I would say. So the configuration you need to have there, you have to have it as infrastructure or as code as well. And we learned the hard way that this is not really working for us, working with application gateway. Application gateway is a service that we rather would not like to use at all because of its complexity. You have listeners as you have back end pools and you have some other stuff there. It works. And you can have weighted route robin but that's a really complicated configuration that we have there. And then we said, okay, why not a simple load balancer? This is also an option, but Azure load balancers cannot have other load balancers as back ends. It's not possible. So this doesn't work either, even though it would be super nice. They also have to be in the same VNet, which is also required, which makes it a bit hard when you have infrastructure as code written and they reside in their own VNet, this doesn't really work. I mean, you can get around it, but it's hard. And you have to rewrite everything. And you cannot even have another load balancer as a backend, which you have as part of AKS. So this doesn't work either. And then there was a question about cross region load balancers. Yes, they might solve this with Wnedding, but again, you cannot have a public or another load balancer as a back end pool. So what a gas upgrade solution did we choose? We went completely back again and said, okay, Active active is not really working for us again. We want to upgrade the service in a very secure way, nothing more. And then we thought, okay, we have all this infrastructure and code in place and we can provision a new infrastructure completely, even though we said we didn't want that again, we said, okay, in case we have really troubles with ACS, we can just provision it in another region, but we probably won't ever need it. So let's just go back and say, okay, let's just have one again and find another way of upgrading it in a secure way. But at the same time, we can go the West Europe route or the North Europe route with an Azure application gateway and Azure API management, so that when there are failures on this route, you can go another way. And actually, we had some problem once, not some weeks ago, where the one route didn't really work and we didn't know why, so we routed to the other route, and that was perfectly for us. But still, what is the problem with ACS at this point? How can we upgrade it? And then suddenly there came something new from Kubernetes. We read about it. And this is the pod disruption budget. And this is really super nice because you have first your node pools where you can have upgrade one note by one or one note pool by note pool, which helps super, but they can fail. All right? So you need to make sure that you have more node pools in place. But if also the node pool fails. And it's not really fun to do it in this upgrade. Even though a note pool fails and you feel quite secure. But with Pod Disruption budgets. You can say that. Okay. I want to make sure that I always want to run. For example. Two applications of this config server. That's your disruption budget, and he won't proceed until if those applications pay, for example, you always make sure that the minimum amount of applications is still running. And we experienced this again once that we made an upgrade and it failed. But luckily the pod disruption budget saved us at this point. So this was not really a problem. But then the silly question, okay, you have also this controller plane still which might fail. This is something that we have never experienced yet. So that's why we said, okay, this risk we can take. And we did. However, we have always another AKS in place, the inactive one, which we always can deploy and we always do deploy this inactive AKS where we test everything. And when we test and we know there are no breaking changes, then we will make the switch in this one. So there's still a limited amount of risk, but it's so minimum that it's almost nothing what we considered. And then there came a new service, Azure Container apps. And we thought, okay, Azure Container apps, that looks quite nice. We like the idea behind it very, very much. First, where are we today? Or before Azure Container apps, we had those virtual machines where you are super flexible and you can do everything yourself, literally everything. Then you have AKS where some parts are managed, but a lot of the stuff you have to do yourself, like upgrading, but still you have all those notes that you can see and you can take them out manually. So it's managed, but you can manage everything yourself still and you are super flexible. And then you have container instances that are single hypervisolated containers, but you cannot really scale them. And load balancing is also quite hard. You have to provision a new one and so on. It works, but it's really still kind of very low level. And then you have Azure functions where you have this functional model. They are serverless and really everything is taken care of for you. It's very managed and it provides a lot of because it follows this functional model, provides a lot of benefits and helps you to create services. But what's the good stuff? What we like very much with Azure Container apps, that's first of all, you have those revisions out of the box. So you just install an application and put it in this container, provision it and then you can say okay, now I want to have a new revision. And that's all you do, basically. And then you can say 2080, or you can do a B testing or gradual deployment, which is quite nice. And you don't get this in Kubernetes or in AKS. You have to do it yourself. So you are selfresponsible for making this happen. But here in Azure Container apps, you get this out of the box. And that was super nice because we were developing this right now in Quebec and it ourselves with the flagger, I think the service is called. And what it also has is out of the box. You can just use it because it's natively there, Kubernetes. You have to do everything yourself. You have stepper for. Example, circuit breaker, which we really don't have right now. Which means if a service is failing that the other service is not calling it all the time, but it says okay, let's make a cut and wait five minutes and try again. Right now you would basically overload the system and you get this out of the box with Azure container apps. And then you have also publisher and subscribe. So you can send messages on a service bus and it listens automatically to the service bus, which everything is automatically there and state management is taken care of for you with, for example, Cosmos, DB, everything there. On the other side you have Hidor, which is Cuba's a ventrism. Also scaler. Today you have those pods and they can scale by themselves based on CPU and memory, which is quite nice. But this doesn't really help very often. Very often you need to scale based on other metrics, like for example, the number of requests coming into application gateway in the beginning or stuff like that. And you can make this happen, you can trigger based on many different things. This is automatically there. What came out now some last week when Microsoft built was there, they announced that I think Debper and KDA both you can add as an extension also to create this right now. But still you have to do a lot of things manually yourself. For example, what you can do there, you can have Azure AP management in front that calls an API an application which has a public IP address or public ingress which is accessible. And those applications can talk through Deborah to other applications there that have no public endpoint. And you get this automatically by Azure container apps. You just define this is public or this is not public. You can even put Azure container apps in your own VNet, public VNet or internal VNet. And public VNet you get a DNS as well, automatically. But you can also come with your custom VNet, which is which makes things so much easier. But the thing is that you are not really flexible at all. I mean, for example, the biggest drawback what we think is that you have no access to the Kubernetes API. And we use for example, we develop some Kubernetes operators so that you can simplify the whole workflow of your deployments. For example, you have an application which comes with an API with a specify and a policy for API management. And you need to first deploy your application to Kubernetes, I guess, and you need to deploy your API. So what you previously did, you deploy it. You have two pipelines, or maybe one pipeline, one to a gas and the other one to API management. This works fine and your teams are probably responsible for those pipelines. But what you can do with operators and that we do a lot is you deploy those APIs from AKS. That means you deploy your applications and the Kubernetes operator that you wrote yourself, an API operator that takes then those API and deploys it to Azure AP management automatically. And this combined with Githubs for example, you don't need any pipeline at all. So you're totally pipeline free. It's just Githubs and AKS or Kubernetes operators and you don't get this flexibility in Azure Container apps. And if you are larger organization now we are merging with Mobile Pay so we get even larger and then cannot really do Azure Container apps. Maybe it works for some team that do some other stuff besides. But if you have a main runtime environment, then it doesn't really work because all the tooling around and so on you need to create once again also you have only general purpose applications, sorry, VMs that didn't mention what type of nodes or VM types they use. I think when they say general purpose, they mean the D series. But if you have applications that have other requirements, you can really change this. You're really dependent on this one VM type, whatever it is. Yeah, and then those operators what I wanted to talk about but you cannot even do Cuba Cube CTL locks, for example, all this stuff that you like as a developer, you cannot do this. You have to go into the portal to Azure Analytics and get your locks from there. So Azure Cumminators or Azure Container apps are really in between container instances and Azure functions. They are quite nice to have and simplify everything. And you could maybe consider or reconsider to have the API gateway from Azure IP management deploy them in Azure Container App so that they run outside Kubernetes as they do now, but as containers in terms of updates or scalability, it's much more flexible. But it's not really an option for us because you don't have any flexibility there. And if you have many teams, we have around 1215 teams or so, you cannot really help them. You can help them, but you need to build a lot of tooling around from Sketch. Again, all the slides here and a lot of text and so on is in this ebook that you can download if you want. There's everything in detail described. Again all those Active, Active and so on. And Immutable infrastructure in more detail. Our infrastructure is code, how we did it and why it didn't really work and so on. Quite detailed, about 30 pages as well. Not so much about Azure Container apps, but a bit. And if you want to take or get in touch, just get in touch. And if you have any questions or so, just send them to me or ask now. To just summarize, we went from really high risk where we just pushed the one button and it was down basically to low risk, still risk because we create a completely new infrastructure was super hard work and it's not really scalable. When we also merged with Mobile Pay it doesn't really work, but it's low risk to we consider as no risk. I mean, there's always a little risk, but we consider at least at no risk because we have done it now many, many times. We have four environments and we do it regularly in all environments and really works pretty well with pod disruption budgets, which is really critical. Have we learned to use in Kubernetes and then Azure container apps as an alternative option to AKS? It really depends. For normal teams, probably not. But if you have some special use cases that are totally outside the main domain, which is payment service and so on, then it might be a good option. And with that I'm thank you.