Video details

Event driven architectures on Azure - Graeme Foster - NDC Melbourne 2022


Ever wondered how to build event-driven architectures on Azure, and what pitfalls to look out for?
Event-driven architectures have become very popular in recent times. They help reduce the complexity in building modern microservices based systems by providing patterns to avoid tight coupling, decrease latencies, and establish efficient and reliable communications.
Based on our practical experience working with multiple large customers, in this session, we will dive into these architectures. We will introduce concepts including CQRS, Event Sourcing, Sagas, etc., and debunk a few myths leaving you prepared for building modern applications on Azure.
Check out more of our featured speakers and talks at


Good afternoon, everybody. And you've made it this foreign conference, so, like, well done. If you like me, you probably needed a nap earlier today. I think that's what happens at this time, right? But hey, thank you for coming to see this talk today so late, the conference quick introduction. My name is Graham Foster. I work as a cloud solution architect at Microsoft. That's been my job for the last almost two years now. But before that I was pretty fortunate to get to work on some fairly large, like, micro services, event source, event type architectures and worked some really smart architects for at the time and picked up some interesting ideas, kind of came up with some of my own ideas and learned a fair bit about those architectures from reading, watching videos and all that kind of stuff. So I'm hoping to share a lot of that kind of fundamental eventing messaging patterns with you today on the slide. But I want to get out talking about what an event driven architecture is. Talking about some of the kind of problems that we have to face when we come into an event driven architecture. Especially when we start breaking things up into lots of small Disparate services and we kind of move away from these monolithic applications where we can do things consistently and atomically in a database transaction. So we're going to be thinking about patterns like that. And because I work for Microsoft now, I'm going to break apart some of the services that we offer for hosting, publishing, receiving events and try and layer that into some of the patterns that we're going to talk about as well. So not so much code in this talk, not so much live demos. I didn't really want to risk a live demo at the end of the conference. I thought that might be kind of like a little bit dangerous. And in fact, because I know we're an architect hat, there's a lot of diagrams in this. So I've put my draw IO hat on and it's kind of a weird experience for me because I've been an engineer for like 25 years or something. So diagram heavy, very little code, just a lot of core patterns and mapping them onto Azure services. So to do this, I'm going to hang the ideas and concepts of a use case. Now, it starts off as a very, very simple use case. Imagine two bas and the higher echelons of their big glossy tower in a city like Melbourne and they're walking around and they've been reading some gartner, something like that, that came up with an idea that said, hey, did you know, if only you could manage your employees wellbeing status, you would realize that there's a direct link between their wellbeing and their productivity disclaimer. I do not think there is a study that says that. I don't know, there's probably like laws against that kind of thing. But the business idea was that if you could get an employee every morning to say one to five on how they're feeling and then suggest something that they could do, go for a walk, breathe in some nice fresh air, that they could be more productive, okay? But it's very important we only ask an employee this once and once only every day. So this is the business requirements and it's been handed down from the lofty heights of the ivory tower up there to the engineering team who are a little bit pressed for time and they have to build something. They were given a couple of weeks. So this is what they came up with. This was their first build of this service and we're going to use this and we're going to embellish this as we go throughout the talk and add new requirements onto it. But it's a lazy team, okay? They're quite pragmatic. They don't really want to over complicate something, which I think is a pretty good way to be. So they want to build something that is as simple as possible. But in their opinion, you know, we'll do the job and we'll do the job OK. And what they've come up with is pretty straightforward. Graham, he's going to log in every morning, he's going to enter zero one to five on his wellbeing rating. This system is going to leverage an existing correspondence service that already exists in step three there to send an email to Graham and it also is going to record that state in a database. Could be Cosmos database. SQL, whatever. Doesn't really matter. It could be a flat file if you really want it to be. Okay, so this is an architecture, it's really straightforward, but if you think about it, there's a few potential problems lurking in here and it's super important to know what they are and then make a decision if that is going to affect what you've done and you need to go back and maybe reengineer this. So I'm looking at it and I'm thinking, well, we've got a tight coupling to a downstream service here, right? So if that correspondence service goes down, what's going to happen? That means can we send that email, maybe we can do a little back off and retry. But what if it's down for a couple of hours or so? That could be a potential problem to this requirements. What happens if our employee Wellbeing service is wildly popular and we realize that that correspondence service is written on an old J two E server running on a cranky VM somewhere and we suddenly overload it at 09:00 in the morning with 100 requests and we take it down? What if it was involved in sending regulation reports out to the government or something? That could be pretty nasty, right? So when I look at an architecture like this, my initial thoughts are, well, you know, if it's a small company, 20 employees, if it's okay, if we don't send that email, this could be completely fit for purpose, right? Super quick, super simple, let's ship it and see what happens. So we ship it and we see what happens and we start getting bug reports that come flying in. You see, a lot of those initial assumptions turned out to be correct. This correspondence service is actually quite flaky and it can go down for an hour or two, quite straightforward. Quite simply, it happens very, very often, which means, based off our original architecture, those users are not getting this correspondence, they're just getting errors. And this is a problem. A, we want the service to be up, and B, it turned out that it was critically important that we send those messages to the user. So we might need to refactor rearchitect a little bit. So what are we going to do? Well, let's try and keep the mantra of keeping this simple, but as simple as it can be, without getting too complicated. And there's a great pattern for something like this, where we want to decouple services, we can introduce a queue and we can send the message instead. So if we look at this architecture, really, the main change is in number three there. We no longer directly called a correspondence service. Instead, we're just going to drop a message onto a queue, a super simple queue. It's a lightweight service, it will do the job beautifully and it means that we can decouple that user experience from the actual backend system of sending the message. There's a couple of other benefits in here as well, right, so that downstream service, we know it doesn't like being overloaded, so we can use a queue to make sure that we limit the amount of throughput that we're putting down onto that correspondence service. It can act as a bulkhead, we call it, so we can send it loads of messages and it will just kind of hold on to them and it will allow them to be dispatched by the wellbeing, emailer in a kind of managed throughput sort of way. So it's a lovely architecture and if we look at Azure, we've got a few choices for a service that we can use here. So the two ones that kind of stand out to me are storage queues and service bus. Now, our requirement is pretty straightforward here, right? So we don't need lots of kind of shiny features that a proper message bus is going to give us. We don't necessarily need deduplication, we don't need sessions, we don't need transactions, we don't need request response, we just want something that we can offload these messages to. So in this scenario, a storage queue is actually a pretty cool choice, right? It's simple, it's kind of rock solid, the SLAs are really big on them, so that's great. It supports this idea of, let's say, Deqing those messages using competing consumers. So I'm not sure if anyone has come across that before. But the idea here is that we've got a whole bunch of messages in a queue and we want to take those messages off and we don't want to do it in a serial fashion, but we want some kind of protection that if we fail to process a message, we'll retry it later. So the idea of competing consumers is that we can leverage a concept of the storage queue that's called a peak lock, which means that when we want to take a message off it, we don't actually take the message off. Instead we look at it at the head of the queue and we put a lock on it. But another consumer can come along and when they go to the first message on the queue, they see the one behind it, they don't see the one that we've taken a peak lock on. And then if we fail to process that message, well, that's okay, it'll drop into another storage queue. So we can process that later. So competing consumers means that let's say it's a quiet time of day and no one is using our system. Well, we don't really have a need for much compute to drain that queue because there's not a lot in there. But as load starts to increase, we can add more consumers as we need them and spread out that processing throughout those consumers and then scale it back down when we don't need it. And if we know we've got physical limits on the downstream services, we can make sure that we don't process too much. Maybe we lock it to three consumers because we know that's what we can get away with. So it's pretty cool. Supported by storage queues, super simple service, we'll do the job. It's kind of nice, right? There's a potential issue in there, which is that what happens if that queue goes down. So we write the state to the database and then we can't send our message. So we've potentially still got that problem in there that we had before, but we've decoupled from the back end system, we've reduced the load off it. It's not going to go down. We're not going to take it down. We don't worry if they're deploying to it. So we're getting somewhere. It's pretty good. Let's ship it and let's see what happens. Man yeah, okay. It had five lines on its SLA, but it's still going down at certain times of day. So we're finding out that now. The users have seen the correspondence and I heard there was a HR library claim. Someone said that they were told to go for a walk by our system and they slipped onto the ice, but we've got no audit trail because we didn't get a message to them because the queue was down at that point in time. That's a bit of a fabricated scenario, but I'm sure in your business domains you could find equally kind of interesting scenarios. So the problem here is that we've uncovered a hidden distributed transaction across these services. So if you think back to the old days, everything used to be bundled up in a monolithic database. And when we were deploying on premise, we had access to these things called to face a commit transaction engines. If anybody in here ever managed to configure one, then I'd love to know who you are because you're probably part of a select bunch of people in the world. But the idea was we could have disparate databases or message queues or things like that, and we could actually open a transaction up that crossed across those different downstream services using two phase commit. And if one of them fails, everything else would get rolled back in the cloud. We don't have that luxury for most of our services. We're using managed services here. We've got Azure functions, we've got managed queues, we've got Cosmos databases. They all, most of them support some notion of transaction against themselves, but not against other services. So what are we going to do? Like feel like we've really hit a roadblock here. It turns out there's actually a really, really simple pattern that we can reach for here. It's called a transactional outbox pattern. And I was trying to find the source of it earlier, but I can't quite find out where it came from. It's one pattern that feels like it should have come from Greg or Hoppy or somebody like that in one of his enterprise integration pattern books. But a transactional outbox is super straightforward and instead of solving the problem by working around the distributed transactions, it just sidesteps it completely and it says, well, what about if we only transacted against one resource? So what we can see here now is number two stores the local state into the database like we used to, but it stores a bit of extra information inside that document. It stores the intent to do something in the future. We call it an outbox and it's like saying, look, here's the local transaction. There's a little thing I need to do later, so I'm just going to put my intent to do it in here. And then we're going to leverage a feature of our Cosmos database. But most relational databases support something similar as well. We're going to use the change feed. So we're going to create another function, the dispatcher. It's great. And the dispatcher and what it's going to do is it's going to subscribe to the Cosmos change feed on those documents. And what will happen is every time we update or insert a new document, that dispatcher function will get notified and it can look at the document and it can peek into the inbox and say, okay, well, I need to send a message now. So what it will do is it will then call the correspondence service and then it will go and write the document back into the database. Now you might realize there's a distributed transaction in there as well because we're still transacting against the correspondent service and calling a database. But it's different this time because we know the intent to send that message was stored earlier on. So we know if we fail to do it this time, we'll have another go, and then we'll have another go, and eventually we'll get that message sent, which is very different to the earlier approach where we just might not get it sent at all. Because remember, there's no such thing as, like, one time delivery, right? No matter who tells you there is, you either have at least once or at most once. And in this case, we're going for at least one delivery of that message because we don't think sending the same message multiple times will be a problem for our downstream employees when they get that message. So this is pretty cool, right? If it's vital to guarantee that you do something when you're communicating across multiple services, this is a really good pattern. Here's an example. Here's my Cosmos document code. It's Jason. I said no, correct? This is just Jason. So I'm good with that. I've got an ID. I've got a status, I've got a recommendation, I've got an outbox. Bang. I'll put it in as one document and send the message later. So that's pretty cool. I'm happy with that. Let's move along. I think we've solved that problem. Okay, so it turned out that our employee rating is actually wildly successful, and we've got a whole bunch of really good data, which is doing the rounds. And HR, I've heard about this and they've said, hey, this is good. This is good stuff, right? We want it. We want to know about your data here. So this gets delivered down to our engineering team, and we think, well, what are we going to do? We've got this outbox. You send the correspondence. How about we broadcast an event after we've sent that correspondence? So we go from this little process where we did one thing and one thing well, which is to send the email, and now we want to chain a second thing onto that. We want to send the email, then we want to broadcast the event. So this is starting to sound awfully like a workflow. I want to do this, I don't want to do that. And when it comes to Azure, I can think of at least three different services that we have that can support you in your workflows. So let's go from the top. I wanted this to be at the bottom because it was kind of low code, but I couldn't make PowerPoint work that way. I'm just not very good at PowerPoint, unfortunately, which is why I'm not very good at lower costs as well. But we have services like Power Automate, and if you're a business user and you want to do things like look at a SharePoint document, send an Outlook email, kind of do stuff across those office type services. Then power automate is your friend. It's going to do what you want it to do. It's straightforward, it's a WYSIWYG, not so good on the CI CD unit testing. I'm not really sure if you could automate a test against it, maybe happy to be told you can do, but I'm not going to use it for this scenario. Now Logic Apps is the next one that pops up on our website. We say Logic Apps is aimed at pro, integrators and developers. It pros. So I'm not sure if anyone classes themselves in that category in here. But Logic Apps give you a similar type of GUI style workflow where you drag and drop. You have these connectors that can call out to other services and we start to see the introduction of like expression style languages where we can use kind of these Power syntax type things where we can look at the outputs from one thing and send it into the next thing. They support CICD, which is good, and they have a lot of OutoftheBox connectors, which is pretty cool. So maybe that's for you personally for me, for this scenario. I'm a developer. My development team think this might get a bit more complicated. So they decided they don't want to use the Logic Apps GUI style interface and instead they want to use these things called durable functions. I'm smelling already because I love durable functions. I think they are so cool. Durable functions essentially allow you to write serverless workflows in an imperative language of your choice, which is pretty cool, right? So you can use net, you can use JavaScript TypeScript, python PowerShell Java support, dropped it the other day. And what you can do is you can essentially write code that defines a workflow. But the little parts of that workflow all run in serverless style, meaning they only come into life when they need to be there, then they disappear. And it's a really kind of cool paradigm. And you can basically white imperative code define all the steps and let the durable functions runtime go to the hassle of making sure that each of those steps is guaranteed to run. The whole thing is backed by storage queues. I think there's another provider you can use for it. As now as well you can write unit tests around these things because they're just net code at the end of the day, you can write whatever logic you want inside them and test it. You can bring in your nugget libraries and interfaces and all that kind of good stuff and this is what they look like. So there's two parts really that make up durable functions. We've got the part on the left that's called the orchestrator. Think of this as the code that knows about the workflow. It's where you write all the steps, all the logic, it's where you spin off the new activities it's where you go in loops, it's where you can set timers and come back if things didn't happen. And essentially I'm using net code here. Left a line of code in parallel tasks. But running from the top, it's the simplest thing you can see. You've got the awaitas when all context call activityasync. The other part of a durable function are the activities. So the activities are where you'll write your code, which essentially is the implementation of the individual workflow steps. So in my scenario here, I've got two activities send Message and Broadcast Sentiment. And you can see at the top right, I've got a Send Message activity that just pokes a correspondence service and then underneath it I have a broadcast sentiment. Even better, we can still opt into Azure functions bindings on these things as well. So if I want to send a message via Service Borso or something, I can just use my attributes on the code at the bottom. So durable functions are an awesome way to create serverless workflows on Azure and indeed that's what my team have decided to do. So if we look down at the bottom now, steps five and eight are essentially the durable function in action. So the first step in the chain is going to send the correspondence via the service and then number eight will go and broadcast an event, in this case into Service Bus. Now, why do I use Service Bus and not Event Grids or Event Hubs is a great question and I'm sure whatever and Strike come up with somebody will challenge that answer anyway. But for this particular scenario, there's a couple of things that led me to that thinking. The first is that I'm using Service Plus because I'm seeing these as discrete business events. So we talked about Event Hubs more for analytical style data where maybe each data point doesn't have much meaning in itself. I see this as like a fully formed piece of information that describes some business activity in my system. So that's one reason I've chosen to use service bus. The second is my security team do not like public IP addresses. And unfortunately Event Creator does not allow you to have a subscriber that is on a private IP address. They have to be public. You can send messages to a Service Bus from an Event Grid, but similar thing, the Service Bus has to be publicly addressable. So if you're in a kind of enterprise organization where it's super important that things are on private IP addresses, unfortunately as cool a service as it looks, you might have problems getting it through the door. But this is cool. We've broadcasted sentiment out and the HR service have got a little subscriber and they can go and listen to that. So HR a little bit frustrated with us because we're not sending them enough information. All we've actually broadcast, our development team a little bit lazy is the employee ID. The date and the wellbeing rating one to five. And they're crying out, saying, we want to know what you talked to them. Why are you telling us that there's not enough in this event? Which leads us to an interesting question of how much information should we put in an event when we broadcast it out? And there are a few different schools of thought about how to get that additional information to the end subscribers. Now there's two that I've called out here. The first is to create what I think Martin Fowler called it an eventcarried state transfer. So the idea of event carried state transfer is that we actually put more information into the event just in case someone actually wants that information. We don't know who that is. They might not be there yet, but maybe in a week's time someone's going to want it. So let's put it in. Let's start putting more into that event. So it's a cool pattern. It means that our downstream subscribers can be completely decoupled from us, right? We send the event, they can get it, and they don't need us to actually introspect anything else about that. They've got everything they need in that event. All I would say is be a little bit careful about that. So I've worked on a project a few years ago going about like five or six years, where we had a lot of services broadcasting events around. I think it was about 120 by the time I left. And we found that each service tended to want a bit more or need a bit more and a bit more and a bit more until eventually every event that was coming off a domain seemed to have, like, the entire aggregate root of that entity in there just so the other services could work out what they needed to do. Now. If you find yourself falling into that pattern. I would probably take a step back and wonder whether you're kind of entering NanoService kind of world where you started to break apart your domains too much. They're too small. And they all need to know about the insides of each other and maybe start like. Instead of shooting a psychotic into the asteroids. Start sticking them back together again and creating like. Goldilocksized services. I don't know if there's a term, but like a right size service, maybe like solar, like everyone thought. So it was many, many years ago. That's one way you can do this. The other way is you could have the HR service make a callback to the wellbeing service to say, hey, look, employee one, this was the day, this was the well being status. What did you tell them? So that can be a really good pattern if you need up to to date information. Maybe it's not okay in your events to have old stale data. Maybe you need to go back and get the latest source of proof. I don't know. It's a legitimate requirement right at the end of the day. But if you do that, then the only thing to be wary of is that you now have put a coupling back between those two services. So a lot of the patterns we spoke about around outboxes and intent and single transactions, you might have to go back to that kind of world to satisfy your requirement. But these two patterns are both totally legitimate ways of getting that extra information to the subscribers of those events. So this is pretty cool, right? We're going with the slightly larger event. I think that's what we're going to try and do. Let's push it out there and let's see what happens and all is good, which is cool. So said the word event a hell of a lot in the last 10 minutes or so and it's a little bit of a just not wasted an hour or it's not wasted for a minefield when we come to cloud providers when it comes to events because there are so many damn services that all seem to be able to do events for you. So let's just spend a few minutes going through the Azure toolbox of event style services to think a little bit about where they make sense, where maybe they're not pitched and why you might choose to use one or the other. So let's start with event hubs. It says Event in the word, right? So this must be the one to use for all the things surely for all of your eventing style patterns. So event. Hub works very similar to Apache Kafka. There are differences, sure, but the general kind of architecture and concepts are the same. And we refer to Event Hub at Microsoft as a mass event ingestion engine. Imagine you've got machinery or you've got log files piling into your web servers and you just need to get them somewhere where they can be analyzed. So that's where we talk about basically smashing Event Hub and sending it millions and millions and millions of tiny data points that you can analyze later on potentially. So we talk about analytical data as opposed to business events. Now again, this is something off our website more than anything, and I think there's a good challenge around whether Event Hub could be used to send business events as well. But certainly if you're thinking about data points, let's say from a machine down in the plant somewhere and you're worried about if that machine is going to break. So the vibration is really useful. Imagine you've got streams of data coming off it, but each vibration point is kind of meaningless in its own. But when you start to push these through stream processing engines, maybe stream analytics or HD insights or something like that, you can do analytical queries, over tumbling, sliding windows like that and start to infer potential problems that might be coming up inside your machines. Event Hub stores data in partitions, so. When you create your Event Hub namespace, you actually choose a number of partitions that you want. And that's a really delicate thing to get right because there could be ramifications of not getting that number right. So one possible thing that could happen, let's say you decided two partitions. What that means is every piece of data you send to Event Hub will essentially land in one of those two physical partitions. Now when it comes to consume that data, we don't use competing consumers like we spoke about earlier. Instead we organize nodes into what we call a consumer group. And each node essentially will have affinity to one or more partitions. So if you have two partitions and one consumer, that consumer will get the events from both happy days. If you have two consumers, each consumer will align to a partition and they will get the events in order that they came out of that partition. If we have three consumers and two partitions, one consumer to their one consumer to there, they have a consumer, nothing to do, sit there, not going to do it because of the way the affinity works. Now that can be really interesting because it means that you need to make sure that your partitions are based off your highest potential need to consume that data, not the lowest potential need to consume it. Because if you get smashed with 500 events a single point in time, let's say for a couple of seconds, and you want to churn through those quickly, you can't just spin up new consumers if you don't have enough partitions holding that data. So it's quite an interesting choice that you have to make. Early on, people might tell you with Event Hub you can send business messages like transact against this account. Sure you could put a message on to that, but are you going to get poison message detection? Are you going to get deduplication? Are you going to get request response? If you find yourself having to build all of those things on top of your eventing engine, then there's a possibility that maybe you should have chosen a message bus instead. Which brings us to our second service, which is Service Bus. Service Bus is like essentially an enterprise message bus suit all your messaging needs. We talk about it as something that you will send your high value business messages to, which I know sure you could do. But it's actually a really, really useful service. It's got a whole bunch of functionality built into it. We can have sessions, we can do filters on topics and subscribers so one subscriber might not get every message that comes into a queue. We can have deduplication support so that if you send a message multiple times with the same ID on it, service Bus will drop a couple of them and only give you one of those messages. We can do ordering. So even if you have competing consumers that are all kind of churning through messages. We can put sessions against sets of messages so they'll get affinity to a consumer and they will only get the messages in that session. We have request response. So if you want to reply to a message, then you can go and send a message back. And we have topics to enable subscribers to come along to subscribe to messages. Now, it's slightly different to Event Hub in that it's very different to Event Hub actually. But with service plus if you're not there at the point a message is sent, you can't just come along later and subscribe to that queue and see those messages because they're delivered. And once they're delivered, then that is the end of that message really. They don't find themselves in new queues whereas event hub, those messages stay there as long as the retention period is. So you can come along later. I think seven days is the default on standard, but I think it can go up to 90 or something on the premium tiers. So you can bring on other things to analyze that data later in time. But with service bus they're delivered and they go, but both are pullbased models. So you need a consumer, you need a host process that is fetching those messages back. So if you set up a service bus binding in your Azure function, there's actually a function host that is subscribing to the service bus and is pulling down messages, which is one of the main distinctions between service bus and event grid. So event grid got the word event in it like event Hub, right? So it must be the choice for events. And it's kind of cool how it works. It's cloudnative and it's reactive in nature, which essentially means that when you send an event grid a message, it will deliver it bang to all the subscribers straight away. There's no polync, it will just call your function or push a message into your queue, which is pretty cool. But the scale it's built at is astronomical. So if you push a million messages in a second into event grid, it will call you a million times, which if you're a legacy downstream, little API running on a single server like that correspondence service could probably set you on fire at the point that you're getting all those messages coming through. So be wary of its scale and it can only deliver to public endpoints, which is a problem. But what we often see is combinations of services. So you might surf on event grid for all of your events and concerns, but set service bus as a delivery for some legacy systems that can't handle the freeport. So let service bus buffer that for you and then use competing consumers on the other side. Now, the other call I want to do is Cosmos change feeds, which are quite similar to event hops. The only thing I would say about them is just be wary of allowing downstream applications to connect to your Cosmos Change feed because really that is your internal schema. So what you don't want is to have third party systems coupled to your internal schema. And then you need to make a change because that's going to be an awful lot of conversations with people who will tell you like, no, I don't have budget to make that change in my end and I'm more important than you. So you're not allowed to change. So whenever you're sending events, think of your events as contracts, just like your APIs are. When you define your Swagger contracts or your wisdom from back in the day, that was like super fixed. You only changed that if you really had to. So similar to that, if you are going to use Cosmos Change feeds, maybe don't let people coupled to your internal entities. Maybe have another collection which has got your contract entities which are more fixed in nature. All right, so HR are not happy, unfortunately, because it turns out what happened is that the correspondence service didn't like the load that we were putting on and the correspondent service started sending emails in an asynchronous fashion. So previously we sent an email and we didn't hear back until it was delivered. HR have been told they're not allowed to use the data until that email has definitely been sent. But now that is asynchronous. So it's not sent when we ask it. Instead, the correspondence people have said that they will raise an event from their system when they finally send that email. And only then should we broadcast our information out that the correspondence wellbeing, service has finished. So this is starting to get more complicated now, right? It's no longer a do this, then do that. It's a do this, then sit and wait. Event comes back in. All right, cool. Events come back, we've sent it. Now let's do the other thing. So we're starting to see the beginnings of much more complicated workflows arising. And I want to work with three fairly common patterns which are all subtly different of how they go about this. So the first pattern is called a process manager. Those of you that use in Hibernate may have seen it referred to as a Saga as well. I'm going to step back from that debate. But suffice to say that most implementations I've seen call as a process manager. And we will look at Sagas in a minute, which I think is slightly different. But we're going to implement this using durable functions still, because we can do and the main change is that number eight down there, the send email. Because this is a serverless function, we can actually just put the function, the process to sleep and if just wait, just hang up and wait for an external event to wake you up again instead of just kind of going through the motions. So in this scenario. What happens is the function yields. It goes to sleep. The email gets sent and then in number nine the correspondence service puts another message and we have a function. Another function that lives outside a durable function. Call a correspondent service trigger this one. Listens to that service plus Q and every time a message comes in. It looks at. It says OK. You send some correspondence and I can see my correlation ID here. So I know which instance of my running workflow this is related to. And then that external function can grab the durable function trigger and say the event has happened. At that point in time the durable function wakes up again and starts to move on to the next step. So this is called a process manager. Now, it's got some really good benefits to it, right? You can version this whole thing so I can release a version two of my durable function and completely change the process for new workflows that need to happen. So maybe that involves new steps, reordering steps and we can just do it using slots in Azure functions. And we can have it such that when the dispatcher number four there initiates an orchestration one time, it initiates that orchestration the next time it initiates the one in the green slot, production slot, green slot. So we can do things like that and we can allow the ones that were in process just to run their course and get to the end and finish. So that's pretty cool, being able to version long running workflows, that's pretty sweet. Bit tricky to version in process workflows. So if a workflow is halfway through, then most workflow engines have this problem, right? It's normally XML Spaghetti somewhere with durable functions. It's a pretty cool implementation. It uses event sourcing behind the scenes. But that means that the order in which the activity is finished is super important. When the orchestrator replays, it expects things in the order that the code says they're going to be in. So if you go and change the order of that, you could find yourself in a bit of a mess. So versioning process not so good. You find with durable functions that a fair amount of boilerplate logic starts to slip into your code. Sometimes it can be hard to look at a more complex workflow and see the logic because it's just a lot of call activity async wait for external event async and it's like where is it, where is the logic going to? That can be a bit of a downside to them, but a process Manager in general can be really useful. Let's say you need to report on the status of all in process workflows. Let's say I've got 1050 loan applications going or something. They're all long running workflows. How many at this stage? How many at that stage? Because there's a central thing that actually knows about that workflow. We can ask it, we can say to it, hey, what's the status of all the impressive workflows? And it can give us that information, right? So that's pretty cool. But with Joe functions in general, the more external services, the more complicated these become. Then the durable function can get quite complicated and difficult to reason with. So there are some good points and bad points. It's a brilliant tool to have in your tool kit. Is it always the right choice? Probably not. Is it sometimes the right choice? Yes, probably it is. So let's flip it completely on its head. Process Manager has got a central orchestrator. What about if we just completely do away with the central orchestrator and we have no central orchestrator at all? I was watching Strange of Things season two last night and it reminded me of the hive mind at the end, where they heat one bit of it and everything's. Like with an orchestration, there is no central orchestrator of this workflow. Instead the behavior, the process itself kind of arises based off the interactions between topics and subscriptions, between different services. So you can't point at a place and say that's where the logic is. Instead the logic is everywhere and it kind of arises. So if we look at this now, there is no durable function anymore. Instead the dispatcher just raises an event, says wellbeing was recorded. We have two services, the Correspondence and the HR service. They both listen to that event. OK, correspondence service says, this is great, I've got an event here, I need to go and send an email because I know that's what I do when I see this event. The HR service hears the event and it says, okay, so the well being kicked into action, but I can't do anything yet because I know I need to wait for the correspondence to be sent. So if you like, I'm going to go into a mini kind of Process Manager state myself, where I wait for the other event to come in and then it subscribes to the events from the correspondence service. When it gets the event, then it says, OK, cool, now I can use this data. We did have a requirement that said that we had to send the email. So often what you'll see in scenarios like this and the other one is maybe a timer event somewhere where it's almost like this is my time out and if I don't hear that the email was sent in 30 seconds, then I'm going to send that message again. So it's kind of weird sort of things you need to think about when you're building these kind of distributed workflows. But this is a choreography approach. Very difficult to version these because where's the logic, right? How do you version that thing when it's running in process? How would you roll out version two? You'd need like a whole new set of cues. You'd need to get every service lined up to listen to the new topics and to react differently and it would be a bit of a communication problem. It could be a lot of OCM involved in getting something like that rolled out. But it's a super elegant model, right? Because you know, the logic just arises. It's kind of beautiful in a way. As more events come on board and the processes get more complicated, then this can become extremely complicated. Like it's one thing not to be able to pinpoint the logic, it's another just to have no idea like what the state of anything is because there's a thousand things involved in it. If you want to know the status of all the impressive workflows, little bit trickier, you might need to build another service that listens to all the events coming off the services that are involved and starts to build up its own reporting view of the world, right? You can do it again, but it could be a little bit complicated. So I tend to find that these things work really well almost at the end of your business process. So maybe the process manager is the thing that you want to actually orchestrate and organize your logic. Because there's a business process, you can almost find a document for it in the organization and codify it, but at the end of it, to allow other services to react and to do things that are unrelated to that process. Raising events is a really nice way to let the rest of the world know that your thing has finished. So again, it's a way you can run a long running distributed serverless workflow another tool to have in your belt. It's pretty cool. Choice four. That was interesting. Choice two to choice four. I don't remember choice three. There's no choice three. Just seeing things. Choice four is a pat. Hold on, I've lost a slide somewhere. How did I do that? Oh my God. Right, hold on. 2 seconds. That's so embarrassing. Where did it go to? How is the undo buffer going to work for me? This is why you should never do things last minute. I'll tell you what, we don't need a slide for it. We can imagine it. We can just imagine it. It was a pattern called a saga pattern. Right. Oh, hold on. Is that going to wait for me? No, something has to go wrong in the last talk of the day, right? Oh, I didn't touch anything. Perfect. Alright. There's a pattern called the saga that's not working as well. So that was life for this one anyway. Imagine if you stepped back into the 1920s and you went into a big organization and you said, hey, Acne Corporation, how are you going to run this long running business process? OK, well, we shouldn't start in this department. We're going to get a big brown manila envelope and we're going to put something inside it that details the process that we need to happen. We're going to do a little bit of it in this department over here and we're going to write a list of names on the front of that envelope and we're going to put it onto the internal mail and then the mail person will take that. If you look at the next name and it'll say, okay, I'm going to take that to department two over here, open it up. Okay, I can see what we're trying to do here. I can see my bit of the instructions. So what I'm going to do is do my bit and then I'm going to look at the front of the envelope, I'm going to cross my name off the envelope, I'm going to put it back onto a mail truck and so on and so on and so on. It will go through all the different departments. If something goes wrong during that. Then the department that had the problem will put some information about what went wrong in there and then they will start the envelope going back through the departments that it came from. Giving each department an opportunity to roll back or compensate because we can't always roll back in these worlds. Compensate the behavior that they enacted upon that workflow and then it will go all the way back to the start giving everybody a chance to compensate. So we refer to that as a saga basically. And with the tools that we've just used we can implement a saga really nicely. We didn't need a dual function. We don't have events in the same way. Instead we have the wellbeing process. Start off with a simple message. It puts on service bus that has got the names of the cues in the headers that it needs this message to flow between and it puts the message onto the bus where it goes to, in this case the correspondent service first of all, which sends its email. It then looks at the headers and it says, okay, cool, well now that message needs to go to the broadcast HR, the broadcast topic and so on. And everything that's interacting with that has a chance to A, put some business value over it and B, if things didn't go well, then it can just start the message going back up the chain instead. So that's three very different approaches to how you could model a workflow using sequences of cues and messages and events that have each got their pros and cons associated with them. And there's not a wrong or right answer to these problems. It's just look based off the pros and cons of the business criteria, come up with a good decision matrix and work out which is the better approach because they'll all get you where you need to be, which is the approach that fits the problem that you've got would be my main advice. Now there's number four, let's see if I can get this working right. Oh my word. Now PowerPoint has gone into one of those super weird things. Where is my screen now? Frozen. Awesome. Sorry, I've got time. I can make this work. Alright, slide one. Two things are moving awesome too. Good advance HR choreography choice four, third party. So we've used primitives so far to deliver our workforce. We've gone really raw on queues and events and service buses and event grids, which are all awesome building blocks. But the thing is, once you get into more complicated scenarios with messaging, you find yourself having to write a lot of abstractions on top of those to really allow you to just focus on writing the business logic and get away from the idea of like cues and subscriptions and all that boilerplate stuff that we'd rather avoid. And it turns out there's a whole bunch of pretty cool libraries out there that can really help you with this, that sit on top of our services, not just ours, but other clouds as well. They abstract that away from you and they just give you an abstraction layer that makes much more sense in messaging applications. So I've picked on mass transit here. There's obviously end service bus as well, and I'm sure there's a whole bunch of other ones. So another way you could rewrite this is to literally use in this case, I've got a container app, which is a great way to run a longrunning process. Now in Azure, which we've only just got, which is awesome, it can sit there as a daemon process and mass transit will deal with all the subscriptions to the queues and things like that that we need. And then it will just allow us to write our business logic that defines what happens when the different messages start to come in. Now, these things are awesome, but the only thing I'd say to be aware of is that they really need you to start to opt in other parts of your organization as well. Because just having one service running on one of these abstractions like mass transit, well, there's a lot of stuff that it's doing under the scenes to deal with subscriptions and routing and things like that that either get your other services using the same thing or that's going to be difficult to only have one service in that world. So again, another option, if you're all in on this, then this could be a really good way to go because you just get the abstractions given for you and they leverage the underlying services and get all the first class goodness from them. So there you go. I'm not going to go for that one because I'm not going to have time for event sourcing. There's a whole bunch of considerations that are on top of what we've spoken about today that you really need to think about in this kind of architecture. So things like you've got your messaging, your hosting services, processing type, but then correlation, IDs, durability, partitioning, back off, retries item potent receivers, you name it, all sorts of stuff that is going to be important when you're building these architectures. But I hope that I've given you a little bit of an insight into the Azure services and the way that we can put them together and leverage them to help build these distributed event driven, architecture type applications. So thank you very much. Thanks for bearing with me when it all went wrong in my little slide deck there. And any questions that you have, then feel free. I think we've got a few minutes left, so go for it. Yes, go for it. What is request response? It probably applies less to an event and more to a message, really. Although, having said that, I did look earlier. So if you think about service bus, the question was, what is Request Response in consideration of an event? In service bus, these things are all just messages at the end of the day, so no matter whether you receive that message on a queue or a topic. So in service bus, queues are kind of like point to point, where you put a message on a queue, there's a single thing that reads off that queue a topic is this idea of onetomany where you put a message and pointed at a topic. You can have different subscribers that will all receive that message in their queue, they'll all get their own copy of it. And Request Response is just this idea that you can actually put a message back onto a queue that refers to the original message. So it just means a message flows back down into the originating system. So whether that's on a subscription or a point to point, it doesn't really matter in service bus world. So a good example might be, let's say you have got a system that wants to put a message on to drive a live online auction, where you might have lots of different services that can potentially put something onto that page. So you might put it onto a queue, but then each one potentially will respond within, let's say, a third of a second. The first few that come in with the response, maybe you'll use that to drive your UI. Cool. Go through it. Yeah, it's not a big lag. In all honesty, the change feed gets updated the moment that you persist a document. Right, that's a very quick thing. The like would be in your own thing that pulls a change feed because that's a polling model and you probably don't want to smash the change feed because you'll get rate limited against it. So you tend to go maybe every second or something like that, but really it should be no more than a couple of seconds that you get the change coming in, unless there's thousands of changes and then obviously you've got to work through them. Is there any other types of patterns, other types of scenarios like retry? Absolutely, yeah. I think if you want to really just translate against a single system, then if you're a service like your microservices owns its own data, it's a really good way of doing it. You could always retry against the downstream system. But, you know, again, it depends. If it's super critical that you translate across both of those systems, then outbox, I think, is a really good way to go. If you can deal with not sending a few of those messages, then you can definitely keep it simpler by simpler by having a bit of back off and retry and then just saying, I couldn't do it to that one. But I think it comes back down to your business requirements and how important is it to send that message. Cool. All right. Going once, going twice. End of conference. Approaching Richard's Lockdown. Thank you, everybody, for coming today.