Video details

You want to code, ship and own your code? Build your Observability muscle by Gregory Ouillon

Career
01.05.2022
English

As DevOps teams increase their velocity, they are facing big questions all along the software lifecycle, e.g. how to:
Develop high performance and production ready code that can be easily supervised
Automate release management, testing and performance validation down to production
Detect, understand and resolve performance issues faster along the full stack
Tie software performance to customer experience and business performance
Build a data-driven team culture and implement meaningful SLOs
Boost collaboration and share insights between Developer, SRE and Engineers
Take full ownership for their code and architecture in production and continuously improve
In this session, we will discuss how high performance DevOps teams build their Observability muscle to deploy ever faster with greater confidence and take full ownership of their software.

Transcript

With me. I have Gregory Lon. So I hope I pronounced correctly your last name Gregory? Kind of, yes. That's okay. That's. So Gregory is here with us today to talk about DevOps and observability. So with this, I think the floor is yours. Gregory, please take it away. Yeah. Thank you very much, Andres. Hi, everyone. Good afternoon. The title of this talk is about ownership of code across the development cycle and how I believe that observability has become over the last few years an absolute capability and muscle that developers, DevOps teams, and SREs should develop to perform at their best. So I'm Greg. I'm the field CTO EMIA for New Relic. I've been over the last two and a half years meeting with hundreds of teams leaders in the software digital space, and therefore, I'm happy to bring a few best practices and observations I see of what works when it comes to achieving high performance in the Dev cycle. What we'll cover today, very briefly, try to define observability. The way I see it, then how it relies on making sure you properly instrument your stack, then how it helps you manage performance, deploy and collaborate faster, better understand your customers, but also your products and business teams build a virtual circle of continuous improvement. And then we'll have a few wrap up takeaways. So, Observability, where does it come from? As we all know now, from many years, DevOps can deliver two things concurrently, which is velocity, deploying fast and often, but also developing high quality code at the same time. I think it all sits on this idea that if you can provide teams a bit more autonomy and decoupling, if you can let them automate the way they release their deployments, and if we can make sure that they have this understanding of this autonomy, they are empowered and only are code, then you will get the performance. And it's this idea also that you do well what you do often. Now that does put quite an onus on DevOps and development teams, right? So now you code it, you ship it, you own it, you fix it, and if you're not already, there's a good chance that you will soon. So what that means is that businesses expect you to deliver innovation and agility. Obviously features to grow the business, great customer experiences and as a foundation of that assured availability, reliability, and performance. That's really the foundation for everything. But they also want you as the company grows. But you can scale your team's, scale knowledge, and also improve the efficiency and unit cost of your development and your architectures for you, that means that you have to be able to release fast often, but you have to be able to detect and resolve issues as productively as possible and as fast as possible, and that you need to go into that blameless continuous improvement that will get you to the next stage. And if you have time, manage a bit of technical debt. One of the realities I do see right now is that after a few years ago we had the shift left movement. So leaving it to developers to own their code and the quality of their code testing early, making sure that they do more automated unit testing and then staging and develop high quality code. The reality is that what we see right now is actually a shift, right. That shift. Right is about the fact that with an acceleration of the deployment cycles, with continuous delivery, more and more tests being done in staging can only be partial load testing, smoke testing, but a lot of the actual measurement and validation of the stack and the release is done in production. So why is that? It's because actually there's a lot of velocity and fragmentation in the code. Now it's very hard to own an entire environment that represents production in staging and it's extremely difficult to catch 100% of issues before production. And it's also in production that you get those real life usage conditions with all your APIs and third party dependencies, but you also get the feedback from customer experience and you can also observe business uptake. So what that implies is that developers and DevOps teams are getting involved all across the development lifecycle at all stages and they need to understand and measure their code at all stages. It's also very true because the modern app is obviously different. Right. New microservice architectures cloud DevOps. Yesterday we still had these monoliths. They were owned by one team. There were a few releases per year, we still had servers we've posted that had names on them and we could still keep up with our CMDB. The reality is that with this it was easy to monitor an alert on a few parameters in the stack and look at the logs. Today, with potentially hundreds of microservices developed by tens of teams with potentially thousands of releases per day, all deployed on containers and some of which have lifespans of minutes, the architecture has become way too fragmented, complex and volatile that you can actually monitor it in a static way. So observability is there to essentially allow you to instrument all of the stack end to end across systems, bring all the telemetry in one place and be able from that telemetry to get a real understanding of how your stack behaves. And if there are issues why they are developing today, most of the teams I meet with face different realities. Sometimes their monitoring landscape was born from history and archaeology, or maybe sometimes from very disparate choices from teams to choose their best of breed tool for their part of the stack. So the reality is that much often there is a high level of fragmentation and data silos across the various telemetries. It can be by telemetry type or it can be by stack. But the risk of that obviously is that if you have this joint telemetry, you cannot easily correlate it. So you do not get insights from those correlation and end to end understanding. It does increase in avoidably the time it takes to resolve issues because you have to go to multiple teams and multiple tools. It also is harder to stitch together and there is quite some effort and maintenance due to make sure that the data stays current. And also it only reinforces the blame game. I see too many teams still where each team has their own tool to essentially conclude that their part of the stack is okay and the issue has to be somewhere else. So Observability is about unified telemetry. The whole idea is that you're going to instrument through all sorts of technologies, your whole infrastructure from your bare metal VM hosts, containers, Kubernetes clusters, and you're going to instrument. On top of that, all of your software, your code with application performance monitoring, but also your middleware and all of the cloud paths that you have in your stack. And on top of that, you will also instrument the digital experience in the browser, on the mobile device, or leveraging some synthetics that will emulate humans to test your stack on an ongoing basis. And on top of that, you will add business data. And I will come later on as to why it is so important to also easily bring in business data. All that instrumentation can be based on vendor agents or open source, as long as it all goes into one unified telemetry area where the data gets correlated, aggregated, and where each component of the stack is represented by an entity that represents all the telemetry of that entity. When you have that, this is how then you can understand how all these components behave, how they relate to each other. And so you can understand the performance of your stack. That's how you can detect issues and resolve them faster. And that's how therefore you can deliver strong SLOs and SLA on performance that allows you to translate that into customer experience and therefore business conversion. That allows you also at your own pace to progressively eliminate toil as you bring data together. You eliminate some tools, you eliminate some silos. You've got less work to do to have one database, one view of your telemetry, and then that allows you to innovate faster because you can measure your releases easily. So it all starts with instrumentation. Today instrumentation needs to be holistic and open. I think the time is gone. We're instrumenting a stack, choosing a vendor and then installing only their agents. What we see today is that the stake is to aggregate and collect metrics, events, logs and traces. That's the melt lingua that all Observability engineers should speak. They will collect that telemetry either from agents. Those agents can be developed by vendors, open source or not, and they will be optimized for a particular part of the stack or one agent. You can leverage integrations, which are built by those vendors with multiple software stacks and components. Or you can leverage open source telemetry, ingesting some Prometheus metrics, some Jaeger prices, or some stats the log folders. You can also ingest custom data from any API that allows you to ingest that data. And you can also move progressively to open telemetry. Open telemetry will play center stage role in the coming years in terms of bringing a standardized approach to data models and specifications for telemetry data. And also more and more, we can see that some technologies like Ebtf, which has been there for a while, are becoming center stage as well. To provide deep telemetry inside Kubernetes clusters, for instance, you will instrument your middleware and your cloud pass components. And why is that? It seems obvious, but I keep on meeting many teams where the Dev teams have their APM, the infrastructure teams have their infrastructure monitoring, and there is a huge crying gap in the middle on middleware. And most of the time application issues surface in that middleware, either caused by infra or caused by software. But somehow this is a key gap. So if you want to have visibility on the health of your stack, you need to instrument those middleware components. And obviously logs for some time. People who are selling APM or infrastructure monitoring kind of say that logs were a dead thing. Actually, not logs contain some of the memory intellectual property of developers. They might be unstructured, they might be legacy. Some people might have lost track of what's in these logs, but they are very precious data. And obviously if you bring them into an observability platform, you can do what you usually do with logs or with a log platform, but you can do better. You can correlate those logs with other types of telemetry. And what allows you to do is, for instance, if you can propagate a trace ID or a transaction ID in your logs, it will give you the ability to instantly select and filter the logs pertaining to a particular transaction or a trace. So huge savings in terms of time, effort, and not having to swap context between platforms and your logs take a whole different meaning. The other aspects of logs is that they are interesting beasts that you can do fun things on, like detecting patterns. An observated platform can digest millions of log entries and detect. What are these catalog dictionary of patterns that are recurring in your logs. You can give them meaning, and from that meaning you can create alert policies, visualizations that give your log a new youth and new utilization. Obviously, observability is all about automation. Most of the teams we work with want to do this as code. They want to ship code instrumented. So the observation platforms don't live in a vacuum. They should not be just an operational tool, and therefore they have to be able to talk to the development environment. They have to be able to make it easy to instrument or fix issues from the idea they should be able to be deployed entirely through your CI CD pipeline and make sure that whenever you're going to push code, you can push deployment markers that will be useful to understand changes that could cause issues later on in production or staging. You want also to enforce the fact that whenever you build the dashboards or alerts or service level objectives, all these are pushed consistently at any deployment for any release so that you ensure consistency of monitoring and telemetry. As soon as the systems are in production, they're going to start reporting and that will give you precious data that you can use, for instance for your release management or for your load testing. So allowing you to do continuous validation of your releases in deployments, in staging and production, and obviously downstream. All the issues and alerts and incidents being raised by the platform can be cleared in the platform, but can also be integrated in the collaboration tools in the ticket taking tools that your workflows are using. So it's very important to enable this kind of EndToEnd automation of Observability as code. Now that you are instrumented and deployed, Observability is going to help you manage the health of your systems and detect issues and resolve them quickly. The principle of full stack Observability is that you can navigate across the stack because all the data has been aggregated and correlated in a consistent way. So you could start from application performance monitoring, investigating a transaction, looking into a stack trace, and moving to a distributed trace. You might then see all the related infrastructure components underneath it and move into your Kubernetes cluster or move into a middleware component. And on and on and on you're going to be able to navigate the full stack to isolate and detect where issues might be and resolve them faster. Observability still comes a bit from APM, so you will start with good old SRE golden signals for your key transactions latency throughput errors saturation, and maybe also introduce some updates to be able to better compare transactions in a unified model in terms of customer satisfaction, response time and quality. Once you have those basics, you can obviously leverage distributed Tracing We see that distributed tracing is still hard to use. A lot of users still find it difficult to select a specific trace and move into that deep down troubleshooting and analysis of traces. So distributed tracing is evolving. First, it's totally open between open telemetry, open tracing and custom agents or proprietary agents so that you can really stitch traces end to end and get much better visibility. But also we are able now to group those traces through topology so we can recognize all the traces that are similar and give them a meaning and give you health signals at the trace group level, which allows you to much better understand the health of that group of transaction and then dig in to a specific transaction for deep debugging. Distributed tracing is set to take ever more importance with microservices, but it's still, I would say, in the customer base still taking off. As the modern developers, one of the key things and DevOps teams is that you own the full stack, meaning that you do want to just own your transactions. You want to understand how the underlying containers and Kubernetes clusters behave. You want to understand your hosts and how they might be running out of capacity. You want to understand your network, you want to do some debugging at the Kubernetes cluster level. You want to go into traces. Very importantly, what that means is that you can get an end to end full stack and shared understanding of what's going on. Resolution is much faster because you get that visibility across the various parts of the stack. Another thing we see changing in observability is more and more the architectures are very large and complex. There are a lot of components in the architecture. So there are new ways of looking at your architecture, which is top down, looking at the entire state of all your components and being able to show at a glance which ones are, for instance, alerting. And based on that, being able to get a sense immediately of which part of the stack is starting to alert what the kind of blast radius of an incident could be, and therefore honing very rapidly as to which components you want to start. Your investigation and troubleshooting workflow to start from bottom up now, moving to top down to better detect where issues are starting. And you can also actually get a little help from the machine, right? So ML Ops AIOps is coming into those observability platforms in a way that they can help you see across your entire estate. For instance, monitoring all the telemetry of your stack without having anything to do what might be a developing situation. It could show you, for instance, that a particular transaction has its error rate that's exploded within the last five minutes compared to the last hour. Maybe it has not turned into an incident yet because you don't have an alert defined for this. So the ability to detect very early changes in behavior, either through full estate or through anomaly detection, can tell you the unknown unknowns. It can surface issues in your stock, which will turn into incidents within minutes. And obviously ML is there also to give you root cause analysis. So it's late at night, you're on call it's 02:00 a.m. Your pager is ringing and all of a sudden you realize that you have an issue to solve. Well, God bless you. ML has been able to put together a full root cause analysis for you, giving you context proposed running books, which teams might be necessary to execute for troubleshooting, lots of context that will help you be operational very quickly and maybe reduce your stress. But we also see is that Observability is challenging the CMDB concept. If you can get a real time view of all your components in your stock, how they relate to each other horizontally, through transactions and microservices or vertically, then that means that you get it right. It's in real time. This is what your stats look like and if you can Zoom in, Zoom out in real time, get detailed view of related entities for whichever entity you look at, understand the infrastructure components, the middleware and see the alerting state, then that's a winner. All that can be also filtered through tagging through filtering so that you can get exactly where you want into your complex architecture. So you've got much less incidents, you resolve them much faster. Now you can focus on accelerating your deployments and increasing the confidence that your deployments are successful. Obviously releasing has evolved a lot over the last few years, from basic before after to all sorts of new strategies, from blue green to Canary to dark launches and feature toggling. And the idea of Observability there is how do you make sure that you measure non functional performance criteria that will tell you whether those deployments are successful or not and whether a new release performs as well or better than the previous releases? Relying on tagging on versioning and dashboards, you can really swarm your deployment teams around dashboards to make sure to give strong confidence that the deployment is successful. Many companies also integrate their Observability with load testing or release management tools. What that allows them to do is to define automated criteria for accepting or refusing a release or even triggering automated rollback based on performance criteria. I've seen companies being able to deploy a new version, measure it and decide to roll back within 60 seconds. Releasing fast is a muscle that you build and progressively. What you learn is to define what are those KPIs which are actually relevant to the success of your releases. Maybe it will start with very basic things like latency and error rate, but it will progressively extend maybe in completely different dimensions, crashes, middleware behavior for saturation, but also maybe business impact or actually customer experience. There's no one truth here and it's important that teams progressively develop with their SRE and their business. What are those criteria that define a successful release? One of the successes of deploying fast is fixing fast and also what we see is that with all the Observability data being in one place, you can have a unified view of all your errors and that gives you a very good sense of the urgency and priority of those areas, their criticality and impact on the business. But also it gives you a sense of prioritization of your incident, problem management and technical debt. Maybe for the next sprint it allows you to inform decisions about fixing, assigning an incident or an error for resolution to a team and being able to collaborate in real time. So I really like this idea now that again top down you get all your errors in one place which gives you a sense of what should I work on as a priority, as a team to get the maximum return on business and customers and it only gets better. I think that what we will see actually now is that Telemetry data will reach the integrated development environment of our developers. It will allow them to stay in their preferred environment and actually fix issues in production. Directly from their IDE they will receive a message. That message will contain a description of the error in production from their SRE. It will contain a stack trace. They will click on that stack trace which will open directly to their code in their Git and they will be able to see which line of code might be actually causing that error in production. They will be able to commit to fix and deploy and within minutes that error in production will have been fixed directly from the ID by the developer. The same mechanisms will allow developers to actually collaborate on pool requests, collaborate on code quality and peer reviews, but also to actually instrument code directly with observability from the ID. So you fix issues fast, you have less issues, you deploy fast. And now you'd like to make sure that you understand what that does to your customers and to your business. So the first thing that you need is obviously to understand your front end experience. But it's not just front end, it's the ability to understand all the technical and perceptive metrics like the core web vitals and stuff. All the crashes and error and stack traces, all that technical thing. But also it needs to be session traces that go all the way from the end point down to the back end. What customers deal care about is the EndToEnd experience. It's not just the front and obviously optimizing the front is important, but many, many of the issues that customers feel during their journey are caused by back end issues, payment microservices, latency errors, et cetera. So it is very key that you can tie together front end, mobile web experience or synthetics with the EndToEnd view in the back end. And also once you get clear about what are the technical criteria that influence the performance of your stock, you can start to relate it to your funnel conversion, customer engagement and satisfaction. So you start to have that muscle that tells that if you degrade a certain parameter like response time, you start to see your conversion fall in some key areas of your funnel. In some of your key transaction, you start to be able to measure frustration or bounce rates based on the degradation of performance criteria. Once you are in that spot, you start to produce much better code. You start to validate your releases even much faster and you increase your customer experience satisfaction. And then you can change the conversation with your business and product teams. What we see more and more is that the companies that have developed their observability muscle actually enrich their telemetry data. With business data it can be in the form of attributes. So whenever you have a key transaction executing, you also send some key attributes of that transaction, like maybe the loyalty tier of the customer, maybe the product class they are consuming so that it will make the faceting pivoting analysis of data much more powerful and fine grained. But they also can send entire streams of business metrics. They can track revenue, they can track supply chain stocks, anything that is relevant to describe their business. What that allows those teams to do is to create a close real time tie between the performance of their It stack of their software and their business. Once they have that, they can actually develop a strong visual sense of what their business is, how it behaves and make sure that each of the team whether business marketing, product operations can get real time view where their business is. It will not compete or overlap with their bi an analytics data warehouse stack, but it will give them that sense of losing or winning customers in real time. Once we have it all, less issues fixed faster, deploying fast, understanding the customer and the business, then you want to move into continuous improvement. The process of observability requires that you establish baselines. Defining Normal and defining normal does not mean defining good. It means defining normal, which is like if I do nothing, how does my stack behave as the normal state? From there on, you get into the art and craft of defining targets for those key indicators and maybe entering them into service level objectives. Again, there is a vast choice and array of indicators that can turn into an Slo. It all depends on your services, on your architecture, on what your product teams are after. It can be business related as a primary priority. It can be infrastructure driven. What is most important is this idea that you define normal and you define the objective and the objective is not below the normal. That means that you need some work to do. You've got some continuous improvement to do, some technical debt to actually clear. And then you want to manage SLOs and you want it to be easy. You want it to be able to represent SLOs in a standard way so that each of the teams can have a good sense of their SLOs. They can own them, they can see when they breach them, they can be Proactive about meeting their SLAs. But what I see is most important is that most of the winning teams I see, they work on a very defined set of SLOs and slice that actually describe their stack end to end. And they show these consistently across big TV screens in the rooms on dashboards. This is the lingo that they all speak and share to describe the health of their stack, from the lowest levels of infrastructure to the business. So let's come with a few takeaways. Observability is all about having all your data and telemetry in one place so that you can correlate it, understand it across the entire stack, across all your systems. Everybody starts from a different place in terms of observability. Some will be still fighting with reactive incident management and customer calls that tell them that they have an issue, but progressively they will move up the ladder to become much more Proactive performance managers. When you have less incidents, you sold them faster. Then you've got more time to spend at improving the resiliency and the performance of your stack and you get much more Proactive at detecting potential degradations. You become performance manager. And once you are in that stage, then you start ready to care about what your customers experience. You have automated your deployment cycle, you can compare versions, you can detect frustrated customers friction points, and ultimately you move into the data driven digital stage where you have enough custom data business data that you can completely tie the performance of your stock with the performance of your business. So that's why I say we should all build our observability muscle. And like everybody, we have various sizes of that muscle. But if you have Observability as part of your develop cycle, you will spend much less time at solving issues, so you'll have more time developing cool stuff or making your stack more resilient. You will have a better connection with your business who will be more engaged and confident in how you manage your technical stack and your technical debt. And you will be able to relate to how that translates into customer experience. From an execution standpoint, that means that you will have control and confidence in your stack, that you'll be deploying much faster and often because of that, and because of your ability to measure instantly. And you'll be able to scale not just your architecture, but your teams because you will have a buddy of knowledge across the teams that will allow you to scale. I can't emphasize enough that Observatory provides one proof. And for many teams who have been working in Silos for many years and doing their best and working hard, being in a situation where all of a sudden the teams are working from the same page is a huge benefit in terms of collaboration and morale. And obviously Observability is a way to improve by learning across the full stack, collaborating with others and growing your career. So with that I would like to thank you very much. I hope I was able to illustrate some of the benefits of Observability and join us at the booth. Join [email protected] and if you want to play and learn with Neurelic and with Observability, we have a perpetual free tier. I put the link here. So just try our free tier. And just build your muscle. Thank you very much and we'd like to call questions. Well, thank you very much, Greg for this very thorough presentation on observability. I personally like very much the model, the four step model that you presented on your summary page and for now I have to say that I don't see any questions on the chat at this point. Apparently these are being very shy today. So that has happened to all the sessions as well. But let's say I do have one thing that I would like to ask you which is when you mentioned the possibility of different tools and vendors to be able to collaborate with one another because of open racing. So how is that working out so far? Is there some sort of foundation? Is there a place where both vendors, consumers, everybody can come together and discuss these ideas how things should work out? Yes. So what we have seen over the last few years is initially distributed tracing was developed by APM vendors. So it was proprietary technology like new relic, you installed a new relic agent. It was able to stitch traces automatically across agents deployed and it will give you tracing but it was proprietary. Then we've had the open source stage of open tracing Jagger Zipkin and some of the platforms like new relic started to accept those formats so that if you had one part of your stack instrumented with new relic and another part instrumented with open tracing or some other open source we would be able to stitch these traces together in an end to end trace. There's been a lot of work done then on the W three C to do a context propagation. So there's now a standard and it goes into open telemetry. That the way you describe your headers. You structure your distributed traces is now coming to a full standard. It's general availability and I think that anywhere looking at observability today and especially tracing should make sure that whatever they select is W three C and CNCF open telemetry. Compliant.