In this presentation, Oliver Gould, creator of Linkerd---the first CNCF project to incorporate Rust, and a major driver of the early Rust networking ecosystem---will present an argument that the future of cloud software and the cloud native ecosystem will be tied to the Rust programming language. Oliver will argue that, while Go is the lingua franca of the *current* cloud native ecosystem, Rust will be the lingua franca--or at least one lingua franca---of the *future* ecosystem. He will draw parallels between the constraints of cloud environments, the core principles of the cloud native ecosystem, and the principles of Rust itself, grounded in concrete examples from Linkerd's Rust proxy as well as other projects. Finally, he will present a roadmap for the future of Rust in the cloud native world, both as captured by the CNCF ecosystem and beyond.
PUBLICATION PERMISSIONS: Original video was published with the Creative Commons Attribution license (reuse allowed). Link: https://www.youtube.com/watch?v=BWL4889RKhU
Hi, thanks for coming. This talk is about why Rust is going to be a foundational technology to the future of cloud native infrastructure. Before I get to that, let me introduce myself. My name is Oliver Gould. These are my dogs, and I'm the creative of a project called Lincoln, a service mesh that's been part of CNCs since 2016. I'm also the CTO of a company called Buoyant, where we make this and some other infrastructure tools. Before that, I worked at Internet companies like Twitter and Yahoo really focused on production, operations and infrastructure, and that's really the lens through which this talk is going to be delivered. This talk is basically three parts. First, I want to take you through a brief history of the cloud from my perspective. Next, I want to get into the details of why I think Rust is so important to cloud technology. And then we'll wrap this up with a quick tour of the Rust tool kit that we use in Lincoln. I want to emphasize when we're talking about the history. This is the history from my perspective, like all history subjective. And so if you may have been in the industry through this whole time, you may have slightly different perspective of things. That's fine. But I think it's important to the table for where we've come from before we talk about where we're going. So when I entered the industry, when I started working at Yahoo in 2007, Yahoo was a big old Internet company. They had literally millions of physical hosts that were managed. They were managed by dozens of hardware teams, people in data centers, people provisioning hardware, and also dozens of system and teams. So our team in production operations was one of many responsible for managing these hosts, these fleet of hosts across the world because of this, because of all these legacy systems that accrue over time is extremely heterogeneous. Lots of FreeBSD Linux creeping in OS versions, configurations proliferated in every different way. But the idea here is that they are largely what they call pets and not cattle barely bespoke configurations for millions of hosts at the time. If you wanted to get hardware, if you wanted a new server, you had to go to something called the Hardware Request Committee Hardware Review Committee. And this was literally a meeting with the CTO of the company, David Silo, where you to justify your need for a server or for a fleet of servers. And it was undoubtedly slow and laborious and a little bit stressful to get hardware. And this was done to save costs and make sure we're using things efficiently. But it's really a different way of getting hardware than we do today. And the first problem I really started working on there the problem I worked on through most of my time at Yahoo was config management. So across these millions of hosts, how do we make sure that they get security patches? How do we make sure that new users, the company get access to the host, or when a user leaves the company that they no longer have access to those. How do we manage the proliferation of configuration? This is a hard problem. We worked on this for years, and this wasn't unique to Yahoo. Lots of companies at the time were going through similar problems. They all had to manage hosts to be connected to the Internet securely. And so the proliferation of projects came around this first CF engine long time ago, written in C, and then in the mid 2000s to late 2000s, we saw new products coming on, mostly in Ruby or Python projects like Puppet, Chef, Anspo, and of course, others. Job of a config management system kind of simple a little bit. They mostly have to run commands and generate templates. These scripting languages that were really coming to popularity at the time were good fit for that. They're system scripting languages. And the real job of configm systems is to make the host able to run an application. And frequently config management systems were responsible for actually deploying code for getting application software ready to run. It's meant there's a pretty tight coupling between your host's database and your actual application. It's not really where we are today, obviously around the same time, really. Throughout the early 2000s, we saw a new proliferation of virtualization technology, so projects like Freebsdjls kind of in the earlier side, and then Zen and Solaris and building up to Linux C groups, which so much of what we're talking about today is built on. Really, the job of these virtualization technologies made systems multitenant. So no longer do I have to have a single host with a single application or even a single user or customer. Now I can run multiple operating systems on a single piece of hardware, and this really changed the game. Of course, all of this stuff is very low level operating system that's maybe written in C at best, but also lots of assembly to get this stuff done because you're virtualizing hardware, you're actually mimicking what a machine does. This gave birth to a new set of products and services that we really called what we call the cloud today. And it really made the data center as a service or data center as a product. Ec Two is probably the first widely available one of these, and of course, many have followed since. I remember having a conversation with a colleague who was at Netflix, probably 2009 or 2010, where he told me how Netflix was moving from their Aix mainframes. Netflix originally was on Aix Mainframes and how they were moving to AWS. And this made no sense to me again. I was working at Yahoo with these massive fleets of Bespoke systems, and the idea that a growing popular Internet company would go to Amazon's infrastructure didn't make any sense. The great thing about this, as we all know, is that it made service accessible. There's no longer a hardware review committee I have to go to to get a new server. There's no longer some of the data Center I have to call to fix a server. Now I just have an API and a credit card and I get access to a server. I get an Internet connected server. This means the students we actually can just get online and get server technology easily. And as startups and businesses, we get access to these things without having to get data center contracts or any of the kind of overhead that's really associated with the time before this. The other interesting thing here is that Linux is really tied to this right. Linux does not generally require licenses to operate, so this was a great fit. I no longer have to get a Microsoft or whatever license to get into Prod. I now just get a free operating system. And I also no longer have to worry about the vast array of driver compatibility issues which were kind of a headache before this. So now I can just make an API call, click a button and I get a server that's ready to operate on the Internet. Great really reduces the barriers to getting involved in server technology. However, there are some downsides here. We're still dealing with hosts, even though we're dealing with virtual hosts. Vms we still have hosts as our primary abstraction between config management is still a big problem, which is why we have all these config management companies coming online and projects. We also have no control over the hardware, very little control over the hardware, which means we have kind of less reliability. There no longer are we trying to have one superpowered machine stay online all the time. Now we're kind of dealing in a world where systems might fail and we can't call in data center deal with it. We really have to consider variable performance. No longer do I have exclusive access to a machine. I might have other businesses running on this machine that are dealing with lots of traffic. And so these concerns end up changing how we actually think about operating services. We have a whole new set of failures, soft failures, more frequent failures. And this kind of gives birth to a new way of testing a new methodology called chaos testing, also coming out on Netflix, probably in 2011 or so building on that. So this is kind of. By the time we met Twitter in 2010, we have a new set of technologies coming online that really decouple applications from hosts. So no longer do we as someone who's operating a service, do I have to think about config management or SSH into a host or kind of all the overhead of that? Now what we're thinking of through projects like Mesos and Aurora. I just want to ship my workload. I want to write software, build an artifact and get it running. This is really great for Twitter. And other growing scaling companies, Uber, Lyft et cetera, where you now need to focus on developing productivity. We have hundreds of engineers. How do we get them to write software, ship it quickly without having to think about all the operational overhead? How do we stratify that and separate that? The downsides of these projects was that they were really operationally complex. It was pretty hard or it's not impossible to run a full Mesos cluster on your laptop. You actually need quite a bit of hardware to get started, or you need some pretty beefy cloud boxes to get started. And so there's some overhead here. This is not a broadly accessible technology that you can get started with. Also, we're dealing with a lot of JVM runtime, which comes with runtime costs, overhead, memory, CPU and operational costs in terms of debugging GC and things like that. And in this world, we're dealing with highly dynamic systems where measles may reschedule pods or instances without there being any kind of user involvement. So we have to deal with things like service discovery and load balancing and retries and timeouts and all the things that kind of are necessary to manage services at the scale. So at Twitter, we were working on a library called Siniko, and that's really what came to be the core of the first version of Lincorty as we dealt with all these production issues and dealt with making communication more reliable in this library called Finagle. The idea with Lincoln was, well, how do we package that up into a proxy and make that accessible to folks who are not writing software with Finagle following that are kind of around the same time. There's a new set of technologies coming on, what we call cloud native. So it really kind of starts with Docker in a lot of ways. So Docker is building on Linux C groups, the technology we're talking about a little bit ago, and Docker makes it possible, as many of you know, I'm sure to package up an application and ship it somewhere and get it running with resource constraints. And so it kind of avoid it, pulls in parts of the config management story and isolates them into a binary that really is almost a whole operating system running in a binary. Kubernetes extends that model and makes it possible to take a cluster of servers and just run these Docker containers anywhere. And with that, we have this heavy reliance on the network. What we call microservice architectures are tiny services that are distributed in data center or in a cluster, and they communicate over the network. And so tools like gRPC and Envoy and Linkerd fit into this world to really focus on managing the complexity of a dynamic system. So we deal with fault tolerance. We deal with the fact that we have to load balance and a lot of these things. Kubernetes and Lingerie, especially Docker as well, focus on user experience and reducing the costs of managing it, getting started, of understanding it to make it accessible for application owners to get running. We focus on applications and not hosts. We're finally broken set those barriers down. So let me take a little detour and describe what lingered is in case you don't know. And then we'll get into why this is so important for us. So Linker is a service mesh? What a service mesh is. It's a pattern of deploying rich data planes as generally as a proxy, a sidecar proxy that deal with this communication complexity. And so we have to deal with load balancing over a set of instances, a set of replicas in a cluster. And I have to deal with making sure that everything gets TLS by default because I may not trust the network that I'm running. And I also want to have identity on either side of this. I want to know which workload is talking, which workload, and that's easily done through TLS. So what we do is we deploy a proxy sidecar next to every application, and this helps manage communication and complexity. This is really in Lincoln. There's kind of two halves to this. We have a control plane which talks to the Kubernetes API, which deals with a lot of the configuration and discovery and the fact that things are dynamic and feeding it to proxies. And the proxies are supposed to be very lightweight, small instances that can fit many, many on the host to service traffic. We can kind of look at it like this, right where we have the Kubernetes API. Kubernetes is, of course, written Go, and we have a Lincoln control plane, which is also today written and go. And we chose Go for the control plane because it's so coupled to the Kubernetes API because we want to use client go. We don't want to have to write Kubernetes client from scratch and think about all the complexity of what's in a Kubernetes client. And this is again, three plus years ago when we were starting, we wanted Lincoln's control plane to feel like part of the Kubernetes ecosystem. So we chose Go for that. But when we went to the side fair proxy, we chose Rust, and that has been a great experience. But when we were starting, it was really rough around the edges. We had to bootstrap ourselves. We had to build lots of technology. We invest heavily in the Rust ecosystem to make this work. And so why is Rust going to happen now? What about this moment? Why is Russ so appealing to this to us in this point of time? Well, Russ gives us a bunch of primitives to build components. It's a programming language which focuses on safety, efficiency, composability, and really on making developers productive. I, as an engineer, can write a data plane proxy, a micro service proxy, and I can do that with high confidence that it's not going to have memory leaks or memory safety issues. And it will do the job well. And we want to use this to build cloud native technology. And so Cognative systems are again dynamic, network, fault tolerant and loosely coupled. And so these things end up actually lining pretty closely. And I'm going to get into why. But first, let me take you back to the OSI model. This is one of my favorite pictures. Every talk. I do have this, but we look at the application stack or the networking stack, and we see all these layers. But really what we're talking about for applications, for the people building websites and user facing applications. This is how the world looks. They should only really care about their application logic, whether it's tweets or pictures or payments or whatever. And maybe the presentation, whether that's Jason or protobuff or the details of how it's rendered and shared. But everything beneath that is infrastructure. It's the cloud. Somebody has to build that stuff, and that stuff is us. And so down at the bottom, we have physical layers and link layers, which are really part of cloud providers or data center hardware. And this middle blue layer is where we spend all our time as infrastructure developers. We built things like Lincoln and Kubernetes really fit into this middle blue layer that is not talked about too much. So what we're all becoming a system programmers. Anyone working in the cloud native space is really not an application developer. Applications are end user facing system programmers build software that supports applications. And generally these things have to be highly trustworthy, meaning that they're going to work safely and correctly, and they generally have pretty tight performance requirements. And this is really where Rust fits in. Rust is a native language, meaning we actually compile the native code. We're not running in a JIT or runtime VM. And so we have access to low level memory primitives and things like that. But we also need to do that safely. And that's where I think Rust really shines. I'm going to walk through some comparisons. I'm going to compare it to Go because Go is what I know well, and really what is the kind of state of the art in cloud native? And we'll talk about where Rust really makes improvements over the current state of things. So here's a really simple example, a function that fails, and I call it and I ignore the error. This is a great this is a bug, right? If something fails, we should have to handle it. Rust makes that really easy, and it uses the type system to do that. So one of the big advantages of Russ is a really nice type system and types let us express constraints in a much richer way. Really. The goal of all of what we're doing here is taking things that might fail at one time when the application is running and trying to make them fail before we even build the thing before we test it as early as possible during the compilation phase. And Rust really excels with this. So here same function that just fails. If we ignore the error, Rust will actually admit a compiler warning linkre. This will prevent linkerdy from building, and we'll have to fix this before we go on another example pretty similar. Here's a place where I access initialized value and we've had this type of bug in the Lincoln control plane or CLI countless times more times than I can. This is a big pain in my neck, and Russ again makes it simple with the type system, so no longer can I access something that hasn't been initialized. If I try to do that, I'll actually get a compilation error. I have to deal with the fact that something may not be set. There's no null value, and Rust option is the closest thing we have where it either exists or doesn't. But it's part of the type system to fix this. I actually have to. I can get the same thing, the same runtime failure that I would get and go, but I actually have to document that I have to expect with an error message, so no longer just SEG faulting because I did something dumb. Rust makes me deal with these things before I even compile. Similarly, concurrency becomes a big issue, especially in a proxy like Lincoln. We have multiple connections and requests going at once. We're talking to the control plane. There's lots of concurrent access, and in Go, this can be quite dangerous by default. So here I've written just this is actually from the Go by example website where they demonstrate how to use mutex. And here I've just left the mutex out and Go will happily compile and it will even run. So if I run this thing for a second, Go works just fine, which is great, right? Unless I run it for longer, so I increase the run time here to 10 seconds and all of a sudden I hit an error. So this is completely non deterministic. Right? I can write test this path, and then when I ship it to Prod, this thing can fail in an unexpected way. This is virtually impossible to do in Rust. So here's the same code effectively written in Rust. And when I try to compile this, I'll actually get an error that says, hey, this map you built, you can't use it multiple places at once. Somebody has to own this thing. So this idea of a borrow checker is really an ownership model for who owns memory. Who is responsible for this or what code is responsible for. This means that I can't even compile this code and rust because it's unsafe. The access patterns are unsafe. So to fix this, I actually have to go and put a mutex in. This is the same fix effectively that should exist in the Go code, but the compiler enforces it. Rust. When I add the mutex, everything pop as you'd expect, which is great. Rust has made my program more safe without even writing tests. My last example here is that Rust is this idea called Raii resource acquisition is initialization, and this deals with lifetimes and when kind of tying back to that borrowing and ownership model, and when I drop something, it no longer exists. And so here in the example on the left, we have Go code and that sends two messages on the channel and then drops the sender. And then I have another task that continually reads from that channel. And if I run this, it runs until that error happens and we hit a deadlock and Go exits, which is great. I mean, Go should fail in this case because that's the best it can do, but we can do better in Rust. In Rust, I have this type system again, and what the type system lets me do is I get an optional value back, so there's no more runtime failure here. I can't even write the code to not work here. I have to handle these conditions, and when I do that, we see that we get some value back every time we do a read. So this is an example of the type of safety net that we get from Rust and why we haven't even gotten to any of the details around memory access or Rust kind of provable safety. But these are all sorts of ways that we take failures that could happen at runtime. After I've shipped my software to production. When I hit a weird corner case where things can crash and break the whole up the rest of the system. Rust rust lets us take all of those types of failures and bring them back in the development cycle, so they have to be dealt with explicitly. And finally, let me take you through a quick tour of Tokyo's async ecosystem. So Tokyo is a rust library that's kind of similar to neti. If you're familiar with the JVM, it gives us asynchronous I o. When I start a program, I set up the Tokyo runtime, and this lets us run I O concurrently without having to have a thread per connection. It gives us something you can basically think of similar to Go's runtime, where Go has these green threads that let you run things concurrently and just block rust in Tokyo give you kind of a similar set of primitives, and Tokyo has an ecosystem around it that really lets us build up systems on these good primitives that are trustworthy. The first one which we've invested heavily in Lincoln and a lot of Tower's primitives come out of Lincorde is this system called Tower, which is really similar to Finagle services. It's a service abstraction where there's a request and response and a set of layers or middlewares that layers stack these things together so they can be used and I can write loosely coupled components and then bind them together so here's an example from the Lincoln proxy. This is an Http client. And so what this is for every endpoint we're talking to. We build one of these services and this is an Http client. It has a reconnect layer. It lets us do linkrelies tap feature and adds metrics. And all of these features are orthogonal. They have no dependencies on each other. Really. And so I can write all of these separate modules that are easily tested or easily shared and reused without having to couple them together. This is a really great building block. I also should emphasize that a lot of the primitives we've developed for Lincoln are freely available in the Tower library and framework, so you can use, for instance, Lincoln's load balancer without having to pull in Lincoln. The Tower Balance project or Freight is something we've contributed backup stream. This is again a set of reasonable components that you can use to build new systems. Another library that we've been heavily involved. This is something called Tonic. Tonic is a gRPC binding for rest. Again, really bound up with Tower and Tokyo's async runtime async networking. Tonic lets me write little gRPC services. So here on the right. This is a load testing service that I wrote and we just have to implement. We take a gRPC protobuff and we write the function that's generated by the API, and now we have a network server. This makes it really easy to build microservices or little pieces of services in Rust again with Tokyo's Asynchrontime. Finally, I want to call out another library which is newer in this ecosystem called Kubers. Kubers is basically client go for rest. So what it gives us is Kubernetes API bindings and primitives that use the Tokyo primitives that can be merged easily with Tonic DRPC services or Tower services. And so here is an example from a prototype I'm building which watches all pods in a cluster and indexes which ports are available on those pods. So this is something I'm really excited about because this means we may actually start being able to replace or as we add new controllers and Lincoln, we can start doing them in Rust, where several years ago this was totally not possible. Now we have a rich ecosystem of projects around Rust and around Tokyo specifically that we can use to stamp out new infrastructure code that is going to be much safer that we're going to get. We'll be more productive writing and generally have a much easier time about it. So in summary, cloud computing creates new ubiquitous abstractions. We no longer have to deal with managing hosts or acquiring hardware nearly the same way that we did a decade ago. Now we have Kubernetes APIs. We have lots of glue beneath the application. We need that to work. Well, we've all become system programmers. Anyone who would have been in operations a decade ago is basically a systems programmer now, but feeling that out, having to have an industry of people writing C would not be great. It hasn't been great. We have security vulnerabilities. We have safety issues. C has a pretty steep learning and development curve, and rust makes it way more accessible. And one of the things I'm most excited about in the rust ecosystem is the number of young engineers getting involved here. People in school or just out of school are really gravitating towards rust. And I think that the industry is going to be transformed by this. We're going to have a much richer, more reliable systems ecosystem that's built on rust. Finally, this wasn't possible. A few years ago, there's been a tremendous amount of investment. Our team has invested heavily in Rust and these ecosystem libraries, folks at Amazon and Microsoft Google you name it, have been investing in rest. And I think that this really paints to a future that is going to be much safer, more efficient, better for the environment, environment and more reliable. Finally, thanks for coming. I hope this talk was useful. I hope you enjoy the rest of the talk today. Have a good one. Bye.