Three engineers, at various points, each take their own approach adding Rust to a C codebase, each being more and more ambitious. I initially just wanted to replace the server’s networking and event loop with an equally fast Rust implementation. We’d reuse many core components that were in C and just call into them from Rust. Surely it wouldn’t be that much code…
Pelikan is Twitter’s open source and modular framework for in-memory caching, allowing us to replace Memcached and Redis forks with a single codebase and achieve better performance. At Twitter, we operate hundreds of cache clusters storing hundreds of terabytes of small objects in memory. In-memory caching is critical, and demands performance, reliability, and efficiency.
In this talk, I’ll share my adventures in working on Pelikan and how rewriting it in Rust can be more than just a meme.
Brian Martin Twitter @brayniac
Brian is a software engineer at Twitter where he focuses on performance and optimization projects. He contributes to several Twitter open source projects including: Pelikan, Rezolus, and rpc-perf. Brian is passionate about high performance software, systems tuning, monitoring, benchmarking, and Rust. When not hacking on code, Brian volunteers in his local search and rescue unit and is training his dog to find people who get lost in the wilderness.
Recorded at Strange Loop 2021 https://thestrangeloop.com
Welcome to whoops I rewrote it in Rust. My name is Brian Martin. My pronouns are he him and I'm a software engineer at Twitter. I've been at Twitter for seven years and have been using Rust for six years. I've worked on a few different open source projects at Twitter, a benchmarking tool for cash performance systems, performance telemetry agent, and now I'm working on a cache server. All these projects are in Rust when I'm not at a computer. I'm a volunteer with my local search and rescue unit, and I'm training my dog Riker to find people who get lost in the wilderness, which is also a lot of fun. This talk is framed around Caching specifically distributed key value storage. You can think of it basically as a HashMap over the network. Caching is really high throughput and very low latency. We expect around 100,000 requests per second on a single core with typical latencies below one millisecond. So very fast because we're just grabbing data out of memory and pushing it over the network. We use caches to store frequently accessed items, and this helps protect the database or backend services by reducing the number of requests they have to handle. So this helps, particularly if you have spikes of traffic for frequently accessed items. You can just hold those in the cache and then the database doesn't need to serve those requests. This can also speed things up because the cache tends to respond a lot faster than a database would, so yeah, it winds up being pretty important for distributed service and system performance. There are a couple of common open source solutions in this space. Memcache and Redis are both written in C and tend to be used a lot in the industry. But this leads me to Pelican, which is Twitter's Caching framework that we want to use to replace our Forks of Memcache and Redis. This gives us a single code base to produce multiple Caching solutions and to easily experiment with new things we can do in this space, and it's a lot easier to manage one codebase than have to go make changes in a couple of different Forks if there's some new feature that we want to add. So I basically wanted to do some work on Pelican to add some TLS support, so we wanted to be able to encrypt traffic between the client and the cache server. And basically I was talking to my manager, Yao, and we were thinking that we could probably do this in Rust and add TLS support that way. I personally feel a lot more comfortable using Rust than C, and particularly in this sort of space. The safety and all the nice language benefits of Rust seemed like they'd be good. But one of the things is we wanted to make sure that the performance matched the C implementation. We didn't really want to take a hit in terms of performance. There were some previous efforts that had made me think that doing this in Rust was going to be possible. So back in 2018, an engineer wanted to add a new storage library to Pelican and decided to try doing it in Rust. And the idea was basically to use the core of Pelican, but have an FFI wrapper around this Rust storage library and just include that in Pelican. So this worked out pretty well. The development was really quick, so that was kind of impressive, and it showed that having a hybrid Rust and C server was going to be workable, and this became the first commit of Rust into Pelican. So that was super exciting. Then in 2019, we had another engineer who was a bit more ambitious and wanted to try using Rust for the actual server code. So they basically decided to use Tokyo, which is an async framework for rust. It's an executor and all these other things. And so they decided to use Tokyo as the server runtime, but wanted to leverage the C components from the normal cache server. So things like storage buffers foundational libraries essentially. But this was really neat. It showed that we could do a lot more advanced layering of sea and rust in the code base. And I was super excited to see that Rust could be used for a larger role in Pelican. As I mentioned before, performance of cache services is super important, very, very fast service. Right. And to make sure that it's fast, we need to do performance testing. You can't improve what you can't measure. So typically what we do is we apply a synthetic workload that's rate limited, and we record the latency for each request into an HDR style histogram. And the goal is basically to get the highest throughput that we can with some latency bound. And typically we might define that as the P 39 latency needs to be below some number. So why P 39 the 99.9 percentile. That seems like it's pretty far out in the tail. Why would we worry about that? Well, it seems like it wouldn't really matter that much. It's only .1% of your requests that are going to see latencies this slower or slower. But there's this thing called fan out, and we might need to make multiple requests to the Caching layer in order to fulfill some higher level requests. So if we have a fan out of ten, meaning we need to make ten queries to the cache and wait for all those to come back, about 1% of our requests at this higher level are going to see latencies that meet or exceed the P 39. This gets a lot worse if you have a fan out of 100. Now, 9.5% roughly of your requests are going to see latencies this high and fan outs this large aren't really that uncommon. If you think about something like Loading a home timeline or something like that, there's a lot of data that you need to get. So the lesson here is that the tail latencies appear a lot more often than you would think. This wide isn't unusual, and so we do care about latencies way out into the tail. Okay, so we had this implementation of Twimcache, which is Twitter Memcache in Pelican in Rust from 2019. It's Memcache protocol compatible. It was using Tokyo for its networking and uses the Sea storage library that was present in the Sea implementation of Pelican twin cache. So I wanted to see if this Rust server would be a good starting point for adding TLS support. And first we need to test the performance, and unfortunately it didn't perform as well as this implementation. Throughput was about 10% to 15% less, which means we'd need more instances to serve the same aggregate QPS and the latency was about 25% to 30% higher at the P 39, and that wasn't going to be super great for use in production. It might start to see timeouts and errors, and making a service ten to 15% more expensive is not an easy thing to sell, and just to sort of visualize the benchmarking results here we're comparing the C implementation in blue with the Rust implementation in red. And what we're looking at here is the P 39 latency in microseconds across a range of throughputs. So basically we can ramp what that rate limit is and trace out this latency curve. And we do expect that as we get more and more traffic into the server, that things are going to start backing up, some resources going to Saturate and latency is going to go up. But here we can clearly see that the Rust implementation has worse latency. It's higher up and it doesn't reach the same throughput. It doesn't go as far to the right hand side of the chart, so we can see that it's not going to perform as well basically. But Rust is known to be pretty fast. It should be as fast as C for this. We're not really doing anything super crazy. We're just reading data from memory and putting it over the network. So let's see if we can make it faster by taking a few steps back. Really basic server and sort of the mother of all servers is this thing called a Ping server. It's a little more advanced than an Echo server. Maybe you could argue, but it's really simple. It knows two different commands, it knows Ping, and if you send it to Ping, it sends back pong, and if you send it a quit, it hangs up the session. So really basic, but it has all the components that you need for a network service. You have connection management, you have request parsing, you have response composition, you have the whole life cycle of things, and you can actually make this thing production grade. You can have all the debugging and tracing and metrics and things that you might want from a real network service. So this is a really good starting point because we can start thinking about the framework of things and make sure that we have all these pieces in place and we can start layering in TLS support and start having plain text and encrypted sessions being able to be handled by this Ping server. And since I was concerned about making the performance match the C implementation, I wanted to try and make things as equivalent as possible. The C implementation of the Ping server was basically just using Epole on Linux, so we could use this library called Mio, which is a foundational library for Tokyo, and it's basically just a nice wrapper around Epol or whatever other equivalent is for OSX or Windows or whatever, and we wouldn't need to worry about async executive performance and all these other really complex things. So I was reducing the surface area of the problem, basically stepping back using Epole the kind of event handling loop myself, and it'll look a lot more like the C version at that point, and I don't need to debug performance of Tokyo and all that stuff. So to get to a prototype quickly, I decided to use some foundational libraries that I had been using in my other projects, and this meant that I could focus on getting something built out quickly with a pure rust build, and if I ran into any problems with these libraries, I was like, Well, I can always switch back to the same implementations later, but it's going to be a lot easier to work with something that's a pure rust build to get started. At this point. I also added TLS support using boring SSL, so I can be sure that the design accommodated the TLS sessions, and then we benchmark again. Cool. Just like in the previous chart. We're looking at request latency versus request rate, and again at P 39, and we can see that these are pretty close, so that looked good. We can open up a pull request merging into the Pelican repo on GitHub and awesome. Now I can move on to what I was supposed to be doing and actually work on a cache server. So the real goal is to add TLS support to our Memcache fork, and the primary difference between this and the Ping server is that we have a much more complicated protocol. We have kind of variable length requests and need to kind of focus on that parsing and being able to respond with actual meaningful data. So to get started and make sure that the request parsing and response composition was all really solid. I actually just took the standard libraries HashMap and wrapped around it a little bit and had that stand in for the storage library so I could get some initial benchmarking results. And when I was confident that this was looking good, it was then going to be time to create a better wrapper around the Sea storage library. I wasn't super happy with some of the abstractions there, and so I'm going to wrap that. I'm going to start including that into my thing super easy. This isn't taking long at all. Of course, I'd run into issues. Choosing to use Rust implementations of some of those foundational libraries left me stuck when it came to trying to integrate the storage library. Since we want really good observability and debugg ability of these things, the storage library actually ties directly into the metrics and logging frameworks. We want that we want to be able to do these things. But now I have something that's not compatible in terms of metrics and logging. So I had a choice. I could rip out the rust versions of my foundation libraries, go back to using the C implementations, try and create more ergonomic wrappers around those libraries. Or maybe I could figure out some way to make the C library talk to the Rust metrics and logging. It was all getting really super gross, though. I spent a lot of time just feeling crushed by this problem and trying to figure out a good way to move forward. And ideas that seemed bad at first were starting to sound better. Maybe I could rewrite the storage library in Rust. There were a lot of reasons not to do this. The storage library essentially winds up being this place where we run into all sorts of things that aren't super easy to do. In Rust. We have selfreferential data structures. We've got to manage our memory allocations to avoid external fragmentation. We've got linked lists all over the place, which there's a really good article series online about learning Rust with too many linked lists. That's how tough it can be, and we have these pointers all over the place. Lots of unsafe code here. There are Dragons, but on the other hand, maybe we could simplify things. What if we start with something that's only single threaded so we don't have to worry about concurrent access and modification? Maybe we could implement just the hash table in C and hash table in Rust and used the C Code to allocate these big blocks of memory. But this wound up being depressingly difficult as well. Luckily, around this time there was a newer research storage designed by a PhD student who was working with us, and it just happened that this design was a lot easier to Port to rust, but I took that design and I actually simplified it even further. I decided to go with single threaded only so I wouldn't have to worry about locking and concurrency things that I could get wrong. While Rust does enable fearless concurrency, the primitives for doing locking are actually too big to use in this context. We're very bite efficient in the data structures, so we don't have much to work with there. But best of all, I wound up being able to manage all the memory from Rust and not have to do anything terribly exotic to make this work. So that was cool. Of course, Rewriting has downsides, right? It has a cost. It took me around two months of work to get a pure Rust storage library implemented, and this was essentially duplicating efforts that were spent on the design in the first place, and at the end of that, I still don't even have full feature parity. There was going to be more work to do, but there were also benefits that came from this. Instead of integrating the other storage library, we're going to skip right ahead to this newer design, which has benefits for a lot of our workloads for reasons I'll get into a little bit later. The Rewrite also allowed us to discover some things about the new storage library and come up with some new ideas that we could integrate into it to help a subset of workloads. And all this really helped get the new library to be more production ready. Yeah. But then there were benefits that came from Rewriting in Rust. So I think this quote from the Rustle Ng page really captures what I experienced. Rust truly does empower everyone to build reliable and efficient software. I was able to confidently write a storage library that was reliable, had good performance characteristics and take advantage of the awesome language, features and tools. On top of the benefits of basically Yahoo wound up sending me this quote to include, and I just thought this really put things in a nice way. The Rust implementation finally made Pelican modules what they ought to be, but couldn't before due to limitations of the Sea language. This feels exactly right, and it's just as fast. So I think that's really cool coming from someone who's been doing awesome work in C for a very long time to sort of see no pun intended. The benefits of Rust for this type of use case I kind of mentioned before. Rust has really awesome tools. There were a couple that I found particularly helpful during my work. One was Cargo Bench using Criterion, which is really awesome for doing microbenchmarking and allowing me to kind of focus on optimizing the critical components of the design and cargo. Fuzz is also really great. This allowed me to have a lot of confidence in my protocol library and make sure that I didn't just not think of some edge case in the protocol parsing. Fuzzing is an awesome tool to use for this type of thing, and both of these together helped me improve my code and make sure that it was high performance and reliable. Also, I thought it was pretty cool how the trait bounds and generics and all this nice stuff about Rust allowed me to express the composability of the libraries. Both the Ping server and Cache server are only a couple of 100 lines, which is mainly just to handle initialization and defining what combination of components make up the server. So here I'm showing that the cache server is the combination of this Segue storage library with the Memcache protocols. And when we add a new protocol, or if we want to do yet another storage design, it'll be just as easy to express how those things compose. And since we're using these sort of traits to define what that interface is, what that contract is between these libraries, it would be very easy to add a new library protocol and make use of the rest of the work that I've done. Okay, so benchmarking, this is always fun here. We're comparing upstream Memcache with the Rest implementation of Pelican SEG cache, and we'll Zoom in here. You can see that at the upper end of the throughputs tested that the Rest implementation actually had lower latency than Memcache. I think that's super cool. Memcache is blazingly fast, and I just think it's neat. It's always fun to win performance. Little benchmarking. It's just a lot of fun to try and win. I mentioned before that there were some benefits that we got specifically from using this new storage library. To understand those, we first need to understand how the two storage libraries differ. So this is slab storage. This is what you would find in Memcache, and this is what I was trying to avoid writing and rust. You can see that we have a hash table that's linked lists of objects or items, but we have the bulk item storage divided up into these units called slabs, and each slab holds items up to some size. So essentially the items are grouped by their size into these slab classes. If we evict an entire slab, all those items are going to be a similar size. Some might be very old and close to expiration, but some might be very new. There are also a whole bunch of other linked lists that help the server track. Like Least recently used all sorts of things. Typically, we want to operate on slabs, though, and evict an entire slab at once under memory pressure. And yeah, we can see that the hash table is just a chained linked list. The segment storage library, however, uses this thing called a segment instead of a slab. The big departure here is that instead of grouping items by their size, we're grouping the items by when they will expire. This is achieved by using these pendlog chunks and indexing those into these TTL buckets. So we group items with a similar TTL together, and we just start writing those into a segment. And when that segment fills up, we have another one. We can also see that the hash table design is a little bit different and now is using something called bulk chaining, and this helps make it a lot more efficient by operating on cash line size chunks. So there are a bunch of little hidden benefits to this storage design. With the items sorted by TTL, we're also able to very efficiently remove expired items. Memcache and Twem cache traditionally use something called lazy expiration where on the read path, the server needs to check if an item that it's just found for that key is expired, and if it is, it does some cleanup and returns a cache. Miss Memcache. Newer versions do have a background thread that actually scans through things and can remove expired items. And we want to do this because those expired items are taking up memory that could be used for something else. And under memory pressure, we're going to evict an entire group of items without necessarily removing just expired ones. So we actually might start to remove items that aren't expired yet under memory pressure. If we're able to remove those earlier, we're making room for incoming rights. With the segment storage design, we're able to very efficiently do this and not need to actually scan through the items. We have this index of the segments and when they're going to expire so we can just walk each of these segment chains. And if that segment is expired expire, it just clear it out, put it back into the free list. This also opens up the path to do these new eviction strategies. We can actually remove items that are closest to expiration, whereas typically you might have something like slab, random, or least recently used or least frequently used. There's also this sort of merging strategy in the segment storage design, where it actually picks the items from each slab and tries to combine these. So you have the items that are going to give you the highest hit rate. So it winds up being beneficial to a lot of workloads. If you think of things where you have a variety of TTLs, maybe you have some very short lived objects and some very long lived ones. You want to be able to clear out the things that are expired and you want to be able to do this efficiently. And this design allows us to do that. We also wind up being a lot more efficient in terms of the per item metadata. Objects that we have in cache tend to be very small. The PhD student who was working with us actually did this study across these production cache traces, and we actually published kind of anonymized cache traces on GitHub, and there's some analysis there. But what we found is that our item sizes are actually very small, so like 100 bytes or so. When you look at something like Memcache, which has 56 bytes per item of metadata, that's a lot. The Pelicans Lab implementation has 39 bytes of metadata per item, so a lot better but still large. This segment storage design uses five bytes per item on average, because it winds up sharing a lot of this metadata across a larger collection of items. So yeah, the metadata overheads are pretty significant. Additionally, with this storage design, it lends itself very well to using persistent memory. So Intel Optane persistent memory is basically like a very fast SSD on your memory channel and you can operate it in this thing called app direct mode, so it actually presents itself like a block device, but it doesn't go through the page cache of the system at all. So yeah, it winds up being pretty efficient to access it, and we're able to split these things. So the metadata lives in DRAM still, but the bulk item storage is on persistent memory, and this helps us really lower the cost per gigabyte for the cash capacity. And we still get about 93% of the performance of DRAM only when we're using a four to one read write synthetic workload. So it's pretty impressive because now we can consolidate things. We can get a lot more PMEM into a box. We can grow what our cash sizes are. Basically we're making memory free, which is pretty cool. Yeah. So just to sort of summarize some of what I talked about, there are next steps in terms of getting this to production needs to get to feature complete, which hopefully very soon. More testing always do production Canary testing and then move on to deploying. There's also future work which hopefully can go in parallel to the path to production. One work on multi threaded performance. The threading model that I went with since the storage library is single threaded only is a little bit different, and I want to spend some time optimizing that general optimization always want to try and make things faster. We need a Redis replacement as well. So looking to do that with Pelican and Rust, there's IOU ring, there's all sorts of like really fun stuff we can play with. Rewriting has costs and benefits. The extra time could have caused me to miss a deadline. If I was working towards a deadline. I was duplicating work that had already been done, so we had to pay for that twice. On the other hand, it is a lot easier to work with an all Rust code base. No more CMake. So nice. Cmake and Cargo together is not super fun and we were able to get some new ideas added to this storage library. C and Rust are both very, very fast. It does take some work to write code that is high performance, but with Rust we were able to match it, and a lot of this was time spent doing profiling and benchmark driven development, so I found that to be very useful. While I was working on this. Rust has really awesome tools. Yay Cargo is a really great build system, and after fighting a bunch with CMake in Pelican, I am even more thankful for Cargo. Cargo. Fuzz is an awesome tool. It's a plugin for cargo, but it's great to have easy access to fuzz these protocols during development. The cargo bench benchmarking micro benchmarking really useful as well. And then things like rust format and Clippy. So I could make sure that my code was nicely formatted and more idiomatic and had a consistent style and all that stuff and that made the code a lot easier to work on and review. And I am happy to say that Pelican has an exciting future with Rust. It seems like we'll be able to migrate the whole code base to rust and I think there's a lot of fun work to come. Thanks for coming and listening to the talk. Thanks to all the folks involved in putting together this conference. Feel free you to reach out to me with any questions. Just grab me and come and chat. Thanks.