Video details

Advanced Caching Patterns at Wix Microservices by Natan Silnitsky

Microservices
01.05.2022
English

For more info on the next Devoxx UK 👉 https://www.devoxx.co.uk
Wix has a huge scale of traffic. More than 500 billion HTTP requests and more than 1.5 billion Kafka business events per day.
This talk goes through 4 Caching Patterns that are used by Wix's 1500 microservices in order to provide the best experience for Wix users along with saving costs and increasing availability.
A cache will reduce latency, by avoiding the need of a costly query to a DB, a HTTP request to a Wix servicer, or a 3rd-party service. It will reduce the needed scale to service these costly requests.
It will also improve reliability, by making sure some data can be returned even if aforementioned DB or 3rd-party service are currently unavailable.
The patterns include:
- Configuration Data Cache - persisted locally or to S3
- HTTP Reverse Proxy Caching - using Varnish Cache
- Kafka topic based 0-latency Cache - utilizing compact logs
- (Dynamo)DB+CDC based Cache and more - for unlimited capacity with continuously updating LRU cache on top
Each pattern is optimal for other use cases, but all allow to reduce costs and gain performance and resilience.

Transcript

Welcome everyone. I hope everyone can hear me. So my name is Matthew Mitski and I'm a backend infrastructure team. [email protected] has a huge scale of traffic with more than 500 billion Http requests and more than 15 billion Kafka business events every single day. And this talk is about three caching patterns that are used by 2200 micro services in order to provide the best experience for our users, but also making sure to reduce costs and also increase availability. And one of these patterns was actually developed by my team data Streams team at Wix Tech and Infrastructure. So I'm really excited to share that pattern with you and the rest. So let's start a few words about Wix. Wix is a website building platform where you can manage your online presence and also manage your business online. We have more than 200 million registered users from 190 countries and more than 100 million websites running on the 5% of all internet websites. So let's see a few issues that Wix developers are faced with from time to time. We have Rebecca from Wix Stores and she's in charge of Store checkout service. Now stores checkout service requires to fetch some administration information from the Wix app market. It does that when source checkups start up and it is required in order to be configured correctly. But what happens if the second party service is unreachable for some reason? Maybe there's network connection problems or maybe even weeks after marketing is currently down. If we don't deal with it, it means that the checkouts will also be down because it doesn't have the information required to start up correctly. Another use case from Week Stores from the Ecommerce solution is with Alex. Alex is in charge of storage catalog service and here it uses a database stores catalog where it keeps all the catalog products, etc. E. And recently Alex has noticed that there are database latencies and even failures for certain queries like in high percentiles, meaning that users will sometimes have not got the best experience they can. So that's a big issue and addicts would like to resolve that. So as your traffic grows and increases, so will the time takes to total the Loading time. The overall load on your system response time increases latency and also network costs will increase because we do a lot of networking all around the system in order to fetch all the data that is required. And you may end up with database failures due to high load on database as well. Now if you do introduce the cache, it means you will definitely reduce latency as long as there will be a cash hit most of the time. And you also reduce the needing scale of your system. So don't need as many service instances or Kubernetes pods as when you don't use the cache because the cache will be the source of the response. So you don't need to go all the way to other parts of your system. And it also improves reliability because using a cache is much easier than doing complex operations and it's much simpler and faster. So overall you will definitely increase the reliability of your system, even if, let's say some database is unavailable, if you have a cache response, that doesn't even matter. So let's take a look at a few of the use cases we have found at Wix and what kind of caching pattern we can use for them. So the first one is the high risk or cost of network failure. And that's like the example I showed at the beginning where you want to start up and fetch information from remote sources, but they may not be available network there. So here caching external critical configuration data critical data that is required for the actual running of your service is really key. Now the second use case is a bit more generalized and a bit more ubiquitous, and that's to improve the average latency for data layer access. So all our services usually have some database there, and as usage grows, so can issue at speed really deteriorate the latency for retrieval of information from the database. So at Wix we use a combination of a cache combination of DynamoDB as a persistent layer, and on top of that we have an in memory cache per service instance and it is populated by fetches, but also by a service. Cdc is the ability to get streamers events from any time your database gets updates. The only way we'll cover that in depth in a little while. Now the third use case is when you have some very high external traffic cases where you have external traffic like browser Http requests coming into your data center with all of your services deployed and you want to provide the fastest Http response back, maybe like rendering where you need to provide back HTML and stuff like that. So here you would like to put a cache before your service is not in the middle of something like that, even before the request comes into your service. So it serves as a proxy to your service and it may be possible to actually avoid going into your service if there is a cash hit. Now, while we discussed patterns use cases where cash, it's also important to understand when not to use caches. So if you have a very young product, let's say you're working on a prototype, you're starting to test it out in the market and you don't expect to get like huge traffic on the first few weeks or even months. So I wouldn't architect this product with a cache built in all from the beginning, right? Because caches do add complexity to your architecture. Especially tricky is all the area of cache invalidation making sure that you don't have stale data that can really hurt what you're trying to test with your proof of concepts are like prototype, right? So be careful not to use the cache when you don't really need to. Okay, so now we're going to delve into each of these caching patterns in depth so you really get a good grasp of why they are needed and how they implemented. We're talking about the S three backed static cache, so a simple object storage cloud object storage that can really help us be served as cash. We also have the DynamoDB, plus that database event streaming, all of that area cache. And we have the Http reverse proxy cache. So a proxy before your service that actually tries to eliminate traffic to your service. And we'll start with the first one of course. So we're going back to the issue I talked about at the beginning of this talk about how to make sure that service overcomes unreliable requests, especially when it needs to start out with some external configuration data. So at least we have a library called koboshi or Remote Data Structure that is set up with the user handler code. How to fetch External Information So our ecommerce service will fetch in it of Kaposhi. It will fetch up definitions and all kinds of other configuration data, but it requires from Wix App Market. So Kabashi will at first will try to fetch the information from App Market if it does not exist yet in Amazon s Three Cloud Object Store. And if it does exist there, it will first fetch the information from there and that way it will guarantee the high availability. Now if you are not using Kubernetes as your deployment platform, then you can actually use the local disk and just persist data there in Kabashi. So it's really a matter of whether your disk actually belongs to the service in the long run, or like in Kubernetes where your local disk keeps getting unmounted from your pod and your pod gets recycled and stuff like that. So you can't really rely on the long term disk for your service instances. So for that we actually use AWS S Three. Now if it's up, Market is down. Like I said, Kabushi can when the ecommerce service starts up, it will get the information from AWS and then it will periodically try to fetch information because potentially they could at some point in time being updated for this configured data, even though it's mostly static. But once we add Market connectivity gets up and running again, then the AWS bucket of information of the cache will be updated if that's needed, if there's some new information there. So to summarize this pattern, we're talking about the type of cache that is read through. A read through cache means that you don't access the source of truth data by yourself. The cache will access the data for you and you just ask it to retrieve. So it will retrieve something from its own storage data or it will retrieve it from the external source of truth. So they put a recent cash like this like Kaboshi before the third party service. Now it's important to understand that you can only use this type of cache if the data can fit into memory. I mean that if you have like terabytes of data that you need to load at the startup, that may be a problem. So this won't be a good fit. But if your data can easily put into memory, then it's really a good fit then. So like I said, it's best used for external source of static configuration. Sorry about that. Okay, so what I won't use this type of cache panel for is if you have really highly dynamic updating data, it keeps getting changed like you have a key, but the value keeps getting changed. So this really is not a good fit for it because it's more for like startup Loading time. You don't want to have to have like a fast pacing done all the time because it can really affect the performance of your app, basically. And it will be not really wise to actually cache this sort of information. So it's really more for statically updating data, really usually quite stable type of data. Okay, so next we're going to discuss our DynamoDB database solution for reducing database latency. So here we have the example of the Store's catalog service that fetches information about carloc products from store database. And it got more and more slow as traffic increase and the owner decided that they want to put a cash here. So how did they introduce the cash? So they use our own library for this. It's called the Cash Key Battery Store. And the way it works is that the service first writes new information and updates to the store's database catalog database, and then it will also put information in the cache. So this is like an LRU cache. Most recently used items are affected when the size reaches the limit and the data is stored here. Now when the service tries to retrieve information, it will first retrieve from the cache and only if there's a cash miss. If the information is not there, it will try and release from the database. So as long as statistically, most of the time there's a cache hit and the data is found in the local in memory LRU cache, then we're good to go. But we have a lot of instances for our service, so traffic is spread around among them. So the fact that we have a local in memory cache will not be very useful. Right. We probably have a lot of cache misses, so we don't only put the data into the local in memory cache, we also put it into Amazon DynamoDB so it is persistent and everyone can access it. Now, we chose Amazon DynamoDB for Cash Key Value store because they have great cross data center data replication and great conflict resolution. So we can actually have data synchronized geographically all around the world and conflicts will be automatically resolved. It will be eventually consistent, and you can also put as much information as you require. There's no limit here for this cloud type of database. And now let's think about okay, so we put information in the database, but how are we going to update the rest of the instances that have not received this cash update? So we use Amazon DB events, we have DynamoDB events, so we might convert them to cash payments because at least we used Afghan. Sorry. And we have this dedicated topic per DynamoDB key value store table. And it has all of the updates done to this database, one after the other, sequentially per partition. So we know that the same key will always be updated sequentially. And now we have the library that stores the Emory Luru cache also has a Kafka consumer, and it consumes the information from the CBC topic from this dedicated stream of events from the database. And it updates the inventory cache. So all of our service instances get automatically updated no matter which of the instances has actually done the change, which is great. And of course, if the key already exists, it will just update for the new value. So if K one four had the value of V one, now it is updated to v. Two. Now there's also the question of what happens when the service restarts or has a new version being deployed. Well, potentially it will start off with a completely empty in memory cache, and that will be in cash misses. So we'll have cold cash and much worse latency because we need to go to the database. So in order to solve that, when the service starts up, the library first consumes all the most recent events from the Kafka topic in order to have the ability to return cash hits. And because it's the most recent event, there is more likelihood that there will be a greater percentage of cash hits because of the temporal rule of caching. Now there's also the option to take not only the most recent event, but the entire Kafka topic. That's because Kafka has a mode where you can keep a topic as a key value storage of its own. It's called a compact compaction, low compaction in Kafka. And in case you know that at least for now, all of your data can fit into memory, then you can set this topic as being compact, and the library will first fetch all of the events from the compact topic and fill up your cache, meaning that you'll have 100% cash heat ratio, which is great. Now with customers, it's important to set up that it aggressively cleans up stale values, so it won't take too long to read all of the partitions sequentially. But as long as you do that, you're going to have five startups with 100% cash it, which is great. Of course, if you have a situation where your data set is increasing and it's going out of the amount of memory that your instances contain. You can switch back to being a noncompact topic, a regular topic, and just take the most recent events. So of course it's a trade off here because you won't get 100% hit ratio. But when you have so much data in your data set, you really can't expect your head to get 100%. So you can play around with the amount of events that you're going to get a warm up and hope that you will get a really high percentage of his rates ratio. Okay. So to summarize, we're talking about usually a cash aside kind of situation that you put before your database in order to have fast access times. But it can also be a cache on its own, like when you want to use it as a key value store and say you don't have to use a database separately. This can be your database with a built in cache which is great as well. And here we don't have a limitation on the data size because we have a configurable luru in a marine cache and any data that doesn't fit in there we'll get discarded. So we don't have that problem anymore. We traded off this with of course the heat ratio, right? So it's great for getting faster access to databases and when you want to have a simple key value store with a cash built in on top of it. And I wouldn't use this. It's a great general purpose pattern, but I'm not sure I would use this for the highly critical application startup information. Right. Because then you will need to have a cost consumer. You need to start up a cost consumer. It can take time. It's probably best just to be like hey, test free, give me the information back and you're done with it. Right? It's better. So if you have a critical application startup information I would probably go with the first pattern and not this pattern. Okay, great. We've reached our final pattern for this talk, which is the Http reverse proxy catch. This is like the biggest one in terms of scale. So at Wix it's really important for us to have the best experience for Wix website owners and even more importantly for their visitors to their site. We wanted to have a great experience and really fast spirit. So for that we had to optimize all the site rendering that we had. So a few years ago we switched to render our sites on the server side, service side rendering and that of course means better rendering because we have limits, especially on mobile applications and the amount of compute power that mobile browser have in order to render the site. So in order to render it and just bring back the HTML from the server will definitely be faster. But it still takes compute power at least still takes time. And of course our sites are visited millions of times every single day. So there's a lot of opportunity here for caching the results. So Wix deploys Varnish caches all around the data centers and they are put right at the beginning of the data center rather than DNS resolution and load balancers. And the Varnish cache is here. Now if the site was already cached, then the response will get back immediately from the Varnish cache response. And of course if there's a cache miss, then Varnish cache will forward the request over to the site rendering service. So that's why it's a reverse proxy. It processes requests, but usually the request will stop at the gate of the cache and the response will return immediately. Now this of course is a very simplified flow. There's a lot more steps when you enter a Wix data center, a lot more security balancing and stuff like that. And I also mentioned that Varnish Cash is open source so you can check it out yourself. Now what about caching validation? And here it is very important. We are talking about this huge scale and we have our sites being updated from time to time by the Wix site owners that the old versions of the caches are invalidated immediately and our site visitors don't get stale sites, which will be really bad. So how we do that is that each component that publishes updates new sites, especially the editor services. When they publish a new site version, they emit a Kafka event that there is a site published purge request with the site ID. Then we have a dedicated Varnish purchase service that gets this site ID in the event and then it goes to all the Varnish catches and it requests to get information. And then it will say hey, do you have a cache of this site ID? And if Martin can suggest that it will get deleted, the cost and the way it works is that each Http request incoming to render already has an E tag with the site ID so it can be perfectly matched to the purge request. And now that we purged this site ID, largecast will not have this cache Http response. And the next time there will be a request for this time it will go all the way through the site rendering service and the new version of the site will be cached, which is exactly what we wanted. And let's talk about size limits in terms of RDCs. We are talking about usually huge scale of traffic. So do you have a few options? We have the memory based option, which means that all the Varnish cache would be memory. But that means you probably don't have to pay a lot to have big memory provided to you by your cloud service. Now if you want to have a more less expensive option, you can use a file based storage, which means it's less expensive because stored on disk, but you can have performance penalty because of disk space fragmentation. Now Varnish does offer the third option here, which is also backed by files, but it has a fragmentation proof algorithm. This is only for paying customers, not the regular open source version, so it's embargoed Cash Plus. Great. So to summarize all this Http reverse proxy pattern talking about the type of cache which is with before your application and in terms of cache size configuration, it really depends on the back end storage that you choose according to the types I showed in the previous slide. And it's best used when you want to improve latency for relatively stable Http responses for external request income to your microservices ecosystem. But I won't use it if you have extremely complex cases of knowing when your data is no longer valid and fresh. If it's going to be a really hard time to invalidate because you have aggregated data from multiple sources and it's really hard to figure out everything, then I wouldn't use this kind of cache. I would use more local caches on each of the data sources that you're going to aggregate. And I've created this very simple flow chart in order to understand the different use cases and when to use each cache. So is your data crucial for startup and it's not really changed a lot? It's like mostly static configuration data, then you can just have a backup in S three and be done with it don't have to make it more complex than it is, but if your data is not really crucial for startup and it can change quite a lot, that having the sort of solution like DynamoDB based plus event cache or Redis caches that you can also use the Pub sub paddle there. They are quite an equivalent concept with Rediscover about DynamoDB and Kafka, so you could check that out as well. So this is like a real Journal purpose kind of caching, but there is differences between Redis and DynamoDB. I can't talk about the scope of this talk, unfortunately. And of course if you have like external clients, we're not talking about specific database access optimization, we're talking about external clients to your system that you have, you want to cache the responses. It's relatively easy to invalidate the responses that I would go something like Vary or also with content delivery networks that provide static information really fast. There are a lot of companies out there like Fastly Accommodate that are involved in making sure that data is provided as fast as possible. But of course it's more expensive solutions, so it depends on your traffic scale and your available choices. Okay, so I have a more indepth coverage of all these patterns in this blog post. That is like where I started talking and writing about all this very important issue of caching optimizations. And I also wanted to share a little bit about Greyhounds. So we have our own Kafka high level SDK used by 2000 Microsoft called Greyhound, which has a lot of great features like parallel consumption. Whenever you want to constrain the amount of parallel work you do in your specific service instance will actually increase it. It's much trickier to do with regular Kafka SDK, but with Greyhound it's really, really simple. Best consumers with Greyhound so you can easily do higher throughputs and bulk APIs and more resiliency with all kinds of retry policies to make sure that consumption and producing of messages in your event driven system are eventually successfully processed and all kinds of other great features. I really recommend for you to check it out and I put all the slides on Slideshow and all of the links so you can just check out and you will find the flight. And I want to thank you very much. And you can also follow me on Twitter and LinkedIn to get all kinds of updates on all the infrastructure that we have on our team on Kafka event driven style development, key value stores caches all around these data stuff for back end services and software engineering in general. And you can also find blog posts and previous talks I gave in conferences on my [email protected] And if there are any questions, I will be more than happy to answer them. Thank you, Nathan. So I'm looking at rocket now. If you got questions, now is the moment you get the Q and a section that you can use, there is a little delay in the streaming. So let's wait a little bit. Sure. If it's interesting, I can also discuss the difference between using DynamoDB. For sure. We got a couple of minutes for that. Okay, great. So at week we decided that we want to have the fastest cache response possible and that's why we chose to have these in memory LRU caches. So that means that you can leave faster than just getting a response in memory. Right now. If you go with Redis, usually it will mean that you have a Redis cluster of the cache, which means an extra hub. So it's in the same data center. It doesn't provide too much added latency. It's not that big a deal, but we do have use cases that this zero latency really improves the overall latency of the system. Any hop you introduced means worse performance and also it introduces another failure point because it is an extra network call when you also need to configure sharding in order to when you reach bigger data sets with DynamoDB, as long as you have relatively distributed partitions for your key values, then you don't need to deal with it and you get endless increase in scale. So that's the main difference between choosing these two patterns. But Redis is extremely popular as a caching solution and it works for most use cases really, really well. So of course, I recommend that