Video details

Kristie Wirth: Stats Don't Have to Be Scary: Automatic a/b Test Analysis Using Python

Python
08.29.2022
English

Learn how my workplace manages to analyze dozens of concurrent A/B tests with millions of data points! I’ll discuss our previous manual analysis process, some things that have changed, and do a down-to-earth walkthrough of how you too can use Python to automate analyzing your tests.

Transcript

All right, so I am so excited to talk to you all today. As mentioned, I'm Christie Worth. I'm going to talk about A B testing, which is a top like I'm really passionate about. There is a lot of potential for how AB testing can benefit a bunch of different businesses, so I hope it's useful to you. The other part of this that I'm really excited about is demystifying statistics. I think we can have a sort of feel of scariness or complication that I'd want to really bring it down to earth and make all of this accessible for anyone, whatever your role is. But before I talk about my fun tool of how to automate some of the aspects of A B testing, first I want to talk about what A B testing is and why you might even want to do a B testing. So a B testing, if you're not familiar it's when you have a website with multiple versions of something and you want to test out some new feature that you have, you want to see if users see this new feature, will they be more likely to sign up or upgrade or whatever metric is important to your business? Sometimes when I talk about a B testing to various engineers or people at my business, they go, christy, do I really have to do this fancy data thing that you're so excited about? I have this feature. I know it's going to be great. So how about I just install this cool new thing on the website and then, I don't know, maybe we can collect some data before and after, and then you can do your data magic and that'll be good enough, right? Or maybe what you can do is I will install the feature and we'll just see who does things with the feature or not, and then we can compare those people and we can see the positive impact. No AB testing needed. Let's keep it simple. But the problem with those approaches is that it doesn't actually give you the ability to say causal statements, which is really important. So what do I mean by causal statements? Causal statements are things like this. When you want to say that installing my fancy new feature should lead to a 5% increase in upgrade rate, you probably want to say something like that. Not all the time, but a lot of times when you have a new feature, you want to be able to brag about it. You want to understand the impact, but you can't say those kinds of statements without a formal A B test. So the reason that is that if you do something like a before and after comparison, the problem with that is that something else might have changed in that time frame that led to any effects that you're seeing. So you can't attribute it to feature X that you've installed on the website. It could be something else that happened during that time period. If you do something like what users interact with a feature versus those who don't, then you're missing out on any potential differences between those groups. There could be something inherently different about users who interact with your feature versus users who don't. Maybe users who interact with whatever feature you've created are already more engaged with your website, so they're already going to be more likely to sign up or whatever you're measuring. So you're not really measuring the impact of your feature. And you might be thinking that some of that sounds like semantics. Why do you care about any of that? And the reason all of that matters is because if you think that your feature caused some effect, like higher sign up rates, but that's not actually what happened when you go on and install your cool new feature for all of the users, you're not going to see that effect continue over time. So you'll have a project that seemed really great, and then it flops when you actually start looking at the data. So that's no good. You don't want that. The other part of that is that it can lead to misunderstandings about your users. So you go on to pursue further projects based on what you think has an effect. But really you didn't do an AB test. You don't know for sure. You're doing a bunch of projects that don't lead to the effect that you're hoping for. That's no good. For example, let's say you had the idea that you didn't do an AB test because you said, who needs data, I just want to install things. So what you did is you figured out that you think adding a button to your web page increases sign up rates. Well, that's great. So you add all of the buttons to all of your web pages, then all of a sudden you realize the sign up rates didn't move at all, and maybe you've sunk all of this time and effort and engineering into making the coolest buttons and somehow you're not seeing an impact at all. So that's no good. That's a somewhat silly example, but you can imagine in other situations how you might pursue different projects based on misunderstandings and not see in effect. So there's a lot of components to a B testing. I'm going to go through some of the logic of how it works and then later go through how some of the code that does some of this and how you might automate it yourself. But let's start at the real basics. What do you need to do in an EV test? Well, you need to assign people to conditions. You've got a visitor coming to your web page and you need to put them into some sort of bucket. So we've got the treatment bucket. The treatment bucket is for people who see the fancy new feature and they get to interact with that thing you've created. And then you've also got the control bucket, that is, the group of people who are seeing the old version without anything fancy new installed. Right? And how do you divide people into those buckets? That's something you might need to think about. That can be as simple as half and half. You just sort of randomly put people in some of each. But one thing you might not think about is the risk of your experiment. And I'd urge you to think about that as well. So if you've done a huge redesign to your pricing page, that could be amazing for users, but it also could have a really unintended negative effect that all of a sudden people are not upgrading, your revenue is going down. That's no good. So you might want to do something more conservative, where you only show 30% of new visitors to your website, this new feature, or maybe you're really worried, you're kind of nervous. Just show 10%. Then if you realize partway through the experiment you're like completely ruining your upgrade rate, then you can just stop it, which is great to accomplish all of this stuff. It might sound simple on off hand, but there's a couple of things you need to consider of how you actually put people into buckets, as well as just this percentages I'm talking about. One thing that's important is to have the same person always go to the same condition in a given experiment, because you don't want your users to have this mismatched, confusing experience. It can be jarring if suddenly they're seeing the new feature one day, the old feature, the other day they're just completely getting flip flopped on your website. That can be confusing. So we don't want to do that. But it's also really important that users see different things in different tests. So if you have a bunch of tests you're running at the same time, and we always have a bunch of A B tests going, then it's really important that the same user isn't always in the treatment group. For example, they're not always seeing the new feature because then you run into this problem of, well, this person had higher sign up rates, which is great, but was it this feature or was it this other test? Or was it this other test? You can't really separate those if they're all related, so you don't want them related at all. So how do we do all that? Well, we came up with this sort of clever process of how we assign people into conditions involved some math. But stay with me, it's going to be great. All right, so we've got a given AB test and that runs on something called an interstate. So let's say we've got an AB test. It's running on Interstate 227, and the interstate is a random prime number. I'll tell you why that's important later, but this is AB test 227. Then we've got our interstate for that test and we split it up into lanes. So here we decided we wanted half and half in each condition, which is pretty standard, and we put half in the control group. So zero to 113, that's about half of the interstate. And then the other lane is going to be our treatment lane. They're going to see the fancy new feature that's about the other half of the interstate. So we've got a visitor come along to our website and this little guy has an identity ID of something 5664 and we're going to assign him to a condition. How do we do that? Well, we do a little math. We take the identity ID, modulo the interstate and we get a number back. And that number 216 is in the treatment range of numbers. So this person goes to the treatment and you might be thinking, that sounds needlessly complicated, why are you doing this? Because of the reasons I mentioned earlier. So we want the same person in the same condition in the same test. So what's important here is that in this math we include the identity idea as part of it. So the same person is always going to go to the same place. The other part of this math is we have the interstate in there. So as long as our tests are running on different interstates, then people are going to end up in different conditions and different tests. And in case you're wondering what a modulo is, that's when you do the division and then that's the remainder term left over, if you remember long division in high school wherever you learned it. So the modular part isn't horribly important. That's more so just some sort of math that does something randomly that involves the identity ID and the interstate to accomplish the stuff that I talked about earlier. The one other important point here is that the interstate is a prime number, which I said earlier. And that's important because if you do evenly divisible numbers, then you get a lot of the same results. You get a lot of people in the same conditions. So we're just trying to do that randomization while also doing consistent in the same ways that we want it to be. Lots of things to consider. So this is great. We've got this fun little process. It does all the things we want it to do fantastic. But the problem with this is that we made all of this really manual for a while. So we told our poor engineers that they had to remember what prime numbers are, they had to figure out something about a modulo and then they had to put people in these different buckets based on all that fancy math they were doing. And remember, these are the same engineers that were not even wanting to do an AB test in the first place. And we didn't do a great job of explaining what I just told you. Now we should have told them that, but we didn't tell them all those things. And so engineers were confused, rightfully so. And they were putting different tests on the same Interstate, which defeats the whole process, just mixing up various parts of the arithmetic that needed to be set up. It was very confusing. So what I decided to do was automate a lot of this process and make the setup a bit easier. What I did is I built a web app that accomplishes all of the logic that I just talked about. But instead of manually having people code this in, it generates code for you. And what they could do is literally go to this web app, call their experiment whatever they want to call it, set up the percentage they want in the control, which is based on all the talk about risk that I mentioned earlier. It's often 50 50, but if it's something risky, you might want a higher percentage in the control. And that gave them some code back that was automatically generated. They didn't have to worry about Interstates. They could totally forget that concept. Leave that to the data scientists. And instead they get some code that works. They can literally just copy and paste this. It's got the name of the test set up in the format that we want, so it's consistent. Also has the lanes defined nicely based on that controlled percentage. But they don't even need to think about it. And the Interstate is automatically determined. And I checked to make sure that it's something that's not already being used. And as a bonus, I generated some front end code too, because you need to toggle the feature off and on in some fashion. This is a bit more work in progress. You need to add a bunch of stuff to actually toggle your feature. But it's a great jumping off point and it ensures some consistency in our code. All right, so I've talked a lot about the set up process, but the big part here that's really exciting is the analysis portion of things. So how does that look? Well, first we need to actually get the data from the test somehow. And this is going to involve a bunch of SQL, which I'm not going to show you. I'm going to show you the logic behind the sequel and you can look at that later. So we've got a person come to our web page. We need to assign them to some condition. We do the fancy process behind the scenes that I talked about with the modulo and the lanes in the Interstate and such, and this person ends up in the treatment condition. The point when the person appears in our study is called the birth event. And that's a term I'll use a couple of times later. You can remember it's the birth event because they are born into the study. If they don't have a birth event, they do not exist in our AB test. But once they are born into the study, then we assign them to a condition based on that process. You have someone else born into the study. This is the moment they visit the web page, just before the new feature gets toggled off and on. This person goes to the control condition based on all of that logic and so on and so forth. We have a bunch of visitors. The other important term here is the death event. So the death event is the point that we care about the metric of interest. So we want to know if someone's going to sign up more often with a new feature or upgrade or whatever feature metric you care about. So the death event is going to be the first time that someone ever does that event, the first time they sign up. And you can remember that sort of terminology because the death event is when they leave the study. So if you sign up, then we no longer care what you do after that point. You've reached the conclusion of the study. It's also important to set a time frame for how we're going to calculate whether people convert or not. So here we set a time frame of one day. So what we want to know is, do people have that death event within the time period of one day? And this is a concept called censoring. This is really important because we need to give everyone an equal chance of converting. So if we have someone who just came to our web page a couple of hours ago versus someone who came several months ago, they're not going to have equal chances of signing up. So instead it's important to set time frames that we check for each individual person. So once they first visited the web page, on that given page with the feature, we give them one day to sign up or not. If they don't convert within that time frame, then we don't count them. The other thing I care about is exclusions. So let's say I found that this person was an employee at Zapier. Then I would exclude them entirely from the study because I'm not going to include employees, they're just testing stuff. I also exclude spam users. I don't want those people in the AB testing, as well as people who have their death event happen before their first birth event. Because if they already signed up, they might not do it again. Sometimes they might do that death event multiple times, but people who've already done it once look different probably than people who've done it for the first time. So we want to just only look at people who have not done the first thing yet. They have not ever signed up. We want to see if our feature is going to influence that. In addition, we also exclude people who haven't had the entire time period yet. So if I've set my time period in one day and someone only appeared in my study 5 hours ago. I'm going to actually exclude them entirely, at least for now, because they don't have the same chance as everyone else in the study. We really want to give people equal chances to be comparing apples to apples. So we'll just completely take them out. If they've been less than one day since they appeared. The next thing I've got to do is compute the time from the birth event to the death event so when they first appear in the web page to whether or not the point when they sign up or upgrade or whatever I'm looking at. If you see a null here, that's because they never did that thing. So that's fairly self explanatory. But what I've got to do with those is then mark them as successes or failures. Did they convert or not? Did they sign up or not? And that's going to be based on the time frame plus the censoring concept here. Pretty self explanatory. This person signed up within an hour, success. This person in 2 hours, success. This person never signed up. Clearly not a success. This person signed up, but it took them a really long time. And since we have that time frame, we're blindly looking at that time frame cut off and that's it. So they signed up outside the time frame, cutting them out entirely, so on and so forth. And ultimately we're just going to tally those numbers. So we have our treatment condition, we've got our control, we've got two people that converted, one person who didn't, and so on and so forth. Now we have the data from our actual test, we're done, right? Not quite that simple. I want to bring us back to the main question of what we're trying to answer. By doing an AV test, I want to know ultimately, if I install some feature, will it have an effect? That's what you really want to know. That's why you're doing an A B test and not just installing it and ignoring the data entirely. And what you might think is that you can just look at the results, we calculated the counts and figure out from there what we would expect to see. But what's challenging is that what you saw in the AB test is at a particular moment in time, a particular group of people, and there's always some uncertainty. That's what we like to talk about in statistics, is uncertainty, right? And if you install this feature, you're not going to see exactly what you saw on the test, you're going to see something maybe similarish. So how do we actually figure out a prediction of what you're going to see when you install it? Because that's what matters. You know the data from the actual test, you've done those counts, you've gotten your fancy sequel set up, you know who's converted and who hasn't. But what actual conversion rates led to that that's the part that's a little tricky. If we got two successes and one failure, what's the actual underlying conversion rate? Maybe it's something like 68% or maybe it's 65%. You don't know for sure. If you kept collecting more data, you might see something that more approximates that true conversion rate, but you're not ever certain. So what we want to figure out is what's the underlying conversion rate and how can we use that to predict what might actually happen, not just what we saw with this particular testing group of people. So I'm going to walk through some code here that illustrates this, and I'll certainly share all my snippets later. So I started with some code that generates a bunch of values. These are some beta distributions that generate 1000 values. And what I'm doing is generating this distribution of possible conversion rates. Because as I mentioned, we know the data, we saw the data that is effect, but what we don't know is the actual conversion rate. So what I want to do is generate a whole bunch of possibilities. And 100,000 is just a lot. That's why I chose that. And ultimately I do this based on the data we saw, the number of successes, the number of failures. And I do this for the control group and the treatment group. 100,000 possible conversion rates that led to the data we saw do is subtract those from each other. That's it. These are just lists of numbers of just possible conversion rates. And what I do is do the treatment minus the control. And why I do that is because I want to see the effect of my feature, right? You want to be able to say something like it's an increase of 1% or a decrease of 5%. And how you do that is by looking at the difference between the control, who saw nothing related to your feature and the treatment group who saw something related to your feature. And the difference between those is the effect you might expect when installing. So I subtract them. I want to say how much higher or lower was the treatment when they saw that fancy feature? But I also want to know it for 100,000 possible realities. And there's always something about uncertainty. These are statistics, after all. And so what I'm going to do is set my threshold of how comfortable with uncertainty I am. So here I'm setting this zero two, which translates to I'm comfortable with zero 2% uncertainty. Or in other words, I want to be 99.8% sure about this, which is pretty darn sure, but I like to be sure. You'll never be 100% sure, right? There's always that level of ambiguity. And then what I'm going to do is figure out the actual prediction I have, right? I figured out how sure I am. I've got some list of effects, but I don't have an actual prediction I'm going to tell somebody. So I got to get to that point. So I take my 100,000 effects, my 100,000 differences, things like increase zero 3% or decrease .2%, whatever. I've got that list, I sort them from smallest to largest and I shave zero 1% off the bottom and zero 2% off the top. And that gives me, if you're following along 99.8% of my list of values, which is also how sure I want to be it worked out. And then ultimately I've got to translate that list of values. I've got this sort of range, I've competed, but I want to make that easier for the users. I'm hoping you all are sort of following along with me, but this is still a bit complex, so I want to simplify it in the actual automatic A B testing app that I did. So what I do is give an actual simple recommendation statement based on the range that I've calculated, which is that 99.8% range and that looks a little different based on the goal. So a lot of times you might want to increase something, you want to increase sign up, et cetera. If that's your goal, if that's the ideal, let's say you got back this range based on all that computing we did in the last slide, you're 99.8% sure that you will see something that is either a 3% increase all the way up to a 12% increase. That's our range of possibilities that we're thinking might happen. That's the middle 99.8% values of that list of values based on that. What I have the automatic AB testing app tell you is to install. And that kind of checks out if you think about it right, because if you want something to increase and the only thing that comes back is increases, we don't know exactly what the increase is, it might be 3%, it might be 6%, it might be 12%, but those are all still successes. That's what you want. So if you're only going to get increases, then you should probably install done. The things where I get a bit more money is when you have a mix of decreases and increases, you want to increase sign up rate or upgrade rate. And I tell you that based on the calculations, you could have a decrease of 1% all the way up to an increase of 4%. It's going to be somewhere in that range based on the data I saw. That's ultimately where I say something a bit more nuanced of it's up to you, which sounds a bit muddy, but I try to give some recommendation as well of what you might do with that up to you part because ultimately it does depend on a couple of different factors. If your feature is really easy to install, you might just go for it. If it affects only a small number of people, that might be not a big deal to have some amount of risk, there a slight risk of a decrease. The other thing to notice is that the range is weighted on the positive end of the scale, right. There are more positive numbers in that range than there are negative numbers. So it increases more likely than a decrease. So based on that, you might go ahead and install, but it depends if it's a lot of work, maybe you wanted to be more sure. So there's a lot of nuance here that's hard to completely bake into an automatic tool, but you can give your users the tools to make those kinds of decisions for themselves and just to hammer that home. Ultimately, I'm looking at the lower end of the scale here because I'm caring about what is the worst possible thing that could happen. Conversely, the logic on the decrease is very similar. Just flipped, right? If you want to decrease rate of support tickets or something like that, then you're just kind of doing the opposite here. If you got back a range that only includes decreases yeah, you should install that's 99.8% sure. Some sort of decrease will happen. You should install it. And if the range is mixed, then as you might predict, it's ultimately up to you. And just to be clear here, I'm kind of looking at the other end of the spectrum. What's the highest possible thing? If you were trying to decrease support tickets and you could possibly get an 8% increase of support tickets, that's pretty high. You might not want to install that feature. I think it's backfiring. So how do I do all this in the fancy web app? So I'll walk through what it actually looks like, but it's got all of these underlying concepts that I was talking about. So now someone, if they want results from their AB test, instead of manually asking a data person to do this analysis every single time and having that process just be kind of slow and pussy, they can go to a web app and specify the name of the experiment they're running, choose it from a list, the death event. So the metric they care about, what do you want to have an impact on, do you want to increase or decrease that thing? And what is the time window that you care about? That's the Censoring concept we talked about. You have to be able to give everyone that equal chance, right? So maybe it's one day, maybe it's seven days. It depends on what you're looking at. And they get back some pretty easy to use data. So here's just an instrumentation check just to make sure your experiments running. As expected. This has some information about what percent of people are seeing the new version versus the control. If you expected that to be 50 50, then you're in good shape. Otherwise maybe you did something wrong with the set up. And that's just a date range of how long it's been running. This is the most important part. This is the analysis message. This is combining. All of the logic that we've talked about. Ultimately, I give them a really bold, simple statement with a fun emoji that they should install. And if you're following along that's because of the range, everything is that predicted range that we talked about, the 99.8% sure here this results got all positive numbers. Only increases are expected. So this is where the install definitely comes up. The install maybe message is a bit more nuanced. There's some extra information in here, but ultimately I'm telling them it's up to you. There's a lot of factors that can influence why you may or may not want to install this, and it's hard to automatically tell you exactly what to do here, but here's the information you can use to make that kind of decision. And I want to empower our users to be able to make that decision for themselves based on the different situations they're in here. There's a mix, it's slightly weighted. On the positive side, they're hoping for an increase you might want to install here, but it depends on some other things as well. I've got a bunch of resources, I'll certainly share my Slides link on Slack. This is hopefully all the building blocks you need to make a similar web app for yourself. So I've got the condition assignment code that I walked through in great detail here. The actual SQL that I use to get those conversion rates has a lot of complex logic that we talked about exclusions, the time periods, all of that. How I've actually got it set up in my web app is it's a string of SQL, and I literally run that sequel differently based on what people select in the form and the web app. So just populate that string with different values. I've got the analysis code that we talked about that does all of the simulations and the various possible conversion rates, and there's just some bonus code for if you have to work with redshift like I unfortunately do. There's some snippets I've figured out of how to make that a little easier and run your sequel. All right, so ultimately that's my presentation. I've got my LinkedIn and GitHub if you want to follow along, and some stuff I'm working on. My five second plug for where I work is that I have a lovely company called Zapier that I work at and we're hiring, so ask me about that if you'd like or just ask me questions later. I don't have time for questions right now, but I would love to chat with anyone during the conference. So thank you so much.