Video details

Probabilistic Python: An Introduction to Bayesian Modeling with PyM || Chris Fonnesbeck


Bayesian statistical methods offer a powerful set of tools to tackle a wide variety of data science problems. In addition, the Bayesian approach generates results that are easy to interpret and automatically account for uncertainty in quantities that we wish to estimate and predict. Historically, computational challenges have been a barrier, particularly to new users, but there now exists a mature set of probabilistic programming tools that are both capable and easy to learn. We will use the newest release of PyMC (version 4) in this tutorial, but the concepts and approaches that will be taught are portable to any probabilistic programming framework.
This tutorial is intended for practicing and aspiring data scientists and analysts looking to learn how to apply Bayesian statistics and probabilistic programming to their work. It will provide learners with a high-level understanding of Bayesian statistical methods and their potential for use in a variety of applications. They will also gain hands-on experience with applying these methods using PyMC, specifically including the specification, fitting and checking of models applied to a couple of real-world datasets.
PUBLICATION PERMISSIONS: PyData provided Coding Tech with the permission to republish PyData talks.
CREDITS: PyData YouTube channel:


It's very good to be back here. I gave a tutorial back in 2019 when the world was a different place. I'm glad there's still a world to come back to and give talks to. So welcome everyone. With these tutorials. It's kind of difficult because there's always a mixed background, right? Some of you out there probably know some statistics and some machine learning and some Python, and different combinations of both. So I'm not going to make any assumptions about your background knowledge here. This is meant to be a kind of a high level overview of Bayesian statistics, probabilistic programming, and PMC. So we won't go in depth into anything. This is kind of just a taste. And there's lots and lots of resources out there for you if you want to dig a little bit deeper. As Matteo mentioned, I work for the Philadelphia Phillies, which is a baseball team in the United States. We are hiring, by the way. We get a good office view, unless you don't like baseball, in which case it's kind of an obnoxious view. Software developers, quantitative analysts, data engineers. If you can throw baseball 100 miles an hour, release pictures, we're also looking for so let me know. So probabilistic programming isn't really a new thing. The term is kind of new. When we talk about a probabilistic program, we're really talking about any computer program whose output is non deterministic or it's partially based on the outcomes of random numbers. Really. You can do PP in any language that has a random number generator. But of course, some choices of language are better than others. And what we're really specifically talking about here essentially is a language where there are additional primitives, right? So we're used to primitives in the sense of integers and floats and strings. A probabilistic programming has primitives that are stochastic. So objects that generate random numbers have stochastic properties, particularly from known statistical distributions, normal distributions, binomials, things like that. So you could have an object that is a normal distribution. You can draw 100 random values from that. It can get more sophisticated than that. There are distributions over entire functions, like a Gaussian process, where you can draw a function from it, a distribution or distribution over functions, and make predictions on that. And then importantly, you can condition one based on the other. So you can have a variable that's drawn from one distribution and then use that draw to condition the draw of another random variable. So what this does is it allows us to specify statistical models, probability models, at a very high level, right? It's all about abstracting away a lot of this stuff. And the main reason we do probabilistic programming is to facilitate Bayesian inference. And Bayesian inference is probably different than the statistics you may have learned at university, although newer programs are incorporating this stuff a little bit more. Why should we do first of all, having said, I'd make no assumptions. What is Beijing inference? What is Bayesian statistics? This is Andrew Gelman's definition and he's a prominent bayesian out of Columbia University. He says, practical methods for making inferences from data, using probability models for quantities we observe and about which we wish to learn. So the idea is you build a model using probability distribution. Everything that you are interested in has a probability distribution associated with it. And the probabilities that we're talking about here are a little bit backwards from what you might be used to. And in fact, Beijing inference used to be called inverse probability, because what we're going to do is we're going to infer from effects to causes. So why here is some data that you observed. Theta is our unknown quantities, random variables, predictions, whatever you're interested in. And you're going to observe something about the world, and we're going to use that to infer something about the things you're interested in. So it's these conditioning statements that help determine what the causes might be. Why would we want to do this? This is, again, a definition from Lincoln Barker's book on basic statistics. It's attractive because it's useful. There are sort of philosophical reasons that you might want to be Bayesian rather than frequentist, but it's very useful and it's very simple, at least conceptually it's very simple. All the pieces are very simple, but it can be used to build something large, complex and interesting. So it's a utility, really, that motivates me. Here's the requisite showing of the base formula. So this is what we're doing here. We're going to use probability distributions to characterize what we know and don't know about the stuff we're interested in. Again and again, theta. Here are the things we're interested in. It could be one variable, it could be 100 variables, be an entire model. And so the idea is you have some information about the world before you start doing your experiment or your project. And we encapsulate that in a prior. This is what we know before seeing the data. And then we have some observations, some data that we encapsulate as a statistical likelihood, and we use that. The magic of Bayes formula is that we use that to update our information about theta and give us that inverse probability that I talked about before. So after having observed something, what can I say about theta? This is really just a rule for learning from data, and that's really what machine learning is trying to do all the time. So in probabilistic programming, we want to come up with a program that does this for us because that's kind of the hard part. Invasion inference is the actual calculation of that posterior distribution. And in a stochastic program, what we're doing is we're specifying most of the right hand side of that equation. So we're going to give it information about our prior, we're going to get information about the data. And you model this kind of joint distribution of your data and your unknowns. So let's look at these pieces a little bit. So, prior distributions. Again, we're trying to quantify the uncertainty in the latent variables before we've seen the day. We may not know much about it. So everything has to be given a distribution in base. So we might give it a normal distribution, which is your bell curve, maybe a mean of zero and a standard deviation of one. It says that we know essentially it's somewhere between negative four and four. Perhaps this is also a normal distribution that line across the bottom. So normal with means zero, but standard deviation of 100. So variance of 10,000, essentially a straight line in this region and that means we don't know very much about it, right? It's diffuse, it's uninformative, largely uninformative. Often we know something about the system we're studying, right? So let's say something like, I don't know, an infectious disease that ravages the world. Its prevalence is going to be low even during a pandemic, right? So a beta distribution describes things between zero and one. So a rate can be either anything between nonexistent to everywhere zero and one. And even a prevalent disease is going to be 10% or less most of the time. So that's a beta 150. So this is again characterizing what we know ahead of time. Then we have the likelihood, the other piece that we're responsible for specifying, how does the data relate to the model. So now we're conditioning your model on the observed data. And our data can look a lot of different ways, right? So our data is often normally distributed heights and weights of people. Any sums of other variables tend to be normal because of something called the central limit theorem. So what we observe x, sorry, I switched it from wide x, x here could be normal with some unknown mew and variance that we need to estimate baseball, right? So we have hitters that tried to hit the ball in a certain number of tries. So we want to estimate the probability that this hitter might get a hit sometime in the future. So we're going to use a binomial distribution to describe that sort of data successes out of a certain number of trials. Let's say you're on a website and you're trying to estimate expected numbers of visitors per month. There are likelihoods for characterizing counts of things when that could be up in the tens or hundreds of thousands. And so the magic of Bayes is that it's going to take those two pieces of information, what we know before some new information, and generate something afterwards. So that funny little symbol in between the two sides of the equation means proportional to needs, equal to up to, but not including a constant value. And so full bays looks like that. And the thing that makes Bayes hard is essentially the probability of y underneath in the denominator of that fraction in Bayes formula that is kind of a weird thing. Probability of Y, probability of the data. It's actually the marginal probability of the data. So it's the numerator integrated over all of the thetas. And I took calculus in school. I can do integration when there's like one variable. I think I took a test where I had to do twice two variables. Multi variable integration is very hard and nearly impossible, and particularly in models where you have hundreds of thousands of parameters that you have to integrate over. So it requires numerical methods in order to do it. And that's what probabilistic programming is going to do for us. Is it's going to take that inference procedure that's usually very complex and tricky and kind of make it go away. But like I like to do, and when I teach this in a course, you can do it by hand for simple problems. And I think it's useful even in a situation like this, to do this together. So we'll do a simple problem by hand and then we'll look at Pine Seat. So I'm going to use again a baseball example because I have lots of baseball data. So I'm going to estimate, for those of you unfamiliar with the sport, the picture throws the ball, the batter is going to try to hit it. And it's very hard to hit a baseball when it's thrown at you at high rate of speed. And so they strike out some of the time. So strike out is when you take three swings at the ball and you miss and then you're done and the next batter comes up. If we're trying to minimize the strikeout rate of a batter, if we're running a baseball team. So we have a player on the Phillies named Nick Castellanos. He plays right field. And back in 2019, he came up to that 22 times to start the season, and he only had one strikeout, which is really low. But of course, 22 play appearance isn't all that much. You'll probably have 600 of those during the year. So what we want to know is what's his true strikeout? What's the underlying truth there? Because you could get that by accident. So the reason we can do this one by hand is a little trick, a little statistical trick called conjugacy. And for certain data, for certain models with certain types of variables, you can do the base formula update without having to do any integration. And in this case you can, because that strikeout data, remember, is binomial, right? It's how many? 22 tries? And he did it once. So that's binomial data. If you put a beta prior on that and we saw the beta prior for the disease thing, right? It's values between zero and one. We're estimating a rate of strikeouts here. We can call that a beta distribution. And it turns out if you combine the two of those, you get a posterior distribution that's also a beta. So that's pretty handy, right? Because you can estimate that and then maybe observe some more data and take the posterior that's a beta distribution and make it the prior the next time and look at some more data. And you can keep turning the Bayesian crank that way. So this is what it looks like. A little bit of math here, not too much. This is what a binomial distribution looks like. So theta is our probability of the event Y or the number of events, right? And if I just look at what I call kernel of this distribution, it's theta to the Y, one minus theta the N minus Y. So the number of successes and the numbers of failures and their associated probabilities. Well, the beta distribution looks like this. And if you squinted those, they look really similar, don't they? Theta to the something, one minus theta to the something. In this case, the variable of interest is theta up there. The variable of interest is Y, but they're the same form. So you can squish those together and reduce it and you end up with another beta distribution. And a beta distribution is parameterised by alpha and beta, essentially the number of prior successes and prior failures. And so it's beta alpha plus the number of observed successes, alpha plus the number of observed successes, and beta plus the number of failures and minus Y. So let's do it. Let's do some bayes. So this is what one of 22 looks like in terms of a likelihood and it implies a strikeout rate that's below 10%, although it could be as high as 30%. And it just kind of happened by accident. We're not sure. Well, let's get a prior for this. How do we get a prior for this? What's our prior? What I did for a prior is I went to my big baseball database and I pulled out all of the batting data over three seasons of playing and I took out all the players with over 200 played appearances. So lots of data and I got their strikeout rates. And in Scifi, there's a big stats library in Scipi and you can just fit distributions, whichever one you want, any distribution you call, in this case beta, because it's a beta distribution fit, it'll give you the parameter estimates and that turns out it's a beta 1035. We've got the formula to do both. And so the black line over there is our prior. So that's the data. The prior is the dark blue line and then the posterior is always going to be somewhere between the prior and the likelihood, right? It's a little bit closer to the prior in this case because there wasn't much data. And now we have kind of reasonable posterior estimate. We didn't have to choose that prior, we could choose a different one. What if we know a little bit more about baseball? Is that oh, Castianos is a right fielder, different players in different positions have different characteristics. So I'm just going to take the right fielders in my data set, more restricted data set to use, and that turns out to be a beta 1547, and that corresponds to the red here. So now we're hedging our bets a little bit more back towards the prior of a higher strikeout rate, because outfielders are big, heavy hitters that strike out a lot, right? Which one is right? We're not sure. But the point is, you can choose your prior based on whatever information you have, and you really want to bring as much information as you can to the table to get the best estimate. The goal here is to get the right answer, not to be statistically pure, right? As long as you tell people what your prior is, as long as you specify it as part of your model, you can argue later about whether or not that was a good choice or not. Okay, that's the idea. That's kind of your base by hand, but your model and your data will quickly outstrip what you're able to do by hand, and you'll have to turn to some software to help you out for bigger models. And here in 2022, we're extremely blessed with a lot of different choices when it comes to doing Bayes. Even just in Python. These are just the ones for doing this in Python. Python and Pi MC Pyro, non Pyro, kind of the Facebook thing, TensorFlow probability, if you're a Google guy and others. I just want to give you a subset here. We'll talk about here about PMC, because I work on that project. So Pioneer is just for fitting Bayesian statistical models using a variety of methods. There are more than one method for doing Bayesian inference. Something called Markov Chain Monte Carlo is primarily what we use, and I'm going to describe that in a little bit of detail later. It includes all those statistical distributions you'll need. That's kind of one of the things that it takes a little bit time to learn to kind of use Beijing methods is learning all the different statistical distributions and when and where they should be used. It leverages a package called RVs, which kind of spun out of Python of Pioneer years ago, for doing lots of output analysis and plotting. It's very extensible. So if you find a probability distribution that we don't have, it's easy to implement your own. You can extend even the algorithms for fitting the models if you're so disposed. And that includes things like nonparametric Bayesian methods, which is another great aspect of Bayes that we don't have time to go into today. And we just released a new version. Prime C Four was released just a couple of weeks ago. We're very excited about that. It has now GPU support. It supports different computational back ends like Jack's and Nimba, a lot more user friendly than before, and generally runs faster. So all the things you like to see in new versions of software. So I encourage you to try it out. We'll use it here in our examples. Okay, I'm going to switch now to Jupiter notebook if I can find my way around my own computer. Feel free to interrupt with questions, by the way. If anybody has them, I can answer them while I'm switching. Yes. The question was, among the different libraries, are there differences between what each of them has to offer? I would say that the differences are mainly in terms of high versus low level. Stan is probably the market leader, if you like, and it's a little bit lower level than PMC, and it's built really just to do Markov chain Monte Carlo really quickly. And it tends to look a little bit more like R than Python. So it depends on kind of the interfaces that you like to use. But most of the main ones, Pyropytorch, they're all going to be used. And some of them have placed emphasis on different methods for fitting Bayesian model. So NumPyro Pyro tend to be more towards the variational inference side, which is for bigger data sets. So that kind of reflects that choice as well. But they all generally do the same thing. They allow you to fit your model at kind of a high level where you don't need to know necessarily very much about the fitting algorithms and then generate results for you. Okay, yes, go ahead. Well, there's not too much to say about the likelihood, right? So a likelihood is really just another probability distribution. The only thing is that, and we'll see this in a second, it's pinned to your observation, so they don't change when you run the model. It's kind of used to inform the rest of the model. So mechanistically in terms of how you specify your model, there's really no difference between your likelihoods and your priors. It's really just that your likelihoods have data associated with them. You can draw new values from that distribution, but they're just normal distributions, binomial distributions, et cetera. Model selection. So there's really three big steps to doing Bayesian inference. One is write your model down, specify all your probability distributions, whichever choice you want to make fit the model, which is what this talk is mostly concerned with. And then the third is model checking. And it's a very important step. And that's where you go in and you see whether your model fits the data. And if you've made a bad choice of prior likelihood, that should show up in kind of bad goodness of fit tests. It's funny that you ask about the likelihood. Most people are worried about the prior. You're worried about the likelihood. That's great because they're equally important. Okay, so I'm going to quickly run through how to fit some models in Pi. And see here we are going to use more, another baseball example to motivate this. Hopefully my data imports will run in finite time here. All right, I'll talk while it's doing that. Makes me nervous that it's running this long just to import. Anyway, so this is the sticky stuff, incident. So last year baseball started cracking down on pitchers putting stuff on their baseballs. And why would they do that? Well, pitchers try to get a good grip on their ball, on the ball, and spin it so that it'll generate some movement, and movement makes it harder to hit. So it's been doing this as far back as baseball goes, which is well over 100 years, putting stuff like pine tar and mixtures of Rosin and sunscreen, anything to make it kind of tacky and sticky. And this caused a lot more strikeouts, which aren't as fun for spectators to watch. And so even though the rule was kind of in the rule books, they started cracking down in kind of the middle of last year, and you'd get a ten game suspension and a fine and things like that. And just to give you an idea, this is why they do it. So this is one of our pictures, aaron Nola throwing a curveball, and you'll see right at the end, it kind of dips it's because he spins the ball really quickly and it makes it really hard to hit. They started cracking down very invasive checks, unbuckling their belts, checking the insides of their gloves. Nobody's very happy about this. What we're going to try to do here is try to estimate whether or not the sticky stuff crackdown had an effect on the spin rates of baseball. So we collect a ton of data. Now major League Baseball does. Every game generates six to seven terabytes of data because they monitor all the players on the field, the ball, the movement of the ball, and it's actually all available for free on the MLB Advanced Media website, the Stackcast website. And so I downloaded all of the data from 2021, the year where they cracked down and looked at their spin rates. So this is what the data looks like. We can plot it. So here's the spin rates by date, the little gap in the middle as they take a break in July to do the All Star Games. There's no games for a little bit under a week, but otherwise continue it. These are all the pitches and the spin rates on each. So on average, there's about 2500 rpm. That's the spin rate of the ball with lots of variation. What we're interested in knowing here is there a change during the season in spin rate due to the crackdown trying to avoid getting suspended. And does that change points correspond to when the crackdown happened, which was June 3, was when the information kind of leaked out that they were going to crack down. So let's build a model for this. What is the model going to look like here? I'm going to extract some data. All I need is the date and the spin rate. So that's what these two lines is doing. And so the model I'm going to use here is a change point model. It's a relatively simple model just to demonstrate how to use PMC. The idea here is that really the simplest possible model is to have a constant mean spin rate across the league, and at some point there would be a change due to this change in policy. So there'd be an early and a late spin rate. So what we statisticians would call a piecewise constant model, two pieces. And so mu one and mu two are the two rates. Tau is going to be our change point. And the tricky part here is that tau is not known. I mean, we have a guess at what it was if our hypothesis is true, but we really don't know what it is. And that's the nice thing in Bayes, is that anything you don't know, you just put probability distribution around it and you estimate it included in your model. And so that's what we're going to do here. So two mus and a tau are variables and then we'll have two likelihoods and these are going to be normally distributed with the two means and then some common sigma, maybe the variance changed afterwards as well. I'm not saying this is a good model, this is just a simple model and somewhat reasonable given what we're trying to do. Okay, so let's step through building this in PMC, really, there are two different types of objects in an Abyssian model, stochastic and deterministic variables. So mu one, mu two, the means and the change point, tau are all random variables. We don't know what they are. Oh, and sigma, the kind of distribution of noise around the mean, so they arise from a physical random process. Sorry, they're random in the sense that we don't know them. So we don't think that the underlying mean necessarily is changing except in a way that we include in our model. But we're using probability distributions to characterize our information state about that variable. So it's an epistemic randomness here, not necessarily one that's stochastic, and it's representing our uncertainty about these values. There are other objects that are deterministic. So in a given year, the rate is completely determined by me and if it's on one side of tower or the other. So the vector of rates for every day of baseball is going to be deterministic given those random variables. Right. So we have stochastic things and deterministic things. All right, so here's the rest of our what choices are we going to give for our probability distributions while our means are going to be normal with mean 2500? Because that was I peaked at the mean there and then a standard deviation of 100, which is a variance of 10,000. So way bigger than we need to do. I'll just say uniform between zero and tea being the last day of the season and then a half normal distribution for cigna it's half normal because variances have to be positive, right? Kind of have negative variance. So we just chop the negative side off of the normal distribution. In PMC, it looks like this. So a couple of things to note. Here the with statement. We use a with statement which in Python is a context manager. And context managers generally are used for things like network connections, file opening and closing. And it kind of manages all that for you. Here we're going to use it to open and close a model and add variables to it. So it's a nice little trick that makes models easy to specify. So we're going to create a model object, which is really just a big container for our variables and how they're related to one another. We're going to call it spin rate model. And then here are two of our prior. So here's mew and mew here has shape of two because there's the early and the late. And then there's the uniform distribution with a lower and an upper bound that I'm using just the first and the last day of the season, which I've encoded as integers from zero to big t. Does that make sense? Notice each distribution has a name, it's got a label, essentially, and it's been added to the model, okay? And in fact, you can't even specify, you can't even use these random variables outside of a model. If I just say I want a normal called x with mean mu zero and sigma ten and I try to run it, I get an error, I get an index error. But if you scroll down to the bottom, it says no model on context stack, which is needed to instantiate distributions. It's got a lot of overhead associated with it that's used for integrating that distribution into the model. And under the hood, oops, I changed the names of these things, so change this to me. Under the hood, all of these variables are things called tensor variables inside the Asara library. Asara is essentially the back end for PMC that does all of the heavy lifting, particularly building the kind of model graph. It is based on Fiano, which is a now extinct deep learning library. And thanks to the efforts of Brandon Willard and others in the Asara team, basically co opted that package of that library and turned it essentially into a probabilistic programming engine that allows us to compile it to different back ends. So you're going to see a lot of mention of Asara that's really just kind of the back end of time. See, all the distributions that you're seeing here are based on our subclasses of a distribution super class, in particular the continuous subclass. And so every distribution really all it needs to have is a method for returning the log probability that's what we use to fit the model. Optionally, a random method for drawing. New random values. It doesn't have to have that. But if you want to sample from your model, you'll want to include that as well. So if you're implementing your own distributions, that's roughly how it looks. So for any of these variables, for example, tau, we can pass it at a value of five and get the log probability back, right? So that's the log probability of day five of the season, essentially similarly random. You use that via the draw function. So we're going to draw five values of Mu. Remember, Mu is the size, too. There's two means. So it's drawing five sets of two in this case, right? If you do want to use our statistical distributions outside of a PMC model, you can do that. There's a class method called disk that just pulls out the statistical distribution. So here I created a normal called x of size 1000, and then I generated a histogram of that. So you can use them outside of PIMC. You've just got to extract it in the appropriate way. PIMC has a ton of distributions. There's the whole list. So chances are what you need is in here. And if it isn't, I've just showed you essentially how to do it. Okay, every distribution has a shape. Remember me as a size, too. So this is a unified distribution, but it's kind of a batch of two of them, right? It's not a multivariate normal, although it could be, but you can have kind of batches of distributions like this. Okay, let's keep going here. So the nice thing about the context manager is that you can add some variables, close it up, crack it open again, add a few more. So we're going to add sigma here because we had one more random variable. This is our observation noise, essentially, and the model is keeping track of what's in it and all of the details. So those are the variables we've added so far. And PMC is doing a lot of stuff under the hood for you here. For example, when you specify a half normal distribution, as I said before, it's constrained to be positive. But when we're doing model estimation and we're doing MCMC sampling, we don't want to have to worry about that lower bound of zero. And so everything under the hood gets transformed to the real line. So in this case, we're going to log transform it so you pass a negative one to this. I don't know why this is taking so long. It's undefined. It's undefined at negative values. And if you look under the hood here, you'll see that the name implies that something's happened sigma underscore log. So it's been log transformed, but you don't have to worry about that. They inverse transform it when you get the results of the model. So you don't have to. It's really just something that's convenient for computation. All right, now deterministic variable. So deterministic variable, again is anything that's completely determined by the value of its parents. So our vector of rates across the season is completely determined by mu and tau. And so this is it here. It's a little switch statement when tau is greater than day end. So it's the first half of the season, it's mew zero, otherwise it's mew one. And notice here that I didn't give this a name like I did with the other things. It was normal mu and then uniform tau. This is kind of an anonymous thing here with no name associated with it because this is sort of an intermediate calculation. I'm not really interested in this vector of copied values, essentially, and so in this sense they'll kind of be thrown away at the end of the model fitting exercise. If you want to specify something as some bad color choices here, if you want to specify something as if you want to make it persist in the model and summarize it at the end, you have to wrap it in a PMC deterministic object and then you give it a name. So now this will exist in the trace that you get out the other end, otherwise you can just leave it anonymously. And of course the shape on this is going to be bigger because we are so there's one for every pitch that's thrown here at 7791. All right, last step, the likelihood. So as I said before, there's really no difference between stochastic and observed variables in the model. One just has data associated with it, the others don't. And so the way that you do that in prime C is you pass the data to the distribution using the observed argument here. So spin rate is the vector of spin rates for every pitch, otherwise it's just a regular normal distribution. Mu is R, so it's our vector of rates. And then sigma is the observation noise and you can always see what your likelihoods are. Oops, haven't run the cell yet. There it is. You can always check what it is looking at observed RVs. And then another step that's useful at the end is to actually look at your model so you can generate a picture of the graph using graphize. This is always good, right? If you're building more complex models, I often do this and I'll have like a node floating around unattached to anything and that would cause problems. So just make sure that your mall is all connected the way you think it should be, okay? And then now we run this and all we have to do is call sample. I anticipate this to take longer than it should because something is weird with my computer today. So this is the magic inference button. This is a phrase coined by Thomas Vicky, who's here in this meeting. He allowed me to say he's trademarked it, but he let me say it. Well, this is the thing, I didn't give it any arguments just to show you how easy it can be. So there are a lot of defaults applied here. And we'll go into that in the kind of the next section. And so it will run this in just a few seconds. Yeah, it should take way less time than this. Sorry about that. It took about 2 seconds the other day. So anyway, see what it says here. Auto assigning nut sampler. We don't know what that is yet. Initializing nuts using something and then multiprocess sampling. It's going to sample four chains and it shows you the random variables that it's estimating. And then it's running about ten times as long as it usually takes. Okay. And then we can generate plots of things. So one we're interested in here is tau. So the mean of tau is 68. Although it's got a weird bimodal thing, it's maybe a little uncertain about where that is. The mean is 68. And if you look at where in the calendar 68 is, it's June 8, which is less than a week after they changed the regulations. So there's a little bit of evidence here of spin rates changing. All right, next I'm going to dig into a little bit more into MCMC. Yes. Please ask a question while I'm oh, yeah, we're definitely interested in how big it is. Let me see. Oh, you can't see this anymore. Mu Naught was 2539 and Mu one was 2442. So about 100 rpm less, which can make the difference. It's a game of fine margins. All right, go for it. Once it's compiled, it's kind of out of pinec's hands. So it compiles to C, which is what Seano gave us before Jax, and then a little bit of numbers at the moment, potentially more than that. And all it's doing is compiling it for the MCMC computation. So all the features aren't really relevant. It's just doing the gradient calculation and all that kind of MCMC steps. All right, so we're going to talk about MCMC. The main challenge in doing fitting behavior models resides in actually calculating the joint posterior distribution over all of the models. Right. So this was the hard piece. Right. Integration is hard. Now, if I was teaching my course here at Vanderbilt, I would spend like two weeks going through from the basic this is kind of ranked in order of complexity, all the different ways you can approximate a posterior distribution from simple but probably inadequate to getting closer and closer to something that's usable. Right. We're going to skip ahead to the workhorse. This is kind of the de facto standard method now for fitting Bayesian model. So MCMC stands for Markov Chain. Monte Carlo. So two pieces to that Monte Carlo and Markov chain. So it's going to simulate a Markov chain. So Monte Carlo is just a fancy word for simulation using a random number generator. And we're going to simulate a Markov chain where the function of interest, our posterior distribution is its unique invariant stationary distribution. Lots of jargon there. First jargon we're going to break here is what a Markov chain is. So Markov chain is something called a stochastic process, which is just jargon for an indexed set of random variables. So it's typically indexed by T because it's often temporal, but it doesn't have to be, and it's Markovian. So the Markovian condition is all it says is that the probability of some variable, its value in the next time step, given its current value, the last value, all the way back to its initial value, is the same as the probability of its next value given just its current value. So essentially the history doesn't matter other than right now, right? These are pretty common. Like, we play Monopoly. That's a mark up chain, right? You landing in Park Place, and the next turn is only determined by the fact that you're, I don't know, on Reading Railway and not that you were in jail three turns ago or anything like that. The current state matters. The past doesn't matter. So it's kind of mild non independence, if you want to think of it that way. It's very hard to estimate Beijing models by drawing independent samples. So what we resort to are dependent samples, and we can get the same sort of result. It can't just be any Markov chain that we simulate here, though. It's got to be a special one. The specialness comes from a property called the Detailed Balance Equation. We want a Markov chain that obeys this property, and this has to do with balancing movement to and from different states. Glossing over lots of complexity here, but I just wanted to give you a little bit of intuition about what's going on here. If it obeys this property, then pi is the limiting distribution of the chain I E, the posterior distribution. So pi of Y, pi of x is going to be our posterior distribution. So anytime you see a new MCMC algorithm pop up, the paper will show that it's reversible. It's kind of the thing that you have to prove before anybody believes you the very simplest MCMC out. So there's no such thing as MCMC as an algorithm. It's a classic algorithm, right? It's like bird. There's no birds called a bird. They've all got species associated with them. Same thing here. So one of the simplest ones is the Metropolis algorithm. And it's really easy. A couple of steps here. That last one should be a four, not a one. I don't know why that happened. So four steps, you initialize all the parameters of your model to arbitrary values, and then you have another distribution that's easy to sample from, and you draw potential new values for that distribution. You evaluate whether it's good or not. And whether it's good or not depends on the ratio of the posterior log posterior distribution under the new values, theta prime over its current values. You don't really know what this posterior distribution is. But you can always remember, every variable has a log P in pine C. You can always calculate the log probability for any value. So you can always do this. You have this acceptance probability. If that probability is higher than that one, you accept it all the time. And if you don't, you accept it probabilistically. You draw a uniform random variable and you kind of flip a coin. So you sometimes accept bad values. It's part of this reversibility thing. You don't just want to find the peak of the distribution. You want to find the whole thing. And so Metropolis sampling kind of looks like this. It's proposing values, sometimes accepting them. I can't remember who I stole this from. It was probably Thomas. Did I stole this from you, Thomas? No. No. Okay, then I did it. This is mine. But notice it's getting stuck here. So this is a complaint problem. This is actually a thousand dimensional multivariate normal. So it's not complex. It's very simple. It's just large. Right. It's big geometry. And so it's kind of getting stuck. It's not doing a very good job here. Interesting. Well, the reason it's getting stuck is it's not very good. Metropolis algorithm is not very good. It's very easy to implement. It's very general, but it doesn't do well for larger models. And the reason is that as models get big, the volume in terms of the space of the parameter space grows exponentially. So that exponential line is the volume, the probability or the density. The probability density function is there. So as you move away from the mode, the volume gets very big and the density doesn't. So most of the probability is actually away from the mode. And so it's really easy to get stuck in areas of low probability. This area of high probability, it's hard to see if there is shade. It's called the typical set. And that's what we want to do, is estimate the typical set. And to do that, we need a better algorithm. And the better algorithm is something called Hamiltonian Monte Carlo. Hamiltonian Monte Carlo is yet another MCMC algorithm. But this one uses more information about your model. Specifically, it uses gradient information. So, Metropolis, we were doing a random walk, right? We were randomly drawing a new value. And as you might sort of intuitively, that's not very efficient, right? Just random values. Are they going to be good or not? Probably not. But if we use information about the gradient, we can get essentially better suggested values and do a better job. And so what this does is it simulates kind of a physical analogy of a particle moving. You think about a skateboarder and a skate park, right? And you kind of push your skateboarder into the park. And what does she do? She kind of rolls around following the topology of the park. Right. Or a marble in a salad bowl or something like that, right. It goes up and down. And so we do this to explore the posterior distribution. And so we add an auxiliary variable to simulate essentially well, position and momentum, or kinetic energy and potential energy, and in kind of discrete time, discretised time. And so it requires the gradient. And so derivatives are easy nowadays on computers. Integration is hard. Derivatives are easy, particularly when you have a package like Fiano or Asara or TensorFlow or Pi Torch, whatever. They all give you these things for free now. So we calculate essentially the changes in position and the changes in momentum and use that to monitor where to go in terms of characterizing the typical set of the distribution. So you take all of these little leapfrog steps. You're kind of following the contours of the posterior distribution. You stop, you take a value, and then you flick the particle in a different random direction. And so the algorithm is even easier than Metropolis sample. A new velocity from a univariate gaussian, ie. Your initial energy perform a certain number of leak frog steps to get a new state because you're simulating it's a continuous system, but you're using discrete steps to approximate it. And then you do an accept reject move because you're doing this approximation at the end and you accept it or not. And Hamiltonian MC looks like this. Boom. And it can even do more complicated things than multivariations. It can be those weird banana shaped distributions with a huge ridge on it or whatever. In fact, that logo, at the very beginning, one of our developers calling Carol did simulating the PMC logo. So it can be super complex stuff. Okay? And then I'll skip right to the punchline. Remember when I ran Sample, it said initializing nuts. Nuts is a particular tuned auto tuning version of Hamiltonian Monte Carlo called the no uturn sampler. And it's no uturn because when you do that, moving around the space, what happens when you go up a hill and you're eating the top of the hill? You roll backwards again, and you want to stop before you do that uturn because you're just recalculating the same values over and over again. So the no uturn sampler is the state of the art, and this is why we use it. So this is the same 10,000 dimensional or 1000 dimensional multivarianormal, just marginal two variables from that. And that's what Metropolis does, another MCMC sampler that nobody uses anymore. And that's what nuts does relative to just independent draws, right? So that's what we want to do, and that's why we use nuts. This few minutes of talk would have taken like a week in my class. So we've glossed over a lot of stuff, but hopefully that gives you a little bit of intuition. Is there a question while I'm moving back over? Yes, you don't. And that's why you want an auto tuning algorithm. That's what nuts does. It optimizes the number of lead fog steps. It depends on the model. Right. Simpler models, you want to do as few as possible. The goal isn't to perfectly follow the path, it's to get a good point. And as you can imagine, the more steps you have to take, the more computationally intensive it is. You want to take as few as possible. And so the nuts algorithm estimates the optimal number for you? Yeah, good question. And I wish I could mirror this thing again. There we go. All right, let's see. 1154. I started out, I have a half an hour still, is that right? Cool. We may even have time for questions, although we're taking questions during it. So any more questions, by the way, since we're on schedule? Yeah, sorry, I didn't understand that. Sorry, can you repeat it again? What is it? It's determined automatically. There's two things. There's two components. So the position is your actual posterior distribution that you're interested in, and the momentum is the auxiliary variable that you're giving with the energy. Your posterior distribution is the position. So it's not using kinetic energy, potential energy. It's using position and momentum. And so the momentum is auxiliary, it gets thrown away and you keep the position. All right, let's dig into this a little bit here. So I'm going to use another sports example, but I'll at least use a sport that most of you are probably familiar with. So six nations. Rugby, the Premier, european Rugby Competition. Those are the only six countries that play rugby. I guess you could argue whether Italy plays it or not, but they participate. Version of this is on our website as one of the examples. So what we're doing is we're taking historical rugby data to try to predict the outcome of the next tournament. It's probably not valid because we're using results from 2014, probably entirely different players on all of these teams by now, but forget that for the moment. And so we're going to take a whole bunch of historical scores and try to predict new scores. I'm going to do that by estimating latent variables, which is what Bayesian inference is good for. And the latent variables will be the strengths, the underlying unobserved strengths of these teams in terms of scoring and preventing points from being scored. Right, so this is all the data, a home score and an away score for every game going back to 2014. So our model here will look a little different. So again, you got to think about our unknown variables and our likelihoods and so forth. Let's in this case, start with the likelihood. Work backwards. So scoring points, they're discrete. You can't score half points, you can't score negative points. So a poison is a reasonable approach here. Poisson distribution is accounting. Distribution has to be positive. The mean and the variance are equal. So as the numbers get larger, the variance gets larger. Reasonable choice for counting things that don't have an upper bound like a binomial does. Okay, so we're going to have two Poissons here, home score and away score, your score and your opponent score. Right. So two Poisson means and then the means in turn will be determined by a log linear equation. The reason it's going to be log linear is because the Poisson mean has to be positive and the way to constrain that is to make it inverse log. Right. So exponential of that. And so there will be two pieces, alpha and delta, attack and defense, which is why I chose those. So for the away team, it'll be the alpha. I think I have that backwards. Yeah, it should be the alpha of the away team and the delta the defense of the home team. Or no, sorry, this is backwards. This is the home. So home is going to be the alpha of the home team and the defense of the away team. And then Ada is a factor for home field advantage. You play better at home because all the fans are yelling for you and you sleep in your own bed at night and stuff like that. So some unknown factor but we're going to do is transform those so that they're centered on the mean. Kind of like we did for the baseball, we centered on the mean. So this is going to be the mean score of a rugby match. And then so that they all add up to zero, these are going to be transformed so they're a little easier to interpret. Okay. And then for the away team it's exactly the same but there's no Ada. And then the alphas and delta are reversed. So it's the alpha of the away team and the delta of the home team. I hope that's clear. Well, they can't be both. Oh, I see, then you could do it, right? You just leave the Ada out of both of them if they're playing like in the United States or something, right? I don't think that happens in six nations though, so we don't have to worry about it. All right, I'm just going to explain this panda factorizes, my favorite pandas function. It essentially does what Scikitlearn's label encoder does. So it takes a list of team names here and it turns them into integers from zero to the number of teams. It's going to turn it into zero through five in this case. Right. So one of the newer features of Pioneer, it's actually not a pimp C four thing, it was available in later versions of PMC. Three is that you can specify the size of your variables in your model using labeled arrays rather than just integers. So I'm going to take my list of teams and use them as the coordinates. And then when I print out the results and whatever, it will have the name of England, Wales, et cetera on them. And so I put that in a dictionary and I pass that to the model as the coordinates argument. All right. And another thing that I'm going to do that's relatively new in Pi MC is I'm going to make my data mutable. What does that mean? Mutable means I can change my data. Why would I want to change my data? Well, I'll fit my model using as we do in machine learning. We fit our model using fitting data, and then I want to use this model, right? I want to say, okay, now that I fit my model, I want to see what Wales in England do next week when they play in Wales. And so I'll change the data and then without having to rerun the model, it'll generate predictions for me. So I have this thing called mutable data. So I'm going to change the home and away teams later on. So what it's really doing is adding it as a node in the graph as symbolic note, and then when it's symbolic, I can feed whatever values that I want into them. Okay, so next step prior. What are my priorities going to be? They're all going to be normal, right? We're dealing with not rates, but actually factors that have been transformed to the real line. So my home field advantage ada is just going to be normal with me. One sigma, one. They seem small, but this is on the inverse log scale, so they get blown up when you call exponential on them. So that's actually quite diffuse. And then mew. So exponential three is about 17 or 20 or something like that. So that's a reasonable approximate mean for what a population average score is going to be. And then alphas and deltas. And now notice I didn't call shape, now I call dims. And I just passed the list of the dictionary of teams to that. And so it'll make sure that it's got one for each team on attack and defense. Go ahead. It shouldn't really be science, so you could give them uninformative prior. There's a thing in here called a flat prior that's just not even a real distribution. You really want to add as much domain expertise as you can. The only domain I don't know anything about rugby. I work in baseball, so only information I know is what the scores kind of tend to be like. You can't get scores of a million and they don't tend to be like one or zero. So I actually just peeked at the population and put the population meaning here. And that's a valid thing to do here because one way of thinking about your prior is actually the underlying population that you're trying to infer, like that baseball player, right? I use kind of the population of all baseball players. That's a great prior, because you know what that is. You don't know where your player is in that. But mine has got to be in there somewhere. Or most likely you want to allow for it to possibly be the greatest baseball player ever. But that's not likely to be the case unless there's lots of data to support it. What I do here, variable name, ada already exists. Well, no, it doesn't. Oh, I hit it twice, didn't I? I must have hit it twice. Okay, deterministic. So here's my transformation. So at me. That's Asaratensor. So you try to use Asara functions here when you can. It's got most of what NumPy has, all the important ones, anyway. So there's alpha star and delta star, and then my model for home and away, just like I wrote above. Mu plus Ada plus alpha plus delta for home, same for away. Except no ada likelihoods. Remember, the only difference here is we have observed and so we're passing rugby data home score the first one, rugby data, away score for the second one. And always like to strip out the Pandas stuff and use the values. In principle, it works with Pandas, but sometimes I get unexpected things going on. And again, we're going to look at our model again, and it's bigger, it's more complex. Now, even this simple model, and just for convenience, I'm going to wrap this all, the whole model in a function, so I can just create a new model whenever I need to, because I'm going to intentionally do bad things with this model. Sorry the colors are so bad here, hold on, let me go back to my is that easier to see? Yeah. So remember, I ran sample with no arguments before, and everybody freaked out because there's a whole bunch of defaults here, and they're usually reasonable, but it's always good to know what you can change. And there's even more than this, but these are the ones you'd change more often, right? So you want to know how many draws do you need? Thousands of good. That's how many draws after the model is finished. Tuning, usually only for Hamiltonian Monte Carlo, usually 1000 or so is good enough, assuming the models converged. Tuning for small models, thousands. Also pretty good. I like to do two by default, whatever. We'll see in a minute. How you know whether you need more or not. A lot of these things get auto assigned. So the step means the MCMC algorithm, the step method. And so it knows how to do a step of a particular algorithm. It'll assign them automatically unless you want to override it. I'll show you how to do that in just a second. You can specify initial values if you want to, how many chains you want to sample from all of these things. I'll go through some of these here in a second. So assigning a sampling algorithm. Usually you want to let it do this itself, unless you have a good reason to override it, but you can. So I can instantiate a Metropolis algorithm and pass it and it'll do Metropolis sampling. But we saw that Metropolis algorithm isn't very good, so you wouldn't want to. Do this most of the time under the hood. Every distributions has a very algorithm has a competence associated with it that says, I'm good at continuous variables, I'm good at discrete variables, don't use me here, things like that. And so it goes through when it auto assigns and determines the best, the ones with the highest confidence score and assigns it. So in general, continuous variables get the nut sampler. But those variables have to be continuous, right? Because it uses gradient information. You can't take gradients of discrete things. So if it's discrete, you get Metropolis. Sorry. And then binary gets a binary metropolis where you can only assign zeros and ones. Okay, that's generally it. But there are other samplers that are available to you. Initial values you can pass your own. Otherwise it will use the mean or the mode or the median, depending on which distribution you can sample to get it again. Most of the time here, unless you have a tricky model, that it can be helpful to specify an initial value, you can override it by passing them here. So let's do a simple thing here. 500 draws with no tuning, using Metropolis with an initializing A to be negative one. I'm just doing this. Oh, God. Any questions while it's sampling this really simple model for a short period of time? Yes. To monitor the data, update the model with new data. Well, you can predict with the model with new data, but if you've got new data that you want to use to fit the model, you've got to refit it. If you want to be a really awesome bayesian, you really want to take your posteriors from the last one and use it in the prior. But that's harder to do than you think, right? Because you'll get sometimes a weird posterior distribution. And how do you turn that into a prior? If it's done a good job, all your posters are going to be multivariate normal, and you can just make the multivariate normal. So you could do that, but you kind of manually have to do that. All right, it finished finally. Actually, it only took one set, so it's having a problem compiling, I guess, for some reason. Okay. And if we look at Ada here, it doesn't look very good. So this is a trace plot from RVs. Those are supposed to be histograms over there. They don't look very good. And then this is the trace. So this is the sequence of 500 values. I didn't do any tuning on here, and I used a crappy algorithm. So this is what we expect, right? So they all started at negative one and they popped up, but they're not really in the same region. And there are four chains here, so there's four different lines, so it doesn't look very good. All right, let's do it for real now. And hopefully this doesn't take very long. So I'm going to do this is really the default. I could have just called sample without any arguments. Here we go. So it's auto signing, blah, blah, blah. Yes, this is a bad demo, sorry. It usually runs a lot faster. I'm going to run everything below here just to speed things up, okay? And look, you got some warnings here. The acceptance probability does not match the target, blah, blah, blah. There are five divergences after tuning blah, blah, blah. Okay, we'll see what those are later. So we try to warn you when things go wrong as best we can. Pardon me. Should be close to 0.8 and it's not. So it's supposed to be 0.8%. Don't worry, I'll get to it. But here we see estimates of alpha. This is the attacking strength. And so England's the best as they are at everything, right? So there are four chains and they all look the same, which should inspire confidence, right? They all started and they all ended up at the same place in each of these. So they look pretty good. The countries are where we expect them to be and then defense. So it's flipped around now. So Italy is the worst and now Ireland's the best. So these comport to what we expect, mu and Ada. So this is a posterior plot. So this gives you an HDI, which is a probability interval. So there's a 94% probability, the true values between those two points and then its distribution. So eight is 00:24, Mu is 2.8. Remember, these are on the log scale. So if we convert these, 2.8 is about 16 points, 2.8 plus 2.24 is about 21 points. So it's a home field advantage of four and a half points, which I guess is reasonable in American football. It's like three points, I think, for home field most of the time. And it probably varies by country too, right, depending on where you are. These samples are stored, used in a modified X array, back end, so labeled arrays, which is pretty nice. And that's a class inside of the RVs package. Again, so you can see our posterior distribution has, for each chain, all of the draws for each of the teams. They're labeled with names, so they're really easy to use. So you got your posterior in there, your log likelihood, sample stats for things like, well, we'll look at what some of these are in a second and even a copy of the data in here. So inference data is one of the relatively recent innovations of PMC. That makes things a lot easier. Parallel sampling. You really want to do more than one chain. For one thing, all our computers here have probably at least four cores, if not eight. And this is all embarrassingly parallel. Every chain can be sampled independently of one another, so there's no reason not to. So by default, it takes the half of the number of cores you have available. So I got four automatically without telling you anything in this example, I picked six chains and using four cores. So it did it partially parallel. And so you end up with N times C chains here. So this was for alpha. So there's six of them. So it's six x 100 x six is the size. And you typically want to combine all of these together at the end. So kind of the last topic. We have about 15 minutes left is an important piece. We referred to it earlier on. Model checking is very important. You can get reasonable looking results, and it just could be complete nonsense. And it's hard to tell sometimes whether it is. There's two components to model checking convergence, diagnostics, at least for MCMC. Did MCMC actually work? Right? MCMC can work. And your model may or may not be good, but you at least want to make sure MCMC works before you proceed. And given that MCMC worked okay, what is the fit of the model like? Does it make any sense? Does it correspond to what the data suggests? All of those things? So how do we do this stuff? Conversions, diagnostics. The easiest thing is to gaze at the trace. Just look at it. That's what a trace looks like. So this is one run. I call this bad trace because I did bad things with it. What did I do? I only tuned it for ten iterations, which was not very good. And you can see this is for two chains. They don't correspond very well. This kind of gets stuck in places from time to time, so it needs to be tuned a little bit more. Tune one. You want to look for this sort of fuzzy caterpillar kind of thing going on over here. Right, remember what we need, what we're trying to look for here, a stationary, homogeneous Markov chain. Means the chain isn't moving around. The mean is staying relatively constant. The variance is staying relatively constant, that sort of thing. This is getting a little closer, but it's not even quite there yet. Remember we saw a warning about divergences. In fact, this is exactly the warning. I pasted it down here. There were five divergences. After tuning increased target accept or reparametrize, we try to suggest what to do to resolve problems where we can. A divergence is when this member, Hamilton Monte Carlo, was doing the leapfrog steps and then sampling leapfrog steps and the leapfrog steps. We're trying to trace this continuous Hamiltonian system as best it can. Well, sometimes it does a poor job of it. So here are a couple where it's done a good job. So the true distribution here is this cone of red and yellow. And so here, it's wandering around nicely there. It's not doing too bad. And this one, it got launched into space over here, and it diverts way off in a low probability region. That would be a divergence. You don't need to freak out if this happens, unless it happens a lot. So we only had five, in this case, over 1000 samples. I wouldn't worry about that. Setting a higher target acceptance is one way of resolving this. So what that will do, it will increase the number of baby steps that it has to take so that the acceptance rate. So zero eight is the default. It's a relatively good default. If you're having trouble boost it to zero nine or zero 95 or 99 even. It'll take longer to run, but you'll get fewer divergences, even better reparametrize. Your model. It's sometimes a sign that your model is a little bit wonky. So I won't go into what that means here. But reparametrization often works too. With these gradient based samplers, there's some information that you can get from those because of the remember we're tracking energy here, momentum and so forth. And based on the estimates of the energy at each step relative to the overall energy of the sampling, you can estimate how well the algorithm has been exploring the posterior distribution. And it's based on this notion of a bayesian fraction of missing information that, again, I will gloss over a little bit. You want it to be close to one, and so I don't know, those look close to one to me. There's no real good cut off. But the easiest way to deal with this is to do an energy plot. So this plot is the energy transitions at each step versus the overall marginal energy after fitting. And you want them to sort of overlap like they do here. Bad ones look like that where the marginal energy distribution is way wider. And so it's saying that the transition energy just isn't covering everything. So that's a bad sign. I sometimes get like bimodal ones where there's two blue bumps on either side. That's a bad sign, too. One other that's worth noting is something called used to be called the Gelman Ruben statistic. Now it's called our hat, or the potential scale reduction. This is essentially an analysis of variance of the traces. If you know what an analysis of variance is, it's comparing the variation within a particular chain to the variance between chains. And if they're the same, they should be similar to one another. If your chains are in different spots, the between variants is wider than the within. Right? So that's kind of what an innovative does. So as it converges on the overall variance, the Rhett will get close to one. So all you have to know is you want your Rhett close to one. So if you call RVs Summary, you get a nice summary table and your Rhett will be over here. And look, they're all really close to one. So we're happy about these. So convergence diagnostics done. We've got yes, the burn in. Yes. Apparently we're not supposed to call it burn in anymore. We're supposed to call it tuning. So it's tuned for $1,000. Again, I usually pick 2000 except for a really simple model back in the bad days of Metropolis, you'd do like $100,000, so it's much better now. But, you'll know, if you need more, you'll have signs of lack of convergence. In some cases, you just need to run it longer, that's all. So goodness of fit. Last topic, goodness of fit. This is essentially a calibration check for your model. And the best way to do this sort of thing is to compare the output from your model to the data that was used to fit the model. And really, if you draw a whole bunch of samples from your model, your data should just look like another draw from that model. Otherwise you can't really claim that this is the data generating model, which is what you want. The nice thing is that Bayes gives you an automatic way of doing this. This is called the posterior predictive distribution, and this is it here. So F of theta given y is just our posterior, remember? And then P of y tilde given theta, p tilde are new values, things we're going to sample from the data given theta. And that data has already been conditioned on why this is the predictive distribution. This is just our likelihood. We're drawing values from our likelihood. Likelihood is y given theta as well. So we're including the residual uncertainty in our parameters and the stochastic sampling uncertainty of just drawing from a binomial distribution. And then you integrate over theta, which sounds intimidating because integration is hard, but you get this for free already because you fit your model using MCMC, so we don't have to worry about it. So we can get this. All we have to do is after fitting our data or our model sample from our livelihoods, and so we have a method for doing that in Pioneer. It's called sample posterior predictive. And this argument here will just stick it back onto the trace. Sorry for the functional programmers out there, but it's just going to add it to that trace silently without telling you. And so we'll have a thing in here called posterior predictive now, and we can do plots. RV has plot PPC, and these are cumulative distributions of draws from the model, which are blue, and their mean is a dashed yellow line and the data is a black line, which is kind of hard to see here, but that's a good sign too, right? If you can't even see your data, that's a good sign. If it was out here going over here, we'd be worried, right? So these are good checks. The observed data looks like draws from the posterior distribution. Got ten minutes left. I lied. I had one more topic, making predictions, because I mentioned this before, right? I can swap in new data. So now that I fit my model, I want to know when Wales comes to visit England, who's going to win? And I can do that. So I opened my model again, just as if I was going to add more variables to it. Except now I call set data and then all I'm adding here are indices. I'm pulling out the index corresponding to England and Wales, adding one to home team, one to away team, and then I call sample posterior predictive, and then I pull out the values. And those are posterior predictive distributions of scores of England playing Wales, presumably at Wembley or something. Right. So England is going to win, apparently. And what you can do is pair wise, subtract or not subtract, but do the calculate a boolean for when the home is greater than the away and take their mean. And so this implies an 82% probability of England winning. But this is kind of a dumb simple model, the one that's on our website uses a hierarchical model. I don't think that will help this here. We should probably change that. That only helps when you have imbalanced data sets, really. But the idea is the same. You can build a way more complex model than this, right? Theta could be more nuanced than that. It could have more information about the teams. We could use better historical information somehow, waiting, more recent events, stuff like that. So you can always make more. In fact, I was doing like an individual based one here for the pitching model, where he had different pictures, because every picture is going to respond to the sticky stuff incident differently. Like this guy might have been cheating here, that sort of thing. So you can always build more complex models. But that wasn't kind of the point here. You want to learn more, there's more here, including a brand new course from PMC Labs, thomas Vicky's company. They've got a course called Intuitive Bays, where you can learn a little bit more about all this stuff. So if you're interested, track Thomas down. He looks like that guy back there anyway. I have five minutes left, I think. So, a couple of questions. Do bookmakers use PMC? I know one that does, but I'm not allowed to gamble on sports because I work for Major League Baseball. Do they use Bayesian methods? A lot of them probably use machine learning, I would imagine, but I'm guessing I only know one that I know for sure does. But I would imagine that you would think this would be a good approach for those sorts of problems. Yeah, sorry. The question was, do bookmakers use PMC? And I know at least one that does. Any other questions? Yeah, the question was, if you have different samples for different variables, how does that work? Do you do them at the same time or sequentially? PMC will. So HMC happens all at once. You walk over the entire multidimensional space. If I had a discrete variable in there, I would have a mixture of Metropolis and HMC and it would do one than the other. I'm not sure which order would do it so it would condition on one. And it gets very sketchy when you do these mixed ones. I don't think there's been a lot of research into what that does to the convergence, because it will change the geometry when you kind of modify the model outside of the HMC. But sometimes it works. I've done it a few times and I've gotten good results. So from a Pragmatic sense, it seems to work. I would imagine if you had a lot of discrete variables, it would be harder, but usually it's like one variable that has to be discrete and it tends to work. But that's just based on my experience. Maybe two more. Yes. One at the back. Yeah, it wouldn't be a strength of this. I would see how it goes. If there was a lot of data, for example. But yeah, you might then look at tree based models or something. One more question. Yeah. Is there a cluster of CPUs? Not out of the box. It uses GPUs now under Jax Thomas. Can you answer that about clusters? So the answer was version 4.1 will help you more with that. It's all right. I think we're done. But feel free to talk to Thomas or I sometime during the conference. Hope to see you out there. Thanks a lot.