Presented by Women Who Code BlockDataPy Tech Summit Summit Playlist: https://youtube.com/playlist?list=PLVcEZG2JPVhfJ2vkMZZfkAJeChW6dkbgw Speaker: Anna Astori
Have you experienced issues with your Python code while wrangling new data sets and crunching big numbers? Have you wondered what are the places where you could make your Python scripts run faster? Let's talk! In this session, learn options to speed up your data processing operations. This talk can be especially valuable for those who are just starting in data science and Python development. Nonetheless, everyone interested in efficient Python is welcome!
💻 WWCode Digital Events: https://www.womenwhocode.com/events 🔎 Job Board: https://www.womenwhocode.com/jobs 💌 Make a Donation: https://www.womenwhocode.com/donate
So. My name is Anna Sori. I'm a software developer. I'm actually at the moment going back to academia and changing moving countries. But previously I worked at Jackson and Amazon. Today I'm going to be talking about optimizing Piping code and providing some tips and tricks for faster code. So this talk is based actually on a medium post that I published some time ago and I'll provide the link for it. At the end here I'll just dive right into faster Python code. So what is past code, how do we define it? And more importantly, how can we measure it? There are three aspects against which we can measure the code it's time, CPU consumption and memory consumption. And for each one of those there's a great deal of tools available in Python. So for instance, for time you can use time or time modules. Quick caveat here, measuring time is actually a little tricky. For instance, your operating system can interfere while you're doing it. So you have to be a bit aware about things happen, how to take care of that in order to avoid getting skewed results. And there's a great read that I'm linking here that's called Falsehood programmers believe about Time. You can also just simply google the title and find it. It's been republished in multiple platforms on web. But for the couple of examples that I'm going to use later in the presentation, I'll measure time. Actually, it's still accepted as one of the general benchmarks for CPU measurements. There is a very popular C profile module and for memory profiler as a matter of fact, I also have medium post about profiling python and I'm attaching a link here below on the page. So feel free to check it out and have a little bit more of tales about those packages, how to use them, what else would they provide? A little bit more of the tales, for example, the difference between the time and time it modules. So feel free to check it out. So what are the things that you can do? When you start a project, you're working with lots and lots of data and all of a sudden you realize your code is hitting well and it's just taking forever to run. What can you do? There are a number of things that you can do, luckily, starting from very straightforward kind of tweaks and new code that you can apply all the way to more generalized and sophisticated approaches that I'll talk a little bit more about later. So I'll start with the simple things. The first suggestion that comes to mind is look out for places where you can replace for loops with list comprehensions. So what I mean by this and what value do they bring? Let's take an example. So here I have admiration square integers put four loops, all it does is it creates the output list and then iterates over each integer within the limit appends squared to the output list and finally returns the list. So when I run this function and here you can see a sample code how to call the time function within your piping code and then printed output to standard output. So in my machine the output was about 9.3 seconds. Can we do better? Yes, we can, with least comprehension. So here at the top, the top screenshot is a new function which achieves exactly the same thing using list comprehension. In the middle of screenshots you can see my code calling time it again and printing out the evaluation and this time it's slightly over 6 seconds, which if you think in relative terms, it's great improvement. And of course, if I were working with real large data input, the benefit would be even more perceivable. But that shows the point of the benefit of this comprehension in Python. That's a great tool. Can we do maybe even better? Actually yes, it might be a good idea to try and avoid unnecessary for loops and even list comprehensions altogether. So one of the patterns that I come across in real life applications sometimes, but actually a lot more often in this kind of lead code interview problems where you know that you're going to be iterating over sale list of items and decent computation for each one of them and you want to store this computation in the predefined list or array. The nice trick that you can use in Python is use the multiplication shortcut. So here my example. Say I have a rose list which is a list of actual items elements and I can simply multiply the default zero value by the length of this input rows list and then my computation per row list would look something like the output below. So that actually voids the loops and even less comprehension altogether. There are some caveats to this approach as well. I'm not going to go into a lot of details here because of the scope of the lightning talk, but I'll be happy to talk a little bit more about it. I love this topic. If anyone has any further questions, what else can we do? Another area that I would suggest to explore and look into is the Python built in, especially if you're maybe starting out in coding data science. Sometimes it even feels a little bit tempting to write this kind of logic by yourself because you can definitely do it. But I would recommend against So the item built in functions that I'm talking about includes things like very simple things like some max, also Map, Reduce, filter, et cetera. And the great thing about them is that they're oftentimes implemented in under the hood, they're optimized for various scenarios and so they'll do the things a lot more efficiently than the code that I might have written myself. For instance, another example that I would like to mention here is something that also shows up a lot in real life applications if you're working with the taxi data like I had to do and also this kind of lead code problems where you maybe need to reduce an input strain or something like find all the non repetitive characters or something like that. So you might also be tempted to just start building the appetite string character by character. But it might not be a very efficient way to do it. It might be a lot more efficient to actually append the characters to an intermediate list and then call the string modules join method on it. Why? Because when you're building a string character by character under the hood, the Python is going to store every intermediate version of the string and thus take up a lot of memory. And imagine if you're working with really large data input, they'll take up a lot of memory and will really slow down your program. And the last one I wanted to talk about here and give an example for is operator item getter is very helpful, very handy in the following kind of examples. So say I have a list of top of the first name, last name of users and I want to sort them but I would like to sort them by last name rather than the first name which would be the default. So in that case I can use the operator item getter highlighted here in magenta as a key and you can see the sample output that it would provide in that case and it would do it very quickly, very nicely. What else can we do? Another thing to think about in your code is if you're working with a lot of objects and their attributes and if you're referencing those attributes in one piece of code quite a few times, it might be a good idea to assign them to local variables. So a very typical example here. Imagine I have a rectangle object and has its height and width attributes. So here I'm going to assign them to the rectangle underscore height and rectangle underscore width variables and then reuse them later in my code to compute the surface and then maybe down below the perimeter or maybe something else. So why it could be a little bit more efficient. It's probably not going to shave off a lot of execution time, but to put it super concisely, the thing is that if you're referencing the object attributes directly, it has to reduce the self object first and then its attributes. Whereas if you're using local variables it skips one step. So it can come in handy and along the lines of thinking carefully about your objects and data structures. And generally my general suggestion is always to explore and learn, know the data structures and the objects that you're using new objects for you well and choose the approach wisely. So what are the kinds of things I mean here for example, if you're working with Python dictionaries and you want to check a key in a dictionary, which is actually a very interesting question. There are like a lot of sites to it, but from the very beginning at least, Python Three allows you to versus syntax. You can use the top one if key intact versus bottom one if key intact keys and you can use others. But it turns out in Python Three, the top one is going to be a little faster. So why not use that? Again, as usual, it all depends on the general logic of your code. But if you're not using the actual keys for anything else later in your code, just go for the faster, simpler version. Another dilemma that I see pop up sometimes, and that actually pops, even experienced developers sometimes, is this choice of list and said data structures. When you know that you're going to retrieve an element from the container, the data structure, which one to go forward, because we know that returning element from one super efficient. Let's make a great use of that. On the other hand, watch out if you are actually getting a list as an input, which is very frequent in Python, you might be tempted again to turn this list into a set first and then retrieve an element from there. However, if you're only retrieving one element, what will happen? You have to iterate over each element of the list first for those elements to be added to the site. So you'll actually create this extra overhead that you don't need and thus you won't really get any benefit from retrieving an element from a set. So once again, know your data structures and your objects well and choose an approach wisely. What happens if you've replaced your loops with list comprehensions everywhere you could and local variables don't help anymore? What can you do? So that might be a time to pull out the big guns. So it might sound like you're working with a pretty big application with real life data at this point. And if you're not familiar yet with some of those libraries that are there for Python, like NumPy and Pandas, they definitely should be on your radar. So NumPy uses their own implementation of Pi, areas that are more compact and faster than Python lists. And I'm going to go into detail how they are implemented. That's actually open information. You can find it on the web. It's really interesting, but here I'm just adding some information about the benefits and the efficiency of NumPy taken from the NumPy official documentation. And similarly, Pandas is super popular. It uses some of the mechanisms like Vectorization, that also allows you to loop altogether and make things run a lot faster. However, panels are super interesting because some of its default data structures are actually not even designed to be super efficient. But it's such a mature library by now that there are so many ways to even scale it to large data sets and fine tune its sufficiency, but it's even provided on its official documentation and I'm providing a link here, so definitely check that out. And lastly, one more thing that I wanted to mention here is especially if you have a long running, really large application, what might really be beneficial is just in time or chip compilers. So chip compilers collect data about information, rather about data types that your code is using and thanks to this information creates very specific machine code that helps your program run a lot faster. So the other hand, something to keep in mind that not probably then you won't see really benefit from just compilers if you're applying it to one off short script. But if it's a big program, there could be a friend and there are two really popular ones. One with the pipe which is a Python implementation of Python. And here again I'm providing some basic information about its benefits and some benchmarking numbers that mentioned in their official documentation. And the other one is number which is a compiler on its own that also claims to be at least a few times faster than the C Python standard implementation. That's all I have for my suggestions for today. Here at the bottom of the page I'm linking again the URL for my original postal medium about optimizing your packing code. Feel free to check it out. Also feel free to follow me on Twitter, connect on LinkedIn. Thank you so much for your time.