Can simple open source tools compete with the music recommendations provided by Spotify and other big names?
This talk will look at how the open source world can stay relevant in a world where music listening has become dependent on commercial streaming services and users expect an element of recommendations. Expect to see small-tech solutions for music recommendations based around GNOME's Tracker search engine and the open, community-powered database Musicbrainz.
PUBLICATION PERMISSIONS: Original video was published with the Creative Commons Attribution license (reuse allowed). Link: https://www.youtube.com/watch?v=7APk4GqIlD0
Now this talk, the emphasis is on simple, so I can promise you there's going to be no machine learning, because interesting as it is, it's not very simple. And hopefully this will be something anyone can understand. I am going to go into detail tails and do some demos, but the idea is that it's a recommendation engine that we can all understand. So you've probably noticed the recommendation engines are everywhere these days. Think about your day today. Have you done something in response to a recommendation algorithm? Maybe you spoke to someone on a social network because an algorithm showed you their post. Maybe you bought something because you were recommended it online. Maybe you got into an argument because an algorithm showed you something you disagree with. Maybe you listened to a song. So my interest in music means that I use different music services. And I found the recommendations sometimes very good, sometimes not sometimes predictable. I thought given recommenders are taking over the world. Maybe I can make a recommendation algorithm and then at least I'll know what's going on with it. But I'm an operating systems developer, so I have to start from scratch. I thought, what goes in to recommendation algorithm? Well, data goes in, then there's a process and then some more data, which is hopefully more interesting, will come out of the other side. In the case of a music recommender, then on the right hand side, we will get a list of songs on the input side. It's more complicated. You might take a music collection. A music collection is really a list of songs, but where the order isn't important, you might take social data like other people's playlists, but those are also a list of songs. Maybe you'll take the history of what someone listened to in the past, which is also a list of songs ordered by time. And maybe you look at analysis of the pieces of music that might say how fast it is, how dark it is, whether it's metal or scar or gospel, which we could really keep that data in a playlist as well. If we can store that data alongside the songs, so we can already simplify this back down to a playlist goes in, some kind of processing happens and a playlist comes out as long as we can do arbitrary playlists, then we can have this simple model. That's nice, because this to me, looks like a shell pipeline. As an operating systems developer, I spend a lot of time writing shell pipelines. As data scientists, we more likely to use something like IPython Jupyter. The same tools work for modeling pipelines, whatever you want to use. So how do we represent a playlist? I've been using this simple JSON format where each line is a JSON object and a set of attributes describe the songs. And I've created some Python codes wrapped up in a command line tool called Kaliope, depending how you want to pronounce it, and that allows me to create some pipeline. So here's a very simple I wouldn't call it a recommender yet, but a playlist generation pipeline. So it asks the tracker search engine on my computer to show all of the songs I have, and then it shuffles the list and chooses five and then uses another commandline tool to select just the title and the creator. There we go. We created a playlist already. I want to talk a bit more about the format. Firstly, the fact that it's a list of JSON objects is deliberate that makes it more usable in a traditional command line because you can start the tool on the left and it prints an object and then another one and then another one and the next tool in the pipeline can read these as you go. So you don't have to generate a huge list of a million songs. Wait for that, then shuffle it. In the case of shuffle, you do have to wait, but in the case of other processing, you can do it a line at a time, and it's a kind of lazy evaluation. Apart from that, the playlist format is not special. It's an adaptation of Spiff xspf, which is already existing since 2006. I think it's a standard, pretty common playlist format. It's based on XML, which I have not used because it's no longer 2006, but pretty much everything else is the same. One. Really cool thing about Spiff is it's described as portable. In the case of a playlist, being portable means that it's not tied to a specific streaming or storage medium. If you have a list of files on your computer that's not portable, you can't play it on a different computer. So the way Spiff works is you record details about the song like title artist, and then later you resolve it to some actual content. So I want to dive in already with some demos to show how this works and how easy it is to play with it. These are all live demos. Things can break. Please bear with me, but let's see what happens. So I've got a playlist here, but I can't play the playlist because it's just an adjacent object. So what if we resolve the playlist against the Spotify Web API? This is using an API key that I've got saved locally. Out the pipeline comes another JSON object. I'm going to run this into the JQ tool, so it comes out in color. There we go. Much better. So now you can see we've got the title, the creator, and we've got a load of metadata from Spotify, including somewhere. Well, we should have a location field, but now we don't. This changed recently, but we have enough information that we can generate a Spotify URL. What about a different service we could resolve against music brains same input. You would expect that to work. The command is different, so I'm going to annotate it with some information from Music Brains again, it's a live demo. The results are cached, but obviously my cache is empty. So if I run this a second time, it will be fast. Okay. And now we have some information from Music Brains. I could also resolve this against files on my local computer. There's another command which interfaces with the Beats Music Library, which is a cool Python tool to organize your music library. So I can ask Beats for a list of all the tracks. It's quite slow because there's some improvements needed in Beats. In fact, it's too slow. Let me show you albums instead. Okay. Hopefully by now you're seeing that all this tool Kalyup does really? Is it's a toolkit for reading data from disparate places and writing it out in a standardized format, which is the first thing we need to do some kind of interesting recommendations. Most of these tools are simple. There are a few hundred lines of Python or 1000 lines maybe, and the more complicated ones, and I intend for them to stay that way. Now, this is a nice way to generate playlists, but I also want to show you that you can export to a number of different playlist formats. So now we generated a result. Spotify Playlist. I can now export it in xspf format, for example, and I can now import that into a music player application or put it on my ipod or anything. So back to the slides. Let's talk a little bit about Spotify. Spotify is probably the number one in music recommendations as of today. By their own numbers, they have thousands of engineers working on the product. They have thousands of data pipelines running constantly. They have 50 million tracks, a quarter of which nobody has ever listened to. And they are catching half a trillion events every day. So that's a lot of events per user. I think that's about 250,000,000 users, right? So I think how many events they're capturing per user? They know a lot about you, more than just a list of songs that you listen to. What can they do with that data? Quite a lot. Can we compete with Spotify in the open source world? That would be difficult, right? Because we don't have 16,000 engineers at our disposal, but we also don't need to something anything is better than nothing. Right? So an open source music recommend, even a very basic one, is already kind of fun. It's already helping us learn about music recommendations, and it's already helping us not depend on Spotify. So if we're going to recommend songs, we need more than just randomness. We need some data. I'm going to talk about some open places and some not open places that we can get data from. Firstly, Music Brains, which is a website. It's kind of a catalog so you can go on Music Brains and you can look up any music, release, any song, and it will tell you way more information than you ever wanted. It can tell you when and where something was recorded, what record label it was on? Who else has released music on that record label? Who produced the albums when they were born? Huge amounts of data. The data is all open data. It's released under an open license. The website itself is open. It's maintained by volunteers, so it's a really cool data source to use, and obviously Kali up interfaces with that unless you make use of the music brands data. So we have lots of ideas of fun things like recommending artists on the same record label or from the same place. Listen Brains is from the same team as Music Brains, and it's a way to record your listening history so you can hook it up to your music player. You can install a browser add on called what's it called web scrubbler, and that'll record things you play in your web browser to listen Brains, you can look at other people's listens, and it gives you some kinds of analysis about what you've been listening to. So start using that. It's really fun. And in future it's going to mean you can get better music recommendations. Of course, Spotify. If you use Spotify, they record your listening habits as well, but they won't give it back to you, even if you go through the long process of request all my data, which they have to do legally, they still only give you one year of your listening history. They won't give you everything unless you specifically ask for it. So if you're not start using Listen Brains, it's more fun. This website in the bottom left is lasfm, which is actually an older website for recording what you listen to. It's not an open website. It's owned by CVS Interactive. I think the Calliope can also interface with that. I have been recording the music I listen to for years and I used to do this with Last FM. Last FM has an API as well with a lot of Tags created by users. So there's an interesting data source. You can look up an artist and see what millions of users have labeled it as, and we can use that for recommendations. Finally, Spotify actually has a really powerful Adi and you can access Spotify's own data. So my last image in the bottom right has Spotify's, an example of the Spotify analysis you can get for a song like how danceable it is, what key it's in, whether it's got lots of speech, whether it sounds acoustic. Now, I don't think you can query 5 million songs using this API, but you can probably query 20 or 30 songs for free with no problem. So there's a lot of interesting data we can use. I'm going to show a couple more demos of things that you can do with the listening history. Now I said to use Listen Brains. I'm a bad example here because I'm still using Last FM, so I'm going to be showing the last FM history command. Kelly doesn't yet support listenbrains, but it will be a fun thing to contribute if you want to join in the project, let me show you what the last of them History command can do. So if I give him my username and I say That's not the command. That's the command. So I say Give me a list of the scrubbles, which is the last five songs I listen to, and then we go there's the last five songs I listened to in the form of a playlist. Let's mine this data a bit more so I can ask what artists I've listened to. I wonder if I've got this one recorded in my backlog. Yes. So I'm asking it for all the artists I've listened to in the last six months. So I'm saying there's an SQL database in the back end here, which I'm not going to show you, but in the background this is scraping the last FM data, putting it in an SQL database so that we can query it locally. And now I'm saying query everything I've listened to in the last six months and only return things which I listen to ten times or more. And then I'm going to select just the creator field because that's the only one we care about. So here's a list of all the artists I discovered in the last six months, and I must have liked them because I listened to them ten times or more. So you want about my taste of music? It's quite obscure, but now we could perhaps make a playlist of these artists. I'm not going to show you how what am I going to show you? Here's another interesting thing we can do. Kind of the inverse, I can say show me all the tracks, which I didn't play in the last one year. This is going to be a lot of tracks, I think. Let's see how many it is. I'm going to count how many it is. Using the word count 34,000 tracks, which I haven't listened to in the last year, but I listened to before that. How can we make a playlist? Let's randomly pick five songs. Okay, so there's an interesting playlist, five songs that I haven't listened to in the last five years. Is there anything else we can do? The last thing I want to talk about before we break a bit for questions is the select command. So shuffle is fine, but there's nothing really too smart going on there. We're taking the data randomly shuffling it and spinning it out. That's already a lot better than nothing. But let's do something a bit more advanced. So the select command uses a simple type of algorithm called a local search. I can feed in this data, randomize it, and then I can say, Give me a playlist with a duration of 60 minutes, because there's a pipeline I need to put a dash to tell each command to read from the previous pipeline. Let's see what this comes up with. Actually, 60 minutes is quite long. Let's have a 30 minutes playlist so that hopefully it fits on the screen. Hopefully this doesn't take too long. In fact, this isn't going to work because we haven't annotated it with the duration information, so that's possible, but it's outside the scope of this talk. What I would do next would be to resolve the files somehow, either against my local music collection or against music brains to find out how long they are, and then I would have the duration field and then I could actually select them based on the duration. But because time is short, I'm going to move on to the last slide, which is a paper I read, and this is where I got the idea for the select command. So it's from 2008 and you can read this online for free. The link is in the slides, which are in my talk, and it uses what's called a local search algorithm to recommend playlists. Local search is quite a simple algorithm in the sense that it's what a person might do. If you had to choose from a pool of 1000 songs and you had some constraints, then you start by picking one, you pick another and eventually you've got too many and you throw some of them away and then you pick some more until your playlist fits the constraints. There's more to it than that, but that's the fundamental. This paper is a really interesting read. They demonstrate recommending music generating playlists by defining a series of constraints. So here's an example from the paper of one of the tests they did in their research, and there's a list of different constraints. Now the right hand side, this is an academic paper, so it's quite mathematical, but on the left hand side in the description column, you can see the first constraint is that all the songs should be different. The second constraint is that they should be released between 1980 2001. 3rd constraint is that 20% should be Stevie Wonder always a good choice, and so on. And a local search algorithm can take a collection of songs and it can find the best, not the best, but a result which satisfies those constraints. And the select command is an implementation of that. Now it's not a complete implementation, although this is very simple, but this is how I really think we're going to get some engaging and some useful open source music recommendations. So just to summarize, recommendation engines are a thing. They're here to stay. Kids now are grown up watching YouTube and using recommenders every day, and they're going to become more and more effective. Life. The Kalia project aims to make simple, fun music recommendations. It's a project you can hack on and use right now. You can Pip install it and run it from the command line. It's full of bugs, so please open issues and merge requests with whatever you find. It's something I work on in my spare time, and I don't have time to Polish it. And one final point, I think the design of Calliope is much more important than the code. I think the model of simple, self contained tools which communicate with a well defined format can work for any type of recommendations, not just music. It can work for any programming language as much as Python. And with that, I'm interested to see if there are any questions. I'll leave the screen sharing, but that's the end of the talk.