Video details

The Magic of Music Matching

JavaScript
05.25.2020
English

Roy van Rijn

Recorded at GOTO Oslo 2020. Arthur C. Clarke once said: "Any sufficiently advanced technology is indistinguishable from magic" The first time I used Shazam (the music matching app) it felt just like that: pure magic. The app shortly listens with the microphone and tells you which song is playing. As a programmer I generally have a pretty good understanding about what happens behind the scenes, in this case however I was absolutely gobsmacked. After a weekend of studying, reading scientific papers and experimenting and coding: I had a working Shazam clone written in Java. During this talk I'll reveal what I've learned and what algorithms and tricks being used.

Transcript

So hi, everyone. I'm going to talk about the magic of music matching for the first time today because I'm stream recording. And if my face looks a bit angry, it's because this is the first attempt. First of all, my neighbor decided to cut towels outside, which messed up the audio completely. The second time my laptop ran out of hard drive space. So I really hate recording the webcast. I do love to be on the on the cruise with you guys. Sadly, it was canceled. I was enabled to travel stupid coronavirus. But please, please. It's not your fault. I just hate recording. I just hate talking to scream. And this is the first attempt and I really hope this goes well because I'm not looking forward to doing this presentation for fourth time. Please. So bear with me. My name Roseline. I'm a Jafa champion crown breaker investor and I'm the founder and leader of the Rotterdam Jaffey User Group. I work for a company called Open Value and I'm running the Rotterdam branch. The best one, obviously. No, just kidding. And you can follow me on Twitter. Just go to this Twitter handle or read my blog on my Web site. So why are you guys programming guys and girls? Sorry. Why are you people program? For me, it was to play games. The best way to get games when I first had that P.S. was to get a book like this and these books were at my local library and I could just go there. And these books had pages filled with source code. You could just copy paste the source code. You actually have to type everything into the terminal without line numbers, without an I.D. e without anything that's helping you. No typos whatsoever. And when everything was done, you could rub the interpreter. You could play a game. And this made me understand how it all works. Well, not all. But it makes me understand how programs work. What kind of stuff goes into it? Whenever I missed a target or somewhere it would show me and. Yeah. So this for me demystified a little bit. What Sofra actually doesn't what you could do with it. But for me, I still so far has this magical moments. So one of these magical moments was when when I first got a computer, everything was text and console based. You just had 80 characters and that's it. And then I went to images. Suddenly you had logos and photos and you could actually you have a UI. You could see stuff. This was magical. I had no idea how you would program something that would draw an image. So another thing was the first time I navigated the 3D world. I had no idea how you could possibly program something like Wolves, Titan or Quake, where you can just walk around, look around. Mind blowing. The first time my computer spoke to me so early, games had these bleeps and dead. I got a soundcard and they had songs and then suddenly it sent stuff to me in a human voice. I couldn't understand how a computer was able to do that. Another magical figure was OCR. You gave it the picture. And from that, magically, it's deducted that there are some territories and it would turn it into text that you could edit and copy paste. And I. I knew how to program. I had no idea how you would make an OCR program or voice recognition, for example. Sure. At barely voice recognition, applications were complete and utter crap. Only the last couple of years it became kind of usable. But I had no idea how you would make something like that. Another thing was Google Earth the first time I saw my own house in a satellite image. You could move. You could ask me, what's the entire earth? Yeah. It was magical. So some of these moments I never forget. And the same thing happened when I first used or heard about Shazam and. So story. This all happened in the summer of 2010, somewhere in my sloes, which we are currently looking at. I still live there or actually I live there again now. And I was having a beer with my friends. One of our friends walked up to me and said, hey, boy, look at this app. He pulled out his phone and he starts Shazam and he holds his phone up in here. The phone listens for a while and somebody tells me the name of the song and I'm on my mind. It's completely blown. How does he do that? How does a vet listen, deduct. Which song it could be? And tell me which. Mind blowing. I had no idea. So the next morning when I finally woke up with a bit of a hangover, I have to know how this works. That's the first thing to give. I have to go this weekend. I have nothing to do. My goal for this weekend. I have to know how she Saab works. So I do what every good developer does. I Google. How with Java. Do I listen to a microphone? And it turns out this is actually pretty easy to do. There is an audio system in Java which you can just access and you can get access to microphone. So if you want to code this, the first thing you need to do is define an audio format. And in this idea format, we put a couple of things. So we put a certain sample rate. A simple citing bits. It's a microphone, which is a single channel. So we have one channel. The data is signed and it's in big and informed. And this audio format works for almost all microphones. And with this audio format, we can define something that's called a data line for target data light. And then most important thing is we access the audio system. And then we call Getline. This gives us a line. We can open and start. And it has a small buffer. So everything that comes into your microphone is put in this small buffer. And we can actually get it. We can actually read from this target data line. And we need to do this quickly, because if we don't do this quickly enough, the small internal buffer will overflow. And we will lose information from the microphone. So in a small, separate thread, I'm defining a small HKT buffer. I'm defining a huge output stream. Which will contain eventually all our microphone data. And I'm starting a small infinite loop or close to even it. And this loop is constantly reading from the target data line and putting everything into by three open street. So it's just quickly flushing everything into the whole debate. Over your story or Mike from data. And this is enough. This is enough in Java to in the background. Just listen to microphone store all the data. So couple of hours into my Saturday and I'm like, yeah, I'm almost halfway there. I Mei-Ling this I'm already I've got an app and it's starting and it's listening to the microphone and I'm almost there. I just need to do a matching Bartnoff. But I begin to think what's actually in this data. What, what does a microphone give us? So I decided to print out the data that we'll see by the rate. And I saw something like this. And I'm like, wait a minute. What is this? What what am I looking at? What does a microphone do? What does what kind of data does it give me? I have no idea. So I opened up Excel or OpenOffice and I decided to plug it in a graph. And then I noticed something. Yeah, this looks familiar. If you've ever used an audio ed, this is how audio this put audio looks like. So when I was silent, the line was almost flat. And when I made it sound like clapping. The wave started to appear. So, yeah, this is audio. But then I realized it was time for some biology and physics, because what does this all mean? What the what what does sound? What do these numbers mean? Basically, sound is. When I click my hands, the air molecules get pushed away. And this wave is propagated through all the other sound particles. The air molecules. Some particles, air molecules until it hits your eardrum and your eardrum is moving. With this wave, with this pressure. And your brain interprets this as sound. Turns out the microphone is exactly the same. It also has a small membrane in it, which is vibrating when there is pressure. So now we can also understand what the simple rate and the sample size does. The simple rate is just how often does the microphone record its position? Is it pushed this way or this way? And the sample size and bits says something about the precision. So how precise is the measurement of the thing that's being moved? So we can do two things to increase the quality of what we are recording. We can do more measurements increased to separate or we can improve the accuracy of what we are measuring. So these these are all kinds of things you can play with when you record sound. Basically anywhere but in Java. Turns out, though, these positions are kind of useless because, sure, our ear detects these waves and it detects where the membrane in our ear is. But that's not very useful because our head hears something different. Be our head. Here's our brain detects frequencies. So, for example, if you have a single wave, a single frequency. This is a low frequency like our head here is just this one frequency. Our head does not hear something that's moving up and down. But while we record is a signal going up and down, if we have like a higher frequency, the waves get shorter. And this is a higher frequency and it gets even more complicated. If we start to combine two waves, for example, this low frequency and these high frequency combined does something very weird to the wave. But all we care what what our brain interprets is just two distinct frequencies. Turns out, though, you can turn the upper image into the lower image and you can do this using something called the FBI transformation. You've probably heard about the Freegate transformation. But in the coming sleights, I'm actually going to explain how this four year transformation kind of works. So Stuart Riffel, he blocked on an old Web site which is no longer online. I couldn't find any copy of it on how this weird transformation works. And he explained it in an excellent way. And I'm just going to repeat what what he blogged on his Web site. Imagine you have this certain recording. So this wave is going in and we're Péter. And basically what's happening now, I'm going to use my little remote control prop. Here's what's happening is the waves go up and down. And that we are drawing it like this. So we are moving our imaginary stick amongst the x axis. That's how we draw this wave. But if we a transformation, we are not going to do this. We are going to do this. We are going to rotate. At a certain frequency that's imported here at a certain frequency or stick. And we still we're gonna draw our signal. So what happens if you take this out your fault and we draw it? We rotate this axis? I don't know. It's a tenuous. You get a pattern like this. All kinds of circles and D circles are around the center point because we are rotating around the center point. And, yeah, that's what happens. But now let's let's look at the same graph. But now we're rotating at three hertz a different speed. And this happens. Now we get ellipses and we're being pulled off center. And this is important because that's let's calculate the average. The average is now no longer in the middle, but we are getting pulled off center. And why does this happen? Well, for example, if you have a single wave that's going at a certain frequency and we are rotating around in that same frequency, you'll get a peak. That's always on the same side. So if you rotate and you record the efforts and the effort is being pulled off center the furrow, it's being pulled off center the futer, this signals actually present in the recording. So if we take this same recording and we will remove the three hertz signal entirety, this happens. Now, it's all based around the center again, because the three hertz frequency is no longer in your original sample. Imagine you do this for all the frequencies there are. And you measure how far the average is. And it turns out this is all the Freegate transformation does. So this is the mathematical formula. And you can actually see what we just noticed in the formula. So X is the energy we want at a particular frequency. We have to spin. Which is E to be I. That's speed your signal, which is X n around a circle to buy at a particular frequency. And every two bunch of points among that pop. So basically the same thing we just said. Is it a mathematical formula. You can actually see it. Excellent explanation on the fuel transmission. That being said, though, no, I kind of understand how it works. I'm still too dumb to implement this, but luckily there's Commins math. So we don't have to. We can just call Commins met and it does to free a transformation for us. There is one huge cost, though, because if we do if we make transformation, we lose time entirely. If you have a three or four minute recording and you do to free a transmission on this three minute recording, you will lose everything. You get the frequencies, but you will get all the frequencies that are present in the three minutes recording. At the same time, it's like the entire band playing all the notes and all the instruments at the same time for three minutes. They do this for every note that's in the song. That's not very useful because that just makes a lot of noise. There are a lot of frequencies. And it's pretty, aren't you? You cannot say which song it is. Like, we need something that we need to time. We need to know when a particular note is being played and when the drum is being hit so late in the day. I'm getting a bit depressed. I'm looking out of the window. I'm like, yeah, I had these waves in time domain. Now I have the frequencies and I've lost track of time completely. How do I combine this? And then it suddenly hit me because looking out of the window windows, actually the windows actually the solution, because we can do a trick that's called windowing. We take the entire song and we just look at the first part. And in this first part, we do to create transformation. And then we shift, we know and we do fumigator information again and then we shift, we know and we do to create reformation again. Now we do lose a little bit of time information because we don't know where in the small. We know certain frequencies are being played. But we we do know that it's we do have a window in which it happens. So how do you program this? Well, first I take everything out of the battery, then I cut it into fixed pieces and then for each piece I do to create transformation. Later, I updated it a little bit because you could do it even better. You can also program it to have a sliding window. So instead of moving an entire slice at the time, you can just slide it a little bit, do the fear transformation, slide in a little bit again to create transformation. Slide it a little bit to get the feel transformation. If you do it this way, you get even more frequency information because you can use a larger window and you have more time information because you're slowly moving, not moving the entire thing at the same time. Quick question. Does anyone know what this is? Well, you could shout them and shook your hands. But this is a screen recording, so I'll just spoil it. It also says it in the picture here. But it's a spectrum analyzer. It turns out what we've just created in Java is a spectrum analyzer. If you take these windows, the small pieces of a transfer made it data. We have the frequencies, if we will plot it had to have the low frequencies in the bottom and the high frequencies in the top and just 10. Just put each slice after each other as an image. You get something that looks like this. This is actually a recording of a twin, Aphex Twin is a deejay and he makes music or music. This song is not really music. It's some, I don't know, fancy noise. But what he likes to do is put in spectral images inside his music. So this YouTube user used a very advanced spectrum analyzer to listen to the song. So let's just listen to the song. She's. Yeah, that's something. Not sure if it's music, but you can clearly see there's a human head inside this song. And for me, this was like the perfect test case. I knew this existed. So I looked it up on YouTube and sure, I programmed this. And this is the same thing as we had before. I listen to the microphone. And for each row of pixels in the bottom, the darker are lower frequencies at the bottom. It's low frequency and top with high frequencies. And the brighter the blue, the more pressing that signal is. And as you can see, it kind of resembles the image above. So the spectrum analyzer used to be top, much, much better. My Jeffcoat, pretty crappy, but you can see the lines and you can kind of see the outline of the face. And I Brayshaw, if I use the right logarithmic skill and I programmed it a little bit better, I could get pretty close to the image they had there. But this told me my coat was working ish. So with the hangover and the entire day of learning new stuff, by the time it was half a quarter past ten. Time for a recap. I was going to bet sound is just pressure and vibrations. Microphones and ears are pretty damn similar. Recording job sounds in Java is easy. There is a problem with the time and the frequency domain. You can transform between these two with the phosphine its first mission and we can use windowing to get some time information back. And if you combine everything, you'll end up with a spectrum analyzer. It's not just M. But we did build a spectrum analyzer the first day. So the next morning I continued. It's my Sunday. I didn't have kids back then. I had the entire weekend for myself. So what we need to do now is fingerprinting. I use the spectrum analyzer off the day before and I use the song from Queen called Under Pressure. And I recorded it. And this is what it looks like. And then I recorded the second time. And this is what it looked like. I've kind of like these things, but you can carry see the images look fairly similar. So if we want to match a certain song, all we need to do is make sure these images are similar. So what do I do? I decided to take this recording and cut it into some bands like low frequency bands, middle frequencies, high frequencies for each slice. I pick the loudest point in this frequency band. So I get a couple of points. And I did the recording again. So these are two recordings. And now you can see the red dots which are just in in these eight beds. The highest frequencies. And if we would overlap these, you can clearly see well, you can you can see that they are similar. This is this can be useful to actually match. So the samples aren't the same, but you get the ball. So I take this single slice, this single slice of frequency with dots, and it has a lot of frequencies in it. But I just pick the loudest frequencies in the A. And then I compress everything down to just 46 different values. And these are stored as bits in a single lung. So this is very efficient. A single lung can store by slice fingerprint. And this is what it looks like. So you can see at the top we have these four single bits and these are these four red dots. These are now in in the four bits here. And then we have to to add two. And so if we do this for entire song, the entire song is just now a list of lungs. And I want to do this for every song that's in my empathy library. So it was eleven o'clock. And I want to create a library of these fingerprints to do this. We need to process MP three files. And this can be done out of the box with a Jaffa audio system. You need a couple of libraries and I ended up using git layer, which is a real time 3D decoder, and Petrie SBI, which is a Jaff up looking based on J layer. And finally, we Thomas, which is an implementation of the Java s API for these blogs. This is the code I wrote 10 years ago. It's a recursive function. We start with a root directory. Then we get everything that's in the directory. If it ends with an MP three files, we process this audio file. And then we do the same thing as we do with the microphone. We put everything into these fingerprint logs and store it somewhere. And if it's a directory, we rickers. Like I said, there was 10 years ago today, I will do this completely differently because now we have the new I o library and we can just do files that walk with a certain path. We filter all the files. And if the file ends with B three, we capture Yalgoo. So yeah, going from this to this much better. The whole led us makes it so much easier to have a good API to do to process files. So what did I have at the moment, 10 years ago? I had a library of 3000 songs turned into these fingerprints. I had a method to capture these fingerprints from my microphone, which I made yesterday. The only thing we need to do now is match. So we need to match something that's captured with microphone to these songs in the database. So the fragment would look like something like this. So we have all these lungs and we have them. Yeah. Just have a series of lungs, which is our fingerprint and the powerful algorithm behind, for example, the stuff that Jasem does sound does. And my thing does is that we can basically do a hash lookup for each and every Rubini and actually two are very fast. So I take this single slice and now I record all the songs that have this slice as well. And I also record the offset. So if we have enough matches for a particular song, we can just call it a match. And this is very cool. And especially when you use the offset, because then you can say I'm pretty sure we have a match for this song. And it's actually matching one minutes and 20 seconds into the song. So this is very powerful. A dead point just in time for dinner. I had a working system. Let's see a demo of this. So time for a little demo. I've started my small application here. And as you can see right about here. Four hundred and eighty three songs are being loaded into the database because I no longer have a couple of thousands of inbetween files because I now have Spotify on Spotify. I've opened a playlist with five hundred greatest songs which shoot overlap the four hundred and eighty three songs that are there. And nobody I would ask a volunteer to come up on stage, open his for his or her phone and play this shuffle this playlist. But right now I'm going to demo because there's nobody here. So I'll just shuffle this playlist and let's see if my computer can recognize the song. Pray for the victim. That's the first name. Six songs. And you must look at. Yep, yep. That was a quick one for you. Toby Foster of this this is recognizable. Cool. Oh, to the north, one guy may not always yes, as long as there are stories I No. One. Surely you will recognize Bob Dylan. Early one morning, the sun was shining. The. Less than. We should end on a high note. Yeah. That was a quick one. So as you can see, it's most of the time able to guess the song, and if I will refine the code, it will probably work a bit better. I had to touch this code in 10 years and I quickly made the demo up and running again before I could travel to Copenhagen. But hey, another thing we can do, instead of just matching the song in the second demo. I've actually, um, I've taken the entire audio from the movie Rush, and I indexed everything. So when I start demo right now, the movie Rush, it's listening for that particular movie. But it's not only trying to detect debt, it actually here's the movie, but it's also trying to detect where in the movie we are. So I've now open up YouTube and there are a lot of clips from that movie and I'll just randomly loot, randomly select one of the clips and my laptop will listen. And if it finds a match, it will start the audio at exactly that time. So then I can stop my phone and the audio will continue playing on my laptop. It will be synchronized, which is pretty cool. Let's hope it works. So let's just pick this scene. Let's make commercially. That's the demo. Brush your teeth, people here. Make yourself so. Let's skip somewhere in the middle there. Now, that doesn't sound like there's one behind it recently because of a computer. Analyze the audio and they found a match in one hour, 30, 34 minutes into the song. So, I mean, let's let's do it again, would a different with a different movie piece. So, yeah, I mean I would think maybe. And it continues later. So not only can we match a certain song, we can also match where in the song we are matching, which is pretty cool. So so far, the demos, other things we could do with this idea was, for example, we could use this to duplicate to to detect duplicate songs. I did this with my old and B3 library and I noticed there were a couple of songs that I had on multiple albums or life versions of songs. This works perfectly. You can totally use this algorithm to detect duplicate songs. Another idea I had was what if we could use songs to align subtitles? For example, when you have a movie, you download an SRT file and an SRT file is nothing else, then just the text in the empathy file. The text in the subtitle file with a certain point in time, which is a timestamp. But if the song if the movie you're watching and has like broken subtitles, maybe there are a couple of minutes off. It cannot correct itself. But if you would base the subtitles on new fingerprints, it would always work perfectly. Even if you're your movie starts with a couple of minutes later or couple of seconds later, it would align itself according to the audio, which is what we need. Good work. Haven't done this yet, but could totally work this algorithm. D Fingerprinting is always also used for copyright infringement. I'm pretty sure YouTube is using a similar algorithm to detect copyright infringement in videos, for example. Could this be used for speech recognition? Oh no, not really. It's just very different because now we're matching against a known audio piece and it has to be perfectly safe and speech recognition. It works in a completely different way. So that's not very useful. So end of day two, end of my long weekend of discovery. A small recap. You can use fingerprinting to match. Using hashes is super fast. Shazam and Sound Hound are very cool, but it's not impossible to make something like this in a weekend. There is a lot of room for improvement. My algorithm is far from perfect. I'm pretty sure if the library becomes too large and there is a lot of things we could do better, but it's working. At the end of the day, I had a working music METCHER and I knew how this stuff worked. So now will be the time for questions. But I'm not here so you can ask questions on Twitter. If you have questions about this, send me a tweet, send me an email and give me a call. Drop by my house. I'm here. Thanks for watching.