Video details

Using Mediapipe to Create Cross Platform Machine Learning Applications With React - Shivay Lamba


React Advanced 2021 ##ReactAdvanced #GitNation Website –
Follow the link to watch the full version of all the conference talks, QnA’s with speakers and hands-on workshop recordings →
Talk: Using Mediapipe to Create Cross Platform Machine Learning Applications With React
This talk gives an introduction about MediaPipe which is an open source Machine Learning Solutions that allows running machine learning models on low powered devices and helps integrate the models with mobile applications. It gives these creative professionals a lot of dynamic tools and utilizes Machine learning in a really easy way to create powerful and intuitive applications without having much / no knowledge of machine learning beforehand. So we can see how MediaPipe can be integrated with React. Giving easy access to include machine learning use cases to build web applications with React.
This event would not take place without the support of sponsors:
🏆 Platinum Sponsors Toptal → The Graph → Focus Reactive →
🥇 Gold Sponsors StackHawk → Sanity → Kontent by Kentico → Sourcegraph → Shopify → Ionic → JetBrains → Progress KendoReact → Sentry → Snyk → Neuralegion→
🥈 Silver Sponsors Stream → CodeSandbox → Smarkets → 3T → Modus Create → Theodo → Commercetools
→ Strapi → MUX → Callstack → hackajob → Hasura → twilio → zeroheight →


Hello everyone. I'm Shivai Lamba. I'm currently a Google at Miriapai and I'm going to be talking at React Advanced. So excited to be speaking at React Advanced on the topic of using Media Pipe to create crossplatform machine learning applications with the React. So a lot of this talk is going to be centering around machine learning, Media Pipe, and how you can integrate basically Media Pipe with React to create really amazing applications. So without wasting any further time, let's get started. So the first thing of course, today machine learning is literally everywhere you look at any kind of an application, you see machine learning being used there, whether it's education, healthcare, fitness, or mining for the sake of it. You'll find the application of machine learning today in each and every industry that is known to human kind. So that makes machine learning so much more important to also be used in web applications as well. And today as more and more web applications are getting into the market, we are seeing a lot more of the machine learning use cases within web applications as well. And let's actually look at a few of these examples that we can see. For example, over here, we can see a face detection happening inside of the Android. Then you can see the hands getting detected in the iPhone XR image. Then you can see the Nest Cam that everyone knows is a security camera. Then you can see some of these web effects where you can see this lady and she has some facial effects happening on her face using the web. Or you can also see the Raspberry Pi and other such kind of microbased, microchips or devices that run on the edge. And what are the things in common in all of this? That's the question. So the thing that is common in all of these is Media Pipe. So what exactly is Media Pipe? Media Pipe is essentially Google's open source cross platform framework that actually helps you to build different kinds of perception pipelines. What that means is that we are able to basically build or use multiple machine learning models and use them in a single EndToEnd pipeline to let's say build something. And we'll also look at some of the common use cases very soon. And it has been previously used widely in a lot of the research based products at Google, but now it has been made like sort of upstream and now everyone can actually use it since it's an open source project and it can be used to process any kind of an audio, video, image based data and also like sensor data. And it helps primarily with two things. One is the data set preparation for different kinds of pipelines with machine learning and also building basically end to end machine learning pipelines. And some of the features that are included within via Pipe include Internation because everything is actually happening on device. Then secondly is that you just have to actually build it once. And different kind of solutions including Python, JavaScript, Android, iOS, all those can actually be used. So you just have to build it once and you can use it on different types of platforms. That is why we are calling it a cross platform based framework. And then these are just ready to use solutions. You just have to import them and integrate them into your code and it will be very easily used. And the best part about it is that it is open source. So all the different kinds of solutions, all different code bases you can find on the MediaPipe repository on Google's, organization on GitHub. Now looking at some of the most commonly used solutions, some of the most well known solutions include the selfie segmentation solution that basically is also actually being used in Google Meet where you can see the different kind of backgrounds that you can actually apply. It's a blurring effect. So what it does is that it uses segmentation mask to only detect the humans in the scene and it is able to extract only the information needed for the humans. And then we have face Mesh that basically has more than 400 plus visual landmarks that you can put. And you can make a lot of different interesting applications using this. For example, let's say air filters or makeup, right? Then we have hair segmentation that allows only you to segment out the hair. Then we have standard computer vision based algorithms like object detection and tracking that you can do to detect specific objects. Then we have facial detection. We also have hand tracking that can track your hands and you can probably use it for things like being able to use hand based textures to control, let's say your web application. Then you have the entire human post detection and tracking that you could probably use to create some kind of a fitness application or a dance application that can actually track you. Then we have the holistic tracking that actually tracks your entire body. Right. And it tracks your face, your hands, your entire pose. Right. So it's combination of basically the human pose, hand tracking and the Face mesh. We have some more advanced object detection like the three detection that can help you direct bigger objects like chair, shoes, table. And then we have a lot more other kinds of solutions that you can actually go ahead and look at. And these are all end to end solutions that you can directly just implement. That is why Media Byte solutions are so popular. And just to look at some of the real world examples where it's being actually used. So we just spoke about the Face Mesh solution that you can see over here taking place on the AR list. Try on that is there on YouTube. Then we have the AR base movie filter that can be used directly on YouTube. Then we have some basic Google Lens services that you can see like augmented reality taking place, then you can also see it also being used not only in these augmented reality or these kind of things, but also more other kinds of inferences, like the Google Lens translation that also does use the Media pipe pipelines in its package. And you can see like augmented faces that again is based on the face mesh. So let's look at a very quick live perception example of how basically it actually takes place for this. What we're going to be doing is we're going to be looking at the hand tracking, right? So essentially what we want to do is that we take an image or a video of your hand and we're able to put these landmarks. What are landmarks? Basically, landmarks are these dots that you see and you can superimpose them on your hand and they sort of denote all the different, you could say the different edges of your hand and you're going to be superimposing them. So this is what the example is going to be looking like. So how would that simple perception pipeline look like? Essentially, first you'll take your video input, then basically you'll be able to reduce the frame. Basically you'll be getting the frames from your video and you'll be breaking down that entire frame into a size that is usable by the pencils. Because internally MediaPipe uses a TF light that is TensorFlow Lite. So you're working with Tensors, with a higher dimensional numerical layers that basically contain your entire information about the machine learning. So basically you'll be doing a symmetric transformation of your frame into a size that is being used by the Tensors. So these images will get transformed into the mathematical format of the Tensors, and then you'll run the machine learning inference on those. Basically you'll be doing some high level decoding of the Tensors, and basically that will result in the creation of the landmarks, and then you'll be rendering that landmark on top of the image and you'll get that output. So essentially what will happen is that if you have your hand and you import those landmarks on top of it, you'll finally get this result that you see is basically the hand tracking. So this way we can build some kind of pipelines. And basically what's happening behind the scenes or under the hood is that we have the concept of graphs and calculators. So if you are aware of the graph data structure of how the graph has edges and vertices. Similarly, Media pipe graph also works in a similar manner that whenever you are creating any kind of a perception pipeline or a media pipe pipeline, basically it's consisting of, you could say the graph, like in the nodes and the edges and where the note specifically denotes the calculator. Now essentially the calculators are the C plus plus configuration files that essentially store what exact kind of transformation or what is the main brain. You could think of the calculated as the main brain behind that solution that you're implementing. And then essentially these are the nodes and the data which actually comes into the node and it's processed and comes out of the node. All of those connections by the edges are sort of what is representing the entire meta pipe route. So including the edges and then what is the input Port at the calculator and what is the output? The input is what is coming into the calculator. And once the calculations have been done, once the transformations have been done, what's coming outside? So essentially that is how you can think of the entire perception pipeline of like using different kinds of calculators together to form, let's say one particular solution and all of that will be represented through this meta pipe graph. So that's essentially what is the back end or what's going on behind any kind of this back end structure of a media pipe solution. Now you can also look at some of the Docs to get to know more about calculators graph by going into Docs metabolic Dev. Or you can also actually visualize different types of perception pipelines. Let's say the one that we used was actually a very simple one where we were just using it to detect the landmarks on your hand. But if you have much more complex pipelines, you can actually go ahead and use Visa pipe Dev to visit that and look at some of the pipelines that are there to offer on this particular site. And now coming to the essential part what this talk is really all about and that is how can you integrate media pipe with React? Right. So there are a lot of NPM modules that are shared by the MediaPipe Google team. And some of these include basically face mesh, face detection, and basically the hand tracking holistic that is having the face mesh hand and your post then Objectron, that is the 3D object detection. And then you have the post. Right. And we have selfie segmentation that we had covered is basically how the Zoom or the Google meet backgrounds or works. So for all of these, you'll find the relevant NBM packages and you can refer to this particular slide. And you can also look at the real world examples that have been provided by the Media Pipe team. These are available on Court Pen. So you can refer to any of these to look at how basically that has been implemented. But what we are going to be doing is we are going to be specifically implementing this directly in React. So here is a brief example of how it's supposed to be working. So in the first piece of code that you can see at the top where we have basically integrated or we have imported React, we have also imported the webcam. Because the input stream that we are going to be putting up is with the help of the webcam that we are going to be using. So we have just integrated the webcam, then we have integrated one of the solutions over here as an example. And that is the selfie segmentation solution that you can see where we have imported from the MIDA pipe selfie segmentation NPM model. And we have also integrated the media pipe camera utilities. So this is to basically fetch the details from the camera, right? We do also have some other utilities that help you to actually create the landmarks which we'll discover in a bit. But after that you can see basically the code where we have used the actual media pipe sales segmentation. And again, the best part about this is that you're not supposed to be writing like 100, 200, 300 lines of machine learning code. And that's the benefit of using media pipe solutions, that everything is packed into this code. And we are doing such kind of important and such kind of essential machine learning based things like object detection, object tracking that usually run into like 200, 300 lines of code. And you can simply just put it in less than 20 to 30 lines of code. Over here we have just simply created our function for the selfie segmentation where we are using the webcam as a reference and we are using on top of that canvas as the reference because the webcam is sort of the base. You get your frames from the webcam and then you are using the canvas element on top of it to render the landmarks. Right. And over here you can see that we are just implementing the CDN to get the meta pipe selfie segmentation solution and then we are rendering the solutions. We are rendering the results on top of whatever is being detected. But yeah, I mean so far it's all been sort of discussion. We'll move quickly into the code with the demonstration type. So if everyone is excited, I'm more than happy to now share the demonstration for this. So let me go back to my Vs code. So over here, basically I have implemented a very simple react application. It's a simple create react application that you can use very simply. You can find it on the Facebook documentation over here in my main app, JS code. What I've done is that I have integrated four different types of examples. Right. So these four examples include the hands, the face mesh, the holistic, and the selfie segmentation solution. So I'll be quickly showing you the demos of all of these. But I'm just going to be quickly demonstrating how easy it is to be able to integrate such kind of a megabyte solution or like a machine learning solution. Right. So within my function in my app, I have for now commented out all the other solutions. The first one that I'll probably demonstrate is the face mesh. So I have imported all the components for each one of these. And currently I'm just rendering or I'm returning the face mesh component. So if I go very quickly to my face mesh component over here, I see we walked through the code. You can see that I've integrated, imported, react. I've imported some of the landmarks. Now basically whenever we are talking about let's say the face right, we want our right eye, left eye, eyebrow, our lips, our nose and all. So these are basically all the ones that we have imported specifically from the face mesh. And then we have created our function. We have created basically the MP face mesh that will be used to render the face mesh on top of our webcam. So over here we have just returned again the CDN and we are using the face mesh on results to render the result that we receive. So we start off by basically getting the camera. So we use the new camera object to get the reference of the webcam and using that what we do is that with bait we have basically created the async function for the machine learning model itself that will be loaded can take some amount of time to load. That is why you have just used an async function to wait for the landmarks to actually load. And that is why we send to the face mesh webcam reference that is basically your current frame that you're using. So once your camera loads, the frame starts coming in, we send that frame to our face mesh where it can actually render the landmark. So basically on the constraints function, what we have done is that we have taken our video input, then on top of it we are rendering the canvas element using the canvas CTX. And what we are doing is that we are going to be now rendering the facial landmarks on top of the actual frame that we see. So that is what you see over here using basically the draw connectors. This is a utility that has been provided by Media Pipe and we are using that. So very quickly what we're doing is that we are rendering all the different landmarks and we are finally returning our webcam and also the canvas that is going to be put on top of our webcam. And then finally we are exporting this react component right in our app. So very quickly jumping into the demonstration. So I'll open up in cognitive mode and I'll go very quickly to Localhost 3000 where it will open up the webcam and we should be able to actually see like as you can see. Hi everyone, that's the second me and very soon I should be able to see the face mesh actually land up on top of the demo. As you can see that's the face mesh. Boom, great. And as you can see that as I move around, I open my mouth, I close my eye, you can see that how all the facial landmarks are happening. So I can close this and very quickly change to let's say another demonstration. Let's say if we can use this time the selfie segmentation and in this what I'll do is I'll basically comment on my face mesh and I'll just comment this out and I'll comment out this selfie segmentation and I'll save it and till then like it's Loading. You can see this selfie segmentation over here. Again we have used it and what we're doing over here is that we are going to be providing a custom background right? So that background that we are going to be providing is going to be defined over here when basically we are using the canvas fill style. So it will basically not colorize your human body but it will tell you all the other things. So that is why we're using the field style to basically let's say add a virtual background. So if I go again if I quickly go back to my incognito mode I can go and look at locals 3000 and very soon I should be able to hope everything works fine. So I can see this is my camera and my frame is coming in and very soon it should load the selfie segmentation model and let's just wait for it. So as you can see boom like blue background. So essentially again how it's working is that it is taking in your body and it is segmenting out your human from the frame and it's basically coloring the rest of the entire background with this blue color because it is able to segment out your human body and just color the rest of the other things. So similarly you can try out various kinds of solutions that are there. Of course for my demonstration I showed you the face mesh and also I showed you the selfie segmentation but you can also try out all the other ones that are shared inside of the NPM modules that are provided by the react code. So that essentially is what I wanted to particularly show with respect to the demonstration. Again it's super quick what I've just shared with everyone, right, that you just have literally even with the selfie segmentation code, the actual logic behind the actual selfie segmentation that we are writing is literally not more than from line number ten till line number probably I guess 36. So within like 70 to 80 lines of code. You're really creating such kind of wonderful applications and you can just think about what kind of amazing applications that you could probably think of and you could actually create the help of a megabyte. And these are just two of the examples that I have shown you. So I mean the sky is the limit and the best part is that you don't really need to know what's happening behind the scenes. You don't need to know what kind of computer vision or machine learning is happening. You just have to integrate it. And that is why today we are seeing media pipes being used in so many live examples in productionize environments, by companies, by startups. So it's really the future of web, right? And it's just easily integrated with react. That is the future of the web as well. So with that that brings an end to my presentation. I hope you liked it. You can connect with me on my Twitter at the hardwood or on my GitHub on showalamba. And if you have any questions with regards to media pipe or being able to integrate media pipe with react, I'll be more than happy to sort of help you out and I hope that everyone has a great react advance. I really loved being a part of it and hopefully next year whenever it takes place, I'll meet everyone in the real world. So thank you so much. With that I sign off. That's it. Thank you so much.