Video details

Real-time Rendering of Big Data | Mustafa Abdul-Kader


Mustafa Abdul-Kader

Current presentations of real-time data limit our ability to act on the data immediately. The increased spread and collection of data grow an ever-increasing need to have better and more thoughtful analytics and monitoring of that data. This session will uncover ways of using and extending an organization's data pipeline to handle performing real-time updates to the presentation layer
We'll uncover how to design real-time communication systems between the server and client and embark on the journey of creating a real-time data dashboard with the necessary components to view events happening in real-time.
Mustafa is a tech enthusiast, urban dweller, and full-time dad to two cats. Currently, a Software Engineer @ nvisia based out of Chicago.


So good evening, everybody. My name is Mustafa Abdul Kader and I am a software engineer and. I do want to give a little bit of a sales pitch on myself and the company I work for. So I've been a software engineer professionally for about four or five years, but I've always been really four or five. So maybe a stretch, maybe three or four, sort of lost track of time. The computers and programming has always been a big part of my life. I love exploring new things and new technologies and seeing everything there is out there to explore and understand, I think the open source community and technology is fascinating and it's cool how even though, you know, we're all at home currently, we're still able to produce great things and collaborate and. Really drives innovation, and I think Indonesia is a great company to help foster that attitude and mentality. We're a software consulting firm based out of Chicago. We have offices in Milwaukee and Madison as well. And I think we are hiring. So if you guys are interested or looking, feel free to reach out to me and I can get you in touch with the appropriate channels. What that being said, let's get started. So. As you guys may have read from the description, we will be covering big data and WebSocket, a lot of people, I think, have this assumption, especially the general population, that big data is pretty much spyware and will track your every movement and sort of encapsulate all the information there is out there. And maybe some companies are trying to do that. But big data is more of a representation of leveraging analytics and information to help drive business decisions and get useful, meaningful analytics out of that data. And big data doesn't necessarily have to be enterprise level. It could be anything from your Iot device publishing temperature data or video camera feeds and analyzing those or really anything is possible. Big data to sort of represents the. Conglomerates, of all things, data and how you use it effectively, WebSocket are a Web technology that allow full communication between a server and client without having to constantly pull for more data. Which is nice because having a socket open opens up the possibility of really. You know, just sorry, I'm losing my train of thought here, but really opens the possibility of what you can do on the front end and how you can integrate your systems on the back end and all the information processing that's going on there and hooking it into your front end. And in our example, that will be using react. So you might be wondering what a few use cases are of this technology. I don't know how many of you have worked in corporations that do leverage information processing at a very fast rate. But there are a multitude of examples from logistics and processing shipment info and any new events that come from that shipment to financial firms that need to regulate and maybe flag transactions or process transactions. That and effective manner how they design their pipelines and really manipulate that data to really extract all the information out of there. In our case, we'll be using a very popular topic, which is covid. There's. Really, no. Implementation in place, and this might be because we've never had to deal with something on this scale before, but there's no implementation in place for real time analytics of covid data. John Hopkins does release daily reports on their GitHub of any new information they get. But all that data is sort of. Aggregated together via different sources, faxes, I'm guessing, is a big part of the emails, maybe I'm not really sure how that information gets collected, but we do have a bit of a delay on that information being readily available to everybody. And this poses a problem because with the information being delayed, we can't really be as reactive to it as if we had that information more readily available. Um. Sorry, I don't know if you guys can hear my cat. And yes, so let's dive into some of the different ways of how we can sort of pull this data. Or rather, retrieve the data. We have three different mechanisms that currently exist and I think are the most popular. I'm sure there are others that I may not know about, but will be sort of understanding what polling is, how WebSocket work and server side of us. So polling is the act of, as a client continuously grabbing a resource. And this is generally OK if your data doesn't update all too often, let's say you have alerts on a Web page that you want to display to the user. They don't need to necessarily have that data instantly available, so maybe pulling every five minutes is a good solution for that. But as you can see here. I don't know if you guys can see my mouse cursor, but the request. Life-cycle, sort of, and this is encrypted, so I mentioned the request Life-cycle sort of has a few steps that happen. Transparently to the end user. Start first, they have to send the synchronized message and then they acknowledge that this is client and server, and then after that they set up the key exchange and. Get their information sort of all ready to go to send the request. This is very exhaustive. You can see the timelines here. This will take two hundred milliseconds. That doesn't sound like a lot, but when we're having messages being sent to us at a very fast and continuous rate, making these requests every two hundred seconds will definitely take up some time. And we'd like to have the most performant technology available to us to be able to handle this information exchange so we can avoid doing all these steps by switching over to WebSocket. Now, this is going to have the same initial handshake as described here, but afterwards we just have a full, persistent connection between the client and the server. Now, this is full duplex, meaning the server can send messages to the client and the client can also send messages back to the server. You might be wondering why we might need WebSocket for. This sort of retrieval of real time data, if we only want the server to send us data and that's where we get into server side events where instead of being full duplex sorry, scroll wheel, instead of being full duplex, we just have communication from the server to the client. But that has its own limitations in the amount of connections you can have open. I believe all major Web browsers do support server side events, but they didn't get as much popularity and traction as WebSocket, which seems to have a bigger foundation of library support and. Sort of articles, blog posts about to help better understand it. So I do believe that server side events are not here to stay, but if you need something a bit lighter weight, maybe that is the approach you need to take. But for our example, we'll be using WebSocket even if we won't use the full communication bandwidth availability to us will. We'll have this sort of foundation in place to make this very easy and seamless. And you can note that the client or the server can both close the connection or obviously, if the connection gets severed by external means, such as a network outage, that connection will close on its own. So we can keep the connection alive till we're done with it, and that'll be sort of like a page transfer or redirect or something else like that. The next line. Now, how does this all come into play? We need to implement some sort of message bus for our WebSocket to be able to bind to most enterprise applications that do message processing will probably have a message in place just because it sort of eases the facilitation of adding new consumers or new publishers to the message plus. Sort of like the internal workings of. Electric grid or something in that you don't really think about it, but it does power a lot of the infrastructure or maintains a whole bunch of the infrastructure that we need to live our lives normally. And there are a few different message buses available and our example will be using Kafka. And it might be a bit overkill for a single consumer, single publisher, but the benefit of it is that it can represent a real time if that message was that you'd normally see in sort of a corporate environment. A lot of people do have experience with the and if you don't simply works as this type of an application, it can push messages onto it. And any other application that's available can either listen to that topic or it can choose to just pass it along. And you can end up changing these sort of consumers to also be publishers and create a pipeline of transformation's or analytics, like putting data into a database, then normalizing that data. Having another consumer send that data off to another third party integration that you have connected and in our use case, will have a consumer that publishes the messages received to any open WebSocket connection. Now, there's a nice library called our US, and I think the series sort of spans a multitude of different languages. The framework, I think, has findings to net Java and you might be familiar with this and you might not be, but it is a really nice way to handle. Well, let me read the description. This is from their website. Our next guest is the library for composing asynchronous and event based programs by using observable sequences. Now, that's a lot of buzzwords, but effectively what it is, is an extension to your sort of stream. And you can imagine this line here being a pipe. And any messages that come to it are these little circles. And you can apply different transformations or mutations or utility functions to this pipeline to transform that data. So, for example, this is a map operator and all this is doing is multiplying the result of any event that comes in. In this case, there are numbers by 10, so you get 10, 20, 30. And this is a very strong library to use when you're integrating with real time data just because it handles events as a sort of primitive to. Your code base and the label it reactive driven development, or I think that's what it is, but yeah, so you can react to changes of data or inputs to the data very rapidly and it is all lazily loaded. So this right here signifies a completion. So in our case, this would be a closing of that websocket. But this operation doesn't happen immediately just as the messages get subscribed to. So we're using it somewhere else, for example. And with that being said, it's time to show you guys the demo. So we will start here. Yeah. Now, let me know if you guys all can see this fine. I'm hoping you can. How are you guys? I think that as we are good to go. So to kick things off, I did a similar presentation for Tony, so that's why the name is there. But for this, I'll be sort of focusing more on the front end and the react side of things. But I'd like to just sort of give a general overview to what it takes to bring up a system like this. So we have three scripts here for we'll just go over this one. So we have to start up that S.H. and all this is responsible for starting Posterous, starting up our Tushka and creating a topic. We already have everything up, so I won't need to run that. We have to produce that. Now, what this one does is it calls my COBA data. John, just to give a little background on what the application is, we have. Program called covid Data Gene that is responsible for generating realistic covid data, and that is using Fakher Jobs to generate names, date of birth and stuff like this. We'll look to that and a little bit. And we have a Java app that exposes a WebSocket connection to be able to connect to, but also listens as a consumer to this case as events topic. And publishes any of those messages to whoever is connected. So all this is doing is it's running that covid data gen program for the amount that we want to generate. Is that. So generate the number of case events, slurping all that into an array, using JQ and then iterating through each one of those, assigning an idea is the key and piping that to the cat to publish the message out. And with that being said, we can go and have a look at the data. John, let's go and open the index to break this down. We just have a few things. Bring in a few imports we use for data store. And this is to handle sort of updating already existing data like a person goes and gets tested. We need another event to handle. Whether or not they're positive or negative and what the outcome was, whether they recovered or died. So this just manages that story for us. But the take block of code that is important creates a longer sets up some constants. Is this chunk right here? Um, this is sort of the fruit of the application to break it up a little bit. We have the amount of case events they want to generate and we'll decide between a new event or an update event and if it's a new event, will handle a new event. Otherwise, we'll handle an update about there and then we'll log it appropriately so we can see an example of this. We generate, for example, to. Here is so what happened here is that it tried generating an update evap, but there was no new event to update, so we skipped that one. But we have this right here we can expect to you. It's going to be a new event, but just to show it a little bit easier. You have an idea. We have a testate, we have a date of birth, the name, the location, the age and the status of repeating this. We can see. That Josh Kling was confirmed and its update. So this is what's going to sort of process our real time data or simulate our real time data just because there's no ability to get real time data currently. And that wraps up the data, John. We're going to just have a brief look and see the job application that sort of drives this. We have a few classes here, so let's just go into the controller. This is going to just get to and points are exposed to on points one for the stats. And we can see this over here. It confirmed recovered deaths and negative cases. That's going to be the initial page load. No. Again. And then we have the response, and that's going to be that bottom portion that shows the recent cases. What else do we have to have a repository? This just connects to our Postgres database and we just have some model classes, just hocus pocus that represent the status quo and the type, you know, meaningless. And this is I think the more important aspect is our consumer implementation. I know this is a very high tech conference, but bear with me just a little bit here because it's important to understand how this all ties together. All this is doing is listening on this topic, and it's handling the event with case event service in case of that service is. Just checking if that exists, adding it to our database and then it's sending the message to our WebSocket. That's all it does. All this code is on GitHub, so I won't dive into it too deeply. And then this is the part that I think we've all been waiting for. This is the dashboard and this is written and react with the Iron Dome system. And we can see that we have just a few components. But the one where everything sort of happens is the main dashboard. Now, I know react to sort of taking a big step towards using hooks instead of the class declaration for a component initialization. Truthfully, I'm not a big fan of FOX. I think the component Life-cycle made more sense using classes. But for the sake of not being grilled, I decided to change these two functional components. And we'll break this down step by step, but primarily we have hope for you state that sets up our COGAT, this potential confirmed negative, recovered and dead. And this is the state that will update you on any new message that we get. We have this handled data which will explore in a little bit. And we have two effects here. This just gets set up on initialization one calls and such stats. The other one calls the recent cases and sets those. Do a little transformation here just because the date does come back in January when Gervasi realizes it, and I'll explain why I chose that design decision and a little bit. And then our last effect is our most important, which is where our extra year comes into play. And we'll look in that in a bit. But the rest of it is just for setting up the component. So look for us observable. Let me get rid of these quick, but the cool thing is our access does expose a module for WebSocket and this takes care of all the grunt work of connecting and maintaining the connection. And when we created it does create a subject. And the subject sort of is just a source of events or messages, rather. And the cool thing about our SJS is that we can type each message and transform that data accordingly. So we just have two simple transformations here. But I'm sure you can imagine the possibilities are endless as to what you can do with it. For example, we have the map operator and all this does is transforms these two properties into dates because like I said before, they came in as an array of year, month and day. And that just goes to show that it's very adaptable to case a bit better, whatever that may be, even if the data might not be represented the way you want to. You can attend or modify it or mutated. However you'd like. The second one is called Tap. Now, I don't know if you guys are familiar with the Unix tool called T, but this one sort of hooks into the type of transformations and takes the message but doesn't really act on it. All it does is it grabs the message and forwards it off like it did. It was never there. And all I'm doing here is just demonstrating that we can use this but also use it for logging purposes. So we're just taking that data and logging it. And if we go back when we're ready to handle those messages coming in, we take that observable and we just subscribe to it. And this is what's going to take place initially that lazy initialization and effectively initialize it. And that's where the data comes into play. And. That is a callback. And we're going to explore that right now. We have status, this just status out of the data that we have and we have the case data and I'll show you why I put that up. So a status is present. Let's set stats. If they're dead or recovered, then the confirmed count goes down. If they're negative or confirmed, their potential status goes down. Not the case itself or the person of the case, but the stats we used to manage. And then the general, otherwise, whatever else is remaining, I think this would be a. Potential would go up by one and then we just return the stats to that object. Now, if it's case data, here's the transformation we're applying. We're contaminating using Lodish and we're adding the data to the first, but we're also removing any element in that array that has that same ID, because if we get a message that was previously potential but now is confirmed, we don't want to show that in our recent cases twice. We want to remove it and add this new one. And that manages that for us. That is cool. And. Yeah, I guess that is yeah, and I can run this to show you guys sort of how it all comes into play. Darren, start. Ignore these warnings. And these. You know, this is also published on GitHub, so if you guys want to explore it or even fix my code, I'd be more than happy to accept your full request. I am not the best REAC developer, but I do like playing around with it, so I'm open to any suggestions in case I've done something wrong. Or maybe there's a better way of doing something. And this is just going to start. Our Gradle instance or Java instance is a gradle and covid data and we don't even have to run it just sort of once as a binary so we can move on from there. Looks like everything is working and if we go back to the UI. And refresh this page. Hopefully this is working and we didn't. I might have cleared the database prior, but let's just go over that one real quick, it's just. Has is that clear flag to our commentator, John declared the datastore and also clears out any offense from the database? So, yeah, a little zero, so I think it was just empty and now we can do produce the resolution will pass in front of us. So here's the events coming through now when a bit too fast for us to really see it happening in real time, but we can see that the potential amount of cases is five that confirmed this one. There's no negative coverage yet. And unfortunately, one person died and we can see all the cases here. It looks like we do have a bug in our data. So ignore that. That's OK. Well, we're going to do now is just crank that number up to one hundred so we can see it sort of all coming in real time. And we can see that we definitely have an opportunity here to. Really pushed the boundaries of what is available to us. We no longer have to wait for messages to come in or press refresh button on the grid to pull for new data. For someone who is monitoring covid. Events and latest updates, a tweet like this could be helpful in sort of demonstrating how or not demonstrating, but rather getting something meaningful out of it, identifying hot spots sooner, closing down states that seem to be on the uptick. So it can really have a lot of potential there. I do know that Georgia Tech and this sort of validated my theory quite a bit, which was sort of nice. But Georgia Tech students, I believe, created a real time dashboard for covid hot spots, which sort of let you place your mouse over a certain area and check the severity of that location after doing a bit more further research on it. It did seem like it was still a day old information, but if they had the resources available to handle real time data or if the real time data was even generally available to them, they would have had better luck in really getting this early on and being able to quarantine zones appropriately. And just to sort of demonstrate we can refresh this page and. See that these numbers are all the same, so we don't really run into any state issues because the same messages that are being inserted into the database for handling on the US as well. We can sort of guarantee that what we see here is an accurate representation of that information. And just to sort of. Hopefully this works, but show a more realistic example is we could just have a thousand events coming in at a time, maybe we even want to set a delay because this information's unreadable at a certain point because it's coming in too fast. We can buffer it and add some lag to give a better user experience and give us an opportunity for the analyst to really understand what's going on. Yeah, it's fun to just sort of see it. Add and remove and mutate itself without really having to pull, as you can imagine, getting the state to fire requests or get requests would be certainly a strain. And I would imagine it would even slow down your browser if you're trying to access that data at the speed that this is just coming in and I'm sorry, my kids in the way of. Yeah, so I guess that wraps up my presentation, I do want to note that lastly, if that's OK, this has implications beyond just COGAT. I mean, I think that if something was available to us from the start, we could have saved countless amount of lives. And I don't want to get political here, but there's really opportunity for real time data to be used for not only business gain, but also just general. Life saving. Technology that really could, I think, change the world we live in and really give us, I know that information sort of spreads at this very rapid rate already and a lot of us are consumed by it. But if we take it away from trying to drive social media analytics and apply it to something more, I think noble, we could foster a better world and really showcase the technology isn't an evil and it's here to help us and not, you know, take us down. So with that being said, I my speech there and stop sharing. And yeah, so I I guess I'm open to any questions if any of you guys have any, um, I know I was a little short on the time there, so I apologize. But I could dig deeper into the code or talk about anything you guys would want me to. Oh, yeah, awesome. Thanks a lot. Yeah, let's see if we got any questions and I want to stick around and dive any deeper, Brian Nordquist says as half cat questionmark. I haven't. Yeah, and they're in the tiny apartment. They're sort of everywhere. So they are getting all up in my business trying to, you know, yeah, I'm a keyboard or with my mouse or anything. James Barkway has a question. Yeah, go ahead. You know, the, uh. Are you meeting James? Yeah, all right, you guys, yeah, now we can so this is kind of I don't want to get too far on my skis here. I mean, I know what I'm talking about. But the I was curious about the first thing you're showing with WebSocket for polling and project that I've been working on, we've got a bunch of graphical middleware stuff and we're using Apollo and we're just starting to experiment with having Apollo do some of that kind of polling for backend, you know, value changes to communicate to the front end. I don't know if you've looked at that or if you compare that with our SJS. I got that. It may be this is a left field question. But now, as far as I know, Apollo is NPRM Library or JavaScript Library for graphical communication, right? Yeah, I guess it doesn't really solve the problem of constantly having to pull. You're still limited by the three technologies available to you. I mean, you can send a graphical effectively is just aggregation of data into a certain response. I think they still use HTP as their transport mechanism. I'm not too familiar. So if anyone wants to correct me if I'm wrong, I think you'd still be limited by constantly having the poll. I don't think Apollo will help you solve that problem. And if that's data that's frequently changing, then it might be fine to use it to update your state and your front end application. But I think that if you may find that you're polling way too frequently, then you're just hurting the performance bandwidth of the network infrastructure between the client and the server and putting unnecessary load on the server. Yeah, so RFQ has subscriptions. Which maintain an active connection, your your server most commonly via WebSocket. Oh, yeah, so it sounds like it also connect the Web sockets to manage the real time communication. Yeah, I'm glad I was on point with that. Yeah. I haven't messed around with it. It's something in one of our other devs implemented, so I'm not that familiar with it. But like I said, it could be I just don't know what I'm talking about. But anyway, that was interesting. Thank you. Absolutely. Good question. I have a question because I see how this would allow for really quick communication between the server and the client. I'm wondering, would you see any potential bottlenecks for the entire nation trying to update the database that this would be pulling from? Would there be any problems from that? And because it seems like your generate events can very quickly create these events, put them in the database for a thousand or so events, but. If the entire like, say, a nation was trying to create this one grand set of data, would there be any bottlenecks that would have to be kind of dealt with on that end? I would imagine there would be. And you're right, a thousand events is very low scale, low volume. The nice thing about big data is that it's not really tied to a certain technology. We have measures in place that we can leverage our technology in place, that we can leverage to handle a large volume of data. And the nice thing is Cafcass sort of works as a receiver as well. So we can throw in excuse me one second. If anyone wants to see this is the cat that's been bothering me all day here on the ground. I'm sorry I lost my train of thought, but we can use Kafka sort of as a way to handle the volume of data and scale the consumers to. Really process that information as needed, so, for example, let's say we have every state has a connection or access to published data to a central system. And maybe we need 10 consumers to handle all that data coming in, we have a hook web hook that publishes the message to a Kafka topic. I'm just sort of visualizing this example. We might not necessarily use Posterous because I don't think we need the consistency guarantees there, something that can sort of be more accessible and readily available, not readily available, but just more performant and just mass inserts. We can break some asset guarantees to sort of handle that influx of data and maybe process them at a later time, like at low peak volume hours, such as the middle of the night. To throw it into a relational database, for example. So, I mean, it's just a matter of how you want to implement the technology. But if your question is how reliable are WebSocket sockets, it'll be dependent on how much data the client can receive before the memory runs. How the nice thing is about our example that we saw was that our. Stats were simply for numbers, so, I mean, that doesn't take up much memory and our recent cases were filtered out by the top five, so we just got rid of anything else that was a bit older. That way, we never have to worry about running out of memory from huge amount of events. And I'm sure there's certain complex application that you might need to address a bit differently. But I think these they're not necessarily workarounds, but there are solutions to these problems. I hope I answered your question on didn't just ramble. Yeah, no, you did. I just it seems like such a big task to get, you know, all that data into one place. And I guess I'm just unfamiliar with the technologies that would do it. But it might be somewhat tangential to what you were talking about, which is how to create a seamless connection between data that is getting updated that frequently in the client. Right. And I mean, these problems are in any sort of credit application with volume will also run into similar issues. It's just a matter of how you sort of architect the application and scale it and its scalability mostly and reliability that you really tend to see these problems arise. But in the case that you do have a system that's getting this much data and you need someone to look at it frequently, then I guess I'm just trying to demonstrate that WebSocket are the way to go in terms of viewing it more rapidly and without overhead. Cool. Any other questions? Comments. Concerns. I've got a quick question and my information is probably a bit out of date, but years ago I remember there would often be kind of trouble. Maintaining WebSocket connections over especially like over load balancers. Have you encountered anything like that? And so, like in the past, even though WebSocket thought it kind of like practically the principles of regulation, people still tend to do a lot of long kolache for pretty certain use cases. Have you encountered any problems like that? Yes. So unfortunately, I've never been on a project that required WebSocket to be used for data transfer as much as I think it's a great idea, just the opportunity for Rose. So I've never had to deal with things like a reverse proxy or firewall, for example, getting into the way. But I would imagine that and I've sort of been following WebSocket for a while. I mean, one of my college classes, I made a Checkers game using WebSocket, and you would definitely run intermittent issues with the connection dropping randomly. I don't know if that was necessarily the technology failing as much as the implementation of the technology not being ready there yet. I think that there are mechanisms in place now that handle the sort of re-establishment of that connection in case it drops or maybe there is a network outage for a second or, you know, you're on your phone, you transfer from mobile to Wi-Fi data that it can reestablish that connection and pick you up where you left off. It's just a matter of setting up that infrastructure. Yeah, I can add something to so at my last company, I, I architected a an application where we had a real time map and on that map there were vehicles moving around and all that. So we use WebSocket. And I remember when I first set it up, I think I was at AWB environment and I think I had it as a classic load balancer, which doesn't work for WebSocket. So I think I had to change it to more of the application load balancer and then it started working. So. So in a way, I think you're quite in your question. You said, is that an old thing? I think yeah. I think that is was a limitation a long time ago, but I don't think it's much anymore, so. OK, cool. And then somebody else said here at their last job was a trading abuse at WebSocket served by spring up disconnects weren't really a problem outside a client losing network connectivity up. Same thing, and I think a trading off is a great example of this. You can see Robin Hood if you guys have experience using that. It is amazing how it just is so responsive and they can just up the price of the stock sort of immediately. And I'm not sure if they use WebSocket sockets or if they poll every second. But that's sort of where the inspiration came from, is that trading apps, I think, are real time. And there's a lot of trades happening during the week. And it's definitely cool that you're using websites to sort of solve these problems. Yeah, it's pretty awesome. I love it, I love the push, you know. There's even a one time I did like a mobile app and I had like a dashboard thing and even that and not even using websites, but using push service events. What was that called? I was just like a one way server to a client even that was awesome at the time. And that was like twenty fourteen, you know, just yeah, I think the future should be plush. I don't think anybody should be pulling for data anymore. I mean, we have web hooks. Yeah. Yeah it's. Yeah that's what I'm saying. So I'm saying 2014 was my first taste of it, even though it wasn't bidirectional, it was only server to client push, whatever that was called. It was popular in Ruby and Rails at that time. But but yeah. Then website is the ultimate.