Video details

Data Analysis with Python


Presented by WWCode Connect Forward 2021 Speaker: Nimrita Koul
This talk will cover different steps involved in getting dataset ready for fitting a machine learning model using Python.
___ 💻 WWCode Digital Events: 🔎 Job Board:​​​​ 💌 Make a Donation:


Very excited to be here tonight, although it's around 11:00 p.m. In night in India and I'm kind of sleepy, but still there's a great excitement to be seeking and connecting with people here. We can see your energy. Sorry. We can see the energy. Yeah. This is kind of a hands on session and at the same time, this is a beginner's tutorial for those who are interested in using Python for analysis of the data. There can be students, there can be research scholars or anybody who is interested in exploring Pandas for their data analysis work as a hobby or as a professional. Does as introduced me. I am an assistant professor. I have an experience of around 15 years teaching computer science students and you can find this notebook on my GitHub URL. I have shared the URL here with all of you and here are some references which I have used to compile this notebook for you. For demonstrating whatever facilities in Python, Pandas and Number that I'll be talking about. I have used this Divorce Predictors data set which is available on UC Airbanes repository. You can find the link here and the corresponding reference here. So with that we will start with what are the general steps involved in almost any data analysis project? The very first step in you to go and collect your data set. So we often find that the data set we get is not in exactly the format we are looking for. So you need to do a lot of initial transformation. First is just exploration. Then you need to do certain transformations to get your data set in a format where you can understand it and then use it for your further tasks in the data analysis. So after you get your data set, you get it into one of the formats that can be easily integrated with Python or any other programming language that you might be using for doing your work. Since we are focusing on Python here and I'll be talking about Pandas and NumPy and a few other libraries in Pandas in Python. So people see how to get our data from a CSV file or any other file format into a Pandas data frame. After you have your data frame ready, we begin the steps involved in data wrangling, which may involve handling your missing values or combining certain columns, dropping certain columns, transforming certain columns. After this, we do some exploratory analysis of our data, looking at the trends in your data columns, finding the mean median mode, things like that. Once this phase is over, we begin something which is called feature engineering, which involves an advanced look at your attributes and trying to apply certain transformations a higher level of transformations on your attributes with the main aim of trying to fit a model or trying to derive meaningful insights from your data set. Of course, after this the next step is training your model and then drawing the insights or trying to find predictions, whatever may be the case for you. So the data set we are going to use as an example here is a divorce predictors data set. This data set has around 54 attributes. The data set has been prepared by the authors of this paper. By getting respondents fill a survey form, they're given 54 questions and the respondents could reply by giving their rank from zero to four to the answers, how strongly they agree to theirs and how much they disagree with this. So these 54 attributes are somewhat like questions based on the response of the people who take this survey. Authors have tried to predict will the authors have a divorce or they face the danger of having a divorce. So we downloaded this dataset. We looked at the way attributes are present there. This is a numerical data set, a simple data set. As I told you, this session is focused towards the beginners. So I've not taken a very complicated data set. So once we will look at this data set as we go ahead and we will be using NumPy and Pandas. Some functionality of Matplotlib and of course Cburn. Numpy is used primarily for numerical computing, scientific computing, applying statistical functions. Linear algebra functions on your data appendas is used exclusively for data which is in the form of tables, multivariate data. Matplotlib and Seaburne are two libraries which are used for 2D and 3D plotting and Python. We have Sqlarn which is an excellent library for applying machine learning to your data. We have Scipi and stats models also for some advanced statistical computing on your data. So once you have your data set ready, the first step is once you decide to work with Python and the libraries available in Python, the first step is to figure out which software you are going to use, which IDE you are going to use. This example in this session I'm demonstrating to you through a Jupiter notebook which comes bundled with Anaconda. Distribution of Python Anaconda is a specialized distribution for data science related projects. A lot of libraries which you require for data analysis in Python come pre bundled with your an account of Python distribution. So NumPy, Pandas, Mac, proudly, etc. All being those libraries, we do not have to explicitly install them. Since I am using Jupyter notebook in Anaconda, we directly import these libraries and we are all right all set to use them. So I imported NumPy, MATLAB and CBA and both. I also will be demonstrating a bit of model building. Therefore I have imported as Kalon also and since I'll be using logistic regression for demonstration purposes, I have imported logistic regression module. I will come to these terms as we go ahead in the session. So I have this divorce one. Csv file which contains 54 attributes and it also contains a final label or final class which has the predicted values. For all the people who responded to this survey, there were around 170 people. So my data set has 170 rows. Now Pandas has a built in function read underscore CSV, read underscore Excel. Also, it helps you to read a CSV file from your hard desk or from an Internet URL into our data frame. Already so deep is the name of my data frame object. Here I can unsee the condense. I can do a preliminary exploratory analysis of my data by using functions like Head, Tail, Shape, etc. So here I am interested in looking at first five rows of my data. The default number of rows that Head function shows us is five, although you can set a number here and see that many rows you might be interested in. So this is what my data frame consists of. The column headings look like this, unnamed one, two, three, and so on. And the name of first column is something like this. It's not a very readable data format. It's not very readable names of columns. So I think I need to do something here. I need to convert this into something which is more readable, which is something more meaningful. So that's what we'll do exactly. Here I have changed the value of this table into by skipping the first row, right? I'm not interested in this row. So I changed the contents of my data frame by just writing a parameter an extra argument to my read underscore CSV function, which is called skip rows. So what it did was it skipped reading this first row into my data frame. This is the output when I try to find the contents of new data frame. It's a lot better than the previous one, right? I do not have this column here. So attribute one, attribute two, and so on are now the names of my columns. This also is not very meaningful, but it's a little better as compared to the unnamed one, unnamed two, and so on. So we'll do a little bit more of processing to get these into more meaningful, more comprehensible column names and assume the Pandas objects. Both are data frame, as well as a single column data storage object. Series provides a number of attributes, a number of built in functions which help make life easy for a data analyst. So we will be looking at some of them which are frequently used, especially at the beginning stages. So a shape function or shape attribute helps us to see the number of rows and columns in your data frame. So what I get here is that I have 170 rows and there are 55 columns, there are 54 attribute values. And the 55th column is the flat column which shows me the predicted labels for each of the 170 respondents of each of the 170 columns in my data frame. I can check are there any missing values in any of the columns in my data frame? There's a built in function is any for that. So DF is to figure out is there any missing value in any of the columns in any of the attributes of my table? I can combine together, I can change whether multiple functions. So on the output of SNA, I can also apply an aggregator called Some. So the output here is telling me how many missing values are there in each of my 54 columns. Incidentally, this data set has none of the missing values. None of the columns has any of the missing values. So the output you see is zero for all the columns. Here we are looking at the overall sum. So we have done a three level training here. First we are competing. First we are checking whether there is any null value in the data frame. We are trying to find the sum of null values in any of the columns and then we are applying a second level sum to find the overall number of missing values in our data field. So Some and Mean are two very common aggregated functions. There are a lot more. A number library allows us a number of aggregated functions which also includes statistical summary functions which can be applied either column wise or row wise, or even on the entire data frame in one group. So here we are doing only on attribute five column and we are trying to see how many values in this column are not null. So we know that all values are present. So all 170 rows contain not null. The sum comes to be 170. There's an attribute called Columns which helps you see the names of all your columns. In the data frame there are 54 attributes and the last attribute is the class value. We can sort the data frame based on any one column. Here we have sorted this data frame using the function Sort values I'm sorting based on the values in column attribute five and I can specify do I want it in an ascending order or do I want it in a descending order? So in this case I want it in a descending order. The values in attribute five will occur in a descending order and the values and rest of the columns will be rearranged accordingly. And the output from the output I just want to print top five rows. So here's the output we can see the row numbers have been shuffled in order to satisfy this ascending order criteria on column five, right this descending order criteria on attribute file. We can also see a number of unique values present in any particular column. So here I'm interested in column 45 or attribute 45 and I want to find out what is the total number of each unique value present in that column. I can see here that four is present in the column 61 times, three is present 36 times, and so on. We can see that this output is not sorted in any specific order. If I am interested in a sorted output of the value counts function, I can apply a sorted level function here. So the result in this case appears in a sorted order. I have the number, I have the count of zeros first, the count of ones next, the count of twos next, and so on. So the index is this is the start indication we have heard about outliers. We know that outliers affect the results of our machine learning pipelines. They affect the results of our data analysis projects. So during data analysis we look for outliers and we try to identify them early in order to handle them. So one way of identifying your outliers is through plotting or visual analysis. In this sell, what I have tried to do is for attribute 45. There is no specific reason for going with attribute 45. This is just random. Using the values in attribute 45, I have plotted a histogram. This histogram function will show me two attributes. The first is how many respondents have responded according to that particular value in column 45. So what we see here is something like this. I have four possible values in the column zero, one, two and four. And here I am looking at the counts. How many people have responded with a response of zero? How many people have responded with one, how many people have responded with two, and how many people have responded with three and four? Looking at the heights of these columns, we can see that a lot more people have responded with four for attribute 45, right? So this value is way beyond the normal range or the range in which rest of the four values lie. Looking at the graph or looking at this histogram, it becomes a little easier for us to identify a value which is a little bit extreme, right, which is a little beyond the range of rest of the values for the same column in my data set, you can also compute screw and curdosis of any of the columns on your data frame. You can also compute it for your entire data set in one go. So here we have computed it for attribute 45, again for attribute 45, and then for entire data set. So I can find that my attribute 45 is skewed left by zero point 46 and there's a Katosis of -1.2 in this column. Then I can find the screw value for each of the columns, all 54 columns. We can see the list of attributes again by using the columns attributes. This is a statistical summary function available in Pandas for a data frame object as well as for a series object. It helps you to see for every column the statistical summary values. How many values are there in the column? What is the mean value for all the values in that column? What is the standard deviation, the Pokemon tiles minimum value, the maximum value, and so on? There is another summary function, another function which gives you the summary information about your data frame, which is called Info. Info function tells you not only the column names, it also tells you how many null values are there. What is the data type of that particular column? Whether any column has a constraint of not null or not on it? This is info function. Both disgrace and info function helps you to look at your data in a little bit of depth in order to make decisions regarding further data processing that you may need to do on your data set. We have index attribute which helps you to see the structure of the index of your data frame. So in this case I can see my index is a number. It starts at zero and it goes all the way up to 169. There are 170 rows in my data set and the index increases by one after each row. Again, we have an attribute called this attribute helps you to transpose your data set, which means that all the columns that you have an original data framed will be converted into rows and white viewers. So this is a long data set. Therefore I had to scroll down quite a bit. Then we have a certain summary function. The statistical summary function. We have a main function. It will find you the mean of every attribute, every column in your data frame. You can also find the mean row wise by specifying the fees parameter here to the main function. You can also find the median of every column in my data frame. Again, this can be found for the rows. Also, by specifying the Fees perimeter, we can as well find the standard deviation by using history function. These functions are built into the data frame object. The NumPy library also has a lot of functions, and we can also use NumPy functions for computing these statistics on our data frames. Right the mode function, so I'm computing the value mode for each of my columns. Since this display is not very pleasant to the eye, I can compute a transpose of this result and then I can see the mode of each of my columns displayed to me in a little more readable fashion. Now what I want to do here is I want to see the mode of column 45 and I want this to be displayed like this. I want to make the index as mode and I want to see exactly the value of mode for column 45. So I have extracted or I have accessed column 45 like this by using double brackets. This is called subsetting or indexing your data frame. Then I called more function on it. The value is stored here in mode. I'm trying to rename this with zero. I'm trying to rename this first row with zero and then I'm displaying the results. So this is what it looks like. It's a lot more readable than the previous few sales we had we can apply Describe function on individual columns. So we saw how Describe works for the entire data frame. It can work for individual columns. Also, you can have the output of Describe function displayed in a little bit more readable manner if you set the index. If you set the functions as index for it. This indexing can be done by using this reset index function. So the values that we are competing are the statistical functions that it is showing to us. These functions have been put as indexed for this time frame. Now we will be displaying we will be demonstrating a bit of feature engineering and bit of model building and so on for that purpose. What I have done next is that I have extracted the class column of the data set and I have tried to shift it into a separate data frame. So I accessed the class column of the data frame into a separate series object called Target. I converted that series object into a data frame using PDOT data frame constructor. I can see the first five rows of this data frame using the Target function. I'm sure. Now we have heard about this head function for a lot since I started. So these are the classes for all 170 rows of my data frame where the class is a binary variable. It has two values, zero and one, maybe one where the chances of a divorce are high, and zero where the chances of a divorce are pretty low. So after we have extracted the class column from our data frame and moved it to a separate data frame, we can drop it from our original dataframe. This is how we can remove one or more columns from your data frame. So DF drop to the Drop function can specify a list of columns that can be dropped. We can also specify whether this drop operation should take place along the columns or along the rows. Of course, in this case we want to drop a column so that these parameter takes the value of one. We want this drop to have the effect on the same data frame without storing the results in a separate data frame object. Therefore, I have passed a parameter in place here and set it to true after this drop function. If we try to see the names of columns again, we'll see that it does not contain a column called class, which was present in the earlier data frame. So what I've done next is that instead of using these column names ADR one to ATR 54, which is not conveying much meaning to us, I have renamed these columns to some meaningful English words, the words which I've used here from the questions that the authors have shared over the UCI website. So question number one was, does the discussion end when one of the spouses apologizes? Can I ignore differences with my spouse? And so on. So based on the questions I have decided to go ahead with these column names. Of course, when you are doing your data science project you can decide your column names. Keeping in mind that the column names need to be meaningful. It's best to have column names which are meaningful. So once we rename the columns, we can go ahead and check the top five rows. And of course the column names have been replaced to the names which convey a little bit of meaning to the reader. Now we can find the correlation of these columns. Now with target class there's a function called Corvid which allows us to compute correlation between two different data frames. So DF, Corvid. I want to compute correlation of every column within my data frame with the class which is present in my target data frame. So here are the results. We can see that almost all the columns have a high and positive correlation. This is zero 86, point 82.8, point 81 and so on. Except I think three columns which is I don't have time and we are like strangers which has a low correlation of zero four and a point 54. There is another column 48 which has a slightly low 47 and 48 46 which have a bit of lower correlation as compared to other columns which have pretty high .8 and half. So by looking at these correlation values, you can make a decision of which columns are important for your further data analysis work. Correlation analysis is one of the easiest and I think it is one of the first level methods of deciding upon the relevance of your columns for your data size related projects. Once you have identified the important columns, you can go ahead and train your models on only those columns. In this case, as yet, I have not discarded any column. I'm trying to fit a model on this data set as it is without dropping any columns. So what do we do when we build a model? Generally is that we separate our data set into a training set and a test set. This essentially prevents the data leakage into the trained model and it helps you to have a better validated model. So for making this split Skillon provides us a function called Train test split. I have used the same function here. I passed it entire DF and the class labels which are in the target data set. I have used the split ratio of 0.7 and 0.3. That means 70% of my roles will get into the training data set and test 30% will form the test data set. So once we have constructed our data into a training set and test set, we can start building or fitting our model. So logistic regression is a model that can help us do binary classification. Very simply, I'm not demonstrating a complex model here. This is a very beginner friendly model. So we are constructing an object of logistic regression. We are trying to fit it on the training data set. We have passed the training data. We are also passing it. The labels for that training data score function which is available in the logistic regression class helps us to see the score or the accuracy of our model. So we will see what is the score we obtained with our original data frame. We'll also see what are the coefficients for all the columns once this model has been fit on them we'll see the values of intercept generated by the logistic regression model we have just constructed. We can also use the object of this model then to make predictions on our test data. Almost all the models that we have in a scale learn provide us these very standard functions fit score, then predict and so on. So we can also see the accuracy score, accuracy score and recall precision. All these standard metrics which are used to test a classifier are given to us in scaling libraries metrics module. So we will see the value of accuracy score here. And finally we will see the confusion matrix for this particular model. We see that the accuracy score here is one which means that it is 100% accurate for all the test rose. It has been able to predict the values which are exactly similar to the values that are available in the data set. We see the coefficient for all 54 columns. We see for a few of them they are negative and small. For a few of them they are positive. We see the value of intercept and then we see the predicted we see the predictions. And finally this is our confusion matrix. I have 25 values which have been classified into first class one. We have 26 values which have been correctly classified into a class zero and none of the values have been misclassified in either of the classes. Next we'll look at some simple feature engineering. Here we fitted our model in the previous cell with 54 features. By looking at the features we can see that a number of them can be combined together and we can reduce the number of features. So that's exactly what I have done here. I have created ten new features and these features are the combination of previous features or the features that were originally there in the data frame. So for example I created a feature we can ignore differences which is based on feature number one, two, three and four. In my original data set which had 54 features, there are multiple transformations that can be applied for combining the columns in the given data frame. For this very simple example, I have just taken the sum of the values in these columns to construct the value in this new column. Similarly, this new column we don't have time at home is just a sum of the original values in column six or attributes six and attributes seven of the original data frame. So with this in mind. I have resulted into ten new columns which are simply constructed by combining together the original columns and I have reduced the dimensionality from 54 to 100. Now let us see, was I right in doing it? Did it impact the performance of the same logistic regression model? How does this feature engineering work for me? So we will now have this done in Python. I created a new data frame object, BF one here and then I added columns to it. My new column we can ignore differences is nothing but some of the values in the columns of the original data frame. This is a data frame one and this is our previous data frame DF. We created ten new columns like this applying very basic feature engineering here. And now we'll look at the top five rows of our data frame. Here they are. Right? We have the new ten columns presented to us like this. Now we'll try to see the correlation values of these columns with each other. I have also tried to plot a heat map here using the Seaborne library just for the sake of demonstration and also indicating that when we are doing visual analysis of our data, Seaburne library and both Matt Proudly and Seaburn prove very handy to us. So this is a heat map here. The darker the cell Clover is the value of correlation. The lighter the cell higher is the value of correlation. So what we find here is that if two columns correlate very closely, if the values of correlation are very high, either positive high or negative high, we can further drop one of those two costs. Right? The columns which are very correlated to each other will have similar impact on the model. And during feature reduction or dimensionality reduction we can either drop one of them or we can drop all except one in a group of columns or attributes which are highly correlated with each other. So plotting a heat map of correlation matrix like this helps us to identify easily which columns are redundant and can be dropped. Here we are plotting the correlation of each of the ten newly constructed attributes with the target class or with the labels that have been given to us. We can see that of the ten column names I have only one which has very less which has relatively less correlation with the class which is we don't have time at all time at home. This has a correlation of zero point 57 with the label. Look at the rest of the columns. We really have a high positive correlation with the class label 090-9109, four point 92 and so on. That means all of these columns which I have constructed now are very relevant to the class or to the prediction that we had in our original latency. So next we will just have a bar plot of these correlation values. The correlation of each of these ten columns with the class label we first compute the correlation. Once again we get the names of all the columns in one more series variable and then we use the math library for plotting a visual representation of these correlation values we just saw in the previous Sell. Only this column we don't have time at home has a lower correlation with the Glass papers. This is around. 00:54 and every column has a high correlation. So by a visual inspection of this we can either drop this column or decide to do some transformation on it, or combine it with some other column and get rid of it if required. By combining data together or by summarizing data together. Grouping data Together this also helps us to derive some insights from your data. This also helps us to know more about our data. Pandas provides us three important functions which help you to gather and to summarize your data. These functions are grouped by we have a pivot table constructing function in Pandas, which is also very important for a data analyst to summarize data to look at data from different angles by grouping them under different categories, multilevel categories. Then there's a cross tab function as well which helps us to look at the frequency values of frequency of different values in our columns. It also helps you to compute other aggregators if you are interested in. So we'll look at the pivot table. First. There's a function called pivot underscore table in our data frame object which can help you to create pivot tables. We'll look at this pivot table through this example where I'm not using the dataframe created earlier. This is a new data frame. The name is example and there's four TF and this data frame is a very simple data frame just has four columns name, age, gender and salary. There are just four rows. So we are printing the head of this data frame. It has just four rows and now we are constructing the pivot table. What I interested in is that I want to group the rows of this data frame. These four rows I want to group first based on gender and then based on names. And once the data has been grouped, the values that I'm interested in looking at are the salary values, the values from salary column. So I'm providing here two column names. Therefore my favorite table will construct a two level index. First level index will be based on the gender values and second level index will be based on the values and the names column and then the data that will be displayed to me will come from the salary column. This is the output of this pivot table function. We can see that there has been first level grouping based on the genders female gender and male gender. And then there's a second level of grouping or second level index which contains the names. Both of these names are from the gender female. Both of these names are from the gender male and then I have the values coming from column salary for all these names right grouped under two gender groups. We can also specify if I'm not interested in looking at plain values, I can apply some aggregator function on these values. This aggregator function you can apply as a function, or you can apply a set of functions. The attribute of Pivot. Table function helps you to specify the aggregator you might be interested in applying on your data frame. So exampled EF Pivot table specify which column you want to be used as index column specify what values you are interested in looking at and you can specify what degradative function you want to apply on the values. So here we have grouped the data according to genders, just the gender, and for each gender I'm displaying the age and not just the age but the mean values of age. So here I can see the mean age which is the average of age for both the females and average of age for both the males. We can also specify an additional parameter or an additional argument called columns in the Pivot table function. It helps you to display multiple columns from the column that you have specified here. You can also specify the aggregator you want to apply and you might find that by applying this Pivot table you generate certain missing values in your resulting Pivot table. Those missing values can be replaced with zero or any value that you want to specify through this attribute called underscore value. This is an attribute of Pivot. Table function. So here what I have done is that I have pivotted my data frame on the gender column. So I have two rows here, one column for the female group or another for the male group. I have specified the columns here to be coming from the column call name in my data frame. Therefore my Pivot table has four columns, one pertaining to each name in the column called Name. And these values are coming from this column called age. So we are looking at the ages of two females in the female group and two males in the male group. These values were supposed to be Nan's or missing values because Ram is not a female and Anita is not a male. So these four values here we are supposed to be NANDs. Originally, since we have specified a cell value of zero, the nons have been changed to zeros. For me in the output you can apply more than one aggregator functions which we have done here. We have used NumPy sum and NumPy mean. So what we were table function gives me here is the data which is grouped according to gender. And not only is it giving me some, it's also giving me mean to the off the salary column. We have also specified a margins argument here. What this does is that it helps us to add an additional row at the bottom which talks about all the entries in both the groups, right? The overall sum of the values in both male group and female group. Right here we can apply a query on the result of pivot table as well. So we created a pivot table like this using both sum and mean. We stored the result in an object called PT. So this is my PTO object and I can fire a query on this where I'm interested only in one group where the gender is female. So PT query PT is the object of your Pivot table. You can query for only one group where the gender is female. So this is how we can query a single group from our Pivot table. Group by function works similar to Pivot table except that in some cases the output is not very readable and there's a limit to how many levels you can go, how many levels of index you can go with group by, right? This is the similar result as we got earlier with Pivot table. We are grouping here the rows by the value in gender column and then we are interested only in females column. So this is what we get as output. There are two females, we have the ages right? Obviously the gender is female and we have the salaries the four columns we had in our original data frame. In addition to group by and Pivot, there also is a cross tab function which by default will give you the frequency of every unique value in your table. The function has syntax like this you call it with Appendas cross tab you update the data frame or the columns of data frame which you want to use to construct your cross table. So here I'm interested in constructing across table between the name column and age column, so we can see for each of the names and each of the values of age how many people have that age. So I have only one person having age 24, only one person having age 23 and so on. We can also plot the results of this cross tab in the form of a heat map where the same values which we saw here in the table can be displayed to us in the form of heat. Again for ease of readability and ease of getting an insight from the data. So we have applied here cross tab on our data frame, one which is of reduced dimensionality the ten columns that we constructed from original 54 columns. So I have tried to construct a cross tab here between two columns. We can ignore our differences and the column my column. But this cross tab is not straight cross tab I have divided the values and we can ignore differences into three bins low, medium and high and these bins as the index for my cross tab and then I have the values in my car column. I can see the frequency of each of these values are divided into one of the three bins here for the we can ignore our differences. This is the pivot table we have constructed on our reduced dimensionality data frame for divorce predictors. For two columns we can ignore differences and we have similar values. So for every value of these two indexes we will get the corresponding values of all remaining eight columns. Finally, we are trying to fit a model on this reduced dimensionality data set. So like in case of original data set, I have split my data frame into a desk portion and a training portion. We have trained this model on DF One, which is our reduced dimensionality table. We have the same classifier, it has logistic regression classifier, same values of random state, same training testing split ratio. So we will see that the score of the values of score, the values of accuracy remains just the same, right? So by reducing the number of columns from 54 to just ten, we have not lost any accuracy, right? So what we understand from this is that we can reduce the dimensionality, we can reduce the computational complexity, the storage space requirement. We can also reduce the complexity of the model by intelligently identifying just a subset of original columns and going ahead and building our models with them. We have lottered the Seaburne heat map again for the accuracy score obtained for each of the two classes. This is the confusion matrix plotted as a heat map here. Now we see the effect of dropping some columns randomly from RDF one which already has reduced dimensionality of just ten. So we reduced a single column. Here we are checking whether there are any duplicated values in the data frame. So I'm seeing a false here, which means there's no duplicated value. And after reducing this column I have fitted the same model once again and I see that by reducing one more column also I do not see any change in the accuracy at all. Finally, I dropped a few more columns and I result with 12345 and six, only six columns. I spread a model on these six columns and I see that score still continues to be same, right? Which means essentially that these six columns are the sufficient predictors for predicting whether a couple is going to face the chance of a divorce in future or not. So after this, what I did is I further tried to drop from six. I tried to reduce further and I figured out that this reduces the accuracy from one to 0.8. What it helped me to figure out is that the columns that I dropped last, these three columns are the essential columns, right? They have an impact on the performance of my model and I should go back and view them in my final set of selected features. So with that I also have come to the end of time allocated to me and I have come to the end of the notebook. Also, just a quick conclusion. Python is an excellent tool for everybody who is interested in joining Data Analysis Task Force, either for your project work at school level, at University level, or in the professional field. It has an excellent support in terms of resources online, in terms of developers who can support you on forums like Stack Exchange or Stack workflow. You can do small projects as well as in enterprise level projects using Python. And of course, there is an excellent community support. There's an excellent set of libraries and those libraries are continuously being added on and enhanced. So all the best to everybody who is about to pick up Python analysis. I hope it works well for you. Thank you so much. Any questions here? Yeah. Yeah, everyone was enjoying that. So at the moment I have one question that is anonymous. So let me just read out for you. I'm currently a beginner learner in data science ML and thought of starting a recommender system project. Do you have any advice of what I should be paying attention to and resources to use? This is a great question and I really would like to congratulate you for choosing Python and building your project on recommended systems using Python. For anybody who is starting with Python data Science whether they are building an NLP based project or a recommender system or a general machine learning predictor, it is very important to begin your data science journey with preprocessing steps. Understanding the Data Sets understanding linear Regression I would say, first of all, after understanding your data, after understanding the data formats and all, you should try to understand linear regression. And if I am allowed even before that, even before understanding your linear regression and simple relations between data, we should try to brush up our backgrounds on probability statistics and a bit of linear algebra. My experience as a teacher has been that when we teach our students building models, we have to go back and teach them the equation of a line. We have to go back and teach them what is the meaning of coefficients, what is the meaning of intercepts and all of that. So if you are beginning and I'm sure you will do great. So before going heads on into models and their complexity, please brush up on your concepts and probabilities. Statistics, linear algebra, Calculus along with that, read upon. Data Analysis Basics What are different formats of data? How do we do preprocessing? What are the different kinds of models? What are the features of all of those models? What kinds of models are suitable for what kinds of data? And of course for recommendation systems. A lot of people use collaborative pre filtering where they get data sets which have been created by crowdsourcing and then they try to apply some kind of filtering to figure out what most people have recommended and things like that. So for any data science student, the beginning is always in probability linear algebra, calculus, statistics then programming a lot of practice and then exploring liabilities like pandas NumPy and Scala. No doubt. Thank you so much. That would be really helpful. Thank you. We have another question from Louisa. No problem, Louisa I can read over here. So how are age and income are related to the ten attributes the model and ends up using for the prediction? Not sure if it was because I missed a part of presentation but didn't understand how both things are related. That's a very valid question, Louisa thank you for asking. Actually the reduced data frame which had ten columns did not have age or income as a column age and income where two columns in a separate dataframe which are used to demonstrate pivot table, cross table and group by that table had only four columns age, gender, income, income and age. Right. That had nothing to do with divorce predictors. Does that answer your question? Yeah, she messages. Perfect. Thank you. All right. So I don't think so. If we have any other questions we still have 1 minute we can make with us. I'm just sharing with you it's around 12:00 a.m. At Namita's zone and she's so excited. I'm so impressed by the energy at midnight. Thank you so much. No doubt. Thank you, everyone. Thank you so much. Yeah. So for everyone, please join the other sessions you can go to stage and have the other sessions as well. We have networking exposition. Yeah. All right. Thank you so much. See you later. Bye.