CHENNAI HOUSE PRICE PREDICTION PROBLEM (PREDICTIVE ANALYSIS-1)



PROBLEM STATEMENT

Chennai House Price Prediction(Regression) 

A Real estate Firm want's to develop a model to accurately predict real estate price for houses in Chennai based on the past transactions under the firm

1. Hypothesis Generation:-                

  • Surrounding area/Location of the house

  • Number of Rooms/Floors in the house

  • Is the house a part of a society

  • condition of house

         2. Understanding the variable:-

  • House Features

  • INT_SQFT

  • N_BEDROOM, N_BATHROOM, N_ROOM

  • QS_ROOMS, QS_BATHROOM, QS_BEDROOM, QS_OVERALL

  • SALE_COND

  • BUILD TYPE

         3. Surrounding and locality

  •  AREA

  • DIST_MAIN ROAD

  • PARK _FACIAL

  • UTILITY_AVAIL

  • STREET

  • MZ ZONE

         4. House Sale price

  • PRT_ID

  • COMMIS

  • SALES_PRICE

SOLUTION

Let us start with loading the data set
we first load the data set after that we check their shape of data set and initialize the few rows of data set

Data Exploration:-


describe function only work for continuous variable and we can find all the missing value in our table by comparing it with the shape of the table and also configure out the presence of outliers
in this we use all function through which we can describe all the variable this count is perform to find out the maximum value and give unique value of categorical value
is null function return True on the place of missing value else it return false and .sum return the sum of all the missing information in an particular row and column.
to check the data types of all the variable present in the data set we use dtypes function
similarly we can also find out which variable is categorical and continuous on the basics of the data types of the particular function

i have created a new data frame to put all this in a form of summary called temp in this data frame we are storing the data type , null_count, unique_count and after that we print this we get again each of the variable  data type,null_count, unique_count now we can see the bedrooms and bathrooms are float which means this sound like continuous variable but the unique count is int variable
By doing this we can check the bathroom variable type as we can see it is float so we need to convert it into int type because this value are not fit in it. so we have to change the data type of this variable by the data manipulation 
this is a over all view of our data let us do a Univariate analysis of data
we plot a histogram of our target variable we can see in this plot it is slightly right skewed and we can see we have the few outliers on the right side
we put a threshold of 1.8 crore and the price more than 1.8 crore is consider as an outliers
now you can see this look like a perfect normal distribution but in above histogram we can see a clear right skewed of plot
we try another continuous variable AREA_SQT and plot a hist of that variable b again using same function


now we check few of categorical variable of the data set we will start with the number of bedrooms  by finding this value count we not get the exact number to data so we divided this value count with the length of the data and find a percentage which will provide us information more accurately

we take up an another variable 'ROOM' we generate a value count which give us some data regarding the number of rooms present in the data
we take a another variable in this cause we also generate the bar plot of that variable
here we have less number of variable when we have large number of variable we can generate the bar blot to get the data more accurately
so this time i take up an variable with larger number of categories our area variable have 17 categories so i took this up we can see lots of categories are due to spelling mistake we just have main 7 categories
all this things we need to fix while we are doing the data manipulation


i took up an another variable parking facility we can see approximately same number of houses have the parking facilities and same don't have it and also we saw a spelling error we need to correct it while doing data manipulation and also only 2 row have this spelling error

Data Manipulation

All the errors we have in a past data set now we are going to deal with all of them
1. Drop Duplicates
2. fill the missing value
3. correct the data type
4. fix the spelling error in variable
First we will drop all the duplicate data from the data set

After Droping all the duplicate values we check the shape of our data set hence their is no change in the shape of our data set so we get our past data set back this function drop a row or column if all the values in row or column are similar to each other in this function by default the first row is kept and the 2nd row is deleted we can also set a subset value than they drop all the particular row which have the duplicate value when this function drop any duplicate value i doesn't change our data frame we need to place a function inplace = 'True'


After droping the duplicate we need to treat with the missing value so check the missing value we run the is null function

Different ways to deal with missing value 

  • Remove the rows with missing value
  • mean or median in cause of continuous variable
  • With mode in cause of categorical variable
  • using other independent Variable
First of all we deleting the rows with the missing value
due to drop of missing value row and column we lose some information so rather than droping we prefer imputing the value on the missing place of the value
we take our first variable first we check out the mode of our function after that we fill it with fillna function this time we use inplace= True so it will do a change in our Orignal data set

we fill  the bathroom according to the number of bedrooms in a house have on the missing place we are filling 1 for the one bedroom for more than one bedroom are filling 2 


for this overall quality function we use the other independent variable first of all we check the head of all this independent variable after that we impute the value of average of all this function on the missing value place
we check the shape of our variable and find the number of missing value in the table after that we itself define a fillna function which return the average sum of  of all three variable after that we write a single liner lambda function which fill up all the missing value with the fillna function else it return the orignal value after that we run is null function to check weather all the missing value is filled or not as we can see their is no missing value in the function

after that we come on our data types as we need to change the data types of the variable n_bathroom,n
_bedroom,n_room so we change it with the object we use .astype function for this and we create a dict which take a column name and the data type if we want to change the data types with int than we simply need to replace a object with the int basically astpype is use to change the data type
after that now we are going to Replace categories name we have and after that i will print the value count for all the categories of the variable


we generated a value count of all he function we have through this value count we can analysis our data more efficiently and merge all the spelling errors to the main variable

as we can see in this column we have a spelling mistake happen in our parking er can easily merge or replace it buy using the replace function we can replace any function using .replace function
we replace all the duplicate variable using the the same process as we used above

BIVARIATE ANALYSIS


First of all i print all the column present in our data set and list done some hypothesis regarding what
type of independent variable going to affect the dependent variable
now we are going to check our hypothesis
here we check our first hypothesis that the increase in the interior area will increase in the price of the houses and also we can see the clear relationship between both of them
in above graph we listed the plots from different colors for different purpose of house and we can easily see the house for the commercial purpose have more price as compare to the house regarding other purpose  the houses which are for residence have comparatively low price as compare to the other purpose
after that we list down our second hypothesis that the increase in bedrooms and bathrooms will increase in our house price n this pivot table we consider the median of the house prices and we also get nan in some places which is valid because we not have one bathroom for the houses which have more than one rooms
 
we now take another important variable overall quality score to check it will affect our sales price or not but we not get any relationship in between the quality score and the sales price of houses which is pretty surprising we just know from this graph that is their are very low number of houses have the high price which is affecting the density of the plot else we not get any kind of linear relationship in between both of them
now we created a quality score of bedroom bathroom and rooms to check weather it will affect the the sales price of hoses or not
we got this plots as an output and its pretty surprising that the quality score variable don't affect the house price their is no particular linear relation between the quality score and the price of the houses
this time we plotted the box plot of the same variable and we can see their is no difference in the quality score of all three variable they are pretty same
here we use the groupby function to do this hypothesis and we get the median of prices of all the types of the houses and we can easily see the price of commercial houses is more as compare to the other purpose houses
first we plot a histogram for the commercial houses of the Anna Nagar and we can clearly see their  is no houses which have the house price less than the 1.60 crore and most of the houses in Anna Nagar have very high sale prices
i will do this same for the building type hose and got this result we can see that the most of the houses have the high price but it is still less than the 1.5 crore that is all about the house related features
here we gain use the groupby function to check weather the sales price affected by the parking facilities
and here we can see the clear difference that the house having the parking facility having the more price ass compare to the house having the less facilities
instead of comparing that variable we make  a plot to get the data more accurately and we can clearly see the difference between house having the parking facilities and no parking facilities
here we create the pivot  table to compare the price with the area
NOTE:- pivot table and the group by function are at most same just the pivot table provide us the more function
here we can easily see that the T Nagar area has the maximum price and Karapakkam has the lowest price now we gonna plot a a sales price plot for the different area 
we can see the value is lower because our value is less than the 1.4 crore
we can do this similar with anna nagar
here we can see the lots of houses having the price more than the 1.5 crore as compare to the karpakam house price

Comments

Popular Posts