We can predict if a house was built before or after 1980 with the correct training model. I also show the accuracyscore of the model I chose for this case. It is really interesting to see how sklearn provides tools to train and test ML models.
Read and format project data
# Include and execute your code hereurl ="https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv"dwellings_ml = pd.read_csv(url)dwellings_ml.head()
parcel
abstrprd
livearea
finbsmnt
basement
yrbuilt
totunits
stories
nocars
numbdrm
...
arcstyle_THREE-STORY
arcstyle_TRI-LEVEL
arcstyle_TRI-LEVEL WITH BASEMENT
arcstyle_TWO AND HALF-STORY
arcstyle_TWO-STORY
qualified_Q
qualified_U
status_I
status_V
before1980
0
00102-08-065-065
1130
1346
0
0
2004
1
2
2
2
...
0
0
0
0
0
1
0
1
0
0
1
00102-08-073-073
1130
1249
0
0
2005
1
1
1
2
...
0
0
0
0
0
1
0
1
0
0
2
00102-08-078-078
1130
1346
0
0
2005
1
2
1
2
...
0
0
0
0
0
1
0
1
0
0
3
00102-08-081-081
1130
1146
0
0
2005
1
1
0
2
...
0
0
0
0
0
1
0
1
0
0
4
00102-08-086-086
1130
1249
0
0
2005
1
1
1
2
...
0
0
0
0
0
0
1
1
0
0
5 rows × 51 columns
QUESTION|TASK 1
Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.
I chose to display3 charts comparing the relationship between the Year Built and the Sell Price, Number of Bathroms, and Number of Bedrooms. My assumption is that these comparissons can help us understand what data could be used to train the model and how data is shaped along the years.
Show the code
# Relationship between the year built and the sell pricesfig1 = px.scatter(dwellings_ml, x='yrbuilt', y='sprice', color='sprice', title='Sell Price and Year Built', labels={'yrbuilt': 'Year Built', 'sprice': 'Sell Price'})# Relationship between the year built and the number of bathroomsbaths_count = dwellings_ml.groupby(['yrbuilt', 'numbaths']).size().reset_index(name='count')fig2 = px.bar(baths_count, x='yrbuilt', y='count', color='numbaths', title='Number of Bathrooms vs. Year Built', labels={'yrbuilt': 'Year Built', 'count': 'Number of Houses', 'numbaths': 'Number of Bathrooms'})# Relationship between the year built and the number of bedroomsbedrooms_count = dwellings_ml.groupby(['yrbuilt', 'numbdrm']).size().reset_index(name='count')fig3 = px.bar(bedrooms_count, x='yrbuilt', y='count', color='numbdrm', title='Number of Bedrooms vs. Year Built', labels={'yrbuilt': 'Year Built', 'count': 'Number of Houses', 'numbdrm': 'Number of Bedrooms'})fig1.show()fig2.show()fig3.show()
QUESTION|TASK 2
Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.
I chose to use a Decision Tree Classifier, and a Random Forest Classifier which was pretty good at guessing the yearbuilt. I dropped columns which are not related to the yearbuilt because are random like parcel, or ones that are too crucial for the model like yrbuilt. After comparing the accuracy result, I decided that a Random Forest Classifier would work best.
Show the code
from sklearn.model_selection import train_test_splitfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.metrics import accuracy_score# Columns to drop to prepare training and test datafeatures_to_drop = ['parcel', 'abstrprd', 'before1980', 'yrbuilt']X = dwellings_ml.drop(columns=features_to_drop)y = dwellings_ml['before1980']# Splitting the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# Show the shape of the datasets to confirm the split# (X_train.shape, X_test.shape, y_train.shape, y_test.shape)dt_classifier = DecisionTreeClassifier(random_state=42)dt_classifier.fit(X_train, y_train)y_pred = dt_classifier.predict(X_test)# Calculate accuracyaccuracy_score(y_test, y_pred)
0.9039860343322665
Show the code
from sklearn.ensemble import RandomForestClassifier# Initialize the Random Forest Classifierrf_classifier = RandomForestClassifier(random_state=42)# Train the model on the training datarf_classifier.fit(X_train, y_train)# Predict on the testing datay_pred_rf = rf_classifier.predict(X_test)accuracy_score(y_test, y_pred_rf)
0.9262438172825138
QUESTION|TASK 3
Justify your classification model by discussing the most important features selected by your model. This discussion should include a chart and a description of the features.
The most important features selected by the RFC are the are which hauses belong to, the style of architecture like having one story and the number of bathrooms. These show that mostly these columns where used to predict data. This helped the model to be 92% accurate.
Show the code
# Extract feature importances from the modelfeature_importances = rf_classifier.feature_importances_# Create a DataFrame for visualizationfeatures_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})# Sort the DataFrame by importancefeatures_df = features_df.sort_values(by='Importance', ascending=False)# Visualizing the most important featuresfig = px.bar(features_df.head(5), x='Importance', y='Feature', orientation='h', title='5 Most Important Features in Predicting Year Built', labels={'Feature': 'Feature', 'Importance': 'Importance Score'})fig.show()
QUESTION|TASK 4
Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.
I chose the Precision, Recall and AUROC score. These scores show how good a model is by giving a score on how often the model is right at predicting, how good the model is at predicting a ‘before 1980’ house, and how the model overall is able to predict a house built before or after 1980; respectively.