Juan Zurita - Data Science Portfolio - Client Report

Elevator pitch

We can predict if a house was built before or after 1980 with the correct training model. I also show the accuracyscore of the model I chose for this case. It is really interesting to see how sklearn provides tools to train and test ML models.

Read and format project data

# Include and execute your code here
url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv"

dwellings_ml = pd.read_csv(url)

dwellings_ml.head()

	parcel	abstrprd	livearea	yrbuilt	totunits	stories	nocars	numbdrm	...	qualified_Q	qualified_U	status_I
0	00102-08-065-065	1130	1346	2004	1	2	2	2	...	1	0	1
1	00102-08-073-073	1130	1249	2005	1	1	1	2	...	1	0	1
2	00102-08-078-078	1130	1346	2005	1	2	1	2	...	1	0	1
3	00102-08-081-081	1130	1146	2005	1	1	0	2	...	1	0	1
4	00102-08-086-086	1130	1249	2005	1	1	1	2	...	0	1	1

5 rows × 51 columns

QUESTION|TASK 1

Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.

I chose to display3 charts comparing the relationship between the Year Built and the Sell Price, Number of Bathroms, and Number of Bedrooms. My assumption is that these comparissons can help us understand what data could be used to train the model and how data is shaped along the years.

Show the code

# Relationship between the year built and the sell prices
fig1 = px.scatter(dwellings_ml, x='yrbuilt', y='sprice', color='sprice', title='Sell Price and Year Built', labels={'yrbuilt': 'Year Built', 'sprice': 'Sell Price'
})


# Relationship between the year built and the number of bathrooms
baths_count = dwellings_ml.groupby(['yrbuilt', 'numbaths']).size().reset_index(name='count')

fig2 = px.bar(baths_count, x='yrbuilt', y='count', color='numbaths', title='Number of Bathrooms vs. Year Built',
              labels={'yrbuilt': 'Year Built', 'count': 'Number of Houses', 'numbaths': 'Number of Bathrooms'})


# Relationship between the year built and the number of bedrooms
bedrooms_count = dwellings_ml.groupby(['yrbuilt', 'numbdrm']).size().reset_index(name='count')

fig3 = px.bar(bedrooms_count, x='yrbuilt', y='count', color='numbdrm', title='Number of Bedrooms vs. Year Built',
              labels={'yrbuilt': 'Year Built', 'count': 'Number of Houses', 'numbdrm': 'Number of Bedrooms'})


fig1.show()
fig2.show()
fig3.show()

QUESTION|TASK 2

Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.

I chose to use a Decision Tree Classifier, and a Random Forest Classifier which was pretty good at guessing the yearbuilt. I dropped columns which are not related to the yearbuilt because are random like parcel, or ones that are too crucial for the model like yrbuilt. After comparing the accuracy result, I decided that a Random Forest Classifier would work best.

Show the code

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Columns to drop to prepare training and test data
features_to_drop = ['parcel', 'abstrprd', 'before1980', 'yrbuilt']
X = dwellings_ml.drop(columns=features_to_drop)
y = dwellings_ml['before1980']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Show the shape of the datasets to confirm the split
# (X_train.shape, X_test.shape, y_train.shape, y_test.shape)

dt_classifier = DecisionTreeClassifier(random_state=42)

dt_classifier.fit(X_train, y_train)

y_pred = dt_classifier.predict(X_test)

# Calculate accuracy
accuracy_score(y_test, y_pred)

0.9039860343322665

Show the code

from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Train the model on the training data
rf_classifier.fit(X_train, y_train)

# Predict on the testing data
y_pred_rf = rf_classifier.predict(X_test)

accuracy_score(y_test, y_pred_rf)

0.9262438172825138

QUESTION|TASK 3

Justify your classification model by discussing the most important features selected by your model. This discussion should include a chart and a description of the features.

The most important features selected by the RFC are the are which hauses belong to, the style of architecture like having one story and the number of bathrooms. These show that mostly these columns where used to predict data. This helped the model to be 92% accurate.

Show the code

# Extract feature importances from the model
feature_importances = rf_classifier.feature_importances_

# Create a DataFrame for visualization
features_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})

# Sort the DataFrame by importance
features_df = features_df.sort_values(by='Importance', ascending=False)

# Visualizing the most important features
fig = px.bar(features_df.head(5), x='Importance', y='Feature', orientation='h',
             title='5 Most Important Features in Predicting Year Built',
             labels={'Feature': 'Feature', 'Importance': 'Importance Score'})

fig.show()

QUESTION|TASK 4

Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.

I chose the Precision, Recall and AUROC score. These scores show how good a model is by giving a score on how often the model is right at predicting, how good the model is at predicting a ‘before 1980’ house, and how the model overall is able to predict a house built before or after 1980; respectively.

Show the code

from sklearn.metrics import precision_score, recall_score, roc_auc_score

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
auroc = roc_auc_score(y_test, y_pred)

(precision, recall, auroc)

(0.9249242247610165, 0.9214866434378629, 0.898073021991411)