Read and format project data
= pd.read_csv('https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv', encoding='ISO-8859-1') df
Course DS 250
Juan Zurita
It is really interesting to build a machine learning model which can predict if a person makes more or less than $ 50K using data provided from a survey about Star Wars. Yes, a sruvey about Star Wars could technically predict this. This is really interesting, and I think really cool to try at least.
Shorten the column names and clean them up for easier use with pandas. Provide a table or list that exemplifies how you fixed the names.
column_mapping = {
'RespondentID': 'RespondentID',
'Have you seen any of the 6 films in the Star Wars franchise?': 'SeenAnyFilm',
'Do you consider yourself to be a fan of the Star Wars film franchise?': 'FanOfFranchise',
'Which of the following Star Wars films have you seen? Please select all that apply.': 'SeenFilm1',
'Unnamed: 4': 'SeenFilm2',
'Unnamed: 5': 'SeenFilm3',
'Unnamed: 6': 'SeenFilm4',
'Unnamed: 7': 'SeenFilm5',
'Unnamed: 8': 'SeenFilm6',
'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': 'RankFilm1',
'Unnamed: 10': 'RankFilm2',
'Unnamed: 11': 'RankFilm3',
'Unnamed: 12': 'RankFilm4',
'Unnamed: 13': 'RankFilm5',
'Unnamed: 14': 'RankFilm6',
'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.': 'HanSoloOpinion',
'Unnamed: 16': 'LukeSkywalkerOpinion',
'Unnamed: 17': 'PrincessLeiaOrganaOpinion',
'Unnamed: 18': 'AnakinSkywalkerOpinion',
'Unnamed: 19': 'ObiWanKenobiOpinion',
'Unnamed: 20': 'EmperorPalpatineOpinion',
'Unnamed: 21': 'DarthVaderOpinion',
'Unnamed: 22': 'LandoCalrissianOpinion',
'Unnamed: 23': 'BobaFettOpinion',
'Unnamed: 24': 'C-3P0Opinion',
'Unnamed: 25': 'R2D2Opinion',
'Unnamed: 26': 'JarJarBinksOpinion',
'Unnamed: 27': 'PadmeAmidalaOpinion',
'Unnamed: 28': 'YodaOpinion',
'Which character shot first?': 'WhoShotFirst',
'Are you familiar with the Expanded Universe?': 'KnowsExpandedUniverse',
'Do you consider yourself to be a fan of the Expanded Universe?æ': 'FanOfExpandedUniverse',
'Do you consider yourself to be a fan of the Star Trek franchise?': 'FanOfStarTrek',
'Household Income': 'Income',
'Location (Census Region)': 'Location'
}
df_cleaned = df.rename(columns=column_mapping)
df_cleaned.columns
Index(['RespondentID', 'SeenAnyFilm', 'FanOfFranchise', 'SeenFilm1',
'SeenFilm2', 'SeenFilm3', 'SeenFilm4', 'SeenFilm5', 'SeenFilm6',
'RankFilm1', 'RankFilm2', 'RankFilm3', 'RankFilm4', 'RankFilm5',
'RankFilm6', 'HanSoloOpinion', 'LukeSkywalkerOpinion',
'PrincessLeiaOrganaOpinion', 'AnakinSkywalkerOpinion',
'ObiWanKenobiOpinion', 'EmperorPalpatineOpinion', 'DarthVaderOpinion',
'LandoCalrissianOpinion', 'BobaFettOpinion', 'C-3P0Opinion',
'R2D2Opinion', 'JarJarBinksOpinion', 'PadmeAmidalaOpinion',
'YodaOpinion', 'WhoShotFirst', 'KnowsExpandedUniverse',
'FanOfExpandedUniverse', 'FanOfStarTrek', 'Gender', 'Age', 'Income',
'Education', 'Location'],
dtype='object')
Clean and format the data so that it can be used in a machine learning model. As you format the data, you should complete each item listed below. In your final report provide example(s) of the reformatted data with a short description of the changes made
Filter the dataset to respondents that have seen at least one film.
RespondentID | SeenAnyFilm | FanOfFranchise | SeenFilm1 | SeenFilm2 | SeenFilm3 | SeenFilm4 | SeenFilm5 | SeenFilm6 | RankFilm1 | ... | YodaOpinion | WhoShotFirst | KnowsExpandedUniverse | FanOfExpandedUniverse | FanOfStarTrek | Gender | Age | Income | Education | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
6 | 3.292719e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 1 | ... | Very favorably | Han | Yes | No | Yes | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
5 rows × 38 columns
Create a new column that converts the age ranges to a single number. Drop the age range categorical column.
Index(['RespondentID', 'SeenAnyFilm', 'FanOfFranchise', 'SeenFilm1',
'SeenFilm2', 'SeenFilm3', 'SeenFilm4', 'SeenFilm5', 'SeenFilm6',
'RankFilm1', 'RankFilm2', 'RankFilm3', 'RankFilm4', 'RankFilm5',
'RankFilm6', 'HanSoloOpinion', 'LukeSkywalkerOpinion',
'PrincessLeiaOrganaOpinion', 'AnakinSkywalkerOpinion',
'ObiWanKenobiOpinion', 'EmperorPalpatineOpinion', 'DarthVaderOpinion',
'LandoCalrissianOpinion', 'BobaFettOpinion', 'C-3P0Opinion',
'R2D2Opinion', 'JarJarBinksOpinion', 'PadmeAmidalaOpinion',
'YodaOpinion', 'WhoShotFirst', 'KnowsExpandedUniverse',
'FanOfExpandedUniverse', 'FanOfStarTrek', 'Gender', 'Income',
'Education', 'Location', 'AgeGroup'],
dtype='object')
Create a new column that converts the education groupings to a single number. Drop the school categorical column
education_mapping = {
'Less than high school degree': 1,
'High school degree': 2,
'Some college or Associate degree': 3,
'Bachelor degree': 4,
'Graduate degree': 5
}
df_filtered['EducationGroup'] = df_filtered['Education'].map(education_mapping)
df_filtered = df_filtered.drop(columns=['Education'])
df_filtered.columns
Index(['RespondentID', 'SeenAnyFilm', 'FanOfFranchise', 'SeenFilm1',
'SeenFilm2', 'SeenFilm3', 'SeenFilm4', 'SeenFilm5', 'SeenFilm6',
'RankFilm1', 'RankFilm2', 'RankFilm3', 'RankFilm4', 'RankFilm5',
'RankFilm6', 'HanSoloOpinion', 'LukeSkywalkerOpinion',
'PrincessLeiaOrganaOpinion', 'AnakinSkywalkerOpinion',
'ObiWanKenobiOpinion', 'EmperorPalpatineOpinion', 'DarthVaderOpinion',
'LandoCalrissianOpinion', 'BobaFettOpinion', 'C-3P0Opinion',
'R2D2Opinion', 'JarJarBinksOpinion', 'PadmeAmidalaOpinion',
'YodaOpinion', 'WhoShotFirst', 'KnowsExpandedUniverse',
'FanOfExpandedUniverse', 'FanOfStarTrek', 'Gender', 'Income',
'Location', 'AgeGroup', 'EducationGroup'],
dtype='object')
Create a new column that converts the income ranges to a single number. Drop the income range categorical column.
Index(['RespondentID', 'SeenAnyFilm', 'FanOfFranchise', 'SeenFilm1',
'SeenFilm2', 'SeenFilm3', 'SeenFilm4', 'SeenFilm5', 'SeenFilm6',
'RankFilm1', 'RankFilm2', 'RankFilm3', 'RankFilm4', 'RankFilm5',
'RankFilm6', 'HanSoloOpinion', 'LukeSkywalkerOpinion',
'PrincessLeiaOrganaOpinion', 'AnakinSkywalkerOpinion',
'ObiWanKenobiOpinion', 'EmperorPalpatineOpinion', 'DarthVaderOpinion',
'LandoCalrissianOpinion', 'BobaFettOpinion', 'C-3P0Opinion',
'R2D2Opinion', 'JarJarBinksOpinion', 'PadmeAmidalaOpinion',
'YodaOpinion', 'WhoShotFirst', 'KnowsExpandedUniverse',
'FanOfExpandedUniverse', 'FanOfStarTrek', 'Gender', 'Location',
'AgeGroup', 'EducationGroup', 'IncomeGroup'],
dtype='object')
Create your target (also known as “y” or “label”) column based on the new income range column.
RespondentID | SeenAnyFilm | FanOfFranchise | SeenFilm1 | SeenFilm2 | SeenFilm3 | SeenFilm4 | SeenFilm5 | SeenFilm6 | RankFilm1 | ... | WhoShotFirst | KnowsExpandedUniverse | FanOfExpandedUniverse | FanOfStarTrek | Gender | Location | AgeGroup | EducationGroup | IncomeGroup | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | I don't understand this question | Yes | No | No | Male | South Atlantic | 1.0 | 2.0 | NaN | False |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | I don't understand this question | No | NaN | No | Male | West North Central | 1.0 | 2.0 | 1.0 | False |
4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | I don't understand this question | No | NaN | Yes | Male | West North Central | 1.0 | 3.0 | 4.0 | True |
5 | 3.292731e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Greedo | Yes | No | No | Male | West North Central | 1.0 | 3.0 | 4.0 | True |
6 | 3.292719e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 1 | ... | Han | Yes | No | Yes | Male | Middle Atlantic | 1.0 | 4.0 | 2.0 | False |
5 rows × 39 columns
One-hot encode all remaining categorical columns.
::: {#cell-Q2f Get Dummies .cell execution_count=9}
['RespondentID',
'AgeGroup',
'EducationGroup',
'IncomeGroup',
'Target',
'SeenAnyFilm_Yes',
'FanOfFranchise_No',
'FanOfFranchise_Yes',
'SeenFilm1_Star Wars: Episode I The Phantom Menace',
'SeenFilm2_Star Wars: Episode II Attack of the Clones',
'SeenFilm3_Star Wars: Episode III Revenge of the Sith',
'SeenFilm4_Star Wars: Episode IV A New Hope',
'SeenFilm5_Star Wars: Episode V The Empire Strikes Back',
'SeenFilm6_Star Wars: Episode VI Return of the Jedi',
'RankFilm1_1',
'RankFilm1_2',
'RankFilm1_3',
'RankFilm1_4',
'RankFilm1_5',
'RankFilm1_6',
'RankFilm2_1',
'RankFilm2_2',
'RankFilm2_3',
'RankFilm2_4',
'RankFilm2_5',
'RankFilm2_6',
'RankFilm3_1',
'RankFilm3_2',
'RankFilm3_3',
'RankFilm3_4',
'RankFilm3_5',
'RankFilm3_6',
'RankFilm4_1',
'RankFilm4_2',
'RankFilm4_3',
'RankFilm4_4',
'RankFilm4_5',
'RankFilm4_6',
'RankFilm5_1',
'RankFilm5_2',
'RankFilm5_3',
'RankFilm5_4',
'RankFilm5_5',
'RankFilm5_6',
'RankFilm6_1',
'RankFilm6_2',
'RankFilm6_3',
'RankFilm6_4',
'RankFilm6_5',
'RankFilm6_6',
'HanSoloOpinion_Neither favorably nor unfavorably (neutral)',
'HanSoloOpinion_Somewhat favorably',
'HanSoloOpinion_Somewhat unfavorably',
'HanSoloOpinion_Unfamiliar (N/A)',
'HanSoloOpinion_Very favorably',
'HanSoloOpinion_Very unfavorably',
'LukeSkywalkerOpinion_Neither favorably nor unfavorably (neutral)',
'LukeSkywalkerOpinion_Somewhat favorably',
'LukeSkywalkerOpinion_Somewhat unfavorably',
'LukeSkywalkerOpinion_Unfamiliar (N/A)',
'LukeSkywalkerOpinion_Very favorably',
'LukeSkywalkerOpinion_Very unfavorably',
'PrincessLeiaOrganaOpinion_Neither favorably nor unfavorably (neutral)',
'PrincessLeiaOrganaOpinion_Somewhat favorably',
'PrincessLeiaOrganaOpinion_Somewhat unfavorably',
'PrincessLeiaOrganaOpinion_Unfamiliar (N/A)',
'PrincessLeiaOrganaOpinion_Very favorably',
'PrincessLeiaOrganaOpinion_Very unfavorably',
'AnakinSkywalkerOpinion_Neither favorably nor unfavorably (neutral)',
'AnakinSkywalkerOpinion_Somewhat favorably',
'AnakinSkywalkerOpinion_Somewhat unfavorably',
'AnakinSkywalkerOpinion_Unfamiliar (N/A)',
'AnakinSkywalkerOpinion_Very favorably',
'AnakinSkywalkerOpinion_Very unfavorably',
'ObiWanKenobiOpinion_Neither favorably nor unfavorably (neutral)',
'ObiWanKenobiOpinion_Somewhat favorably',
'ObiWanKenobiOpinion_Somewhat unfavorably',
'ObiWanKenobiOpinion_Unfamiliar (N/A)',
'ObiWanKenobiOpinion_Very favorably',
'ObiWanKenobiOpinion_Very unfavorably',
'EmperorPalpatineOpinion_Neither favorably nor unfavorably (neutral)',
'EmperorPalpatineOpinion_Somewhat favorably',
'EmperorPalpatineOpinion_Somewhat unfavorably',
'EmperorPalpatineOpinion_Unfamiliar (N/A)',
'EmperorPalpatineOpinion_Very favorably',
'EmperorPalpatineOpinion_Very unfavorably',
'DarthVaderOpinion_Neither favorably nor unfavorably (neutral)',
'DarthVaderOpinion_Somewhat favorably',
'DarthVaderOpinion_Somewhat unfavorably',
'DarthVaderOpinion_Unfamiliar (N/A)',
'DarthVaderOpinion_Very favorably',
'DarthVaderOpinion_Very unfavorably',
'LandoCalrissianOpinion_Neither favorably nor unfavorably (neutral)',
'LandoCalrissianOpinion_Somewhat favorably',
'LandoCalrissianOpinion_Somewhat unfavorably',
'LandoCalrissianOpinion_Unfamiliar (N/A)',
'LandoCalrissianOpinion_Very favorably',
'LandoCalrissianOpinion_Very unfavorably',
'BobaFettOpinion_Neither favorably nor unfavorably (neutral)',
'BobaFettOpinion_Somewhat favorably',
'BobaFettOpinion_Somewhat unfavorably',
'BobaFettOpinion_Unfamiliar (N/A)',
'BobaFettOpinion_Very favorably',
'BobaFettOpinion_Very unfavorably',
'C-3P0Opinion_Neither favorably nor unfavorably (neutral)',
'C-3P0Opinion_Somewhat favorably',
'C-3P0Opinion_Somewhat unfavorably',
'C-3P0Opinion_Unfamiliar (N/A)',
'C-3P0Opinion_Very favorably',
'C-3P0Opinion_Very unfavorably',
'R2D2Opinion_Neither favorably nor unfavorably (neutral)',
'R2D2Opinion_Somewhat favorably',
'R2D2Opinion_Somewhat unfavorably',
'R2D2Opinion_Unfamiliar (N/A)',
'R2D2Opinion_Very favorably',
'R2D2Opinion_Very unfavorably',
'JarJarBinksOpinion_Neither favorably nor unfavorably (neutral)',
'JarJarBinksOpinion_Somewhat favorably',
'JarJarBinksOpinion_Somewhat unfavorably',
'JarJarBinksOpinion_Unfamiliar (N/A)',
'JarJarBinksOpinion_Very favorably',
'JarJarBinksOpinion_Very unfavorably',
'PadmeAmidalaOpinion_Neither favorably nor unfavorably (neutral)',
'PadmeAmidalaOpinion_Somewhat favorably',
'PadmeAmidalaOpinion_Somewhat unfavorably',
'PadmeAmidalaOpinion_Unfamiliar (N/A)',
'PadmeAmidalaOpinion_Very favorably',
'PadmeAmidalaOpinion_Very unfavorably',
'YodaOpinion_Neither favorably nor unfavorably (neutral)',
'YodaOpinion_Somewhat favorably',
'YodaOpinion_Somewhat unfavorably',
'YodaOpinion_Unfamiliar (N/A)',
'YodaOpinion_Very favorably',
'YodaOpinion_Very unfavorably',
'WhoShotFirst_Greedo',
'WhoShotFirst_Han',
"WhoShotFirst_I don't understand this question",
'KnowsExpandedUniverse_No',
'KnowsExpandedUniverse_Yes',
'FanOfExpandedUniverse_No',
'FanOfExpandedUniverse_Yes',
'FanOfStarTrek_No',
'FanOfStarTrek_Yes',
'Gender_Female',
'Gender_Male',
'Location_East North Central',
'Location_East South Central',
'Location_Middle Atlantic',
'Location_Mountain',
'Location_New England',
'Location_Pacific',
'Location_South Atlantic',
'Location_West North Central',
'Location_West South Central']
:::
Validate that the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article.
#SeenFilm
HasSeenMoviesPercent = [round(df_filtered['SeenFilm1'].count() / 835, 2) * 100,
round(df_filtered['SeenFilm2'].count() / 835, 2) * 100,
round(df_filtered['SeenFilm3'].count() / 835, 2) * 100,
round(df_filtered['SeenFilm4'].count() / 835, 2) * 100,
round(df_filtered['SeenFilm5'].count() / 835, 2) * 100,
round(df_filtered['SeenFilm6'].count() / 835, 2) * 100]
fig_1 = px.bar(x=HasSeenMoviesPercent,
y=['Star Wars: Episode I The Phantom Menace',
'Star Wars: Episode II Attack of the Clone',
'Star Wars: Episode III Revenge of the Sith',
'Star Wars: Episode IV A New Hope',
'Star Wars: Episode V The Empire Strikes Back',
'Star Wars: Episode VI Return of the Jedi'],
text=HasSeenMoviesPercent,
title="Which 'Star Wars' Movies Have You Seen? (Of 835 respondents)",
)
fig_1.show()
# WhoShotFirst
hanShotFirst = [round(df_filtered_ohe['WhoShotFirst_Han'].sum() / 834, 2) * 100,
round(df_filtered_ohe['WhoShotFirst_Greedo'].sum() / 834, 2) * 100,
round(df_filtered_ohe["WhoShotFirst_I don't understand this question"].sum() / 834, 2) * 100]
fig_2 = px.bar(x=hanShotFirst,
y=['Han', 'Greedo', 'I dont understand this question'],
title="Who Shot First? (According to 834 respondents)",
text=hanShotFirst)
fig_2.show()
Build a machine learning model that predicts whether a person makes more than $50k. Describe your model and report the accuracy.
I used a DTC and a RFC to try to predict if a person taking the survey earns more than $50K. Unfortunately, the accuracy scores were not high enough to confirm that any of the model can predict this information. The accuracy score for the DTC is 0.60 or 60%, and the score for the RTC is 0.59 or 59%. These scores show that the models can predict right only that percentage of cases. Even though it is more than 50% and someone could say it can work, I also came to the conclusion that the data provided is not sufficently related with the income.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
features = df_filtered_ohe.columns
features = features.drop('Target')
features = features.drop('IncomeGroup')
X = df_filtered_ohe[features]
y = df_filtered_ohe['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Decision Tree Classifier
dt_classifier = DecisionTreeClassifier()
dt_classifier = dt_classifier.fit(X_train, y_train)
y_pred = dt_classifier.predict(X_test)
dt_classifier_accuracy = accuracy_score(y_test, y_pred)
# Random Forest Classifier
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, y_train)
y_pred_rf = rf_classifier.predict(X_test)
rf_classifier_accuracy = accuracy_score(y_test, y_pred_rf)
(dt_classifier_accuracy, rf_classifier_accuracy)
(0.5444839857651246, 0.6014234875444839)