Read and format project data
= pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json") df
Course DS 250
Juan Zurita
Analyzing the flights_missing.json dataset offers insights into flight delays and associated factors such as airline performance and weather conditions. By exploring the data, we can uncover patterns in the frequency and duration of flight delays, identify which airlines are most prone to delays, and assess the impact of weather on flight schedules. Utilizing visualizations, we can depict correlations between delay durations and various parameters like time of day, day of the week, and weather conditions, enabling stakeholders to make informed decisions for improving flight operations and passenger experiences.
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.
I chose the 9th row which shows NaN on the key “minutes_delayed_nas”
airport_code IAD
airport_name Washington, DC: Washington Dulles International
month February
year 2005.0
num_of_flights_total 10042
num_of_delays_carrier 284
num_of_delays_late_aircraft 631.0
num_of_delays_nas 691
num_of_delays_security 4
num_of_delays_weather 28
num_of_delays_total 1639
minutes_delayed_carrier 15573.0
minutes_delayed_late_aircraft 39840
minutes_delayed_nas NaN
minutes_delayed_security 169
minutes_delayed_weather 1359
minutes_delayed_total 78878
Name: 9, dtype: object
Which airport has the worst delays? Discuss the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
For this question, I calculated the percentage of delayed flights each airport reported and sorted the values by the highest percentage of delayed fligths in each airport. This help us see which airport one should avoid because of the probability in 100 to have your flights delayed. It is the San Francisco (SFO) one
q2 = (df.groupby('airport_code').agg(
total_flights=('num_of_flights_total', 'sum'),
delayed_flights=('num_of_delays_total', 'sum'),
avg_delay_time=('minutes_delayed_total', 'mean')))
q2['delayed_percentage'] = q2['delayed_flights'] / q2['total_flights'] *100
q2 = q2.sort_values(by='delayed_percentage', ascending=False)
print(q2)
total_flights delayed_flights avg_delay_time \
airport_code
SFO 1630945 425604 201140.098485
ORD 3597588 830825 426940.371212
ATL 4430047 902443 408969.136364
IAD 851571 168467 77905.136364
SAN 917862 175132 62698.848485
DEN 2513974 468519 190707.431818
SLC 1403384 205160 76692.204545
delayed_percentage
airport_code
SFO 26.095546
ORD 23.093945
ATL 20.370958
IAD 19.783083
SAN 19.080428
DEN 18.636589
SLC 14.618950
What is the best month to fly if you want to avoid delays of any length? Discuss the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)
I dropped NaN values in month from the dataframe, and made a chart which shows the relationship between delays each month and total flights. This shows which month has the lowest percentage of delays which is September.
df_months = df.dropna(subset=['month'])
q3 = df.groupby('month').agg(
total_flights=('num_of_flights_total', 'sum'),
delayed_flights=('num_of_delays_total', 'sum')
)
q3['delayed_percentage'] = q3['delayed_flights'] / q3['total_flights']*100
months_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
q3 = q3.reindex(months_order)
q3_visual = px.bar(q3, x=q3.index, y='delayed_percentage',
color='delayed_percentage',
labels={'delayed_percentage': 'Delayed Percentage'},
title='Proportion of Delayed Flights by Month')
q3_visual.update_layout(xaxis_title='Month', yaxis_title='Proportion of Delayed Flights')
q3_visual.show()
According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:
I successfully included both categories into one column and used the conditions to make the calculations to create that column.
100% of delayed flights in the Weather category are due to weather
30% of all delayed flights in the Late-Arriving category are due to weather.
From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%.
airport_code airport_name month \
0 ATL Atlanta, GA: Hartsfield-Jackson Atlanta Intern... January
1 DEN Denver, CO: Denver International January
2 IAD NaN January
3 ORD Chicago, IL: Chicago O'Hare International January
4 SAN San Diego, CA: San Diego International January
year num_of_flights_total num_of_delays_carrier \
0 2005.0 35048 1500
1 2005.0 12687 1041
2 2005.0 12381 414
3 2005.0 28194 1197
4 2005.0 7283 572
num_of_delays_late_aircraft num_of_delays_nas num_of_delays_security \
0 1109.104072 4598 10
1 928.000000 935 11
2 1058.000000 895 4
3 2255.000000 5415 5
4 680.000000 638 7
num_of_delays_weather num_of_delays_total minutes_delayed_carrier \
0 448 8355 116423.0
1 233 3153 53537.0
2 61 2430 NaN
3 306 9178 88691.0
4 56 1952 27436.0
minutes_delayed_late_aircraft minutes_delayed_nas \
0 104415 207467.0
1 70301 36817.0
2 70919 35660.0
3 160811 364382.0
4 38445 21127.0
minutes_delayed_security minutes_delayed_weather minutes_delayed_total \
0 297 36931 465533
1 363 21779 182797
2 208 4497 134881
3 151 24859 638894
4 218 4326 91552
num_of_delays_weather_total
0 3769.431222
1 1119.150000
2 960.150000
3 4502.250000
4 674.700000
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Discuss what you learn from this graph.
From this barplot we see that there is a high proportion caused by weather in general. If we only used the weather condition provided, we would probably see less of a proportion thus making us feel like there is not a big relationship between common weather delays and total delays.
df['total_weather_delays'] = df['num_of_delays_nas'] + df['num_of_delays_weather']
total_flights = df.groupby('airport_code')['num_of_flights_total'].sum().reset_index()
q5 = df.merge(total_flights, on='airport_code', how='left')
q5['all_delayed_by_weather'] = (q5['total_weather_delays'] / total_flights['num_of_flights_total'])*100
q5_fig = px.bar(q5, x='airport_code', y='all_delayed_by_weather',
title='Proportion of Flights Delayed by Weather at Each Airport',
labels={'weather_delay_proportion': 'Proportion of Flights Delayed by Weather', 'airport': 'Airport'})
q5_fig.update_xaxes(title_text='Airport')
q5_fig.update_yaxes(title_text='Proportion of Flights Delayed by Weather')
q5_fig.show()