Juan Zurita - Data Science Portfolio - Client Report - Late flights and missing data (JSON files)

Elevator pitch

Analyzing the flights_missing.json dataset offers insights into flight delays and associated factors such as airline performance and weather conditions. By exploring the data, we can uncover patterns in the frequency and duration of flight delays, identify which airlines are most prone to delays, and assess the impact of weather on flight schedules. Utilizing visualizations, we can depict correlations between delay durations and various parameters like time of day, day of the week, and weather conditions, enabling stakeholders to make informed decisions for improving flight operations and passenger experiences.

Read and format project data

df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")

QUESTION|TASK 1

Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.

I chose the 9th row which shows NaN on the key “minutes_delayed_nas”

Read and format data

df.replace({
    -999: np.nan,
    'n/a': np.nan,
    'Febuary': 'February',
    '': np.nan,
    '1500+': 1500
}, inplace=True)

df = df.query("month != 'na'")

avg_delay = df['num_of_delays_late_aircraft'].mean()

df['num_of_delays_late_aircraft'].replace(np.nan, avg_delay, inplace=True)

df.iloc[9]

airport_code                                                                 IAD
airport_name                     Washington, DC: Washington Dulles International
month                                                                   February
year                                                                      2005.0
num_of_flights_total                                                       10042
num_of_delays_carrier                                                        284
num_of_delays_late_aircraft                                                631.0
num_of_delays_nas                                                            691
num_of_delays_security                                                         4
num_of_delays_weather                                                         28
num_of_delays_total                                                         1639
minutes_delayed_carrier                                                  15573.0
minutes_delayed_late_aircraft                                              39840
minutes_delayed_nas                                                          NaN
minutes_delayed_security                                                     169
minutes_delayed_weather                                                     1359
minutes_delayed_total                                                      78878
Name: 9, dtype: object

QUESTION|TASK 2

Which airport has the worst delays? Discuss the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.

For this question, I calculated the percentage of delayed flights each airport reported and sorted the values by the highest percentage of delayed fligths in each airport. This help us see which airport one should avoid because of the probability in 100 to have your flights delayed. It is the San Francisco (SFO) one

Read and format data

q2 = (df.groupby('airport_code').agg(
total_flights=('num_of_flights_total', 'sum'),
delayed_flights=('num_of_delays_total', 'sum'),
avg_delay_time=('minutes_delayed_total', 'mean')))

q2['delayed_percentage'] = q2['delayed_flights'] / q2['total_flights'] *100

q2 = q2.sort_values(by='delayed_percentage', ascending=False)

print(q2)

              total_flights  delayed_flights  avg_delay_time  \
airport_code                                                   
SFO                 1630945           425604   201140.098485   
ORD                 3597588           830825   426940.371212   
ATL                 4430047           902443   408969.136364   
IAD                  851571           168467    77905.136364   
SAN                  917862           175132    62698.848485   
DEN                 2513974           468519   190707.431818   
SLC                 1403384           205160    76692.204545   

              delayed_percentage  
airport_code                      
SFO                    26.095546  
ORD                    23.093945  
ATL                    20.370958  
IAD                    19.783083  
SAN                    19.080428  
DEN                    18.636589  
SLC                    14.618950

QUESTION|TASK 3

What is the best month to fly if you want to avoid delays of any length? Discuss the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)

I dropped NaN values in month from the dataframe, and made a chart which shows the relationship between delays each month and total flights. This shows which month has the lowest percentage of delays which is September.

Read and format data

df_months = df.dropna(subset=['month'])

q3 = df.groupby('month').agg(
    total_flights=('num_of_flights_total', 'sum'),
    delayed_flights=('num_of_delays_total', 'sum')
)

q3['delayed_percentage'] = q3['delayed_flights'] / q3['total_flights']*100


months_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
q3 = q3.reindex(months_order)

q3_visual = px.bar(q3, x=q3.index, y='delayed_percentage', 
                   color='delayed_percentage',
                   labels={'delayed_percentage': 'Delayed Percentage'},
                   title='Proportion of Delayed Flights by Month')
q3_visual.update_layout(xaxis_title='Month', yaxis_title='Proportion of Delayed Flights')

q3_visual.show()

QUESTION|TASK 4

According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:

I successfully included both categories into one column and used the conditions to make the calculations to create that column.

Read and format data

delays_mean = df['num_of_delays_late_aircraft'].mean(skipna=True)

df['num_of_delays_late_aircraft'].fillna(delays_mean, inplace=True)

100% of delayed flights in the Weather category are due to weather

Read and format data

df['num_of_delays_weather_total'] = df['num_of_delays_weather']

30% of all delayed flights in the Late-Arriving category are due to weather.

Read and format data

df['num_of_delays_weather_total'] += df['num_of_delays_late_aircraft'] * 0.3

From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%.

Read and format data

df.loc[df['month'].isin(['April', 'May', 'June', 'July', 'August']), 'num_of_delays_weather_total'] += df['num_of_delays_nas'] * 0.4
df.loc[~df['month'].isin(['April', 'May', 'June', 'July', 'August']), 'num_of_delays_weather_total'] += df['num_of_delays_nas'] * 0.65

print(df.head())

  airport_code                                       airport_name    month  \
0          ATL  Atlanta, GA: Hartsfield-Jackson Atlanta Intern...  January   
1          DEN                   Denver, CO: Denver International  January   
2          IAD                                                NaN  January   
3          ORD          Chicago, IL: Chicago O'Hare International  January   
4          SAN             San Diego, CA: San Diego International  January   

     year  num_of_flights_total num_of_delays_carrier  \
0  2005.0                 35048                  1500   
1  2005.0                 12687                  1041   
2  2005.0                 12381                   414   
3  2005.0                 28194                  1197   
4  2005.0                  7283                   572   

   num_of_delays_late_aircraft  num_of_delays_nas  num_of_delays_security  \
0                  1109.104072               4598                      10   
1                   928.000000                935                      11   
2                  1058.000000                895                       4   
3                  2255.000000               5415                       5   
4                   680.000000                638                       7   

   num_of_delays_weather  num_of_delays_total  minutes_delayed_carrier  \
0                    448                 8355                 116423.0   
1                    233                 3153                  53537.0   
2                     61                 2430                      NaN   
3                    306                 9178                  88691.0   
4                     56                 1952                  27436.0   

   minutes_delayed_late_aircraft  minutes_delayed_nas  \
0                         104415             207467.0   
1                          70301              36817.0   
2                          70919              35660.0   
3                         160811             364382.0   
4                          38445              21127.0   

   minutes_delayed_security  minutes_delayed_weather  minutes_delayed_total  \
0                       297                    36931                 465533   
1                       363                    21779                 182797   
2                       208                     4497                 134881   
3                       151                    24859                 638894   
4                       218                     4326                  91552   

   num_of_delays_weather_total  
0                  3769.431222  
1                  1119.150000  
2                   960.150000  
3                  4502.250000  
4                   674.700000

QUESTION|TASK 5

Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Discuss what you learn from this graph.

From this barplot we see that there is a high proportion caused by weather in general. If we only used the weather condition provided, we would probably see less of a proportion thus making us feel like there is not a big relationship between common weather delays and total delays.

Read and format data

df['total_weather_delays'] = df['num_of_delays_nas'] + df['num_of_delays_weather']

total_flights = df.groupby('airport_code')['num_of_flights_total'].sum().reset_index()

q5 = df.merge(total_flights, on='airport_code', how='left')
q5['all_delayed_by_weather'] = (q5['total_weather_delays'] / total_flights['num_of_flights_total'])*100


q5_fig = px.bar(q5, x='airport_code', y='all_delayed_by_weather',
title='Proportion of Flights Delayed by Weather at Each Airport',
labels={'weather_delay_proportion': 'Proportion of Flights Delayed by Weather', 'airport': 'Airport'})
q5_fig.update_xaxes(title_text='Airport')
q5_fig.update_yaxes(title_text='Proportion of Flights Delayed by Weather')

q5_fig.show()