Juan Zurita - Data Science Portfolio - Client Report - Finding relationships in baseball

Show the code

#Establish connection with database
sqlite_file = r'lahmansbaseballdb.sqlite'

con = sqlite3.connect(sqlite_file)

q = '''
    SELECT * 
    FROM sqlite_master 
    WHERE type='table'
    '''
    
table = pd.read_sql_query(q,con)
table.filter(['name'])

	name
0	allstarfull
1	appearances
2	awardsmanagers
3	awardsplayers
4	awardssharemanagers
5	awardsshareplayers
6	batting
7	battingpost
8	collegeplaying
9	divisions
10	fielding
11	fieldingof
12	fieldingofsplit
13	fieldingpost
14	halloffame
15	homegames
16	leagues
17	managers
18	managershalf
19	parks
20	people
21	pitching
22	pitchingpost
23	salaries
24	schools
25	seriespost
26	teams
27	teamsfranchises
28	teamshalf

Elevator pitch

Analyzing the “lahmansbaseball.db.sqlite” database using SQLite and Python’s sqlite3 library provides valuable insights into various aspects of baseball, including team performance, player statistics, and league dynamics. By querying the database, we can uncover trends in team standings over multiple seasons, track individual player performances, and assess the overall competitiveness of different leagues. Visualizing the data can help in understanding patterns such as batting averages, home run frequencies, and pitching statistics, allowing analysts and enthusiasts to gain deeper insights into the sport’s dynamics and evolution over time. Additionally, exploring relationships between player demographics, team strategies, and game outcomes can offer valuable insights for teams, coaches, and fans alike.

QUESTION|TASK 1

Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.

This table shows information for players with school ID as idbyuid and their salary with the year and the team they were playing for.

Read and format data

q1 = '''
    SELECT DISTINCT p.playerID, s.schoolID, sa.salary, sa.yearID, sa.teamID
    FROM people AS p
    INNER JOIN collegeplaying AS c ON p.playerID = c.playerID
    INNER JOIN schools AS s ON c.schoolID = s.schoolID
    INNER JOIN salaries AS sa ON p.playerID = sa.playerID
    WHERE s.schoolID = 'idbyuid'
    ORDER BY sa.salary DESC;
    '''
    
question_1 = pd.read_sql_query(q1,con)
    
question_1

	playerID	schoolID	salary	yearID	teamID
0	lindsma01	idbyuid	4000000.0	2014	CHA
1	lindsma01	idbyuid	3600000.0	2012	BAL
2	lindsma01	idbyuid	2800000.0	2011	COL
3	lindsma01	idbyuid	2300000.0	2013	CHA
4	lindsma01	idbyuid	1625000.0	2010	HOU
5	stephga01	idbyuid	1025000.0	2001	SLN
6	stephga01	idbyuid	900000.0	2002	SLN
7	stephga01	idbyuid	800000.0	2003	SLN
8	stephga01	idbyuid	550000.0	2000	SLN
9	lindsma01	idbyuid	410000.0	2009	FLO
10	lindsma01	idbyuid	395000.0	2008	FLO
11	lindsma01	idbyuid	380000.0	2007	FLO
12	stephga01	idbyuid	215000.0	1999	SLN
13	stephga01	idbyuid	185000.0	1998	PHI
14	stephga01	idbyuid	150000.0	1997	PHI

QUESTION|TASK 2

This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)

Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.

Read and format data

q2a = '''
   SELECT playerID, yearID, 
   ROUND(CAST(SUM(H) AS FLOAT) / NULLIF(SUM(AB), 0), 3) AS batting_average
    FROM batting
    WHERE AB > 0
    GROUP BY playerID, yearID
    ORDER BY batting_average DESC
    LIMIT 5;
    '''
question_2a = pd.read_sql_query(q2a,con)
    
question_2a

	playerID	yearID	batting_average
0	abernte02	1960	1.0
1	abramge01	1923	1.0
2	acklefr01	1964	1.0
3	alanirj01	2019	1.0
4	alberan01	2017	1.0

Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.

Read and format data

q2b = '''
    SELECT playerID, yearID, 
       ROUND(CAST(SUM(H) AS FLOAT) / NULLIF(SUM(AB), 0), 3) AS batting_average
    FROM batting
    WHERE AB >= 10  
    GROUP BY playerID, yearID
    ORDER BY batting_average DESC, playerID
    LIMIT 5;
'''
question_2b = pd.read_sql_query(q2b,con)
    
question_2b

	playerID	yearID	batting_average
0	nymanny01	1974	0.643
1	carsoma01	2013	0.636
2	altizda01	1910	0.600
3	silvech01	1948	0.571
4	puccige01	1930	0.563

Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.

Read and format data

q2c = '''
    SELECT playerID, 
       ROUND(CAST(SUM(H) AS FLOAT) / NULLIF(SUM(AB), 0), 3) AS career_batting_average
    FROM batting
    GROUP BY playerID
    HAVING SUM(AB) >= 100  
    ORDER BY career_batting_average DESC
    LIMIT 5;
'''
question_2c = pd.read_sql_query(q2c,con)
    
question_2c

	playerID	career_batting_average
0	cobbty01	0.366
1	barnero01	0.360
2	hornsro01	0.358
3	jacksjo01	0.356
4	meyerle01	0.356

QUESTION|TASK 3

Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Plotly Express to visualize the comparison. What do you learn?

For this question I chose the Seattle Mariners and the Boston Reds to compare their team’s total salary. I see that the total salary from the Boston Reds is higher than the Mariners which makes me think they are better paid, making any baseball player to aim to play for that team. I also think about each city’s economy over time and in general which may also explain why Boston players would have to be paid more than Seattle ones.

Read and format data

q3 = '''
    SELECT teamID, SUM(salary) AS salary
    FROM salaries
    WHERE teamID IN ('SEA', 'BOS')
    GROUP BY teamID;
'''

question_3 = pd.read_sql_query(q3,con)
    
question_3


visual = px.bar(question_3, x='teamID', y='salary', title='Salaries Comparison',
             labels={'teamID': 'Team', 'salary': 'Team Total Salary'})

visual.show()