#Establish connection with databasesqlite_file =r'lahmansbaseballdb.sqlite'con = sqlite3.connect(sqlite_file)q =''' SELECT * FROM sqlite_master WHERE type='table' '''table = pd.read_sql_query(q,con)table.filter(['name'])
name
0
allstarfull
1
appearances
2
awardsmanagers
3
awardsplayers
4
awardssharemanagers
5
awardsshareplayers
6
batting
7
battingpost
8
collegeplaying
9
divisions
10
fielding
11
fieldingof
12
fieldingofsplit
13
fieldingpost
14
halloffame
15
homegames
16
leagues
17
managers
18
managershalf
19
parks
20
people
21
pitching
22
pitchingpost
23
salaries
24
schools
25
seriespost
26
teams
27
teamsfranchises
28
teamshalf
Elevator pitch
Analyzing the “lahmansbaseball.db.sqlite” database using SQLite and Python’s sqlite3 library provides valuable insights into various aspects of baseball, including team performance, player statistics, and league dynamics. By querying the database, we can uncover trends in team standings over multiple seasons, track individual player performances, and assess the overall competitiveness of different leagues. Visualizing the data can help in understanding patterns such as batting averages, home run frequencies, and pitching statistics, allowing analysts and enthusiasts to gain deeper insights into the sport’s dynamics and evolution over time. Additionally, exploring relationships between player demographics, team strategies, and game outcomes can offer valuable insights for teams, coaches, and fans alike.
QUESTION|TASK 1
Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.
This table shows information for players with school ID as idbyuid and their salary with the year and the team they were playing for.
Read and format data
q1 =''' SELECT DISTINCT p.playerID, s.schoolID, sa.salary, sa.yearID, sa.teamID FROM people AS p INNER JOIN collegeplaying AS c ON p.playerID = c.playerID INNER JOIN schools AS s ON c.schoolID = s.schoolID INNER JOIN salaries AS sa ON p.playerID = sa.playerID WHERE s.schoolID = 'idbyuid' ORDER BY sa.salary DESC; '''question_1 = pd.read_sql_query(q1,con)question_1
playerID
schoolID
salary
yearID
teamID
0
lindsma01
idbyuid
4000000.0
2014
CHA
1
lindsma01
idbyuid
3600000.0
2012
BAL
2
lindsma01
idbyuid
2800000.0
2011
COL
3
lindsma01
idbyuid
2300000.0
2013
CHA
4
lindsma01
idbyuid
1625000.0
2010
HOU
5
stephga01
idbyuid
1025000.0
2001
SLN
6
stephga01
idbyuid
900000.0
2002
SLN
7
stephga01
idbyuid
800000.0
2003
SLN
8
stephga01
idbyuid
550000.0
2000
SLN
9
lindsma01
idbyuid
410000.0
2009
FLO
10
lindsma01
idbyuid
395000.0
2008
FLO
11
lindsma01
idbyuid
380000.0
2007
FLO
12
stephga01
idbyuid
215000.0
1999
SLN
13
stephga01
idbyuid
185000.0
1998
PHI
14
stephga01
idbyuid
150000.0
1997
PHI
QUESTION|TASK 2
This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)
Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.
Read and format data
q2a =''' SELECT playerID, yearID, ROUND(CAST(SUM(H) AS FLOAT) / NULLIF(SUM(AB), 0), 3) AS batting_average FROM batting WHERE AB > 0 GROUP BY playerID, yearID ORDER BY batting_average DESC LIMIT 5; '''question_2a = pd.read_sql_query(q2a,con)question_2a
playerID
yearID
batting_average
0
abernte02
1960
1.0
1
abramge01
1923
1.0
2
acklefr01
1964
1.0
3
alanirj01
2019
1.0
4
alberan01
2017
1.0
Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.
Read and format data
q2b =''' SELECT playerID, yearID, ROUND(CAST(SUM(H) AS FLOAT) / NULLIF(SUM(AB), 0), 3) AS batting_average FROM batting WHERE AB >= 10 GROUP BY playerID, yearID ORDER BY batting_average DESC, playerID LIMIT 5;'''question_2b = pd.read_sql_query(q2b,con)question_2b
playerID
yearID
batting_average
0
nymanny01
1974
0.643
1
carsoma01
2013
0.636
2
altizda01
1910
0.600
3
silvech01
1948
0.571
4
puccige01
1930
0.563
Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.
Read and format data
q2c =''' SELECT playerID, ROUND(CAST(SUM(H) AS FLOAT) / NULLIF(SUM(AB), 0), 3) AS career_batting_average FROM batting GROUP BY playerID HAVING SUM(AB) >= 100 ORDER BY career_batting_average DESC LIMIT 5;'''question_2c = pd.read_sql_query(q2c,con)question_2c
playerID
career_batting_average
0
cobbty01
0.366
1
barnero01
0.360
2
hornsro01
0.358
3
jacksjo01
0.356
4
meyerle01
0.356
QUESTION|TASK 3
Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Plotly Express to visualize the comparison. What do you learn?
For this question I chose the Seattle Mariners and the Boston Reds to compare their team’s total salary. I see that the total salary from the Boston Reds is higher than the Mariners which makes me think they are better paid, making any baseball player to aim to play for that team. I also think about each city’s economy over time and in general which may also explain why Boston players would have to be paid more than Seattle ones.
Read and format data
q3 =''' SELECT teamID, SUM(salary) AS salary FROM salaries WHERE teamID IN ('SEA', 'BOS') GROUP BY teamID;'''question_3 = pd.read_sql_query(q3,con)question_3visual = px.bar(question_3, x='teamID', y='salary', title='Salaries Comparison', labels={'teamID': 'Team', 'salary': 'Team Total Salary'})visual.show()