Page 1 of 2 12 LastLast
Results 1 to 25 of 28

Thread: Lahman Database Simple Question

  1. #1

    Lahman Database Simple Question

    I recently download the lahman database and I had a very easy question i was hoping someone would help me with. Im pretty much new to access but I been playing around with the queries and im getting the hang of it slowly. I can seem to do queries involving single seasons but im having trouble with cumulative totals.

    For instance.......say I wanted a list of the top ten all-time leaders in home runs.....how would I go about doing that. As you know the database is broken into years......so im unsure how i would do something cumulative. Thanks for your help

  2. #2
    Join Date
    Aug 2005
    Posts
    12,862
    Blog Entries
    2
    Quckest way is to query batting. Pick player ID and HR. Then in button menu section you should see the greek letter E hit that and some cells will pop up that say group by. Going into that cell and change the HR to sum. It will total up all the homers for all the playerID's. Now from there you can just switch over to the display or again in the above button section you can tell it to display only the top ten. But make sure to set the cells for HR on descending.

  3. #3
    Join Date
    Apr 2006
    Location
    Magrathea
    Posts
    5,679
    Blog Entries
    2
    Wow, that's pretty cool. I queried it for AB, H, HR, and K, and calculated BABIP for all hitters with 5000+ ABs through 2005 (the Lahman database won't be updated until November).

    The top 25:

    Code:
    FName	LName		BABIP
    Ty	Cobb		0.372
    Rogers	Hornsby		0.365
    Rod	Carew		0.361
    Derek	Jeter		0.36
    Billy	Hamilton	0.352
    Harry	Heilmann	0.351
    Bill	Terry		0.35
    Ed	Delahanty	0.348
    Wade	Boggs		0.348
    George	Sisler		0.346
    Roberto	Clemente	0.346
    Kiki	Cuyler		0.346
    Kirby	Puckett		0.345
    Tony	Gwynn		0.345
    Tris	Speaker		0.345
    Dan	Brouthers	0.344
    Willie	McGee		0.343
    Manny	Ramirez		0.342
    Mo	Vaughn		0.342
    Jesse	Burkett		0.342
    Willie	Keeler		0.34
    Lou	Brock		0.34
    Edgar	Martinez	0.34
    Eddie	Collins		0.34
    Babe	Ruth		0.34
    And the bottom 25:

    Code:
    FName	LName		BABIP
    George	McBride		0.228
    Monte	Cross		0.234
    Mickey	Doolan		0.244
    Wid	Conroy		0.245
    Eddie	Miller		0.246
    Graig	Nettles		0.248
    Germany	Smith		0.249
    Del	Crandall	0.251
    Ed	Brinkman	0.252
    Joe	Pepitone	0.252
    Darrell	Evans		0.253
    Dave	Kingman		0.256
    Roger	Maris		0.257
    John	Mayberry	0.257
    Rocky	Colavito	0.258
    Harmon	Killebrew	0.258
    Willie	Jones		0.258
    Eddie	Joost		0.259
    Mickey	Stanley		0.259
    Everett	Scott		0.259
    Terry	Turner		0.259
    Tommy	Corcoran	0.259
    Mark	McGwire		0.26
    Andre	Thornton	0.26
    Bones	Ely		0.26
    This is one of those just for fun things. I'll play around with this kind of thing more in my spare time. Doubt I'll post any more of the numbers though.

  4. #4
    Thanks. I figured it was something very easy.........guess ill have to play around with it more to get the full hang of it.

  5. #5
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    I taught myself basic Access manipulation by necessity because I had to use it in the process of calculating PCA ratings...it's fairly straightforward to learn enough to give you some weapons to fight with when answering basic questions...once you get the hang of setting up relationships and doing nested queries...there's a lot you can do.

  6. #6
    Is access a good program to use......or should I also try programs like mysql?

  7. #7
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    Access gets the job done for most of the basics...MySQL requires you to be able to write your own queries and takes a bit longer to learn...and Access is a slightly more limited platform but based on MySQL. If you think you have the time and desire to learn how to write MySQL queries, I recommend you devote that time to learning it...it's a good skill to have if you want to research baseball statistics. You don't have to though...if you're just looking to answer some routine questions, then Access works well.

  8. #8
    Im definitely going to look into SQL. Theres actually in Access when you write the queries you can view it in SQL Form. I figured id start with access then work my way up to SQL lol.

  9. #9
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    That is a good way to do it, TA...that's how I started...plus I bought a 20 dollar MySQL Tutorial book which was well written and easy to follow from the people at SQLab...it didn't take all that long once I understood Access to understand MySQL.

  10. #10
    Whats the name of that book?

    Theres so many books out there on access and mysql im not sure which one to look into.

  11. #11
    You should devote a good year to MS Access, and forget about MySQL. Unless you are doing something very intensive, MS Access gets the job done almost all the time. The QUERY designer in MS Access is outstanding.

  12. #12
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    The book I have (and it's a very helpful book) is called "MySQL Tutorial" and it's published by the source...MySQL Press...very good book and not overly dense to read...also fairly cheap to buy.

    If you want to learn Access really well, a good book that takes you from beginner to functionally fluent is Microsoft Access, Data Analysis. Do a search at Amazon for those titles and you should find them high on the list...with good peer ratings. I recommend learning both MySQL and Access, because there will be times when it's convenient to use the nice GUI in Access for simple queries and times when it's better to have direct control in MySQL.

    I also recommend you use SQLab's Query Browser software to do all of your work in MySQL...gives you nice user friendly buttons, drag and drop functionality, some helpful administrative tools and the easy ability to export your results in Excel format.

  13. #13
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    Quote Originally Posted by Tango Tiger
    You should devote a good year to MS Access, and forget about MySQL. Unless you are doing something very intensive, MS Access gets the job done almost all the time. The QUERY designer in MS Access is outstanding.
    I agree with Tango that Access will get the job done for most things, and I agree you should start with Access first so that you get used to database language and structure in an easy format, but Access does have significant limits and learning MySQL isn't much of a stretch over Access.

  14. #14
    Quote Originally Posted by Tango Tiger
    You should devote a good year to MS Access, and forget about MySQL. Unless you are doing something very intensive, MS Access gets the job done almost all the time. The QUERY designer in MS Access is outstanding.
    My mistake was I had went into Mysql first without checking out Access (I didnt have access on my comp and didnt feel like getting it). Now that I have both, access is much easier to learn. Like you said, the query designer is great.

    Thanks matt....Ill look into that "Mysql Tutorial"......there's so many books on this I was just wondering which is the best to learn with.

  15. #15
    what's the best way of getting through the retrosheet game data?

    Like if I wanted to look at all the times that pitcher's tried to pickoff a runner, what would you do? Do you use access or something else?

    I'm still learning with access, so I just do see a way to answer this if there is any.
    ...we came in
    it's time for dodger baseball!
    is this where...

  16. #16
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    Access is not an appropriate platform if you intend to load all of the event files as one table...it will crash your computer trying to handle that much information. MySQL is much more efficient. If you look at just a few years at a time, the event file data can be handled in Access just fine, but I was forced into MySQL for event file work because I wanted the whole range of years.

  17. #17
    Join Date
    Aug 2005
    Posts
    12,862
    Blog Entries
    2
    Quote Originally Posted by dabum1
    what's the best way of getting through the retrosheet game data?

    Like if I wanted to look at all the times that pitcher's tried to pickoff a runner, what would you do? Do you use access or something else?

    I'm still learning with access, so I just do see a way to answer this if there is any.
    Well the PBP data on retrosheet isn't that detailed. You can find how many times a player was picked off not how many times it was tried.

    The quickest way is to download the PBP files, follow the directions on retrosheet, and then load the data into access. Access will then walk you through setting up the fields. Retrosheet provides you with the names of what all the fields are. Once done you simply filter for pickoffs.

    Once you learn how to play with the data and pick what fields you want it is probably easier to create a text file that has the names of all the fields you are looking at and then join that text file to the PBP data, then load it into access. That way you don't have to handwrite every field.

  18. #18
    So how would you do something like ERA leaders from 2000 to 2005?

    I thought if i did master.namelast, master.nameFirst, pitching.ERA (group by avg) and pitching.yearID it would work but everytime I select yearID it will just give me records for the top ERA seasons.....so how do you tell access you only want years 2000-2005 but you want the avg era?

  19. #19
    Join Date
    Aug 2005
    Posts
    12,862
    Blog Entries
    2
    That one is easy you set up a filter in the year ID.

    >1999 in the year ID will set it up so that it only looks at years after 1999

    Also a quick one to get names and save on cells is to do this Name: [namefirst] & " " & [namelast]

    Or flip it around and you will get the first and last name combined. Makes it easier when you are copy & pasting.

    Now then it looks like what you want is the average ERA for pitchers in each year, well to do that you don't look for players. You would do yearID, ipouts, and earned runs,and league/team if you want to break it up by league/team. Then select sum for earned runs and IPouts.

    I'm away from access right now but then I believe you would have to save it at this point and then do a new query. In this query you load the query you just saved and select yearID and then type this:
    LgERA: =([ER]/([ipouts]/3))*9 Or something like that. Access is a little bit more finicky the excel so I am not totally sure on the language.

    Note though that when you load the new query some of your fields will say SumofER. To avoid this you would type this into a field ER: [ER] and that should avoid the sum tag so that you don't have to rewrite that in later queries.

    So to break it down. The first query was set up to convert all the data from individual player season data to yearly league data. The second query is the query that would then manipulate that data into what you want the final product to be.

    Sometimes in Access you might end up using4 or 5 different queries to get the final answer.

  20. #20
    why download this when you have everything at baseballreference.com?

    it just takes up memory imo.

  21. #21
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    b-r.com is a great site for quick scans, but it's not personalized...you can't make your own queries...ask your own questions about the data.

  22. #22
    Quote Originally Posted by Ubiquitous

    Now then it looks like what you want is the average ERA for pitchers in each year, well to do that you don't look for players. You would do yearID, ipouts, and earned runs,and league/team if you want to break it up by league/team. Then select sum for earned runs and IPouts.
    Actually what I was looking for was Top ERA Pitchers Between 2000 and 2005
    I guess what I was thinking about was taking every pitcher that pitched between 2000 and 2005 and seeing which ranked at the top.

    Sorta like taking an all-time home run leaders list but only wanting to know from 2000 onward, excluding anything that happened before that point.

    For instance If I did a query on home run leaders it would look something like
    Arron 755
    Ruth 714
    Bonds 708
    etc etc..........but what if I only wanted to look at who hit the most homeruns after 99? If I put it a yearID filter, it will separate my results by year, but im wanting the sum of home runs...just after 99.

    Am i right in thinking that in access....id have to do more then one query to do this?

    PS..sorry for all the questions

  23. #23
    What you want is something like this:

    select playerid, sum(ER)/sum(IPouts)*27 as ERA
    from pitcher
    where year >= 2000 and year <=2005
    group by playerid

    The "where" clause limits your dataset to only what matches there. Once you have the dataset in question, the "select" clause does it calculation. It uses the "group by" to see how it should do it. In this case, it creates one bucket per player. For each bucket, it has a pocket to sum up all the ER, and another pocket to sum up all the IPouts.

    The "group by" can contain multiple fields. If you do that, then you should put those exact same fields in the "select" list.

  24. #24
    The "significant limits" that Matt mentions will only apply to people like him, who will parse through many many years of Retro data. My rough rule of thumb is that MS Access does an excellent job when your master table has 1 million records or less. At 10 million records or more, it starts to choke.

    It also depends how many tables/'records are part of your JOIN.

  25. #25
    Quote Originally Posted by Tango Tiger
    What you want is something like this:

    select playerid, sum(ER)/sum(IPouts)*27 as ERA
    from pitcher
    where year >= 2000 and year <=2005
    group by playerid

    The "where" clause limits your dataset to only what matches there. Once you have the dataset in question, the "select" clause does it calculation. It uses the "group by" to see how it should do it. In this case, it creates one bucket per player. For each bucket, it has a pocket to sum up all the ER, and another pocket to sum up all the IPouts.

    The "group by" can contain multiple fields. If you do that, then you should put those exact same fields in the "select" list.

    Ahhh thanks thats exactly what i was looking for.

Page 1 of 2 12 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •