07-07-2006, 08:07 AM
You can double-check your work against this:
I'm not sure if you or Ruane do what I do (remove the home half of the 9th and later innings, and remove any partial innings).
As for how to figure out runs to end of inning, it was posted in a SQL here several threads ago. For each half-inning, you need to know the total number of runs scored. That's in one table. For each play, you need to know how many runs have scored so far, before the PA. Runs to end of inning (REOI) is simply the difference.
07-07-2006, 08:20 AM
Tango - In step 4 of hack #10, I did everything it said - Unzipped file BDB-sql-2005-08-02.sql, and imported it in MySQL by using the command he said to use - And after I did everything he said to do, I went to step 5 to check if everything was there - He says type show tables; after mysql>, and so I did, but it says " no database selected ". I've been having this problem since yesterday , any suggestions ?
07-07-2006, 08:31 AM
Tango...I now have Linear Weights calculated and it looks very very similar.
I didn't throw out the bottom of the 9th inning or incomplete innings because I don't believe in throwing things out if I don't have to...although I understand why you did it (because often the bottom of the ninth ends with men still on base and less than three outs because the game is over)...it didn't make much of a difference. We're talking maybe 1/1000th of a run in the extreme and I don't see many example that rise to that level.
The next step would be to find a way to exend my LW table back to 1871.
07-07-2006, 10:05 AM
I agree that you shouldn't throw things out if you don't have to. But, in this case, you do have to. You have guys on base, which means there is run potential on base. By leaving those innings in, you are essentially saying that all the runners left on base have been put out. As well, by including the bottom half of the ninth inning, and if that inning goes to three outs, you have a selective sampling issue since you know that the maximum number of runs scored did not exceed the run differential.
By the very reason that you make that you shouldn't throw things out if you don't have to, then you should not keep data that you need to throw out.
As for the .001 runs in the extreme: you can figure it out exactly by comparing your 99-02 data to my data.
As for extending it back, we already know how to do that, right? I'm going to release my RE simple-Markov program soon enough. Still don't know when, but hopefully before the season ends.
07-07-2006, 10:08 AM
the command you need to run to get the show tables command to work is
"USE bbdatabank;" (without the quotes).
this tells mysql to use the bbdatabank database that you created. In mysql you can create multiple databases so that command informs mysql which one you want to use.
so for hack 10 step 5 should be
mysql > USE bbdatabank;
mysql > show tables;
| Tables_in_bbdatabank |
hope this helps.
07-07-2006, 10:30 AM
current season pbp data
I have been reading this post for awhile and was motivated to contribute.
I am a software engineer by trade but my sabermetric knowledge is only in the infant stage.
I have modified the code from baseball hacks to get current season data and it seems to be working. I do not have the pbp stuff working yet but I am fairly certain that with some time I can get it working. I modified the scripts for my personal use but after seeing how much time you guys have been spending on this I want to contribute somehow (so you guys can get back to finding out interesting stuff for me to think about). I can work on getting this to wherever it would be easiest for you guys to use.
I have in-laws in town this weekend so I will try to steal away some time and clean up the code and figure out where would be the best place to put it.
The one thing that I have not done, which it would appear to be neccessary, is to check for data consistency for the files. I am more than willing to work with whoever to do the work on the coding side (something I know) but I will probably need some direction on what to look for (i.e. the switch hitter problem and the like)
Once we get this working to everyone's satisfaction then I can start working on parsing code to parse the pbp data for the current season.
If you guys could give a direction or set of priorites on what is most important to you guys then I can start working on this. Thanks for all your research, those of us who don't chime in much do appreciate it.
07-07-2006, 10:40 AM
I might tweak my RE method to account for 9th innings but I think what I would do is calcualte RE without those innings and then put them back in and assert that the expected number of runs at the end of the 9th scored.
IOW if you end the ninth with the bases loaded and 2 outs by a walk-off single (bases were loaded before...by default the scoring assumes no additional advances at the end of a game that wouldn't be necessary to push across the winning run)...then you a RE of about .5 or .6 depending on the scoring environment of the game "dangling"...2 outs/bases loaded scores half a run each time it is left, so you could assume the ninth inning has the expected score.
I assume your simple Markov program does not include SB and CS and the other baserunning events...that's going to make it have limited use to me because I need to cover all of the events, but I'll nonetheless be looking forward to its' release...I can find other ways to extrapolate the value of the baserunning events.
07-07-2006, 10:47 AM
It's great to see a lurker show up and volunteer his time like this. If you look at the kind of data that is available in 2005's retrosheet event files, you'll get a good idea of what we're looking for in current-season PBP info. We'd want to be able to integrate that data into the retrosheet database if at all possible so you'd have to ifnd a way to map MLB.com player identifications to the retroID system if at all possible.
Originally Posted by cell76
I'm sure others will have suggestions for what they want, but that's my ultimate agenda with the current-season data...I want to be able to work it into my existing DB.
07-07-2006, 10:52 AM
I have a question for you guys too BTW.
As I've been preparing my database including the gamelog files, the event files, and the baseball-databank and moving it toward a more compact form, I've been saving the commands one at a time into a master database generation file.
I will make that file, along with the necessary tables you have to start with in order to make the SQL script work, available to anyone who wants it...I believe it may save some of you valuable time. I will warn you that running it will take a LONG time...you'd better have a pretty good computer with a LOT of RAM (at LEAST 1 Gig, though I ran each sommand one at a time on a computer with 2 gigs and estimate that running the whole thing in order will take something like 3 or 5 hours on this machine...
It handles a PBP database that is 7,400,694 records large, a gamelog file that is 186,000+ records long, and the baseball-databank which is not in and of itself trivial so there is no avoiding it taking a long time to process.
But now that I have it, I'm happy to share if anyone else wants to work with the same database I have.
07-07-2006, 11:13 AM
Matt, I suggest posting it at baseball-databank, so that we can all see it, and make suggestions/improvements.
07-07-2006, 11:15 AM
Incidentally, here's my problem with dropping all bottom of the 9ths.
The league's scoring environment (what we're trying to define by these linear weights) INCLUDES a team's ability to score in the bottom of the 9th inning or in extra innings. And if you look at scoring rates at the 9th and extra frames I'm betting you'll find that it's not the same as the first few innings...it therefore should be represented as much as possible.
I could see my way to excluding all half-innings that did not end with out #3, but I can't see excluding bottom 9ths that ended with the third out being scored.
Now the problem with that is that as you say, there's a selection bias when you take one but not the other, because now you're choosing only the bottom 9s that didn't result in game winning rallies, which probably lowers the overall run scoring rate in the 9th inning.
You can't throw out all bottom 9s because the scoring environment then is different than it is when the starter is in there. You can't throw out only some of them because it creates a selection bias.
There's got to be a way to handle incomplete bottom 9s that doesn't bias the data.
07-07-2006, 11:19 AM
Originally Posted by Tango Tiger
Heck you might be able to improve the efficiency of my code (I'm sure you would be able).
It's as annotated as I thought appropriate with comment tags, so you can see what I was trying to do with each set of queries.
To run it, you need the event files in a very specific format I chose at the start of this (it was not the default arrangement...I threw a lot of the needless extra information away and included the eventNum field)...so I will have to provide that file as well (it's huge though...don't think I can post that to Yahoo)
07-07-2006, 11:54 AM
Actually, provide the batch file that created that file.
I agree about your point about throwing out the ninth inning, since those would be a non-random sample (Mariano Rivera, Hoffman, etc). Your solution is one that could be acceptable (by using the RE matrix to "assume" what would have happened to the runners left on base when the home team wins). Of course, if you do that, you assume a league average pitcher for those cases.
In the past, this is what I did, calling it "implied runs" and "implied outs", with those LOB guys. But, I was also interested in the run frequency matrix (Tables 8, 9 in The Book), and doing this "implied" things was a bother.
It comes down to the lesser of two evils: throwing things out, like I do now, or "implying" what would have happened, as I used to do.
Keeping as-is (implying that the runners on base were put out) is not an option (thank you Bill Paxton).
07-07-2006, 11:54 AM
Yeah...I just proved that both my gamelog master table and my event file master table are too large to post at Yahoo...(and I can't send them by e-mail either)...
What do you propose I do to post that information so that people can DL it and use it to run the script?
07-07-2006, 12:03 PM
To create the allpbp.csv file I just gave this command in the command line prompt:
BEVENT -f 0, 2, 4, 8, 9, 10, 14, 26, 27, 28, 33, 34, 35, 38, 39, 40, 46, 47, 58, 59, 60, 61, 96 -y 1957- *.EV* > allpbp.csv
Those are the fields I accepted when I loaded the database and, given that all of the .eva and .evn files were together in the same directory...they all loaded into one table.
07-07-2006, 01:03 PM
That's what I suggest you do, just like Adler. Provide your scripts, so that people can recreate your database. You should not be providing the raw data.
07-07-2006, 01:26 PM
Adler provided scripts? What'd he work on?
the gamelog text file produced by following the instructions in Baseball Hacks for creating that table, incidentally. I'll just leave those instructions at the site.
07-08-2006, 07:02 AM
OK Tango...I'm curious what you think of some of these gut reactions to your "short-cut RE method"...
I'm thinking the further back in time you go, the less valid your approach will be because you don't account for the ever-increasing error rate creating extra baserunners at second and third (you only use the double and triple rates). I'm thinking it's going to be tough to get a good handle on the 19th century game because a lot of un-recorded events cause baserunner advances and run scoring that I have no way of accounting for. I'm wondering what you think the best approach might be to using modern data on the average results of an error to make projections about the expected results of errors in the early history of the game.
07-08-2006, 05:40 PM
Originally Posted by cell76
I did that and it says " empty set ". I don't understand why - I put in EXACTLY what he said to put in in step #4 " import the database ". I wait a few minutes like it says, then go to step #5 and do what you said, and still nothing happens.
07-10-2006, 08:13 AM
Matt, Actually we know how many errors the defense has created, right? You can make a reasonable guess to the percentage of those errors being "reaching base" errors.
As well, you can try to figure it out by remembering that, by the end of the inning, every batter is either out, LOB, or scored. You can try to use current data to figure out how often a batter is out on base, etc.
You don't have the data, so you make a best guess.
07-10-2006, 08:28 AM
I also have some questions about the PBP data, Tango, if you don't mind.
The way things are written in the event file documentation, it's ambiguous to me how certain events are recorded.
Let's say you have a runner at first and next batter hits a single...then the runner at first gets thrown out trying to get to third on the hit. Is that recorded as one event (a single) and no indication of how the other runner was wiped out? Or is that recorded as two events (single, runners at first and second, "other advance" runner removed from second)?
Basically I'm asking if a play is complex and involves adding and removing runners at the same time, how is it recorded in the PBP database?
07-10-2006, 08:29 AM
Also...if there's a hit and a subsequent error on the same play allowing runners to advance, does that get recorded as one event or two?
07-10-2006, 11:12 AM
In the event files, everything, I believe, is one event. However, BEVENT may split that into multiple plays.
I suggest you go through this documentation, as John Jarvis did a great job at explaining the data:
07-10-2006, 11:40 AM
Alright...it's all one event in the event file. How do I confirm whether BEVENT does the breaking up of the play...because if it doesn't, I'm going to find it nearly impossible to actually use the event file data.
07-10-2006, 01:09 PM
It'll definitely break up some plays, and others not. I'm sure if you post your question at Retrolist, someone will answer. You can also download
and see if you can make sense of the source code.