No announcement yet.

Sabermetric Book

  • Filter
  • Time
  • Show
Clear All
new posts

  • The more I use his book the less I like it. Not really his fault more my own. I'm a novice at programming and a lot of his stuff has a lot of assumptions about the readers level of expertise. that level is above my own. I've spent several hours and have basically not gotten anything accomplished from the book. It did cause me to branch out on my own and get things down just not in the way he describes, but it was because of his book.

    For instance I now have a quick way to load the header into my database instead of having to manually tell the database each time I create one what each fields name is. I also now have PBP data for each year available in a text file ready to be used in database. Whereas before I did not.

    I haven't tried to load the data yet into MySQL because I fear it will cause another wasted half day. I was really hoping the language he used would work in Access but alas it does not. SO that shortcut is denied.


    • Oh and I even got the roster problem solved. I don't know if you recall but in the last discussion about EVA files we were having problems dealing with the roster files. Well that is one script of his that does work rather well. Also while I was playing around with Tango's tip I figured out that his .bat file was not properly worded which was causing some of the problems as well. The way it was worded each years ev* would be extracted into one gigantic file instead of 49 seperate files.

      It has to look like this to get individual yearly files:
      bevent -y 1957 1957*.ev* > 1957.txt
      Whereas Tango's tip had this line:
      bevent -y 1957 *.ev* > 1957.txt


      • I would agree with you that his assumptions about a person's programming skill are a little lofty...I haven't bene able to do much with his hacks either although I am starting to learn enough MySQL to have the ability to do some of these things on my own...his "check_field_sizes" hack works REALLY well and I now use it every time I want to read in data into MySQL for example.


        • It works but I have finally realized after several missteps that you have to play with it slightly to use it. Something he neglects to tell you, I think he assumes you know you will have to. The part I am talking about is the loading data local infile part. In that part you have to tell it where the file is at but he neglects to tell you that. So after numerous head beatings I figured that part out. What I do now is run the and then just copy and paste the lines into mysql. Works fine like that.

          I finally got everything to work after several nerve wracking hours and I have finally produced an empirical run expectancy chart based on the PBP from 2004. Hopefully it is right, if somebody like misterdirt can check it I would appreciate it.

          	0	1	2
          BE	0.54	0.29	0.11
          1st	0.93	0.55	0.24
          2nd	1.17	0.71	0.34
          3rd	1.45	0.97	0.37
          1sta2nd	1.49	0.97	0.46
          1sta3rd	1.86	1.24	0.54
          2nda3rd	2.15	1.48	0.63
          BaseFul	2.27	1.60	0.82


          • Also for some clarification:

            Man on 3rd not outs has a run expectancy of 1.45 and this situation occured 509 times. So does this mean that 738 total runs were scored when after they entered this situation? 509*1.45 right? So how does one figure out the odds of a run scoring with this data? Or does one need other data besides this to figure that out?


            • For that, you would need to ask "how many times in those X number of times where the situation arose did zero runs score?"

              Once you have the empirical odds of NOT'll have theodds of scoring at least one run.


              • Ub - I ran the numbers for 2004 and got REs consistently lower than yours by about a .01 or .02. If you could post the raw numbers of both the counts and additional runs scored for each event I could check whether the fault lies in your program or mine. I suspect that it is in my count of additional runs scored as I got 728 for man on third 0 out and the same 509 count of events as you did. It is more likely that a program would fail to count a run rather than double counting them.

                Also, usually RE tables exclude all home batting events in the 9th inning or later because of the chance that a walk off run will shorten the inning before all potential runs are scored. But give me the numbers for all events as you have already calculated them.


                • Ub -I checked my program. I did have a problem with additional runs scored on walk offs in the 9th inning or later. I had known that when I calculated my RE table several months ago but had ignored it since I was eliminating those innings anyway. I subsequently forgot that I had left those incorrect numbers in there. Try calculating the RE table eliminating Home batting after the 8th inning and then we can compare.


                  • Ub - Fixed my problem with additional runs. My chart now looks identical with yours.


                    • Thanks Misterdirt for checking.

                      Thanks Matt I should figured that one out, hopefully I can blame it on it being late at night for not figuring that out.

                      Now that I have done it hopefully I can play around with his scipts to do the rest of what I want to do. I also should be able to come up with linear weights as well. But then there will be some regression issues he uses a program called R so we'll see.


                      • I'm able to do the linear weights value of a home run. That was pretty easy and I am thinking that a walk and a triple are going to be simple as well. But I imagine that everything else is a little more involved and that you need more data then just the RE and how many times each situation occurred.


                        • What you need to calculate every linear weight is the average starting RE for each event and the average finishing RE for those events...

                          The PBP database includes all state data at the start of the event and the destination info for every play, so what you need is a script that creates a single number that represents the base/out state before and then another single number that represents the base/out state after each play...then you can just ask how many times each specific event resulted in each unique change in base/out state...since you know what the change in RE is between any two base/out states you can get an answer from there.


                          • Yes but creating the script is the part I'm not familiar with.


                            • Matt, you are right on. What you should do in a database is have 9 fields (which can be collapsed into two if you want to get fancy). Start1B, Start2B, Start3B, StartOuts, End1B, End2B, End3B, EndOuts, RunsOnPlay.

                              All the base fields should be set to 1 or 0, or True/False. All the starting fields and the RunsOnPlay should probably be available directly from BEVENT (been a while for me).

                              For the ending, I think you are told where a runner ends up, in BEVENT, right? So, something like

                              UPDATE eventTable
                              SET End1B = 1
                              WHERE DestBatter = 1;

                              UPDATE eventTAble
                              SET End2B = 1
                              WHERE DestBatter = 1
                              or Dest1B = 1;


                              (Also make all the Dest fields 0 when EndOuts = 3.)

                              So, what you've done here is established the starting and ending states for each event.

                              Put your RE matrix in a table (24 rows) that looks like:
                              Runner1B, Runner2B, Runner3B, Outs, RE

                              (Add a 25th row for Outs = 3)

                              Then you join
                              SELECT s.RE, e.RE
                              FROM eventTable et, reMatrix s, reMatrix e
                              WHERE s.Runner1B = et.Start1B
                              AND e.Runner1B = et.End1B
                              etc, etc, etc

                              That'll give you the starting and ending RE for every event. The difference is your Linear Weights. (This is the derivation of Table 6 in The Book.)
                              Author of THE BOOK -- Playing The Percentages In Baseball


                              • See I would just create one column called StartBO and one called EndBO where each one went from 0 to 24 (0 = bases empty none out, 1 = bases empty 1 out...etc...and 24 = 3 outs) and then one row for the event type...then you just

                                GROUP BY StartBO, EndBO, EventType

                                and COUNT the number that fit in each unique grouping and SUM the number of runs scored on each play.


                                Ad Widget