Page 6 of 14 FirstFirst ... 45678 ... LastLast
Results 126 to 150 of 327

Thread: Sabermetric Book

  1. #126
    Your last post seems a much more thoughtful and balanced response. I agree with everything in it. I also use RE tables as a basis for linear weights to be able to project a player's future performance. I don't know whether Tango uses empirical data or a Markov chain to produce the RE tables that has presented. I use empirical data aggregated over 3 seasons and do not regress the data to any larger aggregation. There is absolutely no difference between the linear weights that Tango generates from his RE table and the linear weights that I generate from my RE table.

    I suspect that you have not actually gone through the process of creating linear weights from an RE table or you would understand why there would not be a difference. Even if you have anomalous values in some of the base out states because of small sample size for those base out states, the resulting linear weights is unaffected. This is because when you are determining the value of an offensive (for example a single) you are multiplying the change in base out state value over the number of occurencesof that base out state and then adding the resulting values over all the base out states. Since the number of occurences of the anomalous base out states are small (that is why you are having the anomalies to begin with) their effect on the total run value of a single does not register within the level of precision used in linear weights. Try calculating a linear weights value with a manually altered value for the man on third 0 outs state and you will see what I mean.

    So there is really no gain in regressing to try and remove those anomalies for those who create linear weights from RE tables. There is, however, a loss for people like Ub who are looking at the team level for explanations for how a team might be better or worse at creating runs. Regressing to remove the anomalies removes just the data he is looking for, i.e. the differences between that team and an average team. He still must investigate whether those differences in data are based on actual differences within the team or just on sample size but at least he has a starting point for his investigations.

  2. #127
    You guys did a great job with your posts, especially the last two. I'll go through all the posts, and add whatever clarity I can.

  3. #128
    Why do you use the triple rate for br3? A lot of runners reach third base who had nothing at all to do with tripling.
    br = br1+br2+br3

    br is the number of initial baserunners. That is, where did the batter land. br3 is the number of times the batter landed on third base, so that's essentially his triples.

    br2 is basically his doubles, but we should include his reaching on 2b errors, and getting to 2b on throws to other bases.

    HRs are split 36%, 32.6%, 31.4%. I couldn't figure out a quick way to directly compute baserunners but PA's are split 34.5%, 33.1%, and 32.4%.
    Right, I'm trying to keep things nice and simple. The biggest non-randomness is with walks, as those are given alot more with 1b open than not. Pitchers and hitters adjust based on the context. They implicitly understand the linear weights by the 24 base-out states, and understand each event has a different impact, which is why walks, the easiest of the outcomes to control, occurs so non-randomly.

    Tango said:
    As you can see, empirical data will give you oddball results, such that it's easier to score from FIRST base than THIRD base, with 2 outs. So, 162 games of a team is still not enough.
    This statement is absolutely wrong and the reason that it is wrong is what leads to other misuses of the RE table.
    This is a very good point, and deserves more explanation. When a reader looks at the RE table, all his sees is a static table, a table that is presumed to come from something whereby every cell in the table is affected a certain way.

    In reality, each value in the table is in fact computed independent of the other 23. They are based on empirical data, and are nothign more than samples of reality, (the way 600 PA from Todd Helton and 600 PA from Jose Cruz, Sr are nothign more than samples of their performance, one mostly at Coors, and on mostly at the Astrodome).

    The empirical RE tables simply says: "of the times that a runner happened to be on third base and there were two outs, how many runs ended up scoring for that inning". (And same question for first base). It's clear that it's not the exact same players in the same number of PA that make up both samples. And even if it was, the size of the sample would be so small as to have a huge margin of error.

    These empirical RE tables should come with a margin of error, or a reader has to be experienced enough to see the RE-man-on-3b-2-outs and the RE-man-on-1b-2-outs in the same light he'd see Todd Helton and Jose Cruz numbers, if stacked side-by-side.

    When I present RE tables, it only makes sense to use them if they've been adjusted, or the sample size is so large as to make the margin of error very small.


    If there's something else that I didn't address, let me know.

  4. #129
    Pitchers and hitters adjust based on the context. They implicitly understand the linear weights by the 24 base-out states, and understand each event has a different impact, which is why walks, the easiest of the outcomes to control, occurs so non-randomly.

    True, and because pitchers have more control the progess of the PA than batters, walks occur more frequently in situations that hurt the defensive team less.

    Right, I'm trying to keep things nice and simple.

    I can understand the desire to keep things simple and your method succeeds admirably at that. I was just trying to understand logically how it works and if there was any additional information that was available in the pre PBP era that could be incorporated. For some research purposes it would seem like all available information should be incorporated even if it makes the process slightly more complicated and the gains are minimal.

    Do you also have an estimation process for the number of occurences of each base out state? If not, how can you convert the RE table into an estimation of linear weights.

  5. #130
    Tango said:

    As you can see, empirical data will give you oddball results, such that it's easier to score from FIRST base than THIRD base, with 2 outs. So, 162 games of a team is still not enough.


    This statement is absolutely wrong and the reason that it is wrong is what leads to other misuses of the RE table.


    What I was trying to express here, and doing a poor job of it, was that the RE tables don't show how easy it is to score from a base. How easy it is to score from a base is shown in a ONE_RUN+ table. Something I know that you know because you have calculated them. But a fact that is often forgotten when the uses of RE tables are discussed.

  6. #131
    Quote Originally Posted by misterdirt
    Tango said:

    As you can see, empirical data will give you oddball results, such that it's easier to score from FIRST base than THIRD base, with 2 outs. So, 162 games of a team is still not enough.


    This statement is absolutely wrong and the reason that it is wrong is what leads to other misuses of the RE table.


    What I was trying to express here, and doing a poor job of it, was that the RE tables don't show how easy it is to score from a base. How easy it is to score from a base is shown in a ONE_RUN+ table. Something I know that you know because you have calculated them. But a fact that is often forgotten when the uses of RE tables are discussed.
    Yes, absolutely. You can *fairly* say that this is a true statement, for the leadrunner states (xx3, 1x3, x23, 123 for chance of scoring from 3B), (x2x, 12x for chance of scoring from 2b), (1xx for chance of scoring from 1b).

    If you look in the book, table 9, you will probably find the chance of scoring at least one run (or 1 minus chance of scoring no runs), is around the .875 level for 3B, using any of those 4 base states, and 0 outs. It will probably follow similarly for the other 2 out states. It's not exactly the same, since bases loaded will allow the runner from 3B to score from a walk, and the x23 just needs two walks instead of the three that xx3 would need.

    Studying Tables 9 and 10 is certainly something that is hugely recommended in following along.

  7. #132
    Quote Originally Posted by misterdirt
    I can understand the desire to keep things simple and your method succeeds admirably at that. I was just trying to understand logically how it works and if there was any additional information that was available in the pre PBP era that could be incorporated. For some research purposes it would seem like all available information should be incorporated even if it makes the process slightly more complicated and the gains are minimal.
    I have a basic Markov program that generates what you want here. It doesn't have basestealing or other non-batter events. It's pretty cool, because it simply take a team's batting line, and generates the RE table from the Markov program (takes one second to run). I'll eventually release the code to the public.

    So, we have a few ways to do the RE tables. The shortcut way, this basic Markov program, and then a really complex Markov.

    Do you also have an estimation process for the number of occurences of each base out state? If not, how can you convert the RE table into an estimation of linear weights.
    You're right, you'd need that if you want to get the LWTS (though, not necessarily... more on that later). Here again, I use a 3-2-1 rule, but I'll see if I can come up with something better.

  8. #133
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    MrD...you are correct that I have not thus far attempted to generate my own LWTS using this method...

    I do have a problem with using three-year aggragate data to filter out sample size errors...if you do that...1930 doesn't come out right. Or 1987. Or 1911 and 1912. I'm operating under the presumption that major changes in the way runs are scored over a single season will cause major changes in the LWTS.

  9. #134
    I do have a problem with using three-year aggragate data to filter out sample size errors...if you do that...1930 doesn't come out right. Or 1987. Or 1911 and 1912. I'm operating under the presumption that major changes in the way runs are scored over a single season will cause major changes in the LWTS.

    I debated about whether to go three years, single year, or single year single league (to differentiate DH from non DH), or 3 year single league. I finally decided that 3 year was best for the studies I was doing. But if I was doing a different study I might decide otherwise. Certainly if you have reason to believe that there are causal factors rather than random variation that are causing the year to year variations you should use a single year. But you still would not want to regress those variations to a larger data set. Why? Because you have just decided that they are not due to random variation.

  10. #135
    So, we have a few ways to do the RE tables. The shortcut way, this basic Markov program, and then a really complex Markov.

    Does your really complex Markov vary the data by batting order position?

  11. #136
    Quote Originally Posted by misterdirt
    Does your really complex Markov vary the data by batting order position?
    Yes, that's how I did the batting order chapter (with the pitcher moving around, etc). The complex Markov used 5-dimensional arrays. The program itself is fairly small, but, it is mind-numbing to program, and then to debug or enhance. It's one of the few programs I wrote that I had to put in extensive documention.

  12. #137
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    I don't suppose I could convince you to let me have a look at your Markov program, Tango...my partner in crime Randy Fiato is a programmer who took an interest in what he was calling "baseball as a state machine" and the two of us might be able to make improvements is we could look at the code for a while and study it. Just a thought...I doubt I wold ever be able to program something as complex as your version myself, and Randy doesn't have time to write whole new programs, but perhaps we can come to some sort of accomodation that would help further my own research and could result in continued improvements in our understanding?

  13. #138
    Right now, I don't release any of my work. But, eventually, I may.

  14. #139
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    I can certainly understand the impulse to keep your work under wraps while you're still tinkering with it and before you have the opportunity to get significant publications out to prove that you were the one who did the research...I just wanted to let you know that I'm interested in picking up some of the research threads that you have started and as of yet shown no interest in continuing (for example...the first basemen saving errors study...possibly a more advanced catchers study involving looking for their effect on pitchers' DIPS statistics...etc)

    Some of the things I'd Like to pursue require markov simulation really to make work (though neither of the two things I mentioned above do...LOL)

  15. #140
    What I do is not hard. It just requires alot of time and patience. I suggest you learn a programming language, any programming language. Learn arrays, and understand recursion.

  16. #141
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    I have some experience with EIGHT programming languages, Tango.

    C++
    Perl
    MySQL
    Visual Basic
    IDL
    R
    Java
    HTML (ok...not really a programming language...but still worth mentioning).

    I was a computer science major for 3 semesters, and as soon as I hit object oriented programming and advanced recursion methods, my brain exploded. What you do may not be difficult for you, but not everyone is really wired to think in the way you need to think to program on a functional level. Believe me...it's not for lack of effort that I lack programming chops...I've attacked this research from so many angles it's not even funny...I haven't made headway in any of them.

  17. #142
    perl is probably the one to focus on. You don't need OO or advanced recursion methods. Just be able to call a function without getting into a loop. As for making headway, the only suggestion I can offer is to pick up a Perl O'Reilly book, and do the exercises start to finish. If you have done all that, and you haven't made headway, then I guess programming is not for you.

  18. #143
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    I haven't done that with Perl yet...

    And I shouldn't say I've made zero headway...my MySQL query writing ability has significantly improved in the last year...enough that I am confident that with a powerful enough computer I can organize and normalize alll of the available data into one database.

    I will see what I can do with Perl...the only problem with that language as I understand it is it's extremely slow for mathematical calculations.

  19. #144
    Join Date
    Aug 2005
    Posts
    12,862
    Blog Entries
    2
    I downloaded PERL today and I am probably going to play around with it since Baseball Hacks have some scripts for it. Most notably the one for run expectancy.

  20. #145
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    There's a RE script in the Hacks book?

    I didn't see that...I'll have to look through it.

  21. #146
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    Actually, Ubi...the RE hack is done entirely in MySQL...which is very encouraging, since that's the language I know best...LOL

  22. #147
    Join Date
    Aug 2005
    Posts
    12,862
    Blog Entries
    2
    Yes but to get the data quickly and easy in his book you will need PERL. I used his programs to download all of the event files in go and then bevent them all in shot instead of doing it one at a time.

  23. #148
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    Right...the warning I would give you on that is you get a lot of extra useless information and the database you construct with the standard bevent program is ungainly and too big for most computers. THat's OK if you're just going to grab small pieces of the data at a time, as long as you have things well indexed, but it can cause problems...I've created the beginnings of my own database already...just waiting for my computer upgrades to be finished and I'll continue work on that...but I'm doing something much more streamlined tso that I can see the entire dataset all at once.

    PERl is useful for parsing text into your code which makes it ideal for stealing data off the web.

  24. #149
    Join Date
    Aug 2005
    Posts
    12,862
    Blog Entries
    2
    I'm finding out more and more that his hacks are pretty unwielding. I've had to change a few things around just to get it to download all the files. I then had to be basically strip his script for unpacking the zip files to just unzipping his files. I couldn't get the bevent part of his script to work properly so I ended up using Tango's tips from the ASS conversation, and then added the header filer and all of these text files into one mega file. It took awhile and it is just finishing up. Hopefully when all that is done the historical PBP file will be setup just like he expects it to be setup in the Hacks book and then from there I can move on to other things.

    Nice a 3 gig file that should be a joy to work with.

  25. #150
    Join Date
    May 2005
    Location
    Where all students live...nowhere.
    Posts
    8,900
    He uses his pBP2K file which has 2000-2004 only...his hacks work in that context but if you try them on the HUGE file...it's going to explode and your computer will need to be restarted. Fair warning.

Page 6 of 14 FirstFirst ... 45678 ... LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •