No announcement yet.

How to reproduce Linear Weights

  • Filter
  • Time
  • Show
Clear All
new posts

  • How to reproduce Linear Weights

    (The thread title should probably read "How do you reproduce Linear Weights?")

    I have a pedagogical question.

    I understand that Linear Weights have been around for years and their values are pretty much agreed upon. I am writing an example for some Linear Regression software and thought it would be cool if I could reproduce the coefficients.

    I plugged in team data from both leagues 1946-89 into a Least Squares Zero-Intercept Linear Regression model with R as the dependent variable and ABminusH,1B,2B,3B,HR,BB,SO,SB,CS as the independent variable and trained. I got the following coefficients:

    0.51 : 1B
    0.68 : 2B
    1.21 : 3B
    1.48 : HR
    0.37 : BB
    0.17 : SB
    -0.22 : CS
    0.0007 : SO
    -0.10 : Outs

    One one hand, I was pleased that this sort of looks like the right answer. I did no special massaging of the data, just using the software as a black box. The near-zero value of strikeout was fun to see. On the other hand, I notice some sizeable discrepancies in the values for 2B & 3B (I'm not so worried about SB/CS as I understand those are sometimes fudged for leverage reasons).

    Does anyone know what else in done in training the Linear Weights model to obtain the known coefficients? For my software example, I think this is good enough, but now I'm curious.

    Let me restate that in no way am I claiming the known values are not correct. I did almost no work in training the above model and I'm simply interested to know if there is a way I can come closer to the published coefficiencts.


  • #2
    I got very similar answers when I calculated my dynamic linear weights for those years. The values of linear weights do change with time...some seasons are different than others. I believe Palmer used 1960s/70s/mid 80s data in his first analysis...but that doesn't explain the differences.

    What software are you using? I''ve been working with SPSS of late...trying to figure out all of its' features to see if I can improve and automate my season-by-season modelling processes.

    As far as I know, the linear weights people use now aren't done with multilinear regression though...they're done with Run/Out expectency tables.


    • #3
      Yes, the regression model on a TEAM level will give you weird and wrong results. You have to understand that every coefficient iteself has its own uncertainty level. The run value of the double is something like .66 +/- .15, 95% of the time. The triples value is even worse. You may think that having 600 or whatever teams is a good sample size, but it hardly is.

      The CORRECT way to do it is with a change in run expectancy model. I give out the whole shebang in the book.

      You can also look at some results from Tom Ruane here:

      He gives the LWTS run values on a year-by-year, league-by-league level. Those are based on thousands of individual plays.
      Author of THE BOOK -- Playing The Percentages In Baseball


      • #4
        I'm using some in-house software at work (work-related tools applied to stuff like baseball and poker are always a big hit). I suppose I could easily switch to using 'R'.

        Thanks Matt for letting me know that my answers are reproducible given the method I used.

        Thanks Tango for the link to the Tom Ruane article showing that you have to go down to individual plays to get better results.

        Thanks for the quick replies, guys!


        • #5
          There's only one problem Tango.

          How the heck do you extend linear weights to eras before PBP data?


          • #6
            That's no problem. The run expectancy matrix can be estimated rather easily. The LWTS numbers are determined based on the frequency of each state-to-state transitions, for each event. (If you have my book, you can probably get some good insights on these things.)

            You can also look here:
            Author of THE BOOK -- Playing The Percentages In Baseball


            • #7
              Thanks Tango...I'm attempting to determine how to make your weighting system work to account for actual runs rather than runs different from average (your out factor is like -.25 which means the average player isn't producing any RC by your method...I want LWs to actually model Run Scoring directly)


              • #8
                No, that is not a true statement. -.25 means the out generates .25 runs less than an average PA (which includes hits, walks, HR, outs).

                To model run scoring directly, you use BaseRuns (see my site). And from BaseRuns you can generate the custom LWTS (see earlier link). And from the custom LWTS you can generate the RE matrix (unpublished).
                Author of THE BOOK -- Playing The Percentages In Baseball


                • #9
                  Uh is what I said untrue?

                  You say "An out is -0.25 runs which means it is worth 0.25 runs less than the average PA"

                  I said "An out is like -.25 runs by the RE matrix, which means an average player (note...meaning he gets average plate appearances!) produces no runs above average by your method.

                  What I said was correct I'm fairly certain.


                  • #10
                    By the way Tango...are you absolutely certain that there is no change in the relationships between the events relative to each other that is not explained by the run scoring enviornment?

                    I know you hate linear regression approaches, but I found through dynamic linear weight research that the more rare a posistive event (HR for example) the more it tended to be worth compared to the other events.

                    I found that the value of a HR in the modern game is actually at a near-all-time low and that it was worth more back in 1912 despite the lower scoring environemtn...and that singles were worth much much less back then (today...about 0.52 R...back then...about 0.39) were walks (today...about 0.38...back then about 0.26)

                    To me...that made the deadball era, the chances of you advancing after getting a single are remarkably less than they are the deadfball era...the home run was runs in the bank...on the board...put it away pally. In the modern game, you have a higher chance to score those runs without the longball.


                    • #11
                      Matt, I think your post #10 is saying what my link in post #6 is saying. Can you click that link, and see if that's the case?


                      You also said "which means the average player isn't producing any RC by your method". This is not a true statement. An average player DOES create runs (RC). He just does it at the same rate as... the average player.

                      This might simply be a confusion in definition. RC being runs created being absolute and total runs created.

                      LWTS being runs created above average.
                      Author of THE BOOK -- Playing The Percentages In Baseball


                      • #12
                        You define linear weights as an average relative method...I define them as an absolute method. That's the difference. The problem with average-relative methodology is that it's not a good assumption that league average defines that league. There are other skewing factors...talent depth and disposition (some years the pitchers are better than the hitters...some years the hitters are better than the pitchers) being a big one...

                        BTW I have clicked that link and was reacting in post #10 to you conclusion that singles, doubles, triples, home runs, and walks all increase in value with increasing run scoring. I'm not convinced that's correct. I think that in increasingly homer-friendly environments, singles, doubles, and possibly triples increase in value and HRs *decrease* in value...and in low-scoring environments, the HR hits a PREMIUM value because it's runs on the board with 100% certainty hwereas other events become less likely to actually produce runs.


                        • #13
                          Matt, I really don't know how to respond to your first paragraph. There's like three things you are talking about there, and I don't see how it's a LWTS problem any more or less than it's an RC problem or BsR problem.


                          Matt: the only way to be convinced is to actually run a simulator or Markov chains and do the work. I understand why you are thinking the way you are, but until you run it through a realistic process, it's just a nice thought.

                          The research that I have done, some published and some not, shows that the run value of the HR starts at 1.00 (obviously), and increases to a certain point, and then decreases until it converges to 1.00 when the team OBP approaches 1.000. That tipping point is around 10-12 RPG, or around an OBP level of .500.

                          Those interested can go here:

                          The run value of the walk is pretty much a straight-line value and it tracks OBP. Pretty much, OBP = run value of a walk. (Not exactly, but that's the basic idea). The run value of the single has a higher slope, and then converges towards 1.00, etc, etc.

                          Until other research shows otherwise, you should consider this research to be the standard. I would be glad to publicize any other research that supplants mine as the new standard.
                          Author of THE BOOK -- Playing The Percentages In Baseball


                          • #14
                            If I knew what a Markov chain was, perhaps I'd be more impressed.

                            As for running simulations...I'm not entirely convinced mathematical modelling of baseball games is capturing the interaction between events...I could be wrong, afterall you've done more research in that area (by a loooong margin).

                            Make no mistake...I've read some of your work in this area and have been favorably just makes intuitive sense to me that in an environment where HRs are common, each HR would have less impact on the game (relative to the other events).


                            • #15
                              I think the book lays it out pretty well, so you might be more impressed after
                              reading it.

                              Your last statement is true under certain conditions. After all, this image:

                              shows how quickly the gap between a HR and 3B closes. It's a very long complicated process, and it's not a simple straight-line estimate. An environment where HR are common, and other events are not, or an environment where HR are common and other events are as well, will give you different gaps in run values.

                              At the extreme, a league where the OBP = .100, and the HR/PA is also = .100 (meaning no other forms of getting on bases exists), the run value of the HR = 1.00.

                              But, if the OBP = .200, and HR/PA stays at .100, the run value of the HR will go up to say 1.100, while the run value of the walk would probably jump to .150. (I don't know, just guessing). But if the OBP = .200 and HR/PA = 0, the run value of the walk would be more like .100.

                              I'm not sure you and I are really disagreeing about anything. You really have to lay out exactly your parameters so that we can establish exactly their impacts.
                              Author of THE BOOK -- Playing The Percentages In Baseball


                              Ad Widget