# Thread: Handling errors in run estimation

## Handling errors in run estimation

I'm new to this site, and relatively new to sabermetrics (though I read some of Bill James's books as a kid in the '80s).

I've been working on analyzing the results of the first season (2007) of the Israel Baseball League (IBL - see http://www.israelbaseballleague.com), a professional league which played a 40-game season this past summer. Some of my findings are available on my blog: http://biblemetrics.blogspot.com.

As you can see from my recent posts there, I'm having trouble reconciling the conventional (MLB) run estimation formulas with the IBL stats. About 20% more runs were scored than the predictions.

I suspect that the main element missing is errors. IBL games featured about three times as many errors as in the majors. But none of the run estimators seem to take errors into account.

Has anyone done any work in this direction? I would think that error rates must also be high in the minors - do the run estimators have the same problems there?

Thanks,

The Iblemetrician
Israel

2. Wow...I had no idea they were taking an interest in baseball in the middle east.

FWIW, some run estimators do take errors into account. Linear Weights, if you use one of the more detailed equations, have error terms. I'm guessing though that errors have a different value in a pre-major-league environment like that because, much like softball, errors would probably result in many more bases advanced by the batter and other baserunners than they do in the major leagues. What kind of information do you have at your disposal regarding the IBL. Do they track just the basics or do you have play by play records? If you've got something more detailed than your basic 1B, 2B, 3B, BB, K, HR, RBI, R type stats, I'd like to know. Even if you don't, whatever stats you have from the IBL would serve as a potentially valuable window into measuring league quality...a league as new and (no offense to the Israelis) "young" developmentally as the IBL could teach us a lot about the earmarks of sub-major-league baseball.

Wow, this is really neat. I'd heard about IBL earlier this year, but it's fascinating to think about applying sabr work to such a novel system. I'm subscribing to your blog.

With the runs estimators, the problem you're probably running into is that you're probably dealing with a massively different run environment than in MLB, which means that any linear weights based on MLB numbers will miss pretty badly (as you noted on your blog).

Seems like base runs would be your best bet moving forward, as it's supposed to hold up well in extreme/unusual environments because it's based on a decent model of run scoring. And you can modify it when you need to include additional terms--and once you have a good model of run scoring, you can generate your own custom linear weights.

Patriot has a nice writeup on base runs that should be helpful (as well as other run estimators):
http://gosu02.tripod.com/id108.html

Here's a "full" version of base runs from his article:
A = H + W + HB - HR - CS - DP
B = .777*S + 2.61*D + 4.29*T + 2.43*HR + .03*(W + HB - IW) - .747*IW + 1.30*SB + .13*CS + 1.08*SH + 1.81*SF + .70*DP - .04*(AB - H)
C = AB - H + SH + SF
D = HR

Where base runs = A * B/(B+C) + D

Now I am totally unqualified to be advising on this, as I've never tried to manipulate such an equation before. But it seems to me that errors would result in additional men on base (the "A" term) as well as additional advancement of runners (the "B" term).

So you'd definitely add reached-base-on-errors to the A term (if you don't have reached-base-on-errors, and just total errors, you might need to add a coefficient here to approximate reached-base-on-errors).

You'd add errors to the B term as well, though you'll have to futz around with the coefficient to find something that works (I'd probably start by equating them to a single).

I'm sure others might have much better suggestions on how to go about this, but this might get you started.
-j

4. The run value of an error on the batter reaching base in MLB is about .02 runs higher than a single. So, as a general rule, just counts an error like a single.

I would start with http://www.tangotiger.net/markov.html to see how the IBL compares. The constraints of the system (lack of outs on baserunners) gets more exposed the lesser the quality of the league (where outs on the bases is likely far higher).

5. Tango...I don't think you can assume that a reach-on-error in the IBL will be similar to a single. In weaker leagues, errors are usually more spectacular and result in compounding errors (think about a softball game where one bad throw often leads to another). I think an error is going to be worth a lot more relative to a single in a weak, error-prone league.

Originally Posted by SABR Matt
Wow...I had no idea they were taking an interest in baseball in the middle east.
Well, I wouldn't go that far. Most of the fans either grew up in North America or lived there for a while - or their parents did. Only 10% of the players are Israeli; the rest are mostly Americans, followed by Dominicans and a surprisingly talented contingent of Australians. I don't see baseball competing with soccer or basketball here anytime in my lifetime, at least.

Do they track just the basics or do you have play by play records?
One thing the league did well was recordkeeping. I have the play-by-play game logs, though I have yet to write the software to process them. For now, I'm working from the box scores which lump all errors together as E. Hopefully at some point I'll be able to break down reached-on-error, not to mention other juicy data like flyballs and groundballs, taking the extra base, etc.

Even if you don't, whatever stats you have from the IBL would serve as a potentially valuable window into measuring league quality...a league as new and (no offense to the Israelis) "young" developmentally as the IBL could teach us a lot about the earmarks of sub-major-league baseball.
One question I'd certainly like to look at is how to assess the play level of the league. Is there any reliable way to do that other than to look at players who have played in both the IBL and other leagues, and compare their performance levels? Is it possible to assess play levels without a broader context - like looking at pitcher control, maybe?

Thanks to everyone for the suggestions on the formulas. Looks like I could use some data on the actual consequences of errors in the league - how many batters reached base, how many runners advanced. Interesting that the conventional stats don't distinguish between those very different situations.

8. Indeed...conventional statistics have a severe failing when it comes to recording the real meaning of errors.

If you have a complete PBP record, there's a LOT you can do in terms of linear analysis...I recommend you spend your time working on a way to process that information, because once you process that, you can compare the rates at which "sloppy" events occur to similar rates in the majors throughout history. HBP, WP, SB, E etc...I'm willing to bet that the IBL is very similar to 19th century baseball...a PBP record of the IBL could give us tools to better understand our own past.

So I took the visual queue from the "powered by mlb.com" logo and went digging. I found all of the PBP data for the Israel Baseball League in Gameday format located in the following location...

http://gd2.mlb.com/components/game/ind/year_2007/

Now if anyone around here has a working version of Adler's Hack for parsing the 2007 Gameday data, we'd be in business.

Originally Posted by weskelton
I found all of the PBP data for the Israel Baseball League in Gameday format
Amazing! Shows how out of touch I am - I don't even know what to do with Gameday format. Until now, I've been parsing the web pages downloaded straight from the IBL's site.

Thanks for the discovery.

It worked - the full BaseRuns formula yielded good estimates without any tweaking on my part. Details are in my latest blog post. Thanks again for the pointers.

A related question: Does anyone know where I can find league-average summary stats for current minor leagues?

12. baseball-reference.com has minor league data now...include league summary information...for all leagues 1992-2007.

I actually did a little data massaging of my own using the numbers that Iblemetrician had posted on his blog...

Basically I took the percentage of extra errors in the ILB that were above and beyond what would have been expected in the MLB and turned them into singles, following Tango's suggestion. I also treated the HBP's as BB's. Then I plugged the revised run environment into Tango's markov modeler. My values were as follows...

AB=837, H=262, 2B=39, 3B=2, HR=25, BB=163, K=165

The result was a run environment that would score 8.083 runs/27 outs(AB-H). When treating the extra errors as singles (not outs), the IBL actually scored at a rate of 7.889 runs/27 outs. This is compared to a Markov value of only 6.341 runs/27 outs using the un-massaged data.

So yes, it does seem that the errors account for a good deal of the error.

