I thought the history buffs might find this one interesting enough that I decided to post it here...the "sabermetrics" involved here are very light mathematically, so it fits in.
This is just experimental, because to properly scale my difficulty rating, I had to arbitrarily choose the marginal value you'll see in a moment...I'm working on ways to more rigorously define it.
OK...bear with me for a moment while I explain where I got this idea.
I've been looking for the LONGEST time for a way to objectively rate how "deep" or "difficult" a league was...
I never liked James' subjective timeline adjustment...it seemed WAY too simple. But how do you go about seeing how skilled the players within a league are as a group?
The idea came to me through a discussion I had with Randy Fiato (TKD) about what defines "bad baseball". It is intuitively obvious that when two bad teams face each other, the games will be sloppy more frequently..mistakes will be made in all aspects of the game. Pitching mistakes...hitting mistakes...fielding mistakes...baserunning blunders.
What will this look like statistically though? A classic idea proposed by sabermetricians in the 70s was to rate players based on standard deviations from the mean...it has been observed many times that the standard deviation of batting average has been fluctuating through time but trending down...(there's a famous paper on the disappearance of the .400 hitter that discusses this...the author's name escapes me for some reason).
Batting average is not however explanatory enough...what we want to know is...does the standard deviation of run scoring per side per game change with time the way it does for batting average? Are we cycling closer and closer to the mean as time advances?
A quick survey using retrosheet.org's game logs reveals that in fact standard deviation is changing with time...but perhaps not the way you might think. It became immediately apparent that the standard deviation of run scoring on a per game basis was directly dependent on the league average run scoring rate. In fact, an r^2 of 0.9301 exists between those two variables...low scoring leagues have small standard deviations...high scoring leagues have larger standard deviations.
Does this mean that high scoring leagues are "weaker"...less deep with talent? Of course not. It's hard to argue that the deadball era was a better level of play than today's game even with expansion considernig the player pool has expanded to include approximately 50 times more potential baseball players than it did back then, minor league scouting and development didn't exist in the deadball era, and the equipment and field conditions were often horrendous, making for sloppy games far more frequently than in today's major leagues.
This dependence on run scoring environment is not however the only problem with using standard deviation to rate the difficulty of a league or the players within the league. There is a fundamental logical flaw. The use of standard normal z scores presumes that the league and/or player distribution was normal...neither is the case.
The player distribution is pyramidal...the top 1% of the humans who play baseball make the major leagues (liberally...it might be closer to .001%)...if we could rate every baseballer from tee-ball to Japan to MLB to High School...the distribution of skill might be normal. Meanwhile, the distribution of runs scored per side per game in a league is the summation of a series of one-game match-ups...each match-up behaving according to the laws of probability as governed by the intrinsic strengths of both combatants...the result of that process is a non-normal significantly skewed distribution...high extreme values will have an exaggeratedly large Z-score...shutouts are a sign of bad play too but their is a lower bound to how "bad" you can be in the non-scoring direction.
Given this lower bound...and the resulting tendency for variations in ability to manifest themselves in the rightward biasing direction (large numbers of high scoring games relative to the mean run scoring environment)...we fall back on MEASURING the skew of the league's RS distribution to get an idea about how erratic/weak that league was.
The positives...Skew is not dependent on the run scoring environemtn...it is never affected by the mean of a probability distribution. Skew uni-directional...meaning the lower bound shouldn't interfere with an accurate measurement of positive skew (skew is defined to be positive when the longer tail of a distribution points to the right on a number line). Skewness also does not presume a distribution is normal. It describes how non-normal a distribution is.
Logically...skew tells you how frequently extremes occur...more extremes mean more variation in intrinsic team strengths...and therefore...a weaker league.
If the run scoring distribution were normal (had no skew) this would mean that there was ZERO variation in player ability across the league...this would be the "ideal" league...but we know this to be humanly impossible to achieve...nonetheless...it serves to demonstrate that more skew is a larger deviation from the ideal league.
Skewness of a distribution is easily measured:
SUM(x - u)^3
(n - 1) * s^3
Where x is the observed game/side runs scored, u is the league average runs scored per side per game, n is the number of game/sides within the league and s is the standard deviation of the distribution.
Placing the s term in the expression automatically scales the skew value so that higher scoring leagues, which will naturally have a wider range of run scoring outcomes do not appear to have higher skew.
When I plotted skew of the run scoring distribution against time, wat I found was a somewhat messy but nonetheless encouraging trend toward gradually decreasing skew with time. There was a lot of noise in the plot...probably because skew is heavily impacted by large outliers, so extreme games might have had a disproportionately large pull on skew...it therefore was necessary to smooth skew values.
I chose to use a normally weighted 7-year running mean of skew values for each league (normally weighted implies a larger emphasis on the center year...think of the shape of the bell curve) to smooth out the fluctuations...
It makes sense to smooth the data because although players change from season to season...the overall strength of the league cannot possibly fluctuate by overly large amounts...there are hundreds of players in any given league...turnover from year to year is no larger than 5-10% so we should expect league strengths to change gradually except in extreme circumstances like during WWII.
I'm considering alternatives to this normally weight running mean idea...I may for instance measure the skewness of a longer period of years than one...perhaps skew is more persistant if you incluide more than one year of data...either way...the smoothed values were eye popping and aligned very well with my expectations for where baseball was weak and where it was strong.
But this doesn't end the problem.
Assuming Smoothed skew is an appropriate measure of league strength, we need to put it in a form that allows strong leagues to score higher than weak leagues...and it would in fact be ideal if we got the scores to range from 0 to 1 so that they could be used multiplicatively...(for instance...if we rate 1872 as a 0.5 league...we would cut player wins in half in 1872 to get an idea of how many wins they'd be worth in a strong league)
We can make use of the exponential function here...it makes sense to use the exponential given that major league baseball represents the top of the baseball pyramid and the drop in skew value from typical leagues to great ones is likely to be large.
It also gives us the right range if used properly. Skewness can theoretically range from 0 to infinity in this case (it can't range negatively because of the lower bound at zero)...if we take a skewness of zero...e^0 = 1...if we take a skewness value approaching infinity e^large = large...ah but if we make that e^-skew...-0 is still zero, but -large implies 1/(e^large) which asymptotically approaches zero.
One more step though...no baseball league...no matter how great...will ever have a skew of zero. Here's the nasty part where I have to arbitrarily pick a marginal skew value. This was just me visually examining the graph of smoothed skew with time and seeing what the skew appeared to be approaching (the overall curved trend appears to be leveling off slowly but surely.
I chose a value orf 0.8 as the minimum skew...though I experimented with other values.
This was applied by simply subtracting 0.8 from each skew value obtained by the smoothing process before converting them with the exponential decay function.
The end result is quite interest to me...
Here are the top 20 most difficult leagues by this method:
And the 20 weakest leagues
Year Lg Strength
1984 AL 0.968
1985 AL 0.967
1997 AL 0.947
1995 AL 0.946
1996 AL 0.943
1998 AL 0.942
1986 AL 0.941
1983 AL 0.941
1983 NL 0.932
1933 AL 0.928
1934 AL 0.928
1999 AL 0.925
1994 AL 0.925
1982 NL 0.923
1937 AL 0.919
1935 AL 0.913
1938 AL 0.912
1936 AL 0.909
1987 AL 0.907
1962 AL 0.906
The early deadball era looks to me to have been very weak competitively...though obviously not as bad as the old National Association...which plays like a modern AA or A league.
1910 NL 0.691
1909 AL 0.690
1944 NL 0.688
1902 NL 0.687
1901 NL 0.683
1885 NL 0.682
1905 AL 0.679
1911 NL 0.675
1881 NL 0.666
1875 NA 0.665
1906 AL 0.663
1908 AL 0.654
1907 AL 0.651
1874 NA 0.637
1873 NA 0.614
1884 NL 0.612
1882 NL 0.589
1872 NA 0.578
1883 NL 0.560
1871 NA 0.528
Thoughts from the peanut gallery?