Thanks. I'm looking forward to reading about the rest of the study.
Thanks. I'm looking forward to reading about the rest of the study.
Wow...please do continue with this bit of logic...I'm very curious to see how this turns out...because I will probably adapt something like this to construct my RE matrices for seasons without PBP
Matt, and just to show how even having thousands of plays is not enough: in AL, 1964, the RE with man on 2b and 2 outs is .325, while with man on 3B, it's .292Originally Posted by SABR Matt
(The NL that year is .312, .394. From 1960-2004, it was .229, .276, respectively)
The best model is one grounded in logic, or through a Markov chain.
Yep...there is definitely error associated with just the empirical data from season to season. I recall in your book how you showed that you couldn't even get the LW of a HR with no one on base correct using the empirical data...not perfectly so anyway.
Perhaps you could explain to me the logical basis for the 3/2/1 rule, Tango. I'm sure there is one...but I'd like to see it fleshed out so I can put a period on the sentence that is that assertion.
The short answer is that you have 4.5 batters to the end of the inning when you have 0 outs, 3 batters with 1 out, and 1.5 batters with 2 outs. The fleshing out will be done on my site near the end of the series. (Btw, I updated the thread on my site with more info.)
Okay Tango I gave it a whirl with the 2005 Cubs. Everything seems to make sense except I added a few things.
For instance with a player on third not only did I consider the error rate but I also factored in WP and PB rate along with a balk. Secondly I found that a walk did not occur for the Cubs around 10% of the time but less but it was so small it wasn't going to make a difference. But the WP and PB part does change the numbers.
The other thing is I went ahead and added in reached second on an error for the Cubs. It happened only once, and they actually got to third twice.
So anyway here is the 2005 Cubs RE:
Code:0 1 2 BE 0.488 0.267 0.104 1st 0.86 0.515 0.228 2nd 1.115 0.685 0.313 3rd 1.339 0.938 0.393 1st2nd 1.487 0.933 0.437 1st3rd 1.711 1.186 0.517 2nd3rd 1.966 1.356 0.602 BasFull 2.338 1.604 0.726
Last edited by Ubiquitous; 06-15-2006 at 06:40 PM.
Oh also I forgot to mention that I didn't use the 50% score from third with less then 2outs assumption. I used the actual Cubs data which was the Cubs scored from third with less then 2 outs 68 times out of 150 outs.
And I was wondering if it was correct that a man on third with one out was more valuable then a man on first and second one out?
Last edited by Ubiquitous; 06-15-2006 at 06:41 PM.
I came up with the Cubs have a .289 chance of scoring from third with 2 outs. The average was .268 and error and WP/PB was .021+negligible walks so it comes out to .289. I realize that by adding the WP/PB to only third base it is the only one getting inflated so now I can understand why it is possible for it to be higher then 1st and 2nd. With 1 out it is .671, and no outs it is .851.
If I ignore WP/Pb it goes down to .012 for that part and then a total of .280/.663/.842 as compared to .289/.671/.851 and would knock third base 1 out below 1st and 2nd 1 out.
0 1 2
BE .482 .266 .109
1st .741 .501 .245
2nd 1.183 .749 .318
3d 1.615 .857 .214
1st2nd 1.338 .801 .369
1st3rd 1.382 1.095 .250
2nd3rd 1.958 1.127 .464
BasFull 1.813 1.347 .587
This is the table I got for 2005 from the actual empirical data. Pretty darn close to what you got considering that actual occurences for some of the base out states are only in the 20s. I am not sure what kind of study would benefit from having REs on a team basis rather than a league basis or even a multiple year basis given the sample size errors inherent in the smaller data set
Damn. How do you get the table to print nice and straight?
You mean like that?Code:0 1 2 BE .482 .266 .109 1st .741 .501 .245 2nd 1.183 .749 .318 3d 1.615 .857 .214 1st2nd 1.338 .801 .369 1st3rd 1.382 1.095 .250 2nd3rd 1.958 1.127 .464 BasFul 1.813 1.347 .587
I take it that is the Cubs RE?
I see based on the empirical data that indeed for the Cubs a man on third with one out was worth more then a man on 1st and 2nd with one out, and it looks that way as well with no outs.
Question for you is there a easy way of doing it the empirical way? Meaning I have the PbP data for the Cubs in 2005 and I hesitate to go the empirical route since I have to believe their is an easier way to manipulate the data then by setting up filters in Access and counting manually from there.
Sise by side, empirical on the left, Tango's on the right
Code:0 1 2 0 1 2 BE .482 .266 .109 BE 0.488 0.267 0.104 1st .741 .501 .245 1st 0.86 0.515 0.228 2nd 1.183 .749 .318 2nd 1.115 0.685 0.313 3d 1.615 .857 .214 3rd 1.339 0.938 0.393 1st2nd 1.338 .801 .369 1st2nd 1.487 0.933 0.437 1st3rd 1.382 1.095 .250 1st3rd 1.711 1.186 0.517 2nd3rd 1.958 1.127 .464 2nd3rd 1.966 1.356 0.602 BasFul 1.813 1.347 .587 BasFull 2.338 1.604 0.726
Well wouldn't the differnt runs in the RE be a reflection of that team? Granted some of the more smaller data points would have to be taken with a grain of salt. But couldn't one look at a team as a whole with say a runner on first and then compare that to other teams RE's with runners on first? See if one team got more or less out of this situation and then from there try to find out why. Isolate why one team had a higher RE then another team in that situation, whether because of speed, power, or something else.Originally Posted by misterdirt
Gosh, the charts look so pretty when you do them!
Well wouldn't the differnt runs in the RE be a reflection of that team? Granted some of the more smaller data points would have to be taken with a grain of salt. But couldn't one look at a team as a whole with say a runner on first and then compare that to other teams RE's with runners on first? See if one team got more or less out of this situation and then from there try to find out why. Isolate why one team had a higher RE then another team in that situation, whether because of speed, power, or something else.
The RE tables would reflect the individual team's efforts but in terms of analysis I don't think they would tell you more than looking at production from each line-up slot. If you know that the #1 and #2 hitters are getting on base a lot and the #3 and #4 hitters are getting more than average extra base hits you have a pretty good idea how a team is scoring its runs. Learning team chemistry from the RE tables is tougher. Take for example the 70 times that the Cubs had a man on 3d with 1 out last year. I have them scoring 60 times. But you have no idea whether they are scoring because the next batter hit a single or a home run. Or a sacrifice fly or a bunt. Or the next three batters walked.
Question for you is there a easy way of doing it the empirical way? Meaning I have the PbP data for the Cubs in 2005 and I hesitate to go the empirical route since I have to believe their is an easier way to manipulate the data then by setting up filters in Access and counting manually from there.
Easy is in the mind of the beholder. I use Access, and add a base outs state field to every event in my EventsGeneral table. I also add a runs scored in the rest of inning field. With those, a single query can sum the runs scored in rest of inning grouped by base out state. The work is in adding the fields.
For the fields that makes sense. So what do you do? You create a field and then link it the outs field and runners on base as well? How does that work? Would you have to set up a seperate query to do a count of runners on base then link it back? Then how would do you also add the out to it or does it simply become code? Like 1 would BE-no outs, 2 would be BE-1out, 3, would be BE 2outs, 4 would be runners on 1st no outs, so on and so on. Either way how is that achieved. Another question is what exactly is measured? You say you create a base/out number for every play is that the base/out situation before the play or after the play? Meaning if the play is a single and the bases are empty with 1 out. Would the number in the box be BE -1 out or Runner on 1st-1 out?
Thinking about it more,
I'm guessing to create the field you have to take it to excel right? Create the formula their for base/out and runs scored to the end of the inning and then bring it back right? If that is the way I think I could do that pretty easily so I guess the main question would be my last one in the first paragraph above. Which base/out situation do you use? The one before the play or the one after?
I guess that refutes the belief that the Cubs couldn't score from third base last year.Originally Posted by misterdirt
How did you get 60 times out of 70 with man on third 1 out? I got 31 times in 73 situations. That is in just that base/out situation the runner scored 31 times before the base/out situation changed. Is that how it is done or does any and all scoring even after the base/out situation changes count towards man on third 1 out? Meaning if you have a man on third with 1 out the batter K's and then the next batter gets a single that run counts toward man on third 1 out but yet it doesn't add another opp? We only count opps of when they initially make it to the base/out situation, and don't lock the result to just the very next play but whatever happens to the end of the inning?
Guys,
I just want to see I'm impressed with the initiative and the greatwork being done. I'll reply to each point made shortly.
Tom
Ub, certainly you can add as much as possible. I was getting worried about putting in "too much", but then again, maybe I shouldn't. Maybe I'll break out my state-transition matrix for each event for 1999-2002, and present those numbers. Then, I can leave it to the reader how he wants to create his own RE matrix. When I created my Markov RE charts for the book, I did in fact take all possible events and transitions.For instance with a player on third not only did I consider the error rate but I also factored in WP and PB rate along with a balk.
Yes, you should use whatever information you do have on hand. After all, we are trying to construct a matrix that shows how the runs did score. You can see how it might be quite difficult if you use a very small number of games. So, not only are you thinking "how did they score", but you have to ask, "if they continue at this pace". When you've got 6000 PA in a season, that's a reasonable thing to agree to. But in terms of scoring from 3B with less than 2 outs, you would need some regression. Alot of it is park-dependent, so it would be nice to calculate this by park.I used the actual Cubs data which was the Cubs scored from third with less then 2 outs 68 times out of 150 outs.
As you can see, empirical data will give you oddball results, such that it's easier to score from FIRST base than THIRD base, with 2 outs. So, 162 games of a team is still not enough.BE .482 .266 .109
1st .741 .501 .245
2nd 1.183 .749 .318
3d 1.615 .857 .214
This was answered quite well by Ub. The chances of scoring is dependent on the whole state-transition matrix. And those are not static. If I added an additional parameter, quality of batter, you will see how the RE matrix would balloon. Right now, we are assuming that it's always an average team batter at the plate. On my site, in the "Walking Bonds" blog entry, I show you how to change the win probability table based on who the batter is. Alot easier said than done.I am not sure what kind of study would benefit from having REs on a team basis rather than a league basis or even a multiple year basis given the sample size errors inherent in the smaller data set
As discussed, we are talking about the chance of that runner scoring, at all. So, that means to the end of the inning. If you go through my calculations, I show you how to figure out the chance of scoring for that particular PA, and then for the rest of the inning. So, it would be something like a 50% chance of scorinig right there and then, and then of the 50% times that he doesn't score, he's got a 30% chance of scoring of those times. So, .50 + .50*.30 = .65 (more or less).How did you get 60 times out of 70 with man on third 1 out? I got 31 times in 73 situations.
Oh, and as for how to figure out the empirical RE, that was also answered, and I do the same thing.
Create a view/query to generate a table that gives you:
game,half-inning,r
Something like:
create table InningRuns
select game,half-inning,sum(r)
from events
group by game, half-inning
(Half-inning could be inning,team, or inning,homeVisitor, or whatever. I convert the top/bottom inning into a number from 1 to 18 for a 9-inning game.)
Once you have that, you do a join of your events table to the InningRuns table. You already know how many runs have already scored from the events table, and you know how many did score for the inning in InningRuns table. The difference is the number of runs scored from that point to the end of the inning.
Something like:
create table InningRuns
select game,half-inning,sum(r)
from events
group by game, half-inning
Similar to what I do but I find it easier to use the max function on home score and visitor score grouped by inning and game. I then link that to the EventsGeneral table.
I do everything in Access. There is no need to go to Excel for this. For Base Out state I query the EventsRunners table for each on base situation individually. There is probably a way to use nested if functions to do it in one query I find nested ifs complicated to do in Access. I code a man on first as 1000, a man on second as 200, a man on third as 30. Outs are in the unit column. These can be added together to create a single code for each base out situation. Men on first and third two outs would be 1032, etc. Makes things very readable and easy.
Run Expectancy counts all runs scored in the inning on all plays subsequent to the base out situation being evaluated, not just on the following play. For example, a batter walks to lead off an inning. Base out situation is 1000. Next batter up GIDP. Third batter homers. Fourth batter strikes out. For this inning the base out states and run scored totals would be baseout state 0, 1 run, 1 occurence; BoS 1000, 1 run, 1 occurence; BoS 2, 1 run, 2 occurences.
This was answered quite well by Ub. The chances of scoring is dependent on the whole state-transition matrix. And those are not static. If I added an additional parameter, quality of batter, you will see how the RE matrix would balloon. Right now, we are assuming that it's always an average team batter at the plate. On my site, in the "Walking Bonds" blog entry, I show you how to change the win probability table based on who the batter is. Alot easier said than done.
I certainly can see the benefit from having a working Markov chain RE where you can plug in a hypothetical situation and do a quick "what if" study. I still question whether Ub can glean much benefit from individual team REs. If he uses empirical data he has the small sample problems that we have already discussed. If he uses your method, which is kind of a question and answer Markov, or a Markov without matrix algebra, there are still some small sample problems plus you have supplied some of the transition information from large data sets that may not apply to the actual team being studied or even the wider range of run environments found at the team level.
For something like: chances of scoring from 3b with 2 outs, there's really no need to use empirical data. What you want is the team batting average and reached on error.Originally Posted by misterdirt
For chances of scoring with exactly 0 and 1 out, I would use the team/park data, regressed, if I thought the makeup of the team or the park was kinda unique. So, that's where those numbers come into play.
Otherwise, the empirical data isn't necessarily needed. It should all follow-through from the exact number of runs that did score.
If you had continued my process, you would also need to know how often you go from 1b to 3b on a single, for each out, etc ( http://www.tangotiger.net/destmob.html ), and again, you would care about the team/park makeup. The end-result is taking that complex approach, or the quick shorthand that I presented, will get you to pretty much the same spot.
Doesn't using a team's actual situational numbers in this shortcut approach sort of defeat the purpose of having the shortcut approach?
That purpose being twofold:
a) Save the user from needing situational data to come up with a reasonably accurate RE chart for a team or league
b) avoid the hazards of small sample size and ground your RE tables in logic
Just sayin...
Using the data I used wasn't exactly time consuming and like Runs created if the information is available use it. The data I changed were the assumption that a team scores from third with 0 or 1 outs 50% of the time. The Cubs came in a little under. That there is a walk in an at bat 10% of the time, and finding out how many WP/PB there were. Tweaking the walk rate to reflect what you are looking at doesn't alter the logic nor is it complicated to do or unavailable throughout history. Using WP/PB is like adding SB/CS data to the RC formula. Does it alter the logic of RC. It is minor and if you have the data it is real quick to add it into the shortcut, nor is the shortcut exactly 2*2=4. In order to use the formula one has to look at data and if one is looking at the data then one might as well use what is available if it is easy to incorporate.
If I'm using the short cut method to look at a year wouldn't I want the data to be as close as possible as I can get it? The shortcut I believe is meant as introductory lesson into making ones own RE. A break the ice, see look how simple it is, you probably thought it was hard but look how simple it is. A way to create a RE if one lacks the resources or know-how to do one based on more advanced mathematical formulas. A way to let the arm-chair analyst use one of the tools of the more advanced stat-heads.
Last edited by Ubiquitous; 06-16-2006 at 10:01 AM.
Bookmarks