Announcement

Collapse
No announcement yet.

Sabermetric Book

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Well, I did say "which can be collapsed into two if you want to get fancy".

    ***

    Your group by will result in a 24x25x20 number of rows. You still need to then do a join. You do the join, and do sum(s.RE)/sum(n), sum(e.RE + runsOnPlay)/sum(n), and GROUP BY eventType.

    Or you can collapse the two into one query, as I showed earlier.
    Author of THE BOOK -- Playing The Percentages In Baseball

    Comment


    • I know that wasn't the only step..I was just saying I think it makes wording the queries a little easier if your data is as compact as possible.

      Comment


      • I do do it the way you are describing it, but I wouldn't recommend it until you are used to manipulating the 24 base/out states. The clean way is to do it the way I described it. Once you have a handle on that, then you can worry about collapsing them. After all, what if you are only interested in the 2-out states for some other reason? Now, you've got to join to a 24-row table that will decipher your baseOut field (which is what I also do, and also don't recommend).

        When you are setting things up, keep it clean, and don't worry about efficiency or size.
        Author of THE BOOK -- Playing The Percentages In Baseball

        Comment


        • All the starting fields and the RunsOnPlay should probably be available directly from BEVENT (been a while for me).

          RunsOnPlay probably should be a BEVENT field but it isn't. It has to be calculated.

          (Also make all the Dest fields 0 when EndOuts = 3.)

          If you do this you cheat players out of value when they get a hit that results in a third out on the basepaths. Example: a single with 2 outs and men on first and second where the lead runner makes the third out trying for home should not have the same result as a strikeout. The resulting base out state should be bases loaded 2 outs for computation purposes of hit value and value to the batter.

          See I would just create one column called StartBO and one called EndBO where each one went from 0 to 24 (0 = bases empty none out, 1 = bases empty 1 out...etc...and 24 = 3 outs) and then one row for the event type...then you just

          Again Matt, have you ever actually done this? Creating the EndBO state from the Runner destination fields is not trivial. I use a single code for base out state as a mentioned in posts 97 and 106. Instead of 1 to 24 as Matt suggests it has a man on first as a 1 in the thousands place, a man on 2nd as a 2 in the hundreds place, a man on 3d as a 3 in the tens place and outs in the ones place. This has the advantage that the base out state is evident visually in the code itself.

          and COUNT the number that fit in each unique grouping and SUM the number of runs scored on each play.

          This wouldn't work. You have to do an intermediate step of calculating the change in run value on the play. Change in run value is the (RE of the EndBO state - the RE of the Start BO state) + Runs scored on the Play. Change in Run Value is what then gets SUMMED just for offensive event type, not for every StartBO state. The SUM of Change in Run value is then divided by the Count of the number of events.

          Ub - Are you now more confused than ever?

          Comment


          • MrD...I knew all of that (about how to calcualte a linear weight)...I just didn't say it correctly in my post...sorry if I caused confusion.

            I don't generally consider it a good idea to use a string of numbers as a code for the base/out state because that can be a pain in the ass to attempt to query even with usage of the substr function....but I do see the appeal of using that code in the visual inspection of the data.

            I'm also contending with trying to create ONE database that includes ALL of the data for ALL of the years...I'm going for "all of the info is there" in "as small a data field as possible".

            Comment


            • I don't generally consider it a good idea to use a string of numbers as a code for the base/out state because that can be a pain in the ass to attempt to query even with usage of the substr function....but I do see the appeal of using that code in the visual inspection of the data.

              Its not a string of numbers, its a number. I find it simple to query it for any single base out state or any class of states. For example, looking at all men on first and third states would be >1029 and <1033. What's the problem?

              Comment


              • Interesting.

                And clever.

                I misunderstood what you meant by how you coded your data, MrD. That's a most efficient way to do it...and it doesn't increase the size of the piece of data I would need to utilize something similar by more than one byte per record (using just 0-24 I would need a 1 byte integer, using your method, I'd need a 2-byte integer).

                Comment


                • Originally posted by misterdirt
                  Ub - Are you now more confused than ever?
                  I understand what you are talking about for something like this if it isn't written down step-by-step or the person is sitting next to me it is lost on me. I understand what you need to do I just don't know the computer language and steps to do it

                  Comment


                  • wouldn't RBI on play work? Though I guess there a few plays in which a run scores but no RBI.

                    Comment


                    • wouldn't RBI on play work? Though I guess there a few plays in which a run scores but no RBI.

                      Not really. You'd lose about 1100 runs a year. And its not the total runs that you lose, its that you lose them from only a couple of types of offensive events, errors and outs mostly. So those outcomes would be highly affected.

                      The method I use is to create a new field called previous event number by subtracting 1 from every event number. Then you create another new field that has every event number the same except it assigns 0 to the last event number of a game. Then you can map the EventsGeneral table onto itself by linking those two fields. This gives you the HOME_SCORE and VISITOR_SCORE of the next event that you can add to your database. The HOME_SCORE or VISITOR_SCORE of the next event is obviously the HOME_SCORE or VISITOR_SCORE of the original event plus the runs scored on the play so you can extract the runs scored on play by subtraction. For the final event of the game you have to subtract from the appropriate FINAL_SCORE. There is probably a much more elegant solution that everybody but me has already discovered.

                      Comment


                      • or you can jjut use the vis_score and home_score fields from the NEXT event (by adding one to every event) and subtract from those the current score.

                        Comment


                        • or you can jjut use the vis_score and home_score fields from the NEXT event (by adding one to every event) and subtract from those the current score.

                          That doesn't work.

                          Comment


                          • If you do this you cheat players out of value when they get a hit that results in a third out on the basepaths.
                            I wasn't worrying about the splitting up of hitting/baserunning values at this point.

                            ***

                            As for runs on play, why not just sum all those destbatter, dest1B, dest2b, dest3b fields? Doing a self-join on the eventsTable is a huge processing cost.
                            Author of THE BOOK -- Playing The Percentages In Baseball

                            Comment


                            • I wasn't worrying about the splitting up of hitting/baserunning values at this point

                              It is not the normal splitting up of baserunning/hitting values which I don't bother with either. It is the fact that with 2 outs the hitter loses all value for his hit except for the value of runs scored on the play, unless you are zeroing that out too, in which case he loses all value. The linear weights will probably work out the same as the value lost is probably within the margins of precision given. But the value lost to the hitter in calculating his changes in RE may be significant for that hitter.

                              As for runs on play, why not just sum all those destbatter, dest1B, dest2b, dest3b fields? Doing a self-join on the eventsTable is a huge processing cost.


                              Obviously, you can do that, but I find it more time consuming. I don't know what you mean about processing cost. A table maps on itself very quickly.

                              Comment


                              • Sure, it's time consuming to code (2 minutes, instead of 20 seconds for the self-join). But, as the eventsTable gets bigger, the processing cost is huge. Each year has 200,000 records. If you do what Matt is suggesting, you're going to have a database with 5 to 10 million records. Joining a 10-million row table to another 10-million row table to get a 10-million row output is an enormous cost.

                                On the other hand, if you want to just join one team, or maybe one year, that's ok. But, I highly recommend doing the update as I'm suggesting.
                                Author of THE BOOK -- Playing The Percentages In Baseball

                                Comment

                                Ad Widget

                                Collapse
                                Working...
                                X