Announcement

Collapse
No announcement yet.

Sabermetric Book

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • It was Ub I was trying to help and he was only asking about doing one year. I did it with three years and it was not a problem. As for Matt, doing querying in Access with a database that large (if it is even possible in Access) would be time consuming, especially nested IF functions.

    Comment


    • Who said anything about Access?

      I do those kinds of complex queries in Query Browser...hard-coded MySQL

      Comment


      • I don't have access to the data right now (my main computer is begin upgraded almost entirely for the purpose of being able t handle the 7.4 million row PBP events database) so I have a question.

        Do the visitor score and home score fields reflect the score BEFORE or AFTER a play has already occured?

        Comment


        • Before the play.

          Comment


          • ALright can answer this?

            I ran the hack #60 in the book the way it was written for 2004 and it worked like a charm. I then tried to do it for 2001 and got way off numbers. This is what the original hack looks like:
            create table runs_by_inning2004 AS
            select game_id, inning, batting_team,
            if (batting_team=0,
            min(vis_score) +
            sum(if(runner_on_1st_dest>3,1,0)) +
            sum(if(runner_on_2nd_dest>3,1,0)) +
            sum(if(runner_on_3rd_dest>3,1,0)) +
            sum(if(batter_dest>3,1,0)),
            min(vis_score)) AS vis_score_end_of_inning,
            min(vis_score) AS vis_score_beginning_of_inning,
            if (batting_team=1,
            min(home_score) +
            sum(if(runner_on_1st_dest>3,1,0)) +
            sum(if(runner_on_2nd_dest>3,1,0)) +
            sum(if(runner_on_3rd_dest>3,1,0)) +
            sum(if(batter_dest>3,1,0)),
            min(home_score)) AS home_score_end_of_inning,
            min(home_score) AS home_score_beginning_of_inning
            from pbp.pbp2k
            where substring(game_id,4,4)="2004"
            group by game_id, inning, batting_team;
            you then create an index:
            create index runs_by_inning2004_idx
            ON runs_by_inning2004(game_id, inning, batting_team);
            Then you do the RE:
            select p.outs,
            if (p.first_runner != "", 1, 0) AS runner_on_1st,
            if (p.second_runner != "", 1, 0) AS runner_on_2nd,
            if (p.third_runner != "", 1, 0) AS runner_on_3rd,
            sum(if (p.batting_team=0,
            r.vis_score_end_of_inning - p.vis_score,
            r.home_score_end_of_inning - p.home_score)
            ) / count(*)
            AS expected_runs,
            count(*) AS N
            from fullpbp2004 p inner join runs_by_inning2004 r
            on p.game_id=r.game_id AND p.inning=r.inning
            AND p.batting_team=r.batting_team
            group by runner_on_1st, runner_on_2nd, runner_on_3rd, outs;

            Now the only change I do to that is I change the fullpbp2004 name to the name of the file with all my pbp data since I don't have a table named fullpbp2004. I honestly don't know where he got that name since he never introduced it before and he was getting his pbp data from a different file to begin with. But changing the name the file in which I have all my pbp gets a proper return for 2004 as verified by Misterdirt. Now then when I change it to 2001 by basically replacing all the 2004 part to 2001 I get oddball results. I get bases empty no outs at 1.33 runs expected at 89,000+ occurences. What am I doing wrong?

            Comment


            • It shouldn't change anything { and I'm not familiar with MySQL, but other databases sometimes have a problem with count(*) }, can you change
              from: count(*)
              to: sum(1)

              (In some databases, you'd have to specify p.* or r.*, or even a field name, perhaps even non-nullable. Sum(1) simply bypasses all those issues. It might even be better processing-wise, but I've never tested it.)
              Author of THE BOOK -- Playing The Percentages In Baseball

              Comment


              • By the way, while this table is fine as coded:
                runs_by_inning2004

                I don't do it that way, especially if all you care about is the RE. This will suffice:
                select game_id, inning, batting_team,
                (
                sum(if(runner_on_1st_dest>3,1,0)) +
                sum(if(runner_on_2nd_dest>3,1,0)) +
                sum(if(runner_on_3rd_dest>3,1,0)) +
                sum(if(batter_dest>3,1,0))
                ) as bat_runs_for_inning
                ;

                If you want to get fancy, you can do:
                int(runner_on_1st_dest/4) ... this returns a 0 if the value is 0,1,2,3, and a 1 if it's 4,5,6,7 .... since the max value is 6, it'll work like a charm. I'm pretty sure that the "IF" costs more than the int and "/4".

                I just did this quick, so test it first.
                Author of THE BOOK -- Playing The Percentages In Baseball

                Comment


                • Originally posted by misterdirt
                  Before the play.
                  OK...so if the Vis_Score and Home_Score Fields are the score before the play...then it follows that the NEXT home_score and vis_score will always be the score after the play except when it is the last play of the game.

                  It seems like there is a way that should be able to be used to access the score of the play after yours and the score of your current play and take the difference. Forgive my naivite if I'm wrong...I'm away from my database and can't play around with it yet, and have not attempted anything beyond some simple size-reduction queries...

                  of course you could also just do something like create temporary scoring flags (batterScores, 1stRunnerScores, 2nRunneScores, 3rdRunnerScores) that equal 1 if the corresponding destination base is 4 and 0 in all other cases and then just derive your runsOnPlay field from that by adding up the flag counts. That would probably be the fastest thing to do with a large database.

                  Comment


                  • Tango...why is the max value of the destination fields 6? I thought it was 0 for no advance, 1 for first base, 2 for second base, 3 for third base and 4 for scorring...and nothing above that.

                    Comment


                    • Matt/188: that's what I do in 187.

                      Matt/189: I think it's 4 for scoring an earned run, 5 for an unearned run, and 6 for a team unearned run, or somethign like that. Check Retrosheet's documentation. It's there.
                      Author of THE BOOK -- Playing The Percentages In Baseball

                      Comment


                      • I've been through the documentation three dozen times and never noticed that...*sigh*...I'm sure it's there I just never saw it...when I get back my (upgraded) computer I'll have some work to do.

                        Comment


                        • Oh and correct me if I'm wrong but doesn't your query in 187 produce runs scored in the ENTIRE inning? I'd rather have information on each PLAY that I can easily group up and add together, and I think rather than calling the SUM function four times and then asking for an IF statement to check it, it would be faster and more useful to calculate runsOnPlay for each single play as a simple sum on scoring flags, and then just delete the scoring flag fields and work without all the complex IF statements in the summation query when I start asking about run scoring for a whole inning.

                          You have to use the IF statements at some point...I'd rather do that in pre-processing and do it just once...and then leave no record of that for all future calculations.

                          Comment


                          • http://www.retrosheet.org/datause.txt

                            RETROSHEET: HOW TO USE OUR EVENT FILES

                            ....

                            58 batter dest* (5 if scores and unearned, 6 if team unearned)
                            59 runner on 1st dest* (5 if scores and unearned, 6 if team unearned)
                            60 runner on 2nd dest* (5 if scores and unearned, 6 if team unearned)
                            61 runner on 3rd dest* (5 if socres and uneanred, 6 if team unearned)
                            Author of THE BOOK -- Playing The Percentages In Baseball

                            Comment


                            • Matt/192: Ouch, that's what happens when I don't think too much. Yes, of course what I do won't work.
                              Author of THE BOOK -- Playing The Percentages In Baseball

                              Comment


                              • Obviously you know way more than I do about how to make things work the way you feel comfortable...I'm sure what you posted is a part of a larger method...

                                Comment

                                Ad Widget

                                Collapse
                                Working...
                                X