Announcement

Collapse
No announcement yet.

Questions about Samples and Populations

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Questions about Samples and Populations

    Question about samples and populations, with regard to baseball players and stats.

    I should qualify this by saying I'm no expert here, and I'm hoping to get some insight. Perhaps there is a statistician here who can address the question. I'm no statistician, but I am fascinated by the numbers behind baseball, and always looking to learn.

    Someone pointed out that a certain player was batting .500-ish in the big leagues after a promotion. I forget the exact numbers, but the player was batting 8-for-16 or something to that effect. I noted that yes, it's nice, but you can't draw many conclusions from that because it's a small sample size. This person came back and said no, it's not a small sample size, that they had used all the possible data for this player that was available -- that they had effectively used all the data available in the population.

    This seemed squirrelly to me, and while I understand what this person is trying to say, I feel like like the "population" of data is the 100+ years of pitcher and batter data that we have recorded, and looking at an 8-for-16 stretch for one player is just a small sample size from millions of data points recorded for the sport. I feel like this person has arbitrarily narrowed their definition of "population" to just this particular player in this particular stretch of 16 at-bats.

    What's the deal here? Is it wrong to say this is a "small sample size" or is this person misusing the idea of a population?

  • #2
    Well, considering that an entire season is approximately 650 plate appearances, 16 plate appearances is just a small slice of that. In fact, that would only account for 2.4% of an entire season of PA's. In other words, this player is going to have 40 more stretches of 16 plate appearances throughout the season. What can you deduce from an 8-for-16 stretch? That this player is capable of hitting major league pitching - and that's the extent of what you can conclude from that miniscule sample. Those 16 have absolutely no bearing on the next 16 plate appearances, these events are mutually exclusive. Baseball is not a game of random chance, so statistics seem to fluctuate throughout the course of the season. With that being said, EVERY season in baseball history has shown that a .500 batting average is impossible to sustain over an entire season.

    Do you know who Sandy Leon is? I wouldn't be surprised if you didn't. He has a career .626 OPS in over 1,000 career plate appearances. He's not a good hitter. But in 2016 with the Red Sox, he caught lightning in a bottle and slashed .310/.369/.476 in 283 plate appearances. That shiny slash line was fueled by a ridiculous 20 for 40 start. There are inner-circle Hall of Famers that never had 20 hits in 40 at bats - it was a total statistical anomaly. Those types of things happen in baseball. While people had reason to believe that Sandy Leon had become a good hitter, he turned into an offensive pumpkin in the seasons following. If you saw those 40 AB's it would be perfectly rational to believe that Leon had evolved into a good hitter, but in context, you would consider that he had a poor offensive track record and you would expect him to regress heavily. Regression to the mean is an important concept into statistics and it basically means that extreme outlier numbers in a small sample will eventually normalize over a larger sample. And by normalize I mean that their numbers will reflect their true talent level over time. The regression monster eventually bites every player.

    In a vacuum, a bad player can play very well - a good player can slump like crazy. It's important to stay grounded and wait it out before making any meaningful conclusions.
    Last edited by Francoeurstein; 04-13-2019, 07:51 PM.
    Rest in Peace Jose Fernandez (1992-2016)

    Comment


    • #3
      Originally posted by fergieprs View Post
      Question about samples and populations, with regard to baseball players and stats.

      I should qualify this by saying I'm no expert here, and I'm hoping to get some insight. Perhaps there is a statistician here who can address the question. I'm no statistician, but I am fascinated by the numbers behind baseball, and always looking to learn.

      Someone pointed out that a certain player was batting .500-ish in the big leagues after a promotion. I forget the exact numbers, but the player was batting 8-for-16 or something to that effect. I noted that yes, it's nice, but you can't draw many conclusions from that because it's a small sample size. This person came back and said no, it's not a small sample size, that they had used all the possible data for this player that was available -- that they had effectively used all the data available in the population.

      This seemed squirrelly to me, and while I understand what this person is trying to say, I feel like like the "population" of data is the 100+ years of pitcher and batter data that we have recorded, and looking at an 8-for-16 stretch for one player is just a small sample size from millions of data points recorded for the sport. I feel like this person has arbitrarily narrowed their definition of "population" to just this particular player in this particular stretch of 16 at-bats.

      What's the deal here? Is it wrong to say this is a "small sample size" or is this person misusing the idea of a population?
      It may be a bit dated by this time, but some good discussion on stabilization of metrics:
      https://library.fangraphs.com/principles/sample-size/
      https://www.baseballprospectus.com/f...ats-stabilize/
      http://tangotiger.com/index.php/site...nal-half-noise
      Jacquelyn Eva Marchand (1983-2017)
      http://www.tezakfuneralhome.com/noti...uelyn-Marchand

      Comment


      • #4
        The person OP was talking to was probably trying to mislead him/her by misusing the statistical definitions of "population" and "sample".
        Last edited by pedrosrotatorcuff; 04-15-2019, 06:20 AM.

        Comment


        • #5
          The FG links that JoF posted are excellent. Note that BA stabilizes at about 900 AB, or about a season and a half. The relevant stat that stabilizes most quickly is K-rate, which is why fans frequently look at a player's K-rate--say a minor leaguer who shows some promise, or a rookie early in his first season. BA is actually one of the last stats to stabilize, taking much longer than OBP, e.g., because so much of it is dependent on BIP luck, whereas OBP is driven to a great extent by walk rate, which stabilizes relatively quickly.

          There are many other examples like Leon, usually players who were new to the majors, and pitchers hadn't figured them out yet. Puig set a record with 44 hits in his first 100 AB. He's developed into a decent hitter, but not a superstar.

          Comment


          • #6
            Originally posted by fergieprs View Post
            Question about samples and populations, with regard to baseball players and stats.

            I should qualify this by saying I'm no expert here, and I'm hoping to get some insight. Perhaps there is a statistician here who can address the question. I'm no statistician, but I am fascinated by the numbers behind baseball, and always looking to learn.

            Someone pointed out that a certain player was batting .500-ish in the big leagues after a promotion. I forget the exact numbers, but the player was batting 8-for-16 or something to that effect. I noted that yes, it's nice, but you can't draw many conclusions from that because it's a small sample size. This person came back and said no, it's not a small sample size, that they had used all the possible data for this player that was available -- that they had effectively used all the data available in the population.

            This seemed squirrelly to me, and while I understand what this person is trying to say, I feel like like the "population" of data is the 100+ years of pitcher and batter data that we have recorded, and looking at an 8-for-16 stretch for one player is just a small sample size from millions of data points recorded for the sport. I feel like this person has arbitrarily narrowed their definition of "population" to just this particular player in this particular stretch of 16 at-bats.

            What's the deal here? Is it wrong to say this is a "small sample size" or is this person misusing the idea of a population?
            The person you were talking to was being a pedant.



            "Batting stats and pitching stats do not indicate the quality of play, merely which part of that struggle is dominant at the moment."

            -Bill James

            Comment


            • #7
              Sample size needed is dependent on a few factors. You can't just eye ball sample size. The term you need to look at is statistical "power". The N needed is dependent on the strength of the effect and the variance. Some clinical studies are significant with double digit N numbers while others aren't with 1000+.

              and if course your friend's argument is not valid. If you flip a coin once you have also used all data available but you have zero predictive value.

              https://en.m.wikipedia.org/wiki/Power_(statistics)
              I now have my own non commercial blog about training for batspeed and power using my training experience in baseball and track and field.

              Comment

              Ad Widget

              Collapse
              Working...
              X