This concerns the second dataset described in the thread.
637pitchers.txt is attached (below)
Format and dimensions
- one comma-delimited table, 638 x 22, where the first row gives field names
NOTE. The format is comma-separated values, which is commonly indicated by filename extension ".csv". Here the filename ends ".txt" because BaseBall-Fever doesn't accept the other. Some of you will benefit your memories, at least, by renaming it.
- The Baseball Database for MS Access, "lahman5.4" covering 1871-2006. Multiple editions, annually updated, are distributed at Sean Lahman's Baseball Archive.
- player pages at Baseball-Reference.com
- player pages or 'DT cards' at BaseballProspectus.com
: 637 pitchers or major league pitching careers, 1871-2008.
They include everyone with one these career achievements,
- at least 2000 innings (407 checked against career leaders by Innings at bb-ref)
- at least 1500 innings and 109 ERA+ (239)
- at least 1000 innings and 116 ERA+ (186 checked against career leaders by ERA+)
They also include everyone from the phase one polls for "Top 100 Pitchers" at BaseBall-Fever.org (346?).
- playerID, nameFirst, nameLast, debut,
- lahmanID, time, name,
- XIP, RA, DH, DR, DW, NRA, RAA, PRAA, PRAR, DERA,
- IP, ERA+, PA, OPS+
- RpS (Reliefs per Start)
playerID, nameFirst, nameLast, debut
At the last minute, for convenience I included these four fields from lahman5.4, edited only by truncating [debut] at four characters, the debut year only. [playerID] rather than lahmanID (below) is used in that database to identify records in tables of season playing data. [playerID] is also used to define internet addresses for player pages at both baseball-reference and baseballprospectus.
lahmanID, time, name
[lahmanID] is the unique numerical identifier for people in the baseball-databank and in lahman5.4
[time] represents mlb debut decade as an integer -3 to 10 meaning 1870s to 2000s
[name] is a version of the pitcher's name, surname followed by one space and some initial(s).
XIP, RA, DH, DR, DW, NRA, RAA, PRAA, PRAR, DERA (blue: scanty coverage of lesser pitchers)
The first ten fields of performance data are new sabrmetrics by Clay Davenport from the player 'DT cards' at baseballprospectus.com. Three of them (blue) are present only in some records that I added or checked recently. Career coverage should be complete through 2008 but the sabrmetrics evolve and I systematically checked only active players after the 2008 season.
[DERA] is a measure of pitching quality on a runs per 9 innings scale. Par is 4.50, so 450/DERA is an index on the ERA+ scale. I call it 'DERA+'. Compared with other normalized measures of pitcher runs allowed (or runs/9), the crucial difference is that Davenport uses his own attribution of runs to pitchers and fielders rather than use official earned and unearned runs. The other "normalizations" are essentially to what other sabrmetricians use, adjusting for the team's run-scoring environment.
IP, ERA+, PA, OPS+
The sources for IP, ERA+, PA, and OPS+ are player pages at baseball-reference.com and "the Baseball Database" lahman54. That database covers 1871-2006. I updated many active players thru 2008 mainly by checking against career leaders, IP and ERA+. For the active players I tried to remember to check innings at the baseballprospectus DT cards.
[RpS] for 'Reliefs per Start' is a career statistic: games pitched in relief or ^Reliefs^ divided by ^Starts plus one^. Specifically, using official pitching statistics G and GS, that is (G-GS)/(GS+1). The highest values of RpS are the numbers of pitching games for a few relief pitchers who never started; the lowest value is zero for several starting pitchers who never relieved. RpS is derived from the 1871-2006 database without systematic update thru 2008.