Writers of Pro Football Prospectus 2008

Most Recent FO Features


» 2018 Free Agency Cost-Benefit Analysis

Is Kirk Cousins the best free-agent quarterback in recent memory? Should Trumaine Johnson or Malcolm Butler have gotten the larger contract? And what makes a free-agent contract good or bad, anyway?

24 May 2004

Similarity Scores

By Michael David Smith

Take two running backs. One is already in the Hall of Fame. Let's call him Larry Csonka. The other, you're sure, deserves to be in the Hall of Fame. Let's call him Terrell Davis. You're certain the two players are similar: You point out that both were Super Bowl MVPs and that their career numbers look somewhat alike. But you're having trouble convincing the guy next to you at the bar, and you'd have even more trouble convincing those 39 guys on the Hall of Fame selection committee. What can you do?

An answer might come in the form of similarity scores. Like much in the world of sports statistical analysis, similaritiy scores are stolen from Bill James, who developed them for baseball. If two people had identical career numbers, they would have similarity scores of 1,000. Of course, two players never have identical career numbers. So what we do is deduct a certain number of points for each statistical category in which the two players are different. For example, with running backs, we could subtract:

1 point for a 100-yard difference in rushing yards
1 point for a 100-yard difference in receiving yards
1 point for a difference of 50 carries
1 point for a difference of 50 catches
1 point for a difference of one rushing touchdown
1 point for a difference of one receiving touchdown
1 point for each difference of one-hundredth of a yard per carry

So let's look at Csonka, the Hall of Famer, and Davis, the former MVP who probably won't get into the Hall:

Player Carries Yards TD Catches Yards TD YPC Similarity to Csonka
Larry Csonka 1891 8081 68 106 820 4 4.27 1000
Terrell Davis 1655 7607 60 169 1280 5 4.60 942

This doesn't prove anything, but it does at least give you a statistical basis for your argument about Larry Csonka and Terrell Davis. A similarity score of 942 means Davis's career is quite similar to Csonka's. When we add in the other similarities that statistics don't measure (both were Super Bowl MVPs and won two rings, both played with Hall of Fame quarterbacks, both played behind good offensive lines), we can say that Davis and Csonka had very similar careers, and that's good ammunition for the people who want to see Davis in the Hall of Fame. We can go through every quarterback, running back and receiver in the Hall of Fame and find the most similar players statistically, and that will give us an idea of which other players are similar to Hall of Famers and therefore worthy of induction themselves. With receivers we would only use receptions, yards, touchdown receptions and yards per catch, while with quarterbacks we'd add interceptions and completion percentage to yards and touchdowns. Unfortunately, we'd probably also have to normalize with the league average because of the stat inflation that has happened with the NFL's passing numbers.

Similarity scores require a huge amount of research, and football research is in its infancy, but let's start here with running backs, who are more similar through the years than quarterbacks or receivers. And among running backs we'll start with Jim Brown, who's generally recognized as the gold standard of running backs. So what you see below is all the running backs in NFL history whose similarity score compared to Jim Brown is at least 800. I also added the NFL's all-time leading rusher, Emmitt Smith.

Player Carries Yards TD Catches Yards TD YPC Similarity to Brown
Jim Brown 2359 12312 106 262 2499 20 5.22 1000
Barry Sanders 3062 15269 99 352 2921 10 4.99 911
Joe Perry 1929 9723 71 260 2021 12 5.04 894
OJ Simpson 2404 11236 61 203 2142 14 4.67 878
Eric Dickerson 2996 13259 90 281 2137 6 4.43 864
Tony Dorsett 2936 12739 77 398 3554 13 4.34 846
Franco Harris 2949 12120 91 307 2287 9 4.11 846
Jim Taylor 1941 8597 83 225 1756 10 4.43 834
John Riggins 2916 11352 104 250 2090 12 3.89 831
Marshall Faulk 2576 11213 97 673 6274 34 4.35 829
Thurman Thomas 2877 12074 65 472 4458 23 4.20 820
Marcus Allen 3022 12243 123 587 5411 21 4.05 814
Walter Payton 3838 16726 110 492 4538 15 4.36 806
Earl Campbell 2187 9407 74 121 806 0 4.30 804
Emmitt Smith 4142 17418 155 500 3119 11 4.20 747

I want to make it clear here that I'm not equating similarity to Jim Brown with greatness. Faulk, for instance, is lowered because he did significantly more than Brown catching the ball. That's certainly not a bad thing. Payton and Smith chose to play past their primes and therefore moved up and away from Brown in career totals while simultaneously moving down and away from Brown in yards per carry. If Smith and Payton had retired on top the way Brown and Sanders did, they'd be more similar to Brown (though still not as similar to Brown as Sanders is), and I'm not going to criticize Smith and Payton for choosing to play as long as they could. But similarity scores aren't intended to show the best players; they're intended to show the most similar players. Sanders and Perry had short, great careers, just like Brown, so seeing Sanders and Perry on top of the similarity list indicates that the system works. It's striking to see how similar the career totals of Riggins and Brown are, and that's why it's important to use yards per carry as one of the similarity categories. We can see that Riggins finished within 1,000 yards rushing, 500 yards receiving and 10 total touchdowns of Brown, but it's not until we look at yards per carry that we realize just how far short Riggins falls.

If I could add another statistic to similarity scores for running backs, it would be fumbles. But so far I haven't been able to find fumble stats for the old timers. Fumbles for running backs are as important a statistic as interceptions for quarterbacks, yet for some reason the NFL doesn't seem to want to reveal the number of fumbles its past running backs have produced. (Although in researching this article I did find another similarity between Csonka and Davis: Csonka had 21 career fumbles and Davis had 20.)

One of the biggest differences between football and baseball is the way the number of games played in a season has impacted season and career totals. Baseball gave Roger Maris's home run record an asterisk when he had the audacity to use eight extra games to top Babe Ruth. But eight games represents only 5 percent of the baseball season. NFL seasons are 33 percent longer now than they were in 1960, when Jim Brown was in the middle of his career. Combine that fact with offense-friendly rules changes, and skill position players have much better numbers now than they did in the past. That poses a problem for similarity scores, but not a fatal problem. After all, when we examine the two players most similar to Brown we get one modern player and one who preceded Brown.

This is, to the best of my knowledge, the first attempt at using similarity scores to get a historical perspective on football. (I have seen similarity scores used for fantasy football, but those compared only modern players and made no attempt at comparing current players to the stars of the past.) This is only a first try, and I hope readers will give their thoughts on how it can be improved.

Posted by: Michael David Smith on 24 May 2004