Possibly the closest Super Bowl matchup in history also poses the question: how much does it mean when certain aspects of an NFL team improve dramatically in the second half of the season?
08 Sep 2004
by Aaron Schatz
(Updated October 7 with slight changes to correlation coefficients due to a mistake that switched wins and points for San Francisco and Seattle.)
Today, along with our redesign, we introduce the latest upgrade to our DVOA (Defense-adjusted Value Over Average) ratings system that breaks the entire NFL season down play-by-play to determine how teams perform compared to the league average. This is the main statistic that we use to rate both teams and players at Football Outsiders, but I know it can get confusing to constantly see percentages and those four letters and have no idea what the heck we are talking about because you just discovered Football Outsiders yesterday. So for all those who have just started reading the site in the last couple of months, I wanted to give another explanation of how it works and why DVOA is better than standard statistics. Afterwards, I'll go into the changes in DVOA that are reflected in the new stats pages presented today. (Click here to skip to the changes.) This official unveiling of the latest version of the DVOA system also paves the way for me to bring out the long-awaited 2000 and 2001 numbers over the next couple of weeks.
One running back runs for three yards. Another running back runs for three yards. Which is the better run?
It sounds like a trick question, doesn't it? Two runs of the same length -- how can one be better than the other? But the answer to this trick question is the secret to finally breaking through the biases inherent in NFL statistics.
What is the down and distance? Is it 3rd-and-2, or 2nd-and-15? Where on the field is the ball? Does the player get only three yards because he hits the goal line and scores? Is this player's team up by two touchdowns in the fourth quarter, so that he is running out the clock, or down by two touchdowns so that the defense is playing purely against the pass? Is our running back playing against Baltimore, or Kansas City?
All of these variables have an impact on what we should expect from any offensive play in football. Three yards is different depending on the situation. Sometimes it is a success, like when it gets a first down. Sometimes it is a failure, like when it comes on first down with the team losing by two touchdowns with five minutes left. And yet, conventional NFL statistics count plays based solely on their yardage. The NFL determines the best players by adding up all their yards no matter what situations they came in or how many plays it took to get them. That's a problem because football has two objectives that get you closer to scoring: gaining yards, and achieving first downs. These two goals need to be balanced to determine a player's value or a team's performance. All the yards in the world aren't useful if they all come in eight-yard chunks on third-and-tens.
The popularity of fantasy football only exaggerates the problem. Fans have gotten used to judging players based on how much they help fantasy teams win and lose, not how much they help real teams win and lose. But fantasy scoring skews things by counting the yard between the one and the goal line as 61 times more important than all the other yards on the field. Let's say Keyshawn Johnson catches a pass on 3rd-and-15 and goes 50 yards but gets tackled two yards from the goal line, and then Eddie George takes the ball on 1st-and-goal from the two-yard line and plunges in for the score. Or, let's say that the Giants take a touchback on the opening kickoff, and the Dallas defense stuffs Tiki Barber twice, and on third-and-10 Kurt "Nine Fingers" Warner throws the ball into the arms of Terence Newman, who gets taken down by Jeremy Shockey at the two-yard line. Then on the ensuing 1st-and-goal, George scores a touchdown.
Has George done something special? Not really. When an offense gets the ball on 1st-and-goal at the two-yard line, they are going to score a touchdown five out of six times. In the first situation, George is getting the credit that primarily belongs to the passing game. In the second situation, George is getting the credit that primarily belongs to the defense.
Can we do a better job of distributing credit for scoring points and winning games? That's the goal of VOA, or Value Over Average. VOA breaks down every single play of the NFL season to see how much success offensive players achieved in each specific situation compared to the league average. It uses a value based on both total yards and yards towards a first down, based on work done by Pete Palmer, Bob Carroll, and John Thorn in their seminal book, The Hidden Game of Football. On first down, a play is considered a success if it gains 45% of needed yards; on second down, a play needs to gain 60% of needed yards; on third or fourth down, only gaining a new first down is considered success.
We then expand upon that basic idea with a more complicated system of "success points." A successful play is worth one point, an unsuccessful play zero points. Extra points are awarded for big plays, gradually increasing to three points for 10 yards, four points for 20 yards, and five points for 40 yards or more. There are fractional points in between. (For example, eight yards on 3rd-and-10 is worth 0.54 "success points.") Losing three or more yards is -1 point, an interception is -8 points, and a fumble is worth anywhere from -2.15 to -6.54 points depending on how often a fumble in that situation is lost to the defense. Red zone plays are worth more, and there is a bonus given for a touchdown. (The system is a bit more complex than the one in Hidden Game thanks to a number of developments, including the larger penalty for turnovers, the fractional points, and a slightly higher baseline for success on first down.)
Every single play run in the NFL gets a "success value" based on this system, and then that number gets compared to the average success values of plays in similar situations for the entire season for all players, adjusted for a number of variables. These include down and distance, field location, time remaining in game, and current scoring lead or deficit. Rushing plays are compared to other rushing plays, passing plays to other passing plays, tight ends get compared to tight ends and wideouts to wideouts. Going back to our example of the three-yard rush, if Player A gains three yards under a set of circumstances where the average NFL running back gains only one yard, it can be argued that Player A has a certain amount of value above others at his position. Likewise, if Player B gains three yards on a play where, under similar circumstances, an average NFL back would be expected to gain four yards, it can be argued that Player B has negative value relative to others at his position. If you divide a player's total success value by the average success values of all players in the each of the situations he faced, you get VOA, or Value Over Average.
Of course, the biggest variable in football is the fact that each team plays a different schedule. By adjusting each play based on the defense's average success in stopping that type of play over the course of a season, we get DVOA, or Defense-adjusted Value Over Average. Rushing and passing plays are adjusted based on down and location on the field; receiving plays are also adjusted based on how the defense performs against passes to running backs, tight ends, and wide receivers.
One of the hardest parts of understanding a new statistic is grasping the idea of what numbers represent good performance or bad performance. We try to make that easy with DVOA, because it gets compared to average. Therefore, 0% always represents league-average. A positive DVOA represents more scoring, and a negative DVOA represents less scoring. This is why the best offenses have positive DVOA ratings (Kansas City: +28.4%) and the best defenses have negative DVOA ratings (Baltimore: -32.0%). Ratings for teams and starting players generally follow that scale, with the best being around 30% and the worst being around -30% (opposite for defense). With fewer situations to measure, the numbers spread out a bit more, so you'll see more extreme DVOA ratings for part-time players and for measurements of teams in more specific situations (for example, passing on third downs).
(Confusion alert: Originally, we called the adjusted VOA for defense something else, but finally we decided that it was better to call opponent-adjusted VOA the same thing in every instance, and most people thought "DVOA" just sounded better than "OVOA" or "AVOA" even though, yes, when we're talking about defenses the "D" can't stand for defense. Think of it as standing for "dependent on opponent" or something.)
The biggest advantage of DVOA is the ability to break teams and players down to find strengths and weaknesses in a variety of situations. In the aggregate, DVOA may not be quite as accurate as some of the other, similar "power ratings" formulas based on comparing drives rather than individual plays, but unlike those other ratings DVOA can be separated not only by player but also by down, or by week, or by distance needed for first down. This can give us a better idea of not just which team is better but why, and what a team has to do in order to improve itself in the future. Some readers have criticized us for using DVOA in too many different ways, but that's the idea behind the number -- since it takes every single play into account, it can be used to measure a player or a team's performance in any situation. And, since it compares each play only to plays with similar circumstances, it gives a more accurate picture of how much better a team really is compared to the league as a whole. The list of top DVOA offenses on third down, for example, is more accurate than the conventional NFL conversion statistic because it takes into account that converting third-and-long is more difficult than converting third-and-short.
What about special teams? The special teams method is different, since each play on special teams has a single goal instead of two goals like rushing and passing plays. Either you want to get the ball through the uprights, you want to kick the ball really far, or you want to return the ball for as many yards as possible. Punts and kickoffs are judged based on the difference in point value between each kick and an average kick from that position on the field. Punt returns and kick returns are judged based on the difference in point value between each return and an average return from the spot where the ball is picked up. Each field goal is compared to the league-average percentage of field goals from that distance. The whole method is described in detail here, although it has recently undergone a major revision, improving the value of touchbacks and adding an adjustment for weather, stadium, and altitude. An article on those revisions is coming later this season.
Now, while DVOA is the best indicator of how a player (or team) has performed on a per play basis, it has a major flaw. It doesn't take into account the value of a player being involved in a greater number of plays, even if his performance is league-average. A player who is involved in more plays can draw the defense's attention away from other parts of the offense. If that player is a running back, he can take time off the clock with repeated runs. And most importantly, nearly every player is a starter for a reason: he is better than the alternative.
Let's say you have a running back who carries the ball 300 times in a season. What would happen if you were to remove this player from his team's offense? What would happen to those 300 plays? Well, the player would not be replaced by thin air. This is why you have to compare performance to some kind of baseline; two yards is not two yards better than the alternative. On the other hand, while comparing players to the league average works on a per play basis, it doesn't work on a total basis because a player removed from an offense is not generally replaced by a similar player. Those 300 plays will generally be given to a significantly worse player, someone who is the backup because he doesn't have as much experience and/or talent.
To take this into account, we borrowed the concept of replacement level from Baseball Prospectus. Using a scale similar to the scale BP uses to determine baseball's replacement level, we've determined that a replacement level player has a DVOA of roughly -13.3%. (If you want to know why, it is explained in the original article introducing PAR.) Instead of determining value by comparing each play's "success value" to the average, as in DVOA, each play is instead compared to a number roughly 13.3% below the average success value of similar plays. That gives us value over a replacement level player, a better representation of a player's total contribution to his team on all his plays.
Actually, while in general replacement level is -13.3%, technically it is different for each position depending on whether we are measuring passing, rushing, or receiving. And, of course, the real replacement player is different for each team in the NFL. Chicago's backup quarterback (Chandler) was better than its starting quarterback (Stewart). Houston's second-team running back (Davis) was better than the guy who began the year as a starter (Mack). Both of Steve McNair's backups performed well. Sometimes, the drop from the starter to the backup is even greater than the general drop to replacement level. Despite being known over the years for depth, Denver last year had a massive drop when they went from their starting running back to his backup, and when they went from the starting quarterback (Plummer) to his backup (Beuerlein). Since you need to generalize for the league as a whole, and no starter can be blamed for the poor performance of his backup, we use the same general replacement level across the league.
Of course, giving a number of "success value points over replacement level" would be fairly useless to the average fan and even the non-average fan. If I tell you Tommy Maddox was worth 77 success value points over replacement, you would have no idea what the heck I was talking about. So we translate those success value points into a number that represents actual points. After working through statistics from the past four seasons, our best approximation is that a team made up entirely of replacement-level players would be outscored 407 to 260, finishing with a 4-12 record. Conveniently, this is close to the average record of the last four expansion teams. But part of the reason this team gives up so many more points than it scores is that it has replacement-level special teams. Those replacement level special teams are worth -27 points, making the actual baseline for determining offensive value 274 points (the baseline for defensive value is 394 points).
With a bit of math, it works out that each "success value point" over replacement level is worth about .48 actual points above this offensive baseline. We also adjust this number for the strength of the opponents each player has faced. Now I can tell you that Tommy Maddox was worth 37 points more than a replacement level quarterback in 2003, or 37 DPAR (which stands for Defense-adjusted Points Above Replacement). That's nothing compared to Peyton Manning, of course, who was worth 127 points more than a replacement level quarterback, or 127 DPAR.
Official NFL statistics rank offenses and defenses based on yardage, but VOA has a much higher correlation with winning. (I'm using VOA, which is not adjusted for strength of opponents, since yardage and wins aren't adjusted for strength of opponents.) For those unfamiliar with correlation coefficients, a higher number closer to 1.0 means a closer connection between two variables.
Over the past four seasons, the correlation of yards gained is .512 to wins and .814 to points scored. The correlation of offensive VOA is .622 to wins and .840 to points scored.
On defense, the correlation of yards allowed is -.498 to wins and .751 to points allowed. The correlation of defensive VOA is -.616 to wins and .824 to points allowed. (The correlations to wins are negative because fewer yards allowed or lower VOA means a better defense and thus more wins).
The accuracy of VOA looks even better when you combine offense and defense. After all, field position is important on both sides of the ball; a defense that stops the opposition sets up its offense for better field position on the start of the next drive. Often times what looks like offense is actually a team's defense putting its own offense in a better position to score (which explains how the 2003 Ravens managed to score more than three points a game).
Over the past four years, correlation of yards gained minus yards allowed is .739 to point differential (points scored minus points allowed) and .679 to wins. Correlation of VOA, offense and defense, is .933 to point differential and .858 to wins. Add in special teams, and the correlation of VOA is .957 to point differential and .873 to wins.
Compared to wins and points, VOA and DVOA are also better predictors of performance the next season. From 2000-2003, the year-to-year correlation of wins was only .233. Year-to-year correlation of point differential was .293. But year-to-year correlation of VOA was .429, and year-to-year correlation of DVOA (with adjustments for strength of schedule in each season) was .488.
You need to have the entire play-by-play of a season in order to compute it, so it is useless for comparing players of today to players of history. As of this writing, we have processed four seasons, 2000-2003.
DVOA is limited by what's included in the official NFL play-by-play, so we can't say which teams have the best offensive DVOA when play-faking, or the best defensive DVOA against three-receiver sets.
Since play-by-play lists tackles, sacks, and interceptions, but not attempted tackles, or attempted sacks or interceptions, we don't have individual DVOA or DPAR for defensive players at this point.
DVOA is still far away from the point where we can use it to represent the value of a player separate from the performance of his ten teammates that are also involved in each play. That means that when we say, "Priest Holmes has a DVOA of 17.6%," what we are really saying is "Priest Holmes, playing in the Kansas City offensive system with the Kansas City offensive line blocking for him and Trent Green selling the fake when necessary, has a DVOA of 17.6%."
We use it a lot, and DVOA is a very good tool for measuring a team or player, but it is not the only one. It has to be understood in context.
For those who are keeping track, here's where we've been so far:
Since many of the changes being made today are not in the main DVOA itself, but rather in some of our other statistics that are based on DVOA, I'm just calling this version 4.1. Before I explain all the changes, I'll do the same thing here that I did when I introduced v4.0 a few months ago, showing how the new version moves teams up or down atop the 2003 rankings. Here are the top ten teams in offense, defense, and overall with both the new version (version 4.1) and the version that's been on the website since May (version 4.0). Yes, Kansas City's offense and special teams were so strong last year that they come out on top despite terrible defense:
All fumbles equal: The biggest change made in DVOA v4.1 is that all fumbles are now considered equal. Previously, teams and players were both given negative value for a lost fumble, the same as an interception, but a fumble recovered by the offense was ignored. However, research done by Jim Armstrong in this guest column showed that defensive fumble recoveries are very inconsistent from year to year, and looking closely at running back numbers seemed to indicate that while some backs do have a tendency to fumble more often year after year, the percentage of those fumbles lost is random. (For example, Pittsburgh recovered all of Jerome Bettis' fumbles in 2001 and none of them in 2002 or 2003.)
I created two versions of DVOA, one with both fumbles lost and fumbles kept given half the penalty of an interception, and one with fumbles lost given the whole penalty and fumbles kept given no penalty. Over the past four seasons, the version that considered all fumbles equal correlated better on both offense and defense from year to year, from the first half of each season to the second half, and even from all games in odd numbered weeks to all games in even numbered weeks. That's good enough for me to believe that it is a better indicator of true ability. (If you want specifics, the 2000-2003 year-to-year correlation of DVOA without fumbles equal was .464; with fumbles equal it was .488).
However, this means that DVOA now deviates even further from a team's actual record because, of course, in real life that random aspect of recovering fumbles has a huge impact on the game. So while DVOA correlates better from year to year, it correlates worse with a team's actual record. A very good example here is 12-4 St. Louis, which sees its total DVOA cut nearly in half by this change. St. Louis defenders saw opposition fumbles bounce into their arms at an absurd rate in 2003. That meant more wins for St. Louis last season but it is not likely to carry over to 2004.
I should note that "all fumbles equal" isn't quite the case. Fumbles were given different negative value depending on how often that kind of fumble was recovered by the offense. For example, offenses recover three out of every four aborted snap fumbles, about half of all fumbles on sacks or QB rushes, and roughly 35 percent of all fumbles on running back rushes. The value of a fumble on a reception changes with the length of the reception; the longer the yardage, the more likely a fumble is going to be lost to the defense because the receiver is running out there all alone without a bunch of teammates around who can lunge for that loose ball.
More accurate PAR: When I first introduced the concept of Points Above Replacement, I admitted that I wasn't quite sure if I had the right multiplier to turn my system's "value points" into a representation of actual points scored on the field. It turns out that I was off more than I expected. After analyzing the last four years of data it looks like I was off by about 25 percent. Therefore, all PAR and DPAR numbers are now 25 percent higher (well, except the negative ones, which are lower). This means that every DPAR number in every article before August 2004 is now 25 percent too low, but I'm not going to be able to go back to fix them all because unfortunately I don't have one of those neat Twilight Zone watches that stops time. However, sometime this season I'll put together an article that breaks down a couple of teams to show just how well PAR now correlates with actual scoring.
Overhauled special teams ratings: I mean really, really overhauled. Special teams are now based on a baseline of four years of data. Touchbacks on kickoffs have been fixed so they count as better plays for the kicking team. A new "weather" adjustment has been added; it does not actually take into account weather, but rather it adjusts based on week and type of stadium (separated into cold, warm, dome, and Denver) to give more credit to quality kicking later in the year and penalize kicks in domes and high altitude. (This makes Denver's 2003 punting look so bad that the punter is now listed in the record as "Alan Smithee.") This is another issue where I'll be doing a larger article sometime in the next couple months instead of explaining the whole thing here.
More accurate WEIGHTED DVOA: WEIGHTED DVOA is the attempt to figure out how a team is playing right now, as opposed to over the season as a whole, by making recent games more important than earlier games. Using trial and error I've arrived at a new set of coefficients that make WEIGHTED DVOA (over the last seven weeks of each of the past four years) a much better indicator of which team will win the next game of the season. The importance of games 10-12 weeks ago has been reduced, the importance of games 13-14 weeks ago is really small, and games 15 weeks ago and earlier now don't count at all. For end of season rankings, this does things like boost the 2003 Eagles and drop the 2003 Chiefs, and wait until you see what it does to the 2001 Chargers. Yikes.
Slightly upgraded ESTIMATED WINS: The formula that estimates wins based on DVOA in specific situations has been slightly altered. Since the importance of red zone performance was already boosted in DVOA, there is no longer an additional adjustment for red zone offense and the one for red zone defense is smaller. ESTIMATED WINS is now derived from offense, defense, special teams, variance (i.e. game-to-game consistency), red zone defense, defense in the second half of close games, and two new variables: offense in the second half of close games and offense in the first quarter, demonstrating the importance of getting out to an early lead.
VOA still correlates with actual team wins: The VOA rating, the non-adjusted number, now excludes not only strength-of-opponent adjustments but also the "all fumbles equal" adjustment on offense and defense and the "weather effects" adjustment on special teams. That means that VOA is a better indicator of how a team has actually performed on the field over the course of the season, how many games the team has won and how many points it has scored and allowed. DVOA is a better indicator of wins and scoring going forward.
Once again the 2003 Rams provide a good example. The team's DVOA is now 9.1%, 13th in the league, but the VOA is 21.3%, fourth in the league. The Rams are adjusted downward because they play in a dome (making special teams look better than they really are), because they played the fourth-easiest schedule in the league according to DVOA, and because they recovered an extraordinary number of opponent fumbles. I'm really sorry to be mean to the Rams, but the good news for their fans is that the DVOA prediction system thinks Seattle will decline in 2004 even more than St. Louis will.
Note that all player ratings, including the "non-adjusted" VOA and PAR, do include the "all fumbles equal" adjustment. Also, special teams ratings don't yet include an all fumbles equal rating because I don't yet have anything to indicate when a fumble is just some guy muffing the punt, picking it right up, and running, compared to a guy who goes twenty yards upfield and then gets the ball stripped.
Better correlation to 2000-2003: This new version of DVOA and all associated statistics were tested based on correlation to points and wins -- and year-to-year correlation -- for the last four years, not just the last two years. That means that some numbers will actually be slightly less accurate for 2003. My goal was to increase the accuracy over a larger statistical sample (126 teams, not 64) in hopes that this will mean more accurate numbers in 2004. If after a few weeks it looks like 2004 is distinctly similar to 2002-2003 and distinctly NOT similar to 2000-2001, I reserve the right to fiddle with things a bit.
If you wish to look at the new numbers, and comment on them, you'll find new tables for total team efficiency, as well as offense and defense. Player stats are now updated with these new formulas as well, and they are all listed in the drop-down menu on the top of the page under JUST THE STATS. Right now, the site has 2002 and 2003 numbers, with 2004 numbers coming as the season progresses. We'll publish 2000 and 2001 with commentary in the next few weeks, and we're working on 1996-1999 as well.
4 comments, Last at 27 Mar 2007, 8:54am by honda scooter 250cc