by Aaron Schatz
Over the next couple weeks, we're going to run a series of articles we're calling FO Basics. We get a lot of questions about our work, but there are also a lot of readers who don't ask questions. We hope this series will help answer some questions and clarify some confusing things for even those readers who don't respond on the message boards.
- August 30: Where our stats come from, and the difference between charting stats and play-by-play stats.
- August 31: A summary of research from our first seven years.
- September 1: Our college stats, how they differ from our NFL stats and from each other.
- September 6: The importance (and limitations) of watching games on tape.
- September 7: Regression towards the mean -- what it means, and how we use it.
So let's start with a look at our stats. My goal here is not to fully explain how our formulas work -- we have a separate page you can read with a lot of those explanations -- but rather to try to clear up some common misconceptions about where our stats come from and whether or not they count as "subjective" or "objective."
I'll start the discussion by separating FO stats into four different categories: play-by-play, game charting, historical stats, and projections.
Play-by-play statistics themselves come in two different categories: counting stats and formulas.
Play-by-Play Counting Stats are the simplest type of statistics: the numbers that come directly from the official NFL play-by-play. That starts with all the standard statistics, anything you can find on NFL.com: yards, carries, passes, receptions, touchdowns, sacks, interceptions, and fumbles. This category also includes the defensive statistics which are technically "unofficial" but are tracked by the league, included in the play-by-play, and listed on NFL.com: tackles, passes defensed, pass targets, and quarterback hits.
Even though these stats come straight out of the play-by-play, the numbers you find at Football Outsiders, or in Football Outsiders Almanac, aren't always necessarily the same as the numbers you will find over on pro-football-reference. For example, if you look at our stats page for quarterbacks, the column for "passes" does not include clock-killing spikes and the column for "runs" does not include kneeldowns. Nonetheless, these stats on FO are still essentially play-by-play stats, as they are counting plays as officially reported by the league. (The official gamebooks will usually note a kneeldown, and often marks a clock-killing spike differently from other incomplete passes.)
Many of our individual defensive stats also qualify as play-by-play counting stats. Total Defensive Plays, Stops, Stop Rate, Defeats, and Average Yardage of Tackles are all based solely on the information in the official gamebooks, not the game charting.
Finally, I would include in this category all the "average stats" that don't adjust numbers or give different weights to different stats in order to create some kind of new rating. Yards per carry, yards per reception, touchdown-to-interception ratio, completion percentage, and so forth.
Play-by-Play Formulas take the official NFL play-by-play data and use math to create new ratings that try to adjust for context or measure multiple skills with a single, more simple number. The most common play-by-play formula is NFL passer rating. It isn't common around here, but it gets used everywhere else. Win Probability statistics used on various sites are essentially play-by-play formulas, since they analyze the last few years of play-by-play to determine each team's chances to win based on the game situation and time remaining. P-F-R's Simple Rating System and the stuff Jeff Sagarin does would also fall into this category.
Most of the stats you find at Football Outsiders, especially during the NFL season, fall into this category. That includes DVOA, DYAR, Adjusted Line Yards, and all our special teams stats. This is a very important point about DVOA: It is based entirely on the official play-by-play. It does not incorporate game charting except in a couple of very minor ways. Unfortunately, this limits us at times. For example, DVOA and DYAR for receivers are not adjusted for dropped passes, since drops are not marked in the official play-by-play. However, I've seen various comments on the Internet mentioning that DVOA is "subjective" because it is based on the information we get through game charting, and it is important for people to understand that the play-by-play stats are very different from the charting stats.
DVOA is an objective metric, unless you have a very strict definition of "subjectivity." It looks at every single play and the yardage gained, then compares to a baseline based on down, distance, and other elements of the situation. Then we add the opponent adjustments, which are based on all the plays the opponent has had during that season. Nearly everything involved in DVOA is strictly based on the official NFL play-by-play logs.* Plays are scored on "success points," but we've spent a lot of time working on the formula for "success points" to make it correlate as closely as possible with wins. Those success points are compared to baselines which aren't chosen out of thin air; they represent the average performance of every NFL team over the last few years in the situation being measured. Likewise, the cutoffs for measuring Adjusted Line Yards are based on regression analysis from multiple years worth of NFL data.
We work very hard to make sure that our stats do not specifically favor any team or player. All the calculations that we've done to create the DVOA formula are based on trying to improve the accuracy for every team in the league over a multiple-year period. This is one reason why we tend to dismiss suggestions or complaints that revolve around the rating for one specific team. The goal is to create a rating that does the best job of measuring all 32 teams, and it is unlikely that some flaw in the system is affecting one team but none of the other 31. However, we tend to seriously consider any patterns showing that multiple teams in multiple years share a certain quality and are all overrated or underrated. We're always looking to be more accurate, because, well, nobody likes to be wrong. Of course, a few subjective decisions had to be made during the development of DVOA and other metrics. We had to decide exactly where to draw the baselines for an "average play," and what adjustments to make or not make. These decisions were made as carefully as possible.
We are often asked if DVOA is supposed to measure how well teams have played in the past or how well they are likely to play in the future. The answer is "sort of both." The idea is to try to measure some kind of platonic ideal of how good a team is right now.
When I try to improve DVOA, I'm fiddling with what splits I use to create baselines, or what the added adjustment is for touchdowns, or various other things. In comparing each version of DVOA to the others, I'm trying to maximize (and balance) two things:
- How well are we measuring the long-term quality of the team. That means looking at the correlation of each team's DVOA from year to year. Sometimes I also look at comparing the first half of the season to the second half, or comparing odd weeks to even weeks, to try to get the most consistent rating that filters out the effects of circumstance and random chance.
- How well we are measuring winning. This means trying to get the best correlation between wins and the non-adjusted rating (i.e. no changes based on opponent strength, and fumbles are only negative events if they end in a turnover rather than recovery by the offense).
The resulting metric doesn't perfectly measure how well a team has played in the past, or how well they will play in the future, but it does a good job of balancing the two. For example, we know that red zone performance tends to revert to the mean over time, so if we wanted a rating that specifically was meant to predict future results, we would make red zone plays less important. In reality, however, DVOA and DYAR consider red zone plays as more important than plays on the rest of the field, because better red zone performance means the team is going to win more games. We give a bonus on plays that score touchdowns... but at a certain point, we stopped raising that bonus because the correlation to winning was no longer improving and the year-to-year correlation started to get weaker.
One thing I've said in interviews, and I'll say it again here, is that there is a pretty good chance that DVOA is not the absolutely, positively most accurate power rating out there. P-F-R's Simple Rating System gets similar results and -- not a surprise, given the name -- is a lot more simple. However, obsession with comparing the small differences in accuracy between DVOA and other power ratings misses the point. The beauty of DVOA is that it is derived from play-by-play, and therefore can be broken up into any grouping of plays you want: by down, by player, by location on the field, and so forth. That kind of matchup analysis is very important to what we do at Football Outsiders.
Finally, I should point out that DVOA is not the only metric we use around here. Economists who get into sports analysis often try to drill player value down to a single uber-stat, because only a single stat that you can use for all players allows you to compare player value to dollars. However, the philosophy of Football Outsiders is the exact opposite. Our goal is more stats, not fewer. Each stat we use tells part of the story about why a team is playing well or playing poorly. DVOA gives you an overall picture but to see the details you also have to use Adjusted Line Yards, Adjusted Sack Rate, the defensive stats like Stops and Defeats, and the game charting stats.
The special teams statistics are a good example of a place where some people might get confused by terminology. We turn the total value of special teams into a DVOA rating so that we can combine it with offense and defense, but the individual ratings for each aspect of special teams are not based on "success points" like DVOA (because they don't have to measure both progress towards a first down and progress towards the goal line) and they are based on totals, not percentages.
*We do incorporate game charting into DVOA in a few small ways: to determine whether an aborted snap was a bad handoff or a blown pass play, to mark squib kicks, and occasionally to change a backwards pass, which is officially a running play, so it counts as a pass play in our stats. There's one other somewhat subjective element in the FO play-by-play formulas, which is that we mark end-of-half interceptions (and a few almost-end-of-half interceptions) as "Hail Mary" and do not count them as turnovers. People might have different opinions about, for example, whether a 40-yard interception thrown on third down from midfield with 20 seconds left should count as a "Hail Mary" or not.
GAME CHARTING STATISTICS
Game charting statistics are the numbers we get from an armada of volunteers watching games on tape and then marking down things that aren't tracked by standard play-by-play. We're not the only people doing this, of course, but as of right now all the various game charting projects out on the Interwebs are separate from each other, which means the stats you see from FO may be different from the stats as tracked by K.C. Joyner, or Stats Inc., or others. We're all working off the same limited television camera angles, so we're all making mistakes. For the time being, there's no alternative.
Are the game charting stats objective or subjective? Well, sort of both. Everything we're asking charters to mark has a specific definition. A screen pass is a screen pass. A scramble is a scramble. A dropped pass, theoretically, should be easy to identify. The problem is that a lot of the events we're marking in the game charting end up somewhere in between one designation and another. If a wide receiver has to jump for a pass, and gets his hands on it but loses control, is that pass overthrown or dropped? When a linebacker comes late on a blitz, does he count as a pass rusher on a delayed blitz? Or did he just notice that the running back he was covering was blocking, in which case the play called for him to rush the passer? Unfortunately, we have no choice but to ask our volunteers to make decisions like this. Some charting stats involve more subjectivity than others -- for example, identifying a draw or a screen is a lot more cut and dried than deciding on a quarterback hurry.
What is important, however, is that the game charting project is an attempt to measure events. It is no different from the official scorers assigning tackles or intended receivers on each play, two items in the official play-by-play which can be tough to discern. No players are graded, and we try not to ask the game charters to assign blame on plays unless there is a specific negative event: a broken tackle, a blown block, or a dropped pass. If the offensive line blows the blocking call and leaves a pass rusher unblocked, we don't try to figure out which lineman had the assignment; we just mark "Rusher Untouched."
For the most part, game charting stats are counting stats. A defensive player will have a certain number of hurries, a certain number of dropped interceptions, a certain number of broken tackles, and so forth. There are a few game charting formulas as well. We take game charting stats and adjust them for context, much like we try to adjust the standard play-by-play stats for context. An example would be "Adjusted Yards per Pass," where we adjust the yards allowed by defensive players in coverage based on the quality of the receiver involved.
The category of "historical statistics" has a lot of crossover with the category of "play-by-play statistics," of course. However, when we think of play-by-play stats around here, we tend to think of stats from the years for which we have play-by-play: 1993-2010. Historical stats, of course, would include yardage and touchdown totals going all the way back to the start of the NFL. It also includes stuff like draft information, game scores, and "official" (i.e. team-reported) heights and weights.
There are also formulas developed from historical statistics, of course. That would include Adjusted Games Lost, Draft Value (as based on the infamous "draft value trade chart"), and P-F-R's Approximate Value.
Projections are the formulas that try to predict how well a team or player will do in the future. In general, these projections are based on the idea that the best team in the league doesn't always finish with the best record, and the 20th best team in the league doesn't necessarily finish with the 20th best record. There are a lot of random elements in a football season, and there are a lot of intangibles that we can't project. So the projections represent a range of possible likely results for each team or player. The numbers you see in the KUBIAK fantasy football projections spreadsheet are the average of those possibilities, not a definite prediction of the exact numbers we expect from a player. That's why we say that a player's specific rank in our projections isn't as important as the overall sense of whether KUBIAK thinks he will be better or worse than the past, and whether KUBIAK thinks he is overrated or underrated by conventional wisdom.
The projections we produce before the season are not "DVOA." They are "DVOA projections." People do sometimes get confused between the two. I've seen comments to the effect of "FO's preseason projections are based on an analysis of every play." That's partly true, since various cuts and splits of DVOA go into the projection system, but the projections also consider a lot of other variables such as age and experience at various positions, team pace, recent drafts, free agent additions, and so on.
It's also important to note that there are two main projection systems around here for the NFL: the team projections, and KUBIAK. KUBIAK is the term we use to refer to the fantasy football projections, and the team projections are among the variables used in KUBIAK.
As I often say, intangibles are called intangibles because they are intangible. We don't do stats that measure leadership or team chemistry. That doesn't mean these things don't exist, just that we can't measure them. Leadership and chemistry can develop over time and will affect other teammates, and it is hard for us to guess how. "Heart," on the other hand, is just another element of a player's performance, no different from strength, speed, or ability to learn the playbook. Contrary to popular belief, there is a stat that measures heart. There are a lot of them, in fact. They are called "stats." Most of the guys who have "heart" also have pretty good stats. Fred Biletnikoff used to smoke a pack of cigarettes, throw up for 20 minutes, and then go out and shred every defense he faced. He had great numbers. Anquan Boldin literally took the field three weeks after a broken face. His numbers are pretty good too. But if badly-rated Player X has so much heart, why didn't he use it to maybe get a few more first downs last year?