Writers of Pro Football Prospectus 2008

17 Jan 2011

The Truth Wears Out

No, we aren't suddenly linking to articles about Paul Pierce. I saw this in an old New Yorker today and I found it fascinating. The article discusses how scientists are discovering that statistically valid studies which are well-accepted sometimes can't be replicated with the same results. I think the suggestion is that even the generally accepted standards for statistical validity may be too low, and there's way more randomness involved than any of us ever realized.

This probably has a lot of application to what we do around here. There's no question that some of the things we think we've discovered, we may eventually discover that we didn't really discover. A good example is the "third down rebound," which seems to have disappeared for offenses over the last couple seasons. In a lot of cases, we're already starting with particularly small sample sizes -- and as this article points out, even good-sized samples may be too small.

You also have to remember that when we're studying football, you can't keep all the variables constant. I think it's pretty clear at this point that the Lewin Career Forecast isn't quite as valuable as it was a couple years ago because the screens and high-percentages throws in college football have risen to a ridiculously high level. That doesn't mean it doesn't have value; it just means that it isn't the be all and end all. And it doesn't mean that it wasn't more valid when David Lewin came up with it five years ago; it is possible that circumstances have simply changed.

So I'll have this in mind when I'm writing about things for the next few months, certainly. The problem there is that I can't say that I have this in mind in every single article. It will get really tiring for regular readers. I've always said that I take pride in the fact that Football Outsiders leads the league in couching our opinions. But you can't stick two paragraphs of opinion-couching "what ifs" in every single article. That's a particular problem with picking games. I think 90 to 95 percent of readers would get angry if we characterized every single playoff game as a toss-up, but effectively, even if we think one team has a 65 percent chance of winning, that's really sort of a toss-up. The team we don't pick will win one out of three times, which isn't particularly rare. And frankly, most predictions are even closer than that. Upsets happen. Randomness happens. We may not repetitively write that in every single article, but trust us, we know it to be true.

Posted by: Aaron Schatz on 17 Jan 2011

80 comments, Last at 30 Jan 2011, 6:49am by erniecohen


by pr9000 :: Mon, 01/17/2011 - 11:46pm

FWIW, that New Yorker article was the best thing I read all year in a magazine.

by Noah Arkadia :: Mon, 01/17/2011 - 11:46pm

No worries here. I've always taken it that way.

by loveshack304 (not verified) :: Mon, 01/17/2011 - 11:49pm

Good points here; the trouble with sample sizes is especially problematic in football where one single game needs to be predicted (as opposed to the other major sports where the playoffs are played in series, allowing for at least a little more predictability).

by Rob C. (not verified) :: Mon, 01/17/2011 - 11:53pm

Guys, I love the site, but the annoyance about Jets' fans on Simmons' podcast and defensiveness about your playoff picks is coming across as desperate this week. We get it, you're wrong sometimes, but just let it be.

by SammyGlick :: Tue, 01/18/2011 - 10:26am

Rob, did you read the article? This link could not possibly be interpreted as a defense of incorrect picks.

by Sander :: Tue, 01/18/2011 - 12:03am

The article is good, reminds me of a lot of arguments made by Nassim Taleb in his books.

I wonder to what extent Football Outsiders and other statistical NFL (or sports) websites have fallen prey to this phenomenon. It's probably very unexciting to publish articles with "we didn't find that so-and-so had a big effect on team wins", so selective reporting is certainly a problem. With a statistical significance threshold of 95%, one out of 20 of all researched subjects should be a 'false positive' of sorts. And that's not including the research where statistical significance is hard to establish because there's simply too small a sample size.

by Verified (not verified) :: Tue, 01/18/2011 - 11:16am

"With a statistical significance threshold of 95%, one out of 20 of all researched subjects should be a 'false positive' of sorts."

This is only true if the null hypothesis is true in every case. With alpha = 0.05 (i.e., a significance threshold of 95%), the probability of a false alarm (FA), or the probability getting a statistically significant result (ssr) given that the null hypothesis is true (H0), i.e., Pr(FA)=Pr(ssr|H0), is 0.05. The probability of a statistically significant result given that the null is not true, or Pr(ssr|~H0), may be any number between 0 and 1, depending on how the alternative to the null is expressed. The general probability of observing a statistically significant result is Pr(ssr) = Pr(ssr|H0)Pr(H0) + Pr(ssr|~H0)Pr(~H0), which is not usually going to be equal to alpha.

by Bowl Game Anomaly :: Tue, 01/18/2011 - 12:08pm

It would be more accurate to say that one out of every 20 studies which show a significant effect are false positives. Since the vast majority of published studies are ones which reject the null hypothesis (97% according to the article), what Sander said was close enough.

by Jim Glass (not verified) :: Tue, 01/18/2011 - 12:06am

The article discusses how scientists are discovering that statistically valid studies that are well-accepted sometimes can't be replicated with the same results...

There are even studies saying that most published scientific studies (at least in some fields) are just plain wrong. The logic works like this:

First, if a 95% confidence interval is used then 1 of 20 study results that pass the test will be wrong. Second, journals only publish studies that have interesting results that people will want to read -- not boring ones that confirm what everybody already knows. Third, what's more interesting, counter-intuitive, than things that aren't true but are "scientifically" claimed to be true? Put it all together, if 1,000 studies are conducted and pass the 95% confidence test then 50 of them will be wrong -- and *they* will be the ones that get published because their findings are "new", "interesting", "breakthrough", etc.

E.g.: if 20 studies are run on the shape of the Earth, with a 95% confidence test 19 will find the Earth is round, one will find the Earth is banana-shaped -- and that last one will be published in the esteemed Journal of New Findings in Earth Science.

Of course, when these new studies say that most studies are wrong -- especially when their findings are new, surprising and counter-intuitive -- and you look at them in reference to themselves....

by akn (not verified) :: Tue, 01/18/2011 - 4:25am

As a researcher myself, I have to strongly disagree with the assertions your making here about the state of scientific studies. You're trying to draw a lot of conclusions based on one idea: that all studies use a 95% statistical threshold, and that at any time 5% of these studies are just wrong. This is incorrect.

First of all, a vast majority of studies, while using p < .05 as a threshold, have significantly lower p values (i.e., p < .0001). The stronger statistical certainty is not emphasized because most people accept 1/20 as reasonable cutoff and comparing that to 1/10000 isn't really a useful discussion.

Secondly, you a making a mistake of statistical interpretation known as the multiple comparisons problem. When you have several results, all with p < .05, you cannot simply assume that 5% of them are wrong or different. Remember, each of those studies concluded that their result could only be produced by chance 5% of the time. But none of them said anything about the rightness or wrongness of their result--that's just an assumption you are adding.

Instead, if you want to analyze several studies in an attempt to produce a consensus, you do what is known as a meta-analysis. This approach uses a whole different set of statistics (things like Bonferroni correction) to pool the results from several studies. The generalizations that arise from these pooled studies are much stronger statistically than individual results.

A properly performed meta-analysis forms the basis for scientific consensus. Can the consensus be wrong? Sure, if there is a fundamental and systematic error that affected all the studies involved. Can the consensus be incorrect by chance? Extremely unlikely (and much less than the traditional 5% assumption).

I can't comment on the idea that interesting or counterintuitive results get disproportionately emphasized in scientific journals. As someone who has published several articles in peer-reviewed journals, I can tell you that in order to publish a result that wasn't expected, we generally have to go a few more rounds with the reviewers to get approval. I see that as a strength, and not a limitation, of the peer-review process. Additionally, I've never had a problem publishing a study that was "boring." I may not be able to get it into a top-tier journal like Nature or Science, but that's only because Nature and Science are space-limited journals. They're designed to only highlight small glimpses into much larger fields. There are dozens of more specific journals that would gladly accept a "boring" but scientifically important study, and I spend most of my literature review efforts in these more specific journals anyway. If I'm going to borrow or reproduce techniques from other researchers, I don't look at their Nature or Science articles, I look at the dozens of methodology papers that are cited by that article. That's were the real science lies.

by Jim Glass (not verified) :: Tue, 01/18/2011 - 3:44pm

Dude, don't be so defensive. I'm not saying science is withcraft. I'm published myself.

But you might be more open-minded, and remember that a key to scientific advance is skepticism, and there is always more grounds for it than we are usually prone to believe.

If you are looking for a more credible authority than a stranger in a football forum to argue the point with, try for instance New Scientist: "Most scientific papers are probably wrong".

by ChaosOnion :: Tue, 01/18/2011 - 9:30am

Second, journals only publish studies that have interesting results that people will want to read -- not boring ones that confirm what everybody already knows.

You know what are interesting results that people what to read? Experiments which attempt to replicate a published result but cannot. Repeating an experiment and validating a state hypothesis is part of building upon existing work. If a result is not repeatable, no one can build upon it and a number of people trying to stand on its shoulders will report their findings invalidating the original claim.

by scottybsun (not verified) :: Tue, 01/18/2011 - 2:07pm

Well, you can actually measure the earth's shape with virtually 100% accuracy.

The 95% threshold is more for social science research- hard sciences can be much more precise.

by Jim Glass (not verified) :: Tue, 01/18/2011 - 3:52pm

I would hope that a post describing the Journal of New Findings in Earth Sciences as publishing an article saying the Earth is banana-shaped would be taken as being a bit tongue-in-cheek to illustrate general principles to a readership of football fans rather than scientists and statisticians.

For a more straight-science take on the issue check the paper available through the link to New Scientist I just put above.

by Joshua Northey (not verified) :: Tue, 01/18/2011 - 12:18am

"I think 90 to 95 percent of readers would get angry if we characterized every single playoff game as a toss-up, but effectively, even if we think one team has a 65 percent chance of winning, that's really sort of a toss-up."

A) I think you are flat out round about the first assertion.

B) What is wrong with characterizing it as a 65/35 game? You guys shoot yourselves in the foot by getting away from the pure numbers constantly.

As just one example you would avoid a ton of stupid "my team won last week why hasn't their ranking gone up?" comments by simply listing last weeks RATING (the DVOA delta) instead of last weeks ranking.

by Joshua Northey (not verified) :: Tue, 01/18/2011 - 12:21am

Wow some awesome synethesia or whatever there.

I meant to type flat out wrong, but having flat in the sentence made my brain type round. Brains are so funny :)

by OrangeMath :: Tue, 01/18/2011 - 12:36am

Would you kindly post a link to the article or at least the issue date?
Ooops, the title is the link. Funny, but it works. Thank you.

by Joseph :: Tue, 01/18/2011 - 1:05am

I wonder if comparing this article to your work on the Curse of 370 would placate Brian Burke of advanced NFL stats a little bit. One of his top articles in "the myth of the curse of 370"--but, in reality, FO has recently said that it was a catchy name as much as exact number. However, the general hypothesis IS true--overusing your RB in one year is not helpful for his career nor his production in the following year.
POSSIBLE BOOK ARTICLE SUGGESTION--maybe, now that you have what, 18 years of data, you could find a round-about number where RB production remains reasonably constant and effective. For example, if RB has less than 320 carries, his DYAR/DVOA/YPC tends to be stable, regardless of W-L record, O-line quality, etc. I am fairly confident that you would need a rate stat like DVOA/YPC/Success Rate, because you have already proved that total runs, and by extension, individual RB carries, will be somewhat reliant on game situation and W-L record.

by AnonymousA (not verified) :: Tue, 01/18/2011 - 3:03pm

"However, the general hypothesis IS true--overusing your RB in one year is not helpful for his career nor his production in the following year."

Bzzzt. Wrong! Go read Burke's article again. This is an example of bias, and it's not simply "the curve goes up, we chose 370 as a good point of dichotomy" it's "the curve is whacky, and we arbitrarily chose the highest point on it and went from there."

Just to be 100% clear: the curse of 370 is a myth, fudging the numbers ("it was a catchy name as much as exact number!") doesn't help, and FO's initial analysis was wrong in both method and result.

by Shattenjager :: Tue, 01/18/2011 - 1:05am

I must rebut at least partially, but Steve Novella can far better than I can: http://theness.com/neurologicablog/?p=2580

by Kirt O (not verified) :: Tue, 01/18/2011 - 1:29am
by Lebo :: Tue, 01/18/2011 - 9:21am

I preferred Myer's rebuttal to Novella's, as Myer was more willing to acknowledge the validity of the points that Lehrer raises (even though Lehrer doesn't present them all that well).

I think that you have to expect some level of sensationalism with most forms of journalism. However, if Lehrer replaced 'all scientific studies' with 'most cutting-edge studies that are publicised' then I believe he'd have a valid point. To me, the most disturbing issue that Lehrer raises is that medical treatments have not adjusted seemingly long after the initial findings on which they are based have been disproved.

(Note: I don't know much about the modern state of medicine or scientific research beyond the three linked articles that I just read.)

by Stats are for losers (not verified) :: Tue, 01/18/2011 - 10:22am


Physics or stamp collecting, as they say.

by Mr Shush :: Tue, 01/18/2011 - 2:07pm

Of course, maths is just a minor branch of philosophy that happens to be reasonably useful in a day-to-day sort of way . . .

by spenczar :: Tue, 01/18/2011 - 6:10pm

And also is actually correct.

by Mr Shush :: Tue, 01/18/2011 - 9:12pm

Correct? It's a logical language. It's not correct, because it's not the right kind of thing to be a candidate for correctness. It's certainly no more "correct" than, say, first order quantifier logic.

8+8=16 is correct (though we might prefer to say valid).

∀x{F(x)→G(x)}∧F(a):G(a) is valid.

There's nothing to choose between them.

by tuluse :: Wed, 01/19/2011 - 9:03am

Yes but the assertions made in the math match up with all observations.

Of course everything is true until it's not.

by Mr Shush :: Wed, 01/19/2011 - 11:06am

So do the ones in the logic. Well, I mean, not necessarily once you start talking about things like modal logics and arguing about whether the S5 step is legitimate or not, but that's because there's no way of checking it against reality, which I'm pretty sure is true of some things in maths as well.

Adding eight apples to eight apples does indeed result in sixteen apples.

All NFL players being men and Sam Bradford being an NFL player does indeed entail that Sam Bradford is a man.

Of course, number theory/philosophy of logic is a step up from maths/logic . . .

by spenczar :: Wed, 01/19/2011 - 1:15pm

Okay, yes, mathematical logic is just as correct as the rest of math.

How correct is anything that's ever been written in epistemology or ethics?

When you look at the blend of algebraic/logical philosophy, the further you get from algebra/logic, the closer you get to philosophy, and the further you get from validity.

by Mr Shush :: Thu, 01/20/2011 - 9:00am

"Okay, yes, mathematical logic is just as correct as the rest of math."

Bass ackwards. Maths is a subset of logic, not the other way around.

"How correct is anything that's ever been written in epistemology or ethics?"

Yes, ethics turns out to be a crock, and should now only be taught in History of Philosophy classes or some such. Epistemology's a slightly more complex case, because while knowledge does seem to almost certainly not actually be anything resembling a useful basic concept, some of the findings that people have come up with while trying and failing to work out what knowledge might be are definitely useful and can be called correct with as much confidence as anything else, logics included. Excluding ethics, aesthetics and the French, the perception of modern philosophy as a wishy-washy subject is pretty off-base.

by Spielman :: Wed, 01/19/2011 - 8:31am

Never argue with a philosopher in a "my field's phallus is bigger than your field's phallus" debate. It's kind of a specialty for them, so it won't end well.

by Mr Shush :: Wed, 01/19/2011 - 11:11am

Hey, it may be a gargantuan self-congratulatory circle jerk, but at least we're good at it. Not like those no good hacks in literary theory, who can't even spew coherent useless nonsense . . .

by Spielman :: Wed, 01/19/2011 - 11:34am

Well played.

by William Lloyd Garrision III (not verified) :: Wed, 01/19/2011 - 1:21pm

Amen to that.

by theslothook :: Tue, 01/18/2011 - 1:20am

DVOA captures a macro view of a team's output, but that output is determined by so many factors that the current statistics simply can't explain. We cant separate individual parts of a team and parse the data in order to analyze them on their own merits and so, that leaves stat models wide open to errors. I applaud FO a great deal, its a giant leap in the right direction, but like all technology, we value its contributions but understand its limitations and thats why we strive to improve upon it.

by Kal :: Tue, 01/18/2011 - 2:49am

I think this is a reasonable thing to be concerned about - which is why it's important to take those easily distilled 'facts' about football and continue to re-analyze them with new data.

For instance, is it true that STOMPS are better than GUTS in determining better teams?

Is it true that home field advantage only has a certain value? Is that value variable depending on the field, or depending on whether it's the playoffs?

Is the Curse of 370 applicable any more?
Is the 3rd down regression to the mean applicable?
Is defensive success repeatable?

Another issue is that too often the cause is inserted when it's really not known or there's just not enough supporting data. this is a natural human trait; we are wired to want to understand what caused something. But often just stating the effect is profound enough.

by Jim Glass (not verified) :: Tue, 01/18/2011 - 7:26pm

For instance, is it true that STOMPS are better than GUTS in determining better teams?

Absolutely and unquestionably.

by Kal :: Wed, 01/19/2011 - 5:42pm

You can always question, especially with such a small data set.

Also, why did you pick 11 wins? Plenty of playoff teams - even teams with fairly good success, like this year's Packers and Arizona two years ago - had fewer than 11 wins. Seems odd to exclude when you could have just put that in and seen the results.

Similarly, why was the breakpoint 10? Does it change significantly for 9, 11, 14? Again, with such a small sample rate it seems very easy to massage data to suit your needs.

by speedegg :: Tue, 01/18/2011 - 3:49am

I still think the 3rd down rebound/regression towards the mean (and other truths) is valid. With some manufacturing processes and statistical process controls you occasionally run into the problem of a "moving mean". There were times where the expected mean kept changing. The best answer people came up with was the variance in raw materials used in the process and the number of test runs (N) used to validate the process was too small.

So with the 3rd down effect either coaches read about it and started changing their practice plans to counteract it (adjustment to process), passing is more prolific in the current NFL (systemic change), and/or the mean is shifting (upward?).

by erniecohen :: Tue, 01/18/2011 - 11:34am

I don't think either of these explanations is likely. How would you change your practice plan to counteract 3rd down rebound?

In any game, a subject that suffers bad luck in a trial is likely to improve in the next; this is just statistical fact. So the third-down rebound question is essentially whether third-down concersion rate is strongly correlated with success rates on other downs; if so, any major deviation is good/bad luck, and would lead to rebound. If it is skill, you expect a much weaker effect related purely to the replacement of players and adjustments in opponent strategies.

by William Lloyd Garrision III (not verified) :: Wed, 01/19/2011 - 1:29pm

Or look at 3rd down rebound rates with a finer tooth comb, i.e. last year they converted 40% of third downs, this year it's 50%. But note, last year their average 3rd down distance to go was 5 yards, and this is its 4 yards. Etc., etc.

I am no Donovan McNabb zealot, but it would drive me crazy this year when the announcers kept on saying that his 3rd down conversion rate was subpar. Yours would be too if you were always in 3rd and 8.

by NJBammer :: Tue, 01/18/2011 - 9:18am

I think it's time to totally re-evaluate the system upon which DVOA lies. Though you have done good work, maybe you should, like Bill James did when he developed "Win Shares" take a refreshing step back and rethink your approach. With all you have learned over the past 10 years, I assume if you started from scratch you would have a better base model to rest the data upon.

by DGL :: Tue, 01/18/2011 - 10:33am

Plus, I know if it were me I'd find it a hell of a lot of fun to take a month off and just rebuild it from scratch.

by Jerry :: Wed, 01/19/2011 - 4:00am

This is a very different world than the one where James came to prominence. He would put out a book once a year, then have twelve months to generate and play with new stuff (less whatever time he spent on free-lance writing assignments). In the internet age, there's constant demand for material. Even if Aaron wants to take a month off and rebuild DVOA from scratch, that's a month where he's not generating any content for this site, and I imagine that preparing the book causes enough of those periods.

by NJBammer :: Wed, 01/19/2011 - 10:52am

FWIW, James developed Win Shares in 2002, which is certainly after the internet age. I think it's more of a mental attitude than a requirement to generate content, as this kind of problem has been with us for hundreds of years, people who continue to build on a system even though it's flawed just because they have so much invested in it, only to suddenly start over almost from scratch to develop something better (See the subject of the movie "Longitude" for a great example of this).

by Jerry :: Thu, 01/20/2011 - 4:15am

My point is that James never had to generate content for baseballabstract.com, for better and for worse. IIRC, his work on Win Shares had a significant effect on the schedule for the Historical Abstract rewrite, but there was no other pressure to get stuff out there.

To the extent that people make a living from FO, they need to keep bringing people to the site, and new content is the best way to drive that. If that doesn't leave enough time to reimagine DVOA, it's understandable. And, of course, if someone can come up with a better system, we can always use another good site.

by William Lloyd Garrision III (not verified) :: Thu, 01/20/2011 - 2:03pm

Fair points, but not sure I buy into the notion that people sit around generating content because of a "mental attitude" they have. That may be the impetus for the comments from the peanut gallary, but not for the article/book writers. Very little high quality content is generated without a business motivation or competitive pressure.

I don't know all that much about James's life, but I do know that he worked arbitration cases back in the day, and I would imagine made a fair amount of money doing this. I remember reading his 2003 abstract and the section he had on George Bell, and how he wrote that he worked the case for Bell's side to argue that despite his high errors total, none of them had actually come in situations or games that mattered--so he should not be penalized in the pocketbook for those mistakes. Then, he worked another case a year later for Bell, and did the same play by play analysis to prove the same point.

Perhaps George Bell was Jame's only client. I doubt it though. I think there's always been money in the evaluation and selling of baseball skills - and I have always looked at James's writings to be his "advertisment" to agents and players alike about how can can spin your data twelve ways from Sunday to make you look better than you appear at a first glance.

To me, he publishes new data and valuation formulas to attract customers as well as readership. That is what drove his reworking of the formula as much as any creative itches he had to scratch.

To me, the "Fielding Bible" is an even more transparent sales pitch for "consulting" work and database sales--which by the way I have no problem with. James's commentary on Brandon Phillips and Chase Utley is interesting in light of this thinking.

In the end I think that the market for newer and better stats is driven by pro athelete's, their agents, and the pro teams evaluating and buying and selling talent as much as it is driven by anything else.

by William Lloyd Garrision III (not verified) :: Wed, 01/19/2011 - 1:43pm

I disagree.

I read an article last year in the Wall Street Journal that threw out a stat something along the lines where you can only find, on average, one person with a college degree in a MLB dugout with a college degree. On a personal note, one of the studipist places I ever spent time was in a dugout on a college baseball team; it's a fairly unintellectual sport for a few reasons.

But football is different. By the time you get to the high school varsity team, you have to memorize (ok, rudimentary) playbooks and weekly game plans, and you get at least a hour or two of film a week, where you analyze all of those moving parts, and your individual performance as part of that machine--in full context--in front of your peers and coaches for scrutiny and review. Football is by its nature a much more intellectual, analytical game, for the players perhaps even more than the fans.

To me, the DVOA look is so remarkably akin to a coach shutting off the projector and saying something like: "Even though we lost, I still give us an "A" for effort, a "B+" for Defense, a "C-" for special teams, etc. etc.

Most of the FO stats to me, are restating, with valuable granularity and color commentary to, the post-mortems and examination of conscience that a football player or coach. It puts a name and more numbers on a truth that you grew up having to face on a weekly basis.

Baseball was different. James was essentially a lone intellectual voice in the wilderness. He had to reinvent his rating system because it wasn't standing up to the scrutiny of peer review once other smart folks joined the fray en mass. Football has already had lots of smart people thinking about the system and it's parts for decades. It's already evolved to a pretty high level.

by Vicious Chicken Of Bristol (not verified) :: Tue, 01/18/2011 - 10:09am

Or it could be that football has too many variables, too many intangibles, to be accurately modeled by statistical methods.

And though you claim there is no bias, any system that assigns levels of success to certain plays, ignores others as irrelevant, and calls yet others fungible is inherently subjective.

Football stats are a good way to tell you what happened, not necessarily what is going to happen.

by NJBammer :: Tue, 01/18/2011 - 3:11pm

I agree. I think game theory can be applied, but alas never varified, to better understanding football success and failures. The real problem is that any defense can game plan to negate another team's advantages. This happens very quickly, certainly no longer than a game or two, and once a weakness is found it is consistently exploited.

So we see for example that Michael Vick has trouble when forced to run to his right. Once teams know this and can exploit it, all the DVOA in the world will not help predict his next game, it's worse than worthless, it's misleading. Every writer, fan, and otherwise slaves to the narratives can see Vick's weakness, and see it's being exploited, but DVOA is slow to catch on. Once DVOA does catch on, Vick is a very different quarterback and is likely doing different things in response to the new attack, which again DVOA is slow to recognize.

In the case of football, I can't see any way that straight statistical analysis, no matter how comprehensive, will ever be able to keep up with the complexity of the sport and the changing strategy. I favor more of an advanced approach to tell us what actually happened, rather than a predictive one. But for that, maybe a narrative will do just as well, anyway?

by Sophandros :: Tue, 01/18/2011 - 10:25am

Rigorous research and reevaluation of methods is clearly ranked too low because most people don't understand statistical analysis and/or the scientific method. Merely regurgitating what some talking head says on TV (or better yet, on sports talk radio) is way better than this. IIRC, U must run to win and never go for it on 4th down, you fools! The game iz won in the trenches!!!one! TTYL

Sports talk radio and sports message boards are the killing fields of intellectual discourse.

by DGL :: Tue, 01/18/2011 - 10:34am

We have a winner.

by Verified (not verified) :: Tue, 01/18/2011 - 11:24am

It appears that we have a winner today, sure, but how will it look tomorrow? A year from now? Ten years from now?

by erniecohen :: Tue, 01/18/2011 - 11:17am

I think that Aaron is taking the wrong lessons from this article. The big problem in essentially all of sports statistical analysis is not the sample size, it is the experimental methodology and the publication bias. The proper way to do a scientific study is

- You look at some data (or do some theory) and make a hypothesis;

- You commit to doing a specific, precisely described experiment, as well as to how you will publish the result.

- You perform the experiment on independent dataset and publish the result as promised. You never mention the original, motivating data.

Now, because everybody on the FO staff knows NFL football in considerable depth, so there is no independent set of data from the past that has not already shaped their views. So the only legitimate experiments for FO (or anyone else researching sports) are those where you follow the above process, post your predictions, then see whether the predictions are borne out in future games. Aaron correctly points out that the NFL is a moving target; if you want to deal with this, you have to make a theory that is parameterized by whatever chnges you wish to track - you cannot revise the theory (or the experiment) in the middle of the experiment. Every time you change the formula for DVOA, you are starting a new experiment. If you do this, you will not be subject to the problems discussed in this paper.

So if you want to treat what you are doing as science, I suggest that FO create a special section on the site where they formally propose experiments for future games/seasons, and commit to posting (forever) the results of those experiments, regardless of their outcome. The most obvious candidate for such predictions is the use of DVOA/WDVOA to predict outcomes of future games. I suggest that you should (1) number the DVOA versions, (2) post the DVOA version number with the predictions, and (3) continue to use old versions to make predictions for at least 2-3 years. (Note that you have to commit to a precise lifetime in advance.) You aren't allowed to give preferential posting to those formulae that happen to do well in the experiments, and you can't try a tweak, watch it fail for a while, and then stop posting its results (as you have done in the past).

by AnonymousA (not verified) :: Tue, 01/18/2011 - 3:08pm

Absolute best post in the thread. Of course, we know the result -- DVOA gets updated every year precisely BECAUSE this would make FO look bad. I'll actually make a hypothesis of my own: if today's DVOA ratings are used to predict game winners over the next five years, their percentage correct will be higher than "pick the favorite on the spread" at most twice.

by AnonymousA (not verified) :: Tue, 01/18/2011 - 3:09pm

Self reply to be clear: I mean if "today's DVOA formula", not "ratings".

by DaveRichters :: Thu, 01/20/2011 - 5:44pm

You perform the experiment on independent dataset and publish the result as promised.

FO isn't doing any experiments. It would surely be useful but the NFL would not cooperate. All we can do is observe, we have no variable to manipulate. And there really isn't one formal "method" for doing science as you suggest there is.

by erniecohen :: Sun, 01/30/2011 - 6:49am

FO is doing experiments, the main variable of which is the DVOA formula. The independent datasets are future NFL games/seasons. (This is precisely the kind of experiment that astronomers do all the time.)

There isn't just one formal method of doing science. (For example, I happen to work in an area where the underlying mathematics is given, so theories can be confirmed with mathematical theorems, rather than experiments.) But the method that I described above of generating a collection of results (i.e., the outputs of the studies) is statistially meaningful (i.e., can be used to draw conclusions about the world), whereas violating any of these conditions makes the outputs statistically useless.

For example, suppose that I wanted to prove that all people are male. A simple way to do this is to randomly sample 100 people and see if they are all male; if they are (and if I do a good job of random sampling), my hypothesis is very likely to be correct. However, if I decide to publish only those studies that comfirm my hypothesis, these published results are useless. (Especially if people replicating my experiment have the same publication criteria.)

by Some_FF-Player_in_nawlins (not verified) :: Tue, 01/18/2011 - 12:01pm

To me, it is obvious that any systematic statistical analysis of Football is inherently limited by the following facts:

1) There is usually significant drop-off from a starter at any position and that position's replacement. It is impossible to completely know the entire difference both individually, and how those differences affect the rest of the team at that point. In the NFL, injuries happen. It's impossible to completely account for them with just numbers.

2) You are dealing with human beings. As history has shown us, that fact alone can lead to variable performance on a second by second basis due to any number of factors. If there were just machines out on the field that could be expected to turn in consistent performance play after play, you'd be able to almost completely predct the outcome of every play just by the statistics generated by past performance alone.

3) A good number of NFL games are still played out doors with it's variable effects. Given that many players react to being outdoors, and then react to the specific weather conditions of each contest, you still have a major source of randomization for a good percentage of yearly games.

4) Coaches, especially near the end of the season, often make decisions that are not completely guided by the end result of the game being a victory. That also throws a bit of variability into things.

5) Year to year statistical comparisson diminishes with time. The players that made up most of the NFL in 2000 are largely no longer in the league any more. The rules of the game have been subtley tweaked year after year. This makes comparissons from year to year diminishingly valid as the year difference between comparisson years grows.

6) Certain positions have proportionately higher impact on overall team performance than others. Because of this, the natural tendency for randomness in human performance has a greater skew there than in other spots. This doesn't just apply to quarterbacks, but also to positions like blind side offensive line players, safeties, and kickers.

There are so many things that can affect a teams performance play to play, game to game, season to season that any statistical measure of them is always going to be of extremely limited predictive value. You will always be able to say that Team A will likely have success in the running game against Team B because Team A is statistically above average at running and team b is statistically below average at stopping the run, but that doesn't say much about the liklihood of team a defeating Team B.

A reasonable person will realize these facts and know that any system is limited by them. An unreasonable person will likely not care about any of these facts and want your system of analysis to be perfect.

by Anonymous Coward (not verified) :: Tue, 01/18/2011 - 1:55pm

So Bill Polian was right all along?

by M :: Tue, 01/18/2011 - 1:55pm

One thing I haven't seen mentioned in all of the critiques is that FO cannot operate in a vaccum. It needs to make money somehow, which usually implies the need to attract a wider audience. Not all of the people who frequent the site are research scientists or statistics majors, so if everything was written only with that audience in mind, the site may not make enough money to survive. Last I checked, Aaron is not a "trust fund baby" with no one to support but himself; indeed, many of the contributors to this site do not have unlimited financial and time resources to make everything we read here "perfect".

Yes, the way some of the information is presented does seem "dumbed down" (especially the ESPN content), and perhaps the theoretical statistics behind the model testing aren't 100% rigorous. However, before FO, what site dedicated only to football was there that gave the geek-fan a forum to discuss football, statistics, and what the measurements of the latter truly can say about the former? I recall only a couple, but can't think of their exact names right now. Frankly, pro-football-reference and FO seem to have developed a nice little symbiotic relationship that allows both sites to generate more traffic and thrive in the current reality of the internet.

It's my sincere hope that the intellectual discourse on the discussion threads doesn't end up causing a schism between more casual fans and stats geeks which ends up indirectly altering the site content and causing one faction or the other to the abandon this site. Constructive critcism of the metrics/models is fundamentally good, but it should be done with the understanding that this is a business that does not have unlimited resources and therefore cannot just cater to the "intellectual elite" of football fans.

by William Lloyd Garrision III (not verified) :: Tue, 01/18/2011 - 3:53pm

Perhaps this is true, but I am left wondering why, how and when the fancy of this "larger audience" shifted away from a desire to ponder the 10 mistakes that guys make towards the desire to want to get Tony Dungy's take on how to be a friend.

by chemical burn :: Tue, 01/18/2011 - 4:28pm

Personally, I'm interested in both the 10 mistakes I'm making and how Tony Dungy thinks I could be a friend.

by erniecohen :: Tue, 01/18/2011 - 5:16pm

That's why I specifically suggested they separate the "scientific" part of the site from the "popular" part of the site. The scientific part of the site - where truth doesn't wear out - is where people go if they want the truth.

by Podge (not verified) :: Tue, 01/18/2011 - 4:29pm

So, just to be clear:

When you got to the bit about measuring symmetry, did you put your hands together and check that your fingers were the same length?

by Mr Shush :: Tue, 01/18/2011 - 8:18pm

Well, yes. Obviously.

by Bill N (not verified) :: Tue, 01/18/2011 - 5:22pm

I challenge some of what Lehrer claims, at least in Physics. The Partlcle data group has plots of the neutron lifetime and axial vector coupling ratio (which is what I presume he means by weak coupling ratio).


Fallen 10 standard deviations!?! From the current measurement perhaps, but not from the initial measurements that had very large error bars and it has fallen by maybe 3 standard deviatinos. Moreover, the numbers are all negative it went from -1.20 to about -1.26, hardly a large absolute change but also a larger, not a smaller effect.

On the other hand, Ioannidis has a very good description of why much research is some fields is wrong and backs it up with the math. Essentially, it is Bayesian. If you are unlikely to find an effect in a field, if you find it, you are likely to be mistaken. A positive result is surprising, therefore likely to be noticed. The typical conventions for power and significance don't hold up well if any effect is highly likely or unlikely.

Andrew Gelman suggests that statistical significance testing can be biased towards large results. Statistical significance relies on both sample size and effect size. Especially with small sample and large errors, only a large measured effect will appear as statistically significant. Gelman recommends a retrospective power analysis to make explicit the effect size that could be resolved with the final sample.

Also from a Bayesian perspective, Donald Berry and James Berger show that statistical significance testing can be overrated. Small "p" is not that strongly correlated with actual likelihood. Berry and Berger showed how the inverse probablity (significance testing assumes the probablity of Y if NOT X, Bayes probability would be Given Y, how likely is X) can be very different from the statisical significance calculation"1-p". For very small p, one can underestimate the probability of a chance outlier by orders of magnitude. Berry and Berger would recommend applying Bayesian analysis.

I elaborate somewhat more at http://billnichols.wordpress.com/2011/01/18/does-the-truth-wear-out-how-...

by William Lloyd Garrision III (not verified) :: Tue, 01/18/2011 - 6:12pm

Let's tone it down and refocus. We are talking about football here, not the inter workings of the galaxy.

The stuff on this site is not just good, it's great. I offer up here my own personal SABRmetic journey, for what it's worth:

-Memorizing the back of baseball and football cards as a child that I instinctively knew were incomplete and misleading a bit.
-Reading Moneyball in 2004 and had my eyes opened.
-Reading Baseball Prospectus every day at work
-Got bored of Baseball Prospectus after reading it every day for a few years
-Rejoicing to learn that "Moneyball" thinking had come to football, and used Football outsiders to guide my 2007 fantasy draft
-Reading Football outsiders every day. Thinking that in general, it was just as good as Baseball Prospectus was in its heyday.
-Thinking that most of the people who read this site and post are much smarter than me; I often feel like Joey from Friends after re-reading my comments.
-Thinking today, that perhaps many of those people I used to think were smarter than me are in fact, just dorks.

by Shattenjager :: Tue, 01/18/2011 - 6:37pm

The linked article is talking about the inner workings of the galaxy.

I don't have a problem with what Aaron said about football at all, but people should be posting arguments against the article itself if it's going to be linked.

by Bill N (not verified) :: Tue, 01/18/2011 - 7:27pm


I link to Ioannidis and the Atlantic article too on my blog.

by William Lloyd Garrision III (not verified) :: Wed, 01/19/2011 - 12:44pm

Perhaps, but I should never be subjected to a sentence like this again on a football site: "The Partlcle data group has plots of the neutron lifetime and axial vector coupling ratio."

And though it was difficult to read that article and not fall asleep, I did make it through the end to where it said: "The disturbing implication of the Crabbe study is that a lot of extraordinary scientific data are nothing but noise." I already knew that though because I know that there are lies, damned lies, and statistics.

Many pharma studies are done in the name of advertising, so I don't believe them all anyway. I also only selectively listen to baseball SABR stuff because I think that alot of it is a sales pitch to MLB teams for consulting work or databases by the researchers--so its value is likely to somewhat exaggerated. Just like the value of the software I sell to customers is likely to be somewhat exaggerated by me. I approach Football Outsiders with the same healthy skepticism.

I also, by the way, think that Moneyball was used by the Okland A's as a PR campaign to distract attention away from the shockingly rampant level of steroid use in their clubhouse. For years this is what I have thought Joe Morgan wanted to say, but has too much restaint to say it. I also think from time to time that we didn't really land on the moon back in '69 and the whole thing was staged by the government in the desert in Arizona somewhere, so perhaps I think too much.

Point is, I don't see FO as overtly trying to "sell" me something here, outside of an annual guide for $14.95, so my radar doesn't pick up the "intentional bias alert" signals that are the real cause of failure in the pharma, etc. studies cited in the article. As a result, I think that some of the hairsplitting isn't applicable here.

by Bill N (not verified) :: Tue, 01/18/2011 - 8:53pm

Hey, the guy left himself exposed, I delivered the hit.

by The Hypno-Toad :: Wed, 01/19/2011 - 4:54am

Eeesh... You guys amaze me. I know enough about statistics and science to know that I don't know either very well, but I know I'm at least reading the right site because people who are very, very smart seem to be very interested in what is said here.
I've said it before, and I'm sure I will say it again. The content on this site is great, but what keeps me coming back is the excellence in the comments.

by MJK :: Wed, 01/19/2011 - 3:13pm

A good example is the "third down rebound," which seems to have disappeared for offenses over the last couple seasons.

I went back and read the original season preview article from a number of years ago where Aaron first introduced the "third down rebound" effect, and (while I don't have a link handy to the article right now), I'm almost certain that in that original article, Aaron stated that the 3rd down rebound effect only seemed to apply on defense, not on offense.

Later, Aaron and the other Outsiders started talking about it all the time for both offense and defense, but I don't know if they did further research that implied that it might apply to offenses too and either didn't mention it, or I missed the article where they did. It's also possible that the original caveat got accidentally dropped.

In the former case, it may be that a known phenomenon for defense was extrapolated to offense on the basis of too little data, as Aaron suggests. If the latter was the case, then it never disappeared...it was just never there in the first place...

by Bowl Game Anomaly :: Wed, 01/19/2011 - 11:52pm

Wasn't the original article about the San Diego Chargers offensive improvement?

by MJK :: Wed, 01/19/2011 - 3:14pm

A good example is the "third down rebound," which seems to have disappeared for offenses over the last couple seasons.

I went back and read the original season preview article from a number of years ago where Aaron first introduced the "third down rebound" effect, and (while I don't have a link handy to the article right now), I'm almost certain that in that original article, Aaron stated that the 3rd down rebound effect only seemed to apply on defense, not on offense.

Later, Aaron and the other Outsiders started talking about it all the time for both offense and defense, but I don't know if they did further research that implied that it might apply to offenses too and either didn't mention it, or I missed the article where they did. It's also possible that the original caveat got accidentally dropped.

In the former case, it may be that a known phenomenon for defense was extrapolated to offense on the basis of too little data, as Aaron suggests. If the latter was the case, then it never disappeared...it was just never there in the first place...

by NYExpat :: Thu, 01/20/2011 - 3:06pm

I also think Aaron is drawing the wrong lessons from the piece, the PZ Myers explanation was very good (separating the facts from the "story" that Lehman set them in).

Sample size isn't what's at question here. Look at it this way: If I flip 100 even coins 10 times each, there's a good chance that one of them will come up heads 9 out of 10 times, despite the fact that they're all even. No matter how large a sample you take of each coin, if you involve enough coins some of them are going to be outliers by chance alone.

Sample size is a problem inasmuch as there's a limited number of experiments being run each year, and it's a moving target, as Aaron said. This is why I think the stuff Tanier and Muth do are so important to this site, breaking down plays and technique. Along those lines, the protectiveness of the NFL with game film makes work here much murkier than it could otherwise be. It doesn't need to be said every time, but let's just say linking to this XP once in a while wouldn't hurt. :-)

by Bill N (not verified) :: Thu, 01/20/2011 - 3:32pm

A coin is not quite the right example. It's what might be called an uninformative prior. A priori, the outcome is a 50-50 toss-up. We get into real trouble with outliers when we fail understand the base rates, especially where a given outcome is highly likely or unlikely.

A classic example is you witness a hit and run accident at night involving a taxi. You think it was blue. There are two taxi companies in your city, the "blue" and "green" taxi companies. What are the odds you are right? Now, let me tell you that the "blue" company has only 10% as many taxies as "green". Now what do you think? What if "blue" doesn't operate after dark?

How you interpret the actual data should be informed by what else you know.

by William Lloyd Garrision III (not verified) :: Fri, 01/21/2011 - 5:04pm

+2 points for that comment.

My experience is in general that the better folks get at the quant piece, the worse they are at that qualitative piece. Or maybe those who aren't good at the qualitative piece in the first place gravitate towards quantitative approaches.

For instance, in that article they mentioned that studies were showing the effectiveness of schizophrenia drugs were diminishing with greater use. My initial, qualitative assessment would be: well, once a drug is pushed into the market, it gets over diagnosed and over requested, so you have people that don't really need it--and therefore won't get results from it--using it and weighing the data down.

My oldest son was "diagnosed" with autism when he was 4. He's not, by the way. But I learned that the system has so many market forces looking to bring a child into the business that is autism therapy that many get identified into the category incorrectly and get factored into all of the statistics on the category--some now your results are comparing apples to apples and oranges. What looks like a changing dynamic in the category is really a changing of the category.

3rd down trend tracking is my latest pet peeve in this category where the "what" gets asked, but not the "why?" Did a team improve it's third down % because it regressed to the mean, or because they got a better running back, or a better O-line, and now 3rd downs tend to be 3rd and 4, instead of 3rd and 7.

by William Lloyd Garrision III (not verified) :: Fri, 01/21/2011 - 5:09pm

When I got my MBA, we used the case study method, as I assume most others do. And although I understood that for the sake of a structured group conversation you don't want to have external, unvalidated data points and arguments introduced from all over the place--I also think a great dysfunction gets encouraged by this approach.

By only sticking to the facts presented, as they are presented, and not informing your analysis by the other things that you know to be true--simply because they weren't written up in the case or one of its exhibits--you form some bad analytical habits. You end up getting accountants who think that are economists, and this causes all sorts of problems.