The Shifting Wind

Bill Belichick, Patriots. Matt LaFleur, Packers.
Bill Belichick, Patriots. Matt LaFleur, Packers.
Photo: USA Today Sports Images

NFL Week 16 - To even the least analytics-minded observer, something seemed different during Week 16 of the 2021 NFL season. Coaches appeared to be going for it on fourth-and short-routinely, often in their own territory, and yet commentators’ heads weren’t exploding. Has the thinking really shifted, and is it here to stay?

There were approximately 100 non-trivial fourth-down decisions that occurred during Week 16. “Non-trivial” is the subset that eliminates obvious late-game desperation attempts or, to be more specific, the type of decisions NFL coaches have always gotten right. EdjSports analyst Ian O’Connor, in his weekly fourth-down report, identified 35 correct fourth-down go attempts. This is the highest count since we have been keeping records and may well be the highest count for a single week in NFL history. Of the 65 decisions that were deemed to be suboptimal, many fell into the gray area of low confidence, and only 16 resulted in a net cost of greater than 2% Game-Winning Chance (GWC). Perhaps we are close to reaching herd immunity.

While there is clearly a visible change in behavior this season, there are still many indicators that suggest we have a long way to go to reach the type of enlightenment that machines have introduced to so many other competitive games. I strongly suggest watching the “AlphaGo” documentary to get an idea of what I am talking about. Bill Belichick may be the most revered coach in NFL history, yet he sits near the bottom of the EdjSports Critical Call Rankings. His aggressiveness on fourth downs began to regress around 2011, and he has not managed to keep up with his more analytically open-minded peers in recent years, at least in this specific category of coaching. Even this weekend, as we witnessed all this progress on fourth downs, Matt Lafleur managed to squander -12.4% GWC on a single decision late in the contest against the Browns. With a fourth-and-6 on their own 42-yard line and clinging to a two-point lead with 2:14 remaining in the game, the Packers punted the ball away to the Browns. On the subsequent drive, a Baker Mayfield interception on third down from midfield kept the Browns from exploiting Lafleur’s error in judgment. To be fair, this is the type of highly-leveraged fourth-down decision that almost every NFL coach would still get wrong.  It defies all the commonly held precepts that have governed such decisions for decades. 1) The Packers were leading in the game; 2) the ball was in their own territory; and 3) it was a long fourth-and-6. Yet this was, by our extensive analysis, the costliest decision of the week.

There is still much room for improvement, but we have come a very long way. When Chuck Bower (an experimental physicist and one of the founders of EdjSports) and I first presented our model and revelations to an NFL team, it was January of 2004. We had begun working on the fourth down simulation model in 2001 and finally managed to get an audience with Marvin Lewis and the Cincinnati Bengals’ front office.  Unfortunately, we were met with cynicism, apathy, and patronizing well wishes. Motivated by other independent research and the innovative metrics and analytical insights of Aaron Schatz, we pressed on and were ultimately successful in contracting with more than one-third of the NFL teams over the next 15 years, including two Super Bowl champions. Because of the complex nature of NFL franchises, shaping the structure of analytics departments is not always the same as extracting benefit. With few exceptions, our tool was treated more as a novelty than a game changer. There has always been a big difference between awareness and implementation when it comes to NFL coaching. However, this season marks a satisfying, and long overdue, tipping point toward adoption.


24 comments, Last at 29 Dec 2021, 1:22pm

#1 by young curmudgeon // Dec 27, 2021 - 3:12pm

I'm generally on board with going for it on fourth down.  But I'm a little surprised that the decision to punt from your own 42 is such a costly one.  I'd like to know how many timeouts each team had before I buy into the idea.  If you fail to get the first down, Cleveland gets the ball and doesn't need that many yards to reach a point from which a field goal is at least possible.  (I do not know the weather conditions or the status/ability of Cleveland's kicker.)  If you punt and Cleveland takes over on, say, the 20, they need 38 yards just to reach the position on the field where you were when you failed on your fourth down try, plus some more to get to field goal range.  Depending on Cleveland's and Green Bay's timeout situation, I can see punting as a decision that ranges from slightly less preferable to somewhat less preferable, but I have a hard time seeing it as !2% less preferable. 

How extensively has GWC data been vetted?  Does GWC take into account that Aaron Rodgers is GB's quarterback (improves your odds of making a fourth and 6) or that Cleveland's offensive strength is rushing (more difficult to gain sufficient yardage in remaining time)  And, while I'm not a statistician, I'm suspicious of calculations that factor together a number of unknowns and assumptions along with a limited body of actual results yielding a figure as precise as 12.4%.  Can you demonstrate that the actual number shouldn't have been 11.7% or 13.3 %? 

Points: 0

#2 by Darren // Dec 27, 2021 - 5:17pm

The model comes up with an estimate. 12.4% is the rounded estimate (because your bs detector would absolutely go off it they claimed 12.435918%). It absolutely could be 11.7% or 9.54%, we'll never know. 

Points: 0

#3 by ImNewAroundThe… // Dec 27, 2021 - 6:26pm

But most take in timeouts, like this one

Still hard to believe it was the worst. This one, were the team was down and had 5 less yards to get. Or this one (but it was between two irrelevant teams and at least went for points which might make a difference).

Points: 0

#4 by DisplacedPackerFan // Dec 27, 2021 - 7:10pm

Yeah 12% feels big to me too. At the time this one felt a bit like a toss-up. Maybe a 5% swing.

Weather was in the 30's but no real wind. Cleveland had a kicker in his first NFL game who had missed an XP to start the game. But had made a 37 yard FG and his other XP after that.
GB had 2 timeouts left. CLE had all 3.
GB had managed 6 yards over 6 plays on their 2 previous drives. They had gotten 17 yards on 5 plays on the drive that led to the 4th and long 6 decision.
CLE had been moving the ball via the run (219 for the game) but not so well through the air. Mayfield had 3 picks to that point.

I've been wondering for awhile now about including drive success in models and wonder if recent drive success or failure might have an impact. I do not believe that every play in the NFL is independent of each other because the game is played by humans and while it may be a small factor psychology, accumulated wear and tear, fatigue all do play a role and I'd be pretty impressed if they were accurately modeled.

Figuring CLE would get 4 stoppages of the clock (1 play before the 2 minute warning plus 3 timeouts) not punting and failing felt very much like allowing them to continue to run the ball with the success they had been having the whole 2nd half and getting in place for a chip shot if not a TD and not leaving enough clock for GB even with 2 TO to respond. With the offense having stalled out on 3 drives in a row (any of which could have essentially sealed the game if they had been successful by chewing the clock or adding points) the 4th and 6th didn't feel great.

Punting felt like it might force CLE to pass which they had not been so great at because while the running game was working and they had 4 chances to stop it they still probably needed more time. So the yardage felt more important than it usually does.


Like you I'm not sure I believe the model can be that precise and I would guess the error bars are larger than anyone wants to admit. But again I figured the model would be around 5% and to me that is right at the edge of what I think the grey area is for a game with this many players and as many things that can happen on any given play. Baseball is by far the easiest of the professional sports to model as you basically only ever have 3 players at most interacting in meaningful ways. Basketball is also often just 2 players (the shooter and the defender) that really matter though it can expand to significantly more. Football it's almost always at least 10 (5 lineman, QB, 2 rushers, receiver, coverage is probably the fewest that actually have a major impact). So the grey area is going to be larger and coaches will have to make calls.

Maybe the details of the model really are that good. But I've seen the EDJ model say 12 and other models say 5 before. I don't know which are more accurate and it's in the model builders interest to keep that information hard to find.

So yeah maybe the model is telling me I still haven't evolved my thinking enough. But I'm also interested in what the GWC was for punting if it was still say 52% and going for it would have been 64.4% then a coach who doesn't trust how is offensive players have been playing or how his run defense has been playing doesn't feel like he's risking as much to drop to a coin flip on the grey area factors that are a lot harder to model or even measure. Though of course part of the point of models saying that the decision was wrong, and why it doesn't matter if the overall chance to win is still in your favor is process over results.

Also I need to get over my growing distrust of models. It's contradictory because I know that models are getting more accurate. But I also know that some of the methods to build them have been and can still be deeply flawed. I think my real issue is that more and more people are blindly trusting models. There are more models out there (for everything not just football) and some of them are wrong and blindly trusting them puts you in a bad situations. So I end up pushing back more against the models when it's really the blind trust that is the issue. The more opaque a model is the more I tend to push back because it's asking for blind trust. I think part of why I don't mind the DVOA black box is because it's not completely black they open it up and show us every now and then and explain how they built it. So it's possibly my own fault for not looking into the EDJ GWC articles as much to see if my concerns have been addressed. Also I suppose I need to get over model chances being presented without the error margins. I get that confuses innumerate folks, but I wish it was standard practice.

Points: 0

#5 by young curmudgeon // Dec 27, 2021 - 8:44pm

Seems a little intellectually dishonest (or at least sloppy) to state that the model dictates a 12.4% chance (as I stated, seems very precise to me) without citing some error bars.  I haven't researched the issue, and may well not possess the statistical acumen to grasp all the details if I were to research it, but it just seems to me that there have to be a lot of assumptions/presuppositions/hypotheticals/weightings built into the formula and those could be questionable or even flat-out wrong, as well as a wealth of variables in each game situation. 

I just get a little nervous that many people seem to be buying into "Game Winning Chance" as an established thing.  I think it expands our knowledge and our thinking, and gives us another tool to work with. I don't think it's graven in stone.

Points: 0

#7 by Frank Frigo // Dec 27, 2021 - 10:17pm

12.4% is the model's best assessment of the GWC cost of punting vs a passing attempt.  Here are a few things that go into that:

  • The simulations are fully customizable for the specific characteristics of each team based on DVOA data and updated each week. Key injury adjustments are also considered
  • The simulations play the game to conclusion (running clock, timeouts, etc) from any unique game state (ball position, clock, yards-to-first, score etc).
  • Our GWC outputs have been extensively tested and calibrated against 20 years of NFL data and also more recently against betting market data.
  • We can stress test our GWC outputs by adjusting any input parameters within the game state or the offensive and defensive characteristics of the opponents.

In this particular situation, it appears the model is heavily weighting the Packers' top rated passing offense vs a more average Browns' passing defense.  With regard to the comparative 4th down bot analysis that was posted, our (customized) expected conversion rates on the 4th and 6 are higher, as well as the GWC for the Browns after the punt.  I want to again emphasize that 12.4% is the model's best assessment of the GWC cost based on our underlying assumptions. That figure could of course be lower, or higher.  For instance, if we swap the Packers for the Saints, we effectively change their offensive passing DVOA from first to last.  Similarly we can swap the Browns for the Cowboys, thus changing their defensive passing DVOA to first from average.  With both of these adjustments applied, the simulation STILL favors a passing attempt on 4th and 6, although by a very slim margin.  In summary, we like to quote the model's GWC differences as that is the best information we have.  More importantly, we our comfortable labeling a decision such as this as a clear error if the analysis survives a stress test that stretches the original assumptions to extreme levels.

Points: 0

#8 by vodkaferret // Dec 28, 2021 - 8:08am

In reply to by Frank Frigo

Would you calibrate against betting market data? That shows what people think should be done in a given situation, or what effect people think a decision had on a teams chance to win. That isn't necessarily the same as the actual impact on chance to win.

Or if you're taking it from the changes in the odds, then that simply shows the changes the bookmaker thinks are necessary to separate people from their money.

Either way, the relevance to a model that is aiming to measure true GWC is not very hgh at all. They are measures of two different things, how can you use one to calibrate the other?

Points: 0

#9 by Frank Frigo // Dec 28, 2021 - 8:36am

If you believe that highly liquid, betting market data (which implies GWC) is irrelevant information, I suggest you go out and exploit it. Best of luck. 

from our perspective, it is one of many important components that can play a role in accurately tuning the model.

Points: 0

#10 by vodkaferret // Dec 28, 2021 - 9:15am

So your answer to why it's relevant is "because we say so, dumbass" - yes, I'm paraphrasing but still. Many thanks for your condescending reply.

Could I ask you to explain a little further why you believe that betting market data is relevant. Because it doesn't imply GWC as you say, it implies human perception of GWC. This is not necessarily the same thing.

Points: 0

#11 by ImNewAroundThe… // Dec 28, 2021 - 9:27am

In reply to by vodkaferret

Usually what it's used for. And analytics twitter already delved into this objective/subjective  argument the other week. 

Points: 0

#12 by vodkaferret // Dec 28, 2021 - 9:46am

Ok.. Could you point me in the direction of some of this discussion please? A starting point is enough. (I have no idea who / what people constitute analytics Twitter or were having that conversation)...

I'd still argue strongly that the average person understands probability very poorly, so using an industry designed to monetize that fact as a calibration seems more than a little flawed. But I'd be very interested to read those discussions.


Points: 0

#13 by ImNewAroundThe… // Dec 28, 2021 - 9:55am

Look at the replies from verified people, most notably in the Ben's 2nd tweet in the thread. 

Points: 0

#6 by armchair journ… // Dec 27, 2021 - 9:48pm

It would seem that a far more productive response to the -12.4 result would have been to question the model rather than the coaching. It certainly would have made for a more worthwhile post. 

Points: 0

#15 by JacqueShellacque // Dec 28, 2021 - 10:23am

The purpose of analytical models isn't to find the 'right' answer, it's to probe the process, to determine if pre-conceived notions are in fact correct or based on poor assumptions, to uncover hidden leverage, in order to (paraphrasing Sir Karl) discover less obvious dependencies within the social or decision-making sphere. I think we'd all agree there's nothing qualitatively new in assessing the GB decision on 4th and 6, as we all know if they succeeded in the conversion they increase their chance of winning tremendously. So that makes the model's assessment of this situation far less interesting - ie, no hidden leverage has been uncovered. Quantitatively  the numerical result derived could be subject to model error (ie, one parameter tweaked a bit could give a wildly different result). So a decision-maker could be quite right to question the direction towards which the number seems to be pointing. In fact, it should be human nature to do so. Analytics people, please - find the hidden leverage, don't try to quantify with precision that which can never be determined with precision. 

Points: 0

#16 by ImNewAroundThe… // Dec 28, 2021 - 10:37am

In reply to by JacqueShellacque

Essentially the notion is on the finely tuned models to prove why they're not matching up? One parameter being changed likely doesn't change much, hence why it was as high as it was.

The number seems high, I agree, but the recommendation remains the same. I linked to Ben's model above, same recommendation. What model (aka different parameters) says the OPPOSITE of this, Bens, etc? 

And to look at the actuality of what happened, it showed Cleveland might have won if GB doesn't get away with a penalty on the last turnover. Ironically approx where they would've turned the ball over if they failed on the 4th.

Points: 0

#17 by JacqueShellacque // Dec 28, 2021 - 2:01pm

The whole purpose behind analysis of situations like this is to avoid rationalizations based on the outcomes bias. This is a big no-no, falling for it leads to being fooled by randomness into thinking one knows more than one really does:

"And to look at the actuality of what happened, it showed..."

These are non-repeatable scenarios, not experiments proving the 'correctness' of a model or not. The 'actuality of what happens' if I decide to cook meth in my basement might be a stack of cash, and even if that's the result 'in actuality' it doesn't mean the decision to do so is correct.

This is also naive:

"One parameter being changed likely doesn't change much"

How do we know? If there's anywhere in these equations where the denominator is a percentage, then a small change in that denominator means a yuuge change in the result it spits out. I don't think anyone is arguing directly for or against the decision to punt, only the idea that the ultimate decision to punt can be definitively classed as an error.

Points: 0

#18 by Frank Frigo // Dec 28, 2021 - 2:20pm

JacqueShellacque, did you read my explanation of simulation parameters and stress testing (speaking to uncertainty) that was posted above?  If so, please specifically address what aspects of the criterion you disagree with.  

Points: 0

#20 by JacqueShellacque // Dec 29, 2021 - 8:25am

and it still misses my point. Analytical methods should look for what's hidden, not try to quantify what's seemingly obvious but in an apparently more precise way than other dummies might. Because what you're asserting here, and even the way in which you're coming to that conclusion, is unfalsifiable. 

Points: 0

#19 by ImNewAroundThe… // Dec 28, 2021 - 3:15pm

But I can see why they said go it. 

But again, I want to see if ANY model said punt. The purpose of this to point out the mistakes, not make up excuses for the coaches. 

How is it naive? You're basing this on what? You think if the parameter was 4th and 7th it changes anything? Again, why the strong push for a model trying to prove coaches right when they choose wrong? That doesn't make any sense for an article! Again, I would like to see any model that said Lafleur made the correct chose. This and Ben's say he didn't! If all the models say medium+ go for it...he was likely an error! Sorry! Let's not be hurt by definitions now that say these professionals might make mistakes! They'll survive.

Points: 0

#21 by JacqueShellacque // Dec 29, 2021 - 8:26am

Naivete comes from too much faith in models. The people who ran Long Term Capital Management had it all worked out in 1998 too, there was no way they'd lose money. Using much the same methods. As I said above, the conclusions can't be falsified, so that calls for humility.

Points: 0

#23 by ImNewAroundThe… // Dec 29, 2021 - 12:56pm

In reply to by JacqueShellacque

But not too much in human coaches somehow.

It was the wrong decision. Why is hard to for you to label it as such? No one said it wouldn't work out still for them. Gee. Again, shoe me ANY model that shows the opposite recommendation.

Points: 0

#22 by Noahrk // Dec 29, 2021 - 11:52am

After reading all of the above comments, I agree both with Jaque and with Packerfan. It's no good making the "right" decision unless you know why it is the right decision. Not when there are secondary decisions to be made. For example, perhaps going for it is the right decision because if you fail you retain a higher than expected chance to win. Maybe (just hypothesizing here) that chance to win comes from selling out to stop the run so that if you lose it will be through a big play that will lead to a quick score and still leave you time to come back and win it. If so, you could only take advantage of the "right" decision if you were willing to play a certain kind of defense after failing to convert.

IMO it is better to make a suboptimal decision with a plan than an optimal decision without one -or in other words, just because the model says so. Otherwise you could be walking into a trap fashioned out of your own ignorance.

Points: 0

#24 by JacqueShellacque // Dec 29, 2021 - 1:12pm the idea of sample paths, and I think that's what you're getting at here. If the decision is made to go for it on 4th and 6 from your own 42 in the middle of the 2nd quarter, there are far more sample paths (and thus more likelihood of achieving these expected values) than inside the 2 minute warning of the 4th quarter, up by 2. The number of sample paths drops significantly, one of which is failing to convert and giving the ball back with only about 10 yards needed for a credible FG attempt, with no time left to respond. For this reason, I suspect the model will overestimate the GWC. That's not to say I would've disagreed with a decision to try to convert, only that I would not have taken the model as telling me what should definitely be done. Models can be just as ignorant as troglodyte NFL coaches. Like the Long Term Capital Management fiasco, run by multiple Riksbank prize-winning economists. Frank himself states that the goal is to increase the probabilty of winning, 'on average'. Significantly reduce the number of sample paths though and you can't get that average. 

Let me reiterate that I only object to the idea of using analytical results such as these to state whether a given decision was 'correct' or not. That simply can't be determined. I could've said, for example, that because GB ended up winning the game, the decision to punt was 'correct'. We all know that's bad reasoning, and all share the same thinking in that regard (outcome and confirmation bias all rolled into one). So there's more agreement between us than difference. I think had the conclusion been stated thusly: "Analysis of the 4th and 6 situation shows that conventional wisdom and its aversion to the risk of turning the ball over on downs in one's own territory late in the game could be overstated, and aggressive approach attempting the conversion may have ultimately been more valuable than punting" it would've been more reasonable. Infallibility didn't wear well with popes, and it definitely won't work with numbers people.

Points: 0

Save 10%
& Support Frank
Support Football Outsiders' independent media and Frank Frigo. Use promo code WRITERS to save 10% on any FO+ membership and give half the cost of your membership to tip Frank.