Offensive Personnel vs. Men in the Box: A Football Causality Tutorial
by Zachary Binney
Hundreds of analysts (a.k.a. football nerds) are rolling up their sleeves this fall to use NFL tracking data – which contains the location of all 22 players plus the ball measured 10 times per second – to try and predict how many yards a running play will gain as part of the NFL’s Second Annual Big Data Bowl.
While the tracking data allows for much more sophisticated analyses, two potentially valuable predictors for those not wanting to get their hands too dirty are the offense’s personnel and the number of defensive “men in the box” (MIB) close to the line of scrimmage. Indeed, within hours of the Big Data Bowl’s data being posted Twitter user @deryck_cg1 had already investigated how rushing yards vary with all combinations of offensive personnel and MIB. There is a vigorous discussion in the NFL analytics community about the relative importance of these in determining rushing success.
One tempting approach is to throw both in a regression model and see which one has a stronger association. While this is a fine approach for a prediction problem like the Big Data Bowl, it can cause some major problems depending on the particular question you’re trying to answer. To understand why you’re going to need a (basic) crash course in the field of causal inference. I’ll be gentle, I promise.
Note that this entire article is an oversimplification that omits a lot of important variables that can impact offensive and defensive personnel choices and rushing results and doesn’t use the best measure of rushing success. It is intended only as a tutorial that demonstrates how some seemingly correct analyses can be wrong. It is not a standalone comprehensive analysis. Also, the code used to produce all results in this article is available here, but the underlying data is proprietary. You can, however, apply this code to the Big Data Bowl data to conduct a similar analysis.
In order to help us sort out our thoughts about offensive personnel, MIB, and rushing success we’ll begin by drawing something called a directed acyclic graph (DAG) to represent our situation. A “graph” here just means we’re drawing some points connected by lines; “directed” means those lines are arrows; and “acyclic” (no cycle) means you can’t follow the arrows to get back to where you started from. The arrows imply that the thing at the tail of the arrow causes the thing it's pointing at. If you can read a play diagram, you can read a DAG.
Here is a DAG diagramming our situation (graphic generated using daggity):
What’s going on here? Let’s start with the 3 individual arrows:
- Offensive personnel has some direct effect on rushing success. This could be, for example, the effect of the offense having more or fewer blockers available to help on the play.
- Offensive personnel impacts MIB because the defense reacts to what the offense shows, stacking the box or backing off.
- MIB impacts rushing success. Fewer MIB means more rushing yards.
The three green arrows represent two (hypothesized) causal paths whereby offensive personnel impacts rushing success: a direct effect, and an indirect effect that “flows through” MIB. In other words, our diagram says offensive personnel impacts rushing success both by a.) altering the options and blockers for the offense (direct effect), and b.) altering the response of the defense in terms of MIB (indirect effect).
Another way to say the latter is that MIB mediates offensive personnel’s effect on rushing success.
Drawing a DAG is a helpful first step as it forces you to think about how all your variables relate to one another. As you’ll see below it can stop you from making some major analytical mistakes.
Questions We Could Ask
There are three types of questions we can ask of any dataset: descriptive, predictive, and causal. Here are examples of each for NFL rushing success:
- Descriptive: What is the distribution of rushing yards on plays out of 11 personnel? 22 personnel? Against 6 MIB? 8 MIB?
- Predictive: How many yards am I likely to gain in the future if I run 11 personnel against 6 MIB? 11 personnel against 8 MIB? 22 personnel against 6 MIB? This is the type of question being asked in the Big Data Bowl.
- DAGs don’t matter here. We just use whatever data we have on hand to make the best predictions we can.
- Causal: How many more or fewer yards can I expect to gain on this play if I run it out of 11 rather than 12 or 13 personnel?
- DAGs are critical here to ensure we don’t mess up our analysis.
Let’s tackle each of these questions in order.
Descriptive: Visualizing the Data
This is always a good place to start. Let’s look at the actual distribution of rushing yards under various situations. We used data from Football Outsiders, Sports Info Solutions, and ESPN on 17,622 non-scramble rushes outside the opponent’s 30-yard line in the first to third quarters, excluding the final 2 minutes of the first half, with 2+ yards to go and five offensive linemen on first through third downs in 2016-18.
Figure 1. Distributions of rushing yards by offensive personnel grouping.
Figure 2. Distributions of rushing yards by MIB.
There are modest differences in yards per carry (YPC) under different offensive personnel (12 and 13 personnel yielded about 4.3 YPC, 22 personnel 4.6 YPC, and 11 and 21 personnel 4.8 YPC). There is a somewhat clearer drop for each additional MIB (5.8/4.9/4.5/4.3 YPC for <=5/6/7/8+ MIB).
Predictive: Combined Regression Model
In predictive modeling we use whatever data we have on hand to predict what will happen in the future. So a model with offensive personnel and MIB at the snap is fine for predicting that play's rushing success. Here are some predictions for average YPC on future first-and-10 plays using a (Bayesian) lognormal regression model (fit using the brms package in R):
Figure 3. Predicted average YPC by offensive personnel and MIB.
Figure 3 shows the distributions of predicted average YPC from our model. For example, for first-and-10 with 11 personnel vs. 6 MIB we would estimate about 4.85 YPC, with a 90% credible interval from 4.73-4.98 YPC (meaning we are 90% sure the true average YPC is in this range).
Here MIB is pretty clearly the stronger predictor of rushing success, and offensive personnel has a much smaller association. But what if we tweak our question slightly to something more valuable: what would happen if we change offensive personnel?
Causal: Offensive Personnel and Rushing Success
There is a theory that switching to (lighter, pass-friendly) 11 personnel could be more successful than (heavier, run-friendly) 12, 13, 21, or 22 personnel if it tricks the defense into expecting a pass.
So our head coach comes to us and asks, “Should we run out of 11 personnel more?” Coach is asking about the causal effect of changing offensive personnel. If we do X (change offensive personnel), will it change Y (YPC)?
In our model above we found that, against 6 MIB on first down, 11 personnel runs gained on average 0.13 more yards (90% credible interval -0.05 to +0.31 more yards) than 12 personnel runs. This difference is quite small. (The results were similar for 11 vs. 13 personnel.)
We “controlled” or “adjusted" for – that is, removed the effect of – MIB by including it in our regression model. In DAG terms, here’s what happened:
That box around MIB indicates that we controlled for it in our regression. That box “blocked” any effect of offensive personnel on rushing success that flows through MIB, so we only see its (smaller) direct effect – the single green arrow and causal path in DAG 2.
Another way of thinking about this is that we only see the effect of offensive personnel “conditional on” some specific number of MIB. For example, is running out of 11 personnel or 12/13 personnel more successful when facing 6 MIB? Or 8 MIB?
This doesn’t answer our coach’s question! It answers what would happen if we run out of 11 personnel but the defense doesn’t change its MIB. It doesn’t account for the full effect that offensive personnel has on rushing success because it explicitly ignores that offensive personnel can alter MIB. It’s an underestimate.
What Would Be Better?
Simpler is actually better here. If we run a model without MIB we will see offensive personnel’s total effect – a combination of its direct and indirect effects, represented by all the green arrows in DAG 1. In other words, we will get the total effect of having fewer blockers but also possibly tricking the defense.
Such a model shows that, against 6 MIB on first down, 11 personnel runs gained on average 0.43 more yards (90% credible interval +0.27 to +0.59 more yards) than 12 personnel runs. Whether switching to 11 personnel nets you 0.43 or 0.13 additional YPC is a major difference! (We see similar results for 11 vs. 13 personnel.)
To help us further understand the difference between the two models take a look at the brief animation below, which shows our estimated average YPC for 11, 12, and 13 personnel in the models that do and don’t control for MIB.
Figure 4. Predicted average YPC for 11, 12, and 13 personnel, controlling or not controlling for MIB (GIF).
In the model that controls for MIB there is only a modest difference between 11 and 12/13 personnel in terms of YPC. But when we don’t control for MIB, there’s suddenly a big gap! That’s because we include offensive personnel’s effect on MIB only in the latter model.
The difference between these two models is also an example of mediation analysis. By looking at the effect of offensive personnel on rushing success with and without controlling for MIB we can understand how much of the effect of offensive personnel on rushing success flows through MIB – in other words, how strong a mediator MIB is. The answer is: the vast majority of the difference in rushing success across different offensive personnel is due to changes in MIB, and offensive personnel appears to have a smaller if any direct effect on rushing success.
The results are even starker if we compare 11 personnel to 21 and 22 personnel. In the model that controls for MIB, 11 personnel looks worse – if we hold MIB constant, runs out of 22 have a higher YPC than runs out of 11. Duh, more big guys is better. In the model that doesn’t control for MIB, however, rushes are more successful from 11 personnel than 22 personnel!
Figure 5. Predicted average YPC for 11, 21, and 22 personnel, controlling or not controlling for MIB (GIF).
All that said, sometimes you only want an exposure's direct effect, meaning you actually want DAG 2. Consider an example of evaluating offensive line play. "Offensive line skill" has some direct impact on rushing yards as well as an indirect effect on rushing yards flowing through MIB since defenses may choose to stack the box against better run-blocking lines. Controlling for MIB would be wise if you want to estimate the true skill of an offensive line independent of its effects on MIB – that is, how one line compares to another when both face 6 MIB. But it may not be if you want to estimate that line’s total value – which should incorporate any effect they have on defenses choosing to stack the box.
Now, are you ready for one more twist?
What if I Want to Know the Effect of MIB Instead?
A defensive coach might ask “If we put another man in the box, will that help us stop the run?” Here, interestingly, you would want the first model we ran with both offensive personnel and MIB – not one with just MIB. Here’s the DAG for this situation:
Notice MIB and offensive personnel switched places because MIB is now what we want to change – our exposure as epidemiologists call it. Offensive personnel is not part of how MIB affects rushing success (a mediator) – it’s what’s called a confounder instead. Offensive personnel causes both the thing we're interested in changing (MIB) and the outcome we want to measure (rushing yards).
Think of it this way: 7 MIB might look worse than 6 MIB just because 7 MIB is used more against heavier offensive personnel. The heavier personnel is said to confound the true effect of that extra MIB. If the 7 MIB caused the heavier offense, that would be part of its effect – there would be a green “causal” path of forward-pointing green arrows in DAG 3 that we would want to include in our analysis. But we’re assuming it didn’t – MIB is merely a response to offensive personnel, creating a red “non-causal path” that we want to “block.” A “non-causal” path is basically one that starts by going backwards up, rather than forwards down, an arrow.
A model with just MIB would give us a combination of the green path we want and the red path we don’t. Even if MIB had no effect on rushing success (no green arrow from MIB -> rushing success), this red path would make them appear associated unless we block it. So we need to control for offensive personnel to “block” the red “non-causal path” from MIB to rushing success to get a clear picture of what would happen if we added an extra MIB, holding offensive personnel equal.
This post is already too long, but a great resource for mediation versus confounding in DAGs is Dr. Miguel Hernan’s online course.
Are Offensive Personnel or MIB More "Important?"
MIB has the bigger effect in a model with both variables; end of story, right?
Yes and no. MIB has a stronger association with rushing success, and most of offensive personnel’s effect is due to how it changes MIB.
We also must consider whether we have a predictive or causal question, our target audience (important for whom?), and – if our question is causal – how modifiable each variable is.
To the offense, offensive personnel’s effect is more important because it’s what they control – even if any effect is largely through its impact on MIB. For the defense, MIB is more important because it’s what they control (and it has the stronger association).
If you simply want to predict – but not change! – rushing success, MIB is also more important because it’s the stronger predictor. But if you want to talk about how to change rushing success – for example advocating for more 11 personnel or not – you’re back in causal territory and need to be more careful.
If you're interested in learning more about the differences between predictive and causal (a/k/a explanatory) modeling, a long but pretty readable paper is the classic "To Explain or to Predict" from Galit Shmueli.
Another Example from Public Health
We all agree smoking raises your risk of lung cancer. But how? Among other things the chemicals in cigarette smoke can cause mutations in the DNA of your lung cells, making them cancerous. The actual DNA mutation – if we could measure it – would show a stronger association with lung cancer than smoking simply because it’s closer in the “causal chain.”
So which is more important? The cancerous mutation is the immediate cause of the cancer and the stronger predictor of it, but we can’t snap our fingers and prevent that mutation. We can, however, stop people from smoking. In that sense smoking matters more despite “cancerous mutation” having a stronger effect.
Also consider what would happen if we looked at smoking while “adjusting” for cancerous mutations, such as by including both in a regression model. We would see no effect of smoking “conditional” on whether you did or did not have a cancerous mutation. That is, in those who did not have a cancerous mutation, whether they smoked would appear to have no effect on their risk of cancer. Same for those who did. But that does not mean that smoking is unimportant.
In this analogy offensive personnel is smoking, the lung DNA mutation is MIB, and cancer is rushing success. (You should pass more.)
This post oversimplifies the game to focus on the main point: how to approach predictive versus causal questions. We did not consider any factors like player skill and team tendencies that would have substantial impacts on both offensive personnel and MIB choices. Nor did we consider different measures of offensive personnel, game flow, or any feedback between offensive and defensive choices.
We also know that all yards are not created equal – this is the whole idea behind DVOA. This analysis would be better with a metric, such as DVOA or EPA or WPA, that better captures the true value of each running play. But this article just uses YPC to keep things simple – again, it’s intended mainly as a tutorial, not a comprehensive analysis – and to be consistent with the Big Data Bowl, which asks participants to predict yards gained.
Finally, because rushing yards are right-skewed average YPC isn’t the best metric to use – some form of quantile regression, or simply comparing the full predictive distributions would be better. The latter is what the Big Data Bowl is smartly asking participants to do, but I've just presented YPC here for simplicity.
All analyses are adjusted for down-and-distance, though those are not written in the DAGs.
The take-home message here is to think carefully about the exact question – descriptive, predictive, or causal – you’re trying to answer and choose a suitable approach. Write it down. If it’s a causal question, identify the thing you want to change (exposure) and draw a DAG to make sure you do your analysis correctly – in most cases, controlling for confounders and avoiding mediators.
So should you throw both offensive personnel and MIB in a model and see what happens? It depends on your question! If you simply want to predict rushing success: sure. If you want to estimate the total effect of switching offensive personnel: no.
14 comments, Last at 25 Oct 2019, 2:45am
#1 by theslothook // Oct 23, 2019 - 2:37pm
This is a classic endogeneity problem in economics. If the question is, "What is my expected ypc if I go 11 personnel?", then both regressions will fail for different reasons.
Regression 1 fails because you cannot use MIB ex ante. It's like forecasting an airplanes flight duration by including the landing time.
Regression 2 fails because you are omitting an important variable which causes bias. Like trying to predict a person's income by age without including education.
The right approach is to predict MIb and use that prediction in your model to predict ypc.
#4 by Zach Binney // Oct 23, 2019 - 3:47pm
We might end up talking past each other because you're speaking economics and I'm speaking epidemiology, but I don't think I agree with what you're saying.
Your approach is one approach (prevalent in econ, not so much in epi). Not the right approach.
As I explain in the article, MIB is a mediator. When trying to estimate the total causal effect of an exposure (offensive personnel), you don't control for mediators. There's no "bias" by not controlling for it. There's only bias by controlling for it.
Your situation sounds more like time-varying confounding (in econ-speak I think that's "omitted variable bias?"), where the same variable (at different times) is a mediator and a confounder.
I'm also not sure what you mean by you can't use MIB because it's ex ante. It's after the exposure (offensive personnel) but before the outcome. Of course you can use it to predict the outcome. It's just a mediator, and as such it's part of the effect of offensive personnel.
This is all assuming the question you've outlined is even causal. I'm not sure if you meant it to be causal or predictive.
#5 by theslothook // Oct 23, 2019 - 4:27pm
Agreed, there is a danger we might be talking past one another. I'm not familiar with statistical jargon related to epidemiology. It's bad enough random effects and fixed effects mean different things between psychology and economics.
Let me try to be clearer. If the question is..."Hey small fry, I want to run the ball on the next drive, can you tell me which personnel grouping will give me the best chance of success" - then you can't use MIB presumably because that only occurs after the personnel grouping has been declared. In other words, it might be 11 personnel depending on if the defense reacts by going light. However, it might be 22 because MIB might be heavy. You don't know in real time.
Maybe I was confused by your equation and the setup in general, but as I understand it, if you were trying to predict ypc based on some factors like down and distance, etc and then you added your offensive personnel but left out the MIB, that would be committing an omitted variable bias. Because we know MIB explains some of the variation in ypc, leaving it out essentially causes right hand error term to become correlated with the left hand variable and you get biased estimators.
#7 by Zach Binney // Oct 23, 2019 - 5:15pm
OK, I think I see what you're saying. There are two different points here:
Don't use MIB when "predicting" running success based on changes in offensive personnel - obviously I agree because that's what this whole article is about. You're saying you can't do this because MIB comes after offensive personnel, I'm saying don't do it because it's a mediator. Not sure how close those two are, but I think they're similar issues just coming from different disciplines.
Omitted variable bias - this seems to be pretty close to confounding? I'm not an expert here, but from some quick reading OVB occurs when a relevant predictor correlated with another predictor is left out. Confounding is when we have a predictor that causes both our outcome and exposure. But here we have a situation where we're leaving out a predictor caused (based on a priori knowledge that, very broadly speaking, defense responds to offense) by another. That means that you actually want MIB's effects subsumed into the coefficient of offensive personnel, as that more accurately describes what offensive personnel does. Doing that doesn't introduce a bias. It is very very common knowledge in epidemiology that it is a bad idea to control for mediators unless you want only an exposure's direct effects, and that's exactly what MIB is here. It's not a confounder and I can't see how excluding it would induce OVB.
Excluding MIB makes your predictions less accurate because it's an important predictor (which is why you should include it in any predictive model!), but doing so shouldn't skew the causal effect of offensive personnel because MIB is caused by, rather than causes, offensive personnel.
#9 by theslothook // Oct 23, 2019 - 5:27pm
I see what you mean now. Yes we use confounding variables in economics as well, but they are slightly different from omitted variable bias.
If I understand your point, you are saying - yes we are leaving MIB out, but whatever variation is lost is actually captured within the variation of offensive formation, thus there is no error bias by leaving it out. Its akin to saying, yes we didn't include your letter grades when predicting income, but since we have your years in college and the coursework you took....we are gravy.
#3 by Zach Binney // Oct 23, 2019 - 3:43pm
Not at all! The idea is you read this, hopefully understand some of it, and become more qualified. None of us is qualified, really. We're just all trying to stumble a little closer to the truth every day.
#6 by theslothook // Oct 23, 2019 - 4:32pm
As an aside, I think MIB and offensive personnel have a weird dual causality going on that makes estimation particularly difficult. As a I see it, an offense will make a guess as to what personnel grouping a defense will bring out based on down and distance while the defense makes its own guess. Then, once the personnel have been declared, the respective coaches will then augment the formation to fit the personnel on the field.
Isn't that ultimately a wash you might ask? No, because revisiting our original question..."coach, it depends. If we trot out a full spread lineup, then we can expect the defense to bring out a bunch of corners and safeties, but if we suddenly go into a tight formation, then maybe the defense will still stay light and THAT will lead us to getting the highest ypc"
In essence, a simple model may give you a one size fits all answer to a very context dependent problem. This is kind of what I meant by endogenous. The variables themselves have a weird mixture of causation and reverse causation depending on time varying circumstances that makes the regression model a mess.
It's what plagued economics circa 1970.
#8 by Zach Binney // Oct 23, 2019 - 5:19pm
This I completely agree with and mention it in the caveats (basically, I ignored it because this is just a very basic tutorial).
In general, though, there's only limited time in which teams can respond, and usually it's the defense responding to the offense. The offense declares its personnel, then the defense is allocated sufficient time to make substitutions if there were any changes. Even if they don't make any subs, though, the defense can move players into or out of the box based on what the offense shows. The offense might audible to a different playcall, but they can't change their personnel at that point. So there's not really that much feedback because the offensive personnel pretty much has to be finalized before MIB is, and we're only working with what both are at the time of the snap (i.e. the final position).
#10 by Lost Ti-Cats Fan // Oct 23, 2019 - 8:39pm
The next iteration is likely "how many runs can you make out of 11 personnel before the defense is likely to add an extra man in the box?" Or maybe it's a ratio, "at a pass:run ratio of 2:1 the odds we face 6 MIB is X, but at 3:1 it drops to Y." Or maybe it's based on run success, "when our average yards gained on runs out of the 11 personnel exceeds 4, the odds we face 6 MIB increases by X%"
#12 by Dan // Oct 24, 2019 - 2:56am
Sorting out causality is actually a lot more complicated than this.
For one thing, the offense can change its play after the defense has lined up. There is a causal arrow from MIB to the play call (run vs. pass). The historical success of runs out of 11 does not tell us the track record of teams that chose to send 11 personnel in there with a run call. It tells us the track record of teams who chose to send 11 personnel in there, and then saw the defense's formation, and then decided to (either stick with or audible to) a run call. If we decide to run more and pass less, then we're effectively taking cases where the defense's alignment was one that the offense chose to pass against, and changing those offensive play calls into runs. (Hopefully we're doing this with the borderline cases, rather than running against the extreme cases where the defense is really strongly selling out to stop the run.)
There are also increasingly many RPO plays where the run vs. pass decision is made after the snap.
Also, the defense's decision about how many men to put in the box (and how much to focus on stopping the run after the snap) depends on what they've seen on tape from your team (and the rest of the league). If you call more and more runs out of 11 personnel, then other teams will notice that tendency and treat your 11 as more of a run personnel rather than a pass personnel, and then it can get harder to run out of that personnel. Your data tell you what has worked in the historical equilibrium, but once you've decided to do things differently based on those data then the other side responds and you move to a new equilibrium.
Also, if the coach is asking "Should we run out of 11 personnel more?" he probably cares about whether that will help the team win, not just about what will happen on the plays when we run out of 11 personnel. And the decision can have effects on passing plays as well as on running plays. The threat of either running or passing puts pressure on the defense, and changing the offense's run tendencies changes when and how the defense feels pressure to respect the run.
A relatively straightforward example of this is with play action passing. If you want to fool the defense with play action, you should call it with personnel, formations, game situations, and initial post-snap action that are similar to running plays. So if you switch to running less out of 12 and more out of 11, then you should probably also run more of your play-action passes out of 11 rather than 12. If 11 play-action works worse than 12 play-action then that's a disadvantage to shifting your run game into more 11 personnel; if 11 play-action works better than that's an advantage of the change in rushing tendencies.
These sorts of intricacies take us past the point where we have the data necessary to draw the causal graphs and make calculations. In my opinion, they're a big part of what makes football and football analytics fun & interesting & challenging.
#14 by RobotBoy // Oct 25, 2019 - 2:45am
It's why good coaches want to have the personnel to run or pass effectively out of any formation and why having personnel that allow you to do that is so invaluable. When they had Gronkowski, NE was able to give defenses fits; go light and he helps run you over. Go heavy and he's blowing by an LB in the open field. On other side, having DBs who tackle well helps to minimize the risk of teams bowling you over in the ground game (LBs who can cover don't hurt either). NE was always going to struggle adjusting to life without him and some of the offensive issues stem directly from his absence. NE's success has most always depending on Belichick winning the scheme battle - he doesn't have enough high-level skill players to beat you when you know what's coming. He tried to prepare for Gronk's absence by loading up on pass-catching backs but injuries have limited the new-look offense. Even so, the Pats still do an excellent job of baiting defenses with play-action that set up exactly the same as a previous run, down to pulling linemen and the way TEs hold blocks. Brady does a beautiful job of selling the run which is no surprise, he's been at it for a while.
#13 by Zach Binney // Oct 24, 2019 - 6:50am
No, not news to me at all. That's why I wrote the caveats section. You make some great points; this article just wasn't designed to address them. It's a basic tutorial about causality using a massively oversimplified example problem that people were still mis-analyzing. It is certainly not a comprehensive analysis of what would happen if a team ran more out of 11. That would require, like, a book, and this was already at 3000 words, and I didn't want to muddy up the tutorial intent of the article with more football intricacies. :)