Data Science versus Econometrics
“I’ve been reading about data science, and I noticed that a lot of the econometric models I know came up a lot. What is the difference between them?”
That’s a question I’ve been asked a few times over the past year and a half. It’s taken different forms (sometimes it’s a data scientist reading about econometrics), and it’s come from classmates, coworkers, and it’s even cropped up in job interviews. I read once that if three people ask something it’s worth writing about, so let’s get to it.
It tends to come from people who know a lot about one field and a little about the other, like my coworker who works in economic consulting and was curious about data science and machine learning. Short answer, I see four main differences:
- The results of an analysis.
- How important interpretability is.
- Which models and techniques are used.
- Their approach to the bias-variance tradeoff.
For the long answer you’ll need to read a bit more. But before we can get there, we need to pin down exactly what we mean by Machine Learning (ML), Econometrics, and Data Science.
- Machine Learning: an ever-growing set of algorithms and statistical techniques which allow a computer to extract some result or pattern from data without being explicitly programmed to do so. For example: linear regression, decision trees, neural networks.
- Econometrics: the application of mathematics and statistics to answer questions in economics, usually in an academic context. For example: using data to find the effect of education on income.
- Data Science: the application of mathematical, statistical, and computational techniques to data in a specific area, generally (but not always) in a business context. For example: forecasting a business’ sales for next year.
These definitions might leave some things out. Machine Learning is a field of study in its own right where one might invent new theoretical techniques, evaluate algorithms’ performance under a variety of conditions, or examine the ethical use of these techniques. Here though, I’m referring to it as a toolbox of algorithms and techniques.
We can say that econometrics and data science both use techniques from machine learning. But the definitions of econometrics and data science are still very similar, so what gives? A close reading will reveal two differences in the definitions: domain and setting. Econometrics is specific to economics, while data science is broadly applicable. Does that make econometrics a type of data science? I wouldn’t say so, although there is a substantial overlap between the fields.
The other difference is setting. Econometrics is an academic, scientific, experimental field which is used to test economic theory; to economics as experimental physics is to theoretical. Therefore it places great emphasis on causality; on answering the question “How do we know that X causes Y?” Data science on the other hand, is a more outcome-based and predictive field. It cares more about “How well does X predict Y?” with less of a mind towards whether X causes Y. This where those four differences from earlier come in, and where both fields take techniques from machine learning and run in different directions.
I think it’s worth pausing here for a disclaimer. I’m painting these fields in broad strokes, and my definitions and observations are neither complete nor absolute. You will find data scientists who focus very heavily on explainable machine learning and causal inference, and econometricians doing prediction or even applying computer vision or natural language processing to economics. Even so, I think these are useful generalizations when trying to delineate the majority of work in both fields.
Now, back to those four differences. We’ll start with the results of an analysis. If you were to pick a random economics paper, and read through the methodology section, there is one machine learning technique you would almost certainly come across: ordinary least squares (OLS) linear regression. If you’re coming from a data science background, you’ve almost certainly used linear regression before, but most economics papers are going to use it in a way that might be somewhat unusual. You won’t see a train-test split. Features will be chosen based on some economic theory, not through feature selection. Most notably you won’t see any predictions.
In econometrics, the output of an analysis isn’t a prediction about unseen data. For an economist, the interesting parts of a regression are the coefficients and their significance, because the end goal is answering a question like “Can we prove that, all else equal, a change in someone’s level of education causes a change in their income?” In order for such a causal relationship to exist, you have to demonstrate that a relationship exists in the first place. And that relationship isn’t demonstrated and quantified with predictions, it’s done with coefficients.
On the other side of this, if you were to take a set of coefficients and submit them to a Kaggle competition, you wouldn’t score very well. Why is that? Well, Kaggle — as the “Home for Data Science” — isn’t expecting coefficients; it’s expecting predictions. Data science in general revolves around prediction. In other words, econometrics is based around inference, which to quote An Introduction to Statistical Learning (ISLR) is “…understanding the association between [the output] Y and [our input variables].” Conversely, data science is based around prediction: the use of observed data to make educated guesses about unobserved data.
We can illustrate this with a hypothetical. Imagine you’re a data scientist working for a business which wants to measure and predict employee efficiency on a sales team to help create quotas for new team members. Our output variable might be sales in dollars, and our features may be things like years of education, tenure with current company, years of total experience, and salary. You train a linear regression model on most of the sales team, test it on some held-out observations, and find that it does a good job of predicting sales figures. That year, you run the predictions for some newer members of the team and find that your predictions lined up well with what happened.
An economist will point out a few problems. There may be some reverse causality with salary and performance in that high performers will command a higher salary. Furthermore, there is multicollinearity between tenure and total experience since total experience = tenure + other experience. Still, it may be the case that removing any of those features would result in a worse prediction by whichever metric is chosen for evaluation. The main issue this hypothetical economist might bring up is that this method might lead to biased (i.e. incorrect) coefficients. That is to say, the coefficients will be far away from the true, causal relationship between input and output.
Because the economist cares about inference through coefficients, having inaccurate ones makes it hard to extract meaning from them. This brings us to the second point: model interpretability. What does it mean for a model to be interpretable? To quote this Amazon white paper:
If [an organization] wants high model transparency and wants to understand exactly why and how the model is generating predictions, they need to observe the inner mechanics of the … method. This leads to interpreting the model’s weights and features to determine the given output. This is interpretability.
This should sound very familiar to someone who has been following along. In terms of linear regression, an interpretable model is one that lets us look at the coefficients and see how much each variable is contributing to the final output. Interpretability is not a binary; it is a spectrum. Furthermore, it goes beyond the results of a single model. It is generally used for entire classes of model — keep this point in mind for later.
Why is interpretability important in econometrics? If your goal is to infer the effect of one variable on another, you need to be able to say exactly what that effect is. That’s interpretability, and it is crucial for doing good inference. Going back to ISLR, in prediction “[the estimator f-hat] is often treated as a black box, in the sense that one is not typically concerned with the exact form of [f-hat], provided that it yields accurate predictions for [the output] Y.”
In short, you can do prediction without interpretability. Furthermore, interpretability can even be counterproductive to accurate prediction. That is not to say that interpretability in prediction, or in data science is useless or pointless. It’s a fascinating area of study in and of itself, and one which is becoming more and more important as machine learning and data science become a part of everyday life. I’d encourage you to read more about it.
Conversely, inference needs to rely on interpretability because without interpretability there is no inference. To quote ISLR again, when doing inference “[the estimator] f-hat cannot be treated as a black box, because we need to know its exact form.” Since this “exact form” is tied to the algorithm used for estimation, interpretability requirements give us the third difference: which models are used.
The above chart was taken from the Amazon white paper I quoted earlier. It illustrates the tradeoff between a model’s performance and its interpretability. There is some scholarly debate about whether or not this performance-interpretability tradeoff actually exists, but for the time being we’ll accept it. Digging into that debate is beyond this article’s scope, and it’s a helpful model that’s true enough.
Looking at the chart we see the common econometric models clustered in the upper left. Recall, linear regression is the bread and butter of econometrics. Logistic regression is less common in econometrics, but so too are classification problems. Decision tree models are rarer still, but come up occasionally, especially in finance.
After that, you run into machine learning algorithms which almost never appear in economics. I can think of one paper that uses neural nets and SVMs, and it’s an unorthodox paper to say the least. Even then, it doesn’t use those models to do inference. The main conclusion is drawn with linear regression, while the others are used to draw intermediate variables from unstructured data.
We could go through each of these models in turn and talk about how they work and why you would or wouldn’t want to use them in econometrics, but that’s been done already. Instead, we’ll look at the extremes. Linear regression is great for inference because the coefficients tell you exactly what the effect of a variable is — “An increase in X increases Y by B much” is easy to explain. Even if you introduce things like interaction terms or polynomials it still tells you a lot about the process underlying your data. However, if you want to make predictions, you’ll need to spend a lot of time feature engineering to avoid running afoul of the assumptions of linear regression.
Neural networks are not like that. The ones we throw at complex problems tend to be wide and deep, with non-linear activation functions, so finding the effect of a one-unit change in one input is a long and complex process that may not actually tell you very much about the underlying data. Neural nets are powerful because they can automatically find patterns in data without being explicitly asked to, for example non-linear or interaction terms, but this comes at the expense of easy interpretability.
Additionally, neural networks are not necessarily deterministic, unlike linear regression which has a closed-form solution. Two networks trained on the same data can have different results, which means if you’re an economist trying to learn about the effects of education on income, and you put your data into two neural networks, you can get two different results. They might be close, but how do you know which one is right? How do you know either is?
If you’re a data scientist who wants to make predictions on a dataset with thousands of features and millions of rows, something like a neural network makes a lot of sense. Even if your coefficients aren’t “fixed”, as long as your predictions are good. the neural network is working for your purposes. This applies doubly if your data isn’t numerical, tabular data, but is instead something like sound, images, or text: DALL-E and ChatGPT both use neural networks.
So far, we’ve seen three of the four differences I laid out. Econometrics and data science have differing goals, which define how they approach interpretability and model choice. These three all feed into the fourth, and the most technical of the differences: how each field approaches the bias-variance tradeoff.
This important problem in statistics is a balance between two attributes of models, aptly named bias and variance. Both of those have specific, mathematical definitions which we’re going to ignore for now. If you want that, I’ll point you again to ISLR. Instead, we’ll be working with these simplified, plain-English definitions:
- Bias: A systemic difference between an estimated value and a “true” population value, i.e. the amount by which a model is wrong.
- Variance: how much our estimated function would change if it were trained on a different dataset with the same features.
At first glance, bias seems like a much worse problem than variance. After all, a model being wrong seems like a huge issue. However, to quote statistician George Box, “All models are wrong, but some are useful.” Bias is an unavoidable part of modelling because modelling always involves some simplifying assumptions which introduce it. The issue then becomes identifying whether or not your model is still useful. If your data is almost but not quite linear, a linear model will be biased, but might be a good enough approximation. If the data is not linear, drawing a line through it will net you little.
Variance is also more of an issue that it would initially seem. Consider this situation: you have one set of data, you train a model on it and get a result. Then you do the same on a similar set of data and get a totally different result. Is that difference a function of some underlying difference between the datasets or variance in your model?
Bias and variance mirror two ML concepts: underfitting and overfitting. Underfitting is when a model does not adequately capture patterns in the data, resulting in a high-bias model (think linear approximations of non-linear data). Overfitting is when your model captures too much noise in your data and cannot generalize to unseen data, creating a high variance model.
How do these concepts relate to the differences between econometrics and data science? Econometrics tends to focus primarily on minimizing bias, while data science treats it more as a trade-off. As with model choice, this stems from our first observation: that econometrics does mostly inference while data science does prediction.
If you are an economist testing a hypothesis about two variables, being able to say with confidence that your result is as close as possible to the true relationship is vital. You want to heavily reduce bias such that your model is the least wrong it can be. Variance is less of a concern. It would be nice if the relationship you observe is generalizable (i.e. to other training data), but it is not a requirement. Variance does play an important role in validating and interpreting econometric models, but it is treated more as a theoretical property, not a practical consideration to control for. Furthermore, variation in results could come from unobserved differences between the two datasets which can become a topic of investigation in and of itself.
Data science treats bias and variance as more of a trade-off. Underfitting means your model won’t perform well. Overfitting is a big problem because your model won’t perform well, but it will look great on the training data. To combat this, data science has a wealth of techniques which decrease variance at the cost of increasing bias. These make for better predictions, but more bias and less explainability.
Practically, this looks like regularization, ensembling, and augmentation. Regularization like Ridge and LASSO reduces variance by making models simpler. It changes (i.e. biases) the coefficients based on the number of features, and so makes inference less clear, but it handles overfitting. Ensemble techniques rely on training many models on the same dataset (or subsets thereof.) Having many models makes it harder to infer variables’ exact effects, but ensemble methods reduce variance. Finally, augmentation involves changing your data in some way to create additional, synthetic data.
At risk of belaboring the point: these technical differences result from the differing goals and problems solved by each field. Using data augmentation as an example, if you’re trying to understand economic patterns in census data, inventing new synthetic data points (i.e. new households) from your existing data sounds like lunacy. If you’re doing image recognition and want to “predict” a label based on image data, then your algorithm should identify a photo of a dog flipped or not.
That last example works well to tie together all of the differences. Image recognition is a prediction problem in the purview of data science. Therefore, building a low-variance model helps avoid misclassifying new images. Explainability might be a nice-to-have, but it’s not necessary. This comes together to inform which tools we use. Most modern image classification uses neural networks, which fit the criteria outlined above.
Finding demographic patterns in census data on the other hand is by its nature a problem of inference. Because the coefficients are the results, without explainability we don’t have results. We also won’t be applying the model to new data, although we may run the same process of inference on similar datasets (i.e. other countries’ censuses or older censuses). Without new data, minimizing bias becomes more important than minimizing variance: we want to be able to say that our coefficients match the data with a high degree of certainty. Different estimations from different datasets are expected and create new avenues for research. To accomplish this, we are going to use some flavor of linear regression.
Econometrics and data science draw from common techniques in machine learning, and overlap in many places. Still, they are different fields applied to different problems. Data science is, at its heart, a predictive venture which values broad effectiveness over explainability, causality, and theoretical correctness. Econometrics sets out to test theoretical — and usually causal — hypotheses. Consequently, it values interpretability in its methods and exact estimation over generalizability. Neither is “better” than the other in any sort of overarching, innate way. In a similar vein, you might find common tools and techniques among carpenters and stonemasons, but which tools are better depend on what you want to build.
Special thanks to Rajas Pandey, Jenna Goldberg, and others for early feedback on this article.