Over the past year or two I have been trying to delve into the world of machine learning, to angle myself for a job in data science. (Hire me!) Data science is a pretty broad discipline, and covers everything from basic descriptives and visualizations to complex deep learning algorithms and AI. But a key part of data science is machine learning. As I have gone through this process of understanding machine learning, however, I’ve realized that there are a number of tools and procedures that would be useful in psychology as well.1 So let me share with you some of the wonders of machine learning!
You’re Already Doing Machine Learning…Sort of
The first thing to note is that machine learning uses some of the same tools as psychologists do, but with a different aim in mind. The major tasks in machine learning involve regression (i.e., predicting continuous values), classification (i.e., predicting categorical values or groups), and clustering/segmentation. That means if you’re familiar with linear regression, logistic regression, and factor analysis, you understand the basic idea of each of these tasks, respectively. Linear and logistic regression are common tools for machine learning as well. Data scientists also use a variety of other techniques, depending on the task at hand.2 But at the core, their primary purpose is to do regression and classification tasks, just like linear and logistic regression do. So if you understand those two, you understand the process of machine learning (even if the specifics of these other techniques are foreign to you). For clustering/segmentation, if you’re familiar with factor analysis and principal components analysis (PCA), you’ve got the general idea.3 If you understand why people use factor analysis, you understand the basic idea of clustering. Turns out you’ve been doing machine learning all along!
But only sort of. The major difference between scientific research and machine learning is (often) in the general goals. As psychological scientists, our goal is primarily to explain relationships among variables—we develop causal models and theories to provide an explanatory role, and our statistical methods are used to test these explanations. We care about the nature of the relationship between a predictor(s) and a criterion, and as such we will take care to examine and interpret the regression coefficients carefully. For example, we might add predictors to a model with covariates and test whether the predictors explain additional variance over the covariates. In contrast, the goal of machine learning is generally to predict. There’s a particular criterion we want to predict as accurately as possible (e.g., number of sales, which movies Netflix customers will watch), and we grab whatever variables are necessary to predict that criterion. The nature of the relationship between predictors and criterion is not as important as the existence of the relationship. If a variable helps to decrease the error in prediction, throw it in the model! Is it a linear, quadratic, cubic relationship? Doesn’t matter, try them all! In this sense, machine learning tools are sometimes described as a “black box”, in the sense that the exact values of the coefficients don’t matter. Often there is little emphasis on significance testing of coefficients. Data scientists will look at the overall R2, but whether a coefficient is positive or negative is inconsequential. What matters is the accuracy in prediction at the end.
Of course, I don’t want to draw too much of a sharp line between these approaches. Hypotheses about what variables might exert some sort of causal effect can help to determine what variables are important to measure and include in a model. Thus, explanation helps with prediction. And of course, prediction is important in order to show evidence for a theoretical model. The point is that the primary goals are often different. Data scientists often don’t bother with things like p-values. As long as a variable has a non-zero effect (note that I don’t mean “significantly different from zero”, just “not exactly zero”), it can be useful in prediction. And confounds like the third-variable problem (i.e., that a relationship between variables is spurious) don’t really matter if the end goal is prediction.
So why might psychologists ever be interested in this approach? After all, we want theories and models! This is true; however, I think there is still value to considering a pure prediction approach in some circumstances. In some cases, we have particular criteria that are important to predict (for instance, variables with high social impact like HIV rates or high school drop-out rates), but little existing theory to identify a particular course of action. In that situation, it might be worth considering a machine learning approach. It does mean taking a more exploratory approach, but maybe it’s worth throwing everything at the wall and seeing what sticks. That work can then be used to inform future theory. You don’t have to commit to being entirely atheoretical, but sometimes just grabbing as much data as you can and exploring new avenues is a great way to open up a new solution you might not have thought of before. Of course, exploratory analyses can run into the issue of false positives—but more on how that is managed in machine learning below.
The Perils of p-Hacking
Let’s get one thing straight: The best way to avoid p-hacking is to not care about p-values at all. And that’s often the case in machine learning. Like I said, the goal is prediction, so any variable that aids in that goal at all is useful, even if it doesn’t necessarily show a “significant” relationship. (Of course, if you’re dealing with vast amounts of data, as data scientists often are, p-values get less helpful anyway—they’ll either be highly significant or highly non-significant.)
But still, most forms of p-hacking are, at their core, a way of taking advantage of false positives—of chance variation in your particular data set that do not generalize. And over-fitting to the particulars in your data can still be an issue with machine learning—especially given how exploratory it is. So how is this dealt with?
If you read anything about machine learning at all, you’ll read about the bias/variance tradeoff. If you’re like me and these terms are not intuitive at all to you, you can think of them as under-fitting and over-fitting, respectively (see the image below). Psychology often has a problem with under-fitting. We get giddy at the thought of an R2 of .2 or .3; we typically deal with linear relationships and linear interactions, even if the ground truth is probably more complex than that. We usually deal with few variables even though that is almost never a complete model of all the variables influencing a particular phenomenon. The bottom line is that under-fitting means bad predictive accuracy. But over-fitting can be just as much of a problem. When you fit too much to the specifics of your data set, you can end up with a model that doesn’t generalize well. Sometimes you get outliers or strange patterns that you don’t want to fit into your general model. And given that machine learning is often extremely exploratory, that can be even more of a problem. So how do data scientists deal with this? There are two methods: regularization and the train-test split.
When psychologists build a model, they often build up—start with linear terms, maybe throw in an interaction if you’ve hypothesized that, possibly add a quadratic effect. Data scientists are more prone to pare down—if you have three predictors, throw in all interactions and exponents up to, say, the fourth degree. That means not only a*b*c, but also a2*b*c, a3*c, b4, and all sorts of other combinations. That is a lot of terms,4 and they add up fast. It’s even worse given that data scientists could be dealing with hundreds of variables. So how does this get dealt with, without (a) blowing up the computer and (b) taking advantage of chance? Regularization is a key tool.
The specific details of regularization depend on the tool you use, but basically every machine learning algorithm has some regularization component. For linear regression, there are two common ways: ridge regression and lasso regression. Both of these have the effect of pushing coefficients down toward zero.5 In other words, if you throw in 75 different variables and interactions, a regularized form of regression will tend to push down coefficients, so that only the strongest effects come out in the end. If you have truly linear data, but you throw in a quadratic term, that term will simply get pushed down close to zero so it has little to no effect on your prediction. If there is actually a quadratic effect in there, that quadratic term will stay strong because it’s useful to minimizing prediction error. And that’s it. Really. Throw in as many terms as you want, but the important ones tend to come out at the end. And this helps greatly to reduce over-fitting.
Of course, with regularization, interpreting p-values gets extra unhelpful, because even your “true effects” will tend to get pushed down at least a little bit. But it does reduce a lot of the guesswork when it comes to exploratory research, and it can do amazing things for your overall prediction accuracy. Think about it—you’re essentially trying every combination of every variable and seeing what actually aids in prediction. That’s a really powerful tool, and it forms the backbone of incredibly accurate predictive tools—the sort of thing that might be necessary for, say, allowing a computer to drive your car. These tools are everywhere, from the post office using computers to read handwritten addresses, to Facebook finding faces in your photos, to Amazon recommending products for you. It’s difficult to overstate the success of this approach for pure predictive accuracy. And the fundamental process is: measure tons of variables, collect tons of data, and throw it into a model, letting regularization sort out which variables are really important.
I’m not saying that all of science should work this way. Again, this approach is extremely exploratory. But I do think that psychologists tend to predict really simple models, and it’s tough to know whether we predict linear relationships because we actually think the relationship is linear, or because we only collected data from 30 participants and wouldn’t have enough power to detect a quadratic relationship even if there was one. Psychology especially deals with multiply determined variables all the time. Every variable we measure is probably influenced by at least a dozen different causes—of which we usually measure two or three. Yes, theory is extremely important for driving science forward. But if we can gather a lot of data, being able to explore and test very complex relationships could provide avenues for more complex theoretical models (to be confirmed and replicated!) that better explain human behaviour. But speaking of replication, let’s talk about how data scientists do that.
This is an extremely powerful tool in machine learning. Even if you think all the above is useless to psychology, I hope I can convince you that the train-test split can be.6 There are many ways that data scientists use this technique, but the simple process is this: Randomly partition your data into a “training set” and a “test set”.7 Do all the analyses you want on your training set. Throw it around, flip it upside down, whatever. Then, when you have a model that you think is the best model, test it on your test set. That’s it. Essentially, data scientists bake in a direct replication into every data set they use. This is important because, as I keep saying, machine learning is often extremely exploratory in nature. It’s very easy to take advantage of chance, but when you have a (well-powered) direct replication at the end of the process, you can be much more confident in the model you’ve created.
Splitting your data between training and testing is, of course, not unique to machine learning. But given the recent prominent discussions in psychology about the need for direct replications, this seems like a natural fit. Do literally whatever you want to your data. P-hack it all you want, even! Just replicate it on your test set at the end. That number you get from your test set (whether you want to look at p-values or some measure of predictive accuracy) is the real value of your model.8
So why do I think knowing about the train-test split is useful for psychologists? Well, apart from the aforementioned discussion about replication, I think the bottom line is that psychologists do a lot more exploratory research than we let on. In fact, we probably do more exploratory research than we even think we do, given how easy it is to convince ourselves that we tooootally predicted something in advance. It’s hard when you predict an interaction, and you don’t find the interaction but you do find main effects, so you start testing the main effects, and before long you are in some gray area where you tell yourself, “Well if I had thought of this beforehand I would have predicted this”. Splitting your data like this is a nice way to protect yourself against all those human biases. You can fall prey to all the biases you want, just as long as you can replicate it (without those biases) in your test set at the end.
Of course, a train-test split also means giving up some data to set aside as a test set. And given how under-powered much psychological research already is, that might be a hard sell. When you’re only able to get 100 participants, do you really want to set aside 20-50 of them? And is 20-50 participants really enough of a test set to even be worthwhile? Probably not. So of course, the advice is always that larger samples are better. Regardless of whether you split your data or not, larger samples are important. But there is one particular case where I think psychologists can and absolutely should use this approach. Some researchers rely on secondary data, like the World Values Survey or the American Time Use Survey, which have two key characteristics: (a) they’re big, and (b) the data are already collected. The first is important because it means there’s enough data that setting a test set aside is feasible. The second is important because pre-registration is not really possible. I mean, if you’ve literally never looked at this data before, I’d argue that you’re blinded enough that pre-registration could still be helpful. But a lot of times, researchers have used the same data set to test multiple things. And once you’ve looked at the data and gotten a feel for it, even if you are using different variables, I think the “shielding” is gone, and there’s a real possibility that your hypotheses and your analyses end up being data-driven. You’re prone to overfitting. In that case, pre-registration doesn’t mean much, and a train-test split is a much better tool to demonstrating that your models are sound.
Here’s my bold claim: I think that splitting between a training and test set should be required for any analyses that make use of secondary data (at least where the data is large enough for it to be reasonable). If you have not collected the data yourself, you should split the data to validate your analyses. That should be the norm, and I’d argue that journals should not accept any secondary data analysis that does not use a train-test split, or otherwise directly replicate. The danger of p-hacking/overfitting is just too great. I personally know researchers who have made a career from secondary data because it provides them with such flexibility to “prove” pretty much anything they want.9 For primary collection of data, pre-registration offers a useful tool to convince readers and reviewers that your analyses are confirmatory; but for secondary data, or data used for multiple projects/papers, this option is not of great use. Use a train-test split instead.10
I hope I’ve been able to convince you that machine learning can offer insight for psychology. Certainly a pure exploratory approach is not ideal in all situations in science, but I think psychologists generally don’t leverage it enough. For all the complexity of human thoughts and behaviour, we typically use pretty poorly simplistic models. That’s not to say we can’t learn things from linear effects and two-way interactions. But the more we push on into more complex models, the more we can tamp down some of that “random error” in all our analyses that is just waiting to be explained. Having the tools to explore more complex effects is a huge advantage, and it’s worth putting these tools into the tool belt for psychological research.
Addendum: If you’re interested in learning more about any of these techniques, you should check out the excellent paper by Yarkoni and Westfall (2017) that was just recently released online ahead of print. A non-paywalled version is available here. It goes into more depth about some of these approaches.
- Note that none of these tools are unique to machine learning, and are also used in other fields, so some of you may already be familiar with them. But my impression is that they are still quite rare in psychological research. [↩]
- For example: random forest models, support vector machines (SVM), neural networks, ridge regression and lasso regression, Bayesian models, and variants of all the above. [↩]
- Data scientists often use PCA, but they’re unlikely to use factor analysis; they’re more likely to use some other clustering technique, like hierarchical clustering, or k-means clustering. [↩]
- 34 terms in total, if I counted right. [↩]
- Ridge regression tends to result in smaller coefficients; lasso regression tends to result in coefficients dropping down to 0 altogether. [↩]
- This train-test split has also been referred to as the “lock box approach,” because you essentially lock part of your data away and only touch it at the very end. See Skocik, Collins, Callahan-Flintoft, Bowman, and Wyble (2016) for details. [↩]
- Note that your training and test sets don’t necessarily need to be the same size—often the test set is about 20-40% of the total data. There’s a tradeoff between having enough data to form an adequate test, and wanting to use as much data as possible to train your model. So it will depend in part on how much total data you have. There’s probably an absolute threshold that you don’t want your test set any smaller than X observations, but also a relative threshold compared to your training set as well. [↩]
- In fact, data scientists often take this a step further. The test set is sacred—it can’t be touched until the very end, and it can only be used once as a final test. But while you’re exploring different models in your training set, you probably want to have some estimate of the level of accuracy those models will show in your test set. So sometimes data scientists will use an additional “validation” set, partitioned within their training set. One of the best ways of doing this is k-fold cross-validation. There are easy methods for doing this in the most common machine learning packages for R and Python. [↩]
- Whether they themselves perceive this is questionable; that may influence whether you attribute their actions to dishonesty or to human biases. Either way, the research is questionable. [↩]
- Some of these large data sets are longitudinal, which offers another option. If your hypotheses are such that you would expect the relationships to hold across time, you could consider using Wave 1 to train your data, and Wave 2 to cross-validate. [↩]