Two Cousins Meet

Emerging research that combines insights from machine learning and econometrics may end up transforming both fields.

Data is the new oil: It is an enormously useful resource that can make you rich. Like oil, data is an intermediate commodity; it is used as an input in decision making, both human and automated. Like oil, it needs to be extracted, refined and processed before it is finally consumed. Finally, just as crude oil is consumed in two main forms—diesel and gasoline–  ‘final’ data products come in two varieties: prediction and causal analysis.

Machine learning refers to a set of data analysis techniques pioneered by computer scientists. Unsurprisingly, they exploit large computational resources to uncover complex statistical regularities in datasets. These patterns then are used for prediction. Imagine the retail loan department of a bank predicting the likelihood of default of its potential debtors; or an insurance company that seeks to predict mortality rate of a certain population to calculate premium. Machine learning excels at such tasks.

There is a catch however. Beyond prediction, decision makers often want to know the impact of their decisions on a particular outcome of interest. A farmer is interested in knowing how much farm yield can be improved by applying fertilizer; a manager is interested in knowing how much their sales revenue can be increased by advertisement campaigns and changing prices or other product attributes; and law enforcement agencies are interested in knowing how much crime rate can be controlled by their favorite policy intervention. These tasks are collectively called estimation of ‘treatment effects’. Off-the-shelf machine learning methods are not guaranteed to work for estimating treatment effects.

The problem with machine learning methods is that they are often based on fragile statistical relationships between variables such as correlation. And as any basic statistics course tells us, correlation is not causation. For estimating treatment effects, one needs to move beyond simple (and often deceptive) statistical associations and ‘identify’ deep and robust causal relationships. Econometrics—a branch of economics that uses statistical techniques for policy evaluation–has developed many ‘identification’ strategies to do precisely that.

If machine learning can be used for identifying complex relationships and econometric methods can identify treatment effects, a natural question arises: Why not combine their respective strengths to identify complex treatment effects? And equally importantly, what kind of practical problems can be solved by this approach? Susan Athey and Guido Imbens, professors at Stanford Graduate School of Business, answer this question in a recent paper (non-paywalled version). They use a machine-learning technique called ‘regression tree’ and tweak it to identify different treatment effects among different population subgroups.

One application of their technique is in ’smart’ customer relationship management, Many industries such as telecommunication, insurance and online outlets have high customer attrition rates. To retain loyalty, customers are bombarded with personalized discounts and targeted ads (think of the SMS you received offering cheap data packs). Different customers respond differently to these interventions. Naturally such interventions are costly and it is important to target precisely; modified regression trees can be used to find targeting rules that deliver biggest bang for the buck.

Another potential application could be personalized drug testing.  Typically, effectiveness of a drug varies by patient characteristics. A drug that is effective for Caucasian males may be totally ineffective for Asian females. One way of testing differential effect is to collect data in a single trial and go for data mining. Unfortunately this procedure is statistically faulty: if you look hard enough, you will find some effects just by fluke. That is why drug trial authorities require multiple trials for each population subgroup. This is statistically correct, but drug trials are extremely costly and time-consuming. Multiple testing is rarely profitable, especially if the population of interest is small and poor.

Hybrid methods come to our rescue. They provide statistical guarantees which are as good as multiple trials but require a single trial to collect data. Because old datasets can be mined, in many cases new experiments are not required at all. Sometimes you can have your cake and eat it too.

On the similar theme, a different group of Stanford University researchers used machine learning methods to assess the impact of tax reform on illegal mining in Colombia. Illegal mining—rampant in Colombia—presents enormous environmental, political and fiscal challenges. As expected, illegal mining is hard to measure and reliable statistics are hard to come by.

Stanford researchers used satellite images and  machine learning algorithm to spot areas which were being illegally mined. Furthermore they used an econometric technique to tease out the additional impact of tax reform—which reversed fiscal devolution to mining municipalities—on illegal mining. After introduction of tax ‘reform’, illegal mining increased by 1.41 percentage points as share of the mining area. Besides illegal mining, their approach can be used to evaluate the effectiveness of policies aimed at checking deforestation, desertification and coastal area encroachments.

These select examples are only indicative of the growing interest of econometricians in machine learning methods. American Economics Association, in its recent annual meeting, held a well-attended special session on ‘Machine learning in econometrics’.

Though machine learning and econometrics were incubated and developed by different research communities, they  share a common ancestor: statistics. Like close cousins separated in childhood, they have their similarities and quirks. They use similar methods, but ask and answer slightly different questions. After a long period, it seems a happy family reunion is in the offing; and their collaborative strength holds enormous promise for future applications.  

As Alan Turing, the father of computer science, once said: We can only see a short distance ahead, but we can see plenty there that needs to be done.

About the author

Avinash M Tripathi

Avinash M. Tripathi is an Associate Research Fellow (Economics) at the Takshashila Institution. His research interests include competition policy and financial risk management. He prefers a profound answer to a silly question rather than the other way around.