Overcoming Observation Bias
One of the best teachers I had was a 2nd year Physics teacher. He taught in a way that stuck… forever. One day, he gave the class an assignment to construct an experiment to accurately measure the temperature of soup. We had very little to go on, other than that we could bring in our soup, heat it (or not) in the lab, and would be expected to eat the soup when we were done. This may seem like an easy experiment, but there are a lot of variables. What is the operating definition of “soup”? Are there different consistencies or recipes which are more thermodynamically stable, and thus easier to measure? Is it easier to accurately measure a lot of soup or a little soup? What is the most accurate measuring device for such a scenario? Suffice it to say, there was a lot to think about… it was a soupy problem. My conclusions going into the experiment were that the best choice would be a very homogenous and thick soup with as few ingredients as possible, all in solution. I chose creamy tomato. I selected a high-end lab calorimeter, which is basically an insulated steel closed container into which can be inserted a thermocouple device (a fancy electronic thermometer) to do the actual temperature measuring. In the end, every student in the class did much the same thing with different variations of cold vs. hot and soup recipes that were considered more or less stable for different reasons. Ultimately, we all made the same mistake. Introducing a thermometer into the soup, we changed the temperature of the soup, even if only by a minuscule amount. We were essentially cooling or heating the soup very slightly while measuring the temperature. Thus, failing to accurately measure the soup. This was my first introduction to observer effects.
In science, such effects become quite nefarious. In some cases, the very challenge of observing creates a sort of standoff, where any amount of energy will change the thing you are looking at (if you are interested, you may want to read about Heisenberg’s Uncertainty Principle for an example). In data science, observer effects are not at the quantum level, but they still exist. If we fail to realize we are dealing with such effects, our observations will be introducing bias, often to the point where the conclusions are not only wrong, but potentially dangerous.
Human Behavior: Barber Shops and Being Watched
The easiest way to recognize observer effect problems in data science comes to us from observing human behavior. Social scientists have known for quite some time that people behave differently when they know they are being observed. Such changes in behavior have different names, depending on whether the change is intentional or unintentional and also on the motivation of the observed. Let’s take a simple example…
Suppose you wish to collect information to study the impact of social media on discounts offered for a new small business that has no online sales (a barber shop, for example). You are given point of sale data for six months, which links in-store sales transactions with personalized promotions, like a serial-numbered coupon delivered through email. Without getting into all of the ways to modify the discount and all of the different types of advertising, focus on the fundamentals – the soup. Imagine only the data stream you need to analyze. It’s easy to imagine constructing the two modes (groups of behavior) and comparing one to the other, measuring all sorts of traditional (and maybe inventive) patterns such as time of day, method of receiving the offer, amount of discount vs. price paid, new or pre-existing customer. Let’s consider some of the things we might be missing in such analysis.
Rarely do such analyses consider the fact that most people know they are being manipulated when they use a discount. How does such a manipulation affect the business? More sophisticated analyses consider observer effects in subtle ways, such as the longer term customer behavior after the discount, the reaction by competition to the fact that you are changing the way your business competes, and other such “observer” effects.
When using data science to understand behavior, consider the impact on the behavior if the observed population is aware of the measurement. Consider also the ways to understand those changes in behavior over time.
The Plot Thickens When the Observed Deceive: Fingerprints and Ill-Fitting Shoes
Now consider other situations where data is used to understand behavior in more complex scenarios. In particular when there is an assumption that the data is being manipulated. A great example is detecting fraud or other malfeasant behavior. When a fraudster is setting the trap, he is highly aware of the “signals” he is generating, hence establishing a false identity or manipulating the data to appear differently before some bad act. Much like a murderer might wear gloves to hide fingerprints or different size shoes to obscure tracks, a cybercriminal can change the way they “look” online or do things to alter the trail they leave.
One of the traditional ways to respond to fraud involves understanding the past. Essentially looking at known bad situations after the fact and understanding the then-available data, trying to establish a profile or other correlation (sometimes a model) to detect future similar acts. This analysis is solid science, and absolutely necessary, but not sufficient. Why would we only look at prior bad actions to detect future bad actions if we know that a certain percentage of future bad actions will involve the observer effect, essentially that bad actors will change their behavior if they (or others acting similarly) are detected?
This question brings out a sort of digital cat and mouse game, where the data detection is ramped up to react to new bad behavior (or predicted future bad behavior), thereby causing the bad actors to behave differently. The solutions to such a scenario are quite complex and nuanced. One solution is to use sophisticated methods to detect changes in clusters of behavior, essentially finding “lumpy” clusters in otherwise consistent masses of transactions. Some of these lumpy parts will simply be new benign behavior, such as entities interacting in new, permissible ways. But others will be emerging forms of bad behavior. Discrimination algorithms can essentially isolate these lumpy parts for closer examination by a humans or other more focused methods, reducing the complexity of looking at the entire set of data.
Pay particular attention to situations where there is intentional manipulation of data. Consider reducing the complexity of the problem by breaking it down into subsets that are “more” and “less” likely to contain manipulations.
Our Future: Machine Learning and Learning from Machines
Often the data science challenge with observer effects is less obvious, especially when human interaction is secondary or absent from the data being analyzed. A good example is when we consider machine learning and other similar approaches to data synthesis.
On first blush, we might presume that “machine” methods are free from observer effects, and for the most part, they are. However, we must remember that most methods, be they supervised or unsupervised, require some sort of training or objective function. Essentially, we tell the algorithms what to look for, or we look at what they found, or both. In all cases, we must consider how we react to the output and how that reaction effects the ongoing evolution of the analysis. We essentially become the “soup” and our environment is changed very subtly by the tools we are using to understand our environment.
As we use new methods to understand complex sets of data, we must be aware of how our own behavior as researchers might be changing our conclusions. We must challenge how our thinking itself is changing based on what we are learning.
We are only at the beginning of understanding how the environments we live in will change in the era of big data. Surrounded by things that create data and things that analyze that data, data about our data, and observations intended to change our behavior, we continue to learn what it all means. If we are aware of observer effects, we can have a newer, richer understanding that was never before possible. That understanding can help us to be better detectives, better advisors, better scientists, and better curators of the amazing data assets at our disposal.