There is a dark side to data, sure enough. As we are entering the season of Halloween (at least in some parts of the world), I thought it would be an interesting departure from our usual topics to consider some of the creepier topics in data science. How do we know what data is really there? Is it what it seems to be? What is being done with the data we leave behind, and who is doing it? These are some of the questions that have been considered in domains such as fraud, cybersecurity, and even some attempts to understand…the supernatural. Edgar Allan Poe himself alluded to the transient nature in data in the opening lines of “To Helen” when he wrote “I saw thee once—once only—years ago: I must not say how many—but not many.” So into the darkness we go with some Halloween-themed topics in data science.
Black Cat Problems: Looking for data that may not be there
There is an old conundrum about a blind man sitting in a dark room talking to a black cat that isn’t there. The man is blind, so it doesn’t matter that the room is dark, yet he isn’t sure if the cat is there, just as any sighted person would be unsure. However, the sighted person would be confounded by the fact that the room is dark and the cat is black. So what does this conundrum have to do with data? Well, sometimes we need to contrive methods to look for data when we are not sure if it is there to find.
There are two frames of inquiry that underlie all data discovery: positivistic and constructivist. Often, one or the other is presumed, but we would do well to stop and think about which is prevalent in certain situations. Positivistic theories presume that a single, objective truth exists that can be discovered through observable and measurable facts. For example, when rescuers construct a search grid to look for a lost hiker (or a lost cat!), they presume that the subject of the search is within a certain region and work from there. In data science, examples of positivistic research include searching for an equation that will predict the behavior of a set of data. We presume that such an equation exists, then set about looking at longitudinal data (data from the past) to assess the validity of the assumption by measuring regression, stability, random error, etc.
Constructivist theories, in contrast, state that truth is constructed by observer’s perceptions. In science, the “truth” is constructed while the research is being performed. Imagine we were looking at data to understand nationalism in the 1940’s vs. the 1970’s. Clearly, looking at data from the Second World War (where countries were united to some degree against other groups of countries), would yield different results than looking for nationalism from the Cold War, where countries were much more inwardly focused.
Politics aside, there are many constructivist situations in data science. Some of these are called “Black Cat” problems, where the discovery does not presume that there is anything there to find. Imagine an algorithm searching for hurricanes amidst weather data. Often, there is no hurricane to find, and such data is chaotic by its very nature. Another example would be looking for any kind of bad actor (cyber criminal, fraudster) in a set of data referring to transactions among otherwise benign parties.
There are many methods for approaching Black Cat problems. A key method is signal analysis, wherein we look at a large collection of data about interaction. Looking at an individual signal (e.g. the fact that one party interacted with another) has very little inferential value, but looking at a large collection allows the discovery of patterns, which can lead to hypotheses that direct learning algorithms to more interesting clusters of transactions.
Cluster analysis is a related method, wherein we look for specific patterns of data according to pre-defined relationships. An example is isotropism, wherein all data relationships in a region are similar in a certain aspect. Anisotropism, the opposite of isotropism, can lead to discovering more interesting regions to pursue in our “Black Cat” problem—essentially focusing on the places more likely to contain the cat—if there is a cat in the room.
The most important aspect of working with Black Cat problems is the recognition that the subject of inquiry may not exist at all. The most powerful methods involve progressive decomposition, breaking down the problem into smaller, more-tractable subsets that can be explored in greater detail.
Dark Rooms: Be careful what you wish for…you might get it.
Another interesting situation is something I have referred to as a “Dark Room” problem. In full disclosure, this is an abstraction of a related problem in computer science relating to computational linguistics, so searching for this term may be somewhat unfulfilling (a bit of a Black Cat problem). The scenario goes something like this: Imagine that you want to create the best place to keep information on a particular subject. You collect all of the literature, all of the experts, and all of the teachers on this subject and put them in a locked room. At that moment, this room (the Dark Room) is the best place to go for information on the subject of interest.
Over time, however, the subject matures outside the room, and the experts in the room can only learn of these changes from the imperfect understanding of others outside the room who relate the information in some imperfect way through the locked door. Furthermore, anyone wishing information about this subject must work through the inconvenience of the locked door (and possibly intermediaries who pass notes under the door) to learn about the subject. Over time, expertise develops outside the room, and in contrast, the experts inside the room become less expert because of their imperfect information and the latency built into their learning. If the subject matter is dynamic, the room eventually becomes the worst place to go for information on the subject at hand.
Dark Rooms can be created in many ways. Sometimes, information silos are created out of necessity, for example for security reasons, privacy concerns, protection of intellectual property, or other perfectly reasonable considerations. Nevertheless, these Dark Rooms will have the manifestations of the isolation described above. One solution is to open a window—in effect, to allow some information to pass between the experts in the room and the outside world with careful controls in place. Another strategy is to thoughtfully allow some of the experts in the room to leave the room, in part to enhance their expertise and in part to satisfy legitimate needs outside the room (for example, when medical researchers working for private companies attend conferences to learn from other experts and present some version of their own findings). Of course, in all scenarios, there is risk. It is important to realize that there is risk no matter which strategy is undertaken, including keeping the door locked with the “experts” inside.
Beware of Dark Room scenarios—know when you are creating them (or in one of them) and question what steps are in place to mitigate risks such as confirmation bias, isolation, stagnation of expertise, and declining relevance.
Creepy Use of Data: Ethics are increasingly under scrutiny.
Any discussion of the “dark” side of data science would be incomplete without some discussion about the juxtaposition of sources and uses of data. There are many examples of this concern in the news today. Perhaps the most obvious is the unintended use of data for purposes that would surprise, and possibly perturb, those who directly or indirectly generate data. Examples abound, including court cases where toll information is used to prove the existence of a person at a location and time, and marketing “conclusions” about consumer needs for maternity products based on prior behavior that predicts pregnancy.
There are many strategies in use today to overcome objections to creative use of data, including complicated user agreements (click Agree if you want to get to the next screen, even though we know you probably haven’t read everything in this incredibly long document), laws which attempt to govern the appropriate use of certain data (do-not-call lists), and general principles about privacy and transportability of data.
All such attempts to regulate the ingestion and use of data suffer from the same basic phenomenon: our ability to use data in unexpected (and sometimes very beneficial ways) will always outrun our attempt to protect that data. This argument is based on the fundamental tenets of Big Data, namely that the velocity of change, ever-increasing volume, constantly changing variety, unclear implicit value, and other aspects of data will continue to overwhelm our best attempts to discover, curate, and synthesize the data we create.
This situation is not hopeless; rather it relies on our ability to adjust our thinking about how data will be used, our perspectives on what is realistically confidential, and many other aspects of living in a world awash in data. Furthermore, as purveyors and practitioners of amazing technologies such as artificial intelligence, deep learning, automata (e.g. “bots”) and other amazing advances in data and related science, we must continue to consider the ethical questions that lie between what we can do and what we should do.
The edge of the possible continues to move at a rate that is increasingly difficult to discern. We must be increasingly aware of what is being done with our data and what we are doing with our data.
In modern times, Halloween is not only about ravens, dreary midnights, goblins and monsters and scary things. It is also about bringing our communities together for something fun and hopeful for our children and our future. As we consider some of the amazing things that can be done with data, we should remember that while some of them are scary, others bring great promise for a new year full of better understanding.