There is an expression that we sometimes use “the straw that broke the camel’s back,” meaning to express something that was just a little bit more than necessary. In Greek, they say something like “the drop that made the glass overflow,” which makes a bit more sense to me because I have never seen a camel with a broken back. Either way, I think we all have a certain intuition (although we don’t actually agree) on when something has been taken too far in a debate, a joke, or some other social interaction. Things are very different when data is involved. In a world awash with data we can see, data we could go and get, and still more data that we wish we had, how do we know when we have enough to make a decision? When do we really have “enough” data?
Picking the Right Data: On Tuition and Intuition
The first exposure most of us have with the question of having enough data comes when we are studying a large set of normally distributed data. If we want to know something about such a large set of data, we can not just look at it and use our intuition. Statistics tells us that we do not need to look at the whole universe of data either. Rather, we can calculate a sample size which is much smaller and draw a sample at random which is consistent with that sample size. From this sample we can ascertain measures of the population as a whole. While this advice is all true, it is only the beginning.
What happens when there is more than one set of data? What happens when the conclusion we are trying to reach is not known completely until we see some of the data? What if all of the data is not of the same type, not all numeric, not all accessible, or not even all true? In such cases, there is no simple answer. The reality is a mixture of art and science.
When I was first considering going to college, I had access to one of the first databases of U.S. colleges and universities. This software was introduced way before the Internet, and involved a series of reductive queries, starting with all schools in the database, and then progressively reducing the set based on factors such as geography, major subjects available, tuition, etc. I used this system to construct a short list of schools to consider and thought that the decision would be easy. Then I started talking to my friends. I found out that everyone seemed to be using different criteria. Some were considering the ratio of male to female (which I will admit hadn’t occurred to me, but seemed pretty important) while others were looking at the extracurricular activities available near the campus. There was also a whole host of information I would have loved to know, such as the exact criteria used for acceptance and the acceptance rate per 100 applications, which was not disclosed and confidential.
I had inadvertently stumbled on my first real “big data” experience with the disparate sets of data that are always present and the danger of rushing to a decision just because I had “enough” data to make a decision. Having enough data to make any decision, something that can be referred to as a dispositivethreshold does not guarantee that you have enough data to make a gooddecision. Rushing to make a decision with the data you have, simply because you can, is possibly one of the biggest “big data” mistakes. Unless you take the time and effort to do some sort of analysis on the merits and implications of using the data in hand, there is no defendable premise that such data is sufficient or appropriate to make any particular decision.
The first step in any data-based decision should be to assess the character, quality, and importance to the situation of three distinct sets of data: the data in hand, the data that could be collected, and the data that is known to exist but is inaccessible through any reasonable effort.
Learning from the Missing: On Black Cats and Bad Movies
Because there will nearly always be more data available, how do we know when to stop collecting and start concluding? After we have thoroughly considered what data we have in hand, including all of its foibles, bias, and other deficiencies, we must next have some idea of the data that we are not using. Such consideration is sometimes referred to as a “black cat” exercise, because it is like looking for the proverbial black cat in a dark room. Consider, for example, the decision to land men on the moon. We could not readily know for certain what the exact nature of the lunar surface was, yet we set out to land on it. Clearly, we could experiment with materials meant to approximate assumptions, but until we actually went to the moon, there would be uncertainty. Thus, significant provisions were taken to accommodate worst case scenarios, such as if one astronaut fell and couldn’t get up how and if the other astronaut could help without becoming trapped as well. In such situations where data is simply not readily available, it is important to understand the limitations of the decisions that are being made and the sensitivity in the decision to those limitations.
In data, especially when there is a qualitative nature to assessments (for example, when you are considering crowdsourcing or other methods of asking a large group for an opinion), one such technique is called supersaturation. Supersaturation is a key method in the science of heuristics, where algorithms are designed to perform the same as a group of similarly instructed, similarly incented individuals. In such cases, measurements are taken using as complex as possible a set of attributes. The conclusions of a large set of such measurements are computationally reduced and compared to additional tranches of observation until nothing but trivial changes are observed in the conclusion. Essentially, the conclusion stabilizes. A simple example would be if you asked a bunch of people coming out of a movie if they liked the movie or not and why. After a large number of people generally said yes or no (with some minority opinion) and after the reasons why uncovered no new reasons, you might reasonably conclude that you understand the sentiment.
Taking it to the Next Level: Recursion and Fighting in the Dark
One of the great advantages of higher order intelligence is the ability to learn from our mistakes. One time I had a very powerful lesson in a martial arts class where we learned to fight in the dark. At first, everyone thinks it’s impossible. When the lights are turned off, several people attack the person in the center (who knows how to fight in the dark). Insert lots of awkward kicking and punching at nothing and a lot of “ouch.” Next, the lights come on and the instructor asks what we learned. Without getting into the whole lesson, it becomes clear that the person in the center learns where we are when we accidently come into partial contact. This “information” is then used to mount an offense or a defense in the dark. I can assure you, the next time the lights go out, the attackers do much better until they eventually earn the right to defend from the center. There are many things that are powerful about this exercise, but perhaps the most is the importance of learning, even from a painful and ineffective first attempt, so that progressive attempts build on everything learned so far. This is a physical example of the algorithmic concept of recursion, whereby each step is informed by the base algorithm but also by the collective learning of all preceding iterations of the same algorithm.
We have been taught that if a machine is designed to do something wrong, it will do it consistently wrong, every time. Of course, machine learning and other non-regressive methods have given rise to an even more advanced machines and more advanced ways of looking at data-based decisions, so that this premise is not entirely true anymore. Recursive learning is one of the many ways that new approaches in data science are using data never before seen or data that was not initially available to address a problem to continuously refine the performance to address problems which may themselves be changing over time.
One of the most promising advances in data science are formal methods to learn from prior experience and to inform future iterations to create continuously increasing performance, even as the very nature of a problem changes.
Thinking about the data available and carefully considering by what right we use it to address a problem is one of the most critical disciplines in data science. We live in a world where new sets of data are constantly becoming available. New tools and new techniques to use that data abound. The critical skill that will set us apart in this sea of tools and data is our ability to assess when we have the right data, and when we have enough of it to make a meaningful decision.