There is a reason why we promise to tell the truth, the whole truth, and nothing but the truth. Each of these is a distinct thing. It’s hard to ignore the concept of Truth, especially in an election year. Accusations of lies, assertions of truth, fact checking, and apologies for “miss-speaking” seem to be everywhere. Philosophers have struggled with the concept of truth for millennia, yet it is still one of those concepts that we seem to understand intrinsically until we try to define it. So what do we really know about the truth in data, and how do we know we know it?
The Truth -- Accuracy: Populations and Paint Guns
The first thing that becomes obvious in a Big Data world is that finding the truth can be a matter of definition and perspective. Determining which data is true or accurate when there is no reference to compare is one of the fundamental problems in computer science.
The term accuracy implies the ability to compare to some “correct” source. When I was in school, I really struggled with spelling. If I asked someone how to spell a word, they would invariably say something like “look it up in the dictionary” – which to me required knowing how to spell the word in the first place! Of course, as we get older we realize that dictionaries are only one of the many places we go to look things up. In an internet era, it seems that almost any “truth” can be looked up. The reality is, however, that this assumption only works for what is sometimes called “data at rest” – things that are not changing while you are looking them up, or at least changing slowly enough that the answer you find is still relevant when you find it.
Looking up answers also only works for data which is unambiguous. So, for example, you can look up the population in a country as of the last census (data at rest), but it is problematic to look up the exact number of people in a country, which is ambiguous because the amount is changing while you are asking the question, and any source will have a natural latency built in. So we learn the first inconvenient truth about truth: the assumption of comparing to a reference source can be tricky when data is changing or being constructed as part of a process.
Whether you look up the best answer (data at rest with a good reference source) or use some sort of algorithm to construct the best possible answer, it is still necessary to be careful about how you use the “truth” you discover. One reason for comparison is often to measure accuracy, or the difference between some piece of data and the “truth”. Imagine that you had a paint gun that fired paintballs very consistently – every time you pulled the trigger, the ball went exactly where the gun site was pointing to within a very small tolerance. You might say this gun is very accurate. Imagine if sometimes this gun misfired, in which case the paint ball squirted directly backwards, striking the shooter. Now imagine you have another gun that never misfires, but always fires exactly a certain distance to the left. Some definitions say that this gun is accurate, but not precise. The second gun is very easy to use. Just fire a bit to the right and you will always hit the target. The first gun, well… I prefer to be consistently inconsistent (second gun) rather than inconsistently consistent (first gun)!
The first thing to remember when discussing truth in data is that there is often no truth against which to compare. The illusion of having a “dictionary” to measure the accuracy of all data often distracts us from more carefully considering the nature of diversion in our data and choosing the best way to use that data for a particular purpose.
The Whole Truth – Completeness: Aging Receivables and Aging Cats
Another issue to consider when thinking about truth in data is completeness. Is missing data worse than data which is complete, but incorrect? If data is intentionally omitted, is that worse? Missing data is a form of failing to tell the whole truth.
Imagine you own a successful animal hospital and you are considering acquisition of a smaller business from a retiring veterinarian. You have access to data on the customers, vendors, recent sales, accounts payable and receivable and other pertinent information. Your mission is to understand balance of trade. How much of their business overlaps your pre-existing business. Do you have similar vendors, which might allow negotiating better pricing? Do you serve the same kind of clients, implying that your services are scalable? This problem is a fairly straightforward one. The discovery gets much more interesting, however, when you ask what is missing. What about pending law suits? Age of pets being treated? Required investment in technology (CAT scan? – sorry, I couldn’t resist). Certainly, asking a few more questions might yield a completely different view of the situation.
The case with data is much the same, only more automated. In many cases, machine learning or other methods can reach one conclusion given a certain set of data and a completely different conclusion using additional data. One best practice is the concept of measuring saturation – the point at which adding additional data does not change the conclusion. Another technique is measuring the sensitivity of the conclusion to the amount of missing information (for example, assuming worst-case scenarios for missing data and calculating the error introduced into the conclusion).
Of course, these are only simple examples of a much more complex issue. The overall point is to remember that missing information could be intentionally redacted, simply not available, not discoverable, non-existent, or many other forms of “not there” – each of these has a very different implication to the overall conclusions being reached with the data.
Pay careful attention to information which is missing. Make sure that the analytical method considers the sensitivity of the conclusion reached to the missing data as well as an understanding of the root causes of missing data.
Nothing but the Truth – Consistency: Earthquakes and Monotonicity
Imagine that you wanted to know how many people died in a recent earthquake. You collect 5 articles, four of which say that 40 people were killed, and one which says that 400 people were killed. A strict mathematical approach might be to compute the mean (40+40+40+40+400)/5 and conclude that about 112 people were killed. However, common sense plays a big part in this scenario. We know that reports of deaths in major disasters tend to be reported only when confirmed. Further inspection of the data reveals, in fact, that 3 of the articles stating 40 deaths pre-date the one that says 400 people died, while the final article is dated beyond the 400-person figure. A logical explanation is that the more likely answer is 400, and that the newer “40” figure is an out-of-date figure published in error.
Of course, looking at billions of rows of data which varies in ontology is not so simple. Nevertheless, it is imperative that we understand the underlying assumptions in any attempt to resolve seemingly conflicting views of the same events. In the case of the earthquakes, there was an implied rule to consider the fundamental principle of monotonicity – that such reports tend to stay the same or get larger unless some major intervening misconception is uncovered. Thus, finding the largest, most recent number is generally a good approach. However, a more nuanced approach might be to look for subsequent monotonicity after a significant downward adjustment.
In general, it is important to have methods that discard data that is distracting, likely to be incorrect, or in some other way known to be irrelevant. It is extremely important that such discarding of data be done in a repeatable, carefully considered way against established rules that can be monitored and well understood.
Beware doing simple math in complex situations. Ensure that methods to treat seemingly different measures of the same “fact” are well-considered and carefully tested.
These are a few things to consider when you are thinking about the truth in data. Of course, like any discipline, the most important first step is awareness. Data can be incorrect because it is simply wrong, because it has been intentionally manipulated, or because it has changed over time. Information which is missing, or which is otherwise inconsistent from the corpus of data can be equally confounding. The search for truth and meaning in data begins with the awareness that there are rarely absolutes. All true data may not be simultaneously true. All missing data may not be missing because it does not exist. The only absolute truth is that there are absolutely no shortcuts to understanding truth in data!