Data Accuracy in Context: Measuring Data Decay
How fast is your data decaying? How much time has to go by before you no longer trust the accuracy of the data sitting in your databases? Your data is decaying at vastly different rates, so you can probably trust portions of it as other portions become unreliable. Yet, even highly decayed data has value for some applications – so this is not a binary, yes or no, decision.
If you were lost in Cairo, would you prefer to have a 50-year old map or no map at all? Many of the street names and landmarks would be incorrect but there is going to be a lot on an old map that may help you navigate through the city. The objects on the map that are going to be helpful are the landmarks that tend to stick around across generations (e.g. the pyramids, old Roman alleyways, etc.). The things that won’t be so helpful are those that tend to decay more quickly (e.g. restaurants & businesses, street names, residences, etc.).
The data in your database is no different, and the key is to develop your intuition about the different decay rates associated with different data. Take for example, the names of people in your database, what does your intuition tell you about the decay rate of these names? Well, first names (given names, forenames) are fairly stable bits of data. The accuracy may improve or degrade based on changing sources or data but aside from the very rare cases where someone decides to change their name, first names are quite stable – they have an inherently low decay rate.
How about last names (surnames, family names)? Here’s where things get more interesting. Are we talking about males or females? Married or Single? That’s just the beginning of the factors to consider. From what generation? Compared to previous generations, millennial females are actually trending towards adopting their husband’s last name – so the decay rate for female last names is highly dependent on their generation. Beyond this, there are cultural factors to consider. If we are evaluating the name of a female from a Spanish-speaking culture, once she is married her maiden name will likely still make up the first part of a hyphenated last name which can include both, father’s last name and husband’s last name.
This is not to say that it is impossible to measure data accuracy across a large database. On the contrary, our assessments are made easier when we identify what matters most for specific data use cases. Therefore, when assessing the accuracy of the data in your database it is key to know the inherent decay rate of the data itself; but, it’s just as important to understand the intended application of the data.
So, the next time someone asks “how old is the data?”, challenge the question by starting a conversation about the intended use of the data (e.g. finding your way in an ancient city) and the specific data that may be involved for the use case (e.g. locations of Roman alleyways and pyramids or street names and residences). Because the use case, available data, and decay rate of specific data determines utility, the overall “average age” of the entire database, or any given record in the database, shouldn’t conclusively prevent favorable outcomes.