The Games We Play with Data: A Record-Breaking Look at Data
Over the past weeks, it would have been hard to live anywhere in sight of a television or a web browser and not have some contact with the Summer Olympics in Rio. Like many people, I follow the sports I played in college (swimming, water polo) and some that I find interesting (gymnastics) as well as a reasonable dose of those I barely understand (canoe sprinting, race walking?). Of course, there is data everywhere: times, distances, world records… lots of data. We soak it all in and just accept it, right? Let’s take a slightly closer look at some of the dynamics at play (pun intended).
Organizing Data: Who Goes First?
The first thing we generally do with data is organize it. That very act implies some concept of what “organization” actually means. By way of example, everyone knows that Olympic Games start with Opening Ceremonies. Although the ceremonies vary widely, one thing that normally stays the same is the Parade of Nations, where athletes enter the venue. Normal convention is to have Greece enter first, followed by all of the remaining guest countries in largely alphabetical order, finishing with the host nation. For Example, in 2016, Afghanistan was second after Greece and Zimbabwe was number 205, just before a special Refugee Olympic Team and then the host nation, Brazil.
In Beijing, many people thought that the Chinese had decided to break this tradition. The Chinese ceremony still had Greece first, and China last, but in between, there seemed to be a rather random order. For example Malaysia was one of the first few countries, while Singapore was well toward the end. What was actually going on has to do with what is called collation.
Data is often put in order according to some collating sequence (some examples include ASCII and Unicode, which are common computer collation sequences). When we alphabetize, we are ordering data in a sequence guided by an acceptable collation sequence, which in English is the Roman Alphabet. Many other languages also use the Roman Alphabet, but China certainly does not, primarily employing an ideogrammatic writing system based on tens of thousands of characters that largely convey meaning, not sound.
When we put things in order according to the Chinese writing system, we put those things that are simpler to write (fewer strokes) first, while those with more complexity come later. This collation is the equivalent of “alphabetizing” in Chinese, and was the basis for the ordering of the Beijing opening ceremony. Malaysia starts with the character “马”, which is very simple, so it goes towards the beginning, while Australia starts with “澳”, which is much more complex. So Australia was number 202 of the 204 countries in the Parade of nations, while Malaysia was number 10! Imagine what would happen if we were trying to do some sort of analysis of the Parade of Nations over time, since the objective collation function has changed, how would that change our perspective or the nature of the conclusions we might reach?
The choices we make when we chose to “organize” data can have a big impact on how we see things, how we find what we are looking for, and how we compare disparate pieces of information. Some storage methods are very good for writing data efficiently, but not efficient for modifying data because of dependence on collating sequences. Other systems offer multiple “keys” which essentially give the benefit of organizing the data simultaneously according to various collation sequences.
This simple aspect of what is considered “order” and what is considered “out of order” is one of the fundamental issues in looking at data which has never been seen before. How do we order something we don’t understand?
Of course in a Big Data world, such problems present themselves with much larger sets of data, and with much more complex needs to query, organize, and make sense out of the data.
One of the most fundamental decisions in the discovery and curation of data is if, and how to organize it at the outset. With extremely large, highly dynamic sets of data, failure to consider the issue of collation can lead to inefficient processes, or even completely undermine the ability to ask certain types of questions after the fact.
Data about Value: Extraordinary commodities
We all know that the athletes in first place get Gold medals, followed by Silver, then Bronze. This tradition is based on the relative value of those three medals (gold is always more valuable than silver, etc.). We can therefore see the “nominal value” of the medal immediately, even though the real value will fluctuate based on the amount of gold/silver as well as the current commodity pricing. (There is about $600 worth of gold in this years’ Gold medal, about half that value in a Silver medal).
But what is the real value? Should we take into account total income to the athlete, including awards, sponsorships, etc? Should we consider contribution to the sport? Some would argue that the value of the medal is in the way that the winning athlete uses the victory status to do some greater good, like mentoring or inspiring others or calling attention to an important cause. Clearly there is a simple way to put a monetary value on the medal, even a “time value of money” way to assess an entire revenue stream associated with winning. But most would argue that all of these methods fall short of assessing total “value” of winning a medal in the Olympics.
In data, there are many ways to assess, and re-assess value. One method is called factor analysis, in which we perform empirical tests to determine the most critical attributes of data associated with a particular outcome. Then, by scaling those attributes, we can understand the relative value of newly discovered or curated data. Such an approach works very well with quantitative data, but is problematic and often subjective when looking at qualitative data or data which is a mix of quantitative and qualitative data.
For qualitative data or mixed type (heterogeneous) data sets, one method of understanding value is in the context of a particular use case. For example, if we were assessing the value of a new data set for understanding customer behavior, we might assess value differently if we were trying to understand the likelihood of adopting a new product or service than if we were using the same data to understand customer retention. In such cases, it is important to have a consistent frame to measure value of mixed data or data that has never been seen before.
To understand the value of data, it is valuable to consider two dimensions: Character and Quality. The character of the data speaks to what is in the data set, including volume, variety, veracity, and value. Measuring the quality of new data can be complex, especially if there is no standard against which to compare, but should always be done in a way that is as consistent and reproducible as possible.
The Past and the Future: Fast pools and skimpy bathing suits
Once data is discovered, organized, and understood in terms of quality and character, we can start to infer meaning. One such way to do so is by way of longitudinal analysis, where we look at a set of data over time to understand behavior.
In swimming, one of the sports I know about the most, a lot goes into who swims in what lane. In general, the faster swimmers earn the right to swim in the center of the pool, where surface waves dissipate more quickly. The reduced turbulence offers a slight advantage. Apart from that, you might think everything else comes down to swimming faster than everyone else… but you would be wrong.
Other factors greatly influence the outcome of swimming events. For example, some pools are considered very “fast” – because of their design, either by having deeper water, more engineered lane lines that absorb energy, or other factors. Both Beijing and Rio had extremely fast pools.
Bathing suits themselves have undergone massive transformation over the last few Olympics. Many remember the very skimpy suits of old, the nearly full-body suits of a few years ago, and the modern suits, which are somewhere in between. Each of these designs is adopted because of factors that purport to allow the swimmer to swim just a bit faster, while staying within regulations. (When I was in High School, our practice suits had stripes and our competition suits were solid color – I once convinced someone that solid colors are faster – not true, but funny at the time).
So if swimming (or any sport) is about continuously improving, breaking records, and doing better than before, how do we reconcile that there may be factors to consider before simply comparing who swam the fastest time? If we really wanted to know who the better swimmer was, could we look back at the data and include a multi-factor analysis that rationalized the suits, pools and other factors to figure out who really swam best? The answer is yes. Such an analysis would not change the stats as they are collected today, but it would provide a much more nuanced understanding.
When looking at data over time, it is extremely important to consider how the environment, and the nature of the data generation itself may have changed in order to have the most nuanced understanding of behavior. As data becomes more heterogeneous, such analysis becomes more critical to an overall understanding.
Looking at the data around us from a context of what we have to believe and how we need to interpret what we see can provide a far more nuanced understanding. Data itself can be useful or misleading depending on factors surrounding how it was created and what factors influenced change over time. Finding meaning in data often starts with the realization that it’s never “just about the data.”