Organizing Data: Getting It Right
For many people, January is a time of resolutions: resolutions to do better or to get organized are common. While resolutions about getting organized are familiar, we eventually realize that there is nuance. On the one hand, we have a notion of what “organized” looks like (a clean desk, a tidy garage) or feels like (less drama with friends and co-workers, more focus on goals that really matter). However, the reality is that getting organized is usually much more difficult to accomplish than it seems. Things change. It’s not just our perception that things seem to get disorganized as we attempt to order them. The reality is that entropy (the state of disorganization) is actually the natural state of the universe. It takes energy to keep organization in place. In some respects, organization is unnatural. Without that exertion of energy, things revert to a lower-energy, more disorganized state. Of course, there are huge benefits to organization, but it takes effort. As it turns out, organization is even more nefarious when we look at data. While getting data “organized” is a bit simpler than in real life (determine rules, apply rules, repeat), it turns out things get a bit more complicated when we look at them more closely.
In the beginning, there was chaos.
There is a famous quote from the I Ching book of Chinese wisdom, “Before a brilliant person begins something great, they must look foolish to the crowd.” Sometimes, this aphorism is restated “Before achieving something truly great, there must be chaos.” Either way, we get the sense that there is a “before,” which in the case of organizing is the precious time before decisions are made about how to proceed and how to organize. Often, small decisions can have a very big impact.
In the early days of computers, databases had fixed fields to hold specific types of information. There would be some sort of “dictionary” or documentation that described the structure of the data (sometimes called the ontology or metadata) to avoid any misuse of the structure (metadata violations). Early programming languages, such as COBOL, required memory to be organized before it was used. Many of these same conventions persist in part today. One might conclude that with such structure, everything was very well organized. Actually, reality was different.
Consider names and addresses, a very common thing to store in databases. The database might be organized to hold first (given) name, last (family) name, address, city, state, and postal code. Sometimes, the metadata was dictated by physical limitations or convention (for example, addresses were typically 35 characters because printers output 10 characters to the inch, and address labels were 3.5 inches wide).
Over time, we start to notice the exceptions. What happens when there is a “second” address line, such as an apartment or suite number (add Address2, also 35 characters), or a middle initial (add a field, make sure the sum of the length of first name, last name, middle initial, and spaces to delineate don’t exceed 35 characters).
Closer inspection of our example reveals even more nuance. For example, this structure is potentially very U.S.-centric. What about non-U.S. addresses? Postal code still works, but now we need additional fields for different address structures (e.g. provinces) name structures (e.g. patronymics). What happens when printers can print more characters per inch, or labels are metric sizes?
Over time, there are metadata violations in the use of our data. These variances imply shortcomings in our original design. For example, we may observe suffixes in the surname field (such as “Jr.”), suggesting we omitted a nuance in our ontology. We can change the ontology. However, we must go back and fix the pre-existing data and all of the systems and processes that use it. It quickly becomes clear how entropy always finds a way into the picture.
Of course, databases are far more complex, often encompassing self-defining, more flexible ontology, capabilities or systems such as ETL (Extract – Transform – Load). Modern database languages allow for looser data typing and self-defining processes, essentially enabling some of the work of describing the data about the data to be more natural. Nevertheless, we never fully escape the fundamental issue that, in order to curate and synthesize data, sooner or later we have to make some assumptions on the structure or content in the form of rules.
Be extremely thoughtful when determining the initial structure of new data. Small decisions often have extremely large impact as that data is put into use.
Reposing data: Preparation is everything.
For many years, I have volunteered as an Emergency Medical Technician. Among the many lessons I learned is that real life rarely imitates training, but training is essential. When we arrive on the scene, there is no substitute for being prepared. Everything is in a place. The equipment likely needed for a traffic incident is stored mainly on the outside of the ambulance or rescue truck, where we can get to it from the road. Equipment for performing CPR is kept on the inside, where we can get to it quickly en route to the hospital. Sometimes, I wish we were as careful with where we store our data.
Unfortunately, reality is often different. For example, certain data environments are extremely useful for storing large amounts of data, but less efficient for reorganization after the fact. Other data environments are very popular and commonly used, but not necessarily the best for certain types of analysis. It is extremely important to balance the needs of supporting an IT infrastructure with the needs of discovery, curation, and synthesis of data for analytical purposes.
Consider a very common data science problem: understanding customer behavior in the face of a potential new product launch. Imagine that you are asked to construct an optimistic, typical, and pessimistic scenario for the impact of a product launch in a new market segment.
The analytical environment in this case is a highly dynamic, ongoing process. Involved parties would have constructed some sort of business case for the new product, which might include financial projections, project plans for development, testing and launch, and acceptance criteria. In Agile development, there would be cross-functional teams working with user stories and rapidly changing data sets intended for testing and validation. Other parties may have conducted focus groups to understand end-user behavior and preference. All of these activities collect and consume data. The reality of the situation is that all of that data is rarely accessible to most of the parties creating and consuming it. Data becomes compartmentalized out of necessity.
There are multiple strategies for conducting such an analysis (e.g. extracting relevant data to one common environment, attempting to connect to data in-situ via external reference, requesting extracts from involved parties to suit the analytical need). This scenario may seem overly complex, but in reality it is far simpler than a typical real-world situation. The essential skills for successfully navigating such a challenge go far beyond analysis. They include an understanding of the work at hand, the potential impact to work in progress, and probably a bit of diplomacy, all to get the necessary information organized in a way that is useful.
The new normal: Everything can rarely be seen anywhere.
Another consideration in the modern world of data science is governance and regulatory compliance. Put simply, all over the world there are laws being introduced that relate to where data may be stored, how data may be used, and how subjects of the data must be informed. These restrictions relate to data sovereignty (country-specific concerns about data leaving the place of origin), privacy, and various forms of data misuse (e.g. identity theft, misappropriation of intellectual property). Organizations around the world are rightfully placing a significant amount of focus on data governance, not only from the perspective of bringing outside data into the organization, but also focusing on the location and use of data within the systems and processes controlled by the enterprise.
Data scientists often use the word “curation” to describe the process of storing data in a place that makes sense. For example, we try to keep master data in a centralized location where enterprise systems and processes can access it. We try to cleanse and transform data as early on in the process as appropriate so that downstream processes get maximum benefit. The reality is that laws about where data may be stored and how it may be used often work contrary to efforts of curation. In order to comply with the requirements of regulation and governance, we are often forced to retain copies of information locally, to store intermediate data likely to decay while we wait for authorization to use it, and to engage in other practices contrary to reaching the best possible analytical purpose of the data.
Understanding that governance and analysis of data often can be at odds, it is imperative that all parties work together. Lawmakers can only understand unintended impact of legislation if those who foresee that impact are appropriately vocal in sharing their collected wisdom. Similarly, people using data, which is regulated or subject to internal governance restrictions, must not work to “get around” restrictions but rather to understand those restrictions and take them into consideration when assessing what is both feasible and advisable from an analytical perspective. There is no silver bullet. Restrictions are likely to increase as the value and risk associated with data continues to change.
Organization is a perplexing phenomenon. It’s never done. It consumes energy and effort that must be repeated to remain effective. The rules change, requiring even more organization. Yet, without organized data we have chaos. Without organization, much of our data goes unused because it is not available in a way that makes it useful when needed. Like all New Year resolutions, we will revisit the issue again and again, hopefully better prepared each time. Happy New Year!