Why Structured Data Is Increasingly Important to Your Case
This article was originally published in Law360 on September 2, 2021
The role of structured data — the tidy but sometimes voluminous rows and columns of data tables often associated with spreadsheets or financial ledgers — in the legal context has been overlooked, underappreciated or at best secondary to its unstructured data counterpart that includes emails, files and agreements.
Simultaneously, dependence on structured data is of fundamental importance as an operational necessity, a competitive advantage and, in some instances, an existential risk. The inevitable convergence of these two opposing trends foreshadows an impending transformation within the legal industry.
In this article, we make the case for reevaluating the concepts of undue burden and proportionality with respect to structured data and the proliferation of easily accessible structured data. While the need to scour through unstructured documents may still be required, we argue that structured data has a larger role to play, potentially enabling parties in civil litigation to realize efficiencies without losing accuracy.
The Proliferation of Data
As the parable goes, during the internet's infancy, a consultant performed an assessment of her client's business and provided the client with recommendations for future operating strategies.
Frustrated, the client responded that the recommendations seemed relevant only to "internet companies" and that his was a "brick-and-mortar outfit."
The consultant replied, quizzically, "All companies are internet companies now."
Decades later, a similar dynamic now applies to data.
Whether multinational banks with complex, far-reaching information technology operations or small retailers relying on software-as-a service accounting platforms and using social media for targeted advertising, all companies store, consume and generate volumes of data that would have been unthinkable only a few years ago.
Such data are not limited to a business's core operations. In addition to the daily use of business software, company personnel generate a persistent, electronic impression of their activities each time they visit a website, use an internet-connected device, or swipe a security badge.
Simply put, nearly every action taken by a modern company leaves a digital trail. This trail includes not only unstructured data, such as media, emails and memos, but also tabular structured data, which is typically associated with business applications.
A small sampling of statistics related to the growth of data include the following:
- The International Data Corporation expects the amount of data worldwide to reach 175 zettabytes by 2025, a compound annual growth rate of 61%.
- 90 ZB of this data will be from internet-connected devices in 2025.
- In 2020, every person generated 1.7 megabytes of data per second.
- By 2025, 20% of data will be structured.
- Data interactions, defined as the creation, capture, copying and consumption of data, went up by 5,000% between 2010 and 2020.
As the volume and prevalence of data have grown, so too have the tools and skills for using such data. Visualization tools abound, availing nontechnical employees of analytical capabilities that have historically been the province of IT practitioners.
Graphic extract, transform, load tools enable users to merge and manipulate a vast array of disparate data sources without writing code.
Web-based, serverless database platforms provide analysts with the means to query large, complex datasets without the need to acquire and configure hardware, resulting in low startup costs and quick turnaround.
Standard, open-source data exchange formats, such as Extensible Markup Language and JavaScript Object Notation, have made sharing data relatively seamless.
Where once the implementation of artificial intelligence and predictive models was confined to a small group of specialists, drag-and-drop data science platforms have greatly simplified their use. According to LinkedIn, data scientist was the No. 3 emerging job in 2020, with 37% growth from the prior year.
Data is everywhere, and it is increasingly easy to use.
Structured Data in Disputes and Investigations
Discovery in disputes and investigations has traditionally concentrated on unstructured data. Originally centered around paper documents, discovery in the past 20 years has seen an increase in electronic documents.
Such data includes, but is not limited to, emails, chat and text messages, loose files and office documents in Microsoft Word and PowerPoint, web content, and data on collaboration tools such as Slack, Zoom and Microsoft Teams.
Unstructured and semistructured data sources have been sensationalized largely due to individuals' and employees' undisciplined communication with others across various channels. Unfortunately, no matter how many times these communication sand traps make headlines, individuals continue to post and send messages they wouldn't want discovered in disputes or investigations.
But many cases or investigations have been won, lost or turned based on structured data rows and columns. Structured data repositories reside in the general ledger, or in finance, accounting, enterprise resource planning, trading, sales and human resource applications. These data repositories have historically been ignored, misunderstood, discounted or labeled of little value in disputes and investigation matters.
There are a myriad of reasons why these repositories have not been accounted for, including but not limited to:
- A lack of understanding and knowledge about structured data sources;
- Agreement by the parties that the structured data sources do not contain relevant information and would not be subject to discovery, resulting in a head-in-the-sand mentality;
- Assumptions that structured data information is difficult to understand and burdensome to extract, obtain and produce;
- Assumptions that discovery requires special structured data expertise, at significant expense, that the matter does not warrant;
- Failure of project team members to understand or communicate with the subject matter experts responsible for the structured data repositories;
- Inability to interpret or analyze structured data efficiently and effectively;
- Lack of knowledge to ask the appropriate questions about the existence of structured data, its accessibility and relevance to the specific matter.
Fast-forward to today and data is everything to an organization. How it is created, stored, utilized, analyzed, repurposed, retained and protected are paramount.
Data, including largely structured data, is viewed as a competitive advantage and a strategic weapon. The significance of all data — unstructured, semistructured and structured — in a dispute or investigation, cannot be underestimated.
In any dispute or investigation, the goal is to piece together the puzzle to tell the story in the most favorable or defensible light. Legal teams need to tell the complete story. In most instances, this is impossible to do without structured data.
Fortunately, several developments over the past 20 years ensure structured data is now available to complete the story. These include the following:
- Disputes and investigations professionals having a strong overall knowledge of technology and awareness of all data formats. They know where data lives and how to request it. Additionally, they understand how to use it.
- The elevation of information technology and technical professionals across organizations has bolstered recognition of their overall importance to an organization's success.
- Organizations are utilizing structured data's strategic and competitive advantages, and understand structured data's role for winning in the marketplace.
- Legal professionals recognize the need for structured data to tell a story and prove or win a case.
- Third-party applications have proliferated, and there is a reduced emphasis on proprietary and custom-developed applications.
- Web-based applications have expanded and are more accessible, and there is increased confidence that web data exists and is available and obtainable.
The historic arguments of burden and proportionality as related to structured data were thin at the time. Today, these reasons are virtually baseless other than in rare circumstances. If companies and legal teams aren't asking for and utilizing structured data in a dispute or investigation matter, they might miss key elements of telling the story.
Making Use of Structured Data
Given the rate at which structured data is being created across industries, it is not surprising that a similar, albeit slower-moving trend is emerging in the litigation ecosystem.
Structured data offers a wealth of information from which to draw insight. But it is important to recognize how to best use structured data once its relevance becomes apparent. Aside from obtaining access and copies of the structured data itself, the ability to interpret the underlying data is a critical step for which there are very few available shortcuts.
Pragmatically, it is difficult to work with structured data in its raw form without some frame of reference. Structured data is often not easily understood or self-evident. Instead, structured data is frequently accompanied by complementary nuggets of information recurrently referred to as metadata.
Metadata comes in many forms. For the most part, it serves as any information that describes or explains a set of data.
To realize structured data's benefits, it is critical to assess if any of the following articles of metadata are available.
Data Dictionary
A data dictionary is a descriptive text that describes each and every field within a given set of data, including allowable values to be populated. A data dictionary works to answer questions involving what each field means and what information each field represents.
This is an important document to consider during the scoping phase as a potential source of descriptive information to substantiate discovery requests.
Data Lineage or Mapping
A data lineage or mapping document helps illuminate how data has migrated from one interface or dataset to another.
For example, if a text field describing the month of year is converted to a numeric field, a data lineage or mapping document seeks to answer questions around how certain field sources get mapped over to a related field target or what original source of the data was used to populate a certain field from a dataset.
This assists in tracing and provides a better understanding of the data as it moves across multiple or different interfaces or undergoes a set of transformation steps, such as parsing, enrichment or decoding.
Entity Relationship Diagram
If a data dictionary describes each field in isolation, think of the entity relationship diagram, or ERD, as a visual interpretation of how data fields interrelate with one another.
In this context, an entity refers to people, objects or concepts within a set of data. An ERD seeks to answer questions of how correlations can be made between datasets.
ERDs assist in circumstances where there is a value that appears across multiple datasets and it is not clear in the data if the identifier or coded value is in fact the same.
Data Schema or Layout
A listing of all the fields in a dataset along with their data types, field sizes and whether the field is required is known as data schema or layout. A data schema helps address what types of information should be stored in each field. This assists in circumstances where it is important to understand the ordering of fields in a dataset or the data types of specific fields.
In an ideal legal matter involving structured data, most, if not all of the metadata will be available to interrogate, review and support the analysis of the datasets of interest. With that said, it is more common for certain pieces of information to be absent, especially in circumstances where the documentation is sparse or just not available for review.
When structured data is without descriptive metadata, it is through interviews with key data owners or stewards that insights about the data can be gleaned. Additionally, reports that were created based upon structured data of interest offer a pathway to decipher and reverse-engineer the data while also serving to validate the completeness of the data being put forth.
Structured data is often found to be highly centralized and, in many cases, resides in a format that is not standard across different industries. As such, a level of care needs to be taken when extracting structured data as the data that might be relevant for a litigation matter is often part of a larger set of information. This larger set of information may take the form of:
- A critical system that is constantly running and critical to a business's operations;
- A cloud data storage that is being accessed in real time by multiple different consumers;
- An archived database that is offline and will require restoration; or
- A variety of other nuanced scenarios that may be common practice.
Given the variety of storage mechanisms for structured data, an additional piece of information that may offer further insight is the detailing of the methods or protocols by which the data extraction was performed. This insight ensures that both sides understand, and are comfortable with, the approach taken to obtain the structured data and can recognize any inherent limitations that may pose an issue down the line.
The Convergence of Structured and Unstructured Data
Legal proceedings have historically been ideal venues for discerning complex evidence. The proliferation of data has only served to further memorialize recorded details ripe for explanation.
For example, transaction history, web and mobile interactions, and geolocation details that serve key business and regulatory functions may also contextualize a legal claim around an individual's behavior. At the same time, unstructured messaging data that enables organizational collaboration may convey a more palatable play-by-play relevant to situations when understanding one's state of mind is critical.
As a result, the explosion of data across structured transactional systems and unstructured communication systems has opened the door to new sources for electronically stored information in both the discovery phase of a dispute as well as throughout a regulatory investigation.
On the structured data side, responsive ESI may include systems that house data for accounting, human resources, marketing, and regulatory and compliance.
As an example, a regulatory system to ensure sanctions compliance may store application programming interface logs providing instructions that users send to initiate actions, such as executing a trade on an exchange. In addition to trade details around price and quantity, an internet protocol address from where the user clicked would be useful for establishing a customer's location.
By augmenting this type of data with publicly available internet protocol address country links, a strong inference can be made around where the user initiated their instruction from. In a legal or regulatory context, this same data may establish a historical timeline of trading activity.
As for unstructured data, in addition to the preservation of emails and text messages, there is a growing trend around the preservation of collaboration applications within an organization. Collaboration applications allow for group and private messaging and connectivity to plug-in applications including Slack, Dropbox, Microsoft Teams and Google Workspace, among others.
In a survey of industry and legal professionals conducted last year by eDiscovery Today, only 26.8% of respondents said they always or usually have mobile device or collaboration data in their cases, while 20.2% said they rarely or never have either of them.
The likely reason for such a low response is the proportionality or undue burden argument, where the extraction of data may be perceived as excessively time-consuming and costly.
However, courts are recognizing that production of collaboration data is proportional and not unduly burdensome if the requests and searches are limited and focused.
This is evident in the public record of Benebone LLC v. Pet Qwerks Inc., where the U.S. District Court for the Central District of California found in February that "requiring review and production of Slack messages by Benebone is generally comparable to requiring ... search and production of emails and is not unduly burdensome or disproportional to the needs of this case — if the requests and searches are appropriately limited and focused."
Therefore, the combination of increased usage, ease of focused extraction and narrative power has reinforced the need for unstructured data throughout legal proceedings.
While structured and unstructured data are often siloed, the marriage of system data with narrative-based stories may be the most persuasive strategy in a legal setting. In such a union, evidence grounded in structured data, or hard evidence, may be bolstered by unstructured anecdotal data and vice versa.
For example, in the widely reported Lehman Brothers Holdings Inc. v. Citibank NA dispute over the valuation of 30,000 Lehman-facing derivatives trades, which settled in 2017 for $1.74 billion, documented remarks in the unstructured data revealed pricing strategies that were consistent with valuations found in the structured data of key trading systems.
Data has become the ever-present background to modern business.
Historically, unstructured data that can be searched and then reviewed by attorneys to find evidence that will support their claims has been the focus of electronic discovery. However, growing data volumes and the prevalence of structured data alongside new tools and techniques have made the analysis and exchange of structured and unstructured data, individually and collectively, ubiquitous and increasingly user-friendly.
As the barriers to the use of data continue to shrink, its appearance in litigation and disputes will become ever more common and important. Those who spurn its use in discovery do so at their peril.