We live in a world of big data, which gets used in countless ways. “Big data” refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods and gained momentum in the early 2000s. The way organizations manage and derive insight from it is changing the way the world uses business information.
For example, big data is used to determine which ads you see when you’re online, which piece of equipment in a manufacturing facility is not running correctly, where you have traveled, or what you like to eat. With the increased ability to capture and store large amounts of data, businesses, hospitals, government institutions, and many others rely on this data to drive decisions such as product offerings, policymaking, where to build a fast-food restaurant, and of recent relevance, tracking COVID-19 testing and cases.
While the use of big data is not new, any organization that relies on its use to make decisions needs to consider the reliability of the data collected. As the saying “garbage in – garbage out” implies, if the data collected is not accurate, precise, and stable, then the resulting decisions have a higher risk of being flawed.
Accurate, precise, and stable
Measurements are accurate if they tend to center around the actual value of the measured entity. Or when the measured value has a little deviation from the actual value. For example, when a hospital collects data (measurements) on time to process lab requests from the emergency department (ED), the average processing time values captured in the electronic medical records system matches (or is very close to) the actual (true) values.
Measurements are precise if they differ from one another by a small amount. Using the same example of a time to process lab requests from the ED, the processing time values are close to one another.
Stability refers to the capacity of a measurement system to produce the same values over time when measuring the same sample. This is a different example of weighing the amount of liquid hand sanitizer in a sample of bottles. The scale used to take the weight measurements generates the same (or very close) values over time (i.e., daily) for the same sample of bottles.
Big Data ≠ Good Data
But how can we ensure these large amounts of data are accurate, precise, and stable so we can minimize the risk of faulty decisions? Consider an approach called Measurement System Analysis (MSA), which has been fundamental in the continuous improvement domain for many years. Measurement System Analysis is to qualify a measurement system for use by quantifying its accuracy, precision, and stability. While most MSA attributes to its application in Six Sigma project work, it can be of great value in ensuring the quality of your big data, which is a critical step that should precede any data-based decision making.
Traditionally, MSA calls for performing a small test of the measurement system with multiple people (operators) measuring a sample of items (parts). This approach works well in a process where measurements are taken by people (operators) in the process. However, in our current fast-paced world where data is being automatically collected in large amounts, taking a traditional MSA approach may not make sense.
Assessing Data Quality Without MSA
Depending on the nature of data collected, the source(s) where it’s collected, and the effort required to evaluate its quality, consider the following techniques as alternative approaches to ensure the quality of your big data before proceeding with data-based decision making.
Review Data Definitions (Operational Definitions)
In a service center environment, the amount of time it takes to complete one transaction or call, or Average Handle Time (AHT), is an important performance metric for determining staffing levels and costs and impacting customer service. However, how AHT is defined can vary – for example, should the time a customer waits on hold be included? Should a customer service representative’s after-call work be included? Determining a clear definition across your organization of how AHT is defined is critical to ensure the validity and relevance of your data. This definition will ultimately drive how you use the data to make decisions. To capture a customer’s full experience with the servicing function, including their time on hold, time selecting menu options, and the time they spend speaking with a representative is needed. However, to calculate the organization’s cost per call for the representative, including the time they speak to a customer and after-call work is needed. Clarifying the purpose of your data collection efforts is essential, as it can influence how metrics are defined.
Additionally, individual elements of a measure may be stored in different systems or reports. The customer’s wait time and time selecting menu options may be stored in an IVR system, while their talk time and conversation with a representative may be stored in a separate call recording system. Partnering with technology owners who know and understand what data is captured and where is critical to ensuring you collect the right data for your purpose.
Know Your Data
During a process improvement project to reduce the time it takes for a hospital lab to process specimens that come from the emergency department (ED), it determined that processing time data needed to be extracted from the hospital’s electronic medical record (EMR) system. Working with the ED nurse manager as the team lead (Green Belt), we reviewed an Excel file with thousands of lab records. By using a formula to calculate the lab processing times, we learned that several of the values for processing time contained negative numbers. This raised an immediate red flag to both of us since we were collecting a time-based measurement, and any negative number does not make sense. I also refer to this as the ‘sniff test’ – you should have a basic understanding of what values are suitable for the measure(s) you are collecting data for. If you are not familiar with how the data is collected or what is suitable, rely on subject matter experts to ensure the data passes the sniff test.
Compare Against Another Source
With the increasing complexity of how business is done, many organizations rely on external partners to deliver value to their customers. During a policyholder’s automobile claim process, a large insurer partners with a 3rd party vendor to assess the vehicle’s damage to determine the coverage and claim amounts. During a process improvement project to reduce the time it takes to assign a field appraiser to assess vehicle damages, the team lead (Green Belt) attempted to verify the data in the company’s claims system. After learning that the data in the claims system was only ~80% accurate, we needed to explore an alternate mechanism to obtain more accurate data. After further investigation, the team identified that there were two sources for this data. The insurance company’s claims system captured the date/time the claim was assigned to a field appraiser, and the date/time the field appraiser accepted the assignment. The 3rd party vendor also captured the date/time a field appraiser received the assignment, and the date/time the field appraiser accepted the assignment. We obtained a sample of the vendor’s data and learned that it was only ~70% accurate. To get more accurate data, we ultimately conducted a manual data collection for a sample of claims over several weeks. However, this due diligence uncovered a contractual issue with the vendor that needed to be addressed (outside the scope of the process improvement project).
Investigating alternative data sources is an effective way to assess data quality when possible.
Observe the Process (go to the Gemba)
As of 2017, almost 86% of office-based physicians used electronic medical record (EMR) systems, which capture and store a multitude of patient-related data such as diagnoses, treatment plans, prescribed medications and dosages, clinical notes, discharge information, and much more. While these EMR systems are a big data source in healthcare, they may lack the information needed to gain the necessary insights to support quality improvement, monitor patient safety, or measure organization performance.
During a hospital’s effort to reduce/eliminate delays, the first case starts in their operating rooms (the first surgical procedure performed in each operating room starting on time as scheduled). There was a commonly-held assumption that patients not finding parking contributed to delays in the first case starts. A Black Belt was tasked with this improvement project and reviewed the hospital’s EMR data to find there was no information related to when a patient arrived in the parking lot at the hospital. To determine whether patient parking was contributing to delays in the first case starts, the team lead and another team member went to the Gemba, where the work gets done. For several weeks, the team members observed patients when they arrived in the parking lot and followed them through pre-operative activities until the patient entered the operating room. By performing this fundamental lean practice, the team dispelled the assumption about patient parking and focused on addressing other more valid causes.
According to this infographic, the exponential growth of data captured in extremely large amounts is undisputed, and only increasing. With the growth of the internet of things, connected devices, cloud storage, and increased access to a computer and mobile devices across the globe, to name a few, the rate at which big data will grow is certainly exponential. Whether big data is used to make supply chains transparent, guide policymaking, or track a global pandemic, the need for accurate, precise, and stable data will continue. Consider adopting these techniques as they make sense for your organization and situation to limit the risk of flawed conclusions or decisions.