Key points
Excessive data cleansing is often performed automatically and can prevent insights from being uncovered, but advanced analytics software helps users dig into the dirt for better analytics.
As process industries become more and more data-driven, analytics has become the focal point for many subject matter experts (SMEs). Most SMEs do not have a background in data analytics and thus may not think about things like data integrity and what that really means.
One facet of data integrity that is often overlooked is data cleansing. Data cleansing can refer to anything that alters the original data collected by the sensor. This could mean down-sampling data by taking averages, removing outliers, or using any kind of smoothing algorithm.
This becomes a bigger issue with SMEs and data scientists struggling to handle the growing ocean of data that is being generated and stored. How can one possibly get any insight out of so much data? It seems unavoidable to have to summarise, down-sample, or average out the data. But is it possible to overclean data?
Many SMEs may not even realise they are cleaning their data and are just doing so because they are forced to by whatever tool they are using. Excel’s limit of 1,048,576 rows forces users to only look at a small window in time. Even without reaching this limit, some software applications can get very slow when trying to process so much data. Other tools get bogged down if data is too dense, and they may automatically take averages to avoid this problem.
Fortunately, a solution is available in the form of advanced analytics applications specifically designed to handle the large amounts of data associated with process plant operations. When evaluating this kind of software, it is important that the default view always displays the raw data.
The software should use an algorithm which dynamically adjusts to the number of pixels on the screen and the number of raw sample points coming from the datasource. This enables users to always see the most accurate representation of their data, whether they are looking at one day or three years of data, and this is very important when investigating data to create insights.
When to clean
There are definitely instances when data cleansing is not only desired but required for analytics. If you are creating a model or prediction, for example, you will most likely want to use cleansed data as it will produce much better results.
Advanced analytics software makes removing outliers and smoothing data simple through the use of point-and-click tools, helping users understand exactly which data they are removing and which data they are keeping in their analysis.
Note, in no case should this process impact the underlying data in the system of record: the definitions for cleansing must be done using calculations on the source data, not by changing the source data itself. It is also very important that this process is transparent and not a black box, so the SMEs are confident they are removing and keeping the right data.
So when could various down-sampling methods actually lead to inaccurate conclusions and wrong answers? Let’s dig into this by looking at a few examples.
Averaging hides problems
Let’s start off with a simple but common example. SMEs often look at hourly or daily averages instead of the raw data when monitoring a process, as with this pressure signal. By looking at the daily average pressure it seems it was maintained at 30 psi for the entire month of July (Figure 1).
Figure 1: Excessive down-sampling can create a false sense of stable operation.
At first glance it looks like the process has been controlled well and there is nothing further to investigate. However, if we pull up the raw data instead, we see a very different story (Figure 2).
Figure 2: Examination of raw data can reveal issues hidden when only looking at down-sampled values.
In reality, the variability of the pressure has been increasing greatly over time and there are huge pressure swings. This is probably something that needs to be investigated immediately, but this type of problem would not have been apparent from looking at just the daily average pressure.
Use the right time period
Looking at hourly or daily averages is only one method used to down sample or aggregate data for easier visualisation. Another method of down-sampling is just keeping whatever data point is at the beginning or end of the time window (for example, the data point at the top of the hour when examining hourly data).
For example, if you look at the following trend when trying to investigate a process problem reported by an operator, you may not be able to determine the root cause (Figure 3).
Figure 3. Down sampling this data makes it impossible to spot the underlying issue.
Everything looks to be as expected. However, upon investigation you realise that this data has been down sampled and is only giving you the data point at the end of each hour. When looking at the raw data with a two-minute sampling rate, you see a very different picture (Figure 4).
Figure 4: Looking at the raw data makes it much easier to investigate the problem and create insights.
The raw data show there are spikes in the temperature that have been causing the process upset. This issue would be impossible to identify by looking at the down-sampled data.
Some like it down and dirty
The lessons are clear: raw data often has some very important information that can be lost by down-sampling or averaging, and it’s important to always use data that is fit-for-purpose. This usually means starting from the raw data and then doing some data cleansing and down-sampling as appropriate for the specific analysis, a task best performed by an SME familiar with the unit or process.
Because advanced analytics software connects directly to the data source, users don’t need to worry about these issues. For example, a special algorithm, called Spike Catcher by one leading vendor, ensures that no matter how dense the data, how long the time period being investigated, or how few pixels on the screen—users won’t miss an important aspect of the data.
These types of algorithms look at the available pixels on the screen, pick the min and max during each time period, and display both data points with correct level of resolution.
As you zoom in and out, the advanced analytics software is constantly adjusting the visual representation of your data based on the amount of data to be displayed and the number of pixels on your screen. Once you determine that data cleansing is appropriate for the desired analysis, you can easily use smoothing tools to remove noise or see long term trends.
General purpose software, such as spreadsheets, will not have the wide array of tools required for examining the large amounts of data often encountered when SMEs try to create insights from process plant data. The right advanced analytics software, one specifically designed to work with process data, addresses these and other issues to quickly yield results.