Enabling Artificial Intelligence with Clean Data


The application of Artificial Intelligence (AI) requires that data be processed at the speed of operations in order to analyze the performance of any given process system. In this way, important decisions can be made quickly to improve the performance of the operation. But for those decisions to be made in real-time, data must be of the highest quality, and that presents a challenge to the industry.

Quantity vs quality

IoT has driven a tremendous explosion in the quantity of data. Today, there are many sensors measuring and providing data on temperature, flow, vibration, product viscosity and energy consumption, just to name a few. Add, for example, video and images to that mix and you start to get a feel for the vast amount of data that has to be processed. While this can potentially provide a greater window into any process, it also presents a big quality problem, in that much of that data is all too often “dirty.” Dirty data is neither conducive to fast processing nor useful for providing operational insights.

Issues with dirty data

There are three underlying issues that render data dirty: format, quantity, and correlation.

Format. With so many different vendors supplying various types of sensors, the format of data supplied is often not unified, despite the numerous industry standards. Processing the disparate data can be time-consuming if this data is not unified into a standard format.

Quantity. The huge amount of data captured is often very noisy. Each sensor could be reporting data to the system every 5 seconds but that doesn’t mean each packet of data is necessarily meaningful. For example, do you need to know the temperature of a tank every 5 seconds, or is every 5 minutes enough? It is highly probable that the sensor could just be recording the same data multiple times, so perhaps you only need anomalous data, or data that falls outside expected ranges. Of course, even with the reduction of data storage costs in recent years, the saving and reporting of data that does not impact on the process is a waste of resources.

Correlation. Dirty data is often not correlated. Meaning, data from one sensor or system alone may not be put into useful operational context with other data. For instance, you have a sensor that provides temperature data. However, if you also correlate temperature with humidity (a common occurrence due to changing weather conditions), you could see the impact on your steam production, thus providing context-based intelligence that allows you to make real-time decisions dynamically.

Data clean up

In the past big data was used in a different way, as it was typically analyzed off-line by data scientists. In contrast, nowadays there is a need for speed so that data can be used quickly to impact operations. For this to happen, dirty data needs to be processed into clean data within seconds so that AI can then deliver more process efficiency through advanced learning, as an example. With clean data, real-time operations decisions can be made, taking advantage of all that AI can bring.

“Reprinted with permission, original blog was posted here”. You may also visit here for more such insights on the digital transformation of industry.

 About ARC Advisory Group ( Founded in 1986, ARC Advisory Group is a Boston based leading technology research and advisory firm for industry and infrastructure.

For further information or to provide feedback on this article, please contact

About the author

Jane Ren is the Founder and CEO of Atomiton, an IIoT solution provider delivering next-generation artificial intelligence solutions to industrial businesses. Under Jane’s leadership, Atomiton was named a “Top 20 Disruptive Influence in Tech” by the CFO Magazine in 2017 for its IoT software stack deployed in oil and gas, smart cities, and industrial automation.


Share This Post

Leave a Reply