Open Data Sources for AI in Industry
When starting a new project to enhance the capabilities of a production facility with Artificial Intelligence, the common question is: “Is it feasible?” Artificial intelligence in an industrial context requires a lot of data to train the underlying algorithms. Systems in operation are generating data. But often these data are encapsulated, or the databases are not connected. They may not be available for the team with the mission to bring AI to the corporation, where it’s own data is not available for building such kind of systems. And on time and budget restraints, the development team is confronted with the question of how to get the training data.
Why Are Data Sources Important to Get Started With AI
Predictive systems, fully automated systems, and knowledge discovery systems require data to be trained correctly. And the data quality defines the operational results AI systems deliver. If you don’t have enough good data, your training results often perform badly. So, the AI is not capable of building the required abstractions and in the following, it couldn’t create an AI system that delivers outstanding results. Despite there are some reinforcement learning methods that don’t need lots of data, the general supervised Deep Learning Artificial Intelligence needs large amounts of labeled data.
The Process Flow of Data From Training to Operation
When starting an AI project in an industrial context, you need to consider the general workflow of building a usable AI. In the first step, you need to have access to the relevant historical data, which might have a form of files or databases with the needed information. In the Data Science world, the complete collection of data available in the first step is called a data lake. This data lake contains unstructured and structured data. In general, the format of data in a data lake is a raw format. That means there is no preprocessing of the data going into the lake from different sources. Collections of data are retrieved from relevant sensors or historical recordings. These data are not only measured data but also data from sources like images, video or audio.
In the next step, the data are processed in a preprocessing step. Here data are looked at. They’re visualized, so that subject matter experts could evaluate the quality of the data. Then they could be cleaned and reduced, so the raw data become transformed into data with more meaning.
These data are the base for developing the predictive models. Machine Learning algorithms are usually applied to learn from the data. For instance, a data scientist may choose neural network models, which should be validated after learning on new unknown data, and the training is checked. The training phase includes multiple feedback rounds to see if the training results fit the needs.
The Workflow of Machine Learning in Industrial Application
Finally, the ready to go AI components should be integrated at an enterprise scale. This integration is twofold. On the one hand, there is the edge AI with embedded devices and hardware solutions that accompany machinery on site. On the other hand, there is an integration into the enterprise systems in the form of software components. These software modules are supposed to be tailored to fit the existing operations.
The Problem Is Not Too Little Data but Too Much
But where to find all these data to train your neural networks? As data seem to be the new oil in today’s world. You might expect to find data sources outside your own enterprise is hard. Industrial companies keep their values and data for themselves. However, there are other industries, especially IT companies, that went through the same phase of keeping data and source codes for themselves. And even a small number of companies are still doing so.
But in recent years, the open-source approach had an unimaginable success. Even very proprietary companies like Microsoft are tapping into open source. Sharing creates new business opportunities and adds value to whole industries. So, industrial associations and consortiums start initiatives to share data. Another source of free and open data are publicly funded activities and research. Organizations like NASA or CERN provide a lot of valuable data. These data sets are used for general tasks and tests of new algorithms. They serve as a benchmark for algorithmic development. When you’ll search the internet for data, you’ll be overwhelmed by the abundance of data available.
But with these masses of data comes a problem. Artificial Intelligence is a hot topic, and everyone is craving for attention. So, it’s often hard to decide which open data is suitable for your project. There are a lot of unstructured offers, poor quality of data, or just weakly described data sets available. AI is used in so many different fields and used for so many different use cases that there are a lot of data sets not suiting your needs.
Relevant Open Data Sources for Industrial Artificial Intelligence
When you look at the categories of applied industrial AI, you’ll find that you could add AI into a lot of your products and services. In this way, you’ll improve your customers’ experience. For manufacturing tools, for example, machines that self-diagnose would improve the overall performance of the operational installation. It increases their effectiveness, reliability, safety, and enhances the longevity of the machines. They see their own signs of wear and tear on tooltips like drills, saw blades, welding tools, or grippers.
The second application you need data for is automation. The trends researchers call it hyper-automation. It helps the already present automation of industrial processes to get another boost. It makes people obsolete and shift changes negligible. Here data from standards in autonomous driving and smart robotics are used to give individual training to industrial autonomous vehicles and machines.
A third field where AI is applied on is knowledge discovery for engineering systems. The goal here is to find the root causes of problems and eliminate risks with the help of AI. A lot of critical areas provide a lot of data through sensors and logs. Here, AI could create real insights beyond anomalies detections and simple failure modes sensing. AI could then predict the unexpected. It finds relations between similar incidents in the past and current sensor readings. That helps to prevent problems even before they emerge.
What Data Do You Need?
With these fields of application given, you could search for the relevant data publicly available. As a lot of industrial applications require massive amounts of sensor data, these data are not always available for direct download. Sometimes you need to access the data via a given API. This API creates a connection to the existing databases and lets you extract and analyze them.
An example of available sensor data is the Predictive Maintenance of the Turbofan Engine dataset provides by NASA. It features sensor data from 100 engines of the same model. The dataset includes four different sets of engine data using the C-MAPSS aircraft engine simulator. The engines were tested under different operational conditions and fault modes.
These turbofan engine data sourced from NASA Prognostics Center of Excellence, PCoE. This NASA department has even more open data sets available. It features data sets from various universities, agencies, or companies. These time-series data help to create prognostic algorithms. They show the transition from some nominal state to a failed state. A lot of different industrial tasks were included. You’ll find milling data and test on bearings. You’ll find data on electronics and batteries.
More free and openly available repositories are available from the United Kingdom. The UK Oil and GAS National Data Repository, NDR, provides 130 terabytes of offshore data. It is covering more than 12,500 wellbores, 5,000 seismic surveys, and 3,000 pipelines. This data is freely available to everyone. But NRD is not exclusive to the UK. These kinds of National Data Repositories are available in a lot of countries and provide open data, open government approach.
Valuable data from governments is not restricted to the oil and gas industry. The British Geological Survey is also providing lots of data sets. It is offering real-time Seismograms and historical data of its over 100 seismograph stations across the UK. And over 525 more data sets on different geological topics.
The Main Search Engines for Open Data
The best way to find open data sources for your AI project are specific search engines, catalogs, and aggregators. With the help of these tools, you’ll be able to find quickly a fitting data set. They’ll guide through the jungle of available open data sources. Like the classical search engine, you could enter a term of what you’re looking for, and the search engine shows you interesting data sets.
The Google Dataset Search, datasetsearch.research.google.com, gives an impressive overview of existing freely available data sets. Once you’ve done your search, the results do not only give you the link to the repository. It also gives you direct information about the data formats provided and the way the data is accessible. This newly published tool features about 25 million publicly available datasets.
The Registry of Research Data Repositories, re3data.org, offers a comprehensive text-based search of its linked repositories. It features a nice graphical exploration tool under “search by subject” to find open data. But for the engineering sciences, there are only a few results. Further, this search engine does not lead you directly to the data. It just sends you to the repositories from where your search continues.
With these two starting points, you’ll find the right open data quickly. Open data helps you to start your industrial Artificial Intelligence project directly and you don’t have to wait for your operational sensor and enterprise setup being transformed.