Predictive analytics refers to the process of utilizing data, statistical algorithms and machine learning techniques to identify the prospect of future outcomes based on historical data. Solutions for predictive analytics are entering the market rapidly, and expectations are high. Most recently, UPS introduced a solution that will gather and analyze more than 1 billion data points per day at full-scale, including data about package weight, shape, and size, as well as forecast, capacity and customer data. The company clearly understands the need for predictive analytics as Amazon and others disrupt its industry.
Predictive analytics has the power to help all types of organizations, but before moving forward with a major investment, it’s important to understand how to build a strategy, how organizations can ensure data quality and how predictive analytics models can be improved.
Building a Strategy
Start by identifying a compelling business need, not a technology in search of a problem to solve. Some examples include identifying personalized consumer recommendations for the next item to purchase, predicting which customers are in danger of churning out of your subscriber base and optimizing supply chain management by predicting spikes in demand.
Every industry has its own unique set of use cases and it’s important to monitor what the innovators in your space are doing and benchmark your organization against those leaders.
Do you measure up to the business benefits they claim to be providing their users?
It’s also important to start small; there’s no need to “boil the ocean.”
Start with low hanging fruit and a minimum viable product (MVP) that provides tangible business value within three months. At that point, it will be easy to gain support within your organization for the next phase of predictive analytics.
Ensuring Data Quality
Quality is essential. The “dirty secret” of data ingestion is that collecting and cleansing the data takes 60-80% of the total time in any analytics project. As the size of Big Data continues to grow, this cleansing process gets more complex. There is no magic bullet to avoid these difficulties.
Expect them, and plan for them.
Years ago, when data was small and resided in a few dozen tables at most, data ingestion could be performed manually. Now, because data has gotten too large in both size and variety, machine learning automation is an essential part of the data ingestion process.
For example, rather than manually defining a table’s metadata, e.g., its schema or rules about minimum and maximum valid values, a user should be able to call on a system to learn this information automatically and then enforce those learned rules. Fortunately, a variety of products have been developed which employ machine learning and statistical algorithms to automatically infer information about data being ingested, and largely eliminate the need for manual labor. These include Open Source systems like Data Tamer, and commercial products like Tamr, Trifacta, and Paxata. Bottom line: these products are real, they work, and they should be part of any enterprise’s data ingestion roadmap.
Some other considerations in the data quality process include–
Make it self-service.
In a mid-size enterprise, there will be dozens of new data sources to be ingested every week. A centralized IT organization that must implement each and every request will inevitably become a bottleneck. Make data ingestion self-service by providing users who want to ingest new data sources with easy-to-use tools they can use themselves to prepare the data for ingestion.
Govern the data to keep it clean.
Once you’ve gone to the trouble of cleansing your data, you’ll want to keep it clean. This means introducing data governance, with a data steward responsible for the quality of each data source.
Advertise your cleansed data.
Once you have cleansed a particular data source, can other users easily find it? If your data integration is always done point-to-point, on demand as requested by customers, there is no way for any customer to find data already cleansed for a different customer that could be useful. Your organization should implement a pub-sub (publish-subscribe) model, with a registry of previously cleansed data that is available for lookup by all your users.
Improving Predictive Analytics Model Complexity and Usability
As software engineers gain more experience in developing and deploying production quality predictive analytics, it has become clear that this development is complex compared to that of other types of software for the following reasons:
- It requires a unique mix of talents, ranging from data science to Big Data to microservices. It can be difficult to assemble a team with the right skills.
- Analytic models require massive amounts of training data in order to provide accurate predictions.
- Analytic software is far more brittle than traditional software. A typical software algorithm is designed to be bullet-proof, able to handle any kind of input data, and its output is deterministic. Analytic algorithms, by contrast, are non-deterministic in nature and are highly sensitive to the characteristics of the data with which they were trained. If those characteristics change, e.g., the volume grows or the range of possible values for a given field changes, the model may lose its accuracy and need to be replaced by an alternate model.
Some suggestions for overcoming the challenges include:
- Implementing a platform such as Amazon SageMaker, Alteryx yHat or Pachyderm to manage the complex development process.
- Building a feedback loop into the development process to constantly monitor the accuracy of predictions and rapidly detect and correct any drift in accuracy.
At the end of the day, every predictive analytics algorithm has a customer who must receive predictions that are understandable and actionable. To achieve this, it’s important to create a digest that reports actionable insights and exceptions, rather than flooding your customer with raw data.