Anomaly detection is a monitoring mechanism, in which a system keeps an eye on important key metrics of the business, and alerts users whenever there is a deviation from normal behavior. Conventionally, businesses use fixed set of thresholds to identify metrics that cross the threshold, to mark them as anomalies. However, this method is reactive in nature, which means by the time businesses recognize threshold violations, the damage caused would have amplified multi-fold. What is needed, is a system that constantly monitors data streams for anomalous behavior, and alert users in real-time to facilitate timely action.
The use cases of anomaly detection are numerous and are vertical agnostic. Verticals like Telecom, Retail, FinTech and Manufacturing have some of the most impactful uses of anomaly detection.
Anomaly detection algorithms are capable of analyzing huge volumes of historical data to establish a ‘Normal’ range, and raising red flags when outliers are seen to be deviating from the tolerable range.
A good anomaly detection system should be able to perform the following tasks:
- Identification of signal type and select appropriate model
- Forecasting thresholds
- Anomaly identification and scoring
- Finding root cause by correlating various identified anomalies
- Obtaining feedback from users to check quality of anomaly detection
- Re-training of the model with new data
Identification of signal type
The first task is to identify the correct type of signal. For instance, if the chosen data has cyclicity or a trend component etc. Usually, deep learning models do not perform well on sparse data or small volumes of data, and for these type of signals, a simple ARIMA or XGBoost with correct feature engineering might be a better option. Whereas in case of data with good cyclicity in large volumes, application of deep learning models would be a good choice.
After every re-train of the model, it usually forecasts the threshold limits and these limits are calculated based on the metrics obtained from latest trained data, like mean, median, variance, etc. By utilizing normal distribution analogy, based on given confidence, threshold will be set for the next actual point to be forecasted.
Anomaly identification and scoring
Anomalies are identified whenever a particular metric moves beyond the specified threshold. However, it is important to quantify the magnitude of deviation of the anomaly, in order to prioritize which anomaly needs to be investigated/solved first. In the scoring phase, each anomaly is scored as per the magnitude of deviation from median or based on how long the deviated metric sustains from normal behavior. Larger the deviation, higher the score.
Finding root cause by correlating various identified anomalies
Often, it is difficult to identify the root cause by looking into each of the metrics in silos. Rather putting all anomalies together gives a complete picture about the situation. Consider the example of a sudden increase in the traffic on a set of towers for a telecom operator. But by putting them on a map, it can be identified that the tower in the centre was shut down due to a technical problem, which led to the increase in traffic for all the neighboring towers. However, this increase could be temporary, and the operator does not need to take any permanent action by increasing investing on infrastructure based on this anomaly identification. In order to stitch an entire story, one needs to put down all anomalies together, and understand the context by correlating with multiple data sources.
Feedback from users to check quality of anomaly detection
Anomaly detection systems are usually designed around tight bounds to highlight deviation quickly, but in the process sometimes these systems raise many false alarms. In fact, false positives is known to be one of the prevalent issues in the area of anomaly detection. One cannot underrate the flexibility that needs to be provided to end user, to change the status of a data point from anomaly to normal. After receiving this feedback, models needed to be updated/retrained to avoid identified false positives from recurring.
Re-training of the model with new data
The system needs to re-train on new data continuously, to adapt as per the newer trends. It is possible that the pattern itself does change due to the change in operating environment, rather than anomalous deviating behavior. However, there should be a balance in the mechanism. Updating the model too frequently requires excessive amount of computational resources, and lower frequency of updating results in a deviation of the model from the actual trend.
Overall, anomaly detection is gaining increased importance in recent years, due to exponential growth of available data, and the absence of impactful mechanisms to use this data. Anomaly detection systems are better fit in identifying significant deviations, and at the same time ignoring the not worthy noises from the ocean of data — enabling business with the right alarms and insights at the right time.
Read More: AI Beyond 2020: What Makes the Tech Tick?