Know My Company
Tell us about your interaction with smart technologies like AI and Cloud-based computing platforms.
When IT wants to drive business transformation, it needs to understand how to do that. If you don’t know what the business impact of IT changes are, it is impossible. The velocity, variety, and volume of IT data make it too difficult for humans to understand how their business and IT are truly interrelated. You need AI to make sense of the data. We recognize that there are three important challenges to tackle: visibility, intelligence, and scale.
Visibility is about having all relevant data accessible for human and AI operators. StackState provides visibility through its proprietary versioned graph database. StackState automatically builds and maintains a real-time topology that works with both business and IT data. The topology can then be populated with every imaginable type of real-time stream, from low-level metrics like memory and disk utilization to high-level business KPIs. We thus start by having a contextual real-time model that shows the interrelatedness between business and IT.
Intelligence is about making sense of the data, understanding the way the low-level IT metrics affect high-level business KPI’s and being able to rapidly diagnose and fix problems, identify bottlenecks and benefiting the business. StackState makes sense of the heterogeneous data using various AI Technologies anomaly detection, forecasting, clustering, and automatic root cause analysis, to name a few. All these technologies wouldn’t be possible without recent developments in the field of AI such as Deep Learning.
The final challenge is scale. Modern IT landscapes are large and can generate petabytes of data in small periods of time. Applying AI on these amounts of data is impossible without Cloud technologies.
Having access to infinite resources, a good understanding of how elements of the landscape are connected, while at the same time being able to reason transparently about costs are key aspects that make it possible to make well-informed decisions on how to spend resources on AI tasks.
How did you start in this space? What galvanized you to start at StackState?
Before StackState, Mark Bakker (Co-Founder StackState) and I worked at a large financial institute as IT consultants. In this environment, we encountered literally dozens of monitoring tools that are producing tons of data on what is going on in their highly complex and dynamic IT environment. However, at the time, the moment something went wrong they had to sift through all that data to find the needle in the haystack, which sometimes would take them days, while in the mean time some highly critical business applications were completely offline.
A single outage could easily cost millions. In short, they had all the data but did not have the insights. We were asked to come up with a solution to this problem, a problem we have had ourselves many times before in our previous lives, but not at this scale. This is the space in which StackState was born.
What is StackState and how it transforms Data Analytics for AIOps and IT teams?
There are three main ways StackState transforms Data Analytics for AIOps and IT teams:
- StackState transforms Data Analytics for IT teams by relating business to IT and breaking the silos between teams.
- StackState automates Data Analytics
- StackState allows performing analytics at scale.
In order to get a holistic view of an IT environment, it is often required to pull data from different sources the business KPIs, IT SLAs, low-level resource utilization metrics from various services. StackState AIOps platform utilizes current IT investments, by combining and analyzing metrics, logs, events, and data beyond typical monitoring data, like Google Analytics, Twitter, CMDBs, CI/CD tools, service registries, automation and incident management tools.
StackState brings all the data in one place, makes the relationships between data explicit, contextualizes the data and makes it readily available for analysis. StackState uses the variety of data it collects to learn about dependencies, allowing it to build a topology of dynamic IT landscapes in real-time. By ‘rewinding’ the topology visualization in time, StackState instantly assists teams in discovering the root cause of incidents and how the impact of these incidents have propagated across business and on-premise, cloud and hybrid IT landscapes. Moreover, by combining the historical and streaming data collected, StackState is able to predict future incidents and their impact.
What is especially important is that StackState makes the relationship between the business and IT visible and makes the activity of other teams visible and provides context for all the data. This is crucial for Data Analytics without all relevant data in one place one cannot make data-driven decisions. StackState makes all relevant data available to both human and AI agents.
Data Analytics requires Data Science competency something which is rare in IT teams. StackState automates Data Analytics by deploying the right algorithms in the right places and presents results in a way an IT operator can understand and act upon. All that is required from the user is to indicate which business metrics are important the monitoring goal and StackState automatically decides which algorithms to deploy in order to meet the user’s requirements. For example, the user may specify the customer churn as a KPI, then StackState will automatically detect anomalies in the KPI and relate them to low-level IT metrics and perform root-cause analysis if a problem occurs. The user does not need expertise in Data Science to take advantage of StackStrate AI.
At StackState we believe in “best practices” – pieces of expert knowledge packaged and shared as plugins for StackState – StackPacks. A StackPack is a downloadable extension for StackState developed in-house or by the community. A StackPack can be anything: integration with a third-party monitoring tool, a collection of smart health checks or an AI algorithm. Using StackPacks organizations can accelerate AI adoption without requiring Data Science expertise.
Analyzing data at scale is challenging – an IT system might generate petabytes of data; it is prohibitively expensive to analyze all of it in real-time. However, it is crucial to detect problems ongoing immediately and forecast future ones in a timely fashion. StackState solves this issue by monitoring things that matter most business KPIs and relating them to low-level signals using the dynamic topology maintained in the proprietary versioned graph database. StackState is able to apply AI algorithms where they have the most impact without wasting computation recourses on things that do not affect the business. StackState can do it because it understands the dependency between IT and the business. This way StackState AI can scale to very large IT infrastructures, something that would not be possible without the 4T data model.
How big is your AIOps and Product Development team?
Currently, we are about 40 people strong, 30 of which are engineers all of which with a background in Computer Science and/or Artificial Intelligence.
What is the state of AI and NLP for IT and Operations platforms in 2019? How much has it evolved since the time you first started here?
AI for IT operations is necessary when it comes to speeding up the operations, scaling up the teams and making sense of the complex and highly dynamic IT landscapes and bridging the gap between the IT and the business. Humans alone cannot do it anymore, there is a real need for AI in the IT operations space.
However current solutions on the market do not fully satisfy users. Most vendors focus on a subset of the data– they either only work with log data, or traces, or business data. This heavily limits the scope of problems they can solve. There are vendors for example that take event data and then reduce the noise during an alert storm, which is great, but this does not help you one bit with finding anomalous behavior in a newly deployed Docker container. There are too many blind spots and a lack of context to enable even human-level understanding of the IT infrastructure and its relation to business. Therefore, they cannot deliver fully on the promise of AIOps; being able to solve any number of operational problems using AI with big IT data. We are now starting to see both companies and vendors starting to recognize the possibilities of a more holistic approach. The at some point hyped term observability is a sign of that.
Tell us more about your 4T Data Model and the kind of data it ingests from IT companies.
StackState builds all its value on top of a single data model that we call the 4T data model, which stands for Topology, Telemetry, Tracing and Time. With this model, StackState can reason about anything that is going in a business/IT environment. It is abstract enough to allow modeling of every type of IT resource, from mainframe to microservice and from on-prem to cloud, but concrete enough such that every T solves a concrete set of problems that need
Missing even one T leaves human operators and AI partially blind. Let me illustrate the “missing T problem” with some examples. Without Telemetry it is hard to see when things are broken nor are about to break. Without Topology, it is hard to find root causes nor reason about the possible impact. Without Traces, it is hard to find performance bottlenecks and applicative/transactional problems. Without Time it is hard to reason about the consequence of changes nor meaningfully learn system behavior.
One of my favorite examples of the 4T model enabling AI comes from one of our clients. In order to correlate the business and IT, they have attached low level IT resource utilization and the 4 golden IT signals (latency, errors, saturation, and traffic) at component and service levels with business KPI’s coming from Google Analytics. This way the client can detect anomalies and issues at the business level, for example, a sudden drop in sales, and relate it to IT problems like deployment of a faulty component. This has had a tremendous impact on the time-to-fix – before finding the root cause was performed by a team of experts in an operation room and would take hours, now it is done by a single operator in a semi-automatic fashion and often takes minutes.
How should young technology professionals train themselves to work better with Cloud, Automation and AI-based tools?
AI has gone mainstream nowadays and there are many resources online that try to enable people to learn AI by themselves. People should decide how far they are willing to go. Do you want to approach AI as a black box or do you want to know what is going on under the hood? If you do want to understand what is going on under the hood I recommend spending a significant amount of time on mastering the mathematical underpinnings of AI: linear algebra, calculus, and statistics. Start reading papers and immerse yourself in the AI community.
How does StackState contribute to a successful Digital Transformation/IT Modernization?
The biggest challenge to drive the Digital Transformation in 2019 is bridging the IT and the business. Once the gap is bridged, IT teams can think and building strategy in a new way, because their impact is now transparent to all teams within a company. They actually become part of the business outcome.
StackState addresses the problem by putting IT and business data in the same context and providing a single point of truth. This has an immediate impact on both the efficiency of human operators and AI algorithms. It becomes possible to see what IT incidents are impacting the business now and what is likely to impact it in the future. Having business data in the context of IT data allows StackState to perform automatic root-cause analysis by relating anomalies in the business KPIs to IT problems.
The anomaly detection and automatic root cause analysis dramatically reduce the time to repair, and predictive analytics makes it possible to prevent future problems. These technologies reduce the burden on IT operators.
The Good, Bad and Ugly about AI that you have heard or predict –
In the context of AIOps:
Good – good data and good objective – AI automates managing IT systems and makes it easy to run the business. Problems are automatically detected, and many are solved without human intervention. When a human does have to intervene, the AI provides all necessary context for a human operator to make a decision and resolve the problem. The AI learns from human operators and automatically resolves similar problems in the future.
Bad – bad data and good objective – attempts to automate managing IT systems but does a poor job. The data fed to the AI is incomplete and lacks context. The AI increases noise by alerting about irrelevant events – for example, IT errors that do not impact the business. AI fails to perform root cause analysis because it does not know how components are connected. Anomaly detection does not work because AI cannot distinguish between normal and abnormal behavior.
Ugly – good data and bad objective – this one is straight from Isaac Asimov stories, where AI “protects” humanity by enslaving it. With the wrong objective, the AI will reduce the error rate by blocking all traffic to the website, reduce latency by immediately returning “404 not found”, reduce false-positive rates of alerts by just never alerting about anything and reporting the Big Bang as a root cause for everything.
Hideous – bad data and bad objective – Using AI to make sense of bad data to learn something that should already be known. I have seen several vendors that deduce a topology by correlating data such that an AI can report a possible problem that might be caused by a component that might be a database that might have some relationship with your application, which might have something to do with your revenue streams. You certainly don’t want such an AI waking you up in the middle of the night.
How do you promote your ideas?
The weaponization of AI is not really top of mind for me to be honest. I might philosophize about such topics over a beer, but it isn’t really a topic in the AIOps space. I promote the adoption of AIOps so that instead of fixing stuff, people are now free to innovate and build stuff. It allows the organization to move from being “reactive” to being “proactive”. This way it can focus its efforts on conquering and creating new markets instead of trying to maintain the status quo. Our work in the AIOps space is specifically targeted at improving human life, but we need to be vigilant that it actually turns out that way.
The Crystal Gaze
What AI, ML and SaaS start-ups and labs are you keenly following?
I follow all players in the AIOps space, big and small, and learn what there is to learn.
What technologies within AI/NLP and Cloud Analytics are you interested in?
Graph Convolutional Neural Networks, predicting time series at scale, meta and transfer learning, Auto ML, Kubernetes, and Keras based AI solutions.
What are the new emerging markets for these technology markets?
The IT industry itself. For example, the AIOps space, though still in its infancy, is a space of huge opportunities. The entrance into AIOps is quite smooth because all the data is typically already there.
The opportunity lies in the fact that right now use-cases around security, monitoring, cost control and a bunch more are all separated into their own markets. AIOps is a space that will unify many of these use-cases.
What’s your smartest work-related shortcut or productivity hack?
I am always full of ideas, so time is a very precious resource for me. Once I learned the art of starting something, working it just long enough to be sure that new baby projects will fly and then slowly but surely making myself obsolete, I started to book way more results.
Tag the one person in the industry whose answers to these questions you would love to read:
Dan Pei, he is doing some really interesting work in the AIOps space.
Thank you, Lodewijk! That was fun and hope to see you back on AiThority soon.
Started at an early age, 20+ years experience in IT. Lodewijk worked on 3d engines, graph databases, DevOps and web technology and currently really into AI and monitoring. Currently CTO at StackState, where I am involved in the overall vision, strategy, and leadership of the engineering effort and team.
Stackstate is the world’s leading monitoring and AIOps platform for hybrid IT. StackState utilizes current IT investments, by combining and analyzing metrics, logs, events, and data beyond typical monitoring data, like Google Analytics, Twitter, CMDBs, CI/CD tools, service registries, automation and incident management tools. StackState uses the variety of data it collects to learn about dependencies, allowing it to build a topology of dynamic IT landscapes in real-time.
By ‘rewinding’ the topology visualization in time, StackState instantly assists teams in discovering the root cause of incidents and how the impact of these incidents have propagated across business and on-premise, cloud and hybrid IT landscapes. Moreover, by combining the historical and streaming data collected, StackState is able to predict future incidents and their impact. Overall, StackState helps organizations make better decisions faster and avoid high severity outages.