Adding Data Science to the DevOps Toolkit

See previous post in series on workflow scheduling

Every day, Amperity processes data that our customers then use to power their businesses. As such, it’s important that we deliver these artifacts on tight timelines. Testing and craftsmanship strive to minimize errors, but ultimately we must detect and mitigate any problems that do occur. A workflow that runs abnormally long requires an operator/engineer to investigate for possible errors. This requires a human to manually watch each workflow and understand what normal operation looks like for each one. Automated anomaly detection approximates and automates that understanding, allowing us to maintain vigilance while scaling the number of customers and workflows being monitored.

Recasting a Systems Problem as a Data Science Problem

To increase our system’s reliability, we can employ a data science perspective to open up a new set of tools and solutions. Rather than trying to list and guard against the myriad of ways each task in a workflow might go wrong, we can treat each task as a black box, measuring its duration and recording if it succeeded. We can now treat this as a data problem and try to predict a task’s duration based on the task type.

Casting a systems problem as a data problem opens up a whole new set of tools and still admits simple solutions. Training, deploying, and managing complex models is a challenging task. However, treating a problem as a data problem doesn’t require immediately delving into fragile and uninterpretable math. Simple data-driven heuristics often offer intuitive models and sufficient performance while being easy to understand and implement. Our heuristic queries our workflow service for a two week history of each task, filtering out any failed runs. The mean of this history then becomes the expected runtime. The mean + 2 standard deviations is the upper limit to the expected runtime. We regularly check if tasks are running past their upper limit and alert when they do.

Once a task gets outside the blue region, it is declared anomalous

Justification for the Heuristic

[Warning: mathematical hand waving] This heuristic is based on the assumption that task durations are drawn from a heavily imbalanced multimodal distribution. We model the errorless tasks as one normal distribution ($$Good \sim N(\mu_g, \sigma_g)$$ ), and tasks experiencing some error as from a different normal distribution ($$Bad \sim N(\mu_b, \sigma_b )$$), where $$\mu_g \ll \mu_b$$. Which distribution a task’s duration is drawn from is then a Bernoulli distribution, with $$p$$ near 1. We could try to estimate all 5 parameters ($$\mu_ g, \sigma_g, \mu_b, \sigma_b,$$ and $$p$$), but we’re trying to keep this simple, so let’s instead leverage the assumption that $$p$$ is nearly 1. This means that measuring the mean and standard deviation of a sample from this distribution will then tend to most closely approximate $$\mu_g$$ and $$\sigma_g$$ (we further skew this in our favor by ignoring failed or aborted tasks). We then assert that task durations that appear abnormal based on these estimates, are more likely to have been drawn from $$Bad$$, and thus worth a human investigating.

Consider Tailoring the Loss Function to the Implementation

When measuring a model’s performance, it’s important to keep in mind the broader context in which the model will run. This saved us time and effort that would have otherwise be wasted.

A 30 second task that runs twice as long poses much less risk to the pipeline than an hour long task that runs twice as long. Assuming similar histories, these would look like equally significant anomalies. Introducing more complexity into the heuristic could have accounted for this. One idea was that we should learn a model to predict the number of allowed standard deviations based on the expected duration. Instead, a simple solution presented itself.

The actual logic to alert if a running task is anomalous is executed on a timer. When this timeout is 10 minutes, we guarantee any task running longer than 10 minutes will get checked in on at least once. Short tasks might start after a check, and finish before the next check. When measuring model performance against a dataset, we can model this by subsampling tasks below 10 minutes. By including this “implementation detail” into our loss function, it becomes more clear that adding complexity to avoid alerting on short tasks is redundant.