Adding Data Science to the DevOps Toolkit

See previous post in series on workflow scheduling

Every day, Amperity processes data that our customers then use to power their businesses. As such, it’s important that we deliver these artifacts on tight timelines. Testing and craftsmanship strive to minimize errors, but ultimately we must detect and mitigate any problems that do occur. A workflow that runs abnormally long requires an operator/engineer to investigate for possible errors. This requires a human to manually watch each workflow and understand what normal operation looks like for each one. Automated anomaly detection approximates and automates that understanding, allowing us to maintain vigilance while scaling the number of customers and workflows being monitored.

Recasting a Systems Problem as a Data Science Problem

To increase our system’s reliability, we can employ a data science perspective to open up a new set of tools and solutions. Rather than trying to list and guard against the myriad of ways each task in a workflow might go wrong, we can treat each task as a black box, measuring its duration and recording if it succeeded. We can now treat this as a data problem and try to predict a task’s duration based on the task type.

Casting a systems problem as a data problem opens up a whole new set of tools and still admits simple solutions. Training, deploying, and managing complex models is a challenging task. However, treating a problem as a data problem doesn’t require immediately delving into fragile and uninterpretable math. Simple data-driven heuristics often offer intuitive models and sufficient performance while being easy to understand and implement. Our heuristic queries our workflow service for a two week history of each task, filtering out any failed runs. The mean of this history then becomes the expected runtime. The mean + 2 standard deviations is the upper limit to the expected runtime. We regularly check if tasks are running past their upper limit and alert when they do.

Task runtimes plotted over time, with the threshold at which they would be declared anomalies highlighted. Once a task gets outside the blue region, it is declared anomalous

Justification for the Heuristic

[Warning: mathematical hand waving] This heuristic is based on the assumption that task durations are drawn from a heavily imbalanced multimodal distribution. We model the errorless tasks as one normal distribution (\(Good \sim N(\mu_g, \sigma_g)\) ), and tasks experiencing some error as from a different normal distribution (\(Bad \sim N(\mu_b, \sigma_b )\)), where \(\mu_g \ll \mu_b\). Which distribution a task’s duration is drawn from is then a Bernoulli distribution, with \(p\) near 1. We could try to estimate all 5 parameters (\(\mu_ g, \sigma_g, \mu_b, \sigma_b,\) and \(p\)), but we’re trying to keep this simple, so let’s instead leverage the assumption that \(p\) is nearly 1. This means that measuring the mean and standard deviation of a sample from this distribution will then tend to most closely approximate \(\mu_g\) and \(\sigma_g\) (we further skew this in our favor by ignoring failed or aborted tasks). We then assert that task durations that appear abnormal based on these estimates, are more likely to have been drawn from \(Bad\), and thus worth a human investigating.

Consider Tailoring the Loss Function to the Implementation

When measuring a model’s performance, it’s important to keep in mind the broader context in which the model will run. This saved us time and effort that would have otherwise be wasted.

A 30 second task that runs twice as long poses much less risk to the pipeline than an hour long task that runs twice as long. Assuming similar histories, these would look like equally significant anomalies. Introducing more complexity into the heuristic could have accounted for this. One idea was that we should learn a model to predict the number of allowed standard deviations based on the expected duration. Instead, a simple solution presented itself.

The actual logic to alert if a running task is anomalous is executed on a timer. When this timeout is 10 minutes, we guarantee any task running longer than 10 minutes will get checked in on at least once. Short tasks might start after a check, and finish before the next check. When measuring model performance against a dataset, we can model this by subsampling tasks below 10 minutes. By including this “implementation detail” into our loss function, it becomes more clear that adding complexity to avoid alerting on short tasks is redundant.

Beyond Task Durations

Employing this system allowed us to get good visibility into anomalies in our system. Anomalous task durations are not the only thing that can quietly affect runtimes, though. A workflow may not start, waiting on outputs from another flow. Alternatively, the workflow might overall run long, without any of the individual tasks crossing the anomaly threshold; for example, if the task durations are correlated and for a particular workflow they all run slightly long, that can still cause a significant delay in the overall workflow. By opting for a simple anomaly detection system, it was easy to add other heuristics to detect and alert on these cases. Most insidious of all is the workflow with task durations that slowly creep up. As customers leverage our service more, they want each workflow to deliver more products. Also, as our team builds new features, more work is added to the pipeline to provide data for these features. Both of these effects cause gradual increases in the workflow runtimes, eventually putting them dangerously close to missing the delivery window without ever generating anomalies. These are solved with process - regular review of the flow runtimes with integrations team, whose reviews are powered by the metrics gathered for anomaly detection.

Visualizations Break Down Tribal Knowledge

Finally, besides reducing operational load, there was another benefit of building an anomaly detection system. The first part of tackling a data science problem is extracting and visualizing the data. These simple visualizations provide a detailed view into the historical functioning of the workflow. Plot task duration over time, and features like increasing data amount ingested, number and complexity of customer queries, and performance of the core identity resolution model become apparent. In that sense, the “systems problem as a data problem” approach also helps disseminate some of the tribal knowledge and intuition veteran engineers have about the product to new engineers.

This task suddenly changes it's average runtime. Some data was no longer required to be exported

This task has very short durations two days a week. This data source doesn't have data to serve on the weekends