Distributed Data Flow Scheduling and Monitoring

At Amperity, our services process terabytes of data across dozens of large brands every day. Our systems ingest raw data in various formats from many locations, perform intelligent identity resolution, generate custom databases with customer profiles and send data to customer systems of action and analytics.

To make sure we process data quickly and in a cost effective way, we use workflows. Each workflow is a logically continuous flow of data through different stages of data processing until the final desired result reached. In this post we will review how Amperity approaches setting up the data flows and how we monitor them to make sure all data is processed in every run of each workflow.

Integration with Airflow

Currently our data flow schedule leverages Apache Airflow. We have a custom service – called glow – that acts as a translation layer between Apache Airflow and the rest of the Amperity application. Our decision to leverage Airflow was based on the fact that it provides basic workflow logic in a form of a DAG (Directed Acyclic Graph), and represents a collection of tasks to run organized based on relationships and dependencies. Airflow also provides a web-based user interface that our team leverages to review constructed DAGs, monitor the status of currently running flows, and perform operational tasks such as pausing and restarting tasks in data flows. With Airflow providing core support for scheduling and monitoring the workflows, we are able to focus our efforts on establishing the best ways for our internal services to configure workflows and react to task completion or failure. The services maintain configuration and parameters of each workflow in each customer. Every minute or so, the glow service retrieves configurations for all active customer accounts and constructs the DAGs necessary to run their data flows.

Data Flow Scheduling

The Amperity application allows customers to start workflows manually and to schedule them. When customers start workflows from the application, the operations are run by the Amperity services as soon as possible. Scheduled workflows are driven via configuration stored for each customer. The glow service checks each customer’s configuration and retrieves schedules for all customer workflows. Then it defines DAGs corresponding to workflows which include all tasks needed to be performed on data. When it’s time to run, the glow service kicks off DAGs in Airflow. Below is one example of this interaction.

Segments are one of the core concepts in the Amperity application and they enable businesses to identify interesting groups of users (or statistics about them) to use for direct marketing or deriving insights. Here is an example of a SQL query used in defining a segment.

Segment editor in the Amperity application

While the Amperity application allows customers to run segments and download results, the primary use for segments is to setup a regular upload of segment results into various customer systems of action and analytics. Customers can configure the desired destination as shown below.

Destination configuration in the segment editor in the Amperity application

Customers can configure the desired schedule for results to be uploaded into their configured destination. Below is an example configuration for sending results of the Test SQL Segment daily at 9:00 am Pacific Time which respects Daylight Time Savings change.

Daily schedule configuration in the Amperity application

Customers can manually request results to be sent to a destination by selecting the Run option in the application menu.

Kicking off manual request to send results in the Amperity application

Further discussion of details in segments and destinations configurations is beyond the scope of this post. Instead, let’s take a look at the details of scheduling. Every minute or so, the glow service checks every service for updated configuration. Once the glow service detects a new destination configuration, it creates a new DAG in Airflow with a collection of tasks needed to ensure the data is generated and sent to a destination. Here is the DAG prepared by the glow service in Airflow for manual upload of the results from the test SQL segment shown previously.

DAG for manual upload of results in Airflow

Here is the part of the tree in the DAG prepared by the glow service in Airflow for daily scheduled uploads of the results from the test SQL segment example from above.

DAG for daily upload of results in Airflow

When it is time to run a workflow for delivering results from this test SQL segment to the configured destination, the glow service requests that Airflow run the corresponding DAG and monitor it by periodically checking its status in Airflow. As the DAG status changes, the glow service sends progress notifications to other internal services over our internal deployment of Apache Kafka. The processing details of these events is not in scope for this post, but it’s something we will cover in future posts.

SLA Data Flows

A number of the scheduled data flows for each customer correspond to critical business processes. Typically in these scheduled data flows, the Amperity services process customer data for the time period between consecutive runs and produce results our customers expect to receive on a strict timeline. We call these workflows the SLA flows (Service Level Agreement). Our monitoring tools closely track the SLA flows to ensure successful and timely completion. There are two major types of SLA data flows. The first type are those that extract customer data from its original location, load it into Amperity’s internal data storage, and then start all tasks required to perform identity resolution, deduplication and other data transformations of the data received. The second is writing results of data processing from Amperity’s internal storage to customer provided data output locations, which are variations of the example with the test SQL segment described previously.

Using our internal tool called admin-repl, we can retrieve the status of the SLA workflows scheduled to be run at a given time, which we reference as a schedule window in the following report generated by (sla-summary). Here an example of how such textual report looks like.

[prod] amperity.repl.admin=> (sla-summary)
2019-05-22T14:00:00.000Z wsw-20190522-36009-7nG7m         tenant-01 open        SLA
2019-05-22T12:00:00.000Z wsw-20190522-28808-4zoXKs        tenant-02 succeeded   SLA
2019-05-22T14:00:00.000Z wsw-20190522-36009-7BSTTJ        tenant-03 succeeded   SLA
2019-05-22T07:00:00.000Z wsw-20190522-10804-6iiUae        tenant-04 open        SLA
2019-05-22T11:00:00.000Z wsw-20190522-25203-7JBGRg        tenant-05 open        SLA
2019-05-22T16:00:00.000Z wsw-20190522-43202-6Ac7Rd        tenant-06 open        SLA
2019-05-22T07:00:00.000Z wsw-20190522-10804-4GPq5b        tenant-07 succeeded   SLA
2019-05-22T13:00:00.000Z wsw-20190522-32405-5kTRB5        tenant-08 open        SLA

If a workflow is marked as succeeded, then all tasks in it completed and results were delivered to the customer specified destinations. If it is open, then some parts of the daily flow have not completed yet. We have commands to drill down into each schedule window such as (sla-report) and each workflow such as (flow-report) to give the team full visibility into each workflow and each task in workflows.

Data Flow Monitoring

As mentioned earlier, while DAGs are running in Airflow, the glow service checks on the status of each workflow and publishes status updates as events in Kafka. Internal services react to these events in real-time to perform necessary operations. Such updates are also shown to customers in the Amperity application on the notification tab. Here is an example of a notification shown in the application for the test SQL segment results upload.

Notification about results upload in the Amperity application

In the case of SLA flows, our internal monitoring tools make sure that each flow started when expected and it makes progress as expected. If any task within the flow failed or if the flow got interrupted in any other way, our internal tools raise operational alerts to notify the team. Leveraging history of past runs of this data flow in this tenant, we also monitor for abnormal duration of each task. If a particular task run duration is detected as anomalous (for example, running much longer than typical with consideration of the prior consistency), we review the status of the flow, understand the delay and attempt to put a remediation to ensure the data flow completes on schedule.

This concludes our brief introduction into how Amperity organizes processing of data in workflows, and how we approach scheduling and monitoring. If you would like more details about any part of the system discussed in this post, please send us your comments over email or on Twitter.