Data adds the most value when data sources and transformations are transparent and observable. Without observable data infrastructure the insights unlocked through analysis can’t be trusted. Sibi’s shift to an event-driven architecture led us from traditional batch-load data pipelines to a modern event-driven architecture. The move introduced a number of challenges to how we load and transform data for analytical use and put our traditional Extract-Transform-Load (ETL) process under pressure. Implementing a modern data stack using Extract-Load-Transform (ELT) not only met the challenges, but also has many improvements:
- Effortless data pipelines with AWS Kinesis and cloud formation
- Transformations are handled just-in-time, allowing us to rebuild the entire data warehouse from raw source data in minutes
- Raw source data visibility for anyone to explore, opens new lines of communication and reduces time spent debugging
- Rich performance monitoring, test results, and data documentation in convenient web interface
- Increased speed to recognize and solve data integrity issues through testing and monitoring
Effortless Data Pipelines
Lack of flexibility and lack of visibility in the traditional ETL pipelines left us constantly battling issues:
- Small changes in the events feeding the pipelines required time-consuming code updates
- Data was often left in an inconsistent state with no way to re-run transformations
- Multiple processes updating a single table obfuscated the source of errors
- Redshift’s analytical engine struggled to handle the atomic inserts and updates, preferring bulk loads instead. In some cases, events were in queue for over 12 hours while Redshift caught up
The first thing we needed was to get the data into Redshift in a scalable way. AWS Kinesis Firehose made the extract and load part of ELT easy with code-free services that ingest events from AWS Event Bridge. We define rules for what events to ingest and Kinesis manages queueing and bulk-loading into redshift at configurable intervals. With this approach, all of our pipelines are built off a single parameterized cloud formation template. This means we can set up a new Kinesis pipeline in minutes with no new code to maintain.
Transformations Handled Just-in-Time
The transformation step was a bigger challenge. We turned to Redshift’s powerful SQL engine to transform the raw events into reportable insights. Kinesis loads all the raw event data into tables based on the domain the data represents. The event body is then stored using Redshift’s “super” type which enables us to query the raw JSON using dot(.) notation with PartiQL.
Using PartiQL we can write SQL queries that do all of the transformation and normalization on-demand. The entire data warehouse can be rebuilt from scratch using the SQL queries at any time. If we discover data integrity issues we simply change the SQL and rebuild all the data with the updated transformations.
Raw Source Data For Anyone to Explore
Because the event’s are stored in Redshift Super types the raw source events remain available for direct querying. This enables data analysts to dig deeper into the data to discover issues and missed opportunities. Visibility into the raw data leads to a common language across business teams and reinforces naming conventions.
Rich Performance Monitoring and Documentation
At first, we deployed SQL queries as Redshift materialized views. Materialized views mix the performance benefits of standard tables with the flexibility and auto-updates of views. However, limitations quickly became apparent:
- Deploying an update to a base view was difficult and risky because all downstream views had to be dropped first. In effect we had to drop and rebuild an entire schema to change one field. This caused downtime for the reporting layer each time we deployed changes to a view.
- Materialized views can’t be automatically refreshed if they reference another materialized view. This limitation restricted our data modeling and code re-use.
- Refresh schedules are black-box and not customizable. This led to shadow schedules forcing refresh of certain views and an overall lack of visibility into the freshness of the data.
Data build tools (dbt), an open-source data orchestration platform, offered the solution. Dbt resolves the dependency issues of materialized views, provides automated documentation, data lineage tracking, code validation, data testing, build monitoring and more! SQL files in dbt support Jinja templating and macros, which unlocks new possibilities for code re-use and keeps code clean. Scheduling and orchestration of dbt runs is accomplished with open source integrations to our existing Airflow instance. This made implementation simple and fast while giving us full control over when and how models are refreshed.
Increased Speed to Solve Data Integrity Issues
Our dbt implementation led to better visibility at every level. Dbt’s elementary package provides us with automatic anomaly detection and monitoring for even stronger test coverage. We now have instant access to rich information about our data and a full set of test suites runs during build or on demand. Past results are stored and made available in an interactive app so we can explore how model runs and test results have changed over time.
Result: More Valuable Data
Despite the challenges in the adoption of a modern data stack, it’s brought us to a better place than ever. Better visibility has improved trust in the data and enabled stronger analytics that add value back to the business. The focus has shifted from maintaining pipelines to improving and augmenting the data.