Data Reliability Engineering (DRE) is the work done to keep data pipelines delivering fresh and high-quality input data to the users and applications that depend on them. The goal of DRE is to allow for iteration on data infrastructure, the logical data model, etc. as quickly as possible, while—and this is the key part! —still guaranteeing that the data is usable for the applications that depend on it.
End users—data scientists examining a/b-test results, executives looking at dashboards, customers seeing product recommendations, etc.—don’t care about data quality in the abstract. They care about whether the data they’re seeing is useful for the task at hand. DRE focuses on quantifying and meeting those needs, without slowing down the organization’s ability to grow and evolve its data architecture.
It borrows the core concepts of Site Reliability Engineering, which is used at companies like Google, Meta, Netflix, and Stripe to iterate quickly while keeping their products reliable 24/7. These concepts bring a methodical and quantified approach to defining quality, gracefully handling problems, and aligning teams to balance speed and reliability.
Why is data reliability engineering important now?
Nobody needs to be told how critical data is becoming to nearly every industry. As we move to a world where more roles—not just data science and engineering professionals—are interacting with data whether through self-service analytics or the outputs from machine learning models, there’s more demand for it to “just work” every hour of every day.
But in addition to having more users and more use cases to serve, data teams are simultaneously dealing with larger and more diverse volumes of data. Thanks to Snowflake, Databricks, Airflow, dbt, and other modern data infra tools, it’s never been easier to reach a scale where ad hoc approaches can’t keep up.
While the most obvious big-data companies like Uber, AirBnB, and Netflix felt these pains sooner and led much of the foundational work in this discipline, it’s rapidly catching on more broadly.
What are the fundamental principles of data reliability engineering?
The seven principles from Google’s SRE Handbook provide a great starting point for DRE, which can adapt them to deal with data warehouses and pipelines, instead of software applications.
- Embrace risk: Because something will eventually fail, data teams need a plan to detect, manage, and mitigate failures when (or before) they occur.
- Monitor everything: Problems can’t be mitigated if they can’t be detected. Monitoring and alerting give teams the details they need to address data issues.
- Set data-quality standards: Acceptable data quality standards need to be quantified and agreed upon for teams to do anything about it. SLIs, SLOs, and SLAs are the standards-setting tools being adopted for DRE.
- Reduce toil: Toil is the human-led, operational work required to improve a system. For efficient DRE, teams ideally should reduce toil to reduce overhead.
- Use automation: Automating manual processes helps data teams scale reliability efforts and increase time for tackling higher-order problems.
- Control releases: Making changes is how things improve and how they break. Pipeline code is still code, faulty or not, that needs to QA’ed before deployment.
- Maintain simplicity: Minimizing and isolating the complexity in any one pipeline job goes a long way toward keeping it reliable
Will data reliability engineering become an industry standard?
While one could argue that data reliability engineering is still an emerging concept, modern companies (Uber, DoorDash, Instacart, etc.) that use data to operate and grow their businesses are leading the charge to establish DRE as a standard practice. And job postings for the role are already starting to grow. Given the pace of business and the need for data to be trusted, expect to see DRE someday be as prevalent as SRE is now.
About the Author
Kyle Kirwan, CEO and co-founder of Bigeye. Bigeye is the data observability platform that helps data teams keep their pipelines fresh and high quality. Data teams at companies like Instacart, Zoom, and Udacity use Bigeye to automate their data monitoring, detect issues proactively, and keep data reliable for the data scientists, executives, and customers who depend on it.
Sign up for the free insideBIGDATA newsletter.
Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1