Introduction to Data Reliability Engineering

Miriah Peterson
Weave Lab
Published in
5 min readDec 1, 2021

--

As software engineers we are very aware of Google’s SRE standards¹ and practices; they have infiltrated all aspects of the software engineering universe. For Data Engineers, there is constant change as the data landscape continues to evolve.

“I see Data Reliability Engineering as a natural extension of the data team. … Data Reliability Engineering means treating data quality like an engineering problem. It’s applying applications and tools to see that data stays for the variety of application use across the business.” — Egor Gryaznov, ¨Data Engineering Podcast,¨ episode 224²

At Weave we have a long history of “treating data quality” as an engineering problem. Our Integrations Engineering teams, who are responsible for ingesting 3rd party data, now include a new team focusing on data reliability and scaling problems. This team is called Integrations Data Reliability Engineering. We, as Data Reliability Engineers, want to understand how SRE standards for software and SysAdmin-styled best practices apply to data systems, data pipelines, and other areas of traditional data-based infrastructure.

Integrations’ Data Offering

The joint Integrations teams at Weave are not your typical Integrations teams. They are not just building SDKs or plugins, or connecting to and making api calls. They are not ETL teams. The integration team creates software that ingests, manages, transforms, updates, validates, and syncs all of Weave’s core data³.

“What is “data integrity”? When users come first, data integrity is whatever users think it is.” -Google SRE book

We have recently audited our integration’s data stack and interviewed several stakeholders about the importance of integration data to their product. This data is core for many aspects of Weave products. By applying Data Reliability Engineering (DRE) practices, we can communicate the importance of data to our stakeholders as well as making metrics-driven promises for data availability, data completeness, and data downtime⁵.

The audit of our data stack also included product investigation done by the Integrations Product and the Integrations DRE teams. Product representatives conducted several interviews to understand how core data is vital to new product launches, growth initiatives, and customer experience. DRE representatives looked into current availability of core data. Additionally we did a risk assessment for data outages. We learned that without access to accurate and reliable data, new product development is slowed. The following is a summary of the current pain points:

  • Better quality 3rd party data
  • More historical records of 3rd party data and the data state
  • More available 3rd party data
  • A more exhaustive 3rd party data model
  • Better tools for 3rd party data discovery

As an Integrations organization our number one priority is to enable a good customer experience by providing teams with high quality data. With an ever growing number of data partners and a rapidly increasing customer base, it is vital that we establish a series of data standards that contain agreements and promises for the data we deliver. This will allow us to address and alleviate our stakeholders’ pain points while giving our teams ability to move quickly and efficiently to develop reliable and resilient systems.

Data Level Metrics

At Weave we follow the standard SRE⁶ practice of Service Level Metrics. Our SRE team has guided product teams in defining SLOs, SLAs, and SLIs for all our services. These agreements are defined by real world occurrences and are measured by real world data. The goal is to translate these service level practices and apply them to data specific architectures, data services, and data platforms. Our Integrations DRE team is working on creating and defining Data-SLOs, Data-SLAs, Data-SLIs, and more for the 3rd party data stack. With data service level agreements, objectives, and indicators, we give our stakeholders clear and concise promises to support the quality and timeliness for the 3rd party data. This allows us to deliver so they can support the products that the data enables.

Let’s explore these data level metrics with a practical example. One thing data level metrics can do is alert if a system is not working properly, if data is corrupted, or if it is compromised. For protected data like api or client credentials, this is key observability into data systems. With one of our data partners — we’ll call them NTCH — we have Auth and Refresh tokens that have to be provided to all data extraction operations. The Refresh token lasts for 4 hours and is renewed via an Auth process. We need this Refresh token to be live so that we can deliver up-to-date client data to our product teams.

The following metrics are used to communicate the availability and accuracy of the NTCH partner data to our data stakeholders:

DSLAs:

  • Maintain data integrity by delivering updated client data once every 15 minutes
  • Data will always be less than one hour in age

DSLOs:

  • Our Data is no more than one hour old 99% of the time
  • Auth connection to NTCH is refreshed and maintained so that calls for data extraction succeed 99% of the time

DSLIs:

Refresh token is updated more than X number of times in a 5 minute period

  • This can be tracked to manual updates or service updates
  • This means the Refresh token is expired, invalid, or unable to update and that we are at risk of violating our agreement

Data change record is received less than X times per 15 minute period

  • Shows that 0 data values were updated in the last hour and that we are no longer meeting DSLA of data less than one hour in age

Calls to Data Store happen more than X times in a 5 minute period

  • Indicates Refresh and that data is not as accurate or as up to date as we expect, and that we are violating our data age agreement

Using Data Level metrics, shown in this example, allows us to first communicate the limitations of our data partners to our stakeholders while maintaining a high standard of data availability. Second, it allows us to define anomalies in our data that we can detect and report on to send alerts early when we are at risk of breaking an agreement. This allows for DRE intervention. And lastly, it allows us to develop and maintain a robust data operations infrastructure that provides reliable data to the many integration data stakeholders.

Conclusion

As we enter the next phase of our data infrastructure and move toward more robust, automated, and distributed data systems while maintaining a higher level of DRE allows us to maintain a higher level of resilience in our system. Adherence to DRE practices at data level gives us strong commitments to the headcount required to operate and maintain existing DSLAs while improving and expanding our 3rd party data offering.

Footnotes

  1. https://sre.google/sre-book
  2. https://www.dataengineeringpodcast.com/data-reliability-engineering-episode-224/
  3. Core data is the data necessary for weave’s basic product offering. This data is 3rd party data provided by our data partners that is then used to connect weave’s clients to their customers. The absence of this data, or failure of these data systems will result is system wide outage.
  4. https://sre.google/sre-book/data-integrity/
  5. https://towardsdatascience.com/the-rise-of-data-downtime-841650cedfd5
  6. https://sre.google/sre-book/service-level-objectives/

Additional Reading

--

--

Miriah Peterson
Weave Lab

Data Reliability Engineer, Golang Instructor, Twitch streamer, Community organizer