Using Data to Solve Software Pain Points

Miriah Peterson
6 min readMar 30, 2022

“Move Fast and Break Things” (Mark Zuckerberg, CEO of Facebook) is the mantra of many fast-paced startups. Yet, every start-up hits a point where infrastructure and data integrity matter as much as or more than the speed of delivery. In this article, we will discuss the how, where, and when of establishing data integrity on your projects or in organizations.

Photo by Mars on Unsplash

Where to begin

How do you know when it’s time to invest in your data infrastructure? This question that’s necessarily easy to answer. Business leaders will often push for new features that can be sold to quickly increase things like Total Addressable Market. Small companies are often moving and pivoting from one product to another quickly to see what the stakeholders and customers need and want. Even with this constant demand for new products and services, customer experience can always be a pain point. As soon as product, sales, and business leaders start to take interest in the experience endpoint, Engineers need to focus on investing time in paying off our tech debt and solidifying our infrastructure.

There are many indicators that we use to measure pain points like poor customer experience, unreliable systems, unexpected outages, or downtime. Engineers often have to negotiate with business leaders for this kind of time to invest in the infrastructure. It is imperative that we invest our time wisely to ensure that start-ups grow into mature and stable software ecosystems that provide predictable and reliable experiences.

Understanding the Data Landscape

As an engineering team, we needed to understand the data we were working with and the product viability it gave to existing services across the company ecosystem. This is considered a team scoping investigation, and can be summarized in 3 steps:

  • Understanding the compliance requirements of data
  • Understanding the data state
  • Understanding who uses the data and how it is used

Understanding Compliance Requirements

When dealing with data systems, a critical step for any engineer is to understand what privacy requirements your data sets have. Data often contains protected information. I have dealt with data sources that contain both PII (Personally Identifiable Information) and HIPAA (Health Insurance Portability and Accountability Act), which are both protected data sets. Each country is going to have its own set of requirements around data and protecting people’s privacy in the data software system’s use. You need to understand if your data is meeting or failing to meet the requirements that are legally required. This often requires talking to a company’s compliance manager or securities team to understand the full scope.

Understanding State

What is a data’s state? State tells you if data is actionable, if it is fresh or stale, or if it is consistent or sparse. To do this we have to regularly audit our relevant data stores. We can then compare the data consistently across our data stores. We can evaluate query speed and optimize database, indexes, foreign keys, etc. to increase our performance. Despite all this, the key to understanding our data’s state is gathering its metadata. Metadata like data’s age, data distribution, frequency of change, frequency of access, etc. all help us as data practitioners understand how reliable the data is, if it is missing, or it will create problems for production or analytic use. Metadata about your data is key for understanding the state of your data.

Understanding User Needs

Understanding user needs is easiest and most effective when the stakeholders communicate their needs to you. Stakeholder buy-in is critical for getting support from company leaders. This helps you explain product importance and make stable, long lasting data service relationships. In order to accomplish these goals, we need to understand what services and users are accessing your data services, and what they want from the data. Engineers should ask themselves these questions in order to truly understand the basis of their users’ needs:

  • Are you exposing appropriate endpoints?
  • Are you aggregating data to the user’s needs?
  • Are you naming unnecessary assumptions about how data is used?

Duct tape architecture

Start-ups are great places to work. They tend to have new and exciting problems to solve,and they are also riddled with many inexperienced developers who are working as fast as possible to deliver minimum viable products (MVPs). If Engineers aren’t careful, before they know it the MVP service that was supposed to last only a few months is 7 years old and is the core service of your architecture.

In my experience, these long-lived MVP services end up as part of a “duct tape architecture”. This kind of architecture is fragile and very expensive to maintain. It can quickly be patched and re-patched to allow the services to continue running even though you have scaled out many services past what is considered reasonable. Every time a service breaks, engineers race to add a new piece of “duct tape” (a coding quick-fix) to shore things up until the next piece of duct tape comes off.

How do we move past fragile “duct tape” architecture to more reliable and reliant services? Despite its popularity, the solution is not always to gut the code and start over. A better solution is to understand the pain points and strategically address them. In order to understand these pain points, we have to first understand the metadata about our software systems. This metadata in the form of metrics provide the insights into the fragility, outages, downtime, and poor customer experience causing pain points.

Fragility to Reliability

Systems without failures, although robust, become brittle and fragile. When failures occur, it is more likely that the teams responding will be unprepared, and this could dramatically increase the impact of the incident. (Database Reliability Engineering, Laine Campbell & Charity Majors)

In order to move away from the duct tape method and make more reliable systems for our users, we need to get service metrics and leverage that data to determine critical areas of services to optimize. There are 4 golden SRE signals that help us to understand service performance pain points and create long-term solutions:

  • Latency: How long do requests take? How many time-outs? Is it your network causing the latency, your dependency calls, or your business logic? Do you have slow database queries you can optimize?
  • Traffic: Are there trends? Are you seeing spikes or anomalies? Do you have backoff and other appropriate re-try practices?
  • Errors: Do you report your errors? Are your errors easy to find, parse, and read? Are your errors handled appropriately? Are your errors actionable?
  • Saturation: Are you saturating for heap/stack/memory? Can you scale up workers? Are you returning appropriate payload sizes? Do you need to batch/paginate your data payloads?

Take these metrics of these 4 areas and create Service Level Objectives and alerts. I have addressed in this post how to use SLOs and error budgets to address alerts. If these kinds of time budgets can be used to address small edge cases and problems, but if you exhaust the entire budgeted time working on one area of the service or one kind of alert, then that is a major pain point that needs to be addressed through additional refactoring or tech debt budgeted sprint work.

I have used this data-based approach to investigate, prioritize and attach many projects at startup companies moving out of the small, scrappy, building stage into a new phase of life. I have found it to be very successful and rewarding and a good way to get buy-in and feedback from engineering and product leaders as we progress through the projects. I recommend a data-first approach to all teams looking to improve reliability and minimize downtime in their data services.

--

--

Miriah Peterson

Data Reliability Engineer, Golang Instructor, Twitch streamer, Community organizer