How to avoid breaking APIs

March 2023

When writing graphical user interfaces, the end user is a person. If the user interface changes overnight, the user might be surprised but they can adapt and react accordingly. Most software companies are used to changing their products often in various ways. API products don’t have this luxury.

APIs are used by programs which can’t adapt to change. So, whenever you change an API, the change should be backwards compatible so that the new API version can be used by existing integrations. For a list of breaking changes one can make to JSON APIs, see Breaking changes in JSON APIs.

In other words, certain decisions are much harder to reverse than others. This changes how API products are developed: when it comes to APIs, it is much higher to ship fast and iterate. And in many cases you can’t really iterate.

Hyrum’s law

From Hyrum’s law which has (and deserves!) its own website:

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

In other words, your API “leaks” promises of how it will behave. For these “hidden” promises, you should define an explicit contract and track those changes as backwards-incompatible. Otherwise, you risk breaking your users' integrations without realizing. There are probably domain-specific "promises" that apply to your API that I can’t cover in this essay. Think about them as you design your API. Here are some hidden promises that a payment API might make:

  • Settlement speed: Money will be available at certain times in certain accounts. In test mode, the payment might clear into the developer’s balance immediately. But in live mode, it would take 5 minutes for the balance to be there because it was handled by some queue. The integration may assume that clearing is immediate because it was immediate during testing. Stripe’s ACH v1 had this problem.
  • FX quote expiry: If someone wants to convert USD to EUR, your API gives them a quote that defines a rate between the two. That quote is available until an expiry time, for example "next 30 minutes". If you shorten that time, integrations that rely on having 30 minutes to react to the quote will break.
  • Dynamic properties: If a property like description is dynamically generated from other data, like {{PRODUCT}} ${{AMOUNT}}, developers will parse out PRODUCT and AMOUNT from it. Later, when you change how description is generated and add QUANTITY to the string, you might break their parsing logic. This also applies to error messages.
  • Async properties: Returning a property asynchronously when it was previously returned synchronously is a breaking change. An example from invoices:
    • In certain jurisdictions there are strict rules on how invoice numbers should be generated. Imagine that in the first version of the API you return the invoice number synchronously when the invoice draft is created. But that causes problems and out of compliance. To be compliant, you start generating the invoice number asynchronously. Existing integrations depend on having that invoice number on creation and thus are broken by the change.
  • Field length: If you return short of ids with a fixed length (e.g. ch_123456), users will assume that they will continue to be short and of a fixed length and store them in their database with varchar(16). If you ever start returning longer ids (i.e. larger than 16 characters), they won’t fit in your user’s database and break their integration.

I don’t have a good recipe to figure out what these are, but as Hyrum’s law suggests, they will surprise you and they will happen.

On top of that, all APIs are making non-functional promises: Many of them surprising:

  • Latency: if they always return within 200ms, developers will assume that they will continue to return within 200ms. Making the API respond within 1000ms may break an integration which times out at 500ms.
  • Price: if today it costs $0.01 per call, a developer might design their integration one way, and if you change the price, it may “break” that integration
  • Throughput: a threaded integration might break if you reduce the rate limits that throttle the allowed requests per second.
  • Storage: if you store an object for the developer, you just promised that you would store it forever unless there is a specific eviction policy1.
  • Reliability: This is often an explicit part of the SLA.
  • Consistency: Do integrations expect to read an object after they write it? Most APIs work this way but if you ever relax that and allow caches (which may be stale) to return certain calls, you will break integrations.

If everything you do might bring you problems, do less

As Hyrum’s suggests, the more you do, the more mistakes you can make. And because APIs are forever2, mistakes are forever too. The best way to make fewer mistakes is to do less overall.

So, if when considering a feature you wonder “will they need it?”, then don’t add it. Wait until enough people complain about it before you add it. This way you will be sure they needed it. You don’t want to add surface area that you may regret unless it is really worth it.

For example, let’s say that your API could have a number of side-effects (store something in a database, emit an event to some other service, send an email, etc). Once you add those side-effects, you can’t really remove them. But you can always add them later under some parameter like send_email. When in doubt, do less.

Restrict your instinct to be helpful. For example, don’t volunteer information in the API responses that the developer doesn’t absolutely need. For example, Stripe originally added count on list endpoints because Mongo returned it from the list queries. It turns out that Mongo struggles to calculate that count as collections grow, and it creates all sorts of performance problems. It is not super clear that Stripe users need count. For that and other reasons, list endpoints are some of the worst endpoints for Stripe to maintain.

What should’ve early Stripe done? You might think “Ahh, they shouldn’t have added count if nobody asked for it”. No, Stripe shouldn’t have offered list endpoints until absolutely necessary3. Behind every user request for a list endpoint lurks a specific problem to be solved that could be better solved in some other way:

  • Reconciliation: “I want to reconcile my payments to the cash in bank account“
    • If there are lots of payments, a list endpoint will be a moving target: as you list payments to add them up, new ones payments and changes come in, making it impossible for you to get a true snapshot of what happened in the period.
    • This is better served by immutable reports that can be reconciled against bank statements.
  • Customer portal: “I want to show customers their payments in my app”
    • List endpoints have high latency and customers have to wait a long time for their page to render.
    • The integration should store its notion of orders in their database, and when displaying each individual order, fetch the associated payment from Stripe. Alternatively, Stripe could make a Customer Portal.
  • Metrics: “I want to calculate how much revenue I made each day / month / period”
    • It takes forever to send all the payments over the network each time you want to calculate the metrics.
    • It is better for the integration to store the changes as they come in their own data warehouse and then calculate metrics based on that warehouse. Alternatively, they could use the Stripe dashboard or Sigma.
  • Search: “I want to search through the latest payments for one with certain metadata”
    • The list endpoint will return fresh data (read-after-write) but it will be a very inefficient way to search for one payment.
    • This is better served by a search API (e.g. /search?metadata[key]=value) that has all the needed indexes and can return only the objects that the user needs. (The flip-side is that a search API powered by Elastic Search is likely going to return data that is a few minutes stale which may or may not be fine for the user)

But once integrations depend on list endpoints, Stripe can never retire them.

Product development for APIs

So how do you evolve your product if you can’t change it often or do A/B tests4? You need to find other ways to test changes. Here is one recipe:

  1. Talk to developers before designing the APIs
  2. Test the APIs internally and maintain real integrations
  3. Release in alpha to external developers, subject to change
  4. Watch them integrate
  5. Revise the API several times while in alpha
  6. Consider the far future when designing

(1) Talk to developers before designing the APIs

Unlike consumers, developers have to think quite a bit before using an API. Before sitting down and writing a program, they have to consider how the API integrates with what they are building. After they do that, they will form strong opinions regarding what they need for their use-case and they’ll discover new requirements as they code the integration.

This means that it is much easier to interview a developer about their needs than a consumer.

For example, businesses want to be paid at different times:

  1. One business wants to be paid right away so that they in turn can pay their suppliers.
  2. The other business wants to get the money 5 days later once they fulfill goods, so that the cash accounting is aligned with the revenue accounting.

When interviewing businesses, it is important to understand why they need what they need. For the settlement timing example above, you may be tempted to add a payment_delay parameter:

  1. One user can pass payment_delay: 0 (days) and get their money right away.
  2. The other one can pass payment_delay: 5 (days) and get their money 5 days later.

But that would’ve been a mistake. The second business fulfills 5 days later on average. They don’t know when they will fulfill the goods in advance. The right parameter would’ve been:

  1. settlement_on: capture to get the money right away.
  2. settlement_on: fulfillment to get the money later, once they hit a new fulfillment API.

By understanding why developers want what they want you can (to some extent) predict other things they will want and make their lives easier in the future.

(2) Test the APIs internally and maintain real integrations

Once you understand what developers want, you can pretend to be them and then use your API. Ask your coworkers to test the API. (You should also test it but their testing will be a much more reliable way to find problems)

Ask everybody to write friction logs as they test the API. Friction logs are what they sound like: a log that details every moment when the developer encountered some friction. Include screenshots and code snippets of what was confusing alongside what you were thinking when it happened.

There is something very fragile about evaluating APIs (and products in general): once you understand how they work, you are no longer a reliable proxy for a new user. There are all sorts of friction and problems that can only be found by fresh eyes. For example, the word account is often ambiguous to new users but well-defined to those who learned it. Therefore, you should weigh the feedback from those fresh eyes heavily: they are seeing things that every new developer will see but you are now blind to.

Ideally, you can maintain a real integration and see its problems over time. In Stripe’s case, this meant having a small side-business to sell things online. A few Stripe employees did this and had great ideas for improving the product5.

(3) Release an alpha subject to change

Interviews and internal testing will only get you so far. Developers will discover requirements as they integrate and importantly, way after they integrate. Their support team might discover that your API behaves in unexpectedly when support tickets come in. This alpha period lets you change things more fluidly with the expectation that if you do change the API, the alpha integrations will be updated.

Even if they agree to upgrading eventually, alpha users will drag their feet when you ask them to upgrade. Keep this in mind as you plan deprecation timelines for the alpha.

(4) Watch everybody integrate

Many of the problems from an API are subtle confusions. Some examples:

  1. The developer interprets balance as "pending balance" because that is what they care about when you mean "captured balance".
  2. The developer expects the next step in their integration to be about fulfilling goods but your guide’s next step is about reconciling payments with orders.
  3. The developer is providing services and your guide talks exclusively about selling goods which confuses them: is this guide for me?

If the API ends up working for them after they resolve the confusion, they might not tell you what they struggled with. So, here are some ways to observe that process:

  1. Literally watch them. Go to their office and sit next to them while they program. Start a zoom meeting and be in the background in case they have questions. Make a Slack channel with them and ask them how they are doing every so often.
  2. Interview them right after they’ve integrated when they are likely to still remember the problems they encountered.
  3. Ask them to write a friction log (described above) as they integrate that you can read later. This is hard work so expect only your friends to do it.

(5) Revise the API with the feedback

As you revise the alpha API with users' feedback, try to make those fixes backwards compatible. This will save the alpha users some work. But if you need to, don’t be shy about making breaking changes to existing alpha integrations. This is the only time you can fix your mistakes, use it!

(6) Consider the far future when designing

After real usage, you’ll learn a lot more about what developers want from your API. What else might they want? What other parts of their business did they talk about? How might your API relate to those? What are some features that they mentioned but didn’t want yet?

You don’t have time to fully design all of those but keep them in mind. As the example above suggest, learning that businesses care about settlement timing and fulfillment should make you ask questions like:

  • Do they fulfill different parts of the order at different times? If so, do they want the different parts of the payment to be captured at different times? How could you handle that if needed?
    • This might make you change the signature of the /fulfill endpoint or ask them to pass more data about what they are fulfilling.
  • Do they ever cancel fulfillments? What should happen to the resulting cash?
    • This might make you consider fulfillments as a separate object with a state that has to be tracked as opposed to a single property on a payment.

Final warning

This essay has a loose recipe for improving an API design while minimizing future breaking changes. Even with a recipe, backwards compatibility makes iterating on APIs much harder than iterating on UIs. As such, you should extend your timelines when planning the design and rollout of an API.

And you should start the process knowing that you’ll make some irreversible mistakes. Maintaining APIs is not for perfectionists.

Thanks to Remi Jannel and Michelle Bu for reading drafts and providing feedback. Also, thanks to Adam D’Angelo for providing the prompt for the essay.

Footnotes

  1. This is particularly dangerous if you charge per “transaction” (ie API call) but your costs are proportional to the storage of all transaction costs every month forever. As your backlog of stored transactions grows, as some point your daily flux of transactions won’t be enough revenue to cover the costs.
  2. This is not strictly true. You can deprecate APIs but users always hate it, it is a lot of work, and hard to do well. For most minor and medium mistakes, it will not be worth the effort.
  3. I was not around when list endpoints were introduced and what I know I learned from others. From what I understand, it seems that the best compromise would’ve been to introduce the list endpoint with relax consistency guarantees so that they could be cheaply served from caches. Most of the list endpoint use-cases don’t require the read-after-write consistency that list endpoints offer today.
  4. There are many other reasons you can’t do A/B tests: what if one developer gets the documentation in the A variant and the Stack Overflow answer is from the B variant? Imagine how confusing that would be.
  5. Many Stripe employees had been users of the product before joining and they also had a good understanding of the API's problems.