How to Fail Gracefully

We talked about "Designing for Disaster," and this is the next part of that. “Failing gracefully” is commonly said in the programming world, meaning that when something fails, we want the program to be as functional as possible—or at least help us troubleshoot when it does take the whole thing down…

Since mistakes are inevitable, what can we do to help navigate the mess and maybe even reduce the surface area of what our mistakes touch?

Separation of Concerns

The wisdom of senior devs tells us that things inside the system should be modular, simple, and separable.

Kind of like LEGO bricks. LEGO sets are systems that are constructed with removable or swappable LEGO bricks. Each brick has its own function and tries not to do more than one job at the same time—you don’t have an axle that also is a fire-looking brick that also is a person.

“Separation of Concerns” technically means that parts of code should be able to be edited or removed without breaking the whole system. A function that uses data shouldn’t also be in charge of getting that data. The UI of your website shouldn’t break when you update the text on the website.

This is an ideal—not always possible, but you can get really close. The huge benefit to this is that we can make changes or even break small parts, but our whole system doesn’t fail.

It’s like Mitch Hedberg said, “An escalator can never break: it can only become stairs.” That’s a graceful failure!

Meaningful Errors

Anyone who has used the internet knows the error: 404, Not Found. You’ve gone to a page that doesn’t exist.

Some sites go above and beyond and customize this error page—they may even throw in a link or two to help you get back to the home page or a close guess at where you wanted to be. (The screenshot below gives people a chance to search the site)

What about the other errors you can get? Sometimes you’ll run into “Server Error: 500.” What does that mean? What are you supposed to do with that? This error page is much less likely to be customized and even less likely to be meaningful.

Usually the devs have a way of getting to the meaningful error on the backend, but it’s still a frustrating experience for you to have an error and zero context.

Meaningful errors help everyone out. We can all troubleshoot better when we have a little bit of context. Maybe you tried to upload a file that was way too big—how would you know that if all it says is “Something went wrong”?

Whether you’re a programmer or a program user, meaningful errors are a key component to a better experience. Errors are unavoidable, so we should fail gracefully by providing descriptive errors, or at least errors that help you get back to a better state.

Escapes and Exits

Do you remember that awful false alarm that hit Hawaii in 2018? Everyone on the islands got an emergency alert that a ballistic missile threat was inbound. Turns out, no missile or even threat was imminent. The message was an accident.

With a breath of fresh air amidst the outrage at the time, I found an article with a different take on what happened: blaming the system design rather than the human error.

“Altogether a poorly designed system is to blame. A good UI makes it hard for people to err, and easier to recover from any remaining errors. The solution is not to scold users…[but] to redesign the system to be less error-prone.”

​https://www.nngroup.com/articles/error-prevention/​

Take care to examine your systems for deceptive messages, cryptic actions, confusing processes, and make sure that it’s possible to recover or escape when highly impactful actions are taken. “Cancel” buttons available when deleting, for example.

I highly recommend reading the full article!

Fail Early

There’s a sentiment from Silicon Valley: “Move fast and break things.” It’s a little too brazen and risky, to me. I can appreciate the spirit of it though, which is more along the lines of: “You’re not punished for breaking things, just keep trying.”

With that more tempered view, I think we get broader application. Making mistakes, breaking things, running into problems is all part of the game—no matter what your occupation or hobby.

Chris Do puts it well: FAIL = First Attempt In Learning.

Build that failure into the process—expect it! Create a space where you can tinker without it affecting the rest of the system. Try to figure out what will solve your problem first, before trying to make it perfect.

Testing frequently will help you discover the holes in your logic, the parts that don’t work together. Check for errors as soon as possible and try to identify them before they affect things further down the line.

Defaults

Is there something that can be a safe placeholder if something breaks or doesn’t show up? These aren’t easy to implement, but when you look for opportunities to provide defaults, there are more than you might expect.

For example, you’re building a slide-based presentation and you’re waiting on numbers or images or other content. Can you move forward with what you have? Can you put conceptual images in as placeholders in case the Design team doesn’t get to your deck in time?

Defaults help you move forward when you don’t have everything perfect. What’s the least I can deal with and still launch successfully?

Document

Not everything can be avoided by strategy or other safe-guards. Some mistakes or errors will come up frequently. If you have a place to store processes or solutions to these problems, it makes it so much easier and faster to handle.

I’ve mentioned this in previous emails, but I can’t overstate how helpful this particular practice has been for me. I use Obsidian to collect these errors and the steps to resolve them, and it has drastically reduced the time it takes to fix things that are nuanced and uncommon.

Cyborg

We’re not all coders, but we do deal with processes and systems all the time. Failing gracefully might mean giving context for mistakes you’ve made or giving people options for proceeding when something goes wrong or you get stuck.

The smaller the “scope” of a project or the more isolated a part of a system, the easier it is to control the impact of its failures. That doesn’t mean you can’t work on big things—but remember that all big things are composed of smaller things.

If you can carve out these smaller pieces and set failure procedures for them, it helps clamp down the potential chaos when it’s integrated with the larger whole.