When Cloud Giants Fall: Lessons from the Cloudflare Outage

June 22, 2022

Yesterday, for about an hour, a significant chunk of the internet just... vanished.

Not in a dramatic, Hollywood-style cyberattack kind of way. More like someone tripped over the digital equivalent of a power cord. Cloudflare—the infrastructure company that props up somewhere between 15-20% of all websites—had what they're diplomatically calling a "routing issue." The rest of us call it Tuesday.

Discord went dark. Shopify stopped shopping. Zerotier, Crypto.com, and countless others just showed error pages. Even Cloudflare's own status page briefly couldn't tell you that Cloudflare was down, which is the tech equivalent of the fire department catching fire.

The thing about putting all your eggs in one basket

Look, I've been covering infrastructure failures for longer than I care to admit. And I'll say this: Cloudflare is actually pretty damn good at what they do. Their uptime is legendary. Their DDoS protection has saved countless sites from being pummeled into oblivion. When they work, they work brilliantly.

But here's the uncomfortable truth that nobody wants to hear—everybody fails eventually.

It doesn't matter if you're Cloudflare, AWS, Google Cloud, or Microsoft Azure. You're going to have a bad day. Maybe it's a configuration error. Maybe it's a BGP routing mishap (which is what bit Cloudflare this time). Maybe someone accidentally deletes a database. Maybe it's solar flares. Who knows. The point is, it happens.

And when it happens, if you've built your entire operation on top of one provider, you're going down with the ship.

Single points of failure are a hell of a drug

The tech industry has this weird cognitive dissonance about redundancy. We all know single points of failure are bad. We've known this since before many of you were born. It's Infrastructure 101.

Yet somehow, we've collectively decided that the cloud changes this. That because these companies have massive, distributed systems with redundancy built in, we don't need to worry about it anymore. We can just trust them.

That's not redundancy. That's faith.

Real redundancy means you don't have a single vendor who can take you offline. It means spreading your risk. It means—and I know this sounds old-fashioned—actually maintaining control over your own infrastructure destiny.

Your backups shouldn't live in the same house as your primary data

This is where it gets personal for me. I've watched too many businesses get burned because they thought backing up to the same cloud provider as their production environment was good enough.

"But they have different regions!" Sure. And they all connect to the same authentication system, the same control plane, the same billing system. When those fail—and they will—your backups are just as unreachable as your prod data.

The 3-2-1 backup rule exists for a reason: three copies of your data, on two different types of media, with one copy offsite. Notice it doesn't say "three copies of your data, all in AWS, in different availability zones." That's not redundancy. That's just geographic distribution of your single point of failure.

The multi-cloud heresy

Here's what the cloud providers don't want you thinking about: you should probably be using multiple clouds. Not for everything—that way lies madness and bankruptcy. But for your critical data? For your backups? Absolutely.

Store your primary workload wherever makes sense. But put your backups somewhere else. Use a different provider, or better yet, use several. Put one copy on Wasabi, another on Backblaze B2, maybe keep a local NAS for good measure. Make it so that no single company's bad day becomes your catastrophic day.

Is this more complex than clicking "backup to S3" and calling it done? Yes. Does it cost more? Maybe a little. Is it worth it when your business isn't held hostage by someone else's infrastructure problems? You tell me.

Data sovereignty isn't just for paranoid Europeans

The Cloudflare outage is also a good reminder that relying entirely on third-party infrastructure means accepting that you have zero control when things go sideways. You can't fix it. You can't work around it. You can't even get good information about what's happening. You just wait and hope.

That's not a position any business should be comfortable with.

I'm not saying you need to run your own data center. I'm not even saying you should avoid the major cloud providers. They're useful, powerful tools. But they're tools, not solutions. You don't hand over complete control of your data and call it a day.

Keep copies in different places. Use different providers. Maintain the ability to recover your data without needing permission from, or cooperation from, any single company. That's not paranoia. That's just good engineering.

The bottom line

Yesterday's Cloudflare outage will be forgotten by most people within a week. Some RCA (root cause analysis) will be published, lessons will allegedly be learned, and everyone will move on.

But the underlying issue remains: we've built an internet where too much depends on too few companies. And while those companies are generally excellent at their jobs, they're not infallible.

Your backup strategy should assume that everything will fail eventually. Because it will.

Plan accordingly.

—Been doing this too long to believe in perfect uptime anymore