Centralization Is a Systems Design Problem

INSIDE SALAD

Centralization Is a Systems Design Problem

Published: August 21, 2023

Salad Technologies

Blog on centralization benig a systems design process

The opinions and commentary expressed herein are those of the author, and do not necessarily reflect the views of everyone at Salad. We accept a diversity of viewpoints, flavors, and spices in our Kitchen.

Systems design questions are the scourge of first-year web developers everywhere. These prompts have become a standard part of engineering interviews across the stack. Whether you’re a coffee-soused trimester deep or walking on the sunny side of a computer science degree, you’re bound to encounter one. The premises are simple—”build Twitter,” or “diagram a scalable ecommerce backend”—but the challenges are profound.

Engineering managers use systems design questions to screen potential hires for a certain kind of design thinking. Have they anticipated failure in critical subsystems? What have they implemented to deal with a tenfold spike in traffic, and at what stage in the pipeline? Candidates should be able to stitch up a system, implement load balancers at appropriate junctions, address a few high-level concepts, and name-drop the various SaaS suites du jour (without forgetting to plug in the thing).

Proponents see inherent value in the exercise. It’s a good gut-check for bootcamp grads still foggy on finding the command line, and it’s useful heuristic for gauging calm under fire. Things can and do go wrong in web development; the systems design question is a shot across the bow.

Good systems design requires disaster forecasting. You’ve always got to scheme out redundancies to weather crises, no matter how far-fetched the scenarios. Can you shrug off a meteor strike on your favorite data center? Is your moderation filter honed for the next channer brigade? Leaving one weak link in your system is like blowing a Christmas light in series circuit. If you’re not expecting cataclysm, it all goes kablooie.

“Single points of failure” are therefore the most fearsome bogeymen in software engineering. Even rookies develop anaphylactic symptoms over the mere thought of committing this quintessential error of design thinking. You never want your name on the pull request that fubars Facebook.

The franchise and the virus work on the same principle, what thrives in one place will thrive in another. You just have to find a sufficiently virulent business plan, condense it into a three-ring binder―its DNA―Xerox it, and embed it in the fertile line of a well-traveled highway, preferably one with a left turn lane. Then the growth will expand until it runs up against its property lines.

— Neil Stephenson, Snow Crash

DOWN FOR THE COUNT

One day, your grandkids—who have careers designing new Fortnite Islands—will look at you through their adorable, round, $2.5M iBall implants and ask, “Where were you when Facebook deplatformed themselves?”

On October 4, 2021, Facebook and its affiliated services went dark for six hours—all because of a single technical error. It was hardly news to anyone who spurns social media, but it proved to be a huge headache for 3.5B Facebook users suddenly bereft of the ‘book. Half of the world got left on read as WhatsApp, Instagram, and Facebook Messenger disappeared, along with the job security of countless web app developers using Facebook authentication. A rare exodus to Twitter ensued.

Santosh Janardhan, Facebook’s VP of Infrastructure, explained how one command inadvertently authored a system-wide change to their servers’ DNS configuration, instantly obscuring the Facebook network from the web:

Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.

In breaking normal DNS resolution, Facebook had equipped active camo for their entire product line. They most likely watched it happen in real time. It would have been a simple fix, had browsers been able to find their network:

All of this happened very fast. And as our engineers worked to figure out what was happening and why, they faced two large obstacles: first, it was not possible to access our data centers through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.

With remote access out of the question, Facebook engineers had to spend hours butting heads with hardware security protocols at the company’s physical data centers. They managed to restore functionality later that day, but only after putting the Internet through an opera’s worth of tension.

When the happy blue logo returned to the web, no one’s private information appeared compromised, but the same couldn’t be said for the company’s public perception.

MEA CULPA 2.0

You see, we lost more than grandma’s political rants during those six hours. Ecommerce markets the size of some GDPs vanished as Facebook’s ad ecosystem went belly-up. The hiccup cost Facebook up to $65M in ad revenue, tombstoned $6B from Mark Zuckerberg’s personal surf wax fund, and even caused Facebook’s seemingly invincible stock to quaver during the outage.

In the calm that followed the blip, the worrywarts of the Internet also suddenly remembered that Facebook is custodian to innumerable petabytes of user data.

To allay fears that malicious activity could have played a part in the outage, Mr. Janardhan published a supplemental explanation the next day, in which he called the events of October 4th “an error of our own making.” In it, he took pains to document the role of Facebook’s backbone network, flex a few thousand miles of fiber optic cable, and rather breezily reveal the line-level origin of the outage: a bug that failed to prevent a debugger from preventing the actions of a human who failed.

You read that right.

…in the extensive day-to-day work of maintaining [our] infrastructure, our engineers often need to take part of the backbone offline for maintenance—perhaps repairing a fiber line, adding more capacity, or updating the software on the router itself.

This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.

Forgive an old-fashioned reader, but do ears still prick upon passive voice? Mr. Janardhan writes that a command “was issued.” Must we not necessarily infer that a human being gave the directive to set the whole thing in motion? At the risk of dulling Occam’s Razor on a big ol’ hunch, it sure sounds like somebody issued the wrong command.

It’s disconcerting enough to think you might lose access to a valuable social resource the next time a poor Silicon Valley schmo skimps on the protozoan gels in the morning shake. For a resource that also accounts for a sizable portion of all Internet traffic, knows more about you than your shrink, and powers most of the apps you access on the daily, it tends to leave you with the creepy-crawlies.

SOURCE: DUDE, TRUST ME

Mistakes do happen. When your buddy lets the trash talk get personal in Mortal Kombat, you find a way to forgive them.* When a monolithic corporate entity bungles itself out of existence for a few hours, effectively disabling entire market shares of third-party apps and services, you head for the hills.

Facebook’s systems architecture is a favorite study in systems design; it almost defies belief that something like this could happen on their watch.

With all their famed fallbacks, aggressive database sharding, and low-latency data centers, you’d think it was impossible to turn off the lights from any one node on the network. But when someone did, their explanation was more or less “our bad.” Details regarding how the query “unintentionally took down all the connections” are also notably absent. (How many commands can you think of that check availability by beheading your authoritative nameserver?)

Don’t get us wrong; we agree that one ought to be understanding of human error, at least in personal affairs. But when our digital personae are inextricable from those affairs, rented back to us at a premium, and governed by seemingly indifferent corporate mouthpieces, it’s hardly being touchy to ask for a few changes.

As Mr. Janardhan humblebragged about Facebook’s massive network of interconnects, he betrayed the dangerous implication that it’s the unrivaled scale of Facebook’s infrastructure itself that introduces new vectors for error. Though subsequent outages have passed without fanfare, could their sudden prevalence be the first symptoms of untenable scale?

THE BIGGER THEY ARE..

Facebook’s not the only network to endure a crisis. After a number of other sites and services suffered unplanned outages due to hacks or design flaws in closed-source software, it’s safe to say it’s been a bad year to do business on the Internet for just about everybody. Consider this abbreviated timeline:

March: The SolarWinds hack disables sites and services used by the U.S. government.
April: Hackers force the Hotbit crypto exchange to rebuild servers from scratch.
May: Colonial Oil pipelines in the United States fall prey to Darkside hackers.
June: The Fastly content delivery network* suffers an outage. The service interruption only lasts an hour, but it offlines the New York Times, CNN, and Reddit, and lames parts of Twitter, Google, and Spotify.
July: The Akamai Edge DNS resolver goes on the fritz, causing issues with prominent websites like Amazon, FedEx, UPS, and Airbnb. Steam and the PlayStation Network also experience impacted services.
August: A brute force attack exposes 50 million T-Mobile users’ personal data, including driver’s license images and Social Security numbers. The hacker gloats about how easy it was to gain entry using a doctored router.
September: Security experts discover a backdoor that allows unauthorized Visa payments through ApplePay.
October: Another hack exposes Twitch source code and top creator revenues. Wannabe engineers building Twitch in systems design interviews may now check their answers.

Addressing cybersecurity breaches and infrastructure outages in the same breath might seem like comparing apples to oranges, but these events share a common theme: a breakdown in a single service adversely affected millions of individuals in every instance. To the end user, the only real difference between a hack and a service interruption is whether something got overlooked during execution or design.

Perfection doesn’t factor into it. There will always be another hack, a back door, or a stretch of unplanned downtime. We webizens put our faith in the idea that third-party corporations have done everything they can to minimize risk. Goodwill is hard-won and readily lost.

No one would make the mistake of pronouncing Facebook dead simply because it took a siesta. Despite a middling third quarter, Facebook’s market confidence doesn’t seem to have suffered from the events of October 4th. Analysts are champing at the bit over a Facebook Metaverse. Yet it bears repeating: there are no honest mistakes on a zero trust Internet—nor is there a root cause analysis that can overcome the fundamental weakness of centralization.

…THE HARDER THEY FALL

Decentralization defies cohesion as a unified movement. There are numerous startups, deFi outfits, and fledgling social networks squabbling over what it all means. Irrespective of market position or political ideology, most advocates prescribe reforms both structural and spiritual. It’s not simply a question of reclaiming web infrastructure; we must also nerf the platforms to minimize their splash effect. The October 4th Facebook outage distills both aspects of the argument.

Most of the Internet exists by proxy of a few corporate entities, or at least runs on their centralized infrastructure. In relying on cloud providers like Amazon Web Services, Google Cloud, or Microsoft Azure to scale and orchestrate their web apps, the biggest companies of Web 2.0 have established the discrete points of failure that could doom the Internet to the Dark Ages, given the right catalyst.

Then there’s Facebook. Even if you think the core app is a boring boomer hangout, you’ve got to admit that the company fundamentally changed the Internet. In the past decade alone, we’ve witnessed the proto social network become a powerhouse platform with its own byzantine data centers. Facebook commandeered the most widely-used messaging applications on Earth and infected the entire advertising industry with its commodification of human attention.

As websites like Facebook expand their feature sets, perceived utility cements their place as necessary services. So arises a “platform”—a monolithic brand with wide spheres of influence, proprietary marketing funnels, and a political war chest. These infinite touchpoints make platforms more than mere conduits for web activity; they are now rudders for billions of web wayfarers. Should Facebook go offline for keeps, the Internet could find itself adrift.

THE METAVERSE MUST FLOW

Today’s Metaverse architects may have ripped their blueprint from the pages of Snow Crash, but what’s in the margins could prove to be even more prophetic.

All the creative and social activity of Neil Stephenson’s Metaverse occurs by grace of the Global Multimedia Protocol Group, a corporation founded by technologists and engineers—much like the companies who established the real-world Internet. What would have happened if someone at GMPG ran the vacuum too close to the cord? Knowingly or not, Stephenson makes a strong argument against centralized infrastructure.

The Metaverse is an idealized psychopia where software enables us to create at the speed of imagination. It’s a place not catty-corner but grafted to existence itself. We will turn to cutting-edge hardware to superpose our designs on the physical world. Anything that risks experiential discontinuity should be anathematized.

Because the Metaverse must persist by nature, it cannot have a single point of failure. If we continue to orchestrate the destinations of the Internet through third-party cloud services and platforms, we risk violating the core promise of persistence. The only way to conjure the Metaverse is to embrace decentralized systems that offer greater reliability, predictability, and transparency.

SHARE THE LOAD

Distributed infrastructure might make our Metaverse destinations ironclad against cataclysm, but that’s still only half of the battle. Though politically benign, the Global Multimedia Protocol Group’s solitary ownership of the literary Metaverse gave it a vantage that any technocracy would covet. It’s not hard to imagine those libertarian eggheads eventually ceding the mantle to powers more closely aligned with the social media and telecom companies that dominate our Internet.

At Salad, we’re designing intelligent software to find the most lucrative job for lazy hardware. Our engineering team is developing ways to take advantage of emergent technologies to allow us to distribute partial workloads to individual nodes on our growing network. There are tradeoffs between a hodgepodge supercomputer like Salad and a multimillion-dollar data center, but none more dear than the price of freedom.

We believe it’s possible to realize a fully decentralized Internet. Computesharing communities like ours will allow individuals to earn meaningful value, do more together, and mount the web insurrection that forces platforms to play fair. Free expression and data privacy can only be ours by reclaiming our compute capacity. After all, the best way to disempower the platforms is to compete.

NEXT: Decentralize, or Be Destroyed!

Footnotes

*You were spamming upper-cuts, after all. That’s probably not even what your mother said last night (or any of the nights he claims she “let him borrow her MyPillow”).

*The Reddit shutdown is especially disconcerting to us at Salad. Not for the “dawws” or the erudite defenses of assholery, but because Reddit’s Moons economy runs on Ethereum. Third-party applications are critical to the widespread adoption of cryptocurrencies. Even though Blockchain ecosystems are relatively immune to failure, it’s alarming to think that any avenues of participation could be so easily closed.

Have questions about SaladCloud for your workload?

Book a 15 min call with our team. Get $50 in testing credits.

AI Transcription Benchmark: 1 Million Hours of Youtube Videos with Parakeet TDT 1.1B for Just $1260, a 1000-fold cost reduction

Building upon the inference benchmark of Parakeet TDT 1.1B on SaladCloud and with our ongoing efforts to enhance the system architecture and implementation for batch jobs, we have achieved a 1000-fold...

Self-managed Openvoice vs Metavoice comparison: A Text to speech API alternative

Text-to-Speech (TTS) API Alternative: Self-Managed OpenVoice vs MetaVoice Comparison

A cost-effective alternative to Text-to-speech APIs In the realm of text-to-speech (TTS) technology, two open-source models have recently garnered everyone's attention: OpenVoice and MetaVoice. Each model has unique capabilities in...

Blog_Stable_diffusion_fine_tuning_api_service

Cost-effective Stable Diffusion fine tuning on Salad

Stable Diffusion XL (SDXL) fine tuning as a service I recently wrote a blog about fine tuning Stable Diffusion XL (SDXL) on interruptible GPUs at low cost, starring my dog...

Don’t miss anything!