Software Obsolescence, Misconfiguration, and Failure
Computer meltdowns are wreaking havoc around the world. Can database modernization help?
We have been reminded in the past few weeks of the costly and potentially dangerous implications of IT systems that are not well managed and maintained.
What’s the problem? And how can we fix it?
The Jan. 11 failure of the FAA’s antiquated Notice to Air Missions, or NOTAM, system resulted in a temporary shutdown and air-traffic delays across the U.S. The “meltdown” has been blamed on everything from outdated hardware to software misconfiguration to sheer obsolescence.
We need to know exactly what software, systems, and vendors were involved, but those details have been hard to come by.
It was the second transportation industry calamity in a few weeks. In December, Southwest Airlines was forced to cancel more than 16,000 flights during the busy holiday travel period. That fiasco was blamed on a legacy scheduling system, which had a spillover effect on phone support and other operations.
In a New York Times column, Columbia University professor Zeynep Tufekci summed up Southwest Airlines’ issues as being the result of “technical debt.” In other words, it’s the price paid for becoming overly dependent on outdated systems and software.
“While aging code is a common cause of technical debt in older companies—such as with airlines, which started automating early—it can also be found in newer systems, because software can be written in a rapid and shoddy way, rather than in a more resilient manner that makes it more dependable and easier to fix or expand,” writes professor Tufecki.
These kinds of IT system breakdowns are by no means new. In the early 2000’s, tech journalists like myself were kept busy reporting on multimillion-dollar ERP boondoggles.
Yet, despite all of these years of (bad) experience, businesses and their tech partners continue to struggle with software implementations, as reported by CIO.com.
See the CIO article, “12 Famous ERP Disasters, Dustups, and Disappointments”
The case for database modernization
In many cases, it’s unclear the degree to which a database management system may be part of a wider problem, but no doubt it happens.
In fact, CNN and others report that the FAA system failure was the result of a “damaged database file,” though there’s no mention of the type of database involved. To make matters worse, the backup system also had a corrupt file, according to CNN.
And we have seen other examples of database glitches, such as this one involving an Oracle database that caused a two-hour system outage at the VA, DoD, and Coast Guard, as reported by TechTarget. (I have not confirmed the details, but I will check in with Oracle and provide any updates if warranted.)
CIOs understand that modern infrastructure and distributed data architectures can help minimize downtime. It’s just that enterprise-wide IT modernization is resource intensive and can take years.
As you might expect, database vendors have a wide range of newer technologies that can help alleviate what might be considered avoidable mistakes: obsolescence, lack of funding, outdated architecture, etc.
Cockroach Labs takes the idea of database survivability to the extreme, having named its company after a bug that can’t be squashed. Cockroach’s distributed SQL architecture enables what it calls “ultra resilience.”
Likewise, Yugabyte, with its “data lives forever” mantra, provides both synchronous replication across three or more data centers, as well as asynchronous replication between independent database clusters via its xCluster functionality. Here too, the goal is to provide continuous database availability even in the event of failure elsewhere in the enterprise infrastructure.
How to fix
What can businesses do to lower the risk of software and system failures, and the impact they have on business operations, customer experience, and brand reputation? Entire books are devoted to the topic of modern computer architecture and software best practices, but here are a few starter ideas.
Organizational awareness. Tech leaders must have an ongoing conversation with business leaders and line-of-business managers about the state of readiness of their most essential systems and applications.
System modernization and automation strategy. Both the FAA and Southwestern Airlines problems were the result of outdated systems and technical debt. A system modernization strategy is essential. In many cases, new automation capabilities can lower risk through features like self-patching, repair, and maintenance. And machine learning can be applied to automate IT operations more broadly.
Multi-cloud. It’s increasingly possible to use cloud services from two or more cloud service providers to establish resiliency across availability zones and geographic regions. Some of the same benefits can be gained with public/private hybrid clouds.
Data architecture. The tools and capabilities of modern data architecture include geo-partitioning, replication, change data capture (CDC), data fabrics and meshes, and data clouds. All of which can create a more robust environment.
Database transformation and migration. The process of moving data from legacy databases to newer, more capable cloud databases can take weeks or months. Database vendors are offering new tools that help with all aspects of data migration, from inventory/discovery to schema conversion to populating the new target databases.
Further reading
Many database vendors provide guidance to help. Some examples:
AWS: “How Developers Are Building on AWS to Ensure No Downtime, Ever”
Cockroach Labs: “Highly Available, Resilient Data”
Oracle: “Oracle’s Autonomous Database: The Industry’s First Self-Repairing Database” (PDF)
Modern database systems are certainly part of the answer to these ongoing glitches and system failures. But the ultimate solution also requires investment and an organizational culture that prioritizes always-on performance.
Great post, John!
In a follow-up interview, the CEO of SWA explained they were implementing a new release of the purchased app that cratered under the load--and that they'd implemented EIGHT releases in the prior year. NINE releases of a large, on-prem, app suite--with custom mods, integrations, and undoubtedly UX/CX/report/database changes in one year??? He wanted to sound reassuring, but nine releases of a complex, critical system in a year should scare the heck out of Board members!
And that FAA database error was also scary. A critical system shouldn't allow an employee to just overlay a production database. Again, it looked like they were saying, "Nothing to worry about; just a dumb contractor." And again, where are the IT checks and balances that ensure components are validated before migration to production?