So what happens when a change fails?
Is there chaos or is a contingency plan executed?
High-availability organizations are always evolving their contingency plans when failure occurs, or the likely prospect of failure appears. Rather than put their entire focus into one way of implementing a change and never taking failure into account, these groups may develop one or two contingency approaches. But they always have an ultimate ”rollback” plan to return the system to the last known good state.
The need for security and the ability to maintain service levels in increasingly complex business and IT environments should be incentives for all organizations to put rollback plans in place.
Imagine that the most critical server in the company has a planned maintenance window on Sundays from 2 a.m. to 5 a.m. That time slot has been negotiated into service level agreements (SLAs), which are taken very seriously. During that maintenance window, the planned changes are going great.
Then suddenly at 3 a.m. the server stops working — with only two hours to spare. The engineers repeatedly try to get the change in place by making ad hoc modifications. By 5 a.m. they are still pushing to get the changes in, and all the while, the system remains down.
What is wrong with this scenario?
Items four and five are crucial. The engineers should have known how long to try implementing the change, and then when to stop, change gears and execute the rollback plan to bring the system back online within the scheduled amount of time.
To explain, a rollback plan is a recovery plan that aims at returning the system to its last known good state. It may be a tape restore or a reload of a configuration file. The rollback plan is the emergency escape plan to get the system back up before the prescribed amount of time elapses. The allowable time factor is a key point.
There are times where one change is all that will happen. There are other times were the team has to install multiple changes on one host or across many hosts. To get them done within a planned maintenance window requires planning.
To make things simple, if there is one host, one change and a three-hour window, then basic logic tells us that we have three hours to get the change done. If there were three changes, then each change would use up some portion of that three-hour window based on estimates. The change planning process should always include a documented rollback plan and estimate as to how long it would take.
However long that rollback plan would take to execute is a key milestone scheduled back from the end of the permissible time. If the change is allotted one hour and the rollback plan would take 15 minutes, then between 40 and 45 minutes into the change, the engineers must actively decide whether to push ahead and finish, or to execute the rollback plan and restore the system to the last known good state.
If things are going well, then finish implementing the change. If things are going badly, then the team must roll back what has been done so far.
To decide to stop and actually admit the process has failed takes a lot of discipline.
In the previous example, notice the allotment of up to five minutes to decide. The decision control point must take place with enough time to assess the situation and make a decision. Sometimes it even takes a dispassionate third party to make the decision because the engineers, or ”change builders” are so into the details of the implementation that they fail to recognize that a decision is needed.
Preparations
To be optimally effective, there are some issues to bear in mind.
First, the engineers must be able to count on the current state matching the official last known good state that is manually or automatically detailed/documented;
If change management, configuration management and release management disciplines are not in place, then precious time can be lost when a rollback plan fails because the production build didn’t match the documented last known good build.
Not only is testing rendered less meaningful since the test system doesn’t mirror production, but it also is far harder to restore a system for which there is not current accurate configuration data. In these cases, when failure happens during change implementation, work shifts from the vital task of recovery to forensics. You’ll need to ask, ”Why is this configuration value 1,000 instead of 1,500? Who changed it and why?”
System changes can and do fail. As systems become increasingly complex, the probability of subtle differences in production causing a planned change to fail during implementation will climb as well.
Groups worried about meeting service level agreements and stakeholder expectations must recognize this correlation and require rollback plans as part of the change management planning process. Without the ability to recognize failure and quickly recover, the probability of unplanned work and downtime increases… and nobody wants that.
Huawei’s AI Update: Things Are Moving Faster Than We Think
FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA
FEATURE | By Guest Author,
November 10, 2020
FEATURE | By Samuel Greengard,
November 05, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media
FEATURE | By Rob Enderle,
October 16, 2020
FEATURE | By Cynthia Harvey,
October 07, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science
FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future
FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2020
FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI
FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality
FEATURE | By James Maguire,
September 09, 2020
Anticipating The Coming Wave Of AI Enhanced PCs
FEATURE | By Rob Enderle,
September 05, 2020
The Critical Nature Of IBM’s NLP (Natural Language Processing) Effort
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
August 14, 2020
Datamation is the leading industry resource for B2B data professionals and technology buyers. Datamation's focus is on providing insight into the latest trends and innovation in AI, data security, big data, and more, along with in-depth product recommendations and comparisons. More than 1.7M users gain insight and guidance from Datamation every year.
Advertise with TechnologyAdvice on Datamation and our other data and technology-focused platforms.
Advertise with Us
Property of TechnologyAdvice.
© 2025 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this
site are from companies from which TechnologyAdvice receives
compensation. This compensation may impact how and where products
appear on this site including, for example, the order in which
they appear. TechnologyAdvice does not include all companies
or all types of products available in the marketplace.