bugs will quickly emerge to inhibit service levels, while countless other
errors will lurk just below the surface in a latent state. The ITIL Problem
Resolution domain provides some well thought-out guidance to deal with these
issues. This article discusses how problem review boards (PRB) can improve
problem resolution management of complex systems.
At this point, three keys terms should be clarified: 1) Incidents are any
deviation from the standard operations of a system that could, or does,
cause a service interruption; 2) A problem is the condition of having
multiple similar incidents, and 3) a known error is the identified root
cause of a problem.
Essentially, from ITIL we understand that there are two management forces at
work. First, there is incident management, which is concerned with restoring
service as quickly as possible, often using workarounds that address known
errors. Second, problem management is geared toward both proactively and
reactively addressing the underlying causal factors of incidents. Readers
might want to review the ITIL Service Support volume’s chapter on Problem
Management to gain a better understanding.
As complexity increases, the percentage of total system understanding held
by any one IT person will decrease. This is because the level of expertise
to build complex systems demands the involvement of multiple parties. There
just is not an alternative realistic option. Whether developed entirely
in-house, out-sourced or some combination thereof, there are multiple
people, even multiple organizations, involved.
Correspondingly, when incidents or problems occur, root cause analysis
demands review by the parties with the appropriate expertise. For example,
to build a large mission-critical clustered server, there will be
involvement from the vendor(s) of the hardware, the software vendors,
internal software development, IT engineering/release management teams,
security, operations and so on.
Problem Review Boards (PRB)
In the same manner that there are change advisory boards (CABs) for updates
to production systems, there must be a parallel group(s) reviewing incidents
to determine trends, problem identification and ultimately root cause and
mitigation.
Depending on the complexity of the organization, there may be one PRB
overall or a PRB per system. For that matter, some organizations may be so
small or simple that, for whatever reason, they do not need PRBs. In those
cases, it is recommended that they still understand the ITIL Problem
Resolution processes and adopt best practices into their organizations. For
organizations with complex systems, regardless of size, the implementation
of PRBs need to be seriously considered.
The goal of the PRB is to govern problem management reactively and
proactively. This is done through analyzing incidents as they happen,
reviewing historic trend data and staying abreast of current industry news
and vendor updates. For example, a switch may not have failed yet, but your
PRB may know of an operating system bug that had been identified and
eliminated by another organization using the same switch. Hence, it would be
advisable to assess risks and determine the best means to modify the switch
and mitigate incident risks proactively.
Continue on to find out how to structure a PRB….
The PRB must include representation from relevant stakeholders in order to
effectively review incident trend data for reactive problem management as
well as searching for risks from a proactive stance. This means the PRB
could be comprised of vendor personnel, consultants, IT operations, IT
release management, security, and so on. Incidents that appear to be
establishing a trend would then be assigned to teams that would search for
the underlying problem.
Once the underlying problem had been identified, a request for change would
be issued through the change management process to validate and enact the
recommendations of the PRB as they relate to the production systems and
various stored configurations. It is important to note that the PRB must not
spawn a separate change vector to production; rather, it must serve as an
input to the standard change management process which, just in case, must
have established means of handling emergency change requests.
Summary
Complexity is increasing the need for effective communication and
coordination among the various groups involved with complex systems. As this
happens along with technical specialization, enhanced processes will be
needed to meet service levels including availability and security. To foster
appropriate problem identification and root cause analysis, enterprises
should use problem review boards with the necessary stakeholders represented
in order to make decisions for both proactive and reactive problem
management.
How does your organization handle incident and problem management? If you
have any stories or examples you’d like to share, please email me at
[email protected].