Introducing ‘Observability’
Observability is the hot new buzzword in IT Operations, DevOps, Agile, and Site Reliability Engineering (SRE) communities. The concept of observability originally comes from the industrial world, and is defined in Wikipedia as:
“A measure of how well internal states of a system can be inferred from knowledge of its external outputs.”
For example, in a water treatment plant with no instrumentation inside the pipes, a plant operator outside the pipes cannot determine if water is flowing, which way it is flowing, how clean it is, etc. The system lacks observability.
However, by adding flow gauges and quality sensors inside the pipes, connected (by ‘telemetry’) to meters or dashboards outside the pipes, the internal system states (flow speed, water purity, etc.) can be inferred from the external system outputs (meters, dashboards, etc.). The system has observability.
Observability for Software Applications and Services
The same principle can be applied to software. Modern developers are building measurement directly into code, delivering observable status indicators to meters and dashboards outside the application. This allows operations teams (including IT ops, sysadmins, SREs) to, for example:
· Detect, isolate, and alert sooner on critical incidents and events.
· Investigate problem root causes more accurately and efficiently.
· Fix incidents faster with real-time feedback on remediation efforts.
· Conduct more accurate post-incident reviews and post-mortems.
· Better understand problem history to preventing recurrence.
· Close feedback loops with requirements for continuous improvement.
· Use analytics and machine learning to predict and prevent problems.
· And much, much more.
Observability for the Real World
No wonder observability is becoming the norm for cloud-native businesses, which can build and deliver new code unhindered by decades of success and the ‘legacy’ of systems and applications that come with that success.
However, even a large traditional enterprise can build observability into services, even without substantial refactoring. For example:
· With no internal system changes – collect internal system-level data directly from servers, storage, networks, containers, cloud services etc. (e.g. entity performance, utilization, capacity).
· With minor configuration changes – deploy collectd to measure and forward infrastructure attributes (e.g. CPU/memory utilization, network performance, storage IOPS).
· With (probably) minor code changes – deploy statsd to collect and forward metrics from inside your application (e.g. transaction response time, volume, errors etc.).
· With (perhaps) major code changes – use semantic logging (even simple JavaScript injection) to instrument any activity, including business metrics (e.g. sign-ups, click-through rate, revenue).
Each approach is valuable to varying degrees. Even basic infrastructure metrics will help to detect and triage many problems, allowing IT Operations teams to answer key technology questions, such as:
· What is a normal transaction volume or resource utilization by hour, day, or month?
· Is my application performing correctly for this time of day, day of week, etc.?
· Is the application infrastructure and configuration sufficient for my current load?
· Are there transaction bottlenecks in certain applications that are causing problems?
· Are there services or systems throwing exceptions and errors that I need to fix?
However, application activity recorded in a well-structured semantic log opens up observability into higher-order data, allowing multiple stakeholders to also answer key business questions such as:
· How long are purchases taking at different times of day, or days of the week?
· What is my click-through rate, and how does it vary by customer, transaction, product?
· Is my current revenue number normal right now – and what should I do about it?
· Who is my best customer? My worst? Where should I focus my marketing?
· How many purchases are failing, and why? What customers are affected?
From Observation to Action with AIOps
Observability itself is not the end goal. More charts and dashboards will not help your business succeed per se. To be truly meaningful, observability must feed action – such as real-time problem and incident triage, closed DevOps feedback loops, or prescriptive problem prevention.
Typically, this means collecting observability data, correlating it with other monitoring outputs, and processing it with advanced analytics and machine learning, to drive ‘known good’ responses into automated actions. Combining monitoring and observability with advanced data integration, machine learning, predictive analytics, and orchestration capabilities delivers what Gartner calls “Artificial Intelligence for IT Operations,” or “AIOps.”
For example, AIOps solutions will take your raw observability data and make it meaningful and actionable by:
· Integrating it with critical system data like DCIM/APM tools, HTTP events, API outputs, device data, SNMP traps, and even RMF, SMF, or CICS data.
· Improving ‘signal to noise’ by correlating, analyzing, and filtering these integrated datasets to suppress alert storms or isolate the most notable events.
· Leveraging machine learning and predictive analytics to identify and even correct otherwise hidden anomalies to get ahead of potential problems.
· Triggering automated workflows to find, fix, and prevent both known and novel incidents by executing known solutions, even without human intervention.
· Correlating technology and business insights to enable Product Managers and DevOps teams to iterate on new ideas in real-time to achieve business goals.
Observability Nirvana
Observability as practiced at (and often preached by) cloud-based startups delivering web-based services is an exciting new world of IT management – but for many traditional IT Ops, it does not seem achievable. However, any business can and should adopt observability techniques, including large enterprise IT. Especially as a supplement to traditional monitoring, observability changes the game in software service delivery, and moves IT closer to the nirvana of true business-technology alignment.
About the author:
Andi Mann is the Chief Technology Advocate for Splunk.
Huawei’s AI Update: Things Are Moving Faster Than We Think
FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA
FEATURE | By Guest Author,
November 10, 2020
FEATURE | By Samuel Greengard,
November 05, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media
FEATURE | By Rob Enderle,
October 16, 2020
FEATURE | By Cynthia Harvey,
October 07, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science
FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future
FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2020
FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI
FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality
FEATURE | By James Maguire,
September 09, 2020
Anticipating The Coming Wave Of AI Enhanced PCs
FEATURE | By Rob Enderle,
September 05, 2020
The Critical Nature Of IBM’s NLP (Natural Language Processing) Effort
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
August 14, 2020
Datamation is the leading industry resource for B2B data professionals and technology buyers. Datamation's focus is on providing insight into the latest trends and innovation in AI, data security, big data, and more, along with in-depth product recommendations and comparisons. More than 1.7M users gain insight and guidance from Datamation every year.
Advertise with TechnologyAdvice on Datamation and our other data and technology-focused platforms.
Advertise with Us
Property of TechnologyAdvice.
© 2025 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this
site are from companies from which TechnologyAdvice receives
compensation. This compensation may impact how and where products
appear on this site including, for example, the order in which
they appear. TechnologyAdvice does not include all companies
or all types of products available in the marketplace.