Let’s look at an example where a seemingly ordinary, minor server outage led to a 2 million euro loss. The company is an industrial giant delivering product in large quantities to sales locations. The logistics and timeliness of deliveries are very, very important.

A server failure on a regular Friday afternoon happened, but was not seen as particularly important because the CMDB was missing information. Consequently, this ended up costing the company over 2 million euros.

How could this happen? They had all the bells and whistles in place and a well-managed CMDB. Yet, a seemingly regular server failure escalated into a Major Incident. Suddenly, data quality and the CMDB became a top management topic. Important lessons were learned.

Let’s take a closer look.

Things were Looking Good, but…

The company had invested significantly in its CMDB, and the people involved believed they had a robust system in place:

  • Discovery populated their CMDB
  • Event management and different monitoring systems were in use.
  • An Enterprise Architecture function was in place. Business processes and applications were described.

However, when mapped onto the Common Service Data Model framework, their CMDB Setup revealed some concerning insights.

CMDB Setup Before 2 million euro incident

The Enterprise Architecture team primarily managed the DESIGN domain, utilizing their own tools and methods. Some data was in ServiceNow, but not all. Maintenance was sporadic at best. The BUILD domain was not in use at all.

The SELL/CONSUME domain was virtually non-existent, and the lines between Technical and Business Services were blurry.

The MANAGE TECHNICAL SERVICES domain was data-rich, boasting numerous Service Offerings and CI relationships. However, a glaring omission was the lack of connections to the business side or the end-users of the configuration items.

While many Application CIs were linked to Application Management Services offered by a third-party vendor, they weren’t connected to the actual business services or users. This gap made proper impact or root cause analyses based on CI relationships impossible.

FOUNDATION DATA existed, but its quality left much to be desired. There were issues like duplicate company entries, inactive users, and missing links between CIs and their responsible entities.

Let’s then look at how these omissions contributed to our 2 Million Euro Incident.

The 2 Million Euro Incident

Unplanned Server Outage on a Casual Friday Afternoon

Automated monitoring systems flagged a server failure at approximately 2 PM on a Friday. An incident ticket linked to the Server CI that triggered the alert was promptly generated.

A preliminary analysis concluded it was a random server failure that required no immediate action, so a thorough investigation was deferred to Monday. The decision was made because the Server CI lacked any relationships to Applications or Services, and no responsible party was defined.

Saturday Chaos: Deliveries Ground to a Halt

By Saturday morning, the Service Desk was overwhelmed with urgent calls reporting a critical system failure. Employees were frantic, stating, “Our delivery system is down,” and “We have no idea where our delivery trucks should go.”

Unfortunately, the callers – the delivery truck drivers – only saw that they did not get their delivery routes on their devices. They could only describe a “failed delivery system ” and could not offer any further details to the Service Desk agents.

The underlying system, let alone its name, was invisible to its users. The truck drivers were used to getting these operational details into their mobile devices. They did not need to consider how the details got to their devices.

The situation quickly escalated, leading to heated exchanges and the eventual involvement of the Service Desk Lead, who declared it a Major Incident.

Trucks Waiting Around

Emergency Response on Saturday Evening

The Major Incident Manager was summoned, and he quickly identified the affected system as the Logistics System. Key stakeholders were alerted and convened for an emergency meeting that afternoon. The meeting involved various department heads, the head of application management, and other important stakeholders.

All called in on a Saturday to address this urgent issue.

Sunday Resolution

Somewhere between the second pizza and the third pot of coffee, the team linked the initial server failure to the application issue. They discovered that the failed server hosted one critical API used by the Logistics system. However, it did not affect the entire system – only the delivery of details to the truck fleet.

Once identified, the issue was resolved relatively quickly. However, the ripple effects had already caused significant damage:

  • Delayed deliveries of time-sensitive and expensive product.
  • The need to rapidly re-plan deliveries, delivery routes, and work shifts.
  • A surge of irate customers requiring immediate attention and calming down.
  • A slew of claims that had to be processed in the aftermath

What can we learn from this?

Lessons Learned: Three Crucial Takeaways

This Major Incident originated from missing details in the CMDB. The absence of data led to a lack of understanding of the true impact of the server failure.

The key lessons and subsequent actions learned from this incident were:

  1. Clearly define responsibilities.
  2. Establish meaningful connections between Configuration Items.
  3. Enable Business Impact Analysis, Self-Service, and Automation.
What did we learn

1. Assign Responsibilities

Designating responsible individuals or teams for each Configuration Item (CI) is imperative. This information can be sourced from related records, such as Service Offerings. Establishing a clear path to this information is crucial.

Once this is in place, you’ll know whom to contact for further details in an urgent situation.

Read: How to Establish Ownership in Your CMDB

2. Establish CI Relationships

Use the CSDM Crawl phase data models and ensure that every Infrastructure CI (such as a Server) has an upstream relationship to Application Service. Then, ensure that the Application Service is related to a Business Application.

If your core capabilities rely on Business Service data, then defining the minimum relationships and attributes within that data domain is essential. You will get a more comprehensive view of your IT infrastructure, enabling you to prioritize incidents and conduct preliminary root cause analyses more effectively.

Read: How to Accelerate CSDM Alignment

3. Enable Business Impact Analysis, Self-Service and Automation

Develop a comprehensive Business Service Portfolio to complement your technical portfolio. Begin by linking users to services as subscribers. Then, enable your Service Portal and workflow automation to leverage this valuable data. This will significantly enhance the service you can provide your customers.

Read: How to Ensure You Can Do Root Cause Analysis

Simulating a Different Outcome

Simulating a Different Outcome

In the wake of this costly incident, a Proof-of-Concept (PoC) project was initiated to explore alternative outcomes—specifically, how the situation might have unfolded if accurate and comprehensive data had been readily available during the crisis.

We adhered to the three key takeaways previously outlined and used the Common Service Data Model (CSDM) as our reference data model. For the simulation, we focused on just two critical Business Applications, one of which was directly implicated in the Major Incident.

The simulation aimed to reconstruct how the incident management process could have transpired if all pertinent information had been available.

The exercise served as a profound eye-opener for nearly everyone involved.

1. Responsibilities Clearly Defined

Had the incident first responder known whom to contact, immediate action could have been taken, streamlining the information-gathering process.

2. Relationships in Place

Had the CI Relationships been properly established,

  • The initial server failure would have been flagged as part of a critical Business Application, prompting the Server Operations team to prioritize it.
  • Key stakeholders could have been alerted proactively, even before the Service Desk received any calls.
  • Service Desk agents could have linked end-user reports to the server failure, potentially resolving the issue on Friday and preventing escalation.

Utilizing a Business Service Portfolio would have offered insights into the business impact by connecting users to applications through Service Offerings. This model had been used to create a link between the IT Infrastructure and Business Processes that existed before the incident but without a link between them.

Previously used linking via Technical Services caused Dependency Views to look like spider webs where everything is connected to everything and true impact is hidden in the web.

3. Advanced, Not Simulated

While not part of the simulation, additional advantages could have been realized by implementing the advanced steps.

  • The company could have known which services are consumed by which users (service offering subscriptions)
  • They could have provided Dynamic Portal and Mobile User experiences based on subscribed services.
  • They could have easily (and probably automatically) been able to inform users about issues related to their services.
Chain

In Summary

Imagine a scenario where the first caller to the Service Desk had been identified as a user of the affected Logistics System. The Service Desk agent would have instantly seen an open incident under the application. Or, better yet, the original server event would have been assigned the proper priority from the get-go. Relevant stakeholders could have been informed about the issue as it happened.

Having well-organized underlying data simplifies the tasks for Service Desk Agents and End Users. The right tools are essential for this.

The difference between assuming and knowing your data is accurate was worth a staggering 2 Million Euros in this case.

Mitigating Your Risk

A CMDB is only as good as the data it contains, and it is not about the number of records or CI Classes but the accuracy and usefulness of the data. It must be reliable, current, and actively utilized across various products and processes. There is no point in storing data in the CMDB that nobody uses.

Assigning a monetary value to data quality is often overlooked until a disastrous incident like this one occurs. Making sure that your data quality is good enough is an investment, not a cost.

Data Content Manager (DCM) is a tool designed to enhance and maintain your data quality on the ServiceNow platform, significantly reducing the risk of such costly incidents. Had it been used here, I daresay it is likely this incident would not have become a major one because the data needed to assess the true impact of a seemingly regular server failure would have been available.

As additional reading, I might recommend the following:

Also, these videos might be enlightening:

 Thanks for reading and please get in touch if you have any questions!

Mikko Juola

Mikko Juola

Chied Product Officer at Qualdatrix

LinkedIn

Built on now

Get a Free Guided Trial

To see how Data Content Manager works in your own environment, please request a Guided Trial from us. A Guided Trial allows you to experience the power of Data Content Manager in your own ServiceNow instance, with your own data. We will guide you through installation, creating Blueprints, running Audits, and interpreting results.

The results are yours to keep.

There’s no cost or commitment since we know this is the easiest way for you to experience the power of Data Content Manager.

Free Trial