CompanyX experienced a server failure that ended up costing them over 2 million euros. How could this happen? With all the bells and whistles in place and a well-managed CMDB, a seemingly regular incident escalated into a Major Incident. Consequently, data quality and the CSDM became a management team topic and important lessons were learned.
The example discussed in this article is inspired by a real-life case that happened at, let us say, CompanyX. For obvious reasons the company, people, systems and specific events are not named, and the case is anonymous. Nevertheless, this is based on a real-life event.
Things were Looking Good
CompanyX had spent good time and money on their CMDB. Their CMDB was populated by Discovery, event management and different monitoring systems were in use and the Enterprise Architecture function was in place. Business Process and Applications had been described.
Things seemed to be in good shape.
- CMDB populated by Discovery
- Event management and different monitoring systems in use
- Enterprise Architecture function in place, business process and applications described.
Before the incident happened, their “CMDB Setup” looked like this when overlayed on the Common Service Data Model framework:
DESIGN domain was mainly operated by the Enterprise Architecture team using their own tools and methods. Some data had been brought into ServiceNow, but not all. Maintenance was not frequent or continuous by any means.
SELL / CONSUME domain did not really exist at all. It was rather a bit of a mixture between Business and Technical Services.
MANAGE TECHNICAL SERVICES domain had a lot of data with Service Offerings and lots of CI relationships. However, no connection to the business or users of the configuration items existed. For example, most of the Application CIs were related to “Application Management Services” provided by a vendor, but not the actual business services or users.
This basically made it impossible to do any impact or root cause analysis based on the CI relationships.
FOUNDATION DATA existed, but the quality of the data was pretty bad. Duplicate Companies, inactive Users and missing links between CIs and responsible persons.
So, let’s see what happened when a single server failure occurred.
The 2 Million Euro Incident
Random Server Failure on a Friday afternoon?
Automated monitoring and event management systems identified a server failure around 2 PM on a Friday afternoon. An Incident was created based on the event and was related to a Server CI that triggered the event.
Soon after, an initial analysis of the incident was completed. Result: some random server failure, we can take a better look at this on Monday. This conclusion was reached because the Server CI did not have relationships to application or services. Also, a responsible person (Managed by) was missing.
Trucks not Leaving on Saturday
The Day shift on the following Saturday had just started when first notice of a critical system failure reached the Service Desk. Anxious people started calling from all over the place, reporting that our “Delivery system doesn’t work” and “We don’t know where our delivery trucks should go”.
A Service Desk agent begins to investigate and figure out what could be the root cause. With weekend staffing in place, they could not just assign it to second line without doing some initial investigations first.
Sadly, nothing could be found when callers only referred to “failed delivery system” and could not provide any further details. The underlying system, let alone its name, was not visible to its users. They were just used to getting these operational details into their hand-held devices.
The callers resorted to venting their frustrations to the Service Desk. Voices were raised and unprofessional language was used.
Consequently, the ticket was escalated to Service Desk Lead who soon decided to create a Major Incident about this case.
Calling all Troops on Saturday Evening
First, the Major Incident Manager was called to work. The manager soon figured out that it is the “Sleek Logistics System” that they are talking about. Relevant stakeholders were contacted and called to work.
The first MIM Group meeting was held on Saturday afternoon. This meeting included the Head of Operations, Head of Application Management, Head of you-name-it. Lots of people called in to work overtime on this critical issue.
Sunday brings the Sun (and the Server) up again.
Somewhere between the second pizza and third litre of coffee, the initial server failure was connected to the application issue. It turned out that this server hosted one critical API used by the Logistics system, but not impacting the Logistics system as a whole.
The issue itself was fixed rather quickly but the snowball effect had already created lots of other problems. These were the problems that resulted in this incident becoming extremely expensive:
- A lot of (expensive and time sensitive) product did not get delivered on time because…
- Deliveries, delivery routes and work shifts needed to be quickly replanned.
- Lots of angry customers needed to be contacted and…
- a large number of claims needed to be dealt with in the aftermath.
What can we learn to avoid this 2 Million Euro incident from happening again elsewhere?
Lessons Learned: Three Key Take-aways
This Major Incident began from simply missing details in the CMDB. When the server failed, nobody understood the true impact or who to contact to find out. Everything that followed can be traced to this missing data.
The key take-aways and consequent actions from this incident were:
- Define responsibilities.
- Start connecting the dots.
- Plan for Business Impact Analysis, Self-Service and Automation
Ensure that all CIs have responsible persons defined. Who is the person or a group, who should know more about the CI? This information can come from a related record, like Service Offering. Then you need to ensure that the CI has a path to this information.
When this is done, we can at least know who to call and ask for more details, if something urgent happens.
Next Step: Start Connecting the Dots
Take, for example, the CSDM’s Crawl phase data models and ensure that every Infrastructure CI (such as a Server) has an upstream relationship to Application Service. Then ensure that the Application Service is related to a Business Application.
If your key capabilities are depending on the Business Service data, then define the minimum relationships and attributes on that data domain.
This should already give you a better view on your IT Infrastructure and provide means to better prioritize incidents and run initial root cause analysis.
Future: Enable Business Impact Analysis, Self-Service and Automation
Create a proper Business Service Portfolio to complement the technical one. Start connecting the users to the services (as subscribers) and enable your Service Portal and workflow automations to use this valuable information when trying to deliver better service to your customers.
Following this expensive occurrence, a Proof-of-Concept project was started to simulate a different outcome. That is, if proper data would have been in place and available for different people during the incident.
We followed the three key takeaways mentioned above. Since CSDM basically defines how data should be modelled and populated into CMDB, it was chosen as a reference data model.
Only two critical business applications were selected for the simulation. Not surprisingly, the other one was the application related to the Major Incident. We basically simulated how the incident process could have gone with all the relevant details in place.
It truly was a real eye opener for most of the people involved.
Simulating a Different Outcome
Minimum requirement: Responsibilities clearly defined.
- The first person receiving the Incident would have immediately known who to contact and ask for more information.
Next Step: Relationships in Place
- The original server failure would have been identified as part of a critical Business Application and the Server Operations team would have prioritized it accordingly.
- Key stakeholders could have been informed about the issue before anyone even called the Service Desk.
- Service Desk agents could have connected the End User reported incidents to the server failure and the issue could have been solved already on Friday, essentially avoiding the incident escalating.
- With Business Service Portfolio you get more insight into business impact by connecting the users to the applications via Service offerings. This model was used to create a link between the IT Infrastructure and Business Processes that both already existed before the incident, but without a link between them.
- Previously used linking via Technical Services caused Dependency Views to look like spider webs where everything is connected to everything and true impact is hidden in the web.
The future part was not part of the Proof-of-Concept, but here’s the additional benefits we know could have been realized by also doing it:
- We could have known which services are consumed by which users (service offering subscriptions)
- We could have provided dynamic portal and mobile user experiences based on subscribed services
- We could have easily (and probably automatically) been able to Inform users about issues related to their services
Summing it up
Imagine if the first caller to Service Desk would have been identified as a user of the Logistics System in question. Alternatively, he or she might have created the ticket from a self-service portal where all these details would be pre-filled based on the subscription details.
The Service Desk agent would have immediately seen that there is an open Incident under the application. Or even better, the original server event would have been assigned the proper priority in the beginning. Relevant stakeholders could have been informed about the issue as it happened.
It is quite easy to keep things simple for the Service Desk Agents and End Users when you have the underlying data in order. To know for sure, you need to have proper tools in place.
The difference between believing and knowing your data is good was worth 2 Million Euros in this case.
Mitigating Your Risk
CMDB is only good, when the data is reliable, up-to-date, and actually USED by other products and processes. It is often difficult to assign a euro or dollar value for quality data. Unfortunately, the monetary analysis is often only considered when an ugly incident like the one described here surfaces.
Data Content Manager is a tool which helps improve and maintain the quality of your data. With DCM you can enforce any data models on the ServiceNow platform and make sure that you get the most out of its fantastic capabilities. With it, you can significantly mitigate the risk of these 2 Million Euro incidents from taking place.