Mitigate the Risk of Cloud Downtime and Data Loss

Senior leadership is asking difficult questions about the organization’s dependency on third-party cloud services and the risk that poses.
IT leaders have limited control over third-party incidents and that includes cloud services. Yet they are on the hot seat when cloud services go down.
While vendors have swooped in to provide resilience options for the more-common SaaS solutions, it is not the case for all cloud services.

Our Advice

Critical Insight

No control over the software does not mean no recovery options. Solutions range from designing an IT workaround using alternate technologies to pre-defined third-party service continuity options (e.g. see options for O365) to business workarounds.
Even where there is limited control, you can at least define an incident response plan to streamline notification, assessment, and implementation of workarounds. Leadership wants more options than simply waiting for the service to come back online.
At a minimum, IT’s responsibility is to identify and communicate risk to senior leadership. That starts with a vendor review to identify SLA issues and overall resilience gaps.

Impact and Result

Follow a structured process to assess cloud resilience risk.
Identify opportunities to mitigate risk – at the very least, ensure critical data is protected.
Summarize cloud services risk, mitigation options, and incident response for senior leadership.

Mitigate the Risk of Cloud Downtime and Data Loss Research & Tools

1. Mitigate the Risk of Cloud Downtime and Data Loss – Step-by-step guide to assess risk, identify risk mitigation options, and create an incident response plan.

Even where there is limited control, you can define an incident response plan to streamline notification, assessment, and implementation of workarounds.

Mitigate the Risk of Cloud Downtime and Data Loss Storyboard

2. Cloud Services Incident Risk and Mitigation Review – Review your key cloud vendors’ SLAs, incident preparedness, and data protection strategy.

At a minimum, IT’s responsibility is to identify and communicate risk to senior leadership. That starts with a vendor review to identify SLA and overall resilience gaps.

Cloud Services Incident Risk and Mitigation Review Tool

3. SaaS Incident Response Workflows – Use these examples to guide your efforts to create cloud incident response workflows.

The examples illustrate different approaches to incident response depending on the criticality of the service and options available.

SaaS Incident Response Workflows

4. Cloud Services Resilience Summary – Use this template to capture your results.

Summarize cloud services risk, mitigation options, and incident response for senior leadership.

Cloud Services Resilience Summary

Mitigate the Risk of Cloud Downtime and Data Loss

Resilience and disaster recovery in an increasingly Cloudy and SaaSy world.

Analyst Perspective

If you think cloud means you don’t need a response plan, then get your resume ready.

Frank Trovato

Most organizations are now recognizing that they can’t ignore the risk of a cloud outage or data loss, and the challenge is “what can I do about it?” since there is limited control.

If you still think “it’s in the cloud, so I don’t need to worry about it,” then get your resume ready. When O365 goes down, your executives are calling IT, not Microsoft, for an answer of what’s being done and what can they do in the meantime to get the business up and running again.

The key is to recognize what you can control and what actions you can take to evaluate and mitigate risk. At a minimum, you can ensure senior leadership is aware of the risk and define a plan for how you will respond to an incident, even if that is limited to monitoring and communicating status.

Often you can do more, including defining IT workarounds, backing up your SaaS data for additional protection, and using business process workarounds to bridge the gap, as illustrated in the case studies in this blueprint.

Frank Trovato
Research Director, Infrastructure & Operations

Info-Tech Research Group

Use this blueprint to expand your DRP and BCP to account for cloud services

As more applications are migrated to cloud-based services, disaster recovery (DR) and business continuity plans (BCP) must include an understanding of cloud risks and actions to mitigate those risks. This includes evaluating vendor and service reliability and resilience, security measures, data protection capabilities, and technology and business workarounds if there is a cloud outage or incident.

Use the risk assessments and cloud service incident response plans developed through this blueprint to supplement your DRP and BCP as well as further inform your crisis management plans (e.g. account for cloud risks in your crisis communication planning).

Overall Business Continuity Plan
IT Disaster Recovery Plan A plan to restore IT application and infrastructure services following a disruption. Info-Tech’s Disaster Recovery Planning blueprint provides a methodology for creating the IT DRP. Leverage this blueprint to validate and provide inputs for your IT DRP.	BCP for Each Business Unit A set of plans to resume business processes for each business unit. Info-Tech’s Develop a Business Continuity Plan blueprint provides a methodology for creating business unit BCPs as part of an overall BCP for the organization.	Crisis Management Plan A plan to manage a wide range of crises, from health and safety incidents to business disruptions to reputational damage. Info-Tech’s Implement Crisis Management Best Practices blueprint provides a framework for planning a response to any crisis, from health and safety incidents to reputational damage.

Overall Business Continuity Plan

IT Disaster Recovery Plan

A plan to restore IT application and infrastructure services following a disruption.

Info-Tech’s Disaster Recovery Planning blueprint provides a methodology for creating the IT DRP. Leverage this blueprint to validate and provide inputs for your IT DRP.

BCP for Each Business Unit

A set of plans to resume business processes for each business unit.

Info-Tech’s Develop a Business Continuity Plan blueprint provides a methodology for creating business unit BCPs as part of an overall BCP for the organization.

Crisis Management Plan

A plan to manage a wide range of crises, from health and safety incidents to business disruptions to reputational damage.

Info-Tech’s Implement Crisis Management Best Practices blueprint provides a framework for planning a response to any crisis, from health and safety incidents to reputational damage.

Executive Summary

Your Challenge	Common Obstacles	Info-Tech’s Approach
Senior leadership is asking difficult questions about the organization’s dependency on third-party cloud services and the risk that poses. Migrating to cloud services transfers much of the responsibility for day-to-day platform maintenance but not accountability for resilience. IT leaders are often responsible for not just the organization’s IT DRP but also BCP and other elements of overall resilience. Cloud risk adds another element IT leaders need to consider.	IT leaders have limited control over third-party incidents and that includes cloud services. With SaaS services in particular, recovery or continuity options may be limited. While vendors have swooped in to provide resilience options for the more common SaaS solutions, that is not the case for all cloud services. Part of the solution is defining business process workarounds and that depends on cooperation from business leaders.	At a minimum, IT’s responsibility is to identify and communicate risk to senior leadership. That starts with a vendor review to identify SLA and overall resilience gaps. Adapt how you approach downtime and data loss risk, particularly for SaaS solutions where there is limited or no control over the system. Even where there is limited control, you can define an incident response plan to streamline notification, assessment, and implementation of workarounds. Leadership wants more options than simply waiting for the service to come back online.

Info-Tech Insight

Asking vendors about their DRP, BCP, and overall resilience has become commonplace. Expect your vendors to provide answers so you can assess risk. Furthermore, your vendor may have additional offerings to increase resilience or recommendations for third parties who can further assist your goals of improving cloud service resilience.

Key deliverable

Cloud Services Resilience Summary

Provide leadership with a summary of cloud risk, downtime workarounds implemented, and additional data protection.

The image contains a screenshot of the Cloud Services Resilience Summary.

Additional tools and templates in this blueprint

Cloud Services Incident Risk and Mitigation Review Tool

Use this tool to gather vendor input, evaluate vendor SLAs and overall resilience, and track your own risk mitigation efforts.

The image contains a screenshot of the Cloud Services Incident Risk and Mitigation Review Tool.

SaaS Incident Response Workflows

Use the examples in this document as a model to develop your own incident response workflows for cloud outages or data loss.

The image contains a screenshot of the SaaS Incident Response Workflows.

This blueprint will step you through the following actions to evaluate and mitigate cloud services risk

Assess your cloud risk

Review your cloud services to determine potential impact of downtime/data loss, vendor SLA gaps, and vendor’s current resilience.

Identify options to mitigate risk

Explore your cloud vendor’s resilience offerings, third-party solutions, DIY recovery options, and business workarounds.

Create an incident response plan

Document your cloud risk mitigation strategy and incident response plan, which might include a failover strategy, data protection, and/or business continuity.

Cloud Risk Mitigation

Identify options to mitigate risk

Create an incident response plan

Assess risk

Phase 1: Assess your cloud risk

Phase 1	Phase 2	Phase 3
Assess your cloud risk	Identify options to mitigate risk	Create an incident response plan

Cloud does not guarantee uptime

Public cloud services (e.g. Azure, GCP, AWS) and popular SaaS solutions experience downtime every year.

A few cloud outage examples:

Microsoft Azure AD outage, March 15, 2022:
Many users could not log into O365, Dynamics, or the Azure Portal.
Cause: software change.
Three AWS outages in December 2021: December 7 (Netflix and others impacted), December 15 (Duo, Zoom, Slack, others), December 20 (Slack, Epic Games, others). Cause: network issues, power outage.
Salesforce outage, May 12, 2022: Users could not access the Lightning platform. Cause: expired certificate.

Cloud availability

Migrating to cloud services can improve availability, as they typically offer more resilience than most organizations can afford to implement themselves.
However, having multiple data centers, zones, and regions doesn’t prevent all outages, as we see every year with even the largest cloud vendors.

DR challenges for IaaS, PaaS, and cloud-native

While there are limits to what you control, often traditional “failover” DR strategy can apply.

High-level challenges and resilience options:

IaaS: No control over the hardware, but you can failover to another region. This is fairly similar to traditional DR.
PaaS: No control over the software platform (e.g. SQL server as a service), but you can back up your data and explore vendor options to replicate your environment.
Cloud-native applications: As with PaaS, you can back up your data and explore vendor options to replicate your environment.

Plan for resilience

Include DR requirements when designing cloud service implementation. For example, for IaaS solutions, identify what data would need to be replicated and what services may need to be “always on” (e.g. database services where high-availability is demanded).
Similarly, for PaaS and cloud-native solutions, consult your vendor regarding options to build in resilience options (e.g. ability to failover to another environment).

DR challenges for SaaS solutions

SaaS is the biggest challenge because you have no control over any part of the base application stack.

High-level challenges and resilience options:

No control over the hardware (or the facility, maintenance processes, and so on).
No control over the base application (control is limited to configuration settings and add-on customizations or integrations).
Options to back up your data will depend on the service.

Note: The rest of this blueprint is focused primarily on SaaS resilience due to the challenges listed here. For other cloud services, leverage traditional DR strategies and vendor management to mitigate risk (as summarized on the previous slides).

Focus on what you can control

For SaaS solutions in particular, you must toss out traditional DR. If Salesforce has an outage, you won’t be involved in recovering the system.
Instead, DR for SaaS needs to focus on improving resilience where you do have control and implementing business workarounds to bridge the gap.

About Info-Tech

Info-Tech Research Group is the world’s fastest-growing information technology research and advisory company, proudly serving over 30,000 IT professionals.

We produce unbiased and highly relevant research to help CIOs and IT leaders make strategic, timely, and well-informed decisions. We partner closely with IT teams to provide everything they need, from actionable tools to analyst guidance, ensuring they deliver measurable results for their organizations.

What Is a Blueprint?

A blueprint is designed to be a roadmap, containing a methodology and the tools and templates you need to solve your IT problems.

Each blueprint can be accompanied by a Guided Implementation that provides you access to our world-class analysts to help you get through the project.

Table of Contents