Free IT Staff Time: Implement Runbook Automation

Info-Tech Advisor: Research Note

Published: April 03, 2007


Presently, IT staff fill the gap between system monitoring apps and helpdesk ticketing apps. Once the monitoring application reports a problem, IT staff must choose and run diagnostics, analyze and determine the problem, and select and implement a solution. Afterwards, IT staff updates a helpdesk ticket.

Runbook Automation = Automated Incident Response

Enter runbook automation. Candidates for runbook automation include enterprises with mature operational processes that can find savings of more than $150,000 by mitigating alert floods and automating routine tasks.

While its name implies an automated/electronic form of traditional data center documentation, the similarity is unfortunate. At present, runbook automation solutions from iConclude (owned by Opsware Inc.), Opalis and RealOps – the three startup evangelists in this space – focus on the troubleshooting/incident management section of traditional runbook documentation. Runbook automation should probably be called "automated incident response."

Traditional runbooks, whether physical or electronic, simply capture processes. Runbook automation moves beyond documentation by automating response processes and routine tasks. It addresses scaling pains, providing incident resolution for dozens of servers while reducing staff time investment, ticket escalation, alert floods, and resolution time. Further, runbook automation empowers ITIL-based (Information Technology Infrastructure Library) incident management by giving practical, automated expression to otherwise high-level directives while capturing a full audit trail.

Runbook automation builds on three key inputs:

  • Application integration. Runbook automation hooks into existing monitoring and tracking apps, drawing necessary information and leaving an audit trail of its activities.
  • Process logic. These are the enterprise's processes for booting and initializing servers and applications, troubleshooting out-of-parameter incidents, and performing graceful shutdown/restart of aberrant services and devices. Some of these processes are drawn from industry standards or vendor documentation, while others are home-grown to account for individual implementation issues.
  • Workflow design. Runbook automation solutions feature an interface for designing workflows. IT administrators use this to convert the process logic into steps and scripts that the solution will perform.

The resulting system performs the following steps:

  • Detection. The solution identifies the flag from the management software and takes charge of the incident.
  • Diagnostics. Following the business logic indicated above, the runbook automation solution then initiates various diagnostic steps, working through the process workflow.
  • Repair. If the diagnostic steps indicated a difficulty that the solution can handle automatically, it does so, checking to see if the solution resolves the issue. If the issue is beyond the runbook automation solution's parameters, it escalates the incident to the appropriate helpdesk staff.
  • Documentation. At every step, the runbook automation solution works with the enterprise's ticketing system to open a ticket, document steps, and close the ticket as appropriate.

For example, Microsoft Operations Manager detects an unresponsive Web server. The automated solution notices the flag and assigns itself to resolve the task, opening a ticket in the appropriate help desk solution. The software then works through standard diagnostics, such as pinging the address and checking the status of the SQL server instance on that server. Depending on the results of those diagnostics, the software could then shut down and restart a service or the server itself. Each action is logged in the help desk ticket. If the reboot resolves the problem, the solution closes the ticket and can e-mail an event summary to the appropriate administrator.

Recommendations

  1. Build mature operational processes. Runbook automation is best suited to enterprises with a mature operations environment that have developed workflows, process documentation, and triage processes. Enterprises should hold off on runbook automation until the processes are designed and tested.
  2. Determine suitability by examining workload. Enterprises must determine if the workload to be alleviated by the runbook automation solution consumes sufficient IT resources to offset the $150,000+ price tag. Some solutions start as low as $50,000, but most start in six-figures and quickly escalate toward $400,000. Enterprises with extensive server farms, department-crippling alert floods, and that need to perform many repetitive tasks across many systems are ideal candidates.
  3. Compare cost and time to implement against expected reduction in incident-resolution time. Work with the vendor to select three to five labor-intensive tasks where a runbook automation implementation will return the greatest benefit. Estimate the fiscal savings in staff time, eliminated downtime, and recovered revenue. Compare this against the cost to implement the software, including software prices and staff time required to fully integrate the enterprise's processes into the solution.
  4. Look for these key factors in a solution:
    • Integration with existing monitoring and ticketing apps.
    • Pre-fabricated process flows for industry standard products based on ITIL and industry best practices.
    • A vendor-supported community that designs and shares runbook automation workflows for their solution. Ideally, these workflows will be certified by vendors or trusted third parties.
  5. Save time later, not now. Runbook automation solutions promise time savings, but require considerable effort to install and integrate. Look for long-term savings, but in the near term expect to dedicate resources to define workflows and tune the system.

Bottom Line

Runbook automation bridges the gap between monitoring and ticketing applications by providing automated incident response. Enterprises with mature operational processes that can find savings of more than $150,000 by eliminating alert floods and repetitive tasks are ideal candidates.

First ITA Research Note Back to Current Research Next ITA Research Note »
This article is available in full to members of Info-Tech Advisor.
Already a member? Please log in.

Username:

Password:

Remember me:

I forgot my password.

E-mail address:

 

I am not an Info-Tech Advisor member, but...
  • I would like to become a member (starting at $495/yr).
  • I would like to learn more.