Blog

AI in Operations - From Chaos to Self Healing

Sep 19, 2025·By James Gilding

Artificial intelligence science, technology and innovation.Big data technology and data science. Machine learning and artificial Intelligence(AI).

The New Ops Playbook: From Incident Chaos to Self-Healing Systems

For many organisations, the last eighteen months have marked a tipping point. What used to be cautious conversations about the potential of automation and digital transformation are now practical discussions about deployment at scale. Boards and operational leaders alike are realising that incidents are not rare “black swans” they are a daily reality. The challenge is how to keep delivering resilience while freeing up time to innovate.

Complexity is the default

Systems are more interconnected than ever: cloud sprawl, security risks, continuous deployments, legacy integrations, rising customer expectations. Add to this a steady stream of new initiatives and the result is a web of complexity that even the most capable teams struggle to manage. In fact, studies suggest that more than half of developer time is consumed not by innovation but by managing operational issues.

Why the playbook is shifting

Two years ago, executives tended to be bullish about new technology while frontline teams were more sceptical. That gap has narrowed. Leaders have seen credible use cases emerge: incidents resolved faster, repetitive fixes automated, and new practices turning reactive firefighting into proactive improvement. The conversation has matured from “if” to “how.”

The new playbook recognises three realities:

Incidents are inevitable. Disruptions will continue to occur; pretending otherwise is wishful thinking.
The pace of change is accelerating. Code is shipped faster than ever, and operational teams must keep up.
Time is finite. Every hour spent on manual triage is an hour not spent on innovation, customer experience, or revenue growth.

From firefighting to resilience

Organisations that thrive are those that deliberately rebalance their efforts:

Reduce unplanned work. Push routine and repeatable issues into structured responses, freeing teams from late-night firefights.
Empower responders. Equip frontline staff with the right context, knowledge, and tools so they can act quickly without endless hand-offs.
Institutionalise learning. Treat post-incident reviews as opportunities to build reusable fixes and golden paths, rather than one-off documents.
Design for flow. Keep engineers focused on building and innovating by shielding them from noise and distraction.

Case studies to learn from

Across industries, leaders are showing what “better” looks like:

Global transport and travel firms have streamlined how they handle tens of thousands of operational events each day, carving out thousands of engineering hours for higher-value work.
Digital platforms are standardising development paths and embedding continuous learning into their technical teams, ensuring innovation doesn’t come at the cost of resilience.
Telecoms giants have demonstrated that “operations as code” can radically increase release cadence — from dozens per month to hundreds — by unlocking time previously trapped in incident churn.

The specific technologies vary, but the common thread is clear: automation, intelligent operations, and new ways of working are giving time back to teams and strengthening resilience.

Planned vs unplanned work: the metric that matters

It’s tempting to obsess over metrics like MTTR (mean time to resolve). But the bigger prize is shifting the balance between planned and unplanned work. Planned work drives strategy, customer value, and innovation. Unplanned work consumes energy, causes burnout, and erodes resilience.

Every organisation should ask: What percentage of our capacity is lost to unplanned work, and what will we do about it?

Where I fit in

Contrary to the photo (AI powered!), my core focus is organisational redesign, growth, and performance enhancement, underpinned by business process improvement. My involvement in back-office IT stacks and wider operations has grown naturally from this work, often through client requests and the realities of BPI initiatives. I don’t position myself as a systems consultant. My expertise lies in spotting operational opportunities, shaping organisational models, and helping leaders align people, processes, and technology.

The platforms, from established incident management suites to emerging AI-assisted tools, are already out there. But without the right structures, ownership, and cultural approach, their potential goes unrealised. That’s where I come in. I work alongside leading partners who provide the technology foundations, while I ensure the conditions for success: clear accountabilities, strong feedback loops, and a roadmap that balances resilience with growth.

The opportunity ahead

The pace of change in operations isn’t slowing down. Customer expectations, security threats, and the complexity of digital ecosystems are only rising. The organisations that will win are those that:

Accept complexity as the norm.
Reduce unplanned work by standardising and automating the repeatable.
Invest in resilience not just as an IT goal, but as a business imperative.
Align their people and processes to make the most of new tools and approaches.

Bottom line: The new playbook isn’t about chasing every shiny tool. It’s about shifting the balance, from chaos to resilience, from firefighting to foresight, and from incident management to intelligent operations. Technology alone won’t get you there. Success comes from redesigning organisations, improving processes, and creating the right structures so new approaches can stick.

That’s where I help. If you’re looking to reduce unplanned work, strengthen resilience, and free your teams to focus on growth, let’s talk, give me a call or drop me a line to explore how we can shape the right playbook for your organisation.