Survey: When AI Factories Fail, 6 in 10 Enterprises Cannot Tell You Why

New Virtana Study Finds Enterprises Scaling AI Faster Than They Can Govern It

PALO ALTO, Calif.--(BUSINESS WIRE)--Two-thirds of enterprises are running AI infrastructure without system-level visibility, creating a fragile foundation beneath rapidly expanding AI deployments. New research from Virtana found that as AI adoption accelerates, a new operational reality is emerging: innovation is outpacing control.

As AI becomes core enterprise infrastructure, a clear divide is emerging between organizations that understand how their systems produce outcomes and those that cannot explain or control them.
Share

The AI Factory Reality Check study, based on 788 US enterprise decision-makers, examines how AI factories operate under real conditions. More than half of respondents surveyed are already scaling AI across teams without addressing the system-level observability required to understand and control AI. The study documents a widening disconnect between AI factory expansion and the operational foundation needed to sustain it.

“Modern enterprises, including banks, telcos, insurers and airlines, are increasingly dependent on AI-driven services. As a result, one of the greatest risks to the business is any disruption across these AI systems, where failures across applications or underlying infrastructure directly translate into business impact,” said Paul Appleby, CEO of Virtana. “AI systems function as interconnected systems, where infrastructure, data pipelines, token consumption, and model behavior continuously influence outcomes. Yet most organizations still monitor these elements in silos. Without system-wide understanding of these dependencies, they cannot explain how outcomes are produced, control cost, or determine whether those outcomes can be trusted.”

Enterprise AI Has Scaled. Control Has Not.

Enterprise AI has moved beyond pilots into at-scale operations. Fifty-four percent of organizations are already scaling AI across teams, while another 23% are managing production workloads alongside infrastructure expansion. At the largest enterprises, particularly those above $10 billion in revenue, this creates systems that are increasingly difficult to understand and control.

As AI factories scale, system-level observability is not keeping pace. Organizations are expanding AI without the visibility required to understand performance, control cost, or manage risk across the full stack. Instead, critical investments in the operational foundation are being deferred:

56% percent of enterprises are deferring legacy infrastructure modernization
54% are deprioritizing cost optimization initiatives

At the same time, cost pressures are forcing enterprises to continuously reconfigure their AI systems, often without the visibility to understand the impact of those changes. Eighty percent of enterprises report that the cost of premium AI hardware is reshaping infrastructure decisions. In response:

60% are shifting workloads across hybrid environments
58% are accelerating consolidation to improve per-unit efficiency

These are structural changes to live systems under load. Each shift alters dependencies, resource contention, and performance characteristics across the stack.

“Without system-level observability, organizations cannot determine how these changes affect outcomes, cost, or reliability. As a result, they are continuously optimizing AI systems they do not fully understand, introducing risk with every change,” continued Appleby.

Inside the AI Factory, Visibility Is the Missing Variable

As AI factories scale, visibility is emerging as the missing variable in understanding and controlling system behavior. The research shows that as enterprises expand AI, disparities in system understanding and operational control are becoming more pronounced:

66% of enterprises are operating AI infrastructure without reliable performance baselines
Only 34% describe AI workload performance as highly predictable
That drops to 25% at organizations with more than 50,000 employees

This lack of visibility extends into incident response:

59% cannot automatically identify root cause across infrastructure domains when an alert fires
25% still rely on manual investigations across disconnected consoles as their first response

When AI systems break, they do not fail cleanly. System understanding degrades, forcing teams into reactive analysis while high-cost GPU capacity sits underutilized, issues compound, and outcomes can no longer be fully explained or controlled.

“These are not abstract concerns,” continued Appleby. “As AI becomes core enterprise infrastructure, a clear divide is emerging between organizations that understand how their systems produce outcomes and those that cannot explain or control them. Without visibility across models, tokens, GPUs, and infrastructure, teams absorb hidden cost, performance gaps, and ungoverned risk. Those that understand their systems gain end-to-end visibility and control so they can optimize cost in real time, ensure reliable performance, and prove outcomes. The result is declining resilience, eroding trust, and constrained growth as AI becomes infrastructure that must be governed and optimized at scale.”

ROI Visibility Is the Prerequisite Enterprises Cannot Defer

The study reveals a disconnect between how AI systems operate and how they are observed. A 17-point gap exists between Infra/SRE practitioners and executives on automated root cause capabilities:

69% of Infra/SRE teams report lacking automated cross-domain root cause
52% of executives report the same

This gap reflects a broader breakdown in system-level observability, where critical signals remain fragmented across the stack:

57% cite cost and efficiency metrics as a top challenge
56% cite GPU utilization tracking
52% cite data pipeline visibility

These challenges span business outcomes, AI infrastructure, and data dependencies, yet are still managed in isolation.

GPU cost and utilization remains the most difficult operational challenge for 35% of enterprises, with impact varying by role:

39% of executives experience it as financial accountability pressure
36% of architects cite integration complexity in distributed environments
22% of Infra/SRE teams face it as a scaling and reliability challenge

This variation reflects how different parts of the organization see different fragments of the same system, without a unified view of cause and effect.

Across all roles and revenue bands, enterprise priorities are consistent:

38% need unified visibility across AI and infrastructure layers
32% need AI-driven root cause analysis without manual correlation

Together, these priorities point to a single requirement: system-aware observability that connects performance, cost, and outcomes across the full stack. Today, most enterprises are operating AI systems they cannot fully observe or explain.

Virtana Expands AI Factory Observability with Dell Technologies Partnership

Also announced today, Virtana is extending its Agentic AI-powered observability platform to support Dell’s AI Factory infrastructure, bringing system-aware intelligence across the full AI factory stack, from GPUs and infrastructure to models and AI workloads. Now organizations running Dell-based AI factories can apply continuous, cross-domain analysis across the entire execution system.

Virtana’s autonomous agents correlate GPU utilization, token demand, model behavior, and underlying infrastructure performance in real time, delivering automated, evidence-backed root cause analysis where system complexity is highest. This support moves teams beyond siloed GPU monitoring and fragmented tooling. Instead of chasing signals, operators get clear answers tied to the actual system constraint driving latency, failures, or cost.

Resources

Download the AI Factory Reality Check research report
Learn more at virtana.com
Learn more about Virtana AI Factory Observability
Read the blog: AI Factories Are Breaking Traditional Infrastructure—Here’s How We’re Fixing It
Follow Virtana on LinkedIn and X

Research Methodology

The AI Factory Reality Check is based on an independent survey of 788 US-based professionals at enterprise organizations actively running, piloting, or planning AI workloads in production, with decision-making or significant influence over IT infrastructure, AI strategy, or technology investment. Respondents include application, service, and AI engineering professionals (307), executive leadership (270), infrastructure, cloud, and reliability engineering teams (120), and architects and platform designers (91). Organizations range from under 1,000 to more than 50,000 employees, spanning revenue bands from under $500 million to more than $10 billion.

About Virtana

Virtana delivers the deepest and broadest observability platform for hybrid and multi-cloud, with full-stack AI observability spanning applications, services, data pipelines, GPUs, CPUs, networks, and storage. Powered by high-fidelity data and agentic AI, Virtana provides unmatched visibility across end-to-end IT services and AI workloads, correlating health, performance, cost, and user impact in real time. With advanced event intelligence and autonomous insight generation, Virtana delivers clarity no other provider can match. Trusted by Global 2000 enterprises and public sector organizations, Virtana helps IT operations and DevOps teams reduce risk, strengthen resilience, improve efficiency, and modernize with confidence across multi-cloud, on-premises, and edge environments. Learn more at virtana.com

Contacts

Media Contact
Stephanie Floyd
Bhava Communications for Virtana
virtana@bhavacom.com

Industry:

More News From Virtana

Virtana Extends AI Factory Observability to the HPE AI Factory

PALO ALTO, Calif.--(BUSINESS WIRE)--Virtana, provider of the deepest and broadest agentic observability platform for hybrid and multi-cloud environments, today announced support for HPE AI Factory, including HPE AI-ready server infrastructure, through validation in the HPE Server Partner Program. Delivered as part of the Virtana AI Platform, AI Factory Observability (AIFO) provides customers with a proven, production-ready agentic observability solution that interoperates seamlessly with HPE AI...

Virtana Introduces Outcome-Based SLA Management, Turning Service Levels Into Autonomous Business Outcomes

SAN JOSE, Calif.--(BUSINESS WIRE)--Virtana, provider of the deepest and broadest observability platform for hybrid and multi-cloud environments, today announced Agentic SLA Management, a new AI-native capability that transforms service-level agreements from static reporting metrics into an intelligent operational control plane for business outcomes. Agentic SLA Management enables organizations to define SLA-as-Code, continuously validate service performance against business commitments, and auto...

Virtana Appoints Daniel Raskin as Chief Marketing Officer

PALO ALTO, Calif.--(BUSINESS WIRE)--Virtana, provider of the deepest and broadest observability platform for hybrid and multi-cloud environments, today announced the appointment of Daniel Raskin as Chief Marketing Officer. Raskin brings more than 25 years of enterprise marketing leadership to Virtana at a time of significant momentum, as Global 2000 enterprises accelerate investment in AI infrastructure and demand observable, accountable operations at scale. “Daniel brings a rare combination of...

Back to Newsroom

Services & Solutions

Services

Solutions For

Resources

Education

Why Business Wire

Survey: When AI Factories Fail, 6 in 10 Enterprises Cannot Tell You Why

Contacts

Virtana

Contacts

Virtana Extends AI Factory Observability to the HPE AI Factory

Virtana Introduces Outcome-Based SLA Management, Turning Service Levels Into Autonomous Business Outcomes

Virtana Appoints Daniel Raskin as Chief Marketing Officer

Virtana

Contacts