Survey: When AI Factories Fail, 6 in 10 Enterprises Cannot Tell You Why
Survey: When AI Factories Fail, 6 in 10 Enterprises Cannot Tell You Why
New Virtana Study Finds Enterprises Scaling AI Faster Than They Can Govern It
PALO ALTO, Calif.--(BUSINESS WIRE)--Two-thirds of enterprises are running AI infrastructure without system-level visibility, creating a fragile foundation beneath rapidly expanding AI deployments. New research from Virtana found that as AI adoption accelerates, a new operational reality is emerging: innovation is outpacing control.
As AI becomes core enterprise infrastructure, a clear divide is emerging between organizations that understand how their systems produce outcomes and those that cannot explain or control them.
Share
The AI Factory Reality Check study, based on 788 US enterprise decision-makers, examines how AI factories operate under real conditions. More than half of respondents surveyed are already scaling AI across teams without addressing the system-level observability required to understand and control AI. The study documents a widening disconnect between AI factory expansion and the operational foundation needed to sustain it.
“Modern enterprises, including banks, telcos, insurers and airlines, are increasingly dependent on AI-driven services. As a result, one of the greatest risks to the business is any disruption across these AI systems, where failures across applications or underlying infrastructure directly translate into business impact,” said Paul Appleby, CEO of Virtana. “AI systems function as interconnected systems, where infrastructure, data pipelines, token consumption, and model behavior continuously influence outcomes. Yet most organizations still monitor these elements in silos. Without system-wide understanding of these dependencies, they cannot explain how outcomes are produced, control cost, or determine whether those outcomes can be trusted.”
Enterprise AI Has Scaled. Control Has Not.
Enterprise AI has moved beyond pilots into at-scale operations. Fifty-four percent of organizations are already scaling AI across teams, while another 23% are managing production workloads alongside infrastructure expansion. At the largest enterprises, particularly those above $10 billion in revenue, this creates systems that are increasingly difficult to understand and control.
As AI factories scale, system-level observability is not keeping pace. Organizations are expanding AI without the visibility required to understand performance, control cost, or manage risk across the full stack. Instead, critical investments in the operational foundation are being deferred:
- 56% percent of enterprises are deferring legacy infrastructure modernization
- 54% are deprioritizing cost optimization initiatives
At the same time, cost pressures are forcing enterprises to continuously reconfigure their AI systems, often without the visibility to understand the impact of those changes. Eighty percent of enterprises report that the cost of premium AI hardware is reshaping infrastructure decisions. In response:
- 60% are shifting workloads across hybrid environments
- 58% are accelerating consolidation to improve per-unit efficiency
These are structural changes to live systems under load. Each shift alters dependencies, resource contention, and performance characteristics across the stack.
“Without system-level observability, organizations cannot determine how these changes affect outcomes, cost, or reliability. As a result, they are continuously optimizing AI systems they do not fully understand, introducing risk with every change,” continued Appleby.
Inside the AI Factory, Visibility Is the Missing Variable
As AI factories scale, visibility is emerging as the missing variable in understanding and controlling system behavior. The research shows that as enterprises expand AI, disparities in system understanding and operational control are becoming more pronounced:
- 66% of enterprises are operating AI infrastructure without reliable performance baselines
- Only 34% describe AI workload performance as highly predictable
- That drops to 25% at organizations with more than 50,000 employees
This lack of visibility extends into incident response:
- 59% cannot automatically identify root cause across infrastructure domains when an alert fires
- 25% still rely on manual investigations across disconnected consoles as their first response
When AI systems break, they do not fail cleanly. System understanding degrades, forcing teams into reactive analysis while high-cost GPU capacity sits underutilized, issues compound, and outcomes can no longer be fully explained or controlled.
“These are not abstract concerns,” continued Appleby. “As AI becomes core enterprise infrastructure, a clear divide is emerging between organizations that understand how their systems produce outcomes and those that cannot explain or control them. Without visibility across models, tokens, GPUs, and infrastructure, teams absorb hidden cost, performance gaps, and ungoverned risk. Those that understand their systems gain end-to-end visibility and control so they can optimize cost in real time, ensure reliable performance, and prove outcomes. The result is declining resilience, eroding trust, and constrained growth as AI becomes infrastructure that must be governed and optimized at scale.”
ROI Visibility Is the Prerequisite Enterprises Cannot Defer
The study reveals a disconnect between how AI systems operate and how they are observed. A 17-point gap exists between Infra/SRE practitioners and executives on automated root cause capabilities:
- 69% of Infra/SRE teams report lacking automated cross-domain root cause
- 52% of executives report the same
This gap reflects a broader breakdown in system-level observability, where critical signals remain fragmented across the stack:
- 57% cite cost and efficiency metrics as a top challenge
- 56% cite GPU utilization tracking
- 52% cite data pipeline visibility
These challenges span business outcomes, AI infrastructure, and data dependencies, yet are still managed in isolation.
GPU cost and utilization remains the most difficult operational challenge for 35% of enterprises, with impact varying by role:
- 39% of executives experience it as financial accountability pressure
- 36% of architects cite integration complexity in distributed environments
- 22% of Infra/SRE teams face it as a scaling and reliability challenge
This variation reflects how different parts of the organization see different fragments of the same system, without a unified view of cause and effect.
Across all roles and revenue bands, enterprise priorities are consistent:
- 38% need unified visibility across AI and infrastructure layers
- 32% need AI-driven root cause analysis without manual correlation
Together, these priorities point to a single requirement: system-aware observability that connects performance, cost, and outcomes across the full stack. Today, most enterprises are operating AI systems they cannot fully observe or explain.
Virtana Expands AI Factory Observability with Dell Technologies Partnership
Also announced today, Virtana is extending its Agentic AI-powered observability platform to support Dell’s AI Factory infrastructure, bringing system-aware intelligence across the full AI factory stack, from GPUs and infrastructure to models and AI workloads. Now organizations running Dell-based AI factories can apply continuous, cross-domain analysis across the entire execution system.
Virtana’s autonomous agents correlate GPU utilization, token demand, model behavior, and underlying infrastructure performance in real time, delivering automated, evidence-backed root cause analysis where system complexity is highest. This support moves teams beyond siloed GPU monitoring and fragmented tooling. Instead of chasing signals, operators get clear answers tied to the actual system constraint driving latency, failures, or cost.
Resources
- Download the AI Factory Reality Check research report
- Learn more at virtana.com
- Learn more about Virtana AI Factory Observability
- Read the blog: AI Factories Are Breaking Traditional Infrastructure—Here’s How We’re Fixing It
- Follow Virtana on LinkedIn and X
Research Methodology
The AI Factory Reality Check is based on an independent survey of 788 US-based professionals at enterprise organizations actively running, piloting, or planning AI workloads in production, with decision-making or significant influence over IT infrastructure, AI strategy, or technology investment. Respondents include application, service, and AI engineering professionals (307), executive leadership (270), infrastructure, cloud, and reliability engineering teams (120), and architects and platform designers (91). Organizations range from under 1,000 to more than 50,000 employees, spanning revenue bands from under $500 million to more than $10 billion.
About Virtana
Virtana delivers the deepest and broadest observability platform for hybrid and multi-cloud, with full-stack AI observability spanning applications, services, data pipelines, GPUs, CPUs, networks, and storage. Powered by high-fidelity data and agentic AI, Virtana provides unmatched visibility across end-to-end IT services and AI workloads, correlating health, performance, cost, and user impact in real time. With advanced event intelligence and autonomous insight generation, Virtana delivers clarity no other provider can match. Trusted by Global 2000 enterprises and public sector organizations, Virtana helps IT operations and DevOps teams reduce risk, strengthen resilience, improve efficiency, and modernize with confidence across multi-cloud, on-premises, and edge environments. Learn more at virtana.com
Contacts
Media Contact
Stephanie Floyd
Bhava Communications for Virtana
virtana@bhavacom.com
