-

Survey: When AI Factories Fail, 6 in 10 Enterprises Cannot Tell You Why

New Virtana Study Finds Enterprises Scaling AI Faster Than They Can Govern It

PALO ALTO, Calif.--(BUSINESS WIRE)--Two-thirds of enterprises are running AI infrastructure without system-level visibility, creating a fragile foundation beneath rapidly expanding AI deployments. New research from Virtana found that as AI adoption accelerates, a new operational reality is emerging: innovation is outpacing control.

As AI becomes core enterprise infrastructure, a clear divide is emerging between organizations that understand how their systems produce outcomes and those that cannot explain or control them.

Share

The AI Factory Reality Check study, based on 788 US enterprise decision-makers, examines how AI factories operate under real conditions. More than half of respondents surveyed are already scaling AI across teams without addressing the system-level observability required to understand and control AI. The study documents a widening disconnect between AI factory expansion and the operational foundation needed to sustain it.

“Modern enterprises, including banks, telcos, insurers and airlines, are increasingly dependent on AI-driven services. As a result, one of the greatest risks to the business is any disruption across these AI systems, where failures across applications or underlying infrastructure directly translate into business impact,” said Paul Appleby, CEO of Virtana. “AI systems function as interconnected systems, where infrastructure, data pipelines, token consumption, and model behavior continuously influence outcomes. Yet most organizations still monitor these elements in silos. Without system-wide understanding of these dependencies, they cannot explain how outcomes are produced, control cost, or determine whether those outcomes can be trusted.”

Enterprise AI Has Scaled. Control Has Not.

Enterprise AI has moved beyond pilots into at-scale operations. Fifty-four percent of organizations are already scaling AI across teams, while another 23% are managing production workloads alongside infrastructure expansion. At the largest enterprises, particularly those above $10 billion in revenue, this creates systems that are increasingly difficult to understand and control.

As AI factories scale, system-level observability is not keeping pace. Organizations are expanding AI without the visibility required to understand performance, control cost, or manage risk across the full stack. Instead, critical investments in the operational foundation are being deferred:

  • 56% percent of enterprises are deferring legacy infrastructure modernization
  • 54% are deprioritizing cost optimization initiatives

At the same time, cost pressures are forcing enterprises to continuously reconfigure their AI systems, often without the visibility to understand the impact of those changes. Eighty percent of enterprises report that the cost of premium AI hardware is reshaping infrastructure decisions. In response:

  • 60% are shifting workloads across hybrid environments
  • 58% are accelerating consolidation to improve per-unit efficiency

These are structural changes to live systems under load. Each shift alters dependencies, resource contention, and performance characteristics across the stack.

“Without system-level observability, organizations cannot determine how these changes affect outcomes, cost, or reliability. As a result, they are continuously optimizing AI systems they do not fully understand, introducing risk with every change,” continued Appleby.

Inside the AI Factory, Visibility Is the Missing Variable

As AI factories scale, visibility is emerging as the missing variable in understanding and controlling system behavior. The research shows that as enterprises expand AI, disparities in system understanding and operational control are becoming more pronounced:

  • 66% of enterprises are operating AI infrastructure without reliable performance baselines
  • Only 34% describe AI workload performance as highly predictable
  • That drops to 25% at organizations with more than 50,000 employees

This lack of visibility extends into incident response:

  • 59% cannot automatically identify root cause across infrastructure domains when an alert fires
  • 25% still rely on manual investigations across disconnected consoles as their first response

When AI systems break, they do not fail cleanly. System understanding degrades, forcing teams into reactive analysis while high-cost GPU capacity sits underutilized, issues compound, and outcomes can no longer be fully explained or controlled.

“These are not abstract concerns,” continued Appleby. “As AI becomes core enterprise infrastructure, a clear divide is emerging between organizations that understand how their systems produce outcomes and those that cannot explain or control them. Without visibility across models, tokens, GPUs, and infrastructure, teams absorb hidden cost, performance gaps, and ungoverned risk. Those that understand their systems gain end-to-end visibility and control so they can optimize cost in real time, ensure reliable performance, and prove outcomes. The result is declining resilience, eroding trust, and constrained growth as AI becomes infrastructure that must be governed and optimized at scale.”

ROI Visibility Is the Prerequisite Enterprises Cannot Defer

The study reveals a disconnect between how AI systems operate and how they are observed. A 17-point gap exists between Infra/SRE practitioners and executives on automated root cause capabilities:

  • 69% of Infra/SRE teams report lacking automated cross-domain root cause
  • 52% of executives report the same

This gap reflects a broader breakdown in system-level observability, where critical signals remain fragmented across the stack:

  • 57% cite cost and efficiency metrics as a top challenge
  • 56% cite GPU utilization tracking
  • 52% cite data pipeline visibility

These challenges span business outcomes, AI infrastructure, and data dependencies, yet are still managed in isolation.

GPU cost and utilization remains the most difficult operational challenge for 35% of enterprises, with impact varying by role:

  • 39% of executives experience it as financial accountability pressure
  • 36% of architects cite integration complexity in distributed environments
  • 22% of Infra/SRE teams face it as a scaling and reliability challenge

This variation reflects how different parts of the organization see different fragments of the same system, without a unified view of cause and effect.

Across all roles and revenue bands, enterprise priorities are consistent:

  • 38% need unified visibility across AI and infrastructure layers
  • 32% need AI-driven root cause analysis without manual correlation

Together, these priorities point to a single requirement: system-aware observability that connects performance, cost, and outcomes across the full stack. Today, most enterprises are operating AI systems they cannot fully observe or explain.

Virtana Expands AI Factory Observability with Dell Technologies Partnership

Also announced today, Virtana is extending its Agentic AI-powered observability platform to support Dell’s AI Factory infrastructure, bringing system-aware intelligence across the full AI factory stack, from GPUs and infrastructure to models and AI workloads. Now organizations running Dell-based AI factories can apply continuous, cross-domain analysis across the entire execution system.

Virtana’s autonomous agents correlate GPU utilization, token demand, model behavior, and underlying infrastructure performance in real time, delivering automated, evidence-backed root cause analysis where system complexity is highest. This support moves teams beyond siloed GPU monitoring and fragmented tooling. Instead of chasing signals, operators get clear answers tied to the actual system constraint driving latency, failures, or cost.

Resources

Research Methodology

The AI Factory Reality Check is based on an independent survey of 788 US-based professionals at enterprise organizations actively running, piloting, or planning AI workloads in production, with decision-making or significant influence over IT infrastructure, AI strategy, or technology investment. Respondents include application, service, and AI engineering professionals (307), executive leadership (270), infrastructure, cloud, and reliability engineering teams (120), and architects and platform designers (91). Organizations range from under 1,000 to more than 50,000 employees, spanning revenue bands from under $500 million to more than $10 billion.

About Virtana

Virtana delivers the deepest and broadest observability platform for hybrid and multi-cloud, with full-stack AI observability spanning applications, services, data pipelines, GPUs, CPUs, networks, and storage. Powered by high-fidelity data and agentic AI, Virtana provides unmatched visibility across end-to-end IT services and AI workloads, correlating health, performance, cost, and user impact in real time. With advanced event intelligence and autonomous insight generation, Virtana delivers clarity no other provider can match. Trusted by Global 2000 enterprises and public sector organizations, Virtana helps IT operations and DevOps teams reduce risk, strengthen resilience, improve efficiency, and modernize with confidence across multi-cloud, on-premises, and edge environments. Learn more at virtana.com

Contacts

Media Contact
Stephanie Floyd
Bhava Communications for Virtana
virtana@bhavacom.com

Virtana


Release Versions

Contacts

Media Contact
Stephanie Floyd
Bhava Communications for Virtana
virtana@bhavacom.com

More News From Virtana

Virtana Extends AI Factory Observability to the Dell AI Factory

PALO ALTO, Calif.--(BUSINESS WIRE)--Virtana today announced AI Factory Observability for Dell AI Factory environments, bringing its AI Factory Observability platform to one of the industry’s most widely deployed enterprise AI infrastructure stacks. The integration spans Dell PowerEdge compute, PowerScale and ObjectScale storage, high-performance networking fabrics, including InfiniBand, Ethernet, and NVLink, and Dell’s Smart Fabric Manager (SFM) orchestration layer. As enterprises deploy Dell A...

Virtana Delivers AI Factory Observability to AWS Bedrock Guardrails Environments

PALO ALTO, Calif.--(BUSINESS WIRE)--Virtana, provider of the deepest and broadest observability platform for hybrid and multi-cloud environments, today announced support for AWS Bedrock Guardrails within Virtana AI Factory Observability (AIFO), extending behavioral observability across enterprise LLM deployments on AWS Bedrock. As organizations adopt generative AI for mission-critical workflows, the operational challenge shifts from deploying models to operating them securely at scale. AWS Bedr...

Virtana Extends AI Factory Observability to Nutanix Agentic AI Environments

PALO ALTO, Calif.--(BUSINESS WIRE)--Virtana today announced AI Factory Observability for Nutanix Agentic AI environments, extending system-aware observability across Nutanix Cloud Infrastructure and Nutanix Enterprise AI. As enterprises adopt agentic AI, the operational challenge shifts from building models and deploying individual agents to running infrastructure that can scale reliably under dynamic, high-concurrency demand. To address this, Virtana is expanding AI Factory Observability from...
Back to Newsroom