-

ZFLOW AI's Simulation-Guided Optimization Identifies a 1.54× Higher-Throughput Serving Configuration for DeepSeek V4-Pro on 8×B300

Working on PaleBlueDot AI's NVIDIA B300 platform, ZFLOW AI used hardware-aware simulation to find an optimized SGLang serving configuration for high-concurrency DeepSeek V4-Pro inference.

SANTA CLARA, Calif.--(BUSINESS WIRE)--ZFLOW AI today announced a performance optimization milestone on PaleBlueDot AI's 8×NVIDIA B300 bare-metal platform, using simulation to identify an optimized DeepSeek V4-Pro serving configuration on an SGLang stack. To our knowledge, this is the first publicly documented simulation-guided serving optimization of a frontier open-source model on NVIDIA’s B300 production platform.

ZFLOW AI is building a neutral optimization and control layer for AI infrastructure. Sitting above serving runtimes and below the business decision, ZFLOW AI helps infrastructure teams find the lowest-cost, highest-performance way to run a given workload on a given cluster.

ZFLOW AI's role is complementary to the serving runtime. Building on the high-performance DeepSeek V4 foundation provided by the SGLang ecosystem, ZFLOW AI applies an optimization intelligence layer on top of the runtime — profiling real workload behavior and using hardware-aware simulation to guide deployment and tuning decisions for a specific workload on specific hardware.

In this milestone, ZFLOW AI evaluated DeepSeek V4-Pro serving with SGLang and EAGLE speculative decoding, analyzing serving-architecture tradeoffs, high-concurrency throughput and latency, and next-step multi-node deployment. Under higher-concurrency traffic, the prefill-decode disaggregated configuration reached peak throughput of 826 tokens/second — approximately 1.54× the non-disaggregated (monolithic) peak — with tail latency 2–3× better. The monolithic path remained favorable for single-stream, low-concurrency, and long-context workloads, including full 1M-token context.

ZFLOW AI also observed that MTP/EAGLE speculative decoding improved throughput with no measured quality regression in this test run: GSM8K accuracy across EAGLE 3/1/4, EAGLE 1/1/2, and no-MTP configurations stayed within approximately ±1 percentage point. Broader evaluation is ongoing.

ZFLOW AI's simulation further indicates that a two-node B300 configuration is a promising direction for production deployment, which the team plans to validate on hardware as a next step.

“Modern inference optimization is moving beyond manual tuning of individual runtime knobs,” said Dr. Zhibin Xiao, Founder and CEO of ZFLOW AI. “The next layer is a closed-loop workflow connecting real workload execution, hardware simulation, and optimization strategy. Our work on PaleBlueDot AI's B300 platform shows how ZFLOW AI helps infrastructure teams turn raw hardware capability into a workload-specific deployment strategy.”

Full closed-loop auto-optimization for DeepSeek V4-Pro on B300 remains under active development. ZFLOW AI plans to publish a Technical Insights blog detailing the serving-architecture tradeoffs, MTP/EAGLE optimization, and multi-node deployment work.

Teams evaluating DeepSeek V4-Pro or other frontier models on B300 or other next-generation GPU platforms can contact ZFLOW AI at contact@zflow.ai to discuss optimization for their own workloads.

About ZFLOW AI

ZFLOW AI is building a neutral optimization and control layer for AI infrastructure. Sitting above serving runtimes (vLLM, SGLang, TensorRT-LLM, Dynamo) and below the business decision, ZFLOW AI finds the lowest-cost, highest-performance way to run a given workload on a given cluster — across heterogeneous GPU, LPU, NPU, and CPU systems, without locking teams into any single vendor or stack. Learn more at zflow.ai.

About PaleBlueDot AI

PaleBlueDot AI is a Silicon Valley-based AI compute platform with a growing global footprint, delivering high-performance AI compute through a unified platform for enterprise-scale deployment. Guided by its mission to make intelligence universally accessible, PaleBlueDot AI helps organizations build, deploy, and scale AI faster, better, and cheaper.

Contacts

ZFLOW AI


Release Summary
ZFLOW AI is building a neutral optimization and control layer for AI infrastructure.
Release Versions

Contacts

Back to Newsroom