Vortex: Balancing Speed and Reliability for AI Inference and Knowledge Retrieval

TLDR: Vortex is a novel ML serving platform that prioritizes Service Level Objectives (SLOs) to deliver predictable low latency and high throughput for AI inference and knowledge retrieval services. It achieves this through a microservice-based architecture, smart data and task placement, opportunistic batching, anticipatory model preloading, and efficient data paths, including leveraging RDMA. Experiments show Vortex significantly outperforms existing systems like TorchServe and Ray Serve, offering more stable latencies and higher request rates for given SLO targets.

The world of Artificial Intelligence (AI) is rapidly expanding, with ML inference and knowledge retrieval becoming crucial services for everything from interactive user applications to sophisticated AI agents. This growth brings a significant challenge: the need for both high processing speed (throughput) and predictable response times (latency), often defined by Service Level Objectives (SLOs). Traditional ML serving platforms typically optimize for throughput, which can lead to unpredictable delays, especially during peak loads.

A new research paper introduces Vortex, an innovative platform designed to tackle this very problem. Vortex takes an “SLO-first” approach, ensuring that latency targets are met without sacrificing high throughput. The paper highlights that for similar tasks, Vortex’s pipelines achieve significantly lower and more stable latencies compared to existing solutions like TorchServe and Ray Serve, often enabling more than double the request rate for a given SLO target. The benefits are even more pronounced when advanced networking technologies like RDMA (Remote Direct Memory Access) are utilized.

The Core Idea Behind Vortex

Vortex’s central premise is to design ML-as-a-service frameworks that facilitate SLOs while still employing batching—a technique that groups multiple requests to enhance throughput. The key is to avoid queuing backlogs, which are a major cause of latency spikes. Vortex views ML services as pipelines, which are essentially directed graphs of ML stages running on dynamically resizable server pools. These pipelines can share components, allowing for opportunistic aggregation of loads even if individual request flows are bursty.

How Vortex Achieves Its Goals

The platform incorporates several key innovations:

Microservice-based Pipeline Architecture: Vortex uses a novel architecture where ML components are treated as trusted tenants within its address space. This minimizes data copying and network transfers by using pointers when ML components access input or data objects.
Dual-Role Servers and Smart Placement: Vortex servers act as both key-value storage servers and compute hosts. Its scheduler intelligently routes queries to components running where their necessary data (models, vector database indices) already resides, minimizing access delays. This aligns well with sharding, where the key-value storage is split into groups of replicas for scalability and fault-tolerance.
Avoiding Latency Spikes: Vortex addresses common causes of latency spikes, such as queuing delays and excessively large batch sizes. It employs opportunistic batching, meaning it batches requests when possible but actively manages backlogs and limits batch sizes to remain within SLO targets. Additionally, it preloads models and other dependent objects into GPU memory before activating new instances, preventing delays during scaling events.
System-Level Optimizations: The platform includes several optimizations: smart task placement (collocating components on the same machine if beneficial), pool-oriented microservice management (right-sizing component pools and limiting batch sizes), zero-copy data paths (an asynchronous architecture that minimizes copying and locking), and smart packing (efficiently mapping components to GPUs, including using NVIDIA’s Multi-Instance GPU (MIG) feature).

Real-World Examples

The paper uses two ML pipelines as running examples: PreFLMR, a knowledge retrieval application that takes an image and query to retrieve documents, and AudioQuery, a speech-query RAG (Retrieval Augmented Generation) LLM pipeline that converts audio to text, searches for documents, and generates a spoken response.

Also Read:

Experimental Validation

Experiments conducted on a cluster of servers with NVIDIA A30 GPUs and RDMA networking demonstrate Vortex’s effectiveness. When compared to TorchServe and Ray Serve, Vortex consistently delivers superior performance. For instance, in the PreFLMR pipeline, Vortex with RDMA achieved significantly lower SLO miss rates at high throughputs compared to Ray Serve. The ability to preload models also proved crucial, preventing latency spikes when the system needed to scale up to handle increased load.

Vortex’s optimized stage-to-stage handoffs, especially when leveraging RDMA, lead to substantially reduced latency variability and faster data transfers between pipeline stages. Even when configured to use TCP instead of RDMA, Vortex still outperforms Ray Serve, highlighting the benefits of its underlying architectural optimizations like zero-copy data paths and avoidance of locks.

In conclusion, Vortex offers a compelling new paradigm for hosting ML inference and knowledge retrieval services. By prioritizing SLOs while intelligently managing throughput, it provides a robust and efficient platform for the next generation of AI-powered applications. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Vortex: Balancing Speed and Reliability for AI Inference and Knowledge Retrieval

The Core Idea Behind Vortex

How Vortex Achieves Its Goals

Real-World Examples

Experimental Validation

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

Vida Secures $4 Million Series A Funding to Advance AI Voice Technology and Expand Leadership

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates