spot_img
HomeResearch & DevelopmentVortex: Balancing Speed and Reliability for AI Inference and...

Vortex: Balancing Speed and Reliability for AI Inference and Knowledge Retrieval

TLDR: Vortex is a novel ML serving platform that prioritizes Service Level Objectives (SLOs) to deliver predictable low latency and high throughput for AI inference and knowledge retrieval services. It achieves this through a microservice-based architecture, smart data and task placement, opportunistic batching, anticipatory model preloading, and efficient data paths, including leveraging RDMA. Experiments show Vortex significantly outperforms existing systems like TorchServe and Ray Serve, offering more stable latencies and higher request rates for given SLO targets.

The world of Artificial Intelligence (AI) is rapidly expanding, with ML inference and knowledge retrieval becoming crucial services for everything from interactive user applications to sophisticated AI agents. This growth brings a significant challenge: the need for both high processing speed (throughput) and predictable response times (latency), often defined by Service Level Objectives (SLOs). Traditional ML serving platforms typically optimize for throughput, which can lead to unpredictable delays, especially during peak loads.

A new research paper introduces Vortex, an innovative platform designed to tackle this very problem. Vortex takes an “SLO-first” approach, ensuring that latency targets are met without sacrificing high throughput. The paper highlights that for similar tasks, Vortex’s pipelines achieve significantly lower and more stable latencies compared to existing solutions like TorchServe and Ray Serve, often enabling more than double the request rate for a given SLO target. The benefits are even more pronounced when advanced networking technologies like RDMA (Remote Direct Memory Access) are utilized.

The Core Idea Behind Vortex

Vortex’s central premise is to design ML-as-a-service frameworks that facilitate SLOs while still employing batching—a technique that groups multiple requests to enhance throughput. The key is to avoid queuing backlogs, which are a major cause of latency spikes. Vortex views ML services as pipelines, which are essentially directed graphs of ML stages running on dynamically resizable server pools. These pipelines can share components, allowing for opportunistic aggregation of loads even if individual request flows are bursty.

How Vortex Achieves Its Goals

The platform incorporates several key innovations:

  • Microservice-based Pipeline Architecture: Vortex uses a novel architecture where ML components are treated as trusted tenants within its address space. This minimizes data copying and network transfers by using pointers when ML components access input or data objects.
  • Dual-Role Servers and Smart Placement: Vortex servers act as both key-value storage servers and compute hosts. Its scheduler intelligently routes queries to components running where their necessary data (models, vector database indices) already resides, minimizing access delays. This aligns well with sharding, where the key-value storage is split into groups of replicas for scalability and fault-tolerance.
  • Avoiding Latency Spikes: Vortex addresses common causes of latency spikes, such as queuing delays and excessively large batch sizes. It employs opportunistic batching, meaning it batches requests when possible but actively manages backlogs and limits batch sizes to remain within SLO targets. Additionally, it preloads models and other dependent objects into GPU memory before activating new instances, preventing delays during scaling events.
  • System-Level Optimizations: The platform includes several optimizations: smart task placement (collocating components on the same machine if beneficial), pool-oriented microservice management (right-sizing component pools and limiting batch sizes), zero-copy data paths (an asynchronous architecture that minimizes copying and locking), and smart packing (efficiently mapping components to GPUs, including using NVIDIA’s Multi-Instance GPU (MIG) feature).

Real-World Examples

The paper uses two ML pipelines as running examples: PreFLMR, a knowledge retrieval application that takes an image and query to retrieve documents, and AudioQuery, a speech-query RAG (Retrieval Augmented Generation) LLM pipeline that converts audio to text, searches for documents, and generates a spoken response.

Also Read:

Experimental Validation

Experiments conducted on a cluster of servers with NVIDIA A30 GPUs and RDMA networking demonstrate Vortex’s effectiveness. When compared to TorchServe and Ray Serve, Vortex consistently delivers superior performance. For instance, in the PreFLMR pipeline, Vortex with RDMA achieved significantly lower SLO miss rates at high throughputs compared to Ray Serve. The ability to preload models also proved crucial, preventing latency spikes when the system needed to scale up to handle increased load.

Vortex’s optimized stage-to-stage handoffs, especially when leveraging RDMA, lead to substantially reduced latency variability and faster data transfers between pipeline stages. Even when configured to use TCP instead of RDMA, Vortex still outperforms Ray Serve, highlighting the benefits of its underlying architectural optimizations like zero-copy data paths and avoidance of locks.

In conclusion, Vortex offers a compelling new paradigm for hosting ML inference and knowledge retrieval services. By prioritizing SLOs while intelligently managing throughput, it provides a robust and efficient platform for the next generation of AI-powered applications. You can read the full research paper here.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -