TLDR: ARCADE is a new real-time data system designed to efficiently process and query diverse data types like text, images, videos, and spatial information. It addresses limitations of existing systems by offering high-throughput data ingestion, a unified disk-based secondary index for various data modalities, a comprehensive cost-based query optimizer for complex hybrid queries, and an incremental materialized view framework for efficient continuous queries. Built on RocksDB and MySQL, ARCADE significantly outperforms other leading systems in both read-heavy and write-heavy workloads, enabling real-time semantic search and analytics across multimodal data.
In today’s fast-paced digital world, data is constantly being generated from countless sources, including social media, urban systems, and financial markets. This data isn’t just text; it includes images, videos, spatial information, and traditional relational data. The challenge lies in making sense of this vast, diverse, and continuously flowing information in real-time, especially for tasks like semantic search and retrieval.
Existing database systems often struggle with this. Some, designed for multimodal data, can handle complex queries but fall short when it comes to quickly ingesting new data or running queries that update automatically over time. Others, built for real-time data, excel at rapid ingestion but lack comprehensive support for diverse data types or complex combined queries.
Introducing ARCADE: A New Era for Real-Time Data Systems
To bridge this gap, researchers have introduced ARCADE, a novel real-time data system designed to efficiently handle high volumes of incoming data and process sophisticated queries across various data types. ARCADE stands out by offering unified support for hybrid queries (which combine different data types) and continuous queries (which run automatically over time) in a real-time environment.
ARCADE’s core innovations address three major challenges:
-
Unified Indexing: Traditional real-time systems often lack a consistent way to index different data types like vectors (for images/text embeddings), spatial data (for locations), and text data within their storage architecture. ARCADE introduces a unified disk-based secondary index that works across all these modalities, built on an efficient LSM-based storage system. This means it can quickly find relevant information regardless of its type.
-
Smarter Query Optimization: Current systems often struggle to optimize queries that combine searches across multiple data types. ARCADE features a comprehensive, cost-based query optimizer that intelligently uses all available indexes to speed up hybrid queries. It can even handle complex ‘Hybrid Nearest Neighbor’ queries that rank results based on a combination of similarities, like how close two locations are and how similar two text embeddings are.
-
Efficient Continuous Queries: Real-time monitoring and event-driven analytics require queries that run continuously. ARCADE tackles this with an incremental materialized view framework. This system reuses intermediate results from previous query executions, significantly improving efficiency while ensuring that the results are always up-to-date.
Built on popular open-source components like RocksDB for storage and MySQL for its query engine, ARCADE demonstrates impressive performance. In experiments, it outperformed leading real-time multimodal data systems by 7.4 times on read-heavy workloads and 1.4 times on write-heavy workloads.
What Kind of Data and Queries Does ARCADE Handle?
ARCADE supports a wide range of data modalities, including vector data (for embeddings), blob data (for unstructured binary data like images and videos), spatial data (for geographic information), text data, and traditional relational data. This allows users to store, index, and query heterogeneous real-world data within a single system.
The system introduces four expressive query types:
-
Hybrid Search Queries: These allow users to filter data based on multiple conditions across relational, vector, spatial, or textual attributes. For example, finding tweets mentioning a keyword within a specific geographic region that are also semantically relevant to a query.
-
Hybrid NN Queries: These rank results by combining similarity measures from different modalities, such as embedding distance, spatial proximity, and textual relevance. An example would be finding relevant tweets posted during a specific time range, ranked by a weighted sum of spatial proximity and vector similarity.
-
Continuous SYNC Queries: These queries execute at fixed, user-defined intervals, providing up-to-date results over real-time data. Imagine continuously monitoring the number of tweets about a topic across different cities every 60 seconds.
-
Continuous ASYNC Queries: These queries automatically re-execute whenever the underlying data changes, ensuring the most current results. This could be used to monitor for the most up-to-date tweets on a given topic for investment research.
Under the Hood: How ARCADE Works
ARCADE’s architecture is layered, separating query interaction, processing, and storage. Its unified disk-based secondary index framework is a key differentiator. Unlike systems that load entire vector indexes into memory, ARCADE allows for block-level access, reducing memory footprint and improving cache reuse. This index is built in the background, ensuring data ingestion performance isn’t impacted.
For hybrid queries, ARCADE’s optimizer considers all possible index access plans, dynamically selecting the best combination to accelerate queries. For complex Hybrid NN queries, it uses an aggregation algorithm that efficiently identifies top results by leveraging all relevant indexes simultaneously, avoiding costly full data scans.
Continuous queries benefit from ARCADE’s incremental materialized view framework. This system intelligently selects which views to materialize and updates them incrementally as new data arrives, rather than recomputing everything from scratch. This ensures both efficiency and data freshness.
Also Read:
- FusedANN: A Unified Framework for Efficient Hybrid Vector Search
- AI-Powered Analytics for High-Performance Computing Operations: Introducing the EPIC Platform
Real-World Performance
The researchers developed a benchmark called TRACY (Tweet hybRid And Continuous querY) using real-world tweet, point-of-interest, and city data to evaluate ARCADE. The results showed ARCADE’s superior performance, especially for hybrid Nearest Neighbor queries, where it was 3.5 to 7.4 times faster than the best baseline system, SingleStore-V. For mixed workloads, ARCADE delivered a 1.4x to 7.4x speed-up over leading systems.
In conclusion, ARCADE represents a significant advancement in real-time data systems, offering a unified and efficient solution for processing complex, multimodal data streams. Its innovative indexing, optimization, and continuous query mechanisms pave the way for more responsive user experiences and actionable insights in a data-rich world. You can read the full research paper here.


