ARCADE: A Real-Time System for Blending Diverse Data and Continuous Insights

TLDR: ARCADE is a new real-time data system designed to efficiently process and query diverse data types like text, images, videos, and spatial information. It addresses limitations of existing systems by offering high-throughput data ingestion, a unified disk-based secondary index for various data modalities, a comprehensive cost-based query optimizer for complex hybrid queries, and an incremental materialized view framework for efficient continuous queries. Built on RocksDB and MySQL, ARCADE significantly outperforms other leading systems in both read-heavy and write-heavy workloads, enabling real-time semantic search and analytics across multimodal data.

In today’s fast-paced digital world, data is constantly being generated from countless sources, including social media, urban systems, and financial markets. This data isn’t just text; it includes images, videos, spatial information, and traditional relational data. The challenge lies in making sense of this vast, diverse, and continuously flowing information in real-time, especially for tasks like semantic search and retrieval.

Existing database systems often struggle with this. Some, designed for multimodal data, can handle complex queries but fall short when it comes to quickly ingesting new data or running queries that update automatically over time. Others, built for real-time data, excel at rapid ingestion but lack comprehensive support for diverse data types or complex combined queries.

Introducing ARCADE: A New Era for Real-Time Data Systems

To bridge this gap, researchers have introduced ARCADE, a novel real-time data system designed to efficiently handle high volumes of incoming data and process sophisticated queries across various data types. ARCADE stands out by offering unified support for hybrid queries (which combine different data types) and continuous queries (which run automatically over time) in a real-time environment.

ARCADE’s core innovations address three major challenges:

Unified Indexing: Traditional real-time systems often lack a consistent way to index different data types like vectors (for images/text embeddings), spatial data (for locations), and text data within their storage architecture. ARCADE introduces a unified disk-based secondary index that works across all these modalities, built on an efficient LSM-based storage system. This means it can quickly find relevant information regardless of its type.
Smarter Query Optimization: Current systems often struggle to optimize queries that combine searches across multiple data types. ARCADE features a comprehensive, cost-based query optimizer that intelligently uses all available indexes to speed up hybrid queries. It can even handle complex ‘Hybrid Nearest Neighbor’ queries that rank results based on a combination of similarities, like how close two locations are and how similar two text embeddings are.
Efficient Continuous Queries: Real-time monitoring and event-driven analytics require queries that run continuously. ARCADE tackles this with an incremental materialized view framework. This system reuses intermediate results from previous query executions, significantly improving efficiency while ensuring that the results are always up-to-date.

Built on popular open-source components like RocksDB for storage and MySQL for its query engine, ARCADE demonstrates impressive performance. In experiments, it outperformed leading real-time multimodal data systems by 7.4 times on read-heavy workloads and 1.4 times on write-heavy workloads.

What Kind of Data and Queries Does ARCADE Handle?

ARCADE supports a wide range of data modalities, including vector data (for embeddings), blob data (for unstructured binary data like images and videos), spatial data (for geographic information), text data, and traditional relational data. This allows users to store, index, and query heterogeneous real-world data within a single system.

The system introduces four expressive query types:

Hybrid Search Queries: These allow users to filter data based on multiple conditions across relational, vector, spatial, or textual attributes. For example, finding tweets mentioning a keyword within a specific geographic region that are also semantically relevant to a query.
Hybrid NN Queries: These rank results by combining similarity measures from different modalities, such as embedding distance, spatial proximity, and textual relevance. An example would be finding relevant tweets posted during a specific time range, ranked by a weighted sum of spatial proximity and vector similarity.
Continuous SYNC Queries: These queries execute at fixed, user-defined intervals, providing up-to-date results over real-time data. Imagine continuously monitoring the number of tweets about a topic across different cities every 60 seconds.
Continuous ASYNC Queries: These queries automatically re-execute whenever the underlying data changes, ensuring the most current results. This could be used to monitor for the most up-to-date tweets on a given topic for investment research.

Under the Hood: How ARCADE Works

ARCADE’s architecture is layered, separating query interaction, processing, and storage. Its unified disk-based secondary index framework is a key differentiator. Unlike systems that load entire vector indexes into memory, ARCADE allows for block-level access, reducing memory footprint and improving cache reuse. This index is built in the background, ensuring data ingestion performance isn’t impacted.

For hybrid queries, ARCADE’s optimizer considers all possible index access plans, dynamically selecting the best combination to accelerate queries. For complex Hybrid NN queries, it uses an aggregation algorithm that efficiently identifies top results by leveraging all relevant indexes simultaneously, avoiding costly full data scans.

Continuous queries benefit from ARCADE’s incremental materialized view framework. This system intelligently selects which views to materialize and updates them incrementally as new data arrives, rather than recomputing everything from scratch. This ensures both efficiency and data freshness.

Also Read:

Real-World Performance

The researchers developed a benchmark called TRACY (Tweet hybRid And Continuous querY) using real-world tweet, point-of-interest, and city data to evaluate ARCADE. The results showed ARCADE’s superior performance, especially for hybrid Nearest Neighbor queries, where it was 3.5 to 7.4 times faster than the best baseline system, SingleStore-V. For mixed workloads, ARCADE delivered a 1.4x to 7.4x speed-up over leading systems.

In conclusion, ARCADE represents a significant advancement in real-time data systems, offering a unified and efficient solution for processing complex, multimodal data streams. Its innovative indexing, optimization, and continuous query mechanisms pave the way for more responsive user experiences and actionable insights in a data-rich world. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

ARCADE: A Real-Time System for Blending Diverse Data and Continuous Insights

Introducing ARCADE: A New Era for Real-Time Data Systems

What Kind of Data and Queries Does ARCADE Handle?

Under the Hood: How ARCADE Works

Real-World Performance

Gen AI News and Updates

SIGMACOLLAB: A New Dataset for Human-AI Teamwork in the Real World

New Framework Enhances Wildfire Prediction with Social Media and Satellite Data

Hub.xyz and PlayAI Partner to Transform Smart Glasses into a New AI Data Layer

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates