TLDR: A new research paper introduces OSCAR, a technical pipeline that uses object status recognition to help people with vision impairments track cooking progress. By understanding the changing state of ingredients and tools, OSCAR significantly improves recipe step prediction accuracy in both instructional videos and real-world non-visual cooking sessions. The study highlights the importance of designing assistive technologies that adapt to diverse user practices and challenging environmental conditions, moving beyond static recipe instructions to provide dynamic, context-aware support.
Cooking is a fundamental part of daily life, but it presents unique challenges for individuals with vision impairments. Traditional recipe guides, like screen readers or smart speakers, often provide instructions linearly without understanding what’s actually happening in the kitchen. This can leave cooks uncertain about their progress, whether a step has been completed, or what to do next.
A new research paper, titled “Exploring Object Status Recognition for Recipe Progress Tracking in Non-Visual Cooking,” introduces a technical pipeline called OSCAR (Object Status Context Awareness for Recipes). Developed by Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu, and Patrick Carrington, OSCAR aims to address this gap by focusing on “object status” – the condition or transformation of ingredients and tools as cooking progresses. This means recognizing when onions are chopped, sauces thicken, or meat browns, providing a more dynamic understanding of the cooking process.
OSCAR is designed to integrate several key components: it parses recipes, extracts object status information (like “chopping carrots” or “whisking eggs”), aligns this information with visual data from cooking sessions using advanced Vision-Language Models (VLMs) such as CLIP and SigLIP, and employs a time-causal model to ensure predictions follow the natural flow of a recipe. Unlike systems that only rely on text or voice, OSCAR reasons about the real-time visual state of ingredients and tools, enabling both progress tracking and contextual feedback.
The researchers evaluated OSCAR using two distinct datasets. The first was YouCook2, a large collection of 173 instructional cooking videos. Here, OSCAR significantly improved step prediction accuracy. For instance, with CLIP, accuracy jumped from 41.7% to 68.0%, and with SigLIP, it rose from 62.2% to 82.8%. This improvement was attributed to OSCAR’s ability to disambiguate incomplete or occluded visual scenes, differentiate visually similar actions, and handle cluttered frames by focusing on object status changes.
The second, and perhaps more crucial, evaluation involved a real-world dataset of 12 non-visual cooking sessions recorded by blind and low vision individuals in their own homes. This dataset presented unique challenges, including varied lighting, non-standard tool usage, and exploratory interactions common in non-visual cooking. Despite these complexities, OSCAR again showed substantial gains. CLIP’s accuracy increased from 33.7% to 58.4%, and SigLIP’s from 41.9% to 66.7%. These results underscore the feasibility of using OSCAR for procedural tracking in natural, less controlled environments.
The study highlighted several reasons for OSCAR’s success in real-world scenarios. It reduced false positives from prolonged or exploratory interactions (where users might touch or recheck objects without changing their state), accommodated personalized tools and cooking strategies (focusing on ingredient transformation rather than specific tools), and supported an inclusive design that adapts to user routines. The consistency of performance gains across both instructional and real-world datasets suggests that object status modeling is a robust and generalizable approach.
However, the research also identified factors that still affect performance in real-world settings. These include implicit tasks not explicitly in recipes (like cleaning or discarding waste), frequent rechecking of tools and ingredients, variable lighting conditions, inconsistent camera angles (especially with chest-mounted cameras), and the presence of pre-prepared ingredients that can confuse the system. These insights are crucial for designing future assistive systems that are more resilient and user-centered.
Also Read:
- AI Tools for Manufacturers: Boosting Efficiency While Protecting Data Privacy
- Apple Introduces ‘SceneScout’ AI Agent for Enhanced Street View Accessibility for Visually Impaired
The paper concludes by emphasizing that future assistive AI systems need to move beyond rigid step-alignment and embrace dynamic progress inference models that accommodate the fluidity of real-world, non-visual workflows. Object status recognition is presented as a universal design primitive that can offer greater flexibility and better accommodate diverse user routines across various hands-on tasks, not just cooking. The researchers plan to release their non-visual cooking dataset to support further research in this critical area. For more details, you can read the full paper here.


