TLDR: This research introduces a novel zero-shot human activity recognition (HAR) method for smart homes that avoids relying on large language models (LLMs) and their associated risks like privacy invasion and inconsistent predictions. Instead, it converts sensor data and activity labels into natural language summaries and descriptions, then uses a pre-trained sentence encoder to compare their embeddings for classification, demonstrating comparable performance to LLM-based state-of-the-art solutions across diverse datasets.
Understanding what people are doing in their smart homes, known as Human Activity Recognition (HAR), is a crucial area of research. Imagine a system that can tell if someone is cooking, sleeping, or needs assistance, all without constant supervision or extensive data collection. Traditionally, building such systems has been challenging, often requiring vast amounts of labeled data for training, which is time-consuming and expensive to acquire.
The concept of ‘zero-shot’ recognition has emerged as a promising solution. This means the system can identify activities it has never explicitly been trained on, making it highly adaptable to new smart home environments with different sensor setups and resident behaviors. Recent advancements in this field have heavily relied on Large Language Models (LLMs). These LLMs are fed natural language descriptions of sensor data, often through carefully crafted ‘prompts,’ to classify activities. While effective, this approach comes with significant drawbacks.
The Pitfalls of Prompting LLMs
The reliance on external LLM services introduces several risks. Firstly, there are privacy concerns; sharing sensitive in-home data with an outside party might not be acceptable for many users, especially in healthcare applications. Secondly, the system becomes dependent on the availability and stability of these external services. Network issues or service outages could bring the entire HAR system to a halt. Lastly, LLMs are known for their unpredictable nature. Their predictions can be inconsistent, and even minor version changes can lead to a degradation in performance, making them unreliable for critical applications.
A Novel Approach: Thou Shalt Not Prompt
Researchers Sourish Gunesh Dhekane and Thomas Ploetz from the Georgia Institute of Technology have proposed an innovative solution that bypasses the need to prompt LLMs for activity predictions. Their paper, titled “Thou Shalt Not Prompt: Zero-Shot Human Activity Recognition in Smart Homes via Language Modeling of Sensor Data & Activities”, introduces a method that models sensor data and activities directly using natural language and their embeddings to perform zero-shot classification.
The core of their solution lies in two novel modules: ‘Summary Generation’ and ‘Activity Descriptor’.
How It Works: Language Modeling in Action
The process begins by converting raw sensor data into a concise textual summary. This ‘Summary Generation’ module captures the essence of the activity by including key information such as the time of occurrence, the duration of the activity, the top locations where the activity took place, and the most commonly fired sensors. For instance, a summary might describe an activity starting at a certain time, lasting for a specific duration, occurring mainly near a desk, and involving motion sensors.
Simultaneously, the ‘Activity Descriptor’ module generates precise textual descriptions for each activity of interest. These descriptions are crafted by leveraging smart home layouts and available metadata, detailing where an activity is likely to occur, its typical duration, and any signature sensor readings. For example, ‘Desk Activity’ might be described as taking place in the workspace and TV room when a person uses the desk.
Once both the sensor data summary and the activity descriptions are in text format, a pre-trained sentence encoder (like ‘all-distilroberta-v1’) is used to convert them into numerical representations called embeddings. The system then calculates the similarity between the embedding of the sensor data summary and the embeddings of all possible activity descriptions. The activity label corresponding to the description with the highest similarity is predicted as the ongoing activity. Crucially, this entire process requires no labeled or unlabeled sensor data for training, making it truly zero-shot.
Performance and Advantages
The researchers evaluated their approach across six diverse datasets, showcasing its generalizability across different sensing modalities, layouts, and activities. Their solution achieved comparable performance to existing state-of-the-art LLM-based methods, but without the inherent risks. This means the system offers enhanced privacy, operates independently of external services, and provides more consistent predictions.
Furthermore, the method can be extended to ‘few-shot’ scenarios, where a small number of labeled data samples can significantly improve performance, highlighting its compatibility with human-in-the-loop HAR systems.
Also Read:
- Assessing Machine Health: A New Approach to Evaluating Anomaly Detection Systems with AI-Generated Sounds
- New Method Extends AI Safety from Text to Images
Future Directions
While highly effective, the proposed method has areas for future improvement. Generating more dynamic and nuanced sensor data summaries without LLMs remains a challenge. The system also sometimes struggles to differentiate between semantically similar activities (e.g., ‘Sleeping’ for different residents or ‘Bed to Toilet’ vs. ‘Personal Hygiene’ due to similar movement patterns). Future work will focus on addressing these challenges, potentially through active learning frameworks to intelligently select few-shot examples, further enhancing the robustness of zero-shot HAR systems without relying on LLMs.
This research marks a significant step towards building more private, reliable, and adaptable human activity recognition systems for smart homes, moving away from the complexities and risks associated with large language models.


