TLDR: This research paper introduces a federated learning (FL) framework for pose-based human activity recognition (HAR) in smart manufacturing. Using a custom dataset of upper-body gestures from five participants, the study demonstrates that FL, particularly with a Transformer model, significantly outperforms centralized training in terms of generalization accuracy on both global and unseen external test sets. The findings show that FL not only preserves data privacy by avoiding raw data transfer but also substantially improves cross-user generalization, making it a practical and scalable solution for industrial worker assistance systems.
In the evolving landscape of smart manufacturing, accurately recognizing worker actions in real-time is crucial for boosting productivity, ensuring safety, and fostering seamless human-machine collaboration. Traditional methods for human activity recognition (HAR) often rely on large, centralized datasets. However, in industrial environments, this approach presents significant challenges, particularly concerning data privacy and the logistical complexities of centralizing sensitive information from various sites or workers.
A recent research paper, titled “Federated Action Recognition for Smart Worker Assistance Using FastPose,” addresses these challenges by proposing a federated learning (FL) framework for pose-based human activity recognition. The paper, authored by Vinit Hegiste, Vidit Goyal, Tatjana Legler, and Martin Ruskowski, explores how FL can enable decentralized model training without the need to transfer raw, private data, making it an ideal solution for privacy-sensitive industrial scenarios. You can find the full paper here: Federated Action Recognition for Smart Worker Assistance Using FastPose.
Overcoming Data Privacy and Generalization Hurdles
The core of this research lies in its innovative approach to training HAR models. Instead of pooling all data into one central location, federated learning allows individual clients (in this case, different participants or industrial sites) to train models on their local, private datasets. Only the model updates, not the raw data, are shared with a central server, which then aggregates these updates to create a global model. This method inherently preserves data privacy.
The researchers developed a custom skeletal dataset specifically for smart worker assistance, comprising eight industrially relevant upper-body gestures. This data was collected from five volunteer participants, with each participant’s data treated as a distinct client dataset. To process the video data, a modified FastPose model was used to extract 2D skeletal keypoints, simplifying the original 17 keypoints to a more compact 13-joint representation, which helps reduce noise and improve processing efficiency.
Model Architectures and Training Paradigms
Two types of temporal models were evaluated: a Long Short-Term Memory (LSTM) network and a Transformer encoder. These models were trained and assessed under four distinct paradigms:
- Centralized Training: All data from all participants was pooled together and used to train a single model. This represents the traditional approach without privacy considerations.
- Local (Per-Client) Training: Each client trained its own model independently, without any collaboration or data sharing.
- Federated Learning (FedAvg): Clients trained models locally and shared updates with a central server, which aggregated them using weighted federated averaging.
- Federated Ensemble Learning (FedEnsemble): Similar to FL, but the centralized dataset was uniformly partitioned among clients, allowing the researchers to assess the benefits of ensemble learning in a federated setup, even when privacy isn’t the primary concern.
Remarkable Performance Gains
The results were compelling. On a unified global test set, the federated Transformer model achieved 69.5% accuracy, which was a significant 12.4 percentage point improvement over the centralized training approach. The federated LSTM also showed a notable gain of 9.9 percentage points, reaching 59.9% accuracy. These improvements suggest that aggregating diverse local updates in FL acts as a regularization mechanism, preventing overfitting to specific client biases and leading to better generalization.
Even more striking were the results when evaluating the models on an unseen external client – a participant whose data was not included in any training phase. Here, the FL Transformer achieved 64.29% accuracy, a remarkable 52.58 percentage point increase compared to the centralized model. The FedEnsemble Transformer performed even better, reaching 69.98% accuracy, a 58.27 percentage point gain. This demonstrates that FL not only preserves privacy but also substantially enhances the model’s ability to generalize to new, unseen users, which is critical for real-world deployment in diverse industrial settings.
Also Read:
- Navigating the Path to Adaptable Human Activity Recognition with Wearable Sensors
- The Dual Challenge: Security and Privacy in Federated Learning
Implications for Smart Manufacturing
The study highlights that federated learning is a highly effective solution for pose-based human activity recognition in industrial environments characterized by distributed and heterogeneous data. It consistently outperformed both centralized training and isolated local models. The observed “ensemble effect” in FedEnsemble learning further suggests that even without strict privacy constraints, FL can be a robust training strategy, especially when dealing with small or distributed datasets common in manufacturing.
This research paves the way for scalable, privacy-aware HAR solutions in smart factories, enabling intelligent assistance systems, enhancing worker safety, and improving productivity without compromising sensitive data. Future work aims to scale this framework to larger client populations, incorporate advanced aggregation methods, and integrate multi-sensor fusion for even greater robustness in challenging industrial environments.


