Abstract
In-hand object pose estimation is essential in various engineering applications, such as quality inspection, reverse engineering, and automated manufacturing processes. However, achieving accurate pose estimation becomes difficult when objects are heavily occluded by the hand or blurred due to motion. To address these challenges, we propose a novel framework that leverages the power of transformers for spatial-temporal reasoning across video sequences. Our approach utilizes transformers to capture both spatial relationships within each frame and temporal dependencies across consecutive frames, allowing the model to aggregate information over time and improve pose predictions. A key innovation of our framework is the introduction of a visibility-aware module, which dynamically adjusts pose estimates based on the object’s visibility. This module utilizes temporally-aware features extracted by the transformers, allowing the model to aggregate pose information across multiple frames. By integrating this aggregated information, the model can maintain high accuracy even when portions of the object are not visible in certain frames. This capability is particularly crucial in dynamic environments where the object’s appearance can change rapidly due to hand movements or interactions with other objects. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art techniques, achieving a 6% improvement in overall accuracy and over 11% better performance in handling occlusions.
Original language | English |
---|---|
Pages (from-to) | 35733-35749 |
Number of pages | 17 |
Journal | IEEE Access |
Volume | 13 |
DOIs | |
Publication status | Published - 2025 |
Keywords
- Pose estimation
- deep learning
- intelligent systems
- machine vision
- robot vision systems
- supervised learning
ASJC Scopus subject areas
- General Computer Science
- General Materials Science
- General Engineering