Combining Object Detectors and Trackers to Efficiently Detect Hands in Egocentric Videos
Ryan Visée (1,2), Jirapat Likitlersuang (1,2), José Zariffa (1,2)
Institute of Biomaterials and Biomedical Engineering, University of Toronto
KITE, Toronto Rehab-University Health Network
Individuals with cervical spinal cord injury (SCI) report upper limb function as their top recovery
priority. As a result, new treatments to improve hand function after SCI are needed. To accurately
measure the true impact these interventions have on patient function and independence, evaluation needs
to occur at home, instead of within clinical settings. Although there are currently no tools to evaluate hand
function at home, videos from wearable cameras (egocentric videos) can be used to monitor patient
activities and automatically analyzed using computer vision. However, the first step of this process, hand
detection, is difficult to do robustly and reliably, hindering the deployment of a complete monitoring
system in the home and community. We propose the combination of object detection and tracking
algorithms to create a system for fast and reliable hand detection in egocentric videos.
In previous experiments, a GoPro® Hero 4 wearable camera was used to collect data from 17
individuals with SCI performing activities of daily living in a home simulation laboratory at the Toronto
Rehabilitation Institute. We used these videos to create a large dataset for hand detection by manually
labelling bounding boxes around hands in each frame. Over 167,000 frames were annotated. We used this
data to investigate existing detection (YOLOv2 and Faster R-CNN) and tracking (MF, KCF, OLB, MIL,
TLD) algorithms alone. Finally, we used the fastest detector to automatically initialize and reset the best
trackers every 100 frames, chosen empirically, or after tracker failure. Further, to minimize detector
usage, the tracker was disabled if it failed and the detector was unable to locate the hand in 3 consecutive
frames. The detector then checked every 60 frames until the hand was found.
The F1-scores (based on a thresholded IoU) for the best detector and tracker alone (YOLOv2 and
MF, respectively) were 0.90±0.08 and 0.49±0.27, respectively. The best combination method resulted in
an F1-score of 0.83±0.16 while being five times faster than the fastest detector alone on a CPU.
The combination of the fastest detector and best tracker improved the accuracy over online
trackers alone while improving the speed compared to detectors alone. This improvement will put
researchers’ one step closer to innovating ways to directly measure hand function in a patient’s daily life
at home, thus helping restore independence after SCI.