Combining Object Detectors and Trackers to Efficiently Detect Hands in Egocentric Videos

Ryan Visée (1,2), Jirapat Likitlersuang (1,2), José Zariffa (1,2)

  1. Institute of Biomaterials and Biomedical Engineering, University of Toronto

  2. KITE, Toronto Rehab-University Health Network

Background

Individuals with cervical spinal cord injury (SCI) report upper limb function as their top recovery

priority. As a result, new treatments to improve hand function after SCI are needed. To accurately

measure the true impact these interventions have on patient function and independence, evaluation needs

to occur at home, instead of within clinical settings. Although there are currently no tools to evaluate hand

function at home, videos from wearable cameras (egocentric videos) can be used to monitor patient

activities and automatically analyzed using computer vision. However, the first step of this process, hand

detection, is difficult to do robustly and reliably, hindering the deployment of a complete monitoring

system in the home and community. We propose the combination of object detection and tracking

algorithms to create a system for fast and reliable hand detection in egocentric videos.

Methods

In previous experiments, a GoPro® Hero 4 wearable camera was used to collect data from 17

individuals with SCI performing activities of daily living in a home simulation laboratory at the Toronto

Rehabilitation Institute. We used these videos to create a large dataset for hand detection by manually

labelling bounding boxes around hands in each frame. Over 167,000 frames were annotated. We used this

data to investigate existing detection (YOLOv2 and Faster R-CNN) and tracking (MF, KCF, OLB, MIL,

TLD) algorithms alone. Finally, we used the fastest detector to automatically initialize and reset the best

trackers every 100 frames, chosen empirically, or after tracker failure. Further, to minimize detector

usage, the tracker was disabled if it failed and the detector was unable to locate the hand in 3 consecutive

frames. The detector then checked every 60 frames until the hand was found.

Results

The F1-scores (based on a thresholded IoU) for the best detector and tracker alone (YOLOv2 and

MF, respectively) were 0.90±0.08 and 0.49±0.27, respectively. The best combination method resulted in

an F1-score of 0.83±0.16 while being five times faster than the fastest detector alone on a CPU.

Conclusions

The combination of the fastest detector and best tracker improved the accuracy over online

trackers alone while improving the speed compared to detectors alone. This improvement will put

researchers’ one step closer to innovating ways to directly measure hand function in a patient’s daily life

at home, thus helping restore independence after SCI.

Neural SystemsiARC