In this report, we present our investigations on feature representation, hand mask modality, past action prediction, and model ensemble, for the EPIC-Kitchens Action Anticipation Challenge. Building upon an existing action anticipation model, i.e., RULSTM, our framework effectively utilizes enhanced feature representation, gives more emphasis on many-shot objects, and incorporates additional hand mask modality. We also explore a network modification to capture past action prediction. Furthermore, to exploit all the training data and aggregate the complementary information from different models, we employ model ensemble. We achieved top-1 action anticipation accuracy of 16.02% for Seen Kitchens (S1), and 10.11% for Unseen Kitchens (S2). Our submission, under the team name VI-I2R, achieved 2nd place for both seen and unseen kitchens, in terms of top-1 action anticipation accuracy.
License type:
PublisherCopyrights
Funding Info:
This research is supported by the Agency for Science, Technology and Research (A*STAR)
under its AME Programmatic Funding Scheme (Project #A18A2b0046) and the National Research Foundation,
Singapore under its NRF-ISF Joint Call (Award NRF2015-NRF-ISF001-2541).