|The problem of human activity recognition (HAR) has been increasingly attracting the efforts of the research community, having several applications. In this paper we propose a multi-modal approach addressing the task of video-based HAR. Our approach uses three modalities, i.e., raw RGB video data, depth sequences and 3D skeletal motion data. The latter are transformed into a 2D image representation into the spectral domain. In order to extract spatio-temporal features from the available data, we propose a novel hybrid deep neural network architecture that combines a Convolutional Neural Network (CNN) and a Long-Short Term Memory (LSTM) network. We focus on the tasks of recognition of activities of daily living (ADLs) and medical conditions and we evaluate our approach using two challenging datasets.|
*** Title, author list and abstract as seen in the Camera-Ready version of the paper that was provided to Conference Committee. Small changes that may have occurred during processing by Springer may not appear in this window.