Section: New Results
Recognizing Pedestrians using Cross-Modal Convolutional Networks
Participants : Danut-Ovidiu Pop, Fawzi Nashashibi.
Pedestrian detection and recognition is of great importance for autonomous vehicles. A pedestrian detection system depends on: 1) the sensors utilized to capture the visual data, 2) the features extracted from the acquired images and 3) the classification process. Considering existing data-sets of images (Daimler, Caltech and KITTI) we have focused only on the last two points. Our question is whether one modality can be used exclusively (standpoint one) for training the classification model used to recognize pedestrians in another modality or only partially (standpoint two) for improving the training of the classification model in another modality. If it is trained on multi-modal data, can the system still work when the data from one of the domains is missing? How much information is redundant across the domains (can we regenerate data in one domain on the basis of the observation from the other domain)? How could a multi-modal system be trained, when data in one of the modalities is scarce (e.g. many more images in the visual spectrum than depth). To our knowledge, these questions have not yet been answered for the pedestrian recognition task. Our work proposes to solve this brain-teaser through various experiments based on the Daimler stereo vision data set. This year, we perform the following experimental studies (More detail can be found in , , ):
Three different image modalities (Intensity, Depth, Optical Flow) for improving the classification component are considered. The Classical Training and the Cross Training methods are analyzed. On the Cross Training method, the CNN is trained and validated on different images modalities, in contrast to classical training method in which the training and validation of each CNN is on same images modality.
b) An incremental model where a CNN is trained with the first modality images frames, then a second CNN, initialized by transfer learning on the first one is trained on the second modality images frames, and finally a third CNN initialized on the second one, is trained on the last modality images frames.
In  two different fusion schemes are studied: