Project-Team:STARS

Inria | Raweb 2017 | Presentation of the Project-Team STARS | STARS Web Site


	PDF	e-Pub

Previous |

Home | Next next

Section: New Results

Multi-shot Person Re-identification in surveillance videos

Participants : Furqan Khan, Seongro Yoon, François Brémond.

keywords: person re-identification, appearance modeling, long term visual tracking

Efficient Video Summarization Using Principal Person Appearance for Video-Based Person Re-Identification

In video-based person re-identification, while most work has focused on problems of person signature representation and matching between different cameras, intra-sample variance is also a critical issue to be addressed. There are various factors that cause the intra-sample variance such as detection/tracking inconsistency, motion change and background. However, finding individual solutions for each factor is difficult and complicated. To deal with the problem collectively, we assume that it is more effective to represent a video with signatures based on a few of the most stable and representative features rather than extract from all video frames. In this work, we propose an efficient approach to summarize a video into a few of discriminative features given those challenges. Primarily, our algorithm learns principal person appearance over an entire video sequence, based on low-rank matrix recovery method. We design the optimizer considering temporal continuity of the person appearance as a constraint on the low-rank based manner. In addition, we introduce a simple but efficient method to represent a video as groups of similar frames using recovered principal appearance. Experimental results (Table 5) show that our algorithm combined with conventional matching methods outperforms state-of-the-art on publicly available datasets PRID2011 [77] and iLIDS-VID [125].

**Table 5.** Comparison of recognition rates (%) at different ranks of various Re-ID methods on PRID and iLIDS-VID. Best results are highlighted in bold.
	PRID				iLIDS-VID
Method	r=1	r=5	r=10	r=20	r=1	r=5	r=10	r=20
HOG3D+RankSVM [125]	19.4	44.9	59.3	77.2	12.1	29.3	41.5	56.3
Color+RankSVM [125]	29.7	49.4	59.3	71.1	16.4	37.3	48.5	62.6
DVR [125]	28.9	55.3	65.5	82.8	23.3	42.4	55.3	68.6
ColorLBP [78]+RankSVM	34.3	56.0	65.5	77.3	23.2	44.2	54.1	68.8
DVDL [80]	40.6	69.7	77.8	85.6	25.9	48.2	57.3	68.9
Color+LFDA [106]	43.0	73.1	82.9	90.3	28.0	55.3	70.6	88.0
AFDA [88]	43.0	72.7	84.6	91.9	37.5	62.7	73.0	81.8
DSVR [126]	40.0	71.1	84.5	92.2	39.5	61.1	71.7	81.0
MTL-LORAE [119]	-	-	-	-	43.0	60.1	70.3	85.3
STFV3D+KISSME [93]	64.1	87.3	89.9	92.0	43.8	69.3	80.0	90.0
CNN+KISSME [142]	69.9	90.6	-	98.2	48.8	75.6	-	92.6
RFA-Net+RankSVM [131]	58.2	85.8	93.4	97.9	49.3	76.8	85.3	90.0
CNN+XQDA [142]	77.3	93.5	-	99.3	53.0	81.4	-	95.1
APR+XQDA [73]	68.6	94.6	97.4	98.9	55.0	87.5	93.8	97.2
TDL [133]	56.7	80.0	87.6	93.6	56.3	87.6	95.6	98.3
RCNN [96]	70.0	90.0	95.0	97.0	58.0	84.0	91.0	96.0
PPA+Euclidean	66.6	90.1	93.5	96.7	29.6	55.7	67.6	79.7
PPA+KISSME	85.7	98.9	99.9	100.0	65.7	92.3	96.8	99.1
PPA+XQDA	87.6	99.2	99.6	99.9	66.8	93.9	97.8	99.8

In order to get a deeper insight, Figure 15 presents some qualitative results visualizing principal appearance groups discovered by our approach without manual supervision. We compare our results with the approach of Shu et.al. [116] and found our results to be relatively more visually coherent and to have more groups. Details of this work can be found in our BMVC paper [36].

Figure 15. Visualization of principal appearance groups. Examples of our algorithm result are shown in comparison with ROSL [116] for the same process. The top and bottom on the left column are ID:218 and ID:016 of iLIDS-VID [125], and the top and bottom on the right column are ID:001 and ID:185 of PRID [77], respectively.

N_{g}

means the number of image groups.

Multi-shot Person Re-identification using Part Appearance Mixture

Appearance based person re-identification in real-world video surveillance systems is a challenging problem for many reasons, including ineptness of existing low level features under significant viewpoint, illumination, or camera characteristic changes to robustly describe a person's appearance. One approach to handle appearance variability is to learn similarity metrics or ranking functions to implicitly model appearance transformation between cameras for each camera pair, or group, in the system. The alternative, that is followed in this work, is the more fundamental approach of improving appearance descriptors, called signatures, to cater for high appearance variance and occlusions. A novel signature representation for multi-shot person re-identification, called Part Appearance Mixture (PAM), is henceforth presented that uses multiple appearance models, each describing appearance as a probability distribution of a low-level feature for a certain portion of an individual's body. It caters for high variance in a person's appearance by automatically trading compactness with variability as can be visually seen in the results presented in Figure 16.

Figure 16. Visualization of full-body appearance mixtures of HOG descriptor. For each person, first image is one of the input images used to learn appearance model. It is followed by the composite images, one for each component of the GMM. Optimal number of components for each appearance model varies between persons. (a)-(d) GMM components focus on different pose and orientation of person. (e)-(g) Transient occlusions are implicitly dealt with in appearance models as components focus on pose and orientation. (h) GMM components focus on different person alignments in the bounding box.

Signature representation has probabilistic interpretation of appearance signatures that allows for application of information theoretic similarity measures. A signature is acquired over coarsely localized body regions of a person in a computationally efficient manner instead of reliance on fine parts localization. We also define a Mahalanobis based distance measure to compute similarity between two signatures. The metric is also amenable to existing metric learning methods and appearance transformation between different scenes can be learned directly using proposed signature representation. Combined with metric learning, rank-1 recognition rates of $92.5 %$ and $79.5 %$ are achieved on PRID2011 [77] and iLIDS-VID [125] datasets, respectively, which establish a new state-of-the-art on both the datasets. Detailed comparisons with other contemporary unsupervised and supervised re-identification methods are presented in table 6 and table 7.

**Table 6.** Recognition rates (%) at different ranks for unsupervised methods. Best results are highlighted in bold.
Spatiotemporal Methods
	PRID2011				iLIDS-VID
Method	r=1	r=5	r=10	r=20	r=1	r=5	r=10	r=20
HOG3D [93]	20.7	44.5	57.1	76.8	8.3	28.7	38.3	60.7
FV3D [93]	38.7	71.0	80.6	90.3	25.3	54.0	68.3	87.7
STFV3D [93]	42.1	71.9	84.4	91.6	37.0	64.3	77.0	86.9
Spatial Methods
	PRID2011				iLIDS-VID
Method	r=1	r=5	r=10	r=20	r=1	r=5	r=10	r=20
SDALF [68]	5.2	20.7	32.0	47.9	6.3	18.8	27.1	37.3
eSDC [141]	25.8	43.6	52.6	62.0	10.2	24.8	35.5	52.9
FV2D [94]	33.6	64.0	76.3	86.0	18.2	35.6	49.2	63.8
PAM-HOG	50.6	72.2	83.6	93.0	22.9	44.3	55.7	69.3
PAM-LOMO	70.6	90.2	94.6	97.1	33.3	57.8	68.5	80.5

**Table 7.** Recognition rates (%) at different ranks of supervised methods. Best results are highlighted in bold.
Dictionary or Feature Learning Methods
	PRID2011				iLIDS-VID
Method	r=1	r=5	r=10	r=20	r=1	r=5	r=10	r=20
DVDL [80]	40.6	69.7	77.8	85.6	25.9	48.2	57.3	68.9
Color+LFDA [106]	43.0	73.1	82.9	90.3	28.0	55.3	70.6	88.0
AFDA [88]	43.0	72.7	84.6	91.9	37.5	62.7	73.0	81.8
MTL-LORAE [119]	-	-	-	-	43.0	60.1	70.3	85.3
RCNN [96]	70.0	90.0	95.0	97.0	58.0	84.0	91.0	96.0
Metric or Rank Learning Methods
	PRID2011				iLIDS-VID
Method	r=1	r=5	r=10	r=20	r=1	r=5	r=10	r=20
HOG3D+RankSVM [125]	19.4	44.9	59.3	77.2	12.1	29.3	41.5	56.3
Color+RankSVM [125]	29.7	49.4	59.3	71.1	16.4	37.3	48.5	62.6
ColorLBP [78]+RankSVM	34.3	56.0	65.5	77.3	23.2	44.2	54.1	68.8
DVR [125]	28.9	55.3	65.5	82.8	23.3	42.4	55.3	68.6
DSVR [126]	40.0	71.1	84.5	92.2	39.5	61.1	71.7	81.0
STFV3D+KISSME [93]	64.1	87.3	89.9	92.0	43.8	69.3	80.0	90.0
LOMO+XQDA [90]	-	-	-	-	53.0	78.5	86.9	93.4
LOMO+SBSR+XQDA [53]	-	-	-	-	68.5	87.9	93.0	96.3
RFA-Net+RankSVM [131]	58.2	85.8	93.4	97.9	49.3	76.8	85.3	90.0
CNN+KISSME [142]	69.9	90.6	-	98.2	48.8	75.6	-	92.6
CNN+XQDA [142]	77.3	93.5	-	99.3	53.0	81.4	-	95.1
PAM-HOG+KISSME	55.3	80.7	90.2	95.6	33.9	60.0	70.2	79.1
PAM-LOMO+KISSME	92.5	99.3	100.0	100.0	79.5	95.1	97.6	99.1

For further details, please refer to our paper [31].

Previous |

Home | Next next