skip to main content
10.1145/2683483.2683490acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicvgipConference Proceedingsconference-collections
research-article

Total Cluster: A person agnostic clustering method for broadcast videos

Published: 14 December 2014 Publication History

Abstract

The goal of this paper is unsupervised face clustering in edited video material – where face tracks arising from different people are assigned to separate clusters, with one cluster for each person. In particular we explore the extent to which faces can be clustered automatically without making an error. This is a very challenging problem given the variation in pose, lighting and expressions that can occur, and the similarities between different people.
The novelty we bring is three fold: first, we show that a form of weak supervision is available from the editing structure of the material – the shots, threads and scenes that are standard in edited video; second, we show that by first clustering within scenes the number of face tracks can be significantly reduced with almost no errors; third, we propose an extension of the clustering method to entire episodes using exemplar SVMs based on the negative training data automatically harvested from the editing structure.
The method is demonstrated on multiple episodes from two very different TV series, Scrubs and Buffy. For both series it is shown that we move towards our goal, and also outperform a number of baselines from previous works.

References

[1]
B. Bhattarai, G. Sharma, F. Jurie, and P. Perez. Some faces are more equal than others: Hierarchical organization for accurate and efficient large-scale identity-based face retrieval. In ECCV Workshop, 2014.
[2]
P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, and J. Sivic. Finding actors and actions in movies. In Proc. ICCV, 2013.
[3]
R. G. Cinbis, J. J. Verbeek, and C. Schmid. Unsupervised metric learning for face identification in TV video. In Proc. ICCV, 2011.
[4]
T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar. Movie/script: Alignment and parsing of video and text transcription. In Proc. ECCV, 2008.
[5]
T. Cour, B. Sapp, A. Nagle, and B. Taskar. Talking pictures: Temporal grouping and dialog-supervised person recognition. In Proc. CVPR, 2010.
[6]
T. Cour, B. Sapp, and B. Taskar. Learning from ambiguously labeled images. In Proc. CVPR, 2009.
[7]
T. Cour, B. Sapp, and B. Taskar. Learning from partial labels. J. Machine Learning Research, 2011.
[8]
M. Eichner and V. Ferrari. Better appearance models for pictorial structures. In Proc. BMVC., 2009.
[9]
M. Everingham, J. Sivic, and A. Zisserman. "Hello! My name is... Buffy" – automatic naming of characters in TV video. In Proc. BMVC., 2006.
[10]
M. Everingham, J. Sivic, and A. Zisserman. Taking the bite out of automatic naming of characters in TV video. Image and Vision Computing, 27(5), 2009.
[11]
P. Felzenszwalb, D. Mcallester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In Proc. CVPR, 2008.
[12]
P. F. Felzenszwalb, R. B. Grishick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE PAMI, 2010.
[13]
M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? Metric learning approaches for face identification. In Proc. ICCV, 2009.
[14]
E. Khoury, P. Gay, and J.-M. Odobez. Fusing Matching and Biometric Similarity Measures for Face Diarization in Video. In ICMR, 2013.
[15]
A. Kläser, M. Marszałek, C. Schmid, and A. Zisserman. Human focused action localization in video. In International Workshop on Sign, Gesture, Activity, 2010.
[16]
D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.
[17]
M. Marin-Jimenez, A. Zisserman, and V. Ferrari. "Here's looking at you, kid." Detecting people looking at each other in videos. In Proc. BMVC., 2011.
[18]
J. Monaco. How to Read a Film: The World of Movies, Media, Multimedia – Language, History, Theory. OUP USA, Apr 2000.
[19]
O. M. Parkhi, K. Simonyan, A. Vedaldi, and A. Zisserman. A compact and discriminative face track descriptor. In Proc. CVPR, 2014.
[20]
L. C. Pickup and A. Zisserman. Automatic retrieval of visual continuity errors in movies. In Proc. CIVR, 2009.
[21]
D. Ramanan, S. Baker, and S. Kakade. Leveraging archival video for building face datasets. In Proc. ICCV, 2007.
[22]
J. See and C. Eswaran. Exemplar Extraction Using Spatio-Temporal Hierarchical Agglomerative Clustering for Face Recognition in Video. In ICCV, 2011.
[23]
G. Sharma, F. Jurie, and P. Perez. EPML: Expanded Parts based Metric Learning for Occlusion Robust Face Verification. In ACCV, 2014.
[24]
J. Shi and C. Tomasi. Good features to track. In Proc. CVPR, pages 593–600, 1994.
[25]
J. Sivic, M. Everingham, and A. Zisserman. "Who are you?" – learning person specific classifiers from video. In Proc. CVPR, 2009.
[26]
T. J. Smith. An Attentional Theory of Continuity Editing. PhD thesis, University of Edinburgh, 2006. Unpublished Doctoral Thesis.
[27]
M. Tapaswi, M. Bäuml, and R. Stiefelhagen. "Knock! Knock! Who is it?" Probabilistic Person Identification in TV Series. In Proc. CVPR, 2012.
[28]
M. Tapaswi, M. Bäuml, and R. Stiefelhagen. StoryGraphs: Visualizing Character Interactions as a Timeline. In CVPR, 2014.
[29]
P. Wohlhart, M. Köstinger, P. M. Roth, and H. Bischof. Multiple instance boosting for face recognition in videos. In DAGM-Symposium, 2011.
[30]
L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In Proc. CVPR, 2011.
[31]
B. Wu, S. Lyu, B.-G. Hu, and Q. Ji. Simultaneous Clustering and Tracklet Linking for Multi-Face Tracking in Videos. In ICCV, 2013.
[32]
B. Wu, Y. Zhang, B.-G. Hu, and Q. Ji. Constrained Clustering and Its Application to Face Clustering in Videos. In CVPR, 2013.
[33]
Y. Yusoff, W. Christmas, and J. Kittler. A Study on Automatic Shot Change Detection. Multimedia Applications, Services and Techniques, 1998.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICVGIP '14: Proceedings of the 2014 Indian Conference on Computer Vision Graphics and Image Processing
December 2014
692 pages
ISBN:9781450330619
DOI:10.1145/2683483
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 December 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Face track clustering
  2. TV shows
  3. Video-editing structure

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICVGIP '14

Acceptance Rates

Overall Acceptance Rate 95 of 286 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 31 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media