research-article

Open access

Toward a Holistic Approach to the Socio-historical Analysis of Vernacular Photos

Authors:

Gustavo MarfiaAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 18, Issue 3s

Article No.: 146, Pages 1 - 23

https://doi.org/10.1145/3507918

Published: 01 November 2022 Publication History

All formats PDF

Abstract

Although one of the most popular practices in photography since the end of the 19th century, an increase in scholarly interest in family photo albums dates back to the early 1980s. Such collections of photos may reveal sociological and historical insights regarding specific cultures and times. They are, however, in most cases scattered among private homes and only available on paper or photographic film, thus making their collection and analysis by historians, socio-cultural anthropologists, and cultural theorists very cumbersome. Computer-based methodologies could aid such a process in various ways, speeding up the cataloging step, for example, with the use of modern computer vision techniques. We here investigate such an approach, introducing the design and development of a multimedia application that may automatically catalog vernacular pictures drawn from family photo albums. To this aim, we introduce the IMAGO dataset, which is composed of photos belonging to family albums assembled at the University of Bologna’s Rimini campus since 2004. Exploiting the proposed application, IMAGO has offered the opportunity of experimenting with photos taken between the years 1845 and 2009. In particular, it has been possible to estimate their socio-historical content, i.e., the dates and contexts of the images, without resorting to any other sources of information. Exceeding our initial expectations, such an approach has revealed its merit not only in terms of performance but also in terms of the foreseeable implications for the benefit of socio-historical research. To the best of our knowledge, this contribution is among the few that move along this path at the intersection of socio-historical studies, multimedia computing, and artificial intelligence.

1 Introduction

Following Kodak’s invention of the first megapixel sensor in 1986, digital photography has slowly grown to substitute its analog predecessor, playing a key role in the early 21st-century digital revolution and social transformation [39, 53]. As a relevant example, photography has modified the way mobile phones are used, as their integration of digital cameras has at once fostered an exponential growth of the photos that are shot and uploaded to the Internet every year, as well as a paradigm shift in mobile communications, which today rely on high-quality multimedia [6, 15, 43, 63]. These phenomena have proven to be game-changers for both how people communicate and the bloom of new fields of research, as both academia and industry have exploited such plethora of visual data to develop and apply computer vision models to a variety of different problems (e.g., face recognition, autonomous driving) [16, 32, 33, 44, 59, 66]. Now, while a wealth of research is being devoted to the processing and analysis of digital images, much has to be done regarding analog ones, mainly because printed images representing a place (or an environment) at a given time may be (1) scattered in numerous public and private collections, (2) of variable quality, and (3) damaged due to hard or continued use or exposure. In addition, any analysis by means of image processing and computer vision algorithms requires the potentially quality-degrading initial digitization step. However, despite the complications and challenges brought on by analog photographs, they still represent an unparalleled source of information regarding the recent past: in fact, no other visual media has been used as pervasively to capture the world throughout the 20th century, as the availability of consumer-grade photo cameras supported the spread and popularity of vernacular photography practices (e.g., travel photos, family snapshots, photos of friends and classes) [35, 36].

Family photo albums represent an example of vernacular photography that has drawn the attention of researchers and public institutions. A recent work defines family photo albums as a globally circulating form that not only takes locally specific forms but also “produces localities” that create and negotiate individual stories [48]. Along the same lines, in another relevant contribution, family albums represent a reference point for the conservation, transmission, and development of a community Social Heritage [14]. In essence, scholars from different fields agree in identifying such type of photography collections as capable of capturing salient features regarding the evolution of local communities in space and time. A large-scale analysis of such collections of photos is often impossible, as a manual verification of the characteristics of more than a few hundred of pictures would be exceedingly burdensome, considering also that in many cases no associated descriptions are available. This is why contributions in this field normally base their findings on the study of small corpora of photos [14, 48]. This work addresses such problem, taking as a case study the socio-historical analysis of a collection of family album photographs: we here present the design and implementation of a multimedia application that, resorting to deep learning models, implements their classification for cataloging purposes. To verify the validity of such an approach, the application is exploited to classify a novel dataset, namely IMAGO, collected and maintained at the University of Bologna [14]. In particular, the contributions of this work amount to:

•

A deep-learning-based multimedia application to assist socio-historians in their cataloging work, which consists in identifying the socio-historical information of an image, i.e., its shooting year and socio-historical context, according to the definitions provided in [14]. While the dating task has been so far considered in the literature [26, 37, 47], the estimation of the socio-historical context has not been yet investigated.

•

The introduction of a family photo album collection, namely IMAGO, comprising over 80,000 analog photos taken between 1845 and 2009, belonging to ca. 1,500 families, primarily from the Emilia-Romagna and immediately neighboring regions in Italy.

•

A thorough evaluation of the performance obtained by Convolutional Neural Network (CNN) models [29, 31, 55] trained on the IMAGO dataset for both the dating and the estimation of the socio-historical context. In order to assess the validity of the proposed framework, the performance of the proposed approach is also contrasted with the expertise of a socio-historical scholar.

•

A comparison between the performance of the adopted CNN-based approach and a Transformer-based one [23, 58].

The rest of the article is organized as follows: In Section 2 we sketch the necessary socio-historical background. In Section 3 we review the state of the art that falls closest to our contribution. Section 4 aims at presenting our multimedia tool designed to assist socio-historians, whereas Section 5 provides a description and the main characteristics of the dataset adopted to verify the proposed approach. Section 6 presents and validates the models trained on the proposed dataset to define an evaluation baseline. In Section 7 we compare the classification performance of our application with the results obtained by a socio-historical scholar. In Section 8 the performance of the adopted CNN models is instead contrasted with the Transformer-based ones. Finally, in Section 9 an overall discussion is carried out and possible directions for future works are provided.

2 The Socio-Historical Background

In this section we sketch the socio-historical background required to set the stage for this work. In fact, no classification problem can be solved without first clarifying what the classification categories are. This review aims at providing the basics necessary to understand how contexts and categories emerge in socio-historical studies. To do this, we begin delineating the main differences between traditional and social history. We then explain how and why family photo albums fall within the areas of interest of such field of study, finally introducing the process that socio-historians implement when cataloging a corpus of data.

Social History amounts to an interdisciplinary field of research that combines sociological and historical methods to understand how societies have developed over time and how the past has and may influence the present [50]. In the words of Cabrera, traditional history and social history differ as follows: Traditional history, especially classical political history, was based on the concept of subject: the subjectivity of historical agents was rational and autonomous; the subject a preconstituted center; and, therefore, actions were caused, and fully explained, by the intentions that motivated them. Social history, on the other hand, was based on the concept of society. For social historians, subjectivity and culture are not rational creations but representations or expressions of the social context in which the causes of actions were to be found [9]. Such social contexts, with their own historical logic, represent the ground on which categories are constructed, to grasp the meaning and organize social reality [10]: the categories represent a complex relational network whose nature is neither subjective nor objective but the result of a specific historical phenomenon with its own behavior. Therefore, the categories do not constitute a simple mean for transmitting social reality but are an active part in its definition and are called socio-historical contexts.

Now, the starting point of a socio-historical analysis is the space in which the interweaving between individual initiative and social coercion takes place. An attempt is usually made to explain how society works on different theoretical bases resorting to traditional oppositions: public/ private, subjective/objective, ideal/material, visible/invisible, body/conscience. Further analyses are then introduced turning to the concept of social imaginary, defined as “The way in which ordinary people imagine their social contexts which, often, does not translate into a theoretical formulation but is conveyed in images, stories and legends” [12]. In essence, any socio-historical context introduced in such analyses should describe the evolution of social history and therefore the change of sociality and of people’s behavior in a defined space/time. To this aim, socio-historical categories are first identified studying historical archival documents from different topics (e.g., economics, traditions, wars). Among such documents, now contemporary historians also resort to multimedia sources [19]. Out of the many multimedia sources today available, photography emerges as the one capable to cover the greatest time-span so far, although photographs have risen to the dignity of primary sources of information just in the last few decades [54].

For the purposes of this work, socio-historical categories have been obtained relying on the study of family album photos. This particular kind of pictures originates from and at the same time represents a fundamental component of social structures, also a well known socio-historical abstraction, the Family [8]. The Family is, indeed, a fundamental construct in Social History studies, since it embodies at once the public and the private spheres. In fact, the photos contained in the family albums can be read, on the one hand, as private visual memories of one’s own history, destined to remain hidden from society, and, on the other hand, as traces and signs of the collective social imaginary of a given historical period. So, family album photographs (e.g., spontaneous and/or anonymous images otherwise destined to remain hidden) depict the daily existence of their time, not considering them solely as memories but also as a network of signs, traces, and documents that may be used to interpret the past [54].

Although socio-historical contexts may emerge from the study of archival documents and family album photographs, the specific context of a specific photo may remain hard to tell. This is due the fact that, without knowing when a picture was taken and what the people there portrayed were doing, it may be impossible to associate any accurate information to the picture. Accurate information in most cases may be obtained only resorting to the knowledge of the subjects represented in the photo. For this reason, social-historians rely on the knowledge of the main source, if available, which may be the owner of the photograph, for example. Indeed, such information could be impossible to find: when studying and cataloging a corpus of photos, no reliable source of information may be available. This problem is common for socio-historical scholars, and in such case they resort to other approaches, which may include classifying data based on a visual inspection, implementing onerous processes to reduce as much as possible the misclassifications of socio-historical features. As a relevant example consider Enns and Martin [24], where the authors collected and visually analyzed and classified 355 photos related to women involved in agriculture learning activity.

3 Related Work

In this section we analyze the works that fall closest to ours in terms of datasets and tasks. Only a few have so far analyzed analog collections of vernacular photographs [26, 37, 47]. For example, Ginosar et al. [26] employed a deep learning approach to analyze and date 37,921 historical frontal-facing American high school yearbook photos taken from 1928 to 2010 [26]. Here, a CNN architecture was trained to analyze people’s faces and predict the year in which a photo was taken. Along the same line, Salem et al. [47] presented a dataset containing images from high school yearbooks, covering in this case the 1950-to-2014 time span (considering 1,400 photos per year). They resorted to CNNs to estimate the precise image shooting year. In order to assess the characteristics that allow to correctly classify a picture, they considered both color and grayscaled images containing (1) faces, (2) torsos (i.e., upper bodies including people’s faces), and (3) random regions from the images. The best performance was obtained considering color images portraying the torso of people. Their results confirmed that the human appearance is strongly related to time. Müller et al. [37] instead analyzed the dating task through the lenses of vernacular and landscape photos belonging to years 1930 through 1999, amounting to at most 25,000 pictures per year. The authors proposed different baselines relying on deep CNNs, considering the dating as both a regression and a classification task. In Table 1 we summarize the characteristics (image content, number of images, and covered time span) of the archives employed in the works described so far. In most cases only specific subsets of such archives have been analyzed by means of computer vision techniques. To provide a comfortable comparison, the same information regarding the IMAGO collection (i.e., the collection originating the dataset analyzed in this work) is provided in the last row of the same table.

Table 1.

Original Dataset	Type(s) of Photography	Type(s) of Camera	Theme	Cardinality	Period
Ginosar et al. Ginosar et al. [26]	Portrait	Digital and analog	Frontal face from high school yearbook	168,055	1905–2013
Salem et al. Salem et al. [47]	Portrait	Digital and analog	High school yearbook	ca. 600,000	1912–2014
Müller et al. Müller et al. [37]	Vernacular and landscape	Digital and analog	No specific theme	1,029,710	1930–1999
IMAGO collection	Vernacular	Analog	Family albums	ca. 80,000	1845–2009

Table 1. Characteristics of Existing Datasets and IMAGO

Other works have already investigated the digital cataloging of historical photos [2, 18, 45]. Lincoln et al. [34], for example, developed a prototype to find duplicates and tag photos depicting similar scenes in the Carnegie Mellon University Archives’ General Photograph Collection. Tilton and Arnold [56], instead, draw on scholarship from semiotics and visual cultural studies to develop a framework called distant viewing, to individuate larger patterns within a corpus that may be difficult to discern by closely studying only a small set of objects (e.g., narrative arcs in American sitcoms). One of the works that falls closest in scope was published by Wevers and Smits [62], where the CHRONIC and the SIAMESET datasets were introduced to study the transition from illustrations to photographs in the history of Dutch newspapers.

Concluding, for the works and datasets cited in this section, no pre-defined socio-historical categories were utilized as means of analysis. In addition, none considered the family album theme: to the best of our knowledge, the present amounts to the first contribution to investigate their classification according to the socio-historical context definitions and background (Sections 2, 5.1, and 5.2).

4 A Socio-Historical Cataloging Tool for Family Photo Albums

Socio-historical analyses include dealing with various sources of information, systematically examining their soundness, exemplarity, and meaning, seeking for inter- and intra-correlations and relationships that may help understanding what really happened in the past [7]. Sources are in general not objective but shaped by the politics, practices, and events that selectively document protest [17]. In summary, the procedure of historical inquiry implies the following steps: (1) identification and selection of sources, (2) registration and classification for further investigation, and (3) a critical inquiry of the collection. From here, a socio-historian’s work can then proceed in multiple directions. A sound socio-historical study may hence require the inspection and classification of hundreds or even thousands of documents and images [4, 24, 49]. This amounts to burdensome work, which often seeks for the big picture provided by large corpora of data rather than the specific information returned by a single document or image. Such type of process opens to the use of automatic tools, capable of classifying great amounts of data in short amounts of time. This has already been discussed over two decades ago, for example, in [25], where the author illustrated linguistic and statistical tools that could be profitably used by historians and social historians in the study of events. Now, much more can clearly be expected thanks to the development of computing tools, capable of handling growing amounts of multimedia data originating from heterogeneous sources. This would require a holistic approach taking care of source(s): (1) digitization, (2) accessibility through standard interfaces, and (3) analysis with models capable of translating socio-historical tasks into computing ones.

Now, a typical socio-historical task amounts to inferring from and subsequently applying categorical models to large corpora of data (Section 2). We apply such idea to the case of family photos, proposing a multimedia tool capable of processing and cataloging such type of pictures. To this aim, in Figure 1 we show the components of the proposed application. The core is the Socio-Historical Module (SHM), which is composed by one or more classifiers, depending on socio-historical tasks of interest. For the purpose of this work, such tasks have been defined on top of family album photos, originating from the IMAGO dataset (details regarding its socio-historical value are discussed in Section 5). Such dataset offered the opportunity of predicting two pieces of socio-historical information: the context and the shooting year. In brief, the SHM amounts to a tool that may automatically label photos with the obtained predictions, giving, in addition, the opportunity of confirming or correcting such estimates, when necessary, during cataloging procedures.

Fig. 1.

The classifiers that compose the SHM could be defined exploiting different kinds of computer vision techniques. However, in the last decade, Deep Learning (DL) approaches have generally provided higher accuracies [51], both for the dating task [26, 37, 47] and for the analysis of historical picture datasets [34, 56]. For such reasons, we also exploited such tools in the development of the SHM. In particular, inspired by the work of Salem et al. [47], we trained several classifiers considering different image regions belonging to the same picture, selected using different criteria. To this aim, we considered the whole image and the crops enclosing the faces and the full figures of the people there portrayed. Such patches are always present since we are dealing with family album photos, which always include at least one person in each photo. To effectively estimate the value provided by such patches in terms of prediction performance, we also considered random ones. Hence, for the whole image and for each of the aforementioned regions, we trained two specific single-input classifiers, one per each of the two socio-historical tasks of interest. Such classifiers are named following the analyzed patches: full image, faces, people, and random patches. The single-input architecture utilizes either a CNN or a Transformer-based backbone and a fully connected layer for the final classification. It is important to notice that the results of such classifiers may not be comparable, as the amount of data utilized to perform a prediction varies depending on the fact that the full image is used during testing, or parts of it (patches). This fact required us to establish a different evaluation method, considering not a single face/person/random patch but introducing a layer that merged all of such activations into a single one per each picture. In practice, the activation vectors returned by a single-input classifier (e.g., the face classifier) for each face region were averaged per each image in order to compute the most probable class. This process was applied also to the people classifier and the random-patches classifier.

Finally, we also exploited the ensemble of these models (Figure 2). We resorted to such an approach as it has been successfully applied in the literature [41] and did not require any additional training and tuning of hyper-parameters. This kind of approach was employed not only to exploit the averaging effect [5] but also because it helps identify which type of classifier and data provide a valid contribution at inference time. As represented in Figure 2, such approach is modular, supporting the selection of the single-input classifiers. However, since we are considering activations coming from a single image or obtained averaging across multiple regions, these may contain values at different scales. For this reason, we \(l2\)-normalized the different inputs of the ensemble, to support the combination of the activation vectors coming from the full-image, faces, people, and random-patches classifiers. In particular, the final prediction is obtained by averaging the outputs described above and then computing the most probable class.

Fig. 2.

We now move on, in the following section, to present the details regarding the family photo collection considered in this work, the IMAGO collection.

5 IMAGO Dataset

The IMAGO project started in 2004 by socio-historical scholars to study the evolution of Social History through the lenses of family album photographs. This produced a digitized collection, namely IMAGO,¹ of analog family album photos gathered year by year and conserved by the Department of the Arts of the University of Bologna.² The collection comprises ca. 80,000 photos, taken between 1845 and 2009, belonging to ca. 1,500 Italian family albums, offering the opportunity of studying the evolution of Italian society during the 20th century. Among these, 16,642 images have been labeled by the bachelor’s students in the Fashion Cultures and Practices course, under the supervision of the socio-historical faculty.

5.1 Annotation Process

The annotation process followed (and keeps following, as new photos are acquired from new incoming bachelor’s students in Fashion Cultures and Practices and annotated every year) a simple but strict protocol, involving the following steps:

(1)

During a first lecture, the socio-historical background, the IMAGO dataset construction project, and the different classification categories are presented and explained.

(2)

During a second lecture, the annotation problem is covered in more detail. In particular, the lecture focuses on the importance of the reliability and authenticity of sources of socio-historical materials (including the shooting year). This means explaining that the original owner of the photo should be interviewed whenever possible. In case such person(s) are not available (e.g., the photo is very old), one can find a second-hand informed party (e.g., anyone who might be aware of the context of the given photo). Alternatively, an attempt to infer the socio-historical context and the shooting year (if possible) can be made analyzing any written annotations scripted behind the photo. In case none of such solutions are possible, no annotation is added.

Hence, the information provided by a photograph’s owner amounts to the ground truth from a socio-historical point of view. This assumption in the labeling process is what injects the social component along with the historical one in the dataset. Such an approach is not new to the computer vision community either; other works in literature have considered as image metadata the information provided by their owners [3, 38]. These elements highlight the uniqueness of such datasets: since only the owner (or a directly connected party such as a relative or a friend) holds the ground truth, it is not possible to resort to just any standard labeling services (e.g., Amazon SageMaker Ground Truth or the Google AI Platform Data Labeling Service [1, 27]). This annotation process generated two socio-historical metadata per each photo: (1) the socio-historical context and (2) the shooting year [14].

5.2 Socio-historical Context

We here explain how the classes employed to analyze IMAGO have been defined from a socio-historical point of view. To this aim, we here report on the rationale behind the use of two exemplar ones, “Motorization” and “Affectivity,” while a more in-depth analysis of all classes may be found in [11, 12, 13, 54]. The “Motorization” class is meant to mark an important change in people’s lifestyle. We can take as an example the boom of sale for motorcycles. Such phenomena not only changed the production trend and its related economical ecosystem but also changed the social behavior of people in the area in which such boom took place. It affected the society idea of mobility and of how people gathered together. In these terms, the motorization aspect becomes therefore fundamental for the study of Social History. On a completely different plane, instead, the “Affectivity” class regards personal feelings. Such class wants to represent the changes that occurred between the affective and family relationships. For example, in the first decades of the 20th century, the family emotional relationships were considered ones of estrangement. This phenomenon is also reflected in the photographs that depict wife and husband, parents and children, brothers and sisters. Although all members of the same family, they all posed without any affectional gestures (e.g., hugs). After World War II, things change, starting with younger people who changed poses in terms of distances, contacts, hugs, and so forth. In the following we provide the socio-historical categories individuated in the IMAGO dataset [54], along with a brief explanation:

•

Work: Photos belonging to this class are mostly characterized by people sitting and/or standing in workplaces and wearing work clothes and/or gear.

•

Free-time: This class includes scenes of leisure time, reconstructing, wherever possible, generational and gender differences. It also includes images representing people visiting far-off landmarks, expanding social relationships and interacting with nature.

•

Motorization: Although often closely related to the Free-time category, this class has been distinguished as it includes symbolic objects such as cars and motorcycles, which represent a social and historical landmark.

•

Music: Similar to the Motorization one, this class may also include scenes from leisure time, characterized in this case by the appearance of musical instruments or events.

•

Fashion: This class includes clothing, which represents a mirror of the articulated intertwining of socio-economic, political, and cultural phenomena. This class is characterized by the presence of symbolic objects and clothes, such as suits, trousers, skirts, and coats.

•

Affectivity: This class is characterized by the presence of people (e.g., couples, friends, families, or colleagues) bound by inter-personal relationships.

•

Rites: These are portraits of sacred and/or celebratory events from family lives.

•

School: This class includes all the photos that represent schools, often characterized by symbolic objects (e.g., desk, blackboard) or groups of students.

•

Politics: This class contains photos related to political gatherings, demonstrations, and events.

These aforementioned categories amount to the ones that from now on will be used to implement the socio-historical classification task.

5.3 Exploratory Dataset Analysis

In Figure 3(a) we show the number of labeled images available per year in the 1930-to-1999 time frame; out of such time interval, the number of available images is too little to be visually represented. This figure also exhibits the distribution of the socio-historical information (i.e., shooting year and socio-historical context) over the entire dataset. From such plot, the unbalance that exists in terms of number of photos both per year and socio-historical context is evident. Figure 3(b) shows four exemplar images from the IMAGO dataset, which belong to different decades and represent different socio-historical contexts. These images are representative of the different characteristics that may be found in each photo (e.g., number of people, clothing, colors, and location).

Fig. 3.

6 Experimental Validation

We first provide details about the dataset pre-processing and the training process and then report the results obtained for both the socio-historical context classification and the dating tasks. The entire IMAGO dataset (the 16,642 labeled photos spanning the 1845–2009 time period) was used during the analysis of the socio-historical context classification task. For what concerns the image dating, 15,673 pictures, covering the 1930-to-1999 temporal interval, have been employed to avoid those years with a very limited number of samples, as already shown in Figure 3(a).

6.1 Dataset Pre-processing and Subdivision

The pre-processing phase aimed at (1) isolating the regions of interest from each photo and (2) improving the quality of the images composing the dataset, resorting to different techniques.

As reported in Section 4, both faces and people represent regions of interest to be exploited for the dating analysis [26, 47]. Following such insight, we created the IMAGO-FACES and the IMAGO-PEOPLE datasets, comprising over 60,000 samples each: the first composed of individual faces, the second of a single person’s full-figure images. These have been obtained by processing each image of the IMAGO dataset using the open source implementations of YOLO-FACE and YOLO available at [30, 57], respectively. The IMAGO-FACES dataset has been constructed accounting for the number of people portrayed in a photo. In fact, adopting a fixed-size bounding box, it may be possible to lose relevant details (e.g., hairstyle) or to include pixels related to the faces of other people. To avoid such problem, an adaptive strategy has been adopted: the size of the bounding box used to crop a face depends on the number of people portrayed in a photo—the greater the number of people, the smaller the bounding box. In this way, it was possible to extract the shoulders and the full head of a single person even when a picture portrayed tens of people. Figure 4(a) shows some sample images taken from the IMAGO-FACES dataset considering different decades and different socio-historical contexts. The construction of the IMAGO-PEOPLE dataset follows the same criteria employed for IMAGO-FACES, though images can present different aspect ratios (i.e., people may be standing or sitting in photos). Figure 4(b) shows exemplar images from IMAGO-PEOPLE. It is possible to appreciate that IMAGO-PEOPLE includes details that are not present in IMAGO-FACES (e.g., the clothing of a person).

Fig. 4.

We then verified the utility of performing denoising and super-resolution operations, as all the images considered in this work derive from scans of the analog prints. For denoising we tested the neural network model from [65] and the Bilateral Filter [46]. For super-resolution, we used an open source implementation of the ESRGAN model [61] within the Image Restoration Toolbox [64]. The overall improvement obtained adopting such strategies was revealed to be negligible, so we hence opted for an analysis based on the original scans of analog photos.

The IMAGO-FACES and IMAGO-PEOPLE were defined only to fine-tune the deep learning models for the socio-historical tasks introduced with the IMAGO dataset. So, we will not release such datasets, since their creation is technology dependent. Indeed, in the future, algorithms or models providing more accurate bounding boxes for faces and people regions could be introduced.

Finally, to study the possible usefulness of non-human features within a family album photo dataset, we also created a dataset called IMAGO-RANDOM, comprising eight randomly cropped regions, of 128 \(\times\) 128 pixels, from each image in the IMAGO dataset (some samples are reported in Figure 4(c)). Other window sizes were also tested but returned a lower performance.

All these datasets have been partitioned as follows: 80% for training and 20% for testing; in addition, 10% of the training images are used as the validation set for hyper-parameter tuning. For each image in the train set of IMAGO, the faces and the people there portrayed and the random patches are extracted and added to the corresponding dataset subset. This process is repeated also for the validation and test sets, as it guarantees that none of the training samples may end up in the validation and test sets.

6.2 Model Architecture and Training Settings

All our CNN-based single-input classifiers adopt a well-known architecture pre-trained on ImageNet [21]: the ResNet50 [31]. This architecture was modified replacing the top-level classifier with a new classification layer, whose structure depends on the socio-historical task (i.e., the number of output classes) and whose weights have been randomly initialized. The pre-trained convolutional layers have been specifically fine-tuned for the given input data and task. In order to verify the independence of our dataset from the specific architecture, we have also considered two other well-known ones: InceptionV3 [55] and DenseNet121 [29]. However, the results were very similar and we decided to choose the ResNet50 as the main backbone for our analysis since it represents a good tradeoff between performance and number of parameters [20].

During the training phase we applied data augmentation (e.g., random crop and horizontal flip) in order to make the model less prone to overfitting. Each model has been fine-tuned using a weighted cross-entropy loss to counter the unbalance in our dataset [40]. The Adam optimizer has been employed with a learning rate of 1e-4 and a weight decay of 5e-4. We set the batch size to 32 for the training of the full-image classifier and to 64 for the faces, people, and random-patches models.

6.3 Socio-historical Context Classification Task Results

In the following sections, we proceed to report on the performance obtained with single-input classifiers and with the ensemble model. We finally provide a qualitative grad-cam-based analysis on the behavior of the models.

6.3.1 Single-input and Ensemble Classifiers.

The results are reported in Table 2 and expressed in terms of top-\(k\) metric accuracy: if the correct class is not the one with the highest predicted probability but falls among the \(k\) with the highest predicted probabilities, it will be counted as correct. It is possible to appreciate that the full-image classifier exhibits a higher accuracy compared to the other single-input classifiers. To further investigate the reasons behind such result we report in Table 3 a comparison between the accuracy of each class considering the different single-input classifiers. As it is possible to observe, the model trained on IMAGO provides the best performance for the Motorization, Rites, Music, School, Affectivity, and Work classes. This may be due to the presence of specific objects that drive the performance of the model, also considering that the model was initialized with the ImageNet pre-trained weights [21], which contains classes such as race car and car wheel. Indeed, from a socio-historical point of view, images from the classes Rites and Music could contain physical objects and/or symbols that are representative for that class (e.g., formal attires, musical instruments). Nevertheless, such objects only acquire a meaning when people deal with them. However, the fact that the full-image classifier reached the highest accuracy for the School, Affectivity, and Work classes means that the network has also learned to recognize the presence of groups of people (e.g., school classes, friends standing in front of a monument, mother hugging her child) and specific clothing. Despite this classifier performing best, there are some peculiar results that have to be discussed. For example, the people classifier performs slightly better for the Fashion and Free-Time socio-historical contexts. This is probably due to the fact that the network may be focusing on people’s clothing details and poses instead of exploiting specific objects and/or backgrounds that are not present in the people’s crops. Exemplar areas on which the models focus in order to classify its images are reported in Section 6.3.2. Finally, the Politics class amounts to the only one for which, in terms of performance, the random-patches classifier is comparable to the full-image one.

Table 2.

Table 3.

We also evaluated different ensemble classifiers obtained from the combinations of the single-input classifiers. However, such combinations did not provide any significant improvement with respect to just considering the full-image model. For this reason, from now on, we consider the full-image classifier for the analysis that will follow and as the socio-historical context classifier in our application (check Figure 1).

Figure 5 shows the confusion matrix obtained with the full-image classifier. It is possible to observe that the classes responsible for the largest share of misclassifications are Fashion, Affectivity, and Free-time. This may be due to different causes. First, some classes share visual elements. For example, pictures labeled with Work class often depict people in uniform in workspaces. These could mistakenly be classified as belonging to the Fashion class, as pictures in this class are characterized by people in poses wearing some particular clothing items. Another example involves the Music and Free-time classes. Indeed, the Music category is characterized by photos portraying people playing some instruments or taking part in some musical event. The latter, however, could be easily associated to Free-time photos, since they also often portray groups of people in similar environments and poses. Second, the IMAGO dataset is unbalanced, as reported in Figure 3(a) (Section 5). Indeed, the most misclassified classes are also those that contain fewer samples.

Fig. 5.

6.3.2 Grad-Cam Analysis.

We here report a qualitative analysis that aims at highlighting which visual cues led the classifier to associate a specific socio-historical context to a picture. To do so, we exploited the Grad-Cam algorithm [52], which delimits the areas driving the predictions performed by a deep learning model.

Figure 6 depicts samples of correctly classified IMAGO images processed by the Grad-Cam algorithm. Each column, starting from the left, shows five exemplary images belonging to the Affectivity, Fashion, Motorization, Music, Politics, Rites, and School classes, respectively. Such images are representative of the regions exploited by the full-image classifier. More in detail, people in certain poses close to each other (e.g., hugs, holding a baby, handshakes), as shown in the first column of Figure 6, are characteristic of the Affectivity class. Specific objects like earrings, necklaces, and lapels and also particular hairstyles are used to classify a picture as belonging to the Fashion class (second column of the figure). All kinds of vehicles, as well as musical instruments, are used to recognize a given picture as a member of the Motorization or the Music classes, shown in the third and fourth columns, respectively. The presence of a political banner is typical of pictures in the Politics class (fifth column). The model also appears to individuate the objects that characterize the Rites class (e.g., white dress, flowers, pouring a drink, cheers), as shown in the sixth column of Figure 6. Finally, children wearing school uniforms, as well as school gear (e.g., books, pens, desks), are used to recognize pictures in the School class (last column).

Fig. 6.

It is not surprising that the model was able to correctly classify pictures belonging to the Motorization and Music classes, as these are clearly characterized by specific objects but, more importantly, already part of the model pre-trained on ImageNet [21]. However, also for the majority of the other classes (not studied so far in literature, to the best of our knowledge), the model seems to be able to isolate and focus on the details that distinguish them.

Figure 7 shows instead some failure cases for the full-image classifier. From the leftmost picture and its probability histogram it is possible to see that a photo containing a car was classified as belonging to the Motorization class but the ground-truth label assigned to the picture was Affectivity (two people standing close to each other in a specific pose). Instead, the rightmost picture and its corresponding probability histogram show that a picture depicting a school class was classified as belonging to School, while the actual one was Work (a teacher is standing in the rightmost part of the picture). Such misclassifications may be traced back to the fact that the IMAGO dataset was labeled by the owners of the pictures. The pictures thus convey such specific points of view, which may not be correctly predicted by the network. On the other hand, however, the point of view of the photo owner amounts to the ground truth, according to the methods adopted in socio-historical studies. In fact, the leftmost picture presented in Figure 7 was classified as Affectivity since the owner of the photograph was the child of the couple there portrayed. The same phenomenon happens in the rightmost one, since the one who labeled the photo was a teacher of those students. This proves the intrinsic challenge that the socio-historical classification task poses, since any classifier, including an expert socio-historian, may be subject to such kind of errors. For such reason, we further investigate such phenomenon in Section 7, analyzing the differences between the predictions obtained with the deep learning model and the choices made by a socio-historian.

Fig. 7.

6.4 Dating Task Results

The evaluation of performance for the dating is computed exploiting time distances, as also reported in [26, 47]. The time distance defines the tolerance accepted in prediction with respect to the actual year. For example, if a photo was labeled with the year 1932 and the model returned 1927 (or even 1937), this would be considered correct for those cases where the time distance was set to values equal to or smaller than 5 and wrong otherwise. In this work, model accuracies were computed considering temporal distances of 0, 5, and 10 years and have been assessed for both single-input and ensemble classifiers.

The evaluation of the single-input classifiers is reported in Table 4. The models fine-tuned on faces and people regions achieved a higher accuracy compared to the full-image classifier, when considering a time distance equal to 0. This is also true for the random-patches classifier, which performed even worse with larger time distances. These results could be explained by model averaging, as the use of more data allows controlling uncertainty and reducing the prediction error rate [5]. Nevertheless, this increase in performance may also be due to the faces and people classifiers learning specific visual features characteristic of given time-slices. To verify whether such improvement was due to the averaging effect, we designed a specific experiment. We considered a test subset composed by all those images containing at least \(n=8\) faces or people crops (as in the case of random crops; see Section 6.1). To weigh the role of the number of faces/people, the accuracy values were computed considering \(k\) faces/people, with \(k\) growing from 1 to \(n\). To ensure the completeness and fairness of this experiment, 1,000 random trials per each \(k\) faces/people/random patches were considered. Results have been grouped by \(k\) and reported in Table 5. From these results, we can observe that averaging across multiple inputs, in general, results in a higher performance, which increases when considering the faces and people regions.

Table 4.

Table 5.

Differently from the socio-historical context classification, an ensemble of different classifiers provides positive results for the dating task. Following the flow described in Figure 2, we proceeded to evaluate different ensemble combinations, exploiting the full-image, faces, people and random-patches classifiers. Since no significative improvements were observed employing the random-patches classifier, for the sake of clarity, Table 6 only includes the results that involve the full-image (T), faces (F), and people (P) classifiers. It is possible to observe that the best overall performance is obtained with the ensemble combination of all these classifiers. This shows that the model may benefit from averaging across different classifiers, as well as across multiple regions [5]. This model performs better than any single-input classifier. From now on, we consider, for all the following experiments, the model that reached the best performance that is the ensemble of the full-image, faces, and people classifiers. Moreover, this ensemble has been adopted in our application to estimate the shooting year of a picture.

Table 6.

Figure 8(a) shows the confusion matrix considering a time distance equal to 0. The diagonal structure demonstrates that the confusion mostly occurs between neighboring years, except for the initial and the final decades (this has been observed also in other works, as in [26]). The confusion created within the first 20 years may be caused by the low quality of the images and the limited number of samples representing those years. The confusion created within the last 20 years, instead, may be related to the fact that the number of images for these years is very limited (as shown in Figure 3(a)). Nevertheless, it is interesting to observe the information provided in Figure 8(b), where the model accuracy and the number of samples per decade are reported. This figure confirms the finding exhibited by the confusion matrix; that is, the model accuracy improves after the 1950s. Figure 8(b) also shows that, despite a reduction in terms of available samples per decade after the 1980s, the performance of the model does not decrease. The accuracy generally improves after the 1950s (also when the number of samples drops); again, this could be related to the fact that the images are of better quality with respect to the previous decades.

Fig. 8.

Differently from the socio-historical task, we did not carry out a qualitative analysis for the dating task, as such type of analysis may already be found in the literature [26, 47].

7 Human Vs. Machine Assessment

To this point, we exploited the IMAGO dataset to train the models that compose the SHM, amounting to the core of the application designed to help socio-historians in cataloging family album pictures. To assess the performance the application could attain in terms of accuracy, with respect to a human expert, we designed a specific experiment where a socio-historian was asked to categorize all the pictures in the IMAGO test set (amounting to 3,327 pictures), providing both the socio-historical context and the date. In particular, on one hand, the SHM models can provide a ranking for the classes predicted for a specific photo (i.e., top-k for the socio-historical context classification and a time interval confidence for dating). On the other hand, the socio-historian deals with the corpus of images, labeling them based on past archival and cataloging work experiences. In the following we provide the details of the comparison for the two tasks considered in this work.

Socio-historical context classification assessment. For this experiment, the socio-historical expert was given the opportunity of selecting multiple categories per each photo. As a result of this possibility, one class was chosen for 2,147 photos, two classes for 1,131 photos, and three classes for 49 pictures. It is interesting to point out that, although free to use as many labels as desired, no more than three have been considered at once. To make a fair comparison, we considered the k most probable classes chosen by the SHM model and compared them with the k classes selected by the socio-historian. We then proceeded to compute the accuracy of the socio-historian and of the model, as follows. For example, if the ground truth for a photo was “Affectivity,” the predictions provided by the application and the selections made by the socio-historian would be considered positive if both contained “Affectivity.” Since the scholar could choose the number of categories to assign, we computed such scores cumulatively. In particular, in correspondence of Cumulative k, in Table 7, with k = 1, a prediction is counted as positive in case it matches the ground truth. It follows that if k \(\gt\) 1, a positive match is recorded in case one of the k predictions matches the ground truth. The results are reported in Table 7. The first, simple observation is that the proposed application obtained accuracy levels that surpassed those obtained by the socio-historical scholar. For example, when we consider those pictures that were tagged with only one category by the socio-historical scholar, an accuracy of 54.82% was obtained vs. an accuracy of 64.89% for the application. This occurred also when considering those pictures for which the socio-historian chose one or more classes; the application was still able to obtain a higher performance. In Figure 9 we show a representative example of a case where the model predicts the correct label, unlike the socio-historian. In fact, the socio-historian fails at recognizing a particular detail that only the owner could have known (the subject of the photo is posing wearing a particular outfit); on the contrary, the model correctly classified this image.

Fig. 9.

Table 7.

Dating classification assessment. The socio-historian labeled all the pictures belonging to the test set assigning a year in the [1930,1999] range. The results are reported in Table 8. The dating module performed better than the socio-historian considering the specific picture shooting year (\(+ 12.58\%\)). The difference in performance decreases when a higher time distance is considered, arriving to \(3.64\%\) when the time distance equals 10.

Table 8.

8 CNN Vs Transformer Perfomance

The Transformer is a deep learning architecture that relies entirely on the self-attention mechanism to draw global dependencies between input and output [60]. Recent works have shown that such an approach can achieve comparable or even superior performance to CNNs [23, 28, 58]. In particular, the Vision Transformer (ViT) architecture, proposed by Dosovitskiy et al. [23], has achieved state-of-the-art performance on several computer vision benchmarks. For these reasons, we decided to exploit the ViT architecture in this work (IMAGO dataset socio-historical context classification and dating). To this aim, we proceeded to fine-tune different ViT configurations (Tiny, Small, Base, and Large), varying the size of the input images (i.e., 224 \(\times\) 224 or 384 \(\times\) 384) and considering patches of 16 \(\times\) 16 pixels. For the training, we followed the procedure reported in [23], while adopting a weighted cross-entropy loss to counter the dataset unbalance [40] and preserving the subdivision in training, validation, and test sets used in our previous experiments. This process was adopted for both the socio-historical context classification and dating for all of the proposed datasets (i.e., IMAGO, IMAGO-FACES, IMAGO-PEOPLE, and IMAGO-RANDOM). The results obtained with ViT, available in Tables 9 and 10, are there contrasted with those previously presented in Sections 6.3 and 6.4.

Table 9.

Table 10.

On one hand, from the results reported in Table 9, it is possible to observe that in most cases either ViT-Base or ViT-Large outperforms the ResNet50 while requiring a much higher number of parameters and thus increasing the complexity of the model. When instead a similar number of parameters is used (e.g., ViT-Small with input size 224 \(\times\) 224), ViTs exhibit a slightly lower performance. Nevertheless, comparing the results shown in Table 11 and Figure 10 with those reported in Table 3 and Figure 5, it is worth noticing that ViT-Small obtains a more balanced per-class accuracy.

Fig. 10.

Table 11.

On the other hand, the results in Table 10 show that the ResNet50 outperforms all single-input ViT configurations for dating. We also considered different ensemble combinations but no relevant improvements were detected and for this reason are not here reported. Concluding, the ViT approach exhibits divergent behaviors when applied to dating and socio-historical context classification. Why this occurred may be explained by resorting to [42], where the authors highlighted how ViT (1) incorporates more global information than ResNet at lower layers, leading to different features, and (2) strongly preserves spatial information adopting class tokens. Indeed, the inclusion of more global information at lower layers and the strong preservation of spatial information could be the reason socio-historical context classification obtained a better accuracy than dating. This is qualitatively represented by a few GradCam examples reported in Figure 11: more accurate activations are obtained when compared to the corresponding examples for ResNet50, reported in Figure 6. On the contrary, dating often requires focusing on specific local visual cues rather than on global ones, as also highlighted by Ginosaur et al. [26].

Fig. 11.

9 Conclusions, Limitations, and Future Research

In this work we proposed a multimedia application to assist socio-historians in cataloging family album photos. We then presented the IMAGO dataset composed by photos belonging to family albums, representing a source of socio-historical knowledge. The dataset amounts to 16,642 pictures, each of which was labeled with its socio-historical metadata: shooting year and context. We then trained and tested single-input and ensemble deep learning models to carry out those tasks, considering the Convolutional Neural Network. To the best of our knowledge, this is the first work addressing the socio-historical context classification. This consists in identifying the sociological and historical context of a picture, according to the definitions provided by socio-historical scholars [14]. We then proceeded to compare the performance of our application with the performance of a socio-historical scholar. The results of such assessment proved that our application could speed up cataloging processes, with no loss of accuracy when compared to the performance of a human expert, thus providing an important support to socio-historians. Finally, we carried out a comparative analysis considering Transformer-based deep learning models. The results showed that this approach could be promising also for a socio-historical analysis. This only represents a step in the direction of creating a holistic approach to the socio-historic cataloging problem, as many are the involved processes and sources of information. For example, in our specific case, our models were trained utilizing an unbalanced dataset and considering image regions that often included non-relevant information for classification purposes (e.g., background). In addition, when focusing on the socio-historical classification or dating, scholars perform analyses that resort at once to different sources of information (e.g., newspapers, magazines, archival documents), as well as to traces belonging to the same historical period. These represent three of the most relevant limits for this work.

Further investigations in this domain, hence, may consider (1) larger amounts and more balanced sets of data, (2) a better segmentation of the relevant areas of the images, and (3) the implementation of a multi-modal approach, capable of including also other sources of information and data formats. For what concerns the first point, the availability of larger datasets could surely improve the models’ discriminative power, also reducing possible unbalance problems. Regarding the second, the use of segmentation models may benefit the individuation of more relevant regions. Finally, multi-modal learning appears as the approach that may best replicate the comprehensive approach normally adopted by socio-historians during cataloging processes. Indeed, exploiting knowledge from historical archival documents (and other sources) could improve the general cataloging and analysis effort. For example, knowing how people dressed during a specific period might improve the classification for both the socio-historical context and dating. Such path, although complex, may not be impossible to follow. Indeed, recent natural language processing solutions are able to provide discriminative features that could be exploited in our models to improve the overall performance [22].

Footnotes

The IMAGO resources are available upon request.

imago.unibo.it.

References

[1]

Amazon. 2021. Amazon SageMaker Ground Truth. https://aws.amazon.com/it/sagemaker/groundtruth/.

Abstract

1 Introduction

2 The Socio-Historical Background

3 Related Work

4 A Socio-Historical Cataloging Tool for Family Photo Albums

5 IMAGO Dataset

5.1 Annotation Process

5.2 Socio-historical Context

5.3 Exploratory Dataset Analysis

6 Experimental Validation

6.1 Dataset Pre-processing and Subdivision

6.2 Model Architecture and Training Settings

6.3 Socio-historical Context Classification Task Results

6.3.1 Single-input and Ensemble Classifiers.

6.3.2 Grad-Cam Analysis.

6.4 Dating Task Results

7 Human Vs. Machine Assessment

8 CNN Vs Transformer Perfomance

9 Conclusions, Limitations, and Future Research

Footnotes

References

Cited By

Index Terms

Recommendations

A Geo-Historical Database for the Historical Photos of Yunnan-Vietnam Railway in China

Dante and the Name of the Vernacular: On De Vulgari Eloquentia and 'Vulgare Latium'

An Ethnographic Study of Romanian Vernacular Museums as Spaces of Knowledge-Making and Their Institutional Legitimation

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations