HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: nth
  • failed: cjhebrew

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2012.05616v2 [cs.CV] 25 Feb 2024

Enhancing Human Pose Estimation in Ancient Vase Paintings via Perceptually-grounded Style Transfer Learning

Prathmesh Madhu [email protected] Pattern Recognition Lab, Friedrich Alexander UniversityMartensstr. 3ErlangenBavaria91054Germany Angel Villar-Corrales [email protected] Autonomous Intelligent Systems, University of BonnFriedrich-Hirzebruch-Allee 5BonnNorth Rhine-Westphalia53115Germany Ronak Kosti [email protected] Pattern Recognition Lab, Friedrich Alexander UniversityMartensstr. 3ErlangenBavaria91054Germany Torsten Bendschus [email protected] Institut für Klassische Archäologie, FAUKochstr. 4 / Nr. 19ErlangenBavaria91054Germany Corinna Reinhardt [email protected] Institut für Klassische Archäologie, FAUKochstr. 4 / Nr. 19ErlangenBavaria91054Germany Peter Bell [email protected] German Studies and Art Studies, Philipps University of MarburgDeutschhausstr. 3MarburgHesse35037Germany Andreas Maier [email protected] Pattern Recognition Lab, Friedrich Alexander UniversityMartensstr. 3ErlangenBavaria91054Germany  and  Vincent Christlein [email protected] Pattern Recognition Lab, Friedrich Alexander UniversityMartensstr. 3ErlangenBavaria91054Germany
Abstract.

Human pose estimation (HPE) is a central part of understanding the visual narration and body movements of characters depicted in artwork collections, such as Greek vase paintings. Unfortunately, existing HPE methods do not generalise well across domains resulting in poorly recognised poses. Therefore, we propose a two step approach: (1) adapting a dataset of natural images of known person and pose annotations to the style of Greek vase paintings by means of image style-transfer. We introduce a perceptually-grounded style transfer training to enforce perceptual consistency. Then, we fine-tune the base model with this newly created dataset. We show that using style-transfer learning significantly improves the SOTA performance on unlabelled data by more than 6% mean average precision (mAP) as well as mean average recall (mAR). (2) To improve the already strong results further, we created a small dataset (ClassArch) consisting of ancient Greek vase paintings from the 6–\nth5 century BCE with person and pose annotations. We show that fine-tuning on this data with a style-transferred model improves the performance further. In a thorough ablation study, we give a targeted analysis of the influence of style intensities, revealing that the model learns generic domain styles. Additionally, we provide a pose-based image retrieval to demonstrate the effectiveness of our method. The code and pretrained models can be found at https://github.com/angelvillar96/STLPose.

Pose Estimation, Greek Vase Paintings, Style Transfer Learning, Digital Humanities
copyright: acmlicensedjournal: JOCCHjournalyear: 2022journalvolume: 1journalnumber: 1article: 1publicationmonth: 1price: 15.00doi: 10.1145/3569089journal: JOCCHccs: Applied computing Digital libraries and archivesccs: Computing methodologies Visual content-based indexing and retrievalccs: Computing methodologies Supervised learning

1. Introduction

Human pose estimation (HPE) is highly challenging as it is difficult to have one method that can generalise across all domains. Estimating human pose involves localising each visible body-keypoint (Fig. 0(c)), however the state-of-the-art (SOTA) methods underperform when tested on different domains, for example ancient Greek vase paintings (Fig. 1). HPE is central to understanding the visual narration and body movements of the characters depicted in these paintings and the recent rapid digitisation of art collections has created an opportunity to use HPE as a tool to digitally examine such artworks. These digital copies are usually either photographic reproductions (Mensink and Van Gemert, 2014) or scans of existing archives (Seguin et al., 2018). In addition to the preservation of cultural heritage, these digital collections allow remote access of the invaluable artistic data to the general public. However, due to content complexity and large size, navigating within such collections is often daunting.

To address the challenge of efficiently analysing large digital collections, several computer vision and image analysis techniques have been used for applications, such as artist identification (Johnson et al., 2008), object recognition (Hall et al., 2015; Crowley and Zisserman, 2014; Westlake et al., 2016), character recognition (Madhu et al., 2019), artistic image classification (Carneiro et al., 2012; Saleh and Elgammal, 2016; Cetinic et al., 2018) and pose-matching (Jenicek and Chum, 2019). However, when these methods are evaluated on a different domain, they show sub-optimal performance, c.f. Fig. 1. Hence, an important challenge is to learn effective representations using little data. Human pose representation is one such example.

Refer to caption
(a) Original
Refer to caption
(b) OpenPose
Refer to caption
(c) Ours
Figure 1. Attic red-figure, Zeus and Ganymede, in ancient Greek vase paintings: 0(a) original image, 0(b) pose estimation by OpenPose (Cao et al., 2017), 0(c) our method.

The understanding of visual narration in Greek vase paintings is one of the main objectives in the field of Classical Archaeology. In order to display the actions and situations of a narrative, as well as to characterise the protagonists, ancient Greek artists made use of a broad variety of often similar image elements (Giuliani, 2003). Some of the key aspects of the narrative are illustrated by meaningful interactions and compositional relationships (e. g., postures or gestures) between the characters displayed in the painting (McNiven, 1983). For example, the divine pursuit scene (Fig. 2\nth1 & \nth2 column) is a recurrent narrative in Greek vase paintings, often characterised by a character moving fast from left to right and reaching out with both his arms to catch a woman on her forearm or shoulder (Stansbury-O’Donnell, 2009).

In this work, we propose to exploit these recurrent character interactions and postures in order to navigate semantically through collections of Greek vase paintings. We address image retrieval in such databases by measuring the similarity between character postures. Since ancient Greek artists made use of the postures to depict similar narratives, the retrieved images should display the same scene.

For human pose-based image retrieval in Greek vase paintings, we need a reliable human pose estimation algorithm. We propose a two-step approach: (1) First, we apply style-transfer (Huang and Belongie, 2017) to the COCO (Lin et al., 2014) dataset to generate a synthetic annotated dataset with the style of Greek vase paintings and fine-tune the baseline person detection and pose estimation models on this dataset (Fig. 5 middle). (2) Second, we fine-tune these models on a newly generated dataset for Classical Archaeology (Fig. 5 bottom). We show that both steps improve the person detection and pose estimation tasks and thus the retrieval performance considerably.

In particular, our main contributions are:

  1. (1)

    We introduce the Styled-COCO-Persons (SCP) and the ClassArch (CA) datasets. SCP is a synthetic dataset, generated by applying style transfer, with different style-intensities, to the images from COCO (only ‘person’ class) to mimic the style of the CA dataset (Greek vase paintings). The CA dataset consists of 1783 images (characters) from 1000+ Greek vase paintings along with pose keypoint annotations.

  2. (2)

    We show that by just using styles of the CA dataset on real images, one can improve the task of human pose estimation in Greek vase paintings without requiring any annotations. We also show that fine-tuning this model with the small CA dataset modestly enhances the performance compared to direct transfer learning. Moreover, styled-tuned models outperform state-of-the-art fine-tuned methods on the SCP and CA datasets.

  3. (3)

    We introduce a perceptual loss for style-transfer and show that this is beneficial for both person detection and pose estimation.

  4. (4)

    Additionally, we show that our styled transfer learning based pipeline is also beneficial for retrieving and discovering similar images based on poses of the character in narratives from ancient Greek vase paintings.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2. (\nth1 column) Divine pursuit scene in ancient Greek vase paintings. The central character, a winged persecutor, is depicted with a similar pose, i. e., arms extended towards the right (observer viewpoint) and legs with large strides. (\nth2 column) Leading the bride scene, with the central character bride depicted with similar poses with her left hand extended forward (observer viewpoint) held by the groom. (\nth3 column) Abduction scene, where character on the left is abducting the character on the right. (\nth4 & \nth5 columns) Wrestling in Agonal and Mythological contexts between two main characters.

2. Related work

The task of representing human poses has been studied since the early days of computer vision (Nevatia and Binford, 1977). However, its importance is much older. Pathosformeln (Becker, 2013), the iconic study of basic constructs (units) of body language is one of the firsts to view the body gesture (or posture) also as a way of voicing inner emotions. Body movement’s depiction is essential and central to the historian (Bell et al., 2013; Bell and Impett, 2019), since it gives a way to recognise the inner emotions or expressions of the character. Impett et al. (Impett and Moretti, 2017) studies it with a geometrical construct by operationalisation of body movements, which could be considered a way of Pathosformeln. McNiven et al. (McNiven, 1983) also used human poses as the basis to study the interactions between characters in ancient vase paintings.

Since several years, Convolutional Neural Networks (CNNs) have been dominating computer vision tasks, and HPE is not an exception. After DeepPose (Toshev and Szegedy, 2014), methods followed that improved HPE by using CNN cascades (Wei et al., 2016) and graphical models (Tompson et al., 2014). Two major types of approaches are popular with HPE: bottom-up and top-down.

Bottom-up pose estimation (Pishchulin et al., 2016; Cao et al., 2017; Insafutdinov et al., 2016) directly estimate the location of all keypoints and assemble them into pose skeletons for all people in the image simultaneously. They use CNNs like ResNet (He et al., 2016) and DenseNet (Huang et al., 2017) as backbones to predict keypoints and optimisation-based matching techniques like DeepCut (Pishchulin et al., 2016) and DeeperCut (Insafutdinov et al., 2016) for combining the keypoints into poses. Cao et al. (Cao et al., 2017) introduced Part Affinity Fields (PAFs) which is able to estimate poses in real-time by solving a bi-partite graph matching problem, as a way to solve the optimisation problem of aggregating poses. Bottom-up techniques’ lack of structural information leads to many false positives and often being outperformed by top-down pose estimation approaches.

Top-down pose estimation (Papandreou et al., 2017; Fang et al., 2017; Chen et al., 2018; Xiao et al., 2018) approach HPE in two steps (Fig. 5first row). The first step addresses the problem of detecting all person instances in the image, whereas the second step aims at predicting the body-keypoints for each of the detected person. For person detection, a specific CNN, e. g., from the R-CNN family (Girshick et al., 2014; Girshick, 2015; Ren et al., 2017) is normally used. The second step involves using a single-person pose estimation model to process each of the person instances independently. Tompson et al. (Tompson et al., 2014) first proposed the use of a CNN with multi-resolution receptive fields for the task of body joint localisation. More recent methods, however, focus on refinement techniques, such as Iterative Error Feedback (Carreira et al., 2016), Stacked-hour-glass networks (Newell et al., 2016) and PoseFix (Moon et al., 2019). Multi-scale approach by Sun et al. (Sun et al., 2019) uses a novel architecture to maintain high-resolution representations through the whole estimation process by repeated multi-scale feature fusions. These SOTA methods have improved the performance on their respective benchmark datasets, however, they fail to generalise to domains like Greek vase paintings.

Domain adaptation based approaches mainly aim at bringing the distributions of source domain closer to that of the target, so that a single model can be used for both domains. Methods using domain adaptation via style-transfer mainly transfer the style of target to the source while training online as a way of domain adaption, using style loss (Rodriguez and Mikolajczyk, 2019) or even adapting the domain progressively (Inoue et al., 2018). Some methods also use feature level alignment for aligning the two domains (Li et al., 2017), and others enforce it via self-similarity and domain-dissimilarity loss (Deng et al., 2018). Recently, various methods have focused on reducing data bias in order to enhance the transfer-ability of features in task-specific layers (Maria Carlucci et al., 2017; Long et al., 2015). (Roy et al., 2019) and (Sun and Saenko, 2016) proposed unsupervised domain adaptation techniques, using minimum entropy consensus and co-variance of source and target features respectively.

A few works using generative models have also been proposed. (Russo et al., 2018) proposed a method for optimising bi-directional image transformations and using class consistency loss, while (Hoffman et al., 2018) proposed a cycle GAN that adapts representations to combine feature-level and pixel level while enforcing structural (cycle loss) and semantic consistency (task-specific). A work closely related to ours (Jenicek and Chum, 2019), highlights the importance of using poses for artwork discovery and its subsequent analysis. Their study, however, is based on artworks with some case study images that are relatively easy for SOTA methods, such as OpenPose (Cao et al., 2017) to estimate the poses. One of the first work done on Greek Vase paintings was by Crowley et al. (Crowley and Zisserman, 2013) to generate a correspondence between descriptions and unknown regions in the images of vase paintings. In their work, they describe the challenges of working with Greek vases and annotate the images automatically. On a similar note, estimating poses for characters in ancient Greek Vase paintings presents a completely different challenge, since OpenPose fails very often (Fig. 1).

Our work’s focus, instead, is on using the style transfer to generate a synthetic dataset from already existing labelled dataset like COCO to improve the pose estimation on unlabelled data like ancient Greek vases, by enforcing a pre-computed perceptual consistency loss.

3. Datasets

In order to train deep networks in a supervised fashion, it is very important to have a high-quality annotated dataset. Hence we work with 3 main datasets vizCOCO-Persons (CP) with images of only the ‘person’ category, its corresponding styled counterpart called Styled-COCO-Persons (SCP), and we also introduce our own annotated dataset called ClassArch (CA). Each dataset is labelled with person bounding boxes and their corresponding body pose keypoints, details shown in Tab. 1. We focus on these datasets for training and evaluation of our models.

Table 1. Datasets used in our experiments. For COCO-Persons (CP), we use images from the person category of COCO (Lin et al., 2014). Styled-COCO-Persons (SCP) is generated by using CP images as content and various splits of ClassArch (CA) images as styles (Images: images, Persons: person bounding boxes, Poses: pose annotations).
Datasetsnormal-→\rightarrow CP & SCP CA
Splitnormal-→\rightarrow Train Val Total Train Val Total
Images 64115 2693 66808 1210 303 1513
Persons 257252 10777 268029 2098 531 2629
Poses 149813 6352 156165 1425 303 1728

(a) COCO-Persons (CP) The Common Objects in COntext dataset (COCO) (Lin et al., 2014) was specifically designed for the detection and segmentation of objects in their natural context. COCO has 328K images with over 2.5M labelled instances divided into 91 semantic object categories (e. g., car, person, dog, banana, etc.). We only consider images that include “person” instances, along with their corresponding bounding boxes and pose-keypoints. We call this split as COCO - Persons (CP). This split is taken across the training and validation sets only since the labels for the test set are not publicly available. Consequently, we use the validation set for testing our models. Tab. 1 shows the exact splits for the dataset in terms of images, persons and pose-keypoints. Figs. 3(a) & 3(b) illustrate some samples of CP dataset.

(b) ClassArch (CA) We introduce a challenging dataset from the domain of Classical Archaeology, called ClassArch (CA) dataset. We chose five different recurrent narratives, viz. ‘Pursuits’, ‘Leading of the Bride’, ‘Abductions’, and ‘Wrestling’ in Agonal and Mythological contexts, taken from the period between the \nth6 and \nth5 century BCE. Pose-based analysis of such paintings is of critical importance for Classical Archaeology as discussed in Sec. 1. Each of the narratives in CA has its own set of characters, which appear recurrently and are depicted with similar features and in almost identical poses. Figs. 3(e) & 3(f) illustrate some images and their corresponding person bounding boxes and person keypoints of the CA dataset. Fig. 2 displays two examples from the ‘Pursuit’ (\nth1 & \nth2 column) and ‘Leading the bride’ (bottom row) narratives. In both scenes, the main characters (‘persecutor’/‘fleeing’ & ‘bride’) are depicted with similar posture in every image. CA has different sets of labels associated with it. There are 1513 images, with 2629 person annotations and 1728 pose annotations. More detailed splits are shown in Tab. 1.

(c) Styled-COCO-Persons (SCP) The images in CP significantly differ in semantic content and style from the ancient Greek vase paintings, c.f. Fig. 3(a) vs. Fig. 3(e)). To bridge this domain gap between CP and CA, we use style transfer to adapt the style of the CP dataset to vase paintings.

Style transfer algorithms render a synthetic image that combines the semantic information from one input (denoted as content image) with the texture from the user-defined style image (Gatys et al., 2016). We apply an efficient and fast style transfer technique using adaptive instance normalisation (AdaIN) (Huang and Belongie, 2017) to create SCP, a synthetic dataset that combines the semantic content of the CP with the style of CA. Fig. 3 illustrates the style transfer procedure. We can visually observe that images of Figs. 3(c) & 3(d) are more closer in styles with Fig. 3(e) than Fig. 3(a), i. e., SCP is closer in style with CA, than CP is with CA. The SCP dataset will be released along with the code.

Refer to caption
Figure 3. Style transfer using AdaIN (Huang and Belongie, 2017) with full style intensity (α=1𝛼1\alpha=1italic_α = 1). AdaIN adjusts the first and second order moments of the ‘Content Image’ to match those of the ‘Style Image’. A ‘Styled Image’ (style-transferred) is generated with the semantic content of the ‘content image’ and style of the ‘Style Image’.

Alpha (α)𝛼(\alpha)( italic_α ) and Style-Sets Huang et al. (Huang and Belongie, 2017) suggest a content-style trade-off technique to control the intensity of style transferred to the content image using α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ]. Based on this, we generate 2 groups of SCP. First with α=0.5𝛼0.5\alpha=0.5italic_α = 0.5, meaning that we only transfer half of the style intensity to the content images; and a second one in which α𝛼\alphaitalic_α is chosen randomly from the uniform distribution U[0,1]𝑈01U[0,1]italic_U [ 0 , 1 ]. The second group contains images across the whole spectrum, from no style (α=0𝛼0\alpha=0italic_α = 0) to full style (α=1𝛼1\alpha=1italic_α = 1).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) CP Images
Refer to caption
(b) CP Labels
Refer to caption
(c) SCP, α=0.5𝛼0.5\alpha=0.5italic_α = 0.5
Refer to caption
(d) SCP, α=U𝛼𝑈\alpha=Uitalic_α = italic_U
Refer to caption
(e) CA Images
Refer to caption
(f) CA Labels
Figure 4. Dataset Samples:3(a) Images & 3(b) Labels of CP dataset; 3(c) & 3(d) are samples from the SCP dataset with α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 and α=U𝛼𝑈\alpha=Uitalic_α = italic_U respectively; 3(e) shows images with 3(f) the corresponding labels of our CA dataset. Each labelled example shows the corresponding person bounding boxes and their pose keypoints.

Additionally, we generate two more SCP dataset variations with the method described above using a different dataset of style images. We name this dataset as Red-Black figures (100 in total) or just RB, they are similar in style to the CA dataset but do not have any labels. Our hypothesis is that the model should be able to learn the styles and not the content of style images. In the end, we have four groups of SCP dataset with two different combinations of α𝛼\alphaitalic_α with the two style-sets RB and CA.

Open Source Images (OSI) We will publicly release the CP and SCP datasets + annotations. 71 sample OSI links from the CA dataset are in the supplementary material, the rest can be downloaded from Beazley Archive Pottery Database111https://www.beazley.ox.ac.uk/pottery/default.htm for research purposes. We’ll release the permanent links to those images, along with the pose and person annotations.

4. Proposed Method

In this section, we first present our proposed style-based transfer learning approach to enhance pose estimation. We then briefly explain our models that were trained and evaluated for all datasets mentioned in Sec. 3. Lastly, we propose a perceptual loss as a regularizer to improve the estimation of perceptually similar poses with different styles.

4.1. Pose Estimation Approach

We take a top-down approach to pose-estimation, which is divided into two stages. Fig. 5 details the two stages of the top-down approach. The first stage (A/A*) detects all the persons in an image and then estimates the keypoints (B/B*) for each person instance, then creating the poses for each instance by pose-parsing (C/C*). The models without * are trained on styled datasets while the ones with * are fine-tuned on the CA dataset. We use Faster-RCNN (Ren et al., 2017) as our person detector that was trained on the COCO (Lin et al., 2014) dataset. Top-down pose estimation approaches dominate the COCO keypoint detection challenge in the past few years, and several use the HRNet (Sun et al., 2019) as their backbones. Hence, we chose HRNet-W32 (henceforth denoted as HRNet) as our pose estimation model.

2-Step Training Approach

We adopt a 2 step approach to enhancing pose estimation, as shown in Fig. 5. In the first step (Fig. 5, second row), we train our detector and pose estimation model on styled data (different groups of SCP) to generate styled models (A and B in Fig. 5). In this step, the styled models at the end of their training are expected to learn the styles of the target data, while trying to maintain the performance on the original task.

Refer to caption
Figure 5. (first row, Top-down pose estimation) - (A*) styled person detector detects all instances, (B*) for which the body joint locations are predicted using a person keypoint detector, (C*) The pose skeletons are assembled by connecting the detected keypoints for each person. 2 Step Training Approach: Step 1 (second row, Styled Models) Person Detector trained on SCP persons data, and HRNet on SCP poses data; Step 2 (third row, Styled-Tuned Models) Styled Person Detector from second row is fine-tuned on CA persons data, and Styled HRNet is fine-tuned on CA pose data.

In the second step (Fig. 5, third row), we fine-tune these styled models on our CA data. During this step, the models that have learned the styles in the first step, now focus on improving their performance for the target dataset of CA. The final detector (A*) and pose estimator (B*) models are initialized with the A and B models from the first step and then fine-tuned on SCP persons and poses data.

We report all our experiments using four kinds of models for both tasks, person detection and pose estimation.

1. Baseline models are SOTA models. In case of the person detector, we drop the heads for all the classes except one, and fine-tune it on the ‘person’ class, further denoted as our baseline model.

2. Tuned models are SOTA models fine-tuned on the CA dataset. For the detector, we drop all the heads except one (similar as for baseline model) in Faster-RCNN and fine-tune it on persons data of the CA dataset. Likewise, for pose estimation, we take the SOTA HRNet and fine-tune it on pose data of CA.

3. Styled models are SOTA models trained on a particular group of the SCP dataset. As explained in Sec. 3, there are 4 different groups of SCP dataset. Depending on the values of α𝛼\alphaitalic_α and style-set (RB or CA), the Styled models are trained on that particular group, for the detector as well as the pose estimator. Accordingly, there are 4 different Styled models for each of the detector and the pose estimator.

4. Styled-Tuned (Sdnormal-→\rightarrowTd) models are Styled models ((3) above) fine-tuned on CA dataset. Accordingly, for the detector, Sdnormal-→\rightarrowTd model is a Styled Faster-RCNN model fine-tuned on CA persons data. Similarly, for the pose estimator, Sdnormal-→\rightarrowTd model is a Styled HRNet model fine-tuned on CA poses data. Hence, depending on the group of the Styled model, there is an equivalent Sdnormal-→\rightarrowTd model.

4.2. Enforcing Perceptual Similarity

While training the styled models, the network is fed with styled data. The advantage of doing this is to allow the model to expand its capacity to recognise perceptually similar persons/poses with different styles. In order to achieve content consistency in the perceptual space, we enforce a pre-computed perceptual loss (Johnson et al., 2016) while training, in addition to the regular loss. The model is penalised if it is not able to maintain perceptual consistency.

Let’s denote the task loss by LTsubscript𝐿𝑇L_{T}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, where LT=Ldetsubscript𝐿𝑇subscript𝐿𝑑𝑒𝑡L_{T}=L_{det}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT is for detector models, and LT=Lposesubscript𝐿𝑇subscript𝐿𝑝𝑜𝑠𝑒L_{T}=L_{pose}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_p italic_o italic_s italic_e end_POSTSUBSCRIPT for pose models. We adopt two flavours of the combined loss, each for the detector as well as pose. In the first flavour (Lcomb1subscript𝐿𝑐𝑜𝑚𝑏1L_{comb1}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 1 end_POSTSUBSCRIPT), we adaptively weigh the perceptual loss (Lperceptsubscript𝐿𝑝𝑒𝑟𝑐𝑒𝑝𝑡L_{percept}italic_L start_POSTSUBSCRIPT italic_p italic_e italic_r italic_c italic_e italic_p italic_t end_POSTSUBSCRIPT) with the corresponding detector or the pose loss, as shown in Eq. 1a. While, in the second flavour (Lcomb2subscript𝐿𝑐𝑜𝑚𝑏2L_{comb2}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 2 end_POSTSUBSCRIPT), we weigh each loss term with optimal values of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT chosen using hyperparameter optimisation, as shown in Eq. 1b.

(1a) Lcomb1subscript𝐿𝑐𝑜𝑚𝑏1\displaystyle L_{comb1}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 1 end_POSTSUBSCRIPT =LT+LTLperceptabsentsubscript𝐿𝑇subscript𝐿𝑇subscript𝐿𝑝𝑒𝑟𝑐𝑒𝑝𝑡\displaystyle=L_{T}+L_{T}\ast L_{percept}= italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∗ italic_L start_POSTSUBSCRIPT italic_p italic_e italic_r italic_c italic_e italic_p italic_t end_POSTSUBSCRIPT
(1b) Lcomb2subscript𝐿𝑐𝑜𝑚𝑏2\displaystyle L_{comb2}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 2 end_POSTSUBSCRIPT =λ1LT+λ2Lperceptabsentsubscript𝜆1subscript𝐿𝑇subscript𝜆2subscript𝐿𝑝𝑒𝑟𝑐𝑒𝑝𝑡\displaystyle=\lambda_{1}\ast L_{T}+\lambda_{2}\ast L_{percept}= italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∗ italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∗ italic_L start_POSTSUBSCRIPT italic_p italic_e italic_r italic_c italic_e italic_p italic_t end_POSTSUBSCRIPT

5. Experiments and Analysis

The exact number of images, person bounding boxes and pose annotations, along with the corresponding train/val splits used for our experiments are mentioned in Tab. 1. In this section, we describe the evaluation protocol to train our detector and pose estimator. We also present the experimental results and discuss our findings.

5.1. Training Setup

In general, we use the standard parameters of the SOTA models and make adjustments to suit our experimental needs. For person detection (Faster-RCNN), we use an initial learning rate (lrinit𝑖𝑛𝑖𝑡{}_{init}start_FLOATSUBSCRIPT italic_i italic_n italic_i italic_t end_FLOATSUBSCRIPT) of 0.0001 with a scheduler lrscheduler𝑠𝑐𝑒𝑑𝑢𝑙𝑒𝑟{}_{scheduler}start_FLOATSUBSCRIPT italic_s italic_c italic_h italic_e italic_d italic_u italic_l italic_e italic_r end_FLOATSUBSCRIPT on plateau (3 epochs) which reduces the lr by a factor of 0.33. We use Adam (Kingma and Ba, 2015) with its default parameters and a batch-size (bs) of 8. Standard multi-task loss metric, a combination of log loss and regression loss (Ldet=LCLS+LRegsubscript𝐿𝑑𝑒𝑡subscript𝐿𝐶𝐿𝑆subscript𝐿𝑅𝑒𝑔L_{det}=L_{CLS}+L_{Reg}italic_L start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_R italic_e italic_g end_POSTSUBSCRIPT), is used in our experiments for the detector. We train for 25 epochs on the CP dataset and 30 each on the SCP and CA datasets. However, we found that our models usually converge between 8–12 epochs.
Similar to the detector, we use Adam with its default values for pose models (HRNet) and bs===64. With an lrinit𝑖𝑛𝑖𝑡{}_{init}start_FLOATSUBSCRIPT italic_i italic_n italic_i italic_t end_FLOATSUBSCRIPT of 0.01 and a lrscheduler𝑠𝑐𝑒𝑑𝑢𝑙𝑒𝑟{}_{scheduler}start_FLOATSUBSCRIPT italic_s italic_c italic_h italic_e italic_d italic_u italic_l italic_e italic_r end_FLOATSUBSCRIPT on plateau (3 epochs) which reduces the lr by a factor of 0.1. We train all pose models for 100 epochs. Akin to the original HRNet (Sun et al., 2019), we also use the same configs (augmentations, image size) for fair comparison. Like HRNet, Object Keypoint Similarity (OKS) is used as an evaluation metric in our experiments as a simple Euclidean distance (Lpose=LMSEsubscript𝐿𝑝𝑜𝑠𝑒subscript𝐿𝑀𝑆𝐸L_{pose}=L_{MSE}italic_L start_POSTSUBSCRIPT italic_p italic_o italic_s italic_e end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_M italic_S italic_E end_POSTSUBSCRIPT) for pose estimation.

In both cases, person detection and pose estimation, we report the mean Average Precision (mAP) as well as the corresponding mean Average Recall (mAR).

5.2. Experiments

We compare different models in separate tables to give a clear understanding of our methods. As described in Sec. 3, Style-set (SS) represents two datasets CA and RB and alpha (α𝛼\alphaitalic_α) represents the amount style transferred in the SCP dataset from no style (α𝛼\alphaitalic_α = 0) to full style (α𝛼\alphaitalic_α = 1). Tab. 2 (Results A) compares the baseline models with styled models for detector as well as pose. Similarly, Tab. 3(b) (Results B) compares the tuned with styled-tuned. Tab. 4 (Results C) shows the influence of using different data quantities to fine-tune our models, where as Tab. 5 shows the advantage of using perceptual loss.

Baseline vs Styled models

(Results A, Tab. 2) It is important to understand the impact of styles on the main task for detection and pose estimation. We study the impact of style-transfer by comparing baseline and styled models. As shown in Tab. 2 (SCP column), we observe that the styled models perform consistently much better than their baseline counterpart for detection and pose estimation. When tested on the CA dataset, counter-intuitively, these models underperform in detection. One potential reason is that the network has never seen the complex vase dataset during training. Conversely for pose estimation, styled models unambiguously are better for both SCP and CA datasets. Specifically, styled models, which were not trained on CA, give a considerable jump in performance: 7.62 (mAP) & 8.06 (mAR) when tested on CA.

Table 2. Results A: Comparing baseline model and styled models with different combinations of α𝛼\alphaitalic_α (stylization factor) and SS (RB or CA), 2(a) for pose estimator and 2(b) person detector. α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 or α=U𝛼𝑈\alpha=Uitalic_α = italic_U (randomly sampled from a uniform distribution: U=uniform(0,1)𝑈𝑢𝑛𝑖𝑓𝑜𝑟𝑚01U=uniform(0,1)italic_U = italic_u italic_n italic_i italic_f italic_o italic_r italic_m ( 0 , 1 )). All values in terms of mAP, except CAmAR (mAR). The (+/-) is in reference to the corresponding baselines.
Model CP SCP CA CAmAR α𝛼\alphaitalic_α SS
Baseline 76.5 46.2 24.7 30.9 - -
Styled 73.4 (-3.1) 54.4 (+8.2) 29.7 (+5.0) 36.0 0.5 RB
74.0 (-2.5) 53.7 (+7.5) 30.9 (+6.2) 37.7 U𝑈Uitalic_U RB
74.0 (-2.5) 53.8 (+7.6) 30.6 (+5.9) 36.9 0.5 CA
74.3 (-2.2) 53.5 (+7.3) 32.3 (+7.6) 39.0 U𝑈Uitalic_U CA
(a) Pose Estimation
Model CP SCP CA CAmAR α𝛼\alphaitalic_α SS
Baseline 39.4 24.2 10.4 9.8 - -
37.5 (-1.9) 33.4 (+9.2) 7.6 (-2.2) 9.8 0.5 RB
36.9 (-2.5) 32.1 (+7.9) 6.5 (-3.3) 8.5 U𝑈Uitalic_U RB
37.7 (-1.7) 33.2 (+9.0) 8.2 (-1.6) 10.0 0.5 CA
Styled 37.0 (-2.4) 32.6 (+8.4) 6.5 (-3.3) 8.6 U𝑈Uitalic_U CA
(b) Person Detection

Tuned vs. Styled-Tuned models

(Results B, Tab. 3(b)) With the goal of enhancing pose estimation on our CA dataset, a naive approach is to fine-tune on this data, we call these models as Tuned models. Then, we take the styled models (Tab. 2), which have already learned the styles of CA data, and fine-tune them on our CA data (Styled-Tuned or Sdnormal-→\rightarrowTd). As seen in Tab. 3(b), the Sdnormal-→\rightarrowTd models give a better performance as compared to their Tuned counterparts for pose estimation. Irrespective of the combination of α𝛼\alphaitalic_α and SS, the pose models tend to perform better. We argue that this is partly because the models gradually learn the styles (SCP), while optimising for the main task. During training, the Styled models (Tab. 2) are able to see the different spectrum of style intensities. They adapt the styles while maintaining a consistent performance over the main task. However, for person detection the performance of Sdnormal-→\rightarrowTd model is detrimental in comparison to the Tuned counterpart. One reason for this can be attributed to the overlapping objects: animals and persons – which makes the person detection more difficult for the Sdnormal-→\rightarrowTd models as compared to their Tuned counter parts. Another reason for lower precision is the lack of ground truth annotations for side characters of the scene. For poses however, the overlap of keypoints when compared to the bounding boxes is very small and hence the model generalizes better from styled models in comparison to directly tuned models.

Table 3. Results B: Comparing tuned model with the styled-tuned (Sdnormal-→\rightarrowTd) model, with different combinations of α𝛼\alphaitalic_α (stylization factor) and SS (RB or CA), for 3(a) pose estimator 3(b) and detector. α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 or α=U𝛼𝑈\alpha=Uitalic_α = italic_U (randomly sampled from a uniform distribution: U=uniform(0,1)𝑈𝑢𝑛𝑖𝑓𝑜𝑟𝑚01U=uniform(0,1)italic_U = italic_u italic_n italic_i italic_f italic_o italic_r italic_m ( 0 , 1 )). All values in mAP, except CAmAR (mAR). The (+/-) is in reference to the corresponding baselines.
Model CP SCP CA CAmAR α𝛼\alphaitalic_α SS
Tuned 14.0 9.3 65.6 72.3 - -
11.8 (-2.2) 10.5 (+1.2) 66.8 (+1.2) 73.3 0.5 RB
20.3 (+6.3) 14.5 (+5.2) 67.2 (+1.6) 73.3 U𝑈Uitalic_U RB
34.9 (+20.9) 22.4 (+13.1) 66.6 (+1.0) 73.1 0.5 CA
Sdnormal-→\rightarrowTd 28.0 (+14.0) 18.5 (+9.2) 67.1 (+1.5) 73.6 U𝑈Uitalic_U CA
(a) Pose Estimation
Model CP SCP CA CAmAR α𝛼\alphaitalic_α SS
Tuned - - 49.4 37.0 - -
- - 44.3 (-5.1) 32.9 (-4.1) 0.5 RB
- - 43.0 (-6.4) 32.6 (-4.4) U𝑈Uitalic_U RB
- - 43.7 (-5.7) 33.4 (-3.6) 0.5 CA
Sdnormal-→\rightarrowTd - - 43.9 (-5.5) 32.9 (-4.1) U𝑈Uitalic_U CA
(b) Person Detection

With Tab. 2 and Tab. 3(b), we were able to enhance the performance of pose models, with styled as well as styled-tuned models. Styled models can help to improve the performance with a 7.7 pp (mAP) jump in performance, without any labels. While the styled-tuned models show that fine-tuning with styled models is generally beneficial for the performance, the Sdnormal-→\rightarrowTd model for pose gives a significant 1.6 pp (mAP) performance improvement for CA dataset over its Tuned counterpart.

Table 4. Results C: Comparing tuned model with the styled-tuned (Sdnormal-→\rightarrowTd) model, by training on different quantities (25 %, 50 %, 75 %, 100 %) of the CA data for pose estimation. It clearly shows that the styled model learns quicker. All values in mAP, α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 or α=U𝛼𝑈\alpha=Uitalic_α = italic_U (randomly sampled from a uniform distribution: U=uniform(0,1)𝑈𝑢𝑛𝑖𝑓𝑜𝑟𝑚01U=uniform(0,1)italic_U = italic_u italic_n italic_i italic_f italic_o italic_r italic_m ( 0 , 1 )), and SS = RB or CA
Model 25 % 50 % 75 % 100 % α𝛼\alphaitalic_α SS
Tuned 61.3 65.0 65.1 65.6 - -
Sdnormal-→\rightarrowTd 60.6 65.4 65.1 66.8 0.5 RB
62.1 64.7 66.5 67.2 U𝑈Uitalic_U RB
60.6 65.2 65.4 66.6 0.5 CA
61.4 64.8 65.7 67.1 U𝑈Uitalic_U CA

Influence of data quantity

(Results C, Tab. 4): Tab. 4 shows that Styled-Tuned models learn faster than their Tuned counterpart, for each of the corresponding splits of CA data (model with α=U𝛼𝑈\alpha=Uitalic_α = italic_U, SS=RB is more consistent). Specifically, the model with α=0.5𝛼0.5\alpha=0.5italic_α = 0.5, SS=RB and 50% of CA data gives equivalent performance to the Tuned model trained with whole CA data. We see that the deep learning based models converge faster when they have a suitable initialization of weights (Glorot and Bengio, 2010). We argue that training on styled data helps the model to get a better initialization with respect to the dataset distribution. Consequently, the convergence is faster.

Perceptual Loss Comparison:

Tab. 5 shows the influence of perceptual loss for the model performance. Lcomb1subscript𝐿𝑐𝑜𝑚𝑏1L_{comb1}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 1 end_POSTSUBSCRIPT is the experiment with adaptively weighing the detector as well as pose losses (Eq. 1a). For Lcomb2subscript𝐿𝑐𝑜𝑚𝑏2L_{comb2}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 2 end_POSTSUBSCRIPT (Eq. 1b), we determined the values of λ𝜆\lambdaitalic_λs through a parameter search (Akiba et al., 2019): λ1=0.43subscript𝜆10.43\lambda_{1}=0.43italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.43 and λ2=0.92subscript𝜆20.92\lambda_{2}=0.92italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.92 for person detection and λ1=0.47,λ2=0.018formulae-sequencesubscript𝜆10.47subscript𝜆20.018\lambda_{1}=0.47,\lambda_{2}=0.018italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.47 , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.018 for pose estimation. We present results for the combination SS=CA & α=U𝛼𝑈\alpha=Uitalic_α = italic_U for the detector, and we chose SS=CA & α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 for pose estimator.

Tab. 5 shows that the perceptual loss (adaptive:Lcomb1subscript𝐿𝑐𝑜𝑚𝑏1L_{comb1}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 1 end_POSTSUBSCRIPT or parameterised one:Lcomb2subscript𝐿𝑐𝑜𝑚𝑏2L_{comb2}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 2 end_POSTSUBSCRIPT) indeed helps the styled as well as Sdnormal-→\rightarrowTd models to improve their performance, in general. From Tab. 5(a), we can see that the results for Lcomb1subscript𝐿𝑐𝑜𝑚𝑏1L_{comb1}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 1 end_POSTSUBSCRIPT and Lcomb2subscript𝐿𝑐𝑜𝑚𝑏2L_{comb2}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 2 end_POSTSUBSCRIPT are equal, however we have to note that this is an empirical observation for different values of lambda. Additionally, we note that perceptual loss does not harm the styled model for pose estimation, but actually helps since the styled model is fine-tuned on CA.

Table 5. Perceptual Loss Comparison: Comparing different loss combinations using the styled or styled-tuned (Sdnormal-→\rightarrowTd) model on the CA dataset for 5(a) pose estimation and 5(b) detection, i. e., just the detector or pose loss Ldetsubscript𝐿𝑑𝑒𝑡L_{det}italic_L start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT/Lposesubscript𝐿𝑝𝑜𝑠𝑒L_{pose}italic_L start_POSTSUBSCRIPT italic_p italic_o italic_s italic_e end_POSTSUBSCRIPT or in combination with perceptual loss Lcombsubscript𝐿𝑐𝑜𝑚𝑏L_{comb}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b end_POSTSUBSCRIPT in two different variants (Eqs. 1a and 1b. For 5(a): α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 and for 5(b): α=U𝛼𝑈\alpha=Uitalic_α = italic_U. Style-Set (SS) is CA for both.
Model mAP mAR Lcombsubscript𝐿𝑐𝑜𝑚𝑏L_{comb}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b end_POSTSUBSCRIPT
30.6 36.9 Lposesubscript𝐿𝑝𝑜𝑠𝑒L_{pose}italic_L start_POSTSUBSCRIPT italic_p italic_o italic_s italic_e end_POSTSUBSCRIPT
30.5 36.5 Lcomb1subscript𝐿𝑐𝑜𝑚𝑏1L_{comb1}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 1 end_POSTSUBSCRIPT
Styled 30.5 36.5 Lcomb2subscript𝐿𝑐𝑜𝑚𝑏2L_{comb2}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 2 end_POSTSUBSCRIPT
66.6 73.1 Lposesubscript𝐿𝑝𝑜𝑠𝑒L_{pose}italic_L start_POSTSUBSCRIPT italic_p italic_o italic_s italic_e end_POSTSUBSCRIPT
67.2 73.6 Lcomb1subscript𝐿𝑐𝑜𝑚𝑏1L_{comb1}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 1 end_POSTSUBSCRIPT
Sdnormal-→\rightarrowTd 67.2 73.6 Lcomb2subscript𝐿𝑐𝑜𝑚𝑏2L_{comb2}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 2 end_POSTSUBSCRIPT
(a) Pose Estimation
Model mAP mAR Lcombsubscript𝐿𝑐𝑜𝑚𝑏L_{comb}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b end_POSTSUBSCRIPT
6.5 8.6 Ldetsubscript𝐿𝑑𝑒𝑡L_{det}italic_L start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT
11.2 11.0 Lcomb1subscript𝐿𝑐𝑜𝑚𝑏1L_{comb1}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 1 end_POSTSUBSCRIPT
Styled 9.8 11.2 Lcomb2subscript𝐿𝑐𝑜𝑚𝑏2L_{comb2}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 2 end_POSTSUBSCRIPT
43.9 32.9 Ldetsubscript𝐿𝑑𝑒𝑡L_{det}italic_L start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT
45.4 34.7 Lcomb1subscript𝐿𝑐𝑜𝑚𝑏1L_{comb1}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 1 end_POSTSUBSCRIPT
Sdnormal-→\rightarrowTd 45.7 34.0 Lcomb2subscript𝐿𝑐𝑜𝑚𝑏2L_{comb2}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 2 end_POSTSUBSCRIPT
(b) Person Detection

5.3. Qualitative Pose Estimation Results

Tabs. 3(b) & 5 show that styled-tuned models are consistently giving a better performance than any other method. We visualise the predictions of these models for comparison of their performances. Fig. 5(e) shows four characters (wrestler, fleeing, persecutor, bride) and their pose predictions from each of our 5 proposed models.

As shown in Fig. 5(a), the baseline model is the poorest in pose predictions. It is not able to detect majority of keypoints, confuses between the limbs if multiple characters are present and incorrectly predicts the keypoint locations.

Styled model is generally much better (Fig. 5(b)) than the baseline model (also Tab. 2). It is able to predict more keypoints and does not get confused if multiple characters are present. However, it is not able to predict all the visible keypoints and sometimes (Fig. 5(b), last row) gives worse performance than even a baseline model.

Tuned, styled-tuned (Sdnormal-→\rightarrowTd) and styled-tuned with perceptual loss (Sdnormal-→\rightarrowTdp𝑝{}_{p}start_FLOATSUBSCRIPT italic_p end_FLOATSUBSCRIPT) models are overall quite superior to baseline and styled models. They are able to predict almost all of the visible keypoints, do not confuse between multiple characters and are quite precise with the keypoint locations. However, there are subtle differences that make Sdnormal-→\rightarrowTd models better. They are able to predict all the visible keypoints as shown in Fig. 5(d) and Fig. 5(e), whereas tuned models miss some (e. g., Fig. 5(c) where the shoulder joints are missing). The keypoint location precision is also improved using Sdnormal-→\rightarrowTd models. Visually it is difficult to generalise if models with perceptual loss are better or not, however, they are more precise (Fig. 5(e), third row ankle is corrected, but a shoulder is missed).

Figure 6. Pose models comparison: Pose Predictions on 4 examples each from 5(a) baseline, 5(b) styled, 5(c) tuned, 5(d) styled-tuned (Stnormal-→\rightarrowTd) and 5(e) styled-tuned with perceptual loss (Sdnormal-→\rightarrowTdp𝑝{}_{p}start_FLOATSUBSCRIPT italic_p end_FLOATSUBSCRIPT) models. The results clearly show the superiority of predicted poses with the Stnormal-→\rightarrowTd and Sdnormal-→\rightarrowTdp𝑝{}_{p}start_FLOATSUBSCRIPT italic_p end_FLOATSUBSCRIPT models. The characters starting from the top are called wrestler, fleeing, persecutor and bride
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Baseline
Refer to caption
(b) Styled
Refer to caption
(c) Tuned
Refer to caption
(d) Sdnormal-→\rightarrowTd
Refer to caption
(e) Sdnormal-→\rightarrowTdp𝑝{}_{p}start_FLOATSUBSCRIPT italic_p end_FLOATSUBSCRIPT

.

Figure 6. Pose models comparison: Pose Predictions on 4 examples each from 5(a) baseline, 5(b) styled, 5(c) tuned, 5(d) styled-tuned (Stnormal-→\rightarrowTd) and 5(e) styled-tuned with perceptual loss (Sdnormal-→\rightarrowTdp𝑝{}_{p}start_FLOATSUBSCRIPT italic_p end_FLOATSUBSCRIPT) models. The results clearly show the superiority of predicted poses with the Stnormal-→\rightarrowTd and Sdnormal-→\rightarrowTdp𝑝{}_{p}start_FLOATSUBSCRIPT italic_p end_FLOATSUBSCRIPT models. The characters starting from the top are called wrestler, fleeing, persecutor and bride

6. Pose-Based Retrieval

Our experiments (Sec. 5) showed that Styled models and Styled-Tuned models achieve better keypoint detection results, quantitatively and qualitatively, than their corresponding counterparts. In this section, we show that our two-step training pipeline is also beneficial for discovering similar images based on character poses. We call the process of retrieving images based on poses as pose-based retrieval.

6.1. Experimental Setup

The database for image retrieval and discovery is built from the CA validation dataset. The database consists of 303 images and their respective detected poses for best of baseline, styled, tuned, Sdnormal-→\rightarrowTd and Sdnormal-→\rightarrowTdp𝑝{}_{p}start_FLOATSUBSCRIPT italic_p end_FLOATSUBSCRIPT (with perceptual loss). We perform two retrieval experiments based on the class label for each image, which is either a character or scene. There are 15 unique characters (C) and 5 Scenes (S). Given a query image, we rank the retrieved images based on the OKS metric (Lin et al., 2014). In order to evaluate the retrieval method, we use the precision as: P*=TP*TP*+FP*superscript𝑃𝑇superscript𝑃𝑇superscript𝑃𝐹superscript𝑃P^{*}=\frac{TP^{*}}{TP^{*}+FP^{*}}italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG italic_T italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG start_ARG italic_T italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT + italic_F italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG, where =*@k{}^{*}=@kstart_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT = @ italic_k, consequently P@k := Precision at k; TP := true positives, FP := false positives and FN := false negatives. We report P@k and mAP, for k=1 and k=5. In all our experiments, we exclude the self-retrieval (query itself) from the evaluation. For this task, we compare all the presented models to highlight the quality of our proposed models from an application perspective. The focus of this work is on enhancing poses for Greek vase paintings and not presenting a novel image retrieval method and hence we do not compare with SOTA image retrieval methods.

Table 6. Retrieval Results: The (C)* models show the retrieval values based on characters, where as the (S)* models show for the scenes. P is Precision, and mAP is mean-Average Precision. α=U𝛼𝑈\alpha=Uitalic_α = italic_U; SS: Style-Set. Sdnormal-→\rightarrowTd* are style-tuned models, where p1=Lcomb1𝑝1subscript𝐿𝑐𝑜𝑚𝑏1p1=L_{comb1}italic_p 1 = italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 1 end_POSTSUBSCRIPT (Eq. 1a) and p2=Lcomb2𝑝2subscript𝐿𝑐𝑜𝑚𝑏2p2=L_{comb2}italic_p 2 = italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b 2 end_POSTSUBSCRIPT (Eq. 1b)
Model P@1 P@5 mAP α𝛼\alphaitalic_α SS
(C) Baseline 31.7 25.5 21.5 - -
(C) Styled 37.3 30.2 23.1 U𝑈Uitalic_U CA
(C) Tuned 43.0 39.8 27.5 - -
(C) Sdnormal-→\rightarrowTd 47.7 42.2 28.3 U𝑈Uitalic_U CA
(C) Sdnormal-→\rightarrowTdp1𝑝1{}_{p1}start_FLOATSUBSCRIPT italic_p 1 end_FLOATSUBSCRIPT 45.7 41.4 28.0 0.5 CA
(C) Sdnormal-→\rightarrowTdp2𝑝2{}_{p2}start_FLOATSUBSCRIPT italic_p 2 end_FLOATSUBSCRIPT 48.3 41.1 28.4 0.5 CA
(S) Baseline 43.6 43.2 35.9 - -
(S) Styled 46.9 43.6 37.3 U𝑈Uitalic_U CA
(S) Tuned 56.4 52.7 41.5 - -
(S) Sdnormal-→\rightarrowTd 58.8 55.2 42.1 U𝑈Uitalic_U CA
(S) Sdnormal-→\rightarrowTdp1𝑝1{}_{p1}start_FLOATSUBSCRIPT italic_p 1 end_FLOATSUBSCRIPT 57.8 53.5 41.5 0.5 CA
(S) Sdnormal-→\rightarrowTdp2𝑝2{}_{p2}start_FLOATSUBSCRIPT italic_p 2 end_FLOATSUBSCRIPT 58.8 53.4 41.8 0.5 CA

baseline𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒baselineitalic_b italic_a italic_s italic_e italic_l italic_i italic_n italic_e

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

styled𝑠𝑡𝑦𝑙𝑒𝑑styleditalic_s italic_t italic_y italic_l italic_e italic_d

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

tuned𝑡𝑢𝑛𝑒𝑑tuneditalic_t italic_u italic_n italic_e italic_d

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Sdnormal-→\rightarrowTd

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Sdnormal-→\rightarrowTdp1𝑝1{}_{p1}start_FLOATSUBSCRIPT italic_p 1 end_FLOATSUBSCRIPT

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Sdnormal-→\rightarrowTdp2𝑝2{}_{p2}start_FLOATSUBSCRIPT italic_p 2 end_FLOATSUBSCRIPT

Refer to caption
(a) q𝑞qitalic_q
Refer to caption
(b) m1(q)subscript𝑚1𝑞m_{1}(q)italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_q )
Refer to caption
(c) m2(q)subscript𝑚2𝑞m_{2}(q)italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_q )
Refer to caption
(d) m3(q)subscript𝑚3𝑞m_{3}(q)italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_q )
Refer to caption
(e) m4(q)subscript𝑚4𝑞m_{4}(q)italic_m start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_q )
Refer to caption
(f) m5(q)subscript𝑚5𝑞m_{5}(q)italic_m start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ( italic_q )
Figure 7. Discovery and Retrieval comparison: 6(a) are query images (fleeing character, pursuit scene), and the remaining 5 columns (mi(q)subscript𝑚𝑖𝑞m_{i}(q)italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_q )) are the five most similar images across four models; \nth1 row – baseline, \nth2 row – styled, \nth3 row – Tuned, \nth4 row – Sdnormal-→\rightarrowTd, \nth5 and \nth6 row – Sdnormal-→\rightarrowTdp1𝑝1{}_{p1}start_FLOATSUBSCRIPT italic_p 1 end_FLOATSUBSCRIPT and Sdnormal-→\rightarrowTdp2𝑝2{}_{p2}start_FLOATSUBSCRIPT italic_p 2 end_FLOATSUBSCRIPT respectively. The results clearly show that the styled-tuned models retrieve the most precise results based on poses.

6.2. Retrieval Results and Discovery

Tab. 6 presents our pose-based retrieval results. We observe that styled-tuned models are consistently better for C and S, and the styled are better than baselines counterparts. Fig. 7 displays a query image 6(a) along with the top-5 ranked retrievals for the six different evaluated models. It can be observed that the tuned and the styled-tuned models outperform the baseline and the styled models. Fig. 7 row 1 shows poor retrieval results for the baseline model. For a pursuit scene, the first two retrieved samples belong to a wrestling scene while the last 2 belong to leading of the bride scene. The styled model (row 2) is already better wherein all the five retrievals belong to the pursuit scene, however at a character level, it retrieves a persecutor at the \nth5 retrieval. Tuned (row 3) and styled-tuned (row 4-6) models perform similarly well. On a closer look, we see that the \nth2 and \nth4 retrieved samples of the styled-tuned model are closer to the query sample compared to the tuned model.

7. Conclusion

We presented a two-stage training approach for using style transfer and transfer learning in combination with perceptual consistency to improve pose estimation in ancient Greek vase paintings. We show that the use of styled transfer learning as a domain adaptation technique for such data significantly improves the performance of state-of-the-art pose estimation models on unlabelled data by 6 % mean average precision (mAP) as well as mean average recall (mAR). We also analysed the impact of styles as progressive learning in a comprehensive manner showing that models learn generic domain styles. We experimentally showed that our proposed method outperforms their corresponding counterparts for human pose estimation. In general, our method can be applied to diverse unlabelled datasets without explicit supervised learning. Our method also provides a way for exploring diverse cross-domain datasets with low or no labels using human poses as a tool. Finally, we also show that our method can be used for pose-based image retrieval and discovery of similar, relevant poses and corresponding scenes in collections such as ancient Greek vase paintings. For future work, we plan to a) introduce the geometric structures of vases in COCO and Styled-COCO datasets as an augmentation technique during training, b) use shape information of the persons into our framework or using segmentation as a prior for vase paintings.

Acknowledgements.
This paper is partially funded by the FAU Emerging Fields Initiative (EFI) project “Iconographics. Computational Understanding of Iconography and Narration in Visual Cultural Heritage” as well as partially funded by the EU H2020 project “Odeuropa” under grant agreement No. 101004469. The authors would also like to thank NVIDIA for their hardware donation.

References

  • (1)
  • Akiba et al. (2019) Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, United States, 2623–2631.
  • Becker (2013) Colleen M. Becker. 2013. Aby Warburg’s Pathosformel as Methodological Paradigm. The Journal of Art Historiography 9 (2013), 9–CB1.
  • Bell and Impett (2019) Peter Bell and Leonardo Impett. 2019. Ikonographie und Interaktion. Computergestützte Analyse von Posen in Bildern der Heilsgeschichte. Das Mittelalter 24, 1 (2019), 31–53.
  • Bell et al. (2013) Peter Bell, Joseph Schlecht, and Björn Ommer. 2013. Nonverbal communication in medieval illustrations revisited by computer vision and art history. Visual Resources 29, 1-2 (2013), 26–37.
  • Cao et al. (2017) Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE Computer Society, United States, 7291–7299.
  • Carneiro et al. (2012) Gustavo Carneiro, Nuno Pinho da Silva, Alessio Del Bue, and João Paulo Costeira. 2012. Artistic Image Classification: An Analysis on the PRINTART Database. In Computer Vision – ECCV 2012, Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 143–157.
  • Carreira et al. (2016) João Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. 2016. Human Pose Estimation with Iterative Error Feedback. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, United States, 4733–4742. https://doi.org/10.1109/CVPR.2016.512
  • Cetinic et al. (2018) Eva Cetinic, Tomislav Lipic, and Sonja Grgic. 2018. Fine-tuning convolutional neural networks for fine art classification. Expert Systems with Applications 114 (2018), 107–118.
  • Chen et al. (2018) Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. 2018. Cascaded Pyramid Network for Multi-person Pose Estimation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, United States, 7103–7112. https://doi.org/10.1109/CVPR.2018.00742
  • Crowley and Zisserman (2014) Elliot Crowley and Andrew Zisserman. 2014. The State of the Art: Object Retrieval in Paintings using Discriminative Regions. In British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014. BMVA Press, UK.
  • Crowley and Zisserman (2013) Elliot J Crowley and Andrew Zisserman. 2013. Of gods and goats: Weakly supervised learning of figurative art. learning 8 (2013), 14.
  • Deng et al. (2018) Weijian Deng, Liang Zheng, Qixiang Ye, Guoliang Kang, Yi Yang, and Jianbin Jiao. 2018. Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identification. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, US, 994–1003. https://doi.org/10.1109/CVPR.2018.00110
  • Fang et al. (2017) Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: Regional Multi-person Pose Estimation. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, US, 2353–2362. https://doi.org/10.1109/ICCV.2017.256
  • Gatys et al. (2016) Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2016. Image Style Transfer Using Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, US, 2414–2423. https://doi.org/10.1109/CVPR.2016.265
  • Girshick (2015) Ross Girshick. 2015. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, US, 1440–1448. https://doi.org/10.1109/ICCV.2015.169
  • Girshick et al. (2014) Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, US, 580–587. https://doi.org/10.1109/CVPR.2014.81
  • Giuliani (2003) Luca Giuliani. 2003. Bild und Mythos: Geschichte der Bilderzählung in der griechischen Kunst. CH Beck, München.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 9), Yee Whye Teh and Mike Titterington (Eds.). PMLR, Chia Laguna Resort, Sardinia, Italy, 249–256.
  • Hall et al. (2015) Peter Hall, Hongping Cai, Qi Wu, and Tadeo Corradi. 2015. Cross-depiction problem: Recognition and synthesis of photographs and artwork. Computational Visual Media 1, 2 (01 Jun 2015), 91–103. https://doi.org/10.1007/s41095-015-0017-1
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, US, 770–778.
  • Hoffman et al. (2018) Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. 2018. Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning. PMLR, PMLR, Sweden, 1989–1998.
  • Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, United States, 4700–4708.
  • Huang and Belongie (2017) Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, United States, 1501–1510.
  • Impett and Moretti (2017) Leonardo Impett and Franco Moretti. 2017. Totentanz. Operationalizing Aby Warburg’s Pathosformeln. Technical Report. Stanford Literary Lab, Stanford.
  • Inoue et al. (2018) Naoto Inoue, Ryosuke Furuta, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2018. Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, United States, 5001–5009.
  • Insafutdinov et al. (2016) Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. 2016. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision. Springer, Netherlands, 34–50.
  • Jenicek and Chum (2019) Tomas Jenicek and Ondřej Chum. 2019. Linking Art through Human Poses. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, Australia, 1338–1345.
  • Johnson et al. (2008) C Richard Johnson, Ella Hendriks, Igor J Berezhnoy, Eugene Brevdo, Shannon M Hughes, Ingrid Daubechies, Jia Li, Eric Postma, and James Z Wang. 2008. Image processing for artist identification. IEEE Signal Processing Magazine 25, 4 (2008), 37–48.
  • Johnson et al. (2016) Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Computer Vision – ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Vol. 9906. Springer International Publishing, Cham, 694–711. https://doi.org/10.1007/978-3-319-46475-6_43
  • Kingma and Ba (2015) Diederik P Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic gradient descent. In ICLR: International Conference on Learning Representations. ICLR, US, –.
  • Li et al. (2017) Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2017. Universal style transfer via feature transforms. In Advances in neural information processing systems. NIPS, USA, 386–396.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, Switzerland, 740–755.
  • Long et al. (2015) Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning transferable features with deep adaptation networks. In International conference on machine learning. PMLR, France, 97–105.
  • Madhu et al. (2019) Prathmesh Madhu, Ronak Kosti, Lara Mührenberg, Peter Bell, Andreas Maier, and Vincent Christlein. 2019. Recognizing Characters in Art History Using Deep Learning. In Proceedings of the 1st Workshop on Structuring and Understanding of Multimedia heritAge Contents. ACM, France, 15–22.
  • Maria Carlucci et al. (2017) Fabio Maria Carlucci, Lorenzo Porzi, Barbara Caputo, Elisa Ricci, and Samuel Rota Bulo. 2017. Autodial: Automatic domain alignment layers. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Italy, 5067–5075.
  • McNiven (1983) Timothy John McNiven. 1983. Gestures in Attic Vase Painting: use and meaning, 550-450 BC. University of Michigan, Michigan.
  • Mensink and Van Gemert (2014) Thomas Mensink and Jan Van Gemert. 2014. The rijksmuseum challenge: Museum-centered visual recognition. In Proceedings of International Conference on Multimedia Retrieval. ICML, China, 451–454.
  • Moon et al. (2019) Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. 2019. PoseFix: Model-Agnostic General Human Pose Refinement Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, US, –.
  • Nevatia and Binford (1977) Ramakant Nevatia and Thomas O Binford. 1977. Description and recognition of curved objects. Artificial intelligence 8, 1 (1977), 77–98.
  • Newell et al. (2016) Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European conference on computer vision. Springer, Springer, Netherlands, 483–499.
  • Papandreou et al. (2017) George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. 2017. Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, United States, 4903–4911.
  • Pishchulin et al. (2016) Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V Gehler, and Bernt Schiele. 2016. Deepcut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, US, 4929–4937.
  • Ren et al. (2017) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2017), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
  • Rodriguez and Mikolajczyk (2019) AL Rodriguez and K Mikolajczyk. 2019. Domain adaptation for object detection via style consistency. In BMVC. BMVA Press, UK, –.
  • Roy et al. (2019) Subhankar Roy, Aliaksandr Siarohin, Enver Sangineto, Samuel Rota Bulo, Nicu Sebe, and Elisa Ricci. 2019. Unsupervised domain adaptation using feature-whitening and consensus loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, US, 9471–9480.
  • Russo et al. (2018) Paolo Russo, Fabio M Carlucci, Tatiana Tommasi, and Barbara Caputo. 2018. From source to target and back: symmetric bi-directional adaptive gan. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, US, 8099–8108.
  • Saleh and Elgammal (2016) Babak Saleh and Ahmed Elgammal. 2016. Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature. International Journal for Digital Art History 2, 2 (Oct. 2016), –. https://doi.org/10.11588/dah.2016.2.23376
  • Seguin et al. (2018) Benoit Seguin, Lisandra Costiner, Isabella di Lenardo, and Frédéric Kaplan. 2018. New techniques for the digitization of art historical photographic archives-the case of the cini foundation in venice. In Archiving Conference (2018, 1). Society for Imaging Science and Technology, Society for Imaging Science and Technology, US, 1–5.
  • Stansbury-O’Donnell (2009) Mark Stansbury-O’Donnell. 2009. Structural differentiation of pursuit scenes. σ𝜎\sigmaitalic_σ τ𝜏\tauitalic_τo: Yatromanolakis -, - (2009), 341–372.
  • Sun and Saenko (2016) Baochen Sun and Kate Saenko. 2016. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision. Springer, Springer, Netherlands, 443–450.
  • Sun et al. (2019) Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, US, 5693–5703.
  • Tompson et al. (2014) Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems. NIPS, Canada, 1799–1807.
  • Toshev and Szegedy (2014) Alexander Toshev and Christian Szegedy. 2014. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, USA, 1653–1660.
  • Wei et al. (2016) Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. IEEE, US, 4724–4732.
  • Westlake et al. (2016) Nicholas Westlake, Hongping Cai, and Peter Hall. 2016. Detecting people in artwork with cnns. In European Conference on Computer Vision. Springer, Netherlands, 825–841.
  • Xiao et al. (2018) Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV). ECCV, Germany, 466–481.