Single-stage object detector with attention mechanism for squamous cell carcinoma feature detection using histopathological images

Prabhu, Swathi; Prasad, Keerthana; Lu, Xuequan; Robels-Kelly, Antonio; Hoang, Thuong

doi:10.1007/s11042-023-16372-z

Single-stage object detector with attention mechanism for squamous cell carcinoma feature detection using histopathological images

Open access
Published: 30 August 2023

Volume 83, pages 27193–27215, (2024)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

Single-stage object detector with attention mechanism for squamous cell carcinoma feature detection using histopathological images

Download PDF

Swathi Prabhu¹,
Keerthana Prasad¹,
Xuequan Lu^2,3,
Antonio Robels-Kelly² &
…
Thuong Hoang²

1318 Accesses
1 Citation
Explore all metrics

Abstract

Squamous cell carcinoma is the most common type of cancer that occurs in squamous cells of epithelial tissue. Histopathological evaluation of tissue samples is the gold standard approach used for carcinoma diagnosis. SCC detection based on various histopathological features often employs traditional machine learning approaches or pixel-based deep CNN models. This study aims to detect keratin pearl, the most prominent SCC feature, by implementing RetinaNet one-stage object detector. Further, we enhance the model performance by incorporating an attention module. The proposed method is more efficient in detection of small keratin pearls. This is the first work detecting keratin pearl resorting to the object detection technique to the extent of our knowledge. We conducted a comprehensive assessment of the model both quantitatively and qualitatively. The experimental results demonstrate that the proposed approach enhanced the mAP by about 4% compared to default RetinaNet model.

DenseNet with Atrous Spatial Pyramid Pooling for Skin Lesion Classification

Analysis of Various CNN Models for Locating Keratin Pearls in Photomicrographs

Detection of oral squamous cell carcinoma in clinical photographs using a vision transformer

Article Open access 09 February 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Cancer is a significant cause of death globally, with nearly 10 million deaths in 2020 [38]. Around 70 to 80% of cancer occurs in epithelial tissue cells, which is called as carcinoma. Squamous Cell Carcinoma (SCC) is the most common type of carcinoma in head and neck region, skin, lungs, esophagus and cervix. Histopathological analysis of tissue biopsy remains the gold standard approach for cancer diagnosis. Various clinicopathological features such as keratinization, keratin pearl formation and cells arrangement pattern are used to screen SCC initially [3, 5]. Keratin pearl is a form of keratinized structure, frequently visible at low-grade SCC and rarely observed at high-grade SCC [3]. This feature is more clearly noticeable at low magnification images (25x, 40x). Figure 1 depicts the formation of keratin pearl in SCC images captured at 40x magnification. In practice, pathologists observe the various histopathological features in SCC samples at low and high magnifications such as 25x, 40x, 100x, 200x and 400x for diagnosis. However, the manual evaluation is time-consuming and suffers from inter/intraobserver variability. Hence, there is urgent need for an automated quantification of keratinized area and keratin pearl detection from histopathological images. Early detection could undoubtedly improve the survival rate and help better treatment. In this regard, computer-assisted diagnostic system could be helpful for clinicians to obtain faster and objective diagnostic report.

Several studies reported using traditional image processing and deep learning approaches for SCC detection and classification [4, 6, 7, 10, 13, 22, 28,29,30, 32, 33, 43, 45,46,47]. Although keratin pearl detection is an initial screening criterion of SCC diagnosis, to date, minimal attempts were reported in the literature to detect keratin pearl [3, 5]. This study aims to investigate keratin pearl detection using a single-stage object detection model rather than pixel-based deep networks.

We consider the object detection model because it reduces annotation time by drawing the bounding boxes around objects which is much simpler than pixel-based annotation. Nevertheless, identifying certain histopathological features such as keratin pearls, nuclei and mitotic figures can aid in making diagnostic decisions rather than pixel based segmentation. This study presents the utilization of object detection techniques to identify keratin pearls in histopathological images, aiming to assist pathologists in the detection of SCC. We modified default RetinaNet model architecture to detect small keratin pearl and to improve the detection confidence. The primary contribution of this study are listed below.

1.
A novel object detection method for detection of keratin pearls
2.
An attention module on top of RetinaNet backbone network to support the detection of small scale keratin pearls

The structure of the paper is outlined as follows. Section 2 presents a short summary of recent articles on SCC detection and classification. In Section 3, we discuss the materials and methods utilized in this study. Section 4 presents the results of the proposed method and its comparison with state-of-the-art models. In Section 5, the paper concludes with a summary of key findings.

2 Related work

In the literature, various approaches were reported for detection and classification of SCC with respect to different organs of SCC origin [10, 22, 28, 30, 32, 43, 47]. Additionally, carcinoma features such as increased mitotic division [4], irregular nuclei shape and size [13, 29, 33, 46] were detected using machine learning algorithms and deep learning approaches [3, 5, 7, 10, 22, 28, 30, 32, 43, 45, 47]. However, a couple of studies were reported on keratinization and keratin pearl detection using Chan-Vese model [3] and Random Forest tree classifier with the help of Gabor features [5].

Deep learning-based methods were used to extract nuclei from SCC histopathological images [6, 7, 45]. Semantic segmentation models such as FC, U-Net, SegNet, PSPNet and DeepLab were exploited in histopathological images for microscopic feature segmentation [12, 16, 21, 31]. As the ground truth, these models require pixel-based annotations, which is a time-consuming task. Sometimes pixel-based segmentation may be unnecessary when certain microscopic features, such as the presence of a keratin pearl are sufficient to characterize early stage SCC. In that case, object detection models could be used to detect such features with a much simpler annotation format. Bounding box annotation around objects saves a lot of time. Most of the studies reported using object detection models such as Faster-RCNN, Mask-RCNN, YOLO, Single Shot MultiBox Detector (SSD) and RetinaNet on natural scene dataset, ImageNet dataset, COCO dataset and other object detection datasets. Recently, these models have been used in the medical field for different imaging modalities such as endoscopy images, CT images, X-rays and histopathological images to identify cells, nuclei and mitotic division, cancer lesions, malaria parasites and organ segmentation [1, 11, 17,18,19, 27, 35, 36, 41, 44].

Mask R-CNN is a combination of Faster RCNN and FCN architecture. It provides object bounding boxes, classes and binary masks that indicate the object’s pixels in the bounding box. Jung et al. [18] and Sebai et al. [35] investigated the use of Mask R-CNN for nuclei and mitotic cells segmentation, respectively on histopathological images. However, Mask R-CNN has achieved good performance on natural images, while its performance is worse on the medical image dataset [36]. Hoorali et al. [14] introduced an automatic method to diagnose disease from microscopic images using attention based Mask R-CNN. Khan et al. [19] proposed modified Faster R-CNN with dilated convolutions in the backbone model to improve the detection of mitotic nuclei in histopathological images They achieved average precision of 50.31%. Further, the authors in [2] presented an automated firearms detection system for cargo x-ray images. In their proposed model, RetinaNet outperforms two-stage R-CNN model in terms of detection performance. Also, it matches the speed of one-stage object detection algorithms, namely YOLO and SSD. A few authors also performed comparative study between the object detection models such as YOLOV3, YOLOV5, SSD and RetinaNet [8, 16, 20, 39], and findings are presented in Table 1.

Also, Table 1 provides a summary of different techniques used to extract microscopic features from histopathological images. Most of the studies utilized segmentation models for this purpose, but there is a growing trend in recent studies to use Faster R-CNN and Mask R-CNN for feature extraction. This has sparked our interest in exploring the use of object detection models for analyzing histopathological images. The aim of the object detection is to identify the precise location of objects within images and recognize their respective category. In this study, we present RetinaNet architecture to detect one of the significant SCC feature namely keratin pearl, in histopathological images. Further, we revisit RetinaNet to address the specific challenge of detecting small keratin pearls and to increase the detected objects$'$ confidence.

Table 1 Summary of literature

Full size table

3 Materials and methods

Section 3.1 covers the experiment details followed by a summary of the RetinaNet model in Section 3.2. In Section 3.3, we provide a detailed architecture of the proposed methodology.

3.1 Experimental details

This section provides a detailed explanation of experiment setup. Section 3.1.1 presents information on the image acquisition method and the total number of images used in the study. Section 3.1.2 explains the system requirements of the proposed architecture.

3.1.1 Image acquisition

The proposed workflow of keratin pearl detection uses 101 images captured using DP21 camera attached to the Olympus CX31 microscope. The images are collected at 40x magnification (objective lens: 4x; eyepiece: 10x) with image resolution of 1600x1200 pixels. The slides are collected from the pathology department of Kasturba Medical College (KMC), Manipal, India. The biopsy slides are prepared with Hematoxylin and Eosin (H &E) staining technique. These images are annotated using LabelImg software [40] to mark the bounding boxes around the keratin pearl. A total of 725 instances of keratin pearl are present in training and validation sets. However we increased the dataset size by introducing data augmentation technique. The dataset is split into 90% for training set and 10% for testing set. We also detected keratin pearl present at 100x magnified images to test the model’s robustness.

3.1.2 System requirements

The experiment is performed on CUDA 10.1 toolkit with cuDNN 7 library on 64-bit Ubuntu 16.04 operating system. All experiments are performed with DELL rack server, model PowerEdge R740 with Intel(R) Xeon(R) Gold 6136 CPU @3.00GHz, RAM 128 GB, hard disk 8TB+4TB RAID, GPU NVIDIA Tesla P-100 (16 GB), Python 3.7 with Jupyter hub multiuser environment. Object detection models are implemented using open-source libraries such as OpenCV, CuDNN and Keras with Tensorflow as the backend.

3.2 RetinaNet

This study presented keratin pearl detection using RetinaNet model and compared the results with state-of-the-art object detection models and traditional approach. Further, we modified RetinaNet model by incorporating an attention mechanism. The default anchor scales and ratios are not effective for our dataset. Hence, we separately calculated the anchor ratios and scales based on object sizes in the training dataset. We also replaced standard convolution with depth-wise separable convolution at the last layers of RetinaNet model to significantly reduce the number of parameters.

As RetinaNet feature maps become deeper, the object’s edge definition gets blurred and the corresponding regression becomes weak. On the other hand, the deep feature map has low resolution, posing challenges in detecting small objects within the feature map. Hence, we introduced an attention module to find the interdependence between the channels of its convolution features that boost performance at a little extra computational cost.

This section provides RetinaNet model default architecture and further, we elaborate the proposed model workflow.

RetinaNet [24] is a one-stage object detector model developed to address the imbalances and inconsistencies that exist in single-stage object detection models such as YOLO and SSD. Two crucial building blocks of RetinaNet are feature pyramid architecture and the use of focal loss. The model comprises two main components:1. backbone network that includes feature extractor and Feature Pyramid Network (FPN) [23] 2. subnetwork block includes classification network with focal loss function [24] and box regression network with smooth $L_1$ loss function. RetinaNet by default uses ResNet-50 as a feature extractor, shown in Fig. 2. RetinaNet uses c2, c3, c4 and c5 blocks of ResNet50 as the backbone for feature extraction. Additionally, convolutional operations are applied to the c5 blocks and two additional blocks c6 and c7 are introduced. These blocks are incorporated into the FPN to compute feature maps at various scales. It generates a feature pyramid with multiple scales by using a top-down approach that includes lateral connections. Thus FPN facilitates the detection of objects at different scales. We implemented default RetinaNet model for the detection of keratin pearl. Further, we proposed RetinaNet model with attention module to address the specific challenge of detecting small-size keratin pearls and also to improve Intersection over Union (IoU) of detected objects.

3.3 Proposed architecture

In the proposed architecture, the detector is a modified RetinaNet model which is described in Fig. 3. The proposed model incorporates an attention module on top of backbone network and is discussed in Section 3.3.1. To calculate anchor scales and ratios, we utilize the differential evolution search algorithm [37]. In addition, we use depth-wise separable convolution, which requires fewer multiplications compared to standard convolution and reduces model parameters, as explained in Section 3.3.2. The loss function is discussed in Section 3.3.3. We optimize the hyperparameters of the RetinaNet model and discussed in Section 3.3.4. Overall, this section provides a detailed explanation of the proposed RetinaNet model design, including the various steps we followed to create it.

3.3.1 Attention module

Our proposed study involves channel-based attention module to identify inter-dependencies existing among the feature channels. The Squeeze and Excitation Networks [15] is considered as a reference to the proposed model. This module extracts a degree of importance of the feature channel by reducing the feature maps of each channel to a single numerical value. We performed pooling operation to reduce the spatial dimensions of the feature maps. This study used global max pooling and global average pooling operations. The most activating pixels are preserved with max pooling, while average pooling creates a smoothed average of all the pixels in that window. Later, the results of the pooling operations are fed into fully connected multi-layer perceptron bottleneck structure to learn the adaptive scaling weights of these channels. Figure 4 shows the proposed attention module architecture. It includes 6 steps and are described as below.

Step 1: Generate the feature maps by employing the backbone network.
Step 2: Perform the global max pooling and average pooling operation on feature maps (also known as feature channels) of $C_3$, $C_4$, $C_5$ $C_6$ or $C_7$ block.
Step 3: Add the results obtained from pooling operation as shown in Equation (1).
$$\begin{aligned} z_k = F_a + F_b \end{aligned}$$
(1)
where $z_k$ is a scalar value and we named it as feature descriptor of the $k^{th}$ channel. $F_a$ and $F_b$ are represented in (2) and (3).
$$\begin{aligned} F_a= Avg(\sum _{i=1}^{W} \sum _{j=1}^{H} (c^{[block]}_k(i,j))) \end{aligned}$$
(2)
$$\begin{aligned} F_b=Max(\sum _{i=1} ^{W} \sum _{j=1}^{H} (c^{[block]}_k(i,j))) \end{aligned}$$
(3)
In Equation (2) and (3) we performed global average pooling and global max pooling operation, respectively. W and H represent width and height of the feature map. The term $c^{[block]}_k$ is the $k^{th}$ feature map or channel of a specified backbone block namely $C_3$, $C_4$, $C_5$ $C_6$ or $C_7$. The process of step 2 and step 3 is applied to each channel of the feature maps separately. The output tensor is of size $(1\times 1 \times C)$, essentially a vector of length C channels where each feature map is now decomposed into a singular value.
Step 4: Calculate adaptive scaling weights for each channel using fully connected multi-layer perceptron bottleneck structure. In Fig. 4, first dense layer acts as a dimensionality reduction layer with a hyperparameter r as the reduction factor and it is set to 32 in this experiment. This hyperparameter is thoroughly described in [15]. Further, Swish activation function is applied to the subsequent layer. We used Swish activation function instead of ReLU because ReLU zeroes out the negative parts. However, negative values may still help capture the pattern underlying the image. Swish activation ensures that negative values do not become zero [34].
Step 5: Another dense layer is employed to return the channel dimension of size $1\times 1 \times C$. This step is followed by sigmoid activation function that provides each channel a smooth gating function and scale the values in the range of 0 to 1 [15].
Step 6: The final output is obtained by performing element-wise multiplication between the feature maps in the input tensor and their respective learned weights from the previous step.

The attention module is integrated between feature extractor and FPN. We considered ResNet-152 as backbone and obtained the feature vector. ResNet-152 achieved mAP of 88.45% with the default parameter, which is better than all other backbone networks considered in this study. The attention module on top of the backbone network is shown in Fig. 5. All the blocks of the backbone network are connected to FPN as $C_2, C_3, C_4, C_5, C_6$ and $C_7$, which generates feature pyramid $P_2, P_3, P_4, P_5, P_6$ and $P_7$ layers. However, $C_2$ block is not connected to FPN because of large memory consumption. FPN allows propagating rich features via a top-down path to lower stages. We applied the proposed attention module on the top of ResNet-152. This module is introduced between $C_5$ and $P_5$ feature maps. Also, the calculation of $P_6$ feature maps is performed using $C_5$ block with attention module results. In this study, $P_5$ feature maps are used to extract the global contexts from the attention module. Thus global contexts are propagated to $P_6$ and $P_7$ in top-down pathway. We could add the attention module to $C_3$ and $C_4$. However, we observed during the experiments that the best performance is achieved when our proposed attention module is incorporated between $C_5$ and $P_5$.

3.3.2 Depth-wise separable convolution

Depth-wise separable convolution factorizes the standard convolution by separating the spatial filtering process from the feature generation mechanism. Hence, it includes depth-wise and point-wise layers separately. Unlike the normal CNNs, which apply convolution across all C channels simultaneously, the depth-wise operation performs convolution on each channel individually as shown in Fig. 6. The output size of $H\times W\times C$ for C channels with $N1\times M1\times 1$ kernels. Wherein point-wise convolution applies 1x1 convolution operation on C channels. For an instance of K such filters the output size becomes $H\times W\times K$. In this study, the last layers of FPN (C6 and C7 blocks) in the modified RetinaNet model are replaced with depth-wise separable convolution. This convolution operation improved the convolution efficiency and significantly reduced the model’s training parameters [25].

3.3.3 Loss functions

Focal loss

One stage detector such as SSD and YOLO suffers from class imbalance problems. Focal loss is an extension of cross-entropy loss, used for classification where it emphasizes foreground classes and focuses on the fact that the loss of foreground classes is higher during training by down-weighting the background classes [24]. It is expressed in (4).

$$\begin{aligned} FL(p_t) = -\alpha _t (1-p_t)^\gamma \log (p_t) \end{aligned}$$

(4)

where $\gamma $ focus more on foreground classes and $\alpha $ is a weightage factor for class. When $\gamma $ = 0, focal loss behaves like cross-entropy loss function. In this study, we utilized the default values of 0.25 for $\alpha $ and 2 for $\gamma $, which yielded satisfactory results.

Smooth L1-loss

Smooth L1-loss [9] is adopted to perform the box regression which is a combination of L1 and L2 loss functions. Equation 5 represents the utilization of the Smooth L1-loss, where y denotes the discrepancy between the ground truth and predicted value.

$$\begin{aligned} Smooth_{L1}={\left\{ \begin{array}{ll} 0.5y^2, &{} \text {if |y|<1}.\\ |y|-0.5, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

(5)

3.3.4 RetinaNet default parameter optimization

Instead of using the default parameters to train the model, we examined different hyperparameters to improve our proposed model performance. We considered initial learning rate as $10^{-4}$ and lower bound on the learning rate as $10^{-9}$ using ReduceLROnPlateau callback. On-the-fly data augmentation techniques were used. The number of epochs was set to 50 because RetinaNet converges quickly, so small number of epochs usually suffices. By default, model resizes the images in such a way that the shortest side is equal to 800px and if longest side exceeds 1333px then model resizes it into 1333 px. We selected IoU threshold as 0.5 because IoU at 0.5 normally considered a good prediction.

Anchor boxes Anchor boxes were introduced in Faster RCNN first and later it is added to YOLO, SSD and RetinaNet. It is highly difficult to generate region proposals for varied shape objects. Hence, anchor configuration plays a crucial role in object detection. RetinaNet uses default anchor boxes with different ratios and scales, which may not be effective for all types of objects. By default it uses 32, 64, 128, 256 and 512 sized anchors with different aspect ratios and scales at each pyramid level called P3 to P7 [23]. A total of 9 anchors are used at each level. We considered two algorithms, namely k-means clustering and differential evolution search algorithm [37] to obtain ratios and scales of anchor boxes, three each on our dataset. However, ratios and scales generated using differential evolution search algorithm resulted in better mAP than k-means clustering. Our goal was to obtain the best anchor setting to maximize the overlap between bounding box and anchor box. We obtained optimal scales as 1.093, 1.382, 1.73 and ratios of 0.662, 1.0, 1.509. These optimal values were used to train the detector.

4 Results and discussion

In this section, we present the outcomes of our experiments, as well as some limitations of our system and potential solutions to overcome them. We describe the performance metrics employed in this study in Section 4.1. Experimental results are shown and comparative analysis with state-of-the-art methods is provided in Section 4.2. Finally, we discuss the overall findings of our study and its limitations in Section 4.3.

4.1 Performance metrics

Standardized evaluation metrics should be used to measure the overall performance of the object detection model. The article referenced [26] provides a detailed discussion on the metrics used to evaluate the performance of object detection models. We established a threshold of 0.5 for the IoU, Precision (PR), Recall (RE), Miss Rate (MR), False Positive Per Image (FPPI) and mAP metrics. The model is tested on a dataset containing 87 keratin pearl instances and the evaluation is based on the number of True Positives (TP), False Positives (FP) and False Negative (FN) detections. TP represents a correctly detected keratin pearl class with the correct label, while FP indicates a detected object that is not a keratin pearl and FN refers to keratin pearl that the model did not detect properly. As the dataset does not contain any examples of negative instances, there is no true negative class to consider. PR, RC and F1-score metrics crucial in medical field where PR measures the ratio of correct predictions to all detections, RC evaluates the ratio of correct predictions to potential detections and F1-score reflects the balance between PR and RC. Further lower values of MR and FPPI indicate better performance for both measures. The TP, FP and FN values of our proposed model and state-of-the-art models are illustrated in Fig. 7.

4.2 Results

In this experiment, we validated our proposed model and conducted a comparative analysis with default object detection networks, namely YOLOv3, SSD, YOLOv5-S, YOLOv5-M, YOLOv5-L, YOLOX and RetinaNet, as described in Section 4.2.1. Further, we compared the results with traditional image processing approach proposed by Das et al. [3], which is discussed in Section 4.2.2.

4.2.1 Qualitative analysis

Figure 8 shows the keratin pearl detection on test images with default RetinaNet and proposed model, respectively. The yellow circles in Fig. 8 are keratin pearls detected using the proposed model but not detected with default RetinaNet. The proposed model detected small keratin pearls and also enhanced the confidence of detected objects. We also tested with normal tissue images where keratin pearls are not present. The model does not detect any other object as keratin pearl and hence, the image shown in Fig. 13 (d7) and (e7) does not have any bounding boxes. The RetinaNet model is trained on 40x images. However, it is able to detect the keratin pearl from the images captured at 100x magnification as well, shown in Fig. 9. When we visualise the same area at higher magnification keratin pearl looks bigger and hence we justified that the model trained with low magnification can also detect the keratin pearls at higher magnification. We used Tensorboard visualization toolkit to evaluate the model performance. The mAP and loss curves are depicted in Figs. 10 and 11. The mAP curve demonstrates that the proposed model successfully attains an mAP score exceeding 90% by 35 epochs.

We performed a comparative analysis of the proposed model with state-of-the-art models and presented the results in Figs. 12 and 13. The state-of-the-art models are YOLOv3, YOLOV5 (small(S), medium (M), medium with 1 to 15 layers frozen (M frozen) and large (L)) and YOLOX. Our findings show that, unlike all other models, YOLOv3 did not perform well in detecting keratin pearls. YOLOv5 outperformed YOLOv3 by detecting more keratin pearls. However, the proposed model detected more keratin pearls than YOLOv5. YOLOX detected almost all keratin pearls, similar to our proposed model, but it also detected eosinophilic cytoplasm and blood vessels as keratin pearls, which is a drawback. The main issue with existing models is that they tend to predict blood vessels as keratin pearl in normal images, which is not observed in our proposed model. In Figs. 12 and 13, we have marked the missed keratin pearl detection in black color, while the false positive detection is shown in yellow box and the improper prediction in green box.

Data augmentation involves applying a series of image transformations to the original images. These transformations create variations in images, resulting in new images that are different from the original ones. We applied on-the-fly data augmentation technique which involves randomly applying various transformations to the input images, such as rotation, translation, shearing, scaling, flipping and also varied the image brightness, contrast, hue and saturation to produce the augmented data. This technique allows the model to learn from a larger and more diverse set of training images, which can help improve its performance, generalization ability and can avoid overfitting of the model. The proposed model with augmentation demonstrates an improvement in mAP value when compared to the default RetinaNet model with augmentation, as depicted in Fig. 10.

Finally, we compared our results with the traditional segmentation method proposed by Das et al. [3] and the segmentation result is shown in Fig. 14. However, small keratin pearls are not segmented in this method. We considered the segmentation method proposed by Das et al. [3] to compare our results as very few articles have been reported on keratin pearl detection. Figure 14.b shows the ground truth bounding boxes in black color. The red region shows the segmented output and arrows indicate the false positive regions. When we compare the results of Das et al. with the proposed approach visually, we observed that the approach proposed by Das et al. fails to detect all keratin pearls and some regions that are not keratin pearls are also picked up by their method. However, the proposed model detected keratin pearls more appropriately with a good confidence score as shown in Fig. 8. Besides, it is essential to detect the keratin pearl for SCC diagnosis rather than segmenting them out. An image is classified as well differentiated SCC, if more number of keratin pearls and malignant cells are present in it. Also, not detecting all keratin pearls in an image may lead to wrong grading of SCC. Hence, our detection approach would be more feasible than traditional approaches even in SCC grading.

4.2.2 Quantitative analysis

Table 2 provides results using different backbone networks for RetinaNet with various IoU thresholds and their respective mAP values. Among these networks, ResNet-152 achieved the highest mAP. To enhance the model’s performance, an attention module is introduced and hyperparameters are fine-tuned. As a result, the proposed model’s mAP improved by approximately 4% and it could detect more keratin pearls with higher confidence scores in an images as shown in Fig. 13. In the original RetinaNet model, the backbone network used 6 blocks comprising four ResNet blocks and two more blocks are introduced by applying standard convolution operations to the C5 block, which is the last block of ResNet. However, in this study, we replaced the standard convolution operations with depth-wise separable convolution in the last two blocks of the network, namely C6 and C7. This approach is implemented with the aim of minimizing the number of model parameters. Table 3 displays the outcomes after using depth-wise separable convolution in C6 and C7 blocks of the network model, which did not affect the detection performance or increase the computational cost.

We conducted a comparative analysis between the proposed model with and without data augmentation, and found that the proposed model with augmentation outperformed the default RetinaNet model. When RetinaNet model is trained with augmentation, it is able to achieve an mAP of 88.45 ± 0.06%. However, the proposed model, which incorporated depth-separable convolution and augmentation techniques achieved an mAP of 92.63 ± 0.05%. This suggests that the data augmentation improved the proposed model performance by around 4%. We have presented the results of our experiment in Table 3.

Table 2 RetinaNet model with different backbones

Full size table

Table 3 Default RetinaNet and proposed model results

Full size table

We considered various YOLO family models with transfer learning approach which includes YOLOv3, YOLOv5-S, YOLOv5-M, freezing the first 15 layers of YOLOv5-M, YOLOv5-L and YOLOX. We used the default setting with 50 epochs and applied on-the-fly data augmentation technique. Among these, YOLOv5-L achieved slightly better mAP than the proposed model. However, the proposed model achieved an mAP of above 90% by 35 epochs using early stopping. The training time for YOLOv5, YOLOX and RetinaNet is around half an hour, while YOLOv3 took several hours to train on our dataset. Further, we also implemented Single Shot Detection (SSD) model. However, the results are not satisfactory. Table 4 presents the mAP of various models at IoU 0.5 with default feature extractors.

Table 4 Various models and its mAP values

Full size table

Further, we considered a test dataset containing 87 instances of keratin pearls and various performance metrics are tabulated in Table 5. Upon comparison with YOLOv3, YOLOv5 and YOLOX, it is observed that YOLOv3 performed poorly with fewer TP and more FN cases. YOLOv5-M reported higher PR than YOLOX, but the proposed model achieved much better PR than all other models. The MR of the proposed model and YOLOX is nearly identical, but the proposed model’s FPPI is less as compared to other models. Although YOLOX achieved slightly better recall than the proposed model with fewer FN cases, the weighted average of precision and recall, represented by the F1-score is higher for the proposed model than all the compared models.

When evaluating the performance of different models for keratin pearl detection, we compared the inference speed of each model. Among the models we tested, YOLOX had the fastest inference speed and outperformed all other models in this regard. The performance metrics listed in Table 5 are considered more significant factors than speed in the context of detecting keratin pearl on static images. During the experiments, more importance is given to these metrics over speed and found that the proposed technique for identifying keratin pearl outperformed state-of-the-art methods.

Table 5 Comparison of performance with various object detection models

Full size table

We implemented depth-wise separable convolution instead of standard convolution in the feature extraction layer. Although there is a small difference in mAP, it did not affect the detection accuracy. We also listed the number of parameters used in each experiment. From Table 3, it is clear that the model parameters reduced by 6.2M after using depth-wise separable convolution network.

We also developed two additional models with varied subnetwork filter sizes to show the importance of filter size in detection. Table 6 presents the results of these three models.. With small kernel size, model learns complex and non-linear features compared to larger kernel size. Hence, in this study 3x3 kernel size provided better results than 5x5 and 7x7 kernel size.

Table 6 RetinaNet model results with varied filter sizes

Full size table

4.3 Discussion

Our primary objective is to introduce object detection technique for microscopic feature detection, as sometimes feature detection is more crucial than segmenting them out. This is because pixel-based annotation, which is required for segmentation, can be a time-consuming and tedious task. On the other hand, object detection uses bounding boxes which are less time-consuming and easier to annotate. We proposed an object detection model to detect one of the crucial microscopic feature of SCC namely keratin pearls. The proposed model is evaluated by measuring various metrics such as MR, FPPI, PR, RE, F1-score and inference time. We also compared the results with existing state-of-the-art object detection models that are implemented on same dataset. The results indicate that the proposed model can effectively detect keratin pearls. The model also tested on negative images that do not contain any keratin pearl and it did not detect any object. In contrast, all the state-of-the-art methods detected blood vessels as keratin pearls, which can be considered as false positive cases.

We conducted additional experiments to test robustness of our model on 100x magnification images and compared the results with other methods which is depicted in Fig. 15. One limitation of this study is that our model detected some blood vessels in the 100x magnification images. However, it is worth noting that the number of blood vessels detected by our model is relatively low. In contrast, all other methods detected almost all blood vessels as keratin pearl. This ambiguity in the results could be reduced by incorporating negative cases during the training of the model.

The proposed model is trained using images captured with a specific microscope and camera setup, specifically the Olympus CX31 microscope with a DP21 attached camera, referred to as M1. However, during testing, two different microscopy images are used, including M1 and Olympus BX51 microscope with a DP80 attached camera, referred to as M2. The study revealed that there are higher rates of FP and FN in images captured using M2 compared to M1. This could be due to differences in resolution settings and lighting conditions between the two microscopes, which can impact the quality and clarity of the images.

In the future, there is an opportunity to enhance the mean average precision (mAP) of the model by adding more images to the training set. These images could be captured using different microscopes and magnification levels to increase the robustness of the model. Additionally, further investigation could be done to determine how different microscopes and magnification levels impact the performance of the model. Secondly, we have identified a significant characteristic of SCC. However, it is also essential to identify other microscopic features to grade SCC into different classes.

Further, this study attempted to improve the results by changing the filter size, but there is no noticeable improvement. In the future, we plan to conduct additional experiments by modifying other parameters and loss functions.

5 Conclusion

SCC is a prevalent cancer type that can develop in various organs of the human body. Histopathology is commonly used to diagnose cancer and previous studies have focused on classifying SCC from other types of carcinoma. In contrast, this study utilized an object detection model to identify one of the critical features of SCC. Our objective is to demonstrate the effectiveness of object detection methods in the medical field. Prior research has typically employed traditional machine learning or pixel-based CNN models to identify different histopathological features of carcinoma. However, we introduced an object detection model to detect keratin pearls on histopathological images, which is a hallmark of SCC diagnosis. We initially used a RetinaNet model to detect this feature and further improved its performance by adding an attention module. We also compared our results with state-of-the-art methods and achieved a nearly 4% improvement over the default RetinaNet model, with minimal false positive instances compared to existing methods.

Data Availability

The data that support the findings of this study are not openly available due to reasons of sensitivity and are available from the corresponding author upon reasonable request.

References

Cho Y, Lee SM, Cho Y-H, Lee J-G, Park B, Lee G, Kim N, Seo JB (2021) Deep chest x-ray: detection and classification of lesions based on deep convolutional neural networks. Int J Imaging Syst Technol 31(1):72–81
Article Google Scholar
Cui Y, Oztan B (2019) Automated firearms detection in cargo x-ray images using retinanet. Anomaly Detection and Imaging with X-Rays (ADIX) IV 10999, 109990
Das DK, Chakraborty C, Sawaimoon S, Maiti AK, Chatterjee S (2015) Automated identification of keratinization and keratin pearl area from in situ oral histological images. Tissue Cell 47(4):349–358
Article CAS PubMed Google Scholar
Das DK, Mitra P, Chakraborty C, Chatterjee S, Maiti AK, Bose S (2017) Computational approach for mitotic cell detection and its application in oral squamous cell carcinoma. Multidimension Syst Signal Process 28(3):1031–1050
Article Google Scholar
Das DK, Bose S, Maiti AK, Mitra B, Mukherjee G, Dutta PK (2018) Automatic identification of clinically relevant regions from oral tissue histological images for oral squamous cell carcinoma diagnosis. Tissue Cell 53:111–119
Article CAS PubMed Google Scholar
Das DK, Koley S, Bose S, Maiti AK, Mitra B, Mukherjee G, Dutta PK (2019) Computer aided tool for automatic detection and delineation of nucleus from oral histopathology images for oscc screening. Appl Soft Comput 83:105642
Article Google Scholar
Das N, Hussain E, Mahanta LB (2020) Automated classification of cells into multiple classes in epithelial tissue of oral squamous cell carcinoma using transfer learning and convolutional neural network. Neural Netw 128:47–60
Article PubMed Google Scholar
Duragkar A, Guhe S, Sortee A, Singh S, Chandankhede C (2022) Comparison between yolov5 and ssd for pavement crack detection, pp 257–263. Springer
Girshick R (2015) Fast r-cnn. Proceedings of the IEEE international conference on computer vision 1440–1448
Halicek M, Shahedi M, Little JV, Chen AY, Myers LL, Sumer BD, Fei B (2019) Head and neck cancer detection in digitized whole-slide histology using convolutional neural networks. Sci Rep 9(1):1–11
Article CAS Google Scholar
Harsono IW, Liawatimena S, Cenggoro TW (2020) Lung nodule detection and classification from thorax ct-scan using retinanet with transfer learning. J King Saud Univ - Comput Inf
Hassan L, Saleh A, Abdel-Nasser M, Omer OA, Puig D (2021) Promising deep semantic nuclei segmentation models for multi-institutional histopathology images of different organs. Int J Interact Multimed 6(6)
Hiremath P, Iranna YH (2006) Automated cell nuclei segmentation and classification of squamous cell carcinoma from microscopic images of esophagus tissue. In: 2006 International Conference on Advanced Computing and Communications pp 211–216. IEEE
Hoorali F, Khosravi H, Moradi B (2023) An automatic method for microscopic diagnosis of diseases based on urcnn. Biomed Signal Process Control 80:104240
Article Google Scholar
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7132–7141
Hwang JH, Lim M, Han G, Park H, Kim YB, Park J, Jun SY, Lee J, Cho JW (2023) A comparative study on the implementation of deep learning algorithms for detection of hepatic necrosis in toxicity studies. Toxicol Res 1–10
Jung H, Kim B, Lee I, Yoo M, Lee J, Ham S, Woo O, Kang J (2018) Detection of masses in mammograms using a one-stage object detector based on a deep convolutional neural network. PLoS ONE 13(9):0203355
Article Google Scholar
Jung H, Lodhi B, Kang J (2019) An automatic nuclei segmentation method based on deep convolutional neural networks for histopathology images. BMC Biomedical Engineering 1(1):1–12
Article Google Scholar
Khan HU, Raza B, Shah MH, Usama SM, Tiwari P, Band SS (2023) Smdetector: Small mitotic detector in histopathology images using faster r-cnn with dilated convolutions in backbone model. Biomed Signal Process Control 81:104414
Article Google Scholar
Kubera E, Kubik-Komar A, Kurasiński P, Piotrowska-Weryszko K, Skrzypiec M (2022) Detection and recognition of pollen grains in multilabel microscopic images. Sensors 22(7):2690
Article ADS PubMed PubMed Central Google Scholar
Lal S, Das D, Alabhya K, Kanfade A, Kumar A, Kini J (2021) NucleiSegnet: robust deep learning architecture for the nuclei segmentation of liver cancer histopathology images. Comput Biol Med 128:104075
Article CAS PubMed Google Scholar
Li M, Ma X, Chen C, Yuan Y, Zhang S, Yan Z, Chen C, Chen F, Bai Y, Zhou P et al (2021) Research on the auxiliary classification and diagnosis of lung cancer subtypes based on histopathological images. IEEE Access 9:53687–53707
Article Google Scholar
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2117–2125
Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision 2980–2988
Liu R, Jiang D, Zhang L, Zhang Z (2020) Deep depthwise separable convolutional network for change detection in optical aerial images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13:1109–1118
Article ADS Google Scholar
Montalbo FJP (2020) A computer-aided diagnosis of brain tumors using a fine-tuned yolo-based model with transfer learning. KSII Transactions on Internet and Information Systems (TIIS) 14(12):4816–4834
Google Scholar
Nakasi R, Mwebaze E, Zawedde A, Tusubira J, Akera B, Maiga G (2020) A new approach for microscopic diagnosis of malaria parasites in thick blood smears using pre-trained deep learning models. SN Applied Sciences 2(7):1–7
Article Google Scholar
Nawandhar A, Kumar N, Veena R, Yamujala L (2020) Stratified squamous epithelial biopsy image classifier using machine learning and neighborhood feature selection. Biomed Signal Process Control 55:101671
Article Google Scholar
Nawandhar A, Kumar N, Yamujala L (2021) GPU accelerated stratified squamous epithelium biopsy image segmentation for oscc detector and classifier. Biomed Signal Process Control 64:102258
Article Google Scholar
Noroozi N, Zakerolhosseini A (2016) Differential diagnosis of squamous cell carcinoma in situ using skin histopathological images. Comput Biol Med 70:23–39
Article PubMed Google Scholar
Pan X, Li L, Yang D, He Y, Liu Z, Yang H (2019) An accurate nuclei segmentation algorithm in pathological image based on deep semantic network. IEEE Access 7:110674–110686
Article Google Scholar
Rahman T, Mahanta L, Chakraborty C, Das A, Sarma J (2018) Textural pattern classification for oral squamous cell carcinoma. J Microsc 269(1):85–93
Article CAS PubMed Google Scholar
Rahman TY, Mahanta LB, Das AK, Sarma JD (2020) Automated oral squamous cell carcinoma identification using shape, texture and color features of whole image strips. Tissue Cell 63:101322
Article PubMed Google Scholar
Ramachandran P, Zoph B, Le QV (2017) Searching for activation functions. ArXiv Preprint ArXiv:1710.05941
Sebai M, Wang X, Wang T (2020) MaskMitosis: a deep learning framework for fully supervised, weakly supervised, and unsupervised mitosis detection in histopathology images. Medical & Biological Engineering & Computing 58:1603–1623
Article Google Scholar
Shu JH, Nian FD, Yu MH, Li X (2020) An improved mask r-cnn model for multiorgan segmentation. Math Probl Eng 2020
Storn R, Price K (1997) Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces. J Global Optim 11(4):341–359
Article MathSciNet Google Scholar
Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F (2021) Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 71(3):209–249
Article PubMed Google Scholar
Tan L, Huangfu T, Wu L, Chen W (2021) Comparison of retinanet, ssd, and yolo v3 for real-time pill identification. BMC Med Inform Decis Mak 21:1–11
Article Google Scholar
Tzutalin: tzutalin/labelImg: LabelImg is a graphical iamge annotation tool and label object bounding boxes in images. (Accessed on 07/27/2021)
Umer J, Irtaza A, Nida N (2020) MACCAI LITS17 liver tumor segmentation using retinanet. 2020 IEEE 23rd International Multitopic Conference (INMIC) 1–5
Venugopal A, Nair LS (2022) Two-phase mitotic detection using deep learning techniques, pp. 479–489. Springer
Wang C-W, Yu C-P (2013) Automated morphological classification of lung cancer subtypes using h &e tissue images. Mach Vis Appl 24(7):1383–1391
Article Google Scholar
Wang Q, Bi S, Sun M, Wang Y, Wang D, Yang S (2019) Deep learning approach to peripheral leukocyte recognition. PLoS ONE 14(6):0218808
Article Google Scholar
Wu M, Yan C, Liu H, Liu Q, Yin Y (2018) Automatic classification of cervical cancer from cytological images by using convolutional neural network. Biosci Rep 38(6):20181769
Article Google Scholar
Zhang X, Xing F, Su H, Yang L, Zhang S (2015) High-throughput histopathological image analysis via robust cell segmentation and hashing. Med Image Anal 26(1):306–315
Article PubMed PubMed Central Google Scholar
Zhang S, Chen C, Chen C, Chen F, Li M, Yang B, Yan Z, Lv X (2021) Research on application of classification model based on stack generalization in staging of cervical tissue pathological images. IEEE Access 9:48980–48991
Article Google Scholar

Download references

Funding

Open access funding provided by Manipal Academy of Higher Education, Manipal.

Author information

Authors and Affiliations

Manipal School of Information Sciences, Manipal Academy of Higher Education, Manipal, 576104, Karnataka, India
Swathi Prabhu & Keerthana Prasad
School of Information Technology, Faculty of Science Engineering and Built Environment, Deakin University, Geelong, Waurn Ponds, 3216, Victoria, Australia
Xuequan Lu, Antonio Robels-Kelly & Thuong Hoang
Department of Computer Science and IT, La Trobe University, Melbourne, 3086, Victoria, Australia
Xuequan Lu

Authors

Swathi Prabhu
View author publications
You can also search for this author in PubMed Google Scholar
Keerthana Prasad
View author publications
You can also search for this author in PubMed Google Scholar
Xuequan Lu
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Robels-Kelly
View author publications
You can also search for this author in PubMed Google Scholar
Thuong Hoang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Keerthana Prasad.

Ethics declarations

Ethics approval

Histopathological images were collected from the department of pathology, Kasturba Medical College (KMC), Manipal. Ethical approval for this study was obtained from Institutional Ethics Committee (IEC) and IEC Project No. 17/2020.

Conflict of interest/Competing interests

The authors declare that they have no confict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Prabhu, S., Prasad, K., Lu, X. et al. Single-stage object detector with attention mechanism for squamous cell carcinoma feature detection using histopathological images. Multimed Tools Appl 83, 27193–27215 (2024). https://doi.org/10.1007/s11042-023-16372-z

Download citation

Received: 21 November 2022
Revised: 08 May 2023
Accepted: 17 July 2023
Published: 30 August 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-16372-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Single-stage object detector with attention mechanism for squamous cell carcinoma feature detection using histopathological images

Abstract

Similar content being viewed by others

DenseNet with Atrous Spatial Pyramid Pooling for Skin Lesion Classification

Analysis of Various CNN Models for Locating Keratin Pearls in Photomicrographs

Detection of oral squamous cell carcinoma in clinical photographs using a vision transformer

1 Introduction

2 Related work

3 Materials and methods

3.1 Experimental details

3.1.1 Image acquisition

3.1.2 System requirements

3.2 RetinaNet

3.3 Proposed architecture

3.3.1 Attention module

3.3.2 Depth-wise separable convolution

3.3.3 Loss functions

Focal loss

Smooth L1-loss

3.3.4 RetinaNet default parameter optimization

4 Results and discussion

4.1 Performance metrics

4.2 Results

4.2.1 Qualitative analysis

4.2.2 Quantitative analysis

4.3 Discussion

5 Conclusion

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics approval

Conflict of interest/Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation