1. Introduction
The growing importance of online product reviews in e-commerce highlights that retail revenues are projected to reach USD 4.88 trillion globally in the coming years due to the rise of internet shopping and mobile applications [
1]. In recent years, e-commerce has increasingly replaced traditional face-to-face interactions, with digital interfaces as the primary mode of customer engagement [
2,
3]. In this context, consumers must meticulously evaluate online product offerings by cross-referencing the provided specifications with their requirements. A pivotal element of this decision-making process involves analyzing consumer reviews, critical sources of product quality, and performance information. These reviews, often substantiated by detailed written assessments, offer valuable insights grounded in personal experience, revealing user opinions, identifying potential shortcomings, and providing recommendations crucial for informing prospective buyers [
4,
5,
6]. A study highlights that the vast amounts of data generated from user reviews are crucial for predicting consumer preferences and trends [
7]. This information can help companies refine their marketing strategies and enhance their products. In addition, user reviews are crucial in influencing purchasing decisions. A study indicates that many customers change their buying decisions based on positive or negative feedback [
8]. A recent public survey indicates that 92% of consumers actively utilize online reviews as a reference during their shopping experiences. At the same time, 90% of these individuals assert that favorable reviews significantly increase their likelihood of considering a product in their purchasing decisions [
9]. However, the authenticity of these reviews can often be questionable, leading to concerns about deceptive information and unfavorable online purchasing encounters [
10,
11].
This wealth of information plays a dual role in supporting consumers and e-commerce platforms. For consumers, reviews are vital for making well-informed purchasing decisions, enhancing their trust in the platform, and fostering brand loyalty [
12]. From the e-commerce website’s perspective, developers gain valuable feedback that enables them to focus on core customer issues, ultimately improving service quality and increasing overall customer satisfaction [
13,
14,
15].
Customer ratings and reviews are pivotal in e-commerce, serving as trusted sources of information that directly influence purchasing decisions and consumer trust. These elements convey vital information that helps customers make more informed decisions before purchasing. In addition to textual reviews, customers often provide star ratings and numerical values summarizing their general opinion of the product. Typically, these ratings are on a scale from one to five, where one or two stars indicate poor performance, three stars represent a neutral stance, and four or five stars express a positive experience with the product [
16,
17,
18]. Online reviews and ratings are critical factors that affect consumers’ willingness to purchase products. This emphasizes that consumers increasingly rely on reading and providing product reviews and ratings to evaluate potential purchases and communicate with retailers and other consumers. However, reviews without ratings can complicate the assessment of a product’s appeal, especially when the reviews are vague or overly simplistic [
19] discrepancies between consumer reviews and their corresponding ratings present a significant challenge in e-commerce. There are instances where reviews may be artificially created, either by paying individuals to produce favorable content for authentic products or by using text-generation algorithms to generate fake reviews [
20,
21]. This often leads to inconsistencies where a customer’s written review does not align with the star rating they provide. For example, a review might be highly critical, yet the customer gives the product a four- or five-star rating, or conversely, a glowing review might be paired with a one- or two-star rating. Competitors might exploit this inconsistency by flooding the market with negative reviews about rival products, thereby manipulating online platforms’ ranking algorithms to lower the visibility of the targeted company [
22].
Addressing these inconsistencies is a priority for many e-commerce platforms, as they can make it difficult for customers to discern helpful information from the plethora of available reviews [
23]. To combat this issue, there is a need for a model that can evaluate reviews based on their sentiment polarity—positive or negative—and compare the sentiment-derived rating with the actual customer rating. This comparison would reveal whether the review and rating are consistent and reliable or if the inconsistency suggests that the review should not be trusted when making purchasing decisions [
4,
24,
25]. Organizations also face challenges in managing large volumes of consumer reviews. They need to gather feedback, forecast sales trends, and manage their reputation, which can be overwhelming without proper tools [
26]. Therefore, sentiment analysis (SA) plays a crucial role in e-commerce by enabling businesses to understand customer opinions and enhance decision-making processes. Using text-mining techniques, businesses can create sentiment dictionaries that help identify emotional tendencies in user comments, particularly in live e-commerce settings [
27]. A recent study emphasizes the growing necessity of SA for businesses to extract insights from consumer reviews about their products [
28]. This study shows that product reviews are a significant source of customer feedback, and analyzing the sentiments within these reviews can provide businesses with insights into product quality, customer preferences, and areas needing improvement. SA, or opinion mining, is a sub-field of natural language processing (NLP) that focuses on identifying and extracting subjective information from text data. It involves classifying text into various sentiment categories, such as positive, negative, or neutral, to understand the emotions and opinions expressed by individuals [
29]. This technique has become increasingly important in various applications, including customer feedback analysis, social media monitoring, and market research [
30]. Businesses and organizations can gain valuable insights into consumer opinions, preferences, and satisfaction levels by analyzing sentiments expressed in textual data [
31,
32]. However, the complexity of human language, with its nuances, contextual variations, and subtleties like sarcasm and irony, makes SA a challenging task, mainly when dealing with large-scale and diverse datasets like those found in consumer reviews on e-commerce platforms [
33,
34,
35].
While traditional SA models have demonstrated utility, they exhibit significant limitations in addressing complex classification tasks, particularly in multi-class SA. These models frequently struggle with class imbalance, where specific sentiment categories are disproportionately represented, leading to biased outcomes [
34]. Addressing these limitations necessitates more than conventional methods of review moderation or fundamental text analysis. There is an imperative for advanced SA techniques capable of accurately discerning the underlying sentiment in consumer reviews, even when confronted with complex linguistic features such as sarcasm, irony, or nuanced expression, which can distort the results of traditional methods [
30,
35]. By precisely classifying sentiment, these advanced techniques can identify discrepancies between the textual content of reviews and their associated star ratings, thereby enhancing the reliability of the information provided to consumers.
To address these challenges, this paper introduces a novel hybrid model, WDE-CNN-LSTM, designed to enhance sentiment classification in consumer reviews by synergistically integrating the strengths of Word Embeddings (WDE), Long Short-Term Memory (LSTM) networks, and Convolutional Neural Networks (CNNs). This model aims to mitigate the limitations associated with standalone deep learning (DL) models by incorporating WDE to capture semantic relationships effectively, CNNs to identify local textual patterns, and LSTMs to maintain temporal dependencies within the data [
36]. The research utilized the previous advanced DL models to capture complex patterns in the data and improve sentiment classification accuracy. The proposed hybrid architecture facilitates more effective feature extraction from textual inputs and enables a more rigorous evaluation of consistency between the extracted sentiment and the accompanying star ratings. The model enhances classification accuracy across binary, three-class, and five-class SA tasks while ensuring the alignment between the sentiment expressed in the review and the assigned ratings [
37].
1.1. Contributions
The main contributions of this study are as follows:
The proposal of a mixed DL model, WDE-CNN-LSTM, mainly developed to improve sentiment classification in consumer reviews by incorporating WDE, CNNs, and LSTM networks.
An exhaustive evaluation of the WDE-CNN-LSTM model’s performance across multiple SCTs, including binary, three-, and five-class scenarios, employing a diverse dataset of consumer reviews from e-commerce platforms.
The implementation of advanced data preprocessing methods, such as tokenization, padding, and the handling of class imbalances, to transform unstructured customer review data into a form appropriate for DL models.
The introduction of a consistency-driven technique to SA, focusing on aligning sentiment predictions with actual customer ratings to detect inconsistencies and enhance the trustworthiness of sentiment classification.
A comparative investigation of the suggested WDE-CNN-LSTM model against standalone DL techniques, such as CNNs, LSTMs, and WDE-LSTM, showing the most excellent precision, recall, -score, and overall accuracy performance.
The execution of comprehensive experiments to evaluate the effectiveness and robustness of the hybrid model, emphasizing its capability to address complex SCTs and produce accurate and consistent sentiment predictions.
A presentation of the suggested model’s practical applications for e-commerce platforms, improving automated customer sentiment interpretation and informing better business decisions by leveraging accurate sentiment classification and consistency analysis.
These contributions reflect the study’s innovative aspects, particularly developing and applying a hybrid deep learning model to improve sentiment analysis in consumer reviews.
1.2. Structure
The structure of this paper is methodically organized as follows:
Section 2 provides a comprehensive review of the relevant literature in sentiment classification and customer review analysis.
Section 3 delves into the intricacies of the proposed DL models, offering a detailed exposition of its components and methodology.
Section 4 systematically outlines the experimental results obtained from implementing various models, including CNN, LSTM, WDE-LSTM, WDE-CNN-LSTM, and the statistical consistency analysis. Practical implications and recommendations for utilizing the proposed models in real-world applications are presented in
Section 5. Finally,
Section 6 articulates the concluding remarks and discusses potential avenues for future research in enhancing sentiment classification models.
2. Literature Review
This section provides an exhaustive overview of current studies on SA methods applied to various domains, including healthcare, e-commerce, online services, and tourism. The studies reviewed use a range of ML and DL methods to classify customer sentiments expressed in online reviews, emphasizing the effectiveness and evolution of different techniques in addressing SCTs. From standard ML algorithms like SVM and Naive Bayes (NB) to advanced DL techniques such as LSTM and Bidirectional Encoder Representations from Transformers (BERT), these studies show the growing sophistication and accuracy of sentiment detection techniques. The reviewed research also recognizes common challenges, such as high computational requirements, data imbalance, and difficulty interpreting complex sentiments like mixed emotions and sarcasm. Furthermore, the studies highlight the demand for more advanced methods, such as hybrid techniques and explainable artificial intelligence (XAI), to improve SA interpretability and performance. Future research directions include optimizing existing techniques for efficiency, expanding their applicability to diverse datasets, and integrating multimodal data to capture a more comprehensive understanding of customer feedback and sentiment.
Hossain et al. [
38] used ML techniques such as SVM, NB, Decision Trees, and Random Forests (RFs) to classify customer sentiments from insurance product reviews into negative, positive, and neutral categories. The dataset contained online consumer reviews of insurance products, which were preprocessed and labeled for SA. The study’s approach effectively handled large datasets and provided accurate sentiment predictions, aiding insurance companies in understanding customer feedback. However, limits included potential dataset bias due to class imbalance and challenges analyzing complex sentiments like sarcasm. Forthcoming research is suggested to include advanced DL techniques, such as LSTM and CNN, to improve sentiment detection and extend the analysis to a broader range of insurance products.
Kaur et al. [
39] suggested a DL-based model utilizing a mixed feature extraction approach for consumer SA. The study utilized a combination of WDE, LSTM, and CNN architectures to enhance sentiment classification accuracy in consumer reviews. The dataset incorporated consumer reviews gathered from different e-commerce platforms, which underwent preprocessing to ensure quality input for the model. This mixed model captured local textual patterns and temporal dependencies, achieving high accuracy across binary, three-class, and five-class SCTs. Despite its significance, their proposed model faced restrictions such as high computational complexity and requiring extensive training datasets. Further research is recommended to optimize the model’s computational efficiency and expand its application to more diverse datasets and domains.
Dieksona et al. [
40] conducted an SA study on consumer reviews specifically for Traveloka, a popular travel and accommodation booking platform. They utilized different ML algorithms, including NB, SVM, and RF, to classify customer sentiments into negative and positive classes. The dataset employed in this research consisted of consumer reviews from the Traveloka platform, which were preprocessed to enhance model accuracy. The research showed the efficacy of these algorithms in sentiment classification, with RF achieving the highest accuracy among the tested techniques. However, they noted limitations regarding handling neutral sentiments and the demand for more diverse data to improve model robustness. The forthcoming work suggests expanding the analysis to cover more complex sentiment categories and further exploring DL techniques to improve sentiment detection accuracy.
Huang et al. [
28] studied current methods and future directions for SA on e-commerce platforms. They analyzed different methods, including ML and DL techniques, highlighting their strengths and limitations in handling diverse datasets and sentiment complexities. The review covered a range of datasets typically used in e-commerce SA, such as product reviews and customer feedback, emphasizing the need for high-quality preprocessing and feature extraction to enhance model performance. The advantages of the reviewed techniques include their ability to provide valuable insights into customer behavior and preferences, helping businesses in decision-making procedures. However, they determined several limitations, such as difficulty analyzing complex sentiments like sarcasm and the high computational cost associated with DL techniques. The future directions suggested include developing more efficient algorithms, integrating multi-modal data, and improving the interpretability of SA techniques.
Using a language representation model, Patel et al. [
41] accomplished an SA of customer feedback and reviews for airline services. The research employed advanced NLP methods, including the BERT model, to classify sentiments expressed in consumer reviews into negative, positive, and neutral categories. The dataset comprised consumer reviews of airline services gathered from multiple online platforms and was preprocessed to improve the model’s performance. The outcomes showed that the BERT model surpassed standard ML algorithms in accurately detecting sentiments, particularly in understanding context and handling complex expressions. However, the research noted challenges such as the need for extensive data for model training and high computational requirements. Further research is recommended to optimize the model for lower computational costs and examine its applicability to other fields, such as hotel and tourism services.
Wang et al. [
42] explored the use of large language models (LLMs) for SA in e-commerce, focusing on customer feedback. The research utilized advanced LLMs, including GPT-3 and BERT, to analyze and classify sentiments in consumer reviews across multiple e-commerce platforms. The dataset included diverse customer feedback data from several online retail websites, which were preprocessed to improve the techniques’ effectiveness. The results indicated that LLMs excelled in accurately capturing the nuances of customer sentiment, outperforming standard ML techniques, particularly in handling context-rich and complex reviews. Despite these advantages, the research highlighted limitations such as high computational costs and the need for significant computational resources. Future work suggested includes optimizing LLMs for more efficient SA and expanding the scope to include multi-modal data to understand customer emotions and behaviors better.
Suhartono et al. [
43] designed an SA model for drug product reviews using Deep Neural Networks (DNNs) and weighted WDE. The research applied DNNs integrated with techniques such as Word2Vec and GloVe to improve the sentiment classification of consumer reviews into negative, positive, and neutral sentiments. The dataset included drug product reviews collected from different online pharmaceutical platforms, which were preprocessed and transformed into weighted WDE to serve as input for the DL techniques. The proposed model enhanced performance over standard SA approaches, mainly in catching the nuanced sentiments specific to drug reviews. However, the research encountered limitations such as the demand for large datasets for training and the high computational resources required. Future research is recommended to optimize the model’s efficiency and extend its application to other healthcare-related reviews to improve its generalizability.
Puh and Bagić Babac [
44] concentrated on predicting the sentiment and rating of tourist reviews using ML methods. The research used SVM, RFs, and gradient-boosting techniques to classify tourist reviews and predict their associated ratings. The dataset consisted of tourist reviews from various travel websites, which were preprocessed and labeled for SA and rating prediction. The ML methods showed high accuracy in sentiment classification and rating prediction, providing beneficial insights into customer feedback for the tourism sector. However, limitations included the demand for large, diverse datasets to improve model generalization and the difficulty in analyzing the results for more nuanced sentiments, such as mixed or neutral reviews. Future work suggested expanding the dataset to cover a broader range of tourist destinations and incorporate DL methods to improve predictive performance and robustness.
Taherdoost and Madanchian [
45] executed a comprehensive review of AI techniques and their application in SA, particularly in competitive research contexts. The study examined various AI methods, including ML and DL approaches, and their effectiveness in analyzing sentiments across different datasets, such as consumer reviews and social media content. The authors highlighted the advantages of using AI for SA, including enhanced accuracy, scalability, and the ability to handle large and complex datasets. However, the review also pointed out several limitations, such as the challenges in interpreting nuanced sentiments and the high computational costs associated with advanced techniques. Future research directions proposed by the authors include the development of more efficient AI algorithms, integrating real-time data processing, and exploring multi-language SA to broaden the scope and applicability of AI in competitive environments.
Vatambeti et al. [
46] executed a Twitter SA on online food services utilizing an integrated Elephant Herd Optimization (EHO) approach with a hybrid DL approach. The research employed a dataset of tweets related to online food services, which were preprocessed to clear noise and standardize text for analysis. Their suggested approach incorporated EHO with DL methods such as LSTM and CNN to enhance sentiment classification accuracy by optimizing feature selection and model parameters. The outcomes showed that the hybrid approach significantly outperformed standard ML techniques and standalone DL methods, considering accuracy and computational efficiency. However, limits included the model complexity, which may require high computational resources, and difficulty handling sarcastic or highly nuanced sentiment in tweets. Future research should focus on reducing the model’s complexity and analyzing its application in other fields, such as e-commerce or healthcare.
Iqbal et al. [
47] conducted an SA of consumer reviews, utilizing DL methods to enhance sentiment classification accuracy. The research utilized DL techniques, including LSTM, CNN, and BiLSTM, to investigate consumer reviews gathered from e-commerce platforms. The dataset underwent comprehensive preprocessing to guarantee quality inputs for the DL methods, including tokenization and text normalization. The outcomes showed that DL methods, particularly BiLSTM, exceeded standard ML techniques in accurately classifying sentiments into positive, negative, and neutral categories. Despite their significance, they noted challenges such as the high computational cost of training DL methods and the need for large annotated datasets. Further research should explore model optimization strategies to reduce computational overhead and expand the analysis to contain real-time SA for dynamic consumer feedback monitoring.
Adak et al. [
48] conducted a systematic review on the SA of consumer reviews of food delivery services using DL and explainable XAI methods. The research examined DL techniques, including CNN, LSTM, and hybrid architectures, applied to customer feedback data from considerable food delivery platforms. They emphasized the significance of using XAI to analyze the predictions made by DL techniques, providing insights into model decision-making processes and improving transparency. The study concluded that DL techniques, especially when combined with XAI, show high accuracy in sentiment classification while delivering understandable outputs for end-users. However, the research also presented challenges related to the interpretability of complex techniques and the high computational costs associated with training these techniques. The forthcoming research directions include improving DL techniques’ efficiency and interpretability and extending SA to multi-modal data from different customer feedback channels.
Alantari et al. [
49] empirically compared different ML techniques for a text-based SA of online consumer reviews. The research assessed several ML techniques, including NB, SVM, RFs, and Gradient Boosting, and their significance in classifying sentiments of consumer reviews gathered from different online platforms. The dataset contained diverse consumer reviews, which were preprocessed and labeled for SCTs. The outcomes showed that ensemble techniques like RFs and Gradient Boosting generally exceeded more straightforward methods such as NB and SVM, particularly regarding accuracy and robustness to noisy data. However, the research noted limitations, including challenges in analyzing model outputs and the demand for comprehensive hyperparameter tuning. Further research is recommended to analyze the integration of DL techniques and to focus on model interpretability to understand better the factors influencing sentiment classification.
Marlina et al. [
50] conducted an SA on consumer reviews of natural skincare products utilizing various SA techniques. The research employed ML methods, such as NB and SVM, and DL techniques to classify customer sentiments into negative, positive, and neutral categories. The dataset incorporated reviews of natural skincare products gathered from multiple e-commerce platforms, which were preprocessed to improve data quality and model accuracy. The results showed that DL techniques surpassed standard ML approaches in capturing nuanced sentiments specific to skincare products. However, the research identified limitations regarding handling ambiguous or mixed sentiments and the need for extensive computational resources for DL methods. The forthcoming work suggested expanding the analysis to a broader range of skincare products and enhancing model performance by incorporating more advanced DL methods and feature extraction techniques.
Alzahrani et al. [
51] highlighted the evolution of SA techniques, including ML and DL techniques, and the challenges of fake reviews in e-commerce. The procedure involved an exhaustive data preprocessing pipeline, including steps like punctuation removal, lowercase conversion, and part-of-speech tagging to prepare the review texts for analysis. This research utilized a CNN-LSTM model for sentiment classification, utilizing WDE and dropout layers to enhance performance and mitigate overfitting. The CNN-LSTM model achieved a high accuracy of 96%, outperforming traditional methods, with detailed evaluation metrics such as precision, recall, and
-score demonstrating its effectiveness. However, the research primarily used a dataset from the Amazon website, which may not represent the full spectrum of e-commerce reviews. This restriction could affect the generalizability of the results to other platforms or product categories.
Obiedat et al. [
52] suggested a hybrid evolutionary SVM-based approach for SA of consumer reviews, particularly handling the challenge of imbalanced data distribution. The research used a combination of SVM with evolutionary techniques to optimize model parameters and enhance sentiment classification accuracy on datasets with skewed class distributions. The dataset comprised consumer reviews from different e-commerce platforms, which were preprocessed and utilized to train the hybrid model. The outcomes showed that the hybrid SVM-based approach exceeded standard SVM and other ML methods in addressing imbalanced datasets, delivering better precision, recall, and
-scores. However, the research also stated limitations related to the computational complexity of the evolutionary algorithms and the demand for fine-tuning to accomplish optimum outcomes. The forthcoming research directions include optimizing the evolutionary components to reduce computational costs and expanding the approach to multi-class SA tasks.
Table 1 summarises the previous related work, clearly comparing the implemented methods, datasets, insights, and limitations. It highlights the evolution of SA methods, with traditional ML models proving effective for smaller datasets and basic sentiment tasks. In contrast, DL methods demonstrate superior performance for complex datasets and nuanced sentiments. Hybrid approaches bridge gaps in traditional and advanced methods, improving performance in class imbalance scenarios. However, the computational demands of DL models remain a significant challenge, necessitating optimization and scalability for broader applications.
The reviewed studies underscore the significant advancements in SA methodologies that are driven by integrating ML and DL techniques. These approaches have demonstrated notable improvements in accurately classifying customer sentiments across various domains, such as insurance, e-commerce, tourism, and healthcare. While traditional ML algorithms like NB and SVM remain valuable for specific applications, the enhanced capabilities of DL techniques, including LSTM, CNN, and BERT, have shown superior performance in capturing complex sentiments and contextual nuances. However, challenges persist, particularly in managing computational demands, addressing class imbalances, and improving model interpretability. The studies highlight the need for continued innovation, recommending exploring hybrid techniques, optimization strategies, and the incorporation of explainable XAI to enhance transparency and trust in SA outcomes.
4. Experimental Results and Analysis
This section presents the experimental results for the proposed models, including CNNs, LSTM, WDE-LSTM, and WDE-CNN-LSTM. The models were assessed using training, validation, and testing datasets. The conclusions were drawn based on the average values of the evaluation metrics. The working environment parameters are outlined in
Section 4.1 and
Section 4.2, while
Section 4.3 details the performance metrics.
Section 4.4 offers a comparative analysis of the models, and
Section 4.5 presents the results of the statistical consistency analysis. Finally,
Section 4.6 compares the proposed models with state-of-the-art techniques.
4.1. Working Environment
All experiments were carried out on the Google Colab platform, leveraging the computational power of an NVIDIA T4 GPU to ensure efficient processing and faster model training. The code environment was configured using Python version 3.10.12, the primary programming language. Keras version 3.3.3 was employed as the high-level neural network API, facilitating the implementation of DL models with simplicity and flexibility. TensorFlow version 2.15.0 provided the underlying framework for building and training the models, offering powerful tools for machine learning, including support for automatic differentiation, model optimization, and GPU acceleration. This setup allowed for the seamless integration and execution of various DL tasks, from model definition to training and evaluation.
4.2. Parameters Settings
Table 3 provides a summary of the hyperparameters utilized to implement the DL models. These parameters were fine-tuned before training to optimize model performance and behavior. The specified values were employed to conduct the experiments in Python as part of the research discussed in this paper.
4.3. Evaluation Metrics
The proposed models’ performance is assessed using various evaluation metrics, including recall, specificity, precision, accuracy, and the
-score. These metrics are calculated based on common evaluation parameters for predictive models, such as True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) [
63,
64].
Recall [
65] is defined as the ratio of TP to the sum of TP and FN, and is mathematically represented by Equation (
24):
Precision [
66] is computed by dividing TP by the sum of TP and FP, as expressed in Equation (
25):
Accuracy is determined using Equation (
26):
-score [
65] is the harmonic mean of precision and recall, calculated as shown in Equation (
27):
4.4. Comparative Analysis
This paper proposes a system that is requested to handle customer review issues. For this purpose, different DL models were deployed. The objective is to achieve a DL model with optimal detection, accuracy, and consistency. The proposed models are based on binary and multiple classification cases to validate them with simple and robust issues. Each case will be discussed in detail in the subsections that follow.
4.4.1. Performance Results of the Proposed CNN Model
The evaluation metrics for the CNN model, as shown in
Table 4, demonstrate the model’s effectiveness across 2, 3 and 5 score classes. For the binary classification (Score 2), the CNN achieved near-perfect precision, recall, and
-scores for both classes, indicating minimal errors in detection. In the multi-class classification cases (Scores 3 and 5), the model maintained strong performance across all classes, with slightly lower precision and recall in some categories, particularly for Class 3 and 4 in Score 5. Despite this, the
-score remain high, suggesting a well-balanced performance for the CNN model across all tasks.
The confusion matrices in
Figure 6 further clarify the CNN model’s classification accuracy. In
Figure 6a, the binary classification confusion matrix reveals that the model has high accuracy, with very few misclassifications between Class 0 and Class 1. Similarly,
Figure 6b shows the confusion matrix for the multi-class problem with three classes, where most predictions are correctly classified with only a few errors.
Figure 6c expands the analysis to five classes, showing that while the model handles most predictions correctly, there are some noticeable confusions between closely related classes, particularly in the lower diagonal. These matrices highlight the CNN model’s robustness in binary and multi-class classification scenarios.
4.4.2. The Performance Results of the Proposed LSTM Model
Table 5 shows the evaluation metrics for the LSTM mode with the 2, 3 and 5 score classes. At Score 2, the model exhibits high performance, with precision, recall, and
-scores consistently at 0.98–0.99 across all metrics and classes, reflecting balanced classification ability. Scores 3 and 5 reveal slightly lower yet still steady performance, particularly with a decline in precision and
-scores for specific classes (e.g., Class 3 in Score 5 with 0.87 precision and 0.88
-score). The overall accuracy is notably high across all scores, with the model achieving up to 98% accuracy for Scores 2 and 3 and a slightly reduced accuracy of 94% at Score 5.
Figure 7 presents three confusion matrices illustrating the performance of the LSTM model in multi-class classification tasks.
Figure 7a represents the confusion matrix for a binary classification problem, where the model has successfully classified a vast majority of instances correctly, with minor misclassifications observed between the two classes, as indicated by the non-zero off-diagonal values.
Figure 7b expands this to a three-class classification task, where the model performs well. Still, a slight increase in misclassifications is evident, particularly in Class 2, with some instances being confused between neighboring classes.
Figure 7c demonstrates the confusion matrix for a more complex five-class classification task. As expected with increasing class numbers, the misclassification rate grows, notably with some confusion between Classes 3 and 4 and between Classes 4 and 5, as shown by the higher off-diagonal values in these regions. These matrices collectively highlight that while the LSTM model maintains strong classification capabilities, the complexity of task introduces more opportunities for misclassification, as observed through the decreasing clarity in the distinction between some classes.
4.4.3. The Performance Results of the Proposed WDE-LSTM Model
Table 6 presents the evaluation metrics for the proposed WDE-LSTM model, covering precision, recall, and
-scores across three classification tasks with 2, 3, and 5 classes. For the binary classification task (Score 2), the model consistently performs with precision, recall, and
-scores at 0.97 to 1.00, leading to an overall accuracy of 98%. For the three-class classification task (Score 3), the model maintains similarly high performance across all metrics, with minimal macro, micro, and weighted-average variation, achieving 98% accuracy. However, for the more complex five-class classification task (Score 5), the model’s performance slightly declines, particularly in Class 4, where precision and
-scores drop to 0.88, contributing to a lower overall accuracy of 94%.
Figure 8 demonstrates strong classification performance across binary, three-class, and five-class tasks. While achieving high accuracy in binary classification, slight misclassifications appear in the three-class scenario, and these increase in the five-class task, particularly between neighboring classes. Despite the growing complexity, the model maintains robust accuracy, showcasing its generalization capabilities across varying classification challenges.
Table 7 presents the evaluation metrics for the proposed WDE-CNN-LSTM model across binary, three-class, and five-class classification tasks. In the binary classification task (Score 2), the model demonstrates exceptional performance, with precision, recall, and
-scores close to or at 1.00, resulting in an accuracy of 98%. For the three-class classification task (Score 3), the model maintains similarly high performance, with precision, recall, and
-scores consistently at 0.98–0.99, achieving an overall accuracy of 98%. The performance remains consistent even in the five-class classification task (Score 5), where the model shows robust precision, recall, and
-scores across all classes, again achieving 98% accuracy. The minimal variation in the macro, micro, and weighted averages across all tasks suggests that the WDE-CNN-LSTM model exhibits strong generalization capabilities and effectively handles binary and multi-class classification scenarios. In conclusion, the WDE-CNN-LSTM model demonstrated the highest performance among the proposed models.
Figure 9 illustrates the visual results of the WDE-CNN-LSTM model across three classification tasks:
Figure 9a illustrates the confusion matrix for the binary classification task, showing near-perfect accuracy with minimal misclassification;
Figure 9b illustrates the confusion matrix for the three-class task, where the model maintains high accuracy but exhibits some minor misclassifications between neighboring classes; and
Figure 9c illustrates the confusion matrix for the five-class task, reflecting a more significant number of misclassifications, particularly between Classes 4 and 5, though the majority of instances are still accurately classified. These figures collectively demonstrate the model’s robust performance across varying classification complexities.
Figure 10,
Figure 11 and
Figure 12 depict the training, validation accuracy, and loss curves for the WDE-CNN-LSTM model across different classification tasks (binary, three-class, and five-class). In
Figure 10, for the binary classification task (Score 2),
Figure 10a shows that training accuracy increases steadily and converges near 1.0, while validation accuracy levels off at a slightly lower value, indicating good generalization with minimal overfitting.
Figure 10b reveals that training loss decreases rapidly. In contrast, validation loss decreases initially but stabilizes, indicating that the model has reached a balance between fitting the data and avoiding overfitting.
Figure 11, for the three-class task (Score 3), follows a similar trend in
Figure 11a, with both training and validation accuracy converging at high values. In contrast,
Figure 11b shows a steep decline in training loss with a more gradual stabilization of validation loss. Finally,
Figure 12, for the five-class task (Score 5), illustrates in
Figure 12a that training accuracy continues to improve. Still, the gap between training and validation accuracy slightly widens, indicating increased complexity. The training loss declines sharply in
Figure 12b. In contrast, the validation loss decreases and then plateaus, suggesting that while the model learns effectively, the increased complexity of the classification task poses challenges to further minimizing validation loss. These plots highlight the model’s effectiveness in training across various classification tasks while demonstrating the trade-off between model complexity and generalization performance.
Table 8 presents a comparison between Scores 2, 3, and 5 across the models (CNN, LSTM, WDE-LSTM, and WDE-CNN-LSTM), revealing that lower scores (2 and 3) result in consistently high performance across all metrics, particularly for WDE-LSTM and WDE-CNN-LSTM, which maintain precision, recall, and
-scores above 98%. At Score 2, CNN shows lower accuracy (91.19%) compared to the other models, which maintain higher accuracy. As the score increases to 3, all models improve in accuracy, with WDE-CNN-LSTM achieving the best results across all metrics (98.26%). However, at Score 5, there is a general decline in performance across all models, with CNN experiencing the most significant drop, particularly in accuracy. At the same time, WDE-CNN-LSTM retains a relatively stronger performance (around 95%), making it the most robust model across varying score levels.
4.5. Statistical Consistency Analysis
This section assesses the reliability of the SA model when applied to consumer reviews by evaluating the alignment between sentiment polarity scores and the corresponding user-assigned ratings. The consistency was determined by analyzing how well the sentiment scores, derived from the text, matched the numerical ratings provided by the users. The analysis revealed that 96.00% of the reviews were classified as consistent, indicating a strong correlation between the calculated sentiment polarity and the user-assigned scores. This high level of consistency demonstrates that the SA model is highly effective in accurately capturing the sentiment conveyed in the reviews and aligning it with the users’ ratings.
The strong consistency rate of 96.00% highlights the model’s ability to reliably reflect the underlying sentiment of customer feedback, showing that the sentiment scores closely correspond to the assigned ratings.
4.6. Comparison with the State-of-the-Art Models
A concise comparison of the proposed models is conducted based on evaluation metrics to identify the optimal model. The metrics considered include precision, recall,
-score, and accuracy, as shown in
Table 9. It is evident that the proposed models, WDE-CNN-LSTM, WDE-LSTM, and LSTM, outperform the traditional machine learning and DL models across binary, three-class, and five-class classification tasks.
The WDE-CNN-LSTM model demonstrates superior performance, achieving 98% accuracy and an -score of 98.26% for binary and three-class tasks. Even in the more complex five-class task, the model maintains a high 95.21% accuracy and 95.20% -score. The WDE-LSTM model follows closely, with 98.18% accuracy for binary classification and 93.55% accuracy for five-class tasks, showing a minor drop in performance in more complex scenarios. The LSTM model within the proposed framework also performs well, achieving 98% accuracy for binary classification and 93.53% accuracy for five-class tasks, making it reliable for multi-class scenarios.
The proposed hybrid models show significant improvements compared to traditional models like CNNs and Text-CNN. CNNs achieve 95% accuracy in binary classification but decline to 93.35% accuracy in five-class tasks. Text-CNN, as shown in [
67], consistently performs at 85% accuracy across tasks, highlighting the performance gap between traditional and hybrid approaches. Similarly, while performing well with 91% accuracy in binary classification, BERT struggles in more complex tasks, maintaining the same accuracy across five-class scenarios.
The Novel DL Model for Inconsistency Detection by [
6] excels in five-class classification, achieving 99% accuracy and a 97%
-score, setting a new benchmark for performance in complex multi-class classification tasks.
These results align with prior studies where traditional models, such as CNNs and LSTM, struggle with more complex class distributions, as highlighted by [
68,
69]. Integrating Weighted Differential Evolution (WDE) in the proposed models enhances feature selection and convergence, enabling superior performance across diverse class distributions. This is particularly evident in the multi-class tasks where hybrid models, such as WDE-CNN-LSTM, consistently outperform traditional approaches.
Table 9.
Comparison with the state-of-the-art models (Classes 2, 3, and 5).
Table 9.
Comparison with the state-of-the-art models (Classes 2, 3, and 5).
Method | Size | No. of Classes | Accuracy | Precision | Recall | -Score |
---|
WDE-CNN-LSTM (Proposed) | 568,454 | 2 | 0.98 | 0.98 | 0.98 | 0.98 |
3 | 0.98 | 0.98 | 0.98 | 0.98 |
5 | 0.95 | 0.95 | 0.95 | 0.95 |
WDE-LSTM (Proposed) | 568,454 | 2 | 0.98 | 0.98 | 0.98 | 0.98 |
3 | 0.98 | 0.98 | 0.98 | 0.98 |
5 | 0.94 | 0.93 | 0.93 | 0.94 |
LSTM (Proposed) | 568,454 | 2 | 0.98 | 0.98 | 0.98 | 0.98 |
3 | 0.98 | 0.98 | 0.98 | 0.98 |
5 | 0.94 | 0.93 | 0.93 | 0.93 |
CNNs (Proposed) | 568,454 | 2 | 0.98 | 0.98 | 0.98 | 0.98 |
3 | 0.98 | 0.97 | 0.97 | 0.97 |
5 | 0.93 | 0.93 | 0.93 | 0.93 |
Text-CNN [70] | 72,500 | 5 | 0.85 | - | - | 0.85 |
Bi-LSTM [70] | 72,500 | 5 | 0.90 | - | - | 0.90 |
BERT [70] | 72,500 | 5 | 0.91 | - | - | 0.89 |
LSTM [71] | 29,163 | 2 | - | 0.86 | 0.90 | 0.88 |
3 | - | 0.76 | 0.58 | 0.56 |
Novel DL Model [6] | 568,454 | 2 | 0.95 | 0.89 | 0.92 | 0.90 |
3 | 0.95 | 0.78 | 0.81 | 0.79 |
5 | 0.92 | 0.66 | 0.69 | 0.67 |
WDE-CNN [72] | 11,754 | 2 | 0.88 | - | - | 0.88 |
WDE-LSTM [72] | 11,754 | 2 | 0.88 | - | - | 0.88 |
6. Conclusions and Future Work
The paper presents a novel hybrid deep learning model, WDE-CNN-LSTM, significantly enhancing customer reviews’ sentiment classification. The proposed model achieved impressive accuracy rates of 98% for binary and three-class classifications and 95.21% for five-class classifications, demonstrating its effectiveness in handling complex sentiment classification tasks (SCTs). The WDE-CNN-LSTM model outperformed standalone models in precision, recall, and -score, achieving an -score of up to 98.26% for three-class classification. This indicates a robust capability in accurately classifying sentiments. The model showed a high consistency rate of 96.00% between predicted sentiments and actual customer ratings, which is crucial for building trust in sentiment analysis systems. Furthermore, the findings suggest that the hybrid architecture can significantly improve sentiment analysis in customer review systems, leading to more reliable and accurate sentiment classification, essential for businesses aiming to understand customer feedback better. The research results are particularly beneficial for e-commerce platforms seeking to enhance customer feedback analysis, data scientists developing sentiment analysis models, and researchers exploring hybrid deep learning approaches in natural language processing.
While the proposed WDE-CNN-LSTM model demonstrates substantial improvements in sentiment classification, certain limitations warrant further investigation. One notable limitation is the computational complexity of the hybrid architecture, which may hinder its deployment in real-time applications or environments with constrained computational resources. This highlights the need for future research to explore optimization techniques, such as model compression or pruning, to reduce computational demands without compromising performance. Addressing these challenges would contribute to the broader adoption and practical implementation of the model.
Future studies could focus on optimizing the proposed hybrid model for efficiency, particularly in terms of computational requirements, to make it more accessible for real-time applications. Research should also address challenges related to class imbalances in datasets, which can affect model performance. Techniques such as data augmentation or synthetic data generation could be explored. Finally, applying the hybrid model to various domains beyond customer reviews, such as healthcare or tourism, could validate its versatility and effectiveness across different contexts. Future work could also investigate adapting the model for multilingual sentiment analysis to expand its usability across diverse markets.