Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (118)

Search Parameters:
Keywords = Wikipedia

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
21 pages, 2465 KiB  
Article
Migration and Segregated Spaces: Analysis of Qualitative Sources Such as Wikipedia Using Artificial Intelligence
by Javier López-Otero, Ángel Obregón-Sierra and Antonio Gavira-Narváez
Soc. Sci. 2024, 13(12), 664; https://doi.org/10.3390/socsci13120664 - 11 Dec 2024
Viewed by 937
Abstract
The scientific literature on residential segregation in large metropolitan areas highlights various explanatory factors, including economic, social, political, landscape, and cultural elements related to both migrant and local populations. This paper contrasts the impact of these factors individually, such as the immigrant rate [...] Read more.
The scientific literature on residential segregation in large metropolitan areas highlights various explanatory factors, including economic, social, political, landscape, and cultural elements related to both migrant and local populations. This paper contrasts the impact of these factors individually, such as the immigrant rate and neighborhood segregation. To achieve this, a machine learning analysis was conducted on a sample of neighborhoods in the main Spanish metropolitan areas (Madrid and Barcelona), using a database created from a combination of official statistical sources and textual sources, such as Wikipedia. These texts were transformed into indexes using Natural Language Processing (NLP) and other artificial intelligence algorithms capable of interpreting images and converting them into indexes. The results indicate that the factors influencing immigrant concentration and segregation differ significantly, with crucial roles played by the urban landscape, population size, and geographic origin. While land prices showed a relationship with immigrant concentration, their effect on segregation was mediated by factors such as overcrowding, social support networks, and landscape degradation. The novel application of AI and big data, particularly through ChatGPT and Google Street View, has enhanced model predictability, contributing to the scientific literature on segregated spaces. Full article
Show Figures

Figure 1

22 pages, 7770 KiB  
Article
Advancing Arabic Word Embeddings: A Multi-Corpora Approach with Optimized Hyperparameters and Custom Evaluation
by Azzah Allahim and Asma Cherif
Appl. Sci. 2024, 14(23), 11104; https://doi.org/10.3390/app142311104 - 28 Nov 2024
Viewed by 456
Abstract
The expanding Arabic user base presents a unique opportunity for researchers to tap into vast online Arabic resources. However, the lack of reliable Arabic word embedding models and the limited availability of Arabic corpora poses significant challenges. This paper addresses these gaps by [...] Read more.
The expanding Arabic user base presents a unique opportunity for researchers to tap into vast online Arabic resources. However, the lack of reliable Arabic word embedding models and the limited availability of Arabic corpora poses significant challenges. This paper addresses these gaps by developing and evaluating Arabic word embedding models trained on diverse Arabic corpora, investigating how varying hyperparameter values impact model performance across different NLP tasks. To train our models, we collected data from three distinct sources: Wikipedia, newspapers, and 32 Arabic books, each selected to capture specific linguistic and contextual features of Arabic. By using advanced techniques such as Word2Vec and FastText, we experimented with different hyperparameter configurations, such as vector size, window size, and training algorithms (CBOW and skip-gram), to analyze their impact on model quality. Our models were evaluated using a range of NLP tasks, including sentiment analysis, similarity tests, and an adapted analogy test designed specifically for Arabic. The findings revealed that both the corpus size and hyperparameter settings had notable effects on performance. For instance, in the analogy test, a larger vocabulary size significantly improved outcomes, with the FastText skip-gram models excelling in accurately solving analogy questions. For sentiment analysis, vocabulary size was critical, while in similarity scoring, the FastText models achieved the highest scores, particularly with smaller window and vector sizes. Overall, our models demonstrated strong performance, achieving 99% and 90% accuracies in sentiment analysis and the analogy test, respectively, along with a similarity score of 8 out of 10. These results underscore the value of our models as a robust tool for Arabic NLP research, addressing a pressing need for high-quality Arabic word embeddings. Full article
Show Figures

Figure 1

16 pages, 2723 KiB  
Article
Using Deep Learning, Optuna, and Digital Images to Identify Necrotizing Fasciitis
by Ming-Jr Tsai, Chung-Hui Lin, Jung-Pin Lai and Ping-Feng Pai
Electronics 2024, 13(22), 4421; https://doi.org/10.3390/electronics13224421 - 11 Nov 2024
Viewed by 1355
Abstract
Necrotizing fasciitis, which is categorized as a medical and surgical emergency, is a life-threatening soft tissue infection. Necrotizing fasciitis diagnosis primarily relies on computed tomography (CT), magnetic resonance imaging (MRI), ultrasound scans, surgical biopsy, blood tests, and expert knowledge from doctors or nurses. [...] Read more.
Necrotizing fasciitis, which is categorized as a medical and surgical emergency, is a life-threatening soft tissue infection. Necrotizing fasciitis diagnosis primarily relies on computed tomography (CT), magnetic resonance imaging (MRI), ultrasound scans, surgical biopsy, blood tests, and expert knowledge from doctors or nurses. Necrotizing fasciitis develops rapidly, making early diagnosis crucial. With the rapid progress of information technology and systems, in terms of both hardware and software, deep learning techniques have been employed to address problems in various fields. This study develops an information system using convolutional neural networks (CNNs), Optuna, and digital images (CNNOPTDI) to detect necrotizing fasciitis. The determination of the hyperparameters in convolutional neural networks plays a critical role in influencing classification performance. Therefore, Optuna, an optimization framework for hyperparameter selection, is utilized to optimize the hyperparameters of the CNN models. We collect the images for this study from open data sources such as Open-i and Wikipedia. The numerical results reveal that the developed CNNOPTDI system is feasible and effective in identifying necrotizing fasciitis with very satisfactory classification accuracy. Therefore, a potential future application of the CNNOPTDI system could be in remote medical stations or telemedicine settings to assist with the early detection of necrotizing fasciitis. Full article
(This article belongs to the Special Issue Innovations, Challenges and Emerging Technologies in Data Engineering)
Show Figures

Figure 1

16 pages, 715 KiB  
Article
Sentence Embeddings and Semantic Entity Extraction for Identification of Topics of Short Fact-Checked Claims
by Krzysztof Węcel, Marcin Sawiński, Włodzimierz Lewoniewski, Milena Stróżyna, Ewelina Księżniak and Witold Abramowicz
Information 2024, 15(10), 659; https://doi.org/10.3390/info15100659 - 21 Oct 2024
Viewed by 1155
Abstract
The objective of this research was to design a method to assign topics to claims debunked by fact-checking agencies. During the fact-checking process, access to more structured knowledge is necessary; therefore, we aim to describe topics with semantic vocabulary. Classification of topics should [...] Read more.
The objective of this research was to design a method to assign topics to claims debunked by fact-checking agencies. During the fact-checking process, access to more structured knowledge is necessary; therefore, we aim to describe topics with semantic vocabulary. Classification of topics should go beyond simple connotations like instance-class and rather reflect broader phenomena that are recognized by fact checkers. The assignment of semantic entities is also crucial for the automatic verification of facts using the underlying knowledge graphs. Our method is based on sentence embeddings, various clustering methods (HDBSCAN, UMAP, K-means), semantic entity matching, and terms importance assessment based on TF-IDF. We represent our topics in semantic space using Wikidata Q-ids, DBpedia, Wikipedia topics, YAGO, and other relevant ontologies. Such an approach based on semantic entities also supports hierarchical navigation within topics. For evaluation, we compare topic modeling results with claims already tagged by fact checkers. The work presented in this paper is useful for researchers and practitioners interested in semantic topic modeling of fake news narratives. Full article
Show Figures

Figure 1

32 pages, 5459 KiB  
Article
Toward the Adoption of Explainable Pre-Trained Large Language Models for Classifying Human-Written and AI-Generated Sentences
by Luca Petrillo, Fabio Martinelli, Antonella Santone and Francesco Mercaldo
Electronics 2024, 13(20), 4057; https://doi.org/10.3390/electronics13204057 - 15 Oct 2024
Viewed by 1676
Abstract
Pre-trained large language models have demonstrated impressive text generation capabilities, including understanding, writing, and performing many tasks in natural language. Moreover, with time and improvements in training and text generation techniques, these models are proving efficient at generating increasingly human-like content. However, they [...] Read more.
Pre-trained large language models have demonstrated impressive text generation capabilities, including understanding, writing, and performing many tasks in natural language. Moreover, with time and improvements in training and text generation techniques, these models are proving efficient at generating increasingly human-like content. However, they can also be modified to generate persuasive, contextual content weaponized for malicious purposes, including disinformation and novel social engineering attacks. In this paper, we present a study on identifying human- and AI-generated content using different models. Precisely, we fine-tune different models belonging to the BERT family, an open-source version of the GPT model, ELECTRA, and XLNet, and then perform a text classification task using two different labeled datasets—the first one consisting of 25,000 sentences generated by both AI and humans and the second comprising 22,929 abstracts that are ChatGPT-generated and written by humans. Furthermore, we perform an additional phase where we submit 20 sentences generated by ChatGPT and 20 sentences randomly extracted from Wikipedia to our fine-tuned models to verify the efficiency and robustness of the latter. In order to understand the prediction of the models, we performed an explainability phase using two sentences: one generated by the AI and one written by a human. We leveraged the integrated gradients and token importance techniques, analyzing the words and subwords of the two sentences. As a result of the first experiment, we achieved an average accuracy of 99%, precision of 98%, recall of 99%, and F1-score of 99%. For the second experiment, we reached an average accuracy of 51%, precision of 50%, recall of 52%, and F1-score of 51%. Full article
Show Figures

Figure 1

20 pages, 996 KiB  
Article
Entity Linking Model Based on Cascading Attention and Dynamic Graph
by Hongchan Li, Chunlei Li, Zhongchuan Sun and Haodong Zhu
Electronics 2024, 13(19), 3845; https://doi.org/10.3390/electronics13193845 - 28 Sep 2024
Viewed by 595
Abstract
The purpose of entity linking is to connect entity mentions in text to real entities in the knowledge base. Existing methods focus on using the text topic, entity type, linking order, and association between entities to obtain the target entities. Although these methods [...] Read more.
The purpose of entity linking is to connect entity mentions in text to real entities in the knowledge base. Existing methods focus on using the text topic, entity type, linking order, and association between entities to obtain the target entities. Although these methods have achieved good results, they ignore the exploration of candidate entities, leading to insufficient semantic information among entities. In addition, the implicit relationship and discrimination within the candidate entities also affect the accuracy of entity linking. To address these problems, we introduce information about candidate entities from Wikipedia and construct a graph model to capture implicit dependencies between different entity decisions. Specifically, we propose a cascade attention mechanism and develop a novel local entity linkage model termed CAM-LEL. This model leverages the interaction between entity mentions and candidate entities to enhance the semantic representation of entities. Furthermore, a global entity linkage model termed DG-GEL based on a dynamic graph is established to construct an entity association graph, and a random walking algorithm and entity entropy are used to extract the implicit relationships within entities to increase the differentiation between entities. Experimental results and in-depth analyses of multiple datasets show that our model outperforms other state-of-the-art models. Full article
Show Figures

Figure 1

14 pages, 3422 KiB  
Article
Papers in and Papers out of the Spotlight: Comparative Bibliometric and Altmetrics Analysis of Biomedical Reports with and without News Media Stories
by Artemis Chaleplioglou
Publications 2024, 12(4), 30; https://doi.org/10.3390/publications12040030 - 27 Sep 2024
Viewed by 2061
Abstract
For decades, the discoverability and visibility of a paper relied on the readership of the academic journal where the publication was issued. As public interest in biomedicine has grown, the news media have taken on an important role in spreading scientific findings. This [...] Read more.
For decades, the discoverability and visibility of a paper relied on the readership of the academic journal where the publication was issued. As public interest in biomedicine has grown, the news media have taken on an important role in spreading scientific findings. This investigation explores the potential impact of news media stories on the citations and altmetrics of a paper. A total of 2020 open-access biomedical research papers, all published in the same year, 2015, and in journals with an impact factor between 10 and 14, were investigated. The papers were split into two groups based on the sole criterion of receiving or not receiving news media coverage. Papers with news media coverage accounted for 44% of the total. They received, on average, 60% more citations, 104% more blogs, 150% more X posts, 106% more Facebook reports, 40% more Wikipedia references, 85% more videos, and 51% more Mendeley readers than papers without news media coverage. The correlation between news media outlets and increased citations and altmetrics is evident. However, the broader societal impact of news media coverage, in terms of bringing scientific matters or discoveries to the public eye, appears to be more robust when compared to the reactions of the scientific community. Full article
Show Figures

Figure 1

24 pages, 3952 KiB  
Article
Confrontation of Capitalism and Socialism in Wikipedia Networks
by Leonardo Ermann and Dima L. Shepelyansky
Information 2024, 15(9), 571; https://doi.org/10.3390/info15090571 - 18 Sep 2024
Cited by 1 | Viewed by 743
Abstract
We introduce the Ising Network Opinion Formation (INOF) model and apply it to the analysis of networks of six Wikipedia language editions. In the model, Ising spins are placed at network nodes/articles and the steady-state opinion polarization of spins is determined from the [...] Read more.
We introduce the Ising Network Opinion Formation (INOF) model and apply it to the analysis of networks of six Wikipedia language editions. In the model, Ising spins are placed at network nodes/articles and the steady-state opinion polarization of spins is determined from the Monte Carlo iterations in which a given spin orientation is determined by in-going links from other spins. The main consideration was the opinion confrontation between capitalism, imperialism (blue opinion) and socialism, communism (red opinion). These nodes have fixed spin/opinion orientation while other nodes achieve their steady-state opinions in the process of Monte Carlo iterations. We found that the global network opinion favors socialism, communism for all six editions. The model also determined the opinion preferences for world countries and political leaders, showing good agreement with heuristic expectations. We also present results for opinion competition between Christianity and Islam, and USA Democratic and Republican parties. We argue that the INOF approach can find numerous applications for directed complex networks. Full article
Show Figures

Figure 1

26 pages, 1199 KiB  
Article
Exploring the Effectiveness of Shallow and L2 Learner-Suitable Textual Features for Supervised and Unsupervised Sentence-Based Readability Assessment
by Dimitris Kostadimas, Katia Lida Kermanidis and Theodore Andronikos
Appl. Sci. 2024, 14(17), 7997; https://doi.org/10.3390/app14177997 - 7 Sep 2024
Viewed by 899
Abstract
Simplicity in information found online is in demand from diverse user groups seeking better text comprehension and consumption of information in an easy and timely manner. Readability assessment, particularly at the sentence level, plays a vital role in aiding specific demographics, such as [...] Read more.
Simplicity in information found online is in demand from diverse user groups seeking better text comprehension and consumption of information in an easy and timely manner. Readability assessment, particularly at the sentence level, plays a vital role in aiding specific demographics, such as language learners. In this paper, we research model evaluation metrics, strategies for model creation, and the predictive capacity of features and feature sets in assessing readability based on sentence complexity. Our primary objective is to classify sentences as either simple or complex, shifting the focus from entire paragraphs or texts to individual sentences. We approach this challenge as both a classification and clustering task. Additionally, we emphasize our tests on shallow features that, despite their simplistic nature and ease of use, seem to yield decent results. Leveraging the TextStat Python library and the WEKA toolkit, we employ a wide variety of shallow features and classifiers. By comparing the outcomes across different models, algorithms, and feature sets, we aim to offer valuable insights into optimizing the setup. We draw our data from sentences sourced from Wikipedia’s corpus, a widely accessed online encyclopedia catering to a broad audience. We strive to take a deeper look at what leads to greater readability classification in datasets that appeal to audiences such as Wikipedia’s, assisting in the development of improved models and new features for future applications with low feature extraction/processing times. Full article
(This article belongs to the Special Issue Knowledge and Data Engineering)
Show Figures

Figure 1

25 pages, 514 KiB  
Article
Bridging Linguistic Gaps: Developing a Greek Text Simplification Dataset
by Leonidas Agathos, Andreas Avgoustis, Xristiana Kryelesi, Aikaterini Makridou, Ilias Tzanis, Despoina Mouratidis, Katia Lida Kermanidis and Andreas Kanavos
Information 2024, 15(8), 500; https://doi.org/10.3390/info15080500 - 20 Aug 2024
Viewed by 841
Abstract
Text simplification is crucial in bridging the comprehension gap in today’s information-rich environment. Despite advancements in English text simplification, languages with intricate grammatical structures, such as Greek, often remain under-explored. The complexity of Greek grammar, characterized by its flexible syntactic ordering, presents unique [...] Read more.
Text simplification is crucial in bridging the comprehension gap in today’s information-rich environment. Despite advancements in English text simplification, languages with intricate grammatical structures, such as Greek, often remain under-explored. The complexity of Greek grammar, characterized by its flexible syntactic ordering, presents unique challenges that hinder comprehension for native speakers, learners, tourists, and international students. This paper introduces a comprehensive dataset for Greek text simplification, containing over 7500 sentences across diverse topics such as history, science, and culture, tailored to address these challenges. We outline the methodology for compiling this dataset, including a collection of texts from Greek Wikipedia, their annotation with simplified versions, and the establishment of robust evaluation metrics. Additionally, the paper details the implementation of quality control measures and the application of machine learning techniques to analyze text complexity. Our experimental results demonstrate the dataset’s initial effectiveness and potential in reducing linguistic barriers and enhancing communication, with initial machine learning models showing promising directions for future improvements in classifying text complexity. The development of this dataset marks a significant step toward improving accessibility and comprehension for a broad audience of Greek speakers and learners, fostering a more inclusive society. Full article
(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)
Show Figures

Figure 1

15 pages, 6423 KiB  
Article
Sustainability in Leadership: The Implicit Associations of the First-Person Pronouns and Leadership Effectiveness Based on Word Embedding Association Test
by Qu Yao, Yingjie Zheng and Jianhang Chen
Sustainability 2024, 16(15), 6403; https://doi.org/10.3390/su16156403 - 26 Jul 2024
Viewed by 895
Abstract
The first-person pronoun is an indispensable element of the communication process. Meanwhile, leadership effectiveness, as the result of leaders’ leadership work, is the key to the sustainable development of leaders and corporations. However, due to the constraints of traditional methods and sample bias, [...] Read more.
The first-person pronoun is an indispensable element of the communication process. Meanwhile, leadership effectiveness, as the result of leaders’ leadership work, is the key to the sustainable development of leaders and corporations. However, due to the constraints of traditional methods and sample bias, it is challenging to accurately measure and validate the relationship between first-person pronouns and leadership effectiveness at the implicit level. Word Embedding Association Test (WEAT) measures the relative degree of association between words in natural language by calculating the difference in word similarity. This study employs the word and sentence vector indicators of WEAT to investigate the implicit relationship between first-person pronouns and leadership effectiveness. The word vector analyses of the Beijing Normal University word vector database and Google News word vector database demonstrate that the cosine similarity and semantic similarity of “we-leadership effectiveness” are considerably greater than that of “I-leadership effectiveness”. Furthermore, the sentence vector analyses of the Chinese Wikipedia BERT model corroborate this relationship. In conclusion, the results of a machine learning-based WEAT verified the relationship between first-person plural pronouns and leadership effectiveness. This suggests that when leaders prefer to use “we”, they are perceived to be more effective. Full article
Show Figures

Figure 1

29 pages, 826 KiB  
Article
Practices and Attitudes of the Research and Teaching Staff at the University of Split about the Online Encyclopedia Wikipedia
by Mirko Duić
Publications 2024, 12(3), 20; https://doi.org/10.3390/publications12030020 - 11 Jul 2024
Viewed by 1536
Abstract
The goal of this study was to investigate the practices and attitudes of the research and teaching staff at the University of Split (Croatia) about the online encyclopedia Wikipedia. The method of a questionnaire-based survey was used to gain insights related to this [...] Read more.
The goal of this study was to investigate the practices and attitudes of the research and teaching staff at the University of Split (Croatia) about the online encyclopedia Wikipedia. The method of a questionnaire-based survey was used to gain insights related to this topic. During February 2024, the survey was completed by 226 respondents. The results show that almost all respondents read Wikipedia articles and believe that the level of their accuracy is quite high. Almost half of the respondents strongly agree with the statement that it would be desirable for faculty staff to write Wikipedia articles with the aim of spreading knowledge about topics from their professional fields. However, a very small number of respondents participated in writing articles for Wikipedia. Also, the respondents answered that to them, the greatest motivations to write articles on Wikipedia would be if this activity were evaluated for the advancement to a higher work position and the correction of errors in Wikipedia articles. It was also found that most respondents are not very familiar with how Wikipedia works or how to add new content to it. These and other insights from this study can be used to conceive and initiate various activities that can contribute to greater participation of scientific and teaching staff of higher education institutions in writing quality content on Wikipedia, as well as activities that can contribute to a better familiarization with the principles and procedures to write and enhance its content. Other research methods, such as interviews with scientific and teaching staff of higher education institutions, could be used to acquire further, more detailed answers related to this topic. Full article
Show Figures

Figure 1

13 pages, 1923 KiB  
Article
Infodemiology and Infoveillance of the Four Most Widespread Arbovirus Diseases in Italy
by Omar Enzo Santangelo, Sandro Provenzano, Carlotta Vella, Alberto Firenze, Lorenzo Stacchini, Fabrizio Cedrone and Vincenza Gianfredi
Epidemiologia 2024, 5(3), 340-352; https://doi.org/10.3390/epidemiologia5030024 - 5 Jul 2024
Viewed by 1318
Abstract
The purpose of this observational study was to evaluate the potential epidemiological trend of arboviral diseases most reported in Italy by the dedicated national surveillance system (ISS data) compared to searches on the internet, assessing whether a correlation/association between users’ searches in Google [...] Read more.
The purpose of this observational study was to evaluate the potential epidemiological trend of arboviral diseases most reported in Italy by the dedicated national surveillance system (ISS data) compared to searches on the internet, assessing whether a correlation/association between users’ searches in Google and Wikipedia and real cases exists. The study considers a time interval from June 2012 to December 2023. We used the following Italian search terms: “Virus Toscana”, “Virus del Nilo occidentale” (West Nile Virus in English), “Encefalite trasmessa da zecche” (Tick Borne encephalitis in English), and “Dengue”. We overlapped Google Trends and Wikipedia data to perform a linear regression and correlation analysis. Statistical analyses were performed using Pearson’s correlation coefficient (r) or Spearman’s rank correlation coefficient (rho) as appropriate. All the correlations between the ISS data and Wikipedia or GT exhibited statistical significance. The correlations were strong for Dengue GT and ISS (rho = 0.71) and TBE GT and ISS (rho = 0.71), while the remaining correlations had values of r and rho between 0.32 and 0.67, showing a moderate temporal correlation. The observed correlations and regression models provide a foundation for future research, encouraging a more nuanced exploration of the dynamics between digital information-seeking behavior and disease prevalence. Full article
Show Figures

Figure 1

13 pages, 2270 KiB  
Article
GRAAL: Graph-Based Retrieval for Collecting Related Passages across Multiple Documents
by Misael Mongiovì and Aldo Gangemi
Information 2024, 15(6), 318; https://doi.org/10.3390/info15060318 - 29 May 2024
Viewed by 870
Abstract
Finding passages related to a sentence over a large collection of text documents is a fundamental task for claim verification and open-domain question answering. For instance, a common approach for verifying a claim is to extract short snippets of relevant text from a [...] Read more.
Finding passages related to a sentence over a large collection of text documents is a fundamental task for claim verification and open-domain question answering. For instance, a common approach for verifying a claim is to extract short snippets of relevant text from a collection of reference documents and provide them as input to a natural language inference machine that determines whether the claim can be deduced or refuted. Available approaches struggle when several pieces of evidence from different documents need to be combined to make an inference, as individual documents often have a low relevance with the input and are therefore excluded. We propose GRAAL (GRAph-based retrievAL), a novel graph-based approach that outlines the relevant evidence as a subgraph of a large graph that summarizes the whole corpus. We assess the validity of this approach by building a large graph that represents co-occurring entity mentions on a corpus of Wikipedia pages and using this graph to identify candidate text relevant to a claim across multiple pages. Our experiments on a subset of FEVER, a popular benchmark, show that the proposed approach is effective in identifying short passages related to a claim from multiple documents. Full article
(This article belongs to the Special Issue 2nd Edition of Information Retrieval and Social Media Mining)
Show Figures

Figure 1

21 pages, 3492 KiB  
Article
A Question and Answering Service of Typhoon Disasters Based on the T5 Large Language Model
by Yongqi Xia, Yi Huang, Qianqian Qiu, Xueying Zhang, Lizhi Miao and Yixiang Chen
ISPRS Int. J. Geo-Inf. 2024, 13(5), 165; https://doi.org/10.3390/ijgi13050165 - 14 May 2024
Cited by 1 | Viewed by 2209
Abstract
A typhoon disaster is a common meteorological disaster that seriously impacts natural ecology, social economy, and even human sustainable development. It is crucial to access the typhoon disaster information, and the corresponding disaster prevention and reduction strategies. However, traditional question and answering (Q&A) [...] Read more.
A typhoon disaster is a common meteorological disaster that seriously impacts natural ecology, social economy, and even human sustainable development. It is crucial to access the typhoon disaster information, and the corresponding disaster prevention and reduction strategies. However, traditional question and answering (Q&A) methods exhibit shortcomings like low information retrieval efficiency and poor interactivity. This makes it difficult to satisfy users’ demands for obtaining accurate information. Consequently, this work proposes a typhoon disaster knowledge Q&A approach based on LLM (T5). This method integrates two technical paradigms of domain fine-tuning and retrieval-augmented generation (RAG) to optimize user interaction experience and improve the precision of disaster information retrieval. The process specifically includes the following steps. First, this study selects information about typhoon disasters from open-source databases, such as Baidu Encyclopedia and Wikipedia. Utilizing techniques such as slicing and masked language modeling, we generate a training set and 2204 Q&A pairs specifically focused on typhoon disaster knowledge. Second, we continuously pretrain the T5 model using the training set. This process involves encoding typhoon knowledge as parameters in the neural network’s weights and fine-tuning the pretrained model with Q&A pairs to adapt the T5 model for downstream Q&A tasks. Third, when responding to user queries, we retrieve passages from external knowledge bases semantically similar to the queries to enhance the prompts. This action further improves the response quality of the fine-tuned model. Finally, we evaluate the constructed typhoon agent (Typhoon-T5) using different similarity-matching approaches. Furthermore, the method proposed in this work lays the foundation for the cross-integration of large language models with disaster information. It is expected to promote the further development of GeoAI. Full article
Show Figures

Figure 1

Back to TopTop