Black-Box Opinion Manipulation Attacks to Retrieval-Augmented Generation of Large Language Models

Zhuo Chen
Wuhan University
Wuhan, China
[email protected]
&Jiawei Liu
Wuhan University
Wuhan, China
[email protected]
Haotan Liu
Wuhan University
Wuhan, China
[email protected]
&Qikai Cheng
Wuhan University
Wuhan, China
[email protected]
&Fan Zhang
Wuhan University
Wuhan, China
[email protected]
Wei Lu
Wuhan University
Wuhan, China
[email protected]
&Xiaozhong Liu
Worcester Polytechnic Institute
USA
[email protected]
Corresponding author.
Abstract

Retrieval-Augmented Generation (RAG) is applied to solve hallucination problems and real-time constraints of large language models, but it also induces vulnerabilities against retrieval corruption attacks. Existing research mainly explores the unreliability of RAG in white-box and closed-domain QA tasks. In this paper, we aim to reveal the vulnerabilities of Retrieval-Enhanced Generative (RAG) models when faced with black-box attacks for opinion manipulation. We explore the impact of such attacks on user cognition and decision-making, providing new insight to enhance the reliability and security of RAG models. We manipulate the ranking results of the retrieval model in RAG with instruction and use these results as data to train a surrogate model. By employing adversarial retrieval attack methods to the surrogate model, black-box transfer attacks on RAG are further realized. Experiments conducted on opinion datasets across multiple topics show that the proposed attack strategy can significantly alter the opinion polarity of the content generated by RAG. This demonstrates the model’s vulnerability and, more importantly, reveals the potential negative impact on user cognition and decision-making, making it easier to mislead users into accepting incorrect or biased information.

1 Introduction

With the rapid development of artificial intelligence, large language models (LLMs) have demonstrated exceptional capabilities in the field of natural language processing. However, constrained by their training data, these models have limited scope of knowledge and lack the most up-to-date information, which can lead to errors or hallucinations when tackling more complex or time-sensitive tasks. Retrieval-Augmented Generation (RAG) combines information retrieval with the generative capabilities of large language models, enhancing the timeliness of knowledge acquisition and effectively mitigating the hallucination problem of these models. When given a query, RAG retrieves the most relevant passages from a knowledge base to augment the input request for the LLM. For example, the retrieved knowledge may consist of a series of text snippets that are semantically most similar to the query. RAG has inspired many popular applications, such as Microsoft Bing Chat, ERNIE Bot, and KimiChat, which use RAG to summarize retrieval results for improved user experience. Open-source projects like LangChain and LlamaIndex provide developers with flexible RAG frameworks to build customized AI applications using LLMs, retrieval models and knowledge bases.

However, as the application scope of RAG expands, its security is increasingly a concern, especially regarding the model performance when faced with malicious attacks. The basic RAG process typically consists of three components: the corpus(refers knowledge bases), the retriever, and the generative large language model. When some of the retrieved passages are corrupted by malicious manipulators, the RAG process can become vulnerable; this is referred to as a retrieval manipulation attack in this paper. Numerous studies have explored various forms of retrieval manipulation attacks, such as adversarial attack on the retriever [14, 16], prompt injection attack [1, 15, 10], jailbreak attack for LLM [6, 12, 27], and poisoning attack targeting the retrieval corpus in RAG [29, 23].

This paper primarily focuses on adversarial ranking poisoning attacks against the retriever in RAG and how such attacks indirectly affect the generative results of the LLM. The threat model presented here is closer to a real-world black-box scenario and can be specifically modeled as follows: the attacker can only make requests to the large model and cannot access the complete corpus, the retriever, or the parameters of the RAG. The attacker can only insert adversarially modified candidate texts into the corpus, while the retriever and the LLM remain black-boxed, intact and unmodifiable. Based on previous studies [13, 2], the retrieval corpus and knowledge base contain millions of candidate texts sourced from the internet, allowing attackers to inject adversarially modified candidate texts by maliciously crafting web content or encyclopedia pages. Representative previous studies by Cho et al. [4] and Zhong et al. [28] utilized predefined white-box retrievers, which are challenging to achieve in real-world scenarios with limited flexibility and practicality. Moreover, these works did not consider testing attacks specifically targeting the integrated generation process, where practical integrated models may mitigate the effects of attacks solely targeting the retriever, thereby reducing their effectiveness. Furthermore, another notable work, PoisonedRAG [29], implemented black-box retrieval poisoning attacks on RAG knowledge bases, effectively exposing relevant security vulnerabilities of RAG. However, its experiments mainly focused on closed-domain question answering, such as "Who is the CEO of OpenAI?" Such questions can be corrected when RAG is combined with fact-checking and value alignment of LLMs. The vulnerabilities explored in this paper primarily target open-ended, controversial, and opinion-based questions in RAG, such as "Should abortion be legal?" These questions demand higher levels of logical analysis and summarization capabilities from large models. Current research in controversial topics is limited, and attacks manipulating opinions on opinion-based questions could potentially cause more profound harm.

Open-ended and controversial topics are issues that lack consensus due to differing opinions and attract widespread attention. These topics often involve opinions from different perspectives, influencing public perception when they are widely discussed. For example, in political elections, Robert Epstein [8] found that manipulating search engines to produce biased search results can alter voters’ voting preferences. Placing passages favoring a particular candidate at the top significantly affects voter trust and favorability towards that candidate. Today, the issue of information homogenization in "information bubbles" has been a major concern among scholars. Zhang Yue et al. [24] proposed that homogenization in information bubbles manifests in three dimensions: selective homogenization, content homogenization, and group homogenization. Content homogenization refers to the phenomenon where people using online media encounter homogeneity in the presented content, often due to the "filter bubbles" that are created by recommendation systems and selectively feed biased information. In scenarios of open-ended and controversial topics, "information bubbles" can lead to the homogenization of user opinions, with people’s views being easily influenced by the stance of the information they encounter. Through manual construction or search engine optimization, opinion manipulation attacks or "cognitive warfare" in open-ended controversial topics is actually widespread in practical applications such as social media and news platform. This phenomenon has numerous negative impacts on society. With the development of large language models, opinion manipulation exploiting RAG vulnerabilities poses a particularly severe threat. Attackers can influence the stance of the model generated content with carefully designed inputs, further endangering users’ cognition and decision-making processes. Therefore, it is of significant theoretical and practical importance to study the vulnerabilities of RAG models against opinion manipulation attacks in black-box setting.

In short, this paper aims to explore the reliability of RAG against black-box opinion manipulation attacks in open-ended controversial topics and investigate the impact of such attacks on user cognition and decision-making. Specifically, we first send specific instructions to obtain the ranking of the retrieval results in the RAG model and analyze the working mechanism of its retrieval module. We train a surrogate model on the obtained retrieval ranking data to approximate the features and relevance preferences of the retriever in RAG [14, 21]. Based on the surrogate model, we design adversarial retrieval attack strategies to manipulate the opinions of candidate documents. By attacking this surrogate model, we generate adversarial opinion manipulation samples and transfer these adversarial samples to the actual RAG model. We then conduct experiments on opinion datasets across multiple topics to validate the effectiveness and impact range of the attack strategies without understanding the internal knowledge of the RAG model. Experiments conducted on opinion datasets across multiple topics show that the proposed attack strategy can significantly alter the opinion polarity of the content generated by RAG. This not only demonstrates the vulnerability of the model but, more importantly, reveals the potential negative impact on user cognition and decision-making, making it easier to mislead users into accepting incorrect or biased information.

2 Related Works

Research on the reliability of neural network models has long been established. In 2013, Szegedy et al. [18] found that applying imperceptible perturbations to a neural network model during a classification task was sufficient to cause classification errors in CV. Later, scholars observed similar phenomenon in NLP. Robin et al. [11] found that inserting perturbed text into original paragraphs significantly distracts computer systems without changing the correct answer or misleading humans. It reflects the robustness of neural network models, i.e., the ability to output stable and correct predictions in tackling the imperceptible additive noises [20]. For large language models, Wang et al. [19] proposed a comprehensive trustworthiness evaluation framework for LLMs, assessing their reliability from various perspectives such as toxicity, adversarial robustness, stereotype bias, and fairness. While large language models have greater capabilities compared to general deep neural network models, they also raise more concerns regarding security and reliability.

As RAG is designed to overcome the hallucination problem in LLMs and enhance their generative capabilities, the reliability of the content generated by RAG is also a major concern. Zhang et al. [25] attempted to explore the weaknesses of RAG by analyzing critical components in order to facilitate the injection of the attack sequence and crafting the malicious document with a gradient-guided token mutation technique. Xiang et al. [22] designed an isolate-then-aggregate strategy, which gets responses of LLMs from each passage in isolation and then securely aggregate these isolated responses, to construct the first defense framework against retrieval corruption attacks. These studies are based on white-box scenarios and primarily focus on the robustness of RAG against corrupted and toxic content.

This paper intends to use adversarial retrieval attack strategies to perturb the ranking results of the retriever, ensuring that opinion documents with a certain stance are ranked as high as possible, thereby guiding the generated responses of the LLM to reflect that stance.

The adversarial retrieval attack strategy starts with manipulation at the word level. Under white-box setting, Ebrahimi et al. [7] utilize an atomic flip operation, which swaps one token for an other, to generate adversarial examples and the method, known as Hotflip. Hotflip gets rid of reliance on rules, but the adversarial text it generate usually has incomplete semantics and insufficient grammar fluency. While it can deceive the target model, it cannot evade perplexity-based defenses. Wu et al. [21] also proposed a word substitution ranking attack method called PRADA. To enhance the readability and effectiveness of the adversarial text, scholars further designed sentence-level ranking attack methods. Song et al. [17] propose an adversarial method under white-box setting, named Collision, which uses gradient optimization and beam search to produce the adversarial text named collision. The Collision method further imposes a soft constraint on collision generation by integrating a language model, reducing the perplexity of the collision. The method has shown promising[14] propose the Pairwise Anchor-based Trigger (PAT) method under black-box setting. Added the fluency constraint and the next sentence prediction constraint, the method generates adversarial text by optimizing the pairwise loss of top candidates and target candidates with adversarial text. Although the time complexity of PAT has increased compared to previous methods, PAT takes ranking similarity and semantic consistency into account, so its manipulation effect on the retrieval ranking of target candidates is superior.

3 Method

Refer to caption
Figure 1: The method for manipulating the opinions of RAG-generated content in black-box scenario

This paper attempts to manipulate the opinions in the responses generated by black-box RAG models on controversial topics, targeting both the retrieval model and the LLM which performs the integrated generation task. Zhang et al. [25] tried to poison context documents to deceive the LLM into generating incorrect content, but this method requires extensive internal details of the LLM application, making it less feasible in real-world scenarios. For black-box RAG, the manipulator has no knowledge of the internal information of the RAG, including model architecture and score function, and can only access the inputs and outputs of the RAG. Specially, the manipulator can only call the interface of the LLM in RAG instead of that of the retriever. Since the inputs consist of the query and the candidate documents and the user’s query cannot be altered, this paper focuses on modifying the candidate documents. Although the manipulator cannot access the entire corpus, they can insert adversarially modified candidate texts into the corpus. The basic framework of RAG consists of the retriever and the generative large language model, which the two are serially connected, the LLM performs the generation task based on the context information retrieved by the retriever. Given that manipulators in a black-box scenario cannot modify the system prompts of the generative large model, it is difficult to directly manipulate the generation results by exploiting the reliability flaws of the LLM itself. Therefore, this paper focuses on exploiting the reliability flaws of the retriever to manipulate the retrieval ranking results. By adding adversarial texts to candidate documents that hold the expected opinion, we increase their relevance to the query, making them more likely to be included in the context passed to the generative large language model. Leveraging the strong capability of LLM for understanding and following instructions, we guide the LLM to generate responses that align with the expected opinion. An overview of this method is shown in Figure 1.

The specific approach for manipulating RAG opinions on controversial topics is as follows: Given a topic q𝑞qitalic_q (the query) from a set of controversial topics Q𝑄Qitalic_Q, we select a expected opinion Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and target all candidate documents dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the retrieval corpus D𝐷Ditalic_D that hold the Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT opinion. After obtaining the adversarial text padvsubscript𝑝advp_{\text{adv}}italic_p start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT, it is added to dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, transforming the retrieval corpus to D(d;dtpadv)𝐷𝑑direct-sumsubscript𝑑𝑡subscript𝑝advD(d;d_{t}\oplus p_{\text{adv}})italic_D ( italic_d ; italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊕ italic_p start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT ). Since padvsubscript𝑝advp_{\text{adv}}italic_p start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT can increase the relevance score R(q,dtpadv)𝑅𝑞direct-sumsubscript𝑑𝑡subscript𝑝advR(q,d_{t}\oplus p_{\text{adv}})italic_R ( italic_q , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊕ italic_p start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT ) assigned by the retrieval model RM to dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for query q𝑞qitalic_q, ideally dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will be ranked at the top of the retrieval results RMk(q)={ddtpadv}subscriptRM𝑘𝑞conditional-set𝑑direct-sumsubscript𝑑𝑡subscript𝑝adv\text{RM}_{k}(q)=\{d\mid d_{t}\oplus p_{\text{adv}}\}RM start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ) = { italic_d ∣ italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊕ italic_p start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT }, guiding the large language model to generate responses that align with the expected opinion: S(LLM(q,RMk(q)))=St𝑆LLM𝑞subscriptRM𝑘𝑞subscript𝑆𝑡S(\text{LLM}(q,\text{RM}_{k}(q)))=S_{t}italic_S ( LLM ( italic_q , RM start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ) ) ) = italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

The primary issue in implementing manipulation is to make the retrieval model of the black-box RAG transparent. This paper aims to simulate the retrieval model RM𝑅𝑀RMitalic_R italic_M. The basic idea is to train a surrogate model Misubscript𝑀iM_{\text{i}}italic_M start_POSTSUBSCRIPT i end_POSTSUBSCRIPT with the ranking results RMk(q)𝑅subscript𝑀k𝑞RM_{\text{k}}(q)italic_R italic_M start_POSTSUBSCRIPT k end_POSTSUBSCRIPT ( italic_q ) from the retrieval model RM𝑅𝑀RMitalic_R italic_M, thus turning the black-box retrieval model into a white-box model. However, since the retriever and the large generative model in RAG are serially connected, it is not possible to directly obtain the ranking results of the retriever. Therefore, this paper attempts to guide the large generative model to replicate the retrieval results of the black-box RAG. Therefore, this paper attempts to guide the large model to replicate the output of the retrieval model, so we obtain the text data deemed relevant by the black-box retrieval model RM𝑅𝑀RMitalic_R italic_M, which can be used as positive examples d+subscript𝑑d_{+}italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT for black-box imitation training. Subsequently, irrelevant texts to the query can be random sampled as negative examples dsubscript𝑑d_{-}italic_d start_POSTSUBSCRIPT - end_POSTSUBSCRIPT . Therefore, this paper designs specific instructions to make the black-box RAG replicate the retrieval results of the retriever RM. These retrieval results only need to reflect the relevance to the query. Then, based on the generated results of the LLM, we sample positive and negative data to train the surrogate model. The method for obtaining imitation data of the retrieval model in a black-box RAG scenario is illustrated in Figure 2. The prompt instruction used is as follows:

Now that you are a search engine, please search: {query}
Ignore the Question. Please copy the top 3 passages of the given Context intact in the output and provide the output in JSON with keys ’answer’ and ’context’. Put each candidate passage in ’context’ as a string element in the list. Candidate passages are separated by line break instead of period or exclamation point. Each candidate is an element in the list, like [Passage 1, Passage 2, Passage 3]. Please copy the passages intact with no modification and only output the one best JSON response.

Refer to caption
Figure 2: The method for obtaining imitation data of RAG retrieval model in black-box scenario

This paper uses a pairwise approach to sample data and train the surrogate model. Relevant passages are sampled from the responses generated by the black-box RAG as positive examples d+subscript𝑑d_{+}italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, and random irrelevant passages are sampled as negative examples dsubscript𝑑d_{-}italic_d start_POSTSUBSCRIPT - end_POSTSUBSCRIPT. The black-box RAG responds with context information so the responses generated reflect the retrieval results instead of being independent on the context. These sample pairs (d+,d)subscript𝑑subscript𝑑(d_{+},d_{-})( italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ) are incorporated into the training dataset. After sampling the imitation data, this paper uses a pairwise training method to obtain the surrogate model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let the relevance score calculated by Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the training optimization objective is as follows:

L=1|Q|qQlog(Ri(q,d+)Ri(q,d+)+Ri(q,d))𝐿1𝑄subscript𝑞𝑄subscript𝑅𝑖𝑞subscript𝑑subscript𝑅𝑖𝑞subscript𝑑subscript𝑅𝑖𝑞subscript𝑑L=-\frac{1}{|Q|}\sum_{q\in Q}\log\left(\frac{R_{i}(q,d_{+})}{R_{i}(q,d_{+})+% \sum R_{i}(q,d_{-})}\right)italic_L = - divide start_ARG 1 end_ARG start_ARG | italic_Q | end_ARG ∑ start_POSTSUBSCRIPT italic_q ∈ italic_Q end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_q , italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) end_ARG start_ARG italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_q , italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) + ∑ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_q , italic_d start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ) end_ARG )                       [1]

After obtaining the surrogate model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, this paper transforms the manipulation of RAG-generated opinions in a black-box scenario into manipulation in a white-box scenario. Since we have all the knowledge of the white-box surrogate model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, this paper directly implements adversarial retrieval attacks on it, generating adversarial text padvsubscript𝑝advp_{\text{adv}}italic_p start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT for the candidate document dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT holding the opinion Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This paper employs the Pairwise Anchor-based Trigger (PAT) strategy for adversarial retrieval attacks, which is commonly used as a baseline in related research. Subsequently, the generated adversarial text is added to the candidate document with Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then, the system of the black-box RAG model is queried, and the generated response is obtained. The stance of the response is compared with the stance of the response generated by the RAG without manipulation to evaluate the reliability of the black-box RAG.

PAT, as a representative adversarial retrieval attack strategy, adopts a pairwise generation paradigm. Given the target query, the target candidate item, and the top candidate item(anchor, used to guide the adversarial text generation), the method utilizes gradient optimization of pairwise loss, calculated from the candidate item and the anchor, to find the appropriate representation of an adversarial text. The method also adds fluency constraint and next sentence prediction constraint. By beam search for the words, the final adversarial text, denoted as Tpatsubscript𝑇𝑝𝑎𝑡T_{pat}italic_T start_POSTSUBSCRIPT italic_p italic_a italic_t end_POSTSUBSCRIPT, is iteratively generated in an auto-regressive way. This paper uses Tpatsubscript𝑇patT_{\text{pat}}italic_T start_POSTSUBSCRIPT pat end_POSTSUBSCRIPT as padvsubscript𝑝advp_{\text{adv}}italic_p start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT, with its generated optimization function being [14]:

max(Mi(q,Tpat;w)+λ1logPg(Tpat;w)+λ2fnsp(dt,Tpat;w))subscript𝑀𝑖𝑞subscript𝑇pat𝑤subscript𝜆1subscript𝑃𝑔subscript𝑇pat𝑤subscript𝜆2subscript𝑓nspsubscript𝑑𝑡subscript𝑇pat𝑤\max\left(M_{i}(q,T_{\text{pat}};w)+\lambda_{1}\cdot\log P_{g}(T_{\text{pat}};% w)+\lambda_{2}\cdot f_{\text{nsp}}(d_{t},T_{\text{pat}};w)\right)roman_max ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_q , italic_T start_POSTSUBSCRIPT pat end_POSTSUBSCRIPT ; italic_w ) + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ roman_log italic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT pat end_POSTSUBSCRIPT ; italic_w ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_f start_POSTSUBSCRIPT nsp end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT pat end_POSTSUBSCRIPT ; italic_w ) )          [2]

In the above formula, Pgsubscript𝑃𝑔P_{g}italic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is the semantic constraint function, and fnspsubscript𝑓nspf_{\text{nsp}}italic_f start_POSTSUBSCRIPT nsp end_POSTSUBSCRIPT is the next sentence prediction consistency score function between Tpatsubscript𝑇patT_{\text{pat}}italic_T start_POSTSUBSCRIPT pat end_POSTSUBSCRIPT and dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

In terms of dataset, this paper uses the MS MARCO Passages Ranking dataset as the data source for guiding the black-box RAG to generate relevant passages [28] where we sample data pairs to train the surrogate model. Additionally, this paper uses controversial topic data scraped from the PROCON.ORG website as the object of manipulation. The controversial topic dataset includes over 80 topics, covering fields such as society, health, government, education, and science. Each controversial topic is discussed from two stances (pro and con), with an average of 30 related passages, each holding a certain opinion with stance pro or con.

The specific settings details for the RAG manipulation experiment are as follows:

(1) Black-box RAG: This paper represents the black-box RAG process, which serves as the research object, as RAGblacksubscriptRAGblack\text{RAG}_{\text{black}}RAG start_POSTSUBSCRIPT black end_POSTSUBSCRIPT. It mainly consists of a retriever and a large language model (LLM). The LLMs used are the open-source models Meta-Llama-3-8B-Instruct (LLAMA3-8B) and Qwen1.5-14B-Chat (Qwen1.5-14B). The LLAMA and Qwen series LLMs perform well across various tasks among all open-source models. The prompt connecting the retriever and the LLM in RAGblacksubscriptRAGblack\text{RAG}_{\text{black}}RAG start_POSTSUBSCRIPT black end_POSTSUBSCRIPT adopts the basic RAG prompt from the Langchain framework:

Use the following pieces of retrieved context to answer the question. Keep the answer concise. Context: {context}. Question: {question}.

(2) Target retriever model and surrogate model: The retriever in RAG is usually a dense retrieval model. Therefore, this paper selects the representative dense retrieval model, coCondenser, as the target retrieval model [9]. Since coCondenser is a BERT-based model, the surrogate model chosen in this paper is the MiniLM model, which is BERT-based and specifically trained on the MS Marco Passage Ranking dataset.

(3) Manipulation target: For a controversial topic q𝑞qitalic_q, documents dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT holding the expected opinion Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are manipulated by adding adversarial text padvsubscript𝑝𝑎𝑑𝑣p_{adv}italic_p start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT at the beginning. This manipulation aims to position these perturbed documents as prominently as possible in the top K𝐾Kitalic_K rankings of the RAG retriever RMk(q)subscriptRM𝑘𝑞\mathrm{RM}_{k}(q)roman_RM start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ), where K𝐾Kitalic_K denotes the number of paragraphs obtained by the RAG generation model from the retrieval results. In this paper, K𝐾Kitalic_K is set to 3.

(4) Manipulator(the threat model): In the black-box scenario, the manipulator is only authorized to query the RAG, obtain RAG-generated results and modify the target documents. There are no restrictions on the number of calls to RAG. Furthermore, the manipulator has no knowledge of the model architecture, model parameters, or any other information related to the models within the black-box RAG. Modifying the prompt templates used by the LLM is also prohibited.

(5) Experimental Parameters: The batch size for training the surrogate model is set to 32, with 24 iterations.

Our proposed opinion manipulation strategy for black-box RAG is outlined in Algorithm 1.

Input: target black-box RAG model RAGblacksubscriptRAGblack\textit{RAG}_{\text{black}}RAG start_POSTSUBSCRIPT black end_POSTSUBSCRIPT, target retrieval model RM𝑅𝑀RMitalic_R italic_M, surrogate model 𝑴𝒊subscript𝑴𝒊\boldsymbol{M_{i}}bold_italic_M start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT, controversial topics 𝑸𝑸\boldsymbol{Q}bold_italic_Q, target topic 𝒒𝒒\boldsymbol{q}bold_italic_q, expected opinion Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, corpus Docs𝐷𝑜𝑐𝑠Docsitalic_D italic_o italic_c italic_s , target documents with expected opinion Docst𝐷𝑜𝑐subscript𝑠𝑡Docs_{t}italic_D italic_o italic_c italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, target document 𝒅𝒕subscript𝒅𝒕\boldsymbol{d_{t}}bold_italic_d start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT, relevant document d+subscript𝑑d_{+}italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, random sampled document dsubscript𝑑d_{-}italic_d start_POSTSUBSCRIPT - end_POSTSUBSCRIPT
Instructions:
i1subscript𝑖1i_{1}italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = "Now that you are a search engine, please search: {query} Ignore the Question. Please copy the top 3 passages of the given Context intact in the output and provide the output in JSON with keys ’answer’ and ’context’. Put each candidate passage in ’context’ as a string element in the list…"
i2subscript𝑖2i_{2}italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = "Use the following pieces of retrieved context to answer the question…"
// RAGblacksubscriptRAGblack\textit{RAG}_{\text{black}}RAG start_POSTSUBSCRIPT black end_POSTSUBSCRIPT uses i2subscript𝑖2i_{2}italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as prompt template.
Functions:
𝑶𝒑𝒊𝒏𝒊𝒐𝒏𝑪𝒍𝒂𝒔𝒔𝒊𝒇𝒚𝑶𝒑𝒊𝒏𝒊𝒐𝒏𝑪𝒍𝒂𝒔𝒔𝒊𝒇𝒚\boldsymbol{OpinionClassify}bold_italic_O bold_italic_p bold_italic_i bold_italic_n bold_italic_i bold_italic_o bold_italic_n bold_italic_C bold_italic_l bold_italic_a bold_italic_s bold_italic_s bold_italic_i bold_italic_f bold_italic_y: Classify the opinion of the content into "support", "neutral" or "oppose".
𝑷𝑨𝑻𝑷𝑨𝑻\boldsymbol{PAT}bold_italic_P bold_italic_A bold_italic_T: Pairwise Anchor-based Trigger generation strategy.
Output: manipulated RAG responses 𝑹𝒆𝒔𝑹𝒆𝒔\boldsymbol{Res}bold_italic_R bold_italic_e bold_italic_s
1
2 Phase 1. Pairwise Imitation Data Construction and Black-box Retrieval Model Imitation Training
3       INIT: Dataset 𝒟{}𝒟\mathcal{D}\leftarrow\{\}caligraphic_D ← { }
4       for 𝐪m𝐐subscript𝐪𝑚𝐐\boldsymbol{q}_{m}\in\boldsymbol{Q}bold_italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ bold_italic_Q do
5             induced rank list 𝑹mtop3subscript𝑹𝑚𝑡𝑜𝑝3absent\boldsymbol{R}_{m}{top3}\leftarrowbold_italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_t italic_o italic_p 3 ← RAGblack(𝒒mi1;𝑫𝒐𝒄𝒔)subscriptRAGblackdirect-sumsubscript𝒒𝑚subscript𝑖1𝑫𝒐𝒄𝒔\textit{RAG}_{\text{black}}(\boldsymbol{q}_{m}\oplus i_{1};\boldsymbol{Docs})RAG start_POSTSUBSCRIPT black end_POSTSUBSCRIPT ( bold_italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊕ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_italic_D bold_italic_o bold_italic_c bold_italic_s )
             // RAGblack(𝒒mi1)subscriptRAGblackdirect-sumsubscript𝒒𝑚subscript𝑖1\textit{RAG}_{\text{black}}(\boldsymbol{q}_{m}\oplus i_{1})RAG start_POSTSUBSCRIPT black end_POSTSUBSCRIPT ( bold_italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊕ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) \approx RM(𝒒m)𝑅𝑀subscript𝒒𝑚RM(\boldsymbol{q}_{m})italic_R italic_M ( bold_italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )
6             for 𝐝+j𝐑mtop3subscriptsubscript𝐝𝑗subscript𝐑𝑚𝑡𝑜𝑝3\boldsymbol{d_{+}}_{j}\in\boldsymbol{R}_{m}{top3}bold_italic_d start_POSTSUBSCRIPT bold_+ end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ bold_italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_t italic_o italic_p 3 do
7                   Random sample document as 𝒅jsubscriptsubscript𝒅𝑗\boldsymbol{d_{-}}_{j}bold_italic_d start_POSTSUBSCRIPT bold_- end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
8                   𝒟positive,[𝒒m;𝒅+j;𝒅j]𝒟𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒subscript𝒒𝑚subscriptsubscript𝒅𝑗subscriptsubscript𝒅𝑗\mathcal{D}\leftarrow positive,[\boldsymbol{q}_{m};\boldsymbol{d_{+}}_{j};% \boldsymbol{d_{-}}_{j}]caligraphic_D ← italic_p italic_o italic_s italic_i italic_t italic_i italic_v italic_e , [ bold_italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ; bold_italic_d start_POSTSUBSCRIPT bold_+ end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; bold_italic_d start_POSTSUBSCRIPT bold_- end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ]
9                  𝒟negative,[𝒒m;𝒅j;𝒅+j]𝒟𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒subscript𝒒𝑚subscriptsubscript𝒅𝑗subscriptsubscript𝒅𝑗\mathcal{D}\leftarrow negative,[\boldsymbol{q}_{m};\boldsymbol{d_{-}}_{j};% \boldsymbol{d_{+}}_{j}]caligraphic_D ← italic_n italic_e italic_g italic_a italic_t italic_i italic_v italic_e , [ bold_italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ; bold_italic_d start_POSTSUBSCRIPT bold_- end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; bold_italic_d start_POSTSUBSCRIPT bold_+ end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ]
                   // Reverse 𝒅+jsubscriptsubscript𝒅𝑗\boldsymbol{d_{+}}_{j}bold_italic_d start_POSTSUBSCRIPT bold_+ end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝒅jsubscriptsubscript𝒅𝑗\boldsymbol{d_{-}}_{j}bold_italic_d start_POSTSUBSCRIPT bold_- end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to get the negative triple
10                  
11            
12      Train the surrogate model 𝑴𝒊subscript𝑴𝒊\boldsymbol{M_{i}}bold_italic_M start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT on 𝒟𝒟\mathcal{D}caligraphic_D with Eq 1
13       return 𝐌𝐢subscript𝐌𝐢\boldsymbol{M_{i}}bold_italic_M start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT 
14Phase 2. Adversarial Trigger Generation and Opinion Manipulation in RAG Response
15       INIT: RAG Response Set 𝑹𝒆𝒔{}𝑹𝒆𝒔\boldsymbol{Res}\leftarrow\{\}bold_italic_R bold_italic_e bold_italic_s ← { }
16       for 𝐪m𝐐subscript𝐪𝑚𝐐\boldsymbol{q}_{m}\in\boldsymbol{Q}bold_italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ bold_italic_Q do
17             rank list 𝑹msubscript𝑹𝑚absent\boldsymbol{R}_{m}\leftarrowbold_italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← 𝑴𝒊(𝒒𝒎;𝑫𝒐𝒄𝒔)subscript𝑴𝒊subscript𝒒𝒎𝑫𝒐𝒄𝒔\boldsymbol{M_{i}(\boldsymbol{q}_{m};Docs)}bold_italic_M start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT bold_( bold_italic_q start_POSTSUBSCRIPT bold_italic_m end_POSTSUBSCRIPT bold_; bold_italic_D bold_italic_o bold_italic_c bold_italic_s bold_)
18             𝒂𝒏𝒄𝒉𝒐𝒓m𝒂𝒏𝒄𝒉𝒐subscript𝒓𝑚absent\boldsymbol{anchor}_{m}\leftarrowbold_italic_a bold_italic_n bold_italic_c bold_italic_h bold_italic_o bold_italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← top-1(𝑹msubscript𝑹𝑚\boldsymbol{R}_{m}bold_italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT)
19             for dj𝐑msubscript𝑑𝑗subscript𝐑𝑚d_{j}\in\boldsymbol{R}_{m}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ bold_italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT do
20                   if 𝐎𝐩𝐢𝐧𝐢𝐨𝐧𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐲(dj)=St𝐎𝐩𝐢𝐧𝐢𝐨𝐧𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐲subscript𝑑𝑗subscript𝑆𝑡\boldsymbol{OpinionClassify}(d_{j})=S_{t}bold_italic_O bold_italic_p bold_italic_i bold_italic_n bold_italic_i bold_italic_o bold_italic_n bold_italic_C bold_italic_l bold_italic_a bold_italic_s bold_italic_s bold_italic_i bold_italic_f bold_italic_y ( italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT then
21                         𝑫𝒐𝒄𝒔𝒕𝑫𝒐𝒄subscript𝒔𝒕absent\boldsymbol{Docs_{t}}\leftarrowbold_italic_D bold_italic_o bold_italic_c bold_italic_s start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT ← djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
22                  
23            for 𝐝𝐭j𝐃𝐨𝐜𝐬𝐭subscriptsubscript𝐝𝐭𝑗𝐃𝐨𝐜subscript𝐬𝐭\boldsymbol{d_{t}}_{j}\in\boldsymbol{Docs_{t}}bold_italic_d start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ bold_italic_D bold_italic_o bold_italic_c bold_italic_s start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT do
24                   adversarial trigger 𝒑𝒂𝒅𝒗j𝑷𝑨𝑻(𝑴𝒊;𝒒m,𝒅𝒕j,𝒂𝒏𝒄𝒉𝒐𝒓m)subscriptsubscript𝒑𝒂𝒅𝒗𝑗𝑷𝑨𝑻subscript𝑴𝒊subscript𝒒𝑚subscriptsubscript𝒅𝒕𝑗𝒂𝒏𝒄𝒉𝒐subscript𝒓𝑚\boldsymbol{p_{adv}}_{j}\leftarrow\boldsymbol{PAT}(\boldsymbol{M_{i}};% \boldsymbol{q}_{m},\boldsymbol{d_{t}}_{j},\boldsymbol{anchor}_{m})bold_italic_p start_POSTSUBSCRIPT bold_italic_a bold_italic_d bold_italic_v end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← bold_italic_P bold_italic_A bold_italic_T ( bold_italic_M start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ; bold_italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_d start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_a bold_italic_n bold_italic_c bold_italic_h bold_italic_o bold_italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )
25                   adversarial document 𝒅𝒂𝒅𝒗j𝒅𝒕j𝒑𝒂𝒅𝒗jsubscriptsubscript𝒅𝒂𝒅𝒗𝑗direct-sumsubscriptsubscript𝒅𝒕𝑗subscriptsubscript𝒑𝒂𝒅𝒗𝑗\boldsymbol{d_{adv}}_{j}\leftarrow\boldsymbol{d_{t}}_{j}\oplus\boldsymbol{p_{% adv}}_{j}bold_italic_d start_POSTSUBSCRIPT bold_italic_a bold_italic_d bold_italic_v end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← bold_italic_d start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⊕ bold_italic_p start_POSTSUBSCRIPT bold_italic_a bold_italic_d bold_italic_v end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
                   𝑫𝒐𝒄𝒔:𝒅𝒕j𝒅𝒂𝒅𝒗j:𝑫𝒐𝒄𝒔subscriptsubscript𝒅𝒕𝑗subscriptsubscript𝒅𝒂𝒅𝒗𝑗\boldsymbol{Docs}:\boldsymbol{d_{t}}_{j}\leftarrow\boldsymbol{d_{adv}}_{j}bold_italic_D bold_italic_o bold_italic_c bold_italic_s : bold_italic_d start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← bold_italic_d start_POSTSUBSCRIPT bold_italic_a bold_italic_d bold_italic_v end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT // Replace
26                  
27            𝑹𝒆𝒔𝑹𝒆𝒔absent\boldsymbol{Res}\leftarrowbold_italic_R bold_italic_e bold_italic_s ← RAGblack(𝒒m;𝑫𝒐𝒄𝒔)subscriptRAGblacksubscript𝒒𝑚𝑫𝒐𝒄𝒔\textit{RAG}_{\text{black}}(\boldsymbol{q}_{m};\boldsymbol{Docs})RAG start_POSTSUBSCRIPT black end_POSTSUBSCRIPT ( bold_italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ; bold_italic_D bold_italic_o bold_italic_c bold_italic_s )
28            
29      return 𝐑𝐞𝐬𝐑𝐞𝐬\boldsymbol{Res}bold_italic_R bold_italic_e bold_italic_s
Algorithm 1 Opinion Manipulation Strategy for black-box RAG

4 Experiment and Analysis

After imitating the retrieval model of RAGblacksubscriptRAGblack\text{RAG}_{\text{black}}RAG start_POSTSUBSCRIPT black end_POSTSUBSCRIPT to obtain the surrogate model, this paper first compares the ranking ability of the surrogate model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the target retrieval model RM𝑅𝑀RMitalic_R italic_M, as well as the similarity of their ranking results, as shown in Table 1, to ensure that the surrogate model has learned the capabilities of the black-box retrieval model.

Table 1: Comparison(%) of ranking results between the surrogate model and the target retrieval model(based on the target retrieval model)
Model MRR@10 NDCG@10 Inter@10 RBO@10
Target retrieval model 87.07 68.16
Surrogate model 87.98 73.73 62.32 48.66

This paper uses Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) to reflect the ranking ability of the models themselves; higher values indicate stronger ranking ability in terms of relevance. Inter Ranking Similarity (Inter) and Rank Biased Overlap (RBO) are used to measure the similarity between the ranking results of the surrogate model and the target retrieval model; higher values indicate better performance of the black-box imitation. The weight for RBO@10 is set to 0.7. In Table 1, “–” indicates that the metric is not applicable to the model.

As can be seen from Table 1, the surrogate model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT trained by black-box imitation is similar to the target retrieval model coCondenser in terms of relevance ranking performance and ranking results, validating the effectiveness of the black-box imitation.

After the black-box imitation training, the white-box surrogate model is conducted with opinion manipulation experiments. Several controversial topics and their opinion text data under the four themes of "Government", "Education", "Society", and "Health" from the PROCON.ORG data are selected as the retrieval corpus. The original retrieval corpus is denoted as Docsorigin. Based on the surrogate model, we generate the corresponding Tpat for the candidate items with the expected opinion Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on controversial topics, and then insert Tpat at the beginning of the target candidate items to obtain the perturbed retrieval corpus Docsadv. Query RAGblack twice on controversial topics: once with Docsorigin as the retrieval corpus, and once with Docsadv as the retrieval corpus, and obtain the two responses of RAG, representing the answers before and after opinion manipulation. The responses are then classified into three categories based on their opinion on controversial topics: opposing, neutral, and supporting, represented by 0, 1, and 2, respectively, as the opinion scores of the generated responses. This study uses Average Stance Variation (ASV) to represent the average increase of opinion scores of RAGblack responses in the direction of the expected opinion Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT before and after manipulation. A positive ASV indicates that the opinion manipulation towards Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is effective, while a negative ASV indicates that the manipulation actually makes the opinions of RAG responses deviate from Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The larger the ASV value, the more successful the opinion manipulation of RAG responses. Additionally, this paper attempts to obtain the ranking results of the retriever coCondenser to evaluate the effectiveness of the adversarial retrieval manipulation strategy at the ranking stage for dense retrieval. This evaluation is solely for assessment purposes and is not involved in manipulation, as no internal knowledge of RAGblack was leaked during the manipulation process.

After obtaining the ranking results of the retriever model coCondenser, this paper evaluates the manipulation effect with Attack Success Rate (ASR), the average proportion of target opinions in the Top 3 rankings before and after manipulation (Top3origin, Top3attacked), and the Variation of Normalized Discounted Cumulative Gain (VoN-DCG). Higher values of ASR and Vo-NDCG indicate better manipulation effects on ranking, and a larger difference between Top3attacked and Top3origin signifies more significant ranking manipulation effects, too.

Table 2: Manipulation results of RAGblack𝑅𝐴𝐺blackRAG\textsubscript{black}italic_R italic_A italic_G ranking and response opinion
Model CoCondenser Ranking Qwen1.5-14b LLAMA3-8b
Topic ASR𝐴𝑆𝑅ASRitalic_A italic_S italic_R Top3origin𝑇𝑜𝑝subscript3𝑜𝑟𝑖𝑔𝑖𝑛Top3_{o}riginitalic_T italic_o italic_p 3 start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT italic_r italic_i italic_g italic_i italic_n Top3attacked𝑇𝑜𝑝subscript3𝑎𝑡𝑡𝑎𝑐𝑘𝑒𝑑Top3_{a}ttackeditalic_T italic_o italic_p 3 start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_t italic_t italic_a italic_c italic_k italic_e italic_d NDCGvariation𝑁𝐷𝐶subscript𝐺𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛NDCG_{v}ariationitalic_N italic_D italic_C italic_G start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_a italic_r italic_i italic_a italic_t italic_i italic_o italic_n ASV𝐴𝑆𝑉ASVitalic_A italic_S italic_V ASV𝐴𝑆𝑉ASVitalic_A italic_S italic_V
Government 0.17 0.48 0.57 0.06 -0.17 0.25
Education 0.33 0.28 0.39 0.09 0.42 0.5
Society 0.5 0.39 0.56 0.07 0.42 0.5
Health 0.5 0.33 0.44 0.12 0.67 0.5

Refer to caption

Figure 3: Overall effect of RAG opinion manipulation in black-box scenario

Figure 3 shows the significant overall opinion manipulation effect of the adversarial retrieval attack strategy PAT. This paper divides Docsorigin into two parts: document data with an expected opinion of support and document data with an expected opinion of opposition, the expected opinion represents the stance direction we would like RAG response to hold for the target topic after manipulation. The expected opinion Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is set to 2 for supporting and 0 for opposing, and then manipulation is performed. The results show that when the expected opinion is support, the proportion of responses with a supportive stance increases significantly after manipulation, while the proportion of responses with an opposing stance decreases. When the expected opinion is opposition, the proportion of supportive responses decreases significantly after manipulation, while the proportions of neutral and opposing responses both increase. Comparatively, the changes in stance before and after manipulation are slightly larger for LLAMA3-8b than for Qwen1.5-14b, due to stronger ability of LLAMA3-8b to follow instructions.

The results of the theme-specific manipulation experiments are shown in Table 2. The adversarial retrieval attack strategy PAT, applied to the adversarial texts generated by the surrogate model, significantly increased the proportion of candidate items holding expected opinion in the Top 3 of the RAGblack retrieval list, thereby guiding the LLM to change its opinion in the response. However, the manipulation effect of RAGblack generated opinions varies across different themes: for education, society, and health topics, the attack success rate and ranking variation of target items are significantly higher than those in government topics. This suggests that the LLM may have been specifically fine-tuned on government-related dataset, enabling it to mitigate the bias in the retrieval context to some extent. Among these topics, controversial opinions in society and health topics are more susceptible to manipulation. Since these two areas are closely related to people’s lives, opinion manipulation in society and health topics may pose a greater risk.

The manipulation results across different themes still demonstrate relative advantage of LLAMA3-8b in understanding prompts with contextual background intentions and generating effective responses. However, this also indicates that the strong comprehension ability of LLMs may undermine the reliability of the content it generates.

5 Conclusion

In this paper, we explore the vulnerability of retrieval-augmented generation (RAG) models to opinion manipulation against black-box attack in open-ended controversial topics, and delve into the potential impact of such attacks on user cognition and decision-making. Through systematic experiments, we propose a novel adversarial attack strategy about retrieval ranking poisoning. This method significantly affects the polarity of the opinions generated by RAG by crafting adversarial samples, without requiring internal knowledge of the RAG model. The experimental results indicate that the proposed attack strategy successfully alters the opinion of the content generated by the RAG model, revealing the vulnerability and unreliability of RAG when confronted with malicious retrieval corpus. More importantly, this opinion manipulation could have profound impacts on users’ cognition and decision-making processes, potentially leading users to accept incorrect or biased information, causing cognitive changes and public opinion distortion. This phenomenon is particularly significant in open-ended and controversial issues.

Future research will expands the scale of the experiments by including more open-source and commercial RAG systems to more comprehensively evaluate the reliability of viewpoint generation by RAG models. Given the vulnerabilities of RAG models, future work should focus on developing more robust defense strategies. These may include improving the robustness of retrieval algorithms, enhancing the reliability of generation models, and introducing multi-level input filtering mechanisms to counteract adversarial inputs, thereby achieving a balanced optimization of the understanding and reliability of RAG models.

6 Ethical Statement

This paper explores the feasibility of opinion manipulation on black-box RAG models in real-world scenarios. The main goal is to assess the reliability of RAG technology in responding to ranking manipulation at the stage of retrieval, paving the way for future work to enhance the robustness and defense capabilities of RAG technology. This study did not manipulate any commercial RAG systems or real-world data currently in use.·

References

  • [1] Xiangrui Cai, Haidong Xu, Sihan Xu, Ying Zhang, et al. Badprompt: Backdoor attacks on continuous prompts. Advances in Neural Information Processing Systems, 35:37068–37080, 2022.
  • [2] Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.
  • [3] Zhuo Chen, Jiawei Liu, and Haotan Liu. Research on the reliability and fairness of opinion retrieval in public topics. In 2024 Network and Distributed System Security (NDSS) workshop on AI Systems with Confidential Computing, 2024.
  • [4] Sukmin Cho, Soyeong Jeong, Jeongyeon Seo, Taeho Hwang, and Jong C Park. Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations. arXiv preprint arXiv:2404.13948, 2024.
  • [5] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310–1320. PMLR, 2019.
  • [6] Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
  • [7] Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751, 2017.
  • [8] Robert Epstein and Ronald E. Robertson. The search engine manipulation effect (SEME) and its possible impact on the outcomes of elections. Proceedings of the National Academy of Sciences, 112(33):E4512–E4521, August 2015. Publisher: Proceedings of the National Academy of Sciences.
  • [9] Luyu Gao and Jamie Callan. Unsupervised corpus aware language model pre-training for dense passage retrieval. arXiv preprint arXiv:2108.05540, 2021.
  • [10] Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
  • [11] Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328, 2017.
  • [12] Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
  • [13] Zilong Lin, Zhengyi Li, Xiaojing Liao, XiaoFeng Wang, and Xiaozhong Liu. Mawseo: Adversarial wiki search poisoning for illicit online promotion. arXiv preprint arXiv:2304.11300, 2023.
  • [14] Jiawei Liu, Yangyang Kang, Di Tang, Kaisong Song, Changlong Sun, Xiaofeng Wang, Wei Lu, and Xiaozhong Liu. Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models, April 2023. arXiv:2209.06506 [cs].
  • [15] Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt injection attack against llm-integrated applications, june 2023. arXiv preprint arXiv:2306.05499, 2023.
  • [16] Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, Yixing Fan, and Xueqi Cheng. Black-box adversarial attacks against dense retrieval models: A multi-view contrastive learning method. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 1647–1656, 2023.
  • [17] Congzheng Song, Alexander M Rush, and Vitaly Shmatikov. Adversarial semantic collisions. arXiv preprint arXiv:2011.04743, 2020.
  • [18] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • [19] Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In NeurIPS, 2023.
  • [20] Wenqi Wang, Run Wang, Lina Wang, Zhibo Wang, and Aoshuang Ye. Towards a robust deep neural network in texts: A survey. arXiv preprint arXiv:1902.07285, 2019.
  • [21] Chen Wu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models, June 2022. arXiv:2204.01321 [cs].
  • [22] Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, and Prateek Mittal. Certifiably robust rag against retrieval corruption. arXiv preprint arXiv:2405.15556, 2024.
  • [23] Jiaqi Xue, Mengxin Zheng, Yebowen Hu, Fei Liu, Xun Chen, and Qian Lou. Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models. arXiv preprint arXiv:2406.00083, 2024.
  • [24] Zhang Yue, ZHUANG Bichen, LI Qingyu, and ZHU Qinghua. Homogenization dilemma:concept analysis and theoretical framework construction of information cocoons. Journal of Library Science in China, 49(3):107–122, 2023.
  • [25] Quan Zhang, Binqi Zeng, Chijin Zhou, Gwihwan Go, Heyuan Shi, and Yu Jiang. Human-imperceptible retrieval poisoning attacks in llm-powered applications. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pages 502–506, 2024.
  • [26] Zihan Zhang, Mingxuan Liu, Chao Zhang, Yiming Zhang, Zhou Li, Qi Li, Haixin Duan, and Donghong Sun. Argot: Generating adversarial readable chinese texts. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 2533–2539, 2021.
  • [27] Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. Weak-to-strong jailbreaking on large language models. arXiv preprint arXiv:2401.17256, 2024.
  • [28] Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. arXiv preprint arXiv:2310.19156, 2023.
  • [29] Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge poisoning attacks to retrieval-augmented generation of large language models. arXiv preprint arXiv:2402.07867, 2024.