RAG is an AI framework that allows LLMs to access additional domain specific data and generate better and more accurate answers, without having to be retrained. The following figure describes the RAG based digital assistant architecture design.
Figure 6. Digital assistant design
Digital assistant workflow
The following list describes the steps involved in operating the digital assistant.
- Ingest data: PDF files or HTML web pages which contain domain specific information have to be identified and processed to help the Llama 2 model to answer user queries more accurately.
- Extract and split data: LangChain library is used to extract the data and split it into smaller and manageable chunks, as language models can only handle limited amount of text at a time, based on the context length on which it was trained. Followed by converting text chunks to numerical values, known as embeddings. Embeddings are used to perform semantic searches.
- Load embeddings into the vector store: LangChain can store the embeddings to the vector store and retrieve it using different retrievers such as “similarity distance threshold.” Redis is used in this solution as the vector store to store vectors and to perform semantic search.
- User Interface: Gradio provides a simple and intuitive user interface to interact with the digital assistant. LangChain integrates the Gradio user interface with other components such as LLM and vector store to work together as a digital assistant.
- Query processing: When users submit a query, the query will be split into smaller chunks and then embeddings will be created for the same. Semantic search is performed against the vectors stored in Redis. Results from Redis are ranked and sent to the Llama 2 model, the Llama model answers the queries based on results from Redis and its pretrained capabilities.