Introduction
Retrieval-Augmented Generation (RAG) is a cutting-edge technology that combines the power of large language models (LLMs) with real-time data retrieval to generate highly contextual and accurate responses. RAG involves two key phases: ingestion and retrieval. During the ingestion phase, data from various sources (such as internal documents, databases, and external APIs) is indexed and stored in a vector database. This allows the system to quickly retrieve relevant information when needed. In the retrieval phase, the system uses this indexed data to find the most relevant information in response to a user query, which is then used by the LLM to generate a precise and context-aware response
Key Components of RAG
- Retrieval Module: This component retrieves relevant information from external sources based on the user query.
- Generation Module: This component uses the retrieved information to generate a response, often leveraging a large language model.
- Integration Mechanism: This ensures a seamless flow of information between the retrieval and generation modules
Example
In this example, I am fetching data from Wikipedia and using it to generate a response.
The source of truth of the information is the content of the Wikipedia article. Eventhough LLMs are able to access this information, they will provide a response based on the data retrieved from the Wikipedia article.
This demonstrates how RAG can limit the amount of information that the LLM has to process, improving the speed and accuracy of the response without hallucinating.
import { ChatPromptTemplate } from '@langchain/core/prompts';import { Runnable } from '@langchain/core/runnables';import { OllamaEmbeddings, ChatOllama } from '@langchain/ollama';import { createRetrievalChain } from 'langchain/chains/retrieval';import { createStuffDocumentsChain } from 'langchain/chains/combine_documents';import { Document } from 'langchain/document';import { MemoryVectorStore } from 'langchain/vectorstores/memory';import { WikipediaQueryRun } from '@langchain/community/tools/wikipedia_query_run';
export class WikiGpt { embeddingsModel: OllamaEmbeddings; vectorStore: MemoryVectorStore; combineDocsChain: Promise<Runnable>; wikipediaRetriever: WikipediaQueryRun; conversationHistory: string[] = [];
constructor() { const llmModel = new ChatOllama({ model: 'llama3.1', temperature: 0, maxRetries: 2, });
this.embeddingsModel = new OllamaEmbeddings({ model: 'nomic-embed-text', baseUrl: 'http://localhost:11434', });
this.vectorStore = new MemoryVectorStore(this.embeddingsModel);
const systemPrompt = ` Use the given context to answer the question. If you don't know the answer, say you don't know. Use three sentences maximum and keep the answer concise. Context: {context} `; const prompt = ChatPromptTemplate.fromMessages([ ['system', systemPrompt], ['human', '{input}'], ]);
this.combineDocsChain = createStuffDocumentsChain({ llm: llmModel, prompt, });
this.wikipediaRetriever = new WikipediaQueryRun({ topKResults: 3, maxDocContentLength: 7000, }); }
async storeMemory(memory: string) { const doc = new Document({ pageContent: memory, }); await this.vectorStore.addDocuments([doc]); }
async getChain() { const retriever = this.vectorStore.asRetriever(); return createRetrievalChain({ retriever, combineDocsChain: await this.combineDocsChain, }); }
async getRelevantMemory(query: string) { const chain = await this.getChain(); let input = query;
if (this.conversationHistory.length) { const context = this.conversationHistory.join('\n'); input = `${context}\n${query}`; }
const response = await chain.invoke({ input }); if (response.answer) { this.conversationHistory.push(response.answer); } return response.answer; }
async fetchWikipediaData(query: string) { return this.wikipediaRetriever.invoke(query); }}
Result
As part of this example, I will let the LLM to answer what is LangChain.
import { WikiGpt } from './wiki-gpt';
const wikiGpt = new WikiGpt();
// Fetch data from Wikipediaconst wikipediaData = await wikiGpt.fetchWikipediaData("LangChain");console.log("Fetched Wikipedia Data:", wikipediaData);
// Store the fetched data as a memoryawait wikiGpt.storeMemory(wikipediaData);
// Retrieve relevant memoryconst query = "What is LangChain?";const relevantMemory = await wikiGpt.getRelevantMemory(query);console.log("Relevant Memory:", relevantMemory);
// Continue the conversationconst followUpQuery = "Can you tell me about Donald Trump?";const followUpResponse = await wikiGpt.getRelevantMemory(followUpQuery);console.log("Follow-Up Response:", followUpResponse);
This outputs the below about LangChain:
Relevant Memory: LangChain is a software framework that helps integrate large language models (LLMs) into applications, facilitating tasks such as document analysis and summarization, chatbots, and code analysis. It was launched in October 2022 as an open-source project.
But the follow up question wasn’t about LangChain.
Follow-Up Response: I don’t know anything about Donald Trump. The context provided is about LangChain, a software framework for integrating large language models into applications.