Skip to content

Blog

Building Personal Chatbot - Part 2

Enhancing Our Obsidian Chatbot: Advanced RAG Techniques with Langchain

In our previous post, we explored building a chatbot for Obsidian notes using Langchain and basic Retrieval-Augmented Generation (RAG) techniques. Today, I am sharing the significant improvements I've made to enhance the chatbot's performance and functionality. These advancements have transformed our chatbot into a more effective and trustworthy tool for navigating my Obsidian knowledge base.

System Architecture: The Blueprint of Our Enhanced Chatbot

Let's start by looking at our updated system architecture:

![[Pasted image 20240818163253.png]]

This diagram illustrates the flow of our enhanced chatbot, showcasing how each component works together to deliver a seamless user experience. Now, let's dive deeper into each of these components and understand their role in making our chatbot smarter and more efficient.

Key Improvements: Unlocking New Capabilities

Our journey of improvement focused on four key areas, each addressing a specific challenge in making our chatbot more responsive and context-aware. Let's explore these enhancements and see how they work together to create a more powerful tool.

1. MultiQuery Retriever: Casting a Wider Net

Imagine you're trying to find a specific memory in your vast sea of notes. Sometimes, the way you phrase your question might not perfectly match how you wrote it down. That's where our new MultiQuery Retriever comes in – it's like having a team of creative thinkers helping you remember!

self.multiquery_retriever = CustomMultiQueryRetriever.from_llm(
    self.retriever, llm=self.llm, prompt=self.multiquery_retriever_template
)

The MultiQuery Retriever is a clever addition that generates multiple variations of your original question. Let's see it in action:

Suppose you ask: "What was that interesting AI paper I read last month?"

Our MultiQuery Retriever might generate these variations:

  1. "What artificial intelligence research paper did I review in the previous month?"
  2. "Can you find any notes about a fascinating AI study from last month?"
  3. "List any machine learning papers I found intriguing about 30 days ago."

By creating these diverse phrasings, we significantly increase our chances of finding the relevant information. Maybe you didn't use the term "AI paper" in your notes, but instead wrote "machine learning study." The MultiQuery Retriever helps bridge these verbal gaps, ensuring we don't miss important information due to slight differences in wording.

This approach is particularly powerful for:

  • Complex queries that might be interpreted in multiple ways
  • Recalling information when you're not sure about the exact phrasing you used
  • Uncovering related information that you might not have thought to ask about directly

The result? A much more robust and forgiving search experience that feels almost intuitive, as if the chatbot truly understands the intent behind your questions, not just the literal words you use.

Now that we've expanded our search capabilities, let's look at how we've improved the chatbot's understanding of time and context.

2. SelfQuery Retriever: Your Personal Time-Traveling Assistant

While the MultiQuery Retriever helps us find information across different phrasings, the SelfQuery Retriever adds another dimension to our search capabilities: time. Imagine having a super-smart assistant who not only understands your questions but can also navigate through time in your personal knowledge base. That's essentially what our SelfQuery Retriever does – it's like giving our chatbot a time machine!

self.retriever = CustomSelfQueryRetriever.from_llm(
    llm=self.llm,
    vectorstore=self.pinecone_retriever,
    document_contents=self.__class__.document_content_description,
    metadata_field_info=self.__class__.metadata_field_info,
)

The SelfQuery Retriever is a game-changer for handling queries that involve dates. It's particularly useful when you're trying to recall events or information from specific timeframes in your notes. Let's see it in action:

Suppose you ask: "What projects was I excited about in the first week of April 2024?"

Here's what happens behind the scenes:

  1. The SelfQuery Retriever analyzes your question and understands that you're looking for:
    • Information about projects
    • Specifically from the first week of April 2024
    • With a positive sentiment ("excited about")
  2. It then translates this into a structured query that might look something like this:

    {
      "query": "projects excited about",
      "filter": "and(gte(date, 20240401), lte(date, 20240407))"
    }
    
  3. This structured query is used to search your vector database, filtering for documents within that specific date range and then ranking them based on relevance to "projects excited about".

The magic here is that the SelfQuery Retriever can handle a wide range of natural language date queries:

  • "What did I work on last summer?"
  • "Show me my thoughts on AI from Q1 2024"
  • "Any breakthroughs in my research during the holiday season?"

It understands these temporal expressions and converts them into precise date ranges for searching your notes.

The result? A chatbot that feels like it has an intuitive understanding of time, capable of retrieving memories and information from specific periods in your life with remarkable accuracy. It's like having a personal historian who knows exactly when and where to look in your vast archive of experiences.

This capability is particularly powerful for:

  • Tracking progress on long-term projects
  • Recalling ideas or insights from specific time periods
  • Understanding how your thoughts or focus areas have evolved over time

With the SelfQuery Retriever, your Obsidian chatbot doesn't just search your notes – it understands the temporal context of your knowledge, making it an invaluable tool for reflection, planning, and personal growth.

But how does the chatbot know when each note was created? Let's explore how we've added this crucial information to our system.

3. Adding Date Metadata: Timestamping Your Thoughts

To support date-based queries and make the SelfQuery Retriever truly effective, we needed a way to associate each note with its creation date. This is where date metadata comes into play. I’ve implemented a system to extract the date from each note's filename and add it as metadata during the indexing process:

def extract_date_from_filename(filename: str) -> Optional[int]:
    match = re.match(r"(\\d{4}-\\d{2}-\\d{2})", filename)
    if match:
        date_str = match.group(1)
        try:
            date_obj = datetime.strptime(date_str, DATE_FORMAT)
            return int(date_obj.strftime("%Y%m%d"))
        except ValueError:
            return None
    return None

# In the indexing process
document.metadata["date"] = extract_date_from_filename(file)

This metadata allows our SelfQuery Retriever to efficiently filter documents based on date ranges or specific dates mentioned in user queries. It's like giving each of your notes a timestamp, allowing the chatbot to organize and retrieve them chronologically when needed.

With our chatbot now able to understand both the content and the temporal context of your notes, we've added one more crucial element to make it even more helpful: the ability to remember and use information from your conversation.

4. Enhancing MultiQuery Retriever with Chat History: Context-Aware Question Generation

In our previous iteration, we already used chat history to provide context for our LLM's responses. However, we've now taken this a step further by incorporating chat history into our MultiQuery Retriever. This enhancement significantly improves the chatbot's ability to understand and respond to context-dependent queries, especially in ongoing conversations.

Let's see how this works in practice:

Imagine you're having a conversation with your chatbot about your work projects:

You: "What projects did I work on March 1?" Chatbot: [Provides a response about your March 1 projects]

You: "How about March 2?"

Without context, the MultiQuery Retriever might generate variations like:

  1. "What happened on March 2?"
  2. "Events on March 2"
  3. "March 2 activities"

These queries, while related to the date, miss the crucial context about projects.

However, with our chat history-aware MultiQuery Retriever, it might generate variations like:

  1. "What projects did I work on March 2?"
  2. "Project activities on March 2"
  3. "March 2 project updates"

These variations are much more likely to retrieve relevant information about your projects on March 2, maintaining the context of your conversation.

This improvement is crucial for maintaining coherent, context-aware conversations. Without it, the MultiQuery Retriever could sometimes generate less useful variations, particularly in multi-turn interactions where the context from previous messages is essential.

By making the MultiQuery Retriever aware of chat history, we've significantly enhanced its ability to generate relevant query variations. This leads to more accurate document retrieval and, ultimately, more contextually appropriate responses from the chatbot.

This enhancement truly brings together the power of our previous improvements. The MultiQuery Retriever now not only casts a wider net with multiple phrasings but does so with an understanding of the conversation's context. Combined with our SelfQuery Retriever's ability to handle temporal queries and our robust date metadata, we now have a chatbot that can navigate your personal knowledge base with remarkable context awareness and temporal understanding.

Custom Implementations: Tailoring the Tools to Our Needs

To achieve these enhancements, we created several custom classes, each designed to extend the capabilities of Langchain's base components. Let's take a closer look at two key custom implementations:

  1. CustomMultiQueryRetriever: This class extends the base MultiQueryRetriever to incorporate chat history in query generation.
  2. CustomSelfQueryRetriever: We customized the SelfQuery Retriever to work seamlessly with our Pinecone vector store and handle date-based queries effectively.

Here's a snippet from our CustomMultiQueryRetriever to give you a taste of how we've tailored these components:

class CustomMultiQueryRetriever(MultiQueryRetriever):
    def _get_relevant_documents(
        self,
        query: str,
        history: str,
        *,
        run_manager: CallbackManagerForRetrieverRun,
    ) -> List[Document]:
        queries = self.generate_queries(query, history, run_manager)
        if self.include_original:
            queries.append(query)
        documents = self.retrieve_documents(queries, run_manager)
        return self.unique_union(documents)

    def generate_queries(
        self, question: str, history: str, run_manager: CallbackManagerForRetrieverRun
    ) -> List[str]:
        response = self.llm_chain.invoke(
            {"question": question, "history": history},
            config={"callbacks": run_manager.get_child()},
        )
        if isinstance(self.llm_chain, LLMChain):
            lines = response["text"]
        else:
            lines = response
        return lines

These custom implementations allow us to tailor the retrieval process to our specific needs, improving the overall performance and relevance of the chatbot's responses.

While these enhancements have significantly improved our chatbot, the journey wasn't without its challenges. Let's reflect on some of the hurdles we faced and the lessons we learned along the way.

Challenges and Learnings: Navigating the Complexities of Langchain

While Langchain provides a powerful framework for building RAG systems, we found that its complexity can sometimes be challenging. Digging into different parts of the codebase to understand and modify behavior required significant effort. However, this process also provided valuable insights into the inner workings of RAG systems and allowed us to create a more tailored solution for our Obsidian chatbot.

Some key learnings from this process include:

  • The importance of thoroughly understanding each component before attempting to customize it
  • The value of incremental improvements and testing each change individually
  • The need for patience when working with complex, interconnected systems

These challenges, while sometimes frustrating, ultimately led to a deeper understanding of RAG systems and a more robust final product.

Now that we've enhanced our chatbot with these powerful features, let's explore some of the exciting ways it can be used.

Use Cases and Examples: Putting Our Enhanced Chatbot to Work

With these improvements, our Obsidian chatbot is now capable of handling a wider range of queries with improved accuracy. Here are some example use cases that showcase its new capabilities:

  1. Date-specific queries: "What projects was I working on in the first week of March 2024?"
  2. Context-aware follow-ups: "Tell me more about the meeting I had last Tuesday."
  3. Complex information retrieval: "Summarize my progress on Project X over the last month."

These examples demonstrate the chatbot's ability to understand temporal context, maintain conversation history, and provide more relevant responses. It's not just a search tool anymore – it's becoming a true digital assistant that can help you navigate and make sense of your personal knowledge base.

As exciting as these improvements are, we're not stopping here. Let's take a quick look at what's on the horizon for our Obsidian chatbot.

Future Plans: The Road Ahead

While we've made significant strides in improving our chatbot, there's always room for further enhancements. One exciting avenue we're exploring is the integration of open-source LLMs to make the system more privacy-focused and self-contained. This could potentially allow users to run the entire system locally, ensuring complete privacy of their personal notes and queries.

Conclusion: A Smarter, More Intuitive Chatbot for Your Personal Knowledge Base

By implementing advanced RAG techniques such as MultiQuery Retriever, SelfQuery Retriever, and incorporating chat history, we've significantly enhanced our Obsidian chatbot's capabilities. These improvements allow for more accurate and contextually relevant responses, especially for date-based queries and complex information retrieval tasks.

Building this enhanced chatbot has been a journey of continuous learning and iteration. We've tackled challenges, discovered new possibilities, and created a tool that we hope will make navigating personal knowledge bases easier and more intuitive.

We hope that sharing our experience will inspire and help others in the community who are working on similar projects. Whether you're looking to build your own chatbot or simply interested in the possibilities of AI-assisted knowledge management, we hope this post has provided valuable insights.

You can find the final code in this GitHub repo

If you have any feedback or simply want to connect, please hit me up on LinkedIn or @prabha-tweet

Building an Obsidian Knowledge base Chatbot: A Journey of Iteration and Learning

As an avid Obsidian user, I've always been fascinated by the potential of leveraging my daily notes as a personal knowledge base. Obsidian has become my go-to tool for taking notes, thanks to its simplicity and the wide range of customization options available through community plugins. With the notes and calendar plugins enabled, I can easily capture my daily thoughts and keep track of the projects I'm working on. But what if I could take this a step further and use these notes as the foundation for a powerful chatbot?

Imagine having a personal assistant that could answer questions like:

  1. "What was that fascinating blog post I read last week?"
  2. "Which projects was I working on back in February 2024?"
  3. "Could you give me a quick summary of my activities from last week?"

Excited by the possibilities, I embarked on a journey to build a chatbot that could do just that. In this blog post, I'll share my experience of building this chat app from scratch, including the challenges I faced, the decisions I had to make, and the lessons I learned along the way. You can find the final code in this GitHub repo

Iteration 1: Laying the Groundwork

To kick things off, I decided to start with a simple Retrieval-Augmented Generation (RAG) system for the app. The stack I chose consisted of:

  • Pinecone for the Vector DB
  • Streamlit for creating the chat interface
  • Langchain framework for tying everything together
  • OpenAI for the Language Model (LLM) and embeddings

I began by embedding my Obsidian daily notes into a Pinecone Vector database. Since my notes aren't particularly lengthy, I opted to embed each daily note as a separate document. Pinecone's simplicity and quick setup allowed me to focus on building the chatbot's functionality rather than getting bogged down in infrastructure.

For the language model, I chose OpenAI's GPT-4, as its advanced reasoning capabilities would simplify the app-building process and reduce the need for extensive preprocessing.

The initial chatbot workflow looked like this:

The first version of the chatbot was decent, but I wanted to find a way to measure its performance and track progress as I iterated. After some research, I discovered the RAGAS framework, which is designed specifically for evaluating retrieval-augmented generation systems. By creating a dataset with question-answer pairs, I could measure metrics like answer correctness, relevancy, context precision, recall, and faithfulness.

Chatbot screenshot

I included all the metrics available through the RAGAS library, as I was curious to see how they would be affected by my improvements. You can read more about RAGAS metrics here. At this stage, I wasn't sure what to make of the numbers or whether they indicated good or bad performance, but it was a starting point.

Metric Base Performance
Answer_correctness 0.42
Answer_relevancy 0.39
Answer_similarity 0.84
Context_entity_recall 0.27
Context_precision 0.71
Context_recall 0.43
Context_relevancy 0.01
Faithfulness 0.39

Iteration 2: Refining the Approach

With the evaluation framework in place, I reviewed the examples and runs to identify areas for improvement. One thing that stood out was the presence of Dataview queries in my notes. These queries are used in Obsidian to pull data from various notes, similar to SQL queries. However, they don't execute and provide results when the Markdown file is viewed or accessed outside of Obsidian. I realized that these queries might be introducing noise and not adding much value, so I decided to remove them.

After making this change and re-evaluating the chatbot, I was surprised to see that the answer metrics had actually gone down. Digging deeper, I discovered that the vector search wasn't yielding the correct daily notes, even for straightforward queries like "What did I do on March 4, 2024?" On the bright side, context precision had improved since the context no longer contained Dataview queries.

Metric Base Iteration 2
Answer_correctness 0.42 0.34
Answer_relevancy 0.39 0.36
Answer_similarity 0.84 0.81
Context_entity_recall 0.27 0.09
Context_precision 0.71 0.87
Context_recall 0.43 0.42
Context_relevancy 0.01 0.02
Faithfulness 0.39 0.69

To address the issue with vector search, I made two adjustments: 1. Increased the number of documents returned by the retriever from the default 4 to 20. 2. Switched to using a MultiQuery retriever.

The goal was to retrieve a larger set of documents, even if their relevancy scores were low, in the hopes that the reranker model would be able to identify and prioritize the most relevant ones.

These changes led to a slight improvement in the answer-related metrics compared to the previous iterations. However, the context-related metrics took a hit due to the increased number of documents being considered. I was willing to accept this trade-off for now, as my notes were well-structured, and I believed a highly capable LLM should be able to extract the necessary information.

Metric Base Iteration 2 Iteration 2.1
Answer_correctness 0.42 0.34 0.45
Answer_relevancy 0.39 0.36 0.48
Answer_similarity 0.84 0.81 0.85
Context_entity_recall 0.27 0.09 0.15
Context_precision 0.71 0.87 0.62
Context_recall 0.43 0.42 0.35
Context_relevancy 0.01 0.02 0.00
Faithfulness 0.39 0.69 0.56

Iteration 3: Updating Evaluation dataset

As I reviewed the evaluation run, I noticed an interesting pattern. When there were no relevant notes to answer a question, the LLM correctly responded with "I don't know." This matched the ground truth, but the answer correctness was being computed as 0.19 instead of a value closer to 1.

To improve the evaluation process, I updated the dataset to include "I don't know" as the expected answer in cases where no relevant information was available. This simple change had a significant impact on the answer metrics, providing a more accurate assessment of the chatbot's performance.

Metric Base Iteration 2 Iteration 2.1 Iteration 3
Answer_correctness 0.42 0.34 0.45 0.62
Answer_relevancy 0.39 0.36 0.48 0.60
Answer_similarity 0.84 0.81 0.85 0.89
Context_entity_recall 0.27 0.09 0.15 0.14
Context_precision 0.71 0.87 0.62 0.62
Context_recall 0.43 0.42 0.35 0.37
Context_relevancy 0.01 0.02 0.00 0.00
Faithfulness 0.39 0.69 0.56 0.61

The Journey Continues...

At this point, I have a functional chatbot that serves as a powerful search engine for my personal knowledgebase. While I'm happy with the progress so far, there's still room for improvement. Some ideas for future iterations include:

  • Implementing document retrieval based on metadata like date, to provide more accurate answers for time-sensitive questions.
  • Exploring the use of open-source LLMs like LLAMA3 to keep my data private and self-contained.

Building this chatbot has been an incredible learning experience, showcasing the power of combining Obsidian, vector databases, and language models. Not only has it given me a valuable tool for accessing my own knowledge, but it has also highlighted the importance of iterative development and continuous evaluation.

I hope my journey inspires other Obsidian enthusiasts to explore the possibilities of creating their own personal knowledgebase chatbots. By leveraging our daily notes and harnessing the power of AI, we can unlock new ways to interact with and learn from the information we capture.

You can find the final code in this GitHub repo

If you have any feedback or simply want to connect, please hit me up on LinkedIn or @prabha-tweet

Quantized LLM Models

Large Language Models (LLMs) are known for their vast number of parameters, often reaching billions. For example, open-source models like Llama2 come in sizes of 7B, 13B, and 70B parameters, while Google's Gemma has 2B parameters. Although OpenAI's GPT-4 architecture is not publicly shared, it is speculated to have more than a trillion parameters, with 8 models working together in a mixture of experts approach.

Understanding Parameters

A parameter is a model weight learned during the training phase. The number of parameters can be a rough indicator of a model's capability and complexity. These parameters are used in huge matrix multiplications across each layer until an output is produced.

The Problem with Large Number of Parameters

As LLMs have billions of parameters, loading all the parameters into memory and performing massive matrix multiplications becomes a challenge. Let's consider the math behind this:

For a 70B parameter model (like the Llama2-70B model), the default size in which these parameters are stored is 32 bits (4 bytes). To load this model, you would need:

70B parameters * 4 bytes = 260 GB of memory

This highlights the significant memory requirements for running LLMs.

Quantization as a Solution

Quantization is a technique used to reduce the size of the model by decreasing the precision of parameters and storing them in less memory. For example, representing 32-bit floating-point (FP32) parameters in a 16-bit floating-point (FP16) datatype.

In practice, this loss of precision does not significantly degrade the output quality of LLMs but offers substantial performance improvements in terms of efficiency. By quantizing the model, the memory footprint can be reduced, making it more feasible to run LLMs on resource-constrained systems.

Quantization allows for a trade-off between model size and performance, enabling the deployment of LLMs in a wider range of applications and devices. It is an essential technique for making LLMs more accessible and efficient while maintaining their impressive capabilities.

The table below compares the performance of Google’s 2B Gemma model with 32-bit and 16-bit precision. The quantized 16-bit model is 28% faster with approximately 50% less memory usage.

Gemma FP 32 bit precision Gemma FP16 bit precision
# of Parameters 2,506,172,416 2,506,172,416
Memory Size based on # Parameters > 2.5B * 4 Bytes
9.33 GB
> 2.5B * 2 Bytes
4.66 GB
Memory Footprint 9.39 GB 4.73 GB
Average Inference time 10.36 seconds 7.48 seconds
Distribution of Inference Time

Impact on Accuracy

To assess the impact of quantization on accuracy, I ran the output of both models and computed the similarity score using OpenAI's text-embedding-3-large model. The results showed that the similarity scores between the outputs of the 32-bit and 16-bit models were highly comparable with 0.998 cosine similarity, indicating that quantization does not significantly affect the model's accuracy.

In conclusion, quantization is a powerful technique for reducing the memory footprint and improving the efficiency of LLMs while maintaining their performance. By enabling the deployment of LLMs on a wider range of devices and applications, quantization plays a crucial role in making these impressive models more accessible and practical for real-world use cases.

Note

Inference time and Accuracy are measured for 100 random question, you can find them in the colab notebook

Good Resource on this topic

DLAI - Quantization Fundamentals

If you have any feedback or simply want to connect, please hit me up on LinkedIn or @prabha-tweet