The Language of Agents Decoding Messages in LangChain & LangGraph

Ever wondered how apps get AI to chat, follow instructions, or even use tools? A lot of the magic comes down to "messages." Think of them as the notes passed between you, the AI, and any other services involved. LangChain and LangGraph are awesome tools that help manage these messages, making it easier to build cool AI-powered stuff. Let's break down how it works, keeping it simple!

The Main Players: Core Message Types

LangChain uses a few key message types to keep conversations organized. These are the building blocks for almost any chat interaction.

SystemMessage: Setting the Scene

This message sets the stage. It tells the AI how to behave – its personality, its job, or any ground rules. Think of it as whispering to the AI, "You're a super helpful assistant who loves pirate jokes." You usually send this one first. LangChain figures out how to pass this instruction to different AI models, even if they have their own quirks for system prompts.

HumanMessage: What You Say

Simple enough – this is your input. When you ask a question or give a command, LangChain wraps it up as a HumanMessage. It can be plain text or even include images if the AI supports it. If you just send a string to a chat model, LangChain often handily turns it into a HumanMessage for you.

AIMessage: The AI's Response

This is what the AI says back. It's not just text, though! An AIMessage can also include requests for the AI to use "tools" (like searching the web or running some code) and other useful bits like how many tokens it used. If the AI is streaming its response, you'll see AIMessageChunks that build up the full reply.

ToolMessage: Reporting Back from a Mission

If the AI (via an AIMessage) asks to use a tool, your app will run that tool and then send the results back using a ToolMessage. This message needs a special tool_call_id to link it to the AI's original request, which is super important if the AI wants to use multiple tools at once. This is the modern way, an upgrade from the older FunctionMessage.

Going Off-Script: ChatMessage and Custom Roles

What if you need a role that's not "system," "user," "assistant," or "tool"? LangChain offers ChatMessage for that. It lets you set any role label you want.

But here's the catch: most big AI models (like OpenAI's GPTs) only understand the standard roles. If you send a ChatMessage with a role like "developer_instructions," they'll likely ignore it or throw an error. So, only use ChatMessage with custom roles if you know your specific AI model supports them. For example, some Ollama models use a "control" role for special commands, and ChatMessage is how you'd send that.

Who Said That? The name Attribute

In a busy chat with multiple users or AI agents, how do you know who said what? All LangChain message classes (HumanMessage, AIMessage, etc.) have an optional name attribute. Its job is to distinguish between different speakers who might share the same role – for instance, to tell "Alice's" HumanMessage from "Bob's."

Provider support for this name field varies. OpenAI’s Chat API (and therefore Azure OpenAI) allows you to set a name on user or assistant messages, so the model can keep track of different participants. However, many other models might ignore or drop the name; LangChain will usually just omit it when sending the request to such models.

In multi-agent setups with LangGraph, this name field is super handy for tagging which agent sent a message, like AIMessage(content="Here’s what I found.", name="ResearchBot"). Even if the underlying AI model doesn't use the name, it's still useful metadata for your application's logic.

Team Chat: Messages in LangGraph Multi-Agent Systems

LangGraph helps you build apps where multiple AI agents work together. Their coordination hinges on a shared message history, typically managed within a component like MessagesState. This acts as a central ledger of the conversation.

Think of it like a meticulously recorded group project chat where everyone sees all messages in the order they were sent. When it's an agent's turn to contribute:

  1. Reads the History: It first accesses the entire current chat history from MessagesState. This gives it full context of everything that has transpired across all participating agents.

  2. Performs its Task: The agent then does its designated job, which might involve thinking, calling an LLM, or using a tool.

  3. Writes Back: Finally, it appends its own messages (e.g., an AIMessage detailing its findings or actions) to the shared MessagesState, thus adding to the ongoing conversation.

This cycle ensures that if Agent A contributes, and then Agent B takes over, Agent B has visibility into the initial request and Agent A's input. This shared, incrementally built history is fundamental.

Maintaining Order and Clarity in Shared History:

For this system to work without confusion, two things are crucial: messages must be in the correct order, and it must be clear who (or what) sent each message.

  • Chronological Order: MessagesState inherently maintains messages in the order they are added. Each new message is appended, preserving a chronological flow. Furthermore, LangGraph's graph structure itself dictates the sequence of which agent or tool operates next. This controlled execution ensures that contributions are added to the history in a predictable and understandable order.

  • Role and Sender Identification: Knowing "who said what" is vital. While the next section, "Tagging Agents in the Flow," dives deeper into specific techniques like the name attribute and agent-specific SystemMessages, the core idea is that each message carries information about its origin and purpose. The message type itself (e.g., HumanMessage, AIMessage, ToolMessage) provides an initial layer of role definition.

LangGraph leverages this ordered and attributed message history to enable complex interactions. While you can design custom mechanisms to pass specific pieces of information between agents directly, the default and foundational approach is this transparent, shared history that all agents can access and contribute to sequentially.

Tagging Agents in the Flow:

How do you know which agent said what in this shared history?

  • name attribute: As mentioned, AIMessage(content="...", name="FlightBot").

  • System Prompts: Give each agent its own SystemMessage like, "You are HotelBot, specializing in booking accommodations."

  • LangGraph State: You can even add a field to your graph's state to track the last_active_agent.

The safest bet is often to give each agent a clear system instruction about its identity and also log which agent produced each message.

Quick Tips for Message Mastery

  • Order Matters (Usually): A common pattern is SystemMessage first (if you need one), then a HumanMessage from the user, followed by an AIMessage from the model, and then back and forth between Human and AI messages. If tools are involved, ToolMessages follow the AIMessage that requested the tool.

  • Full Context is King: Unless you're specifically managing memory (like trimming old messages), feed the AI the whole accumulated list of messages each time. This gives it the best context. In LangGraph, MessagesState often handles this by injecting the up-to-date history into each agent node.

  • LangChain Simplifies: If you pass a plain string to a chat model, LangChain usually wraps it in a HumanMessage for you.

  • Let LangChain Do the Heavy Lifting: Write your code with LangChain's standard messages. It'll handle the translation to whatever format the specific AI model needs. This makes your code cleaner and easier to switch between different AIs.

  • Mind the Memory: Long chat histories can get too big for an AI's context window (and cost more!). LangGraph and LangChain offer ways to trim or summarize old messages to keep things manageable.

Wrapping Up

Messages are the lifeblood of your LangChain and LangGraph applications. Understanding these basic types, how the name field helps with identity, and how messages flow in multi-agent systems will help you build more powerful and reliable AI tools. Stick to the standards when you can, use custom options wisely, and let LangChain handle the provider-specific details. By following these conventions, you'll ensure clear message sequences and effective agent collaboration. Happy building!

Building Personal Chatbot - Part 2

Enhancing Our Obsidian Chatbot: Advanced RAG Techniques with Langchain

In our previous post, we explored building a chatbot for Obsidian notes using Langchain and basic Retrieval-Augmented Generation (RAG) techniques. Today, I am sharing the significant improvements I've made to enhance the chatbot's performance and functionality. These advancements have transformed our chatbot into a more effective and trustworthy tool for navigating my Obsidian knowledge base.

System Architecture: The Blueprint of Our Enhanced Chatbot

Let's start by looking at our updated system architecture:

![[Pasted image 20240818163253.png]]

This diagram illustrates the flow of our enhanced chatbot, showcasing how each component works together to deliver a seamless user experience. Now, let's dive deeper into each of these components and understand their role in making our chatbot smarter and more efficient.

Key Improvements: Unlocking New Capabilities

Our journey of improvement focused on four key areas, each addressing a specific challenge in making our chatbot more responsive and context-aware. Let's explore these enhancements and see how they work together to create a more powerful tool.

1. MultiQuery Retriever: Casting a Wider Net

Imagine you're trying to find a specific memory in your vast sea of notes. Sometimes, the way you phrase your question might not perfectly match how you wrote it down. That's where our new MultiQuery Retriever comes in – it's like having a team of creative thinkers helping you remember!

self.multiquery_retriever = CustomMultiQueryRetriever.from_llm(
    self.retriever, llm=self.llm, prompt=self.multiquery_retriever_template
)

The MultiQuery Retriever is a clever addition that generates multiple variations of your original question. Let's see it in action:

Suppose you ask: "What was that interesting AI paper I read last month?"

Our MultiQuery Retriever might generate these variations:

  1. "What artificial intelligence research paper did I review in the previous month?"
  2. "Can you find any notes about a fascinating AI study from last month?"
  3. "List any machine learning papers I found intriguing about 30 days ago."

By creating these diverse phrasings, we significantly increase our chances of finding the relevant information. Maybe you didn't use the term "AI paper" in your notes, but instead wrote "machine learning study." The MultiQuery Retriever helps bridge these verbal gaps, ensuring we don't miss important information due to slight differences in wording.

This approach is particularly powerful for:

  • Complex queries that might be interpreted in multiple ways
  • Recalling information when you're not sure about the exact phrasing you used
  • Uncovering related information that you might not have thought to ask about directly

The result? A much more robust and forgiving search experience that feels almost intuitive, as if the chatbot truly understands the intent behind your questions, not just the literal words you use.

Now that we've expanded our search capabilities, let's look at how we've improved the chatbot's understanding of time and context.

2. SelfQuery Retriever: Your Personal Time-Traveling Assistant

While the MultiQuery Retriever helps us find information across different phrasings, the SelfQuery Retriever adds another dimension to our search capabilities: time. Imagine having a super-smart assistant who not only understands your questions but can also navigate through time in your personal knowledge base. That's essentially what our SelfQuery Retriever does – it's like giving our chatbot a time machine!

self.retriever = CustomSelfQueryRetriever.from_llm(
    llm=self.llm,
    vectorstore=self.pinecone_retriever,
    document_contents=self.__class__.document_content_description,
    metadata_field_info=self.__class__.metadata_field_info,
)

The SelfQuery Retriever is a game-changer for handling queries that involve dates. It's particularly useful when you're trying to recall events or information from specific timeframes in your notes. Let's see it in action:

Suppose you ask: "What projects was I excited about in the first week of April 2024?"

Here's what happens behind the scenes:

  1. The SelfQuery Retriever analyzes your question and understands that you're looking for:
    • Information about projects
    • Specifically from the first week of April 2024
    • With a positive sentiment ("excited about")
  2. It then translates this into a structured query that might look something like this:

    {
      "query": "projects excited about",
      "filter": "and(gte(date, 20240401), lte(date, 20240407))"
    }
    
  3. This structured query is used to search your vector database, filtering for documents within that specific date range and then ranking them based on relevance to "projects excited about".

The magic here is that the SelfQuery Retriever can handle a wide range of natural language date queries:

  • "What did I work on last summer?"
  • "Show me my thoughts on AI from Q1 2024"
  • "Any breakthroughs in my research during the holiday season?"

It understands these temporal expressions and converts them into precise date ranges for searching your notes.

The result? A chatbot that feels like it has an intuitive understanding of time, capable of retrieving memories and information from specific periods in your life with remarkable accuracy. It's like having a personal historian who knows exactly when and where to look in your vast archive of experiences.

This capability is particularly powerful for:

  • Tracking progress on long-term projects
  • Recalling ideas or insights from specific time periods
  • Understanding how your thoughts or focus areas have evolved over time

With the SelfQuery Retriever, your Obsidian chatbot doesn't just search your notes – it understands the temporal context of your knowledge, making it an invaluable tool for reflection, planning, and personal growth.

But how does the chatbot know when each note was created? Let's explore how we've added this crucial information to our system.

3. Adding Date Metadata: Timestamping Your Thoughts

To support date-based queries and make the SelfQuery Retriever truly effective, we needed a way to associate each note with its creation date. This is where date metadata comes into play. I’ve implemented a system to extract the date from each note's filename and add it as metadata during the indexing process:

def extract_date_from_filename(filename: str) -> Optional[int]:
    match = re.match(r"(\\d{4}-\\d{2}-\\d{2})", filename)
    if match:
        date_str = match.group(1)
        try:
            date_obj = datetime.strptime(date_str, DATE_FORMAT)
            return int(date_obj.strftime("%Y%m%d"))
        except ValueError:
            return None
    return None

# In the indexing process
document.metadata["date"] = extract_date_from_filename(file)

This metadata allows our SelfQuery Retriever to efficiently filter documents based on date ranges or specific dates mentioned in user queries. It's like giving each of your notes a timestamp, allowing the chatbot to organize and retrieve them chronologically when needed.

With our chatbot now able to understand both the content and the temporal context of your notes, we've added one more crucial element to make it even more helpful: the ability to remember and use information from your conversation.

4. Enhancing MultiQuery Retriever with Chat History: Context-Aware Question Generation

In our previous iteration, we already used chat history to provide context for our LLM's responses. However, we've now taken this a step further by incorporating chat history into our MultiQuery Retriever. This enhancement significantly improves the chatbot's ability to understand and respond to context-dependent queries, especially in ongoing conversations.

Let's see how this works in practice:

Imagine you're having a conversation with your chatbot about your work projects:

You: "What projects did I work on March 1?" Chatbot: [Provides a response about your March 1 projects]

You: "How about March 2?"

Without context, the MultiQuery Retriever might generate variations like:

  1. "What happened on March 2?"
  2. "Events on March 2"
  3. "March 2 activities"

These queries, while related to the date, miss the crucial context about projects.

However, with our chat history-aware MultiQuery Retriever, it might generate variations like:

  1. "What projects did I work on March 2?"
  2. "Project activities on March 2"
  3. "March 2 project updates"

These variations are much more likely to retrieve relevant information about your projects on March 2, maintaining the context of your conversation.

This improvement is crucial for maintaining coherent, context-aware conversations. Without it, the MultiQuery Retriever could sometimes generate less useful variations, particularly in multi-turn interactions where the context from previous messages is essential.

By making the MultiQuery Retriever aware of chat history, we've significantly enhanced its ability to generate relevant query variations. This leads to more accurate document retrieval and, ultimately, more contextually appropriate responses from the chatbot.

This enhancement truly brings together the power of our previous improvements. The MultiQuery Retriever now not only casts a wider net with multiple phrasings but does so with an understanding of the conversation's context. Combined with our SelfQuery Retriever's ability to handle temporal queries and our robust date metadata, we now have a chatbot that can navigate your personal knowledge base with remarkable context awareness and temporal understanding.

Custom Implementations: Tailoring the Tools to Our Needs

To achieve these enhancements, we created several custom classes, each designed to extend the capabilities of Langchain's base components. Let's take a closer look at two key custom implementations:

  1. CustomMultiQueryRetriever: This class extends the base MultiQueryRetriever to incorporate chat history in query generation.
  2. CustomSelfQueryRetriever: We customized the SelfQuery Retriever to work seamlessly with our Pinecone vector store and handle date-based queries effectively.

Here's a snippet from our CustomMultiQueryRetriever to give you a taste of how we've tailored these components:

class CustomMultiQueryRetriever(MultiQueryRetriever):
    def _get_relevant_documents(
        self,
        query: str,
        history: str,
        *,
        run_manager: CallbackManagerForRetrieverRun,
    ) -> List[Document]:
        queries = self.generate_queries(query, history, run_manager)
        if self.include_original:
            queries.append(query)
        documents = self.retrieve_documents(queries, run_manager)
        return self.unique_union(documents)

    def generate_queries(
        self, question: str, history: str, run_manager: CallbackManagerForRetrieverRun
    ) -> List[str]:
        response = self.llm_chain.invoke(
            {"question": question, "history": history},
            config={"callbacks": run_manager.get_child()},
        )
        if isinstance(self.llm_chain, LLMChain):
            lines = response["text"]
        else:
            lines = response
        return lines

These custom implementations allow us to tailor the retrieval process to our specific needs, improving the overall performance and relevance of the chatbot's responses.

While these enhancements have significantly improved our chatbot, the journey wasn't without its challenges. Let's reflect on some of the hurdles we faced and the lessons we learned along the way.

Challenges and Learnings: Navigating the Complexities of Langchain

While Langchain provides a powerful framework for building RAG systems, we found that its complexity can sometimes be challenging. Digging into different parts of the codebase to understand and modify behavior required significant effort. However, this process also provided valuable insights into the inner workings of RAG systems and allowed us to create a more tailored solution for our Obsidian chatbot.

Some key learnings from this process include:

  • The importance of thoroughly understanding each component before attempting to customize it
  • The value of incremental improvements and testing each change individually
  • The need for patience when working with complex, interconnected systems

These challenges, while sometimes frustrating, ultimately led to a deeper understanding of RAG systems and a more robust final product.

Now that we've enhanced our chatbot with these powerful features, let's explore some of the exciting ways it can be used.

Use Cases and Examples: Putting Our Enhanced Chatbot to Work

With these improvements, our Obsidian chatbot is now capable of handling a wider range of queries with improved accuracy. Here are some example use cases that showcase its new capabilities:

  1. Date-specific queries: "What projects was I working on in the first week of March 2024?"
  2. Context-aware follow-ups: "Tell me more about the meeting I had last Tuesday."
  3. Complex information retrieval: "Summarize my progress on Project X over the last month."

These examples demonstrate the chatbot's ability to understand temporal context, maintain conversation history, and provide more relevant responses. It's not just a search tool anymore – it's becoming a true digital assistant that can help you navigate and make sense of your personal knowledge base.

As exciting as these improvements are, we're not stopping here. Let's take a quick look at what's on the horizon for our Obsidian chatbot.

Future Plans: The Road Ahead

While we've made significant strides in improving our chatbot, there's always room for further enhancements. One exciting avenue we're exploring is the integration of open-source LLMs to make the system more privacy-focused and self-contained. This could potentially allow users to run the entire system locally, ensuring complete privacy of their personal notes and queries.

Conclusion: A Smarter, More Intuitive Chatbot for Your Personal Knowledge Base

By implementing advanced RAG techniques such as MultiQuery Retriever, SelfQuery Retriever, and incorporating chat history, we've significantly enhanced our Obsidian chatbot's capabilities. These improvements allow for more accurate and contextually relevant responses, especially for date-based queries and complex information retrieval tasks.

Building this enhanced chatbot has been a journey of continuous learning and iteration. We've tackled challenges, discovered new possibilities, and created a tool that we hope will make navigating personal knowledge bases easier and more intuitive.

We hope that sharing our experience will inspire and help others in the community who are working on similar projects. Whether you're looking to build your own chatbot or simply interested in the possibilities of AI-assisted knowledge management, we hope this post has provided valuable insights.

You can find the final code in this GitHub repo

If you have any feedback or simply want to connect, please hit me up on LinkedIn or @prabha-tweet

Building an Obsidian Knowledge base Chatbot: A Journey of Iteration and Learning

As an avid Obsidian user, I've always been fascinated by the potential of leveraging my daily notes as a personal knowledge base. Obsidian has become my go-to tool for taking notes, thanks to its simplicity and the wide range of customization options available through community plugins. With the notes and calendar plugins enabled, I can easily capture my daily thoughts and keep track of the projects I'm working on. But what if I could take this a step further and use these notes as the foundation for a powerful chatbot?

Imagine having a personal assistant that could answer questions like:

  1. "What was that fascinating blog post I read last week?"
  2. "Which projects was I working on back in February 2024?"
  3. "Could you give me a quick summary of my activities from last week?"

Excited by the possibilities, I embarked on a journey to build a chatbot that could do just that. In this blog post, I'll share my experience of building this chat app from scratch, including the challenges I faced, the decisions I had to make, and the lessons I learned along the way. You can find the final code in this GitHub repo

Iteration 1: Laying the Groundwork

To kick things off, I decided to start with a simple Retrieval-Augmented Generation (RAG) system for the app. The stack I chose consisted of:

  • Pinecone for the Vector DB
  • Streamlit for creating the chat interface
  • Langchain framework for tying everything together
  • OpenAI for the Language Model (LLM) and embeddings

I began by embedding my Obsidian daily notes into a Pinecone Vector database. Since my notes aren't particularly lengthy, I opted to embed each daily note as a separate document. Pinecone's simplicity and quick setup allowed me to focus on building the chatbot's functionality rather than getting bogged down in infrastructure.

For the language model, I chose OpenAI's GPT-4, as its advanced reasoning capabilities would simplify the app-building process and reduce the need for extensive preprocessing.

The initial chatbot workflow looked like this:

The first version of the chatbot was decent, but I wanted to find a way to measure its performance and track progress as I iterated. After some research, I discovered the RAGAS framework, which is designed specifically for evaluating retrieval-augmented generation systems. By creating a dataset with question-answer pairs, I could measure metrics like answer correctness, relevancy, context precision, recall, and faithfulness.

Chatbot screenshot

I included all the metrics available through the RAGAS library, as I was curious to see how they would be affected by my improvements. You can read more about RAGAS metrics here. At this stage, I wasn't sure what to make of the numbers or whether they indicated good or bad performance, but it was a starting point.

Metric Base Performance
Answer_correctness 0.42
Answer_relevancy 0.39
Answer_similarity 0.84
Context_entity_recall 0.27
Context_precision 0.71
Context_recall 0.43
Context_relevancy 0.01
Faithfulness 0.39

Iteration 2: Refining the Approach

With the evaluation framework in place, I reviewed the examples and runs to identify areas for improvement. One thing that stood out was the presence of Dataview queries in my notes. These queries are used in Obsidian to pull data from various notes, similar to SQL queries. However, they don't execute and provide results when the Markdown file is viewed or accessed outside of Obsidian. I realized that these queries might be introducing noise and not adding much value, so I decided to remove them.

After making this change and re-evaluating the chatbot, I was surprised to see that the answer metrics had actually gone down. Digging deeper, I discovered that the vector search wasn't yielding the correct daily notes, even for straightforward queries like "What did I do on March 4, 2024?" On the bright side, context precision had improved since the context no longer contained Dataview queries.

Metric Base Iteration 2
Answer_correctness 0.42 0.34
Answer_relevancy 0.39 0.36
Answer_similarity 0.84 0.81
Context_entity_recall 0.27 0.09
Context_precision 0.71 0.87
Context_recall 0.43 0.42
Context_relevancy 0.01 0.02
Faithfulness 0.39 0.69

To address the issue with vector search, I made two adjustments: 1. Increased the number of documents returned by the retriever from the default 4 to 20. 2. Switched to using a MultiQuery retriever.

The goal was to retrieve a larger set of documents, even if their relevancy scores were low, in the hopes that the reranker model would be able to identify and prioritize the most relevant ones.

These changes led to a slight improvement in the answer-related metrics compared to the previous iterations. However, the context-related metrics took a hit due to the increased number of documents being considered. I was willing to accept this trade-off for now, as my notes were well-structured, and I believed a highly capable LLM should be able to extract the necessary information.

Metric Base Iteration 2 Iteration 2.1
Answer_correctness 0.42 0.34 0.45
Answer_relevancy 0.39 0.36 0.48
Answer_similarity 0.84 0.81 0.85
Context_entity_recall 0.27 0.09 0.15
Context_precision 0.71 0.87 0.62
Context_recall 0.43 0.42 0.35
Context_relevancy 0.01 0.02 0.00
Faithfulness 0.39 0.69 0.56

Iteration 3: Updating Evaluation dataset

As I reviewed the evaluation run, I noticed an interesting pattern. When there were no relevant notes to answer a question, the LLM correctly responded with "I don't know." This matched the ground truth, but the answer correctness was being computed as 0.19 instead of a value closer to 1.

To improve the evaluation process, I updated the dataset to include "I don't know" as the expected answer in cases where no relevant information was available. This simple change had a significant impact on the answer metrics, providing a more accurate assessment of the chatbot's performance.

Metric Base Iteration 2 Iteration 2.1 Iteration 3
Answer_correctness 0.42 0.34 0.45 0.62
Answer_relevancy 0.39 0.36 0.48 0.60
Answer_similarity 0.84 0.81 0.85 0.89
Context_entity_recall 0.27 0.09 0.15 0.14
Context_precision 0.71 0.87 0.62 0.62
Context_recall 0.43 0.42 0.35 0.37
Context_relevancy 0.01 0.02 0.00 0.00
Faithfulness 0.39 0.69 0.56 0.61

The Journey Continues...

At this point, I have a functional chatbot that serves as a powerful search engine for my personal knowledgebase. While I'm happy with the progress so far, there's still room for improvement. Some ideas for future iterations include:

  • Implementing document retrieval based on metadata like date, to provide more accurate answers for time-sensitive questions.
  • Exploring the use of open-source LLMs like LLAMA3 to keep my data private and self-contained.

Building this chatbot has been an incredible learning experience, showcasing the power of combining Obsidian, vector databases, and language models. Not only has it given me a valuable tool for accessing my own knowledge, but it has also highlighted the importance of iterative development and continuous evaluation.

I hope my journey inspires other Obsidian enthusiasts to explore the possibilities of creating their own personal knowledgebase chatbots. By leveraging our daily notes and harnessing the power of AI, we can unlock new ways to interact with and learn from the information we capture.

You can find the final code in this GitHub repo

If you have any feedback or simply want to connect, please hit me up on LinkedIn or @prabha-tweet

Quantized LLM Models

Large Language Models (LLMs) are known for their vast number of parameters, often reaching billions. For example, open-source models like Llama2 come in sizes of 7B, 13B, and 70B parameters, while Google's Gemma has 2B parameters. Although OpenAI's GPT-4 architecture is not publicly shared, it is speculated to have more than a trillion parameters, with 8 models working together in a mixture of experts approach.

Understanding Parameters

A parameter is a model weight learned during the training phase. The number of parameters can be a rough indicator of a model's capability and complexity. These parameters are used in huge matrix multiplications across each layer until an output is produced.

The Problem with Large Number of Parameters

As LLMs have billions of parameters, loading all the parameters into memory and performing massive matrix multiplications becomes a challenge. Let's consider the math behind this:

For a 70B parameter model (like the Llama2-70B model), the default size in which these parameters are stored is 32 bits (4 bytes). To load this model, you would need:

70B parameters * 4 bytes = 260 GB of memory

This highlights the significant memory requirements for running LLMs.

Quantization as a Solution

Quantization is a technique used to reduce the size of the model by decreasing the precision of parameters and storing them in less memory. For example, representing 32-bit floating-point (FP32) parameters in a 16-bit floating-point (FP16) datatype.

In practice, this loss of precision does not significantly degrade the output quality of LLMs but offers substantial performance improvements in terms of efficiency. By quantizing the model, the memory footprint can be reduced, making it more feasible to run LLMs on resource-constrained systems.

Quantization allows for a trade-off between model size and performance, enabling the deployment of LLMs in a wider range of applications and devices. It is an essential technique for making LLMs more accessible and efficient while maintaining their impressive capabilities.

The table below compares the performance of Google’s 2B Gemma model with 32-bit and 16-bit precision. The quantized 16-bit model is 28% faster with approximately 50% less memory usage.

Gemma FP 32 bit precision Gemma FP16 bit precision
# of Parameters 2,506,172,416 2,506,172,416
Memory Size based on # Parameters > 2.5B * 4 Bytes
9.33 GB
> 2.5B * 2 Bytes
4.66 GB
Memory Footprint 9.39 GB 4.73 GB
Average Inference time 10.36 seconds 7.48 seconds
Distribution of Inference Time

Impact on Accuracy

To assess the impact of quantization on accuracy, I ran the output of both models and computed the similarity score using OpenAI's text-embedding-3-large model. The results showed that the similarity scores between the outputs of the 32-bit and 16-bit models were highly comparable with 0.998 cosine similarity, indicating that quantization does not significantly affect the model's accuracy.

In conclusion, quantization is a powerful technique for reducing the memory footprint and improving the efficiency of LLMs while maintaining their performance. By enabling the deployment of LLMs on a wider range of devices and applications, quantization plays a crucial role in making these impressive models more accessible and practical for real-world use cases.

Note

Inference time and Accuracy are measured for 100 random question, you can find them in the colab notebook

Good Resource on this topic

DLAI - Quantization Fundamentals

If you have any feedback or simply want to connect, please hit me up on LinkedIn or @prabha-tweet

Karpathy's let's build GPT from scratch

Self Note

This note is for me to understand the concepts

Learning Resource

Karpathy's tutorial on Youtube Lets build GPT from scratch

The spelled-out intro to neural networks and backpropagation: building micrograd - YouTube

In this video he buils micrograd

The spelled-out intro to language modeling: building makemore - YouTube

Building makemore [GitHub - karpathy/makemore: An autoregressive character-level language model for making more things](https://github.com/karpathy/makemore)

Dataset: people names dataset in givernment website

Iteration 1:

    Character level language model

    Method: Bigram (Predict next char using previous char)

Pasted%20image%2020250130124540 As seens above, it doesn't give good names. Bigram model is not good for predicting next character.

In "bigram" model probabilities become the parameter of bigram language model.

Quality Evaluation of model

We will be using [[Negative maximum log likelihood estimate]] , in our problem we will calculate for the entire training set.

Log 1 = 0 & Log (very small number ) = -Inf

We would estimate Negative Log likelihood as follows

log_likelihood = 0
n = 0
for w in words[:3]:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 , ix2 = stoi[ch1], stoi[ch2]
        prob=P[ix1, ix2] # P is the matrix that holds the probability
        n+=1
        log_likelihood+=torch.log(prob)
        print(f'{ch1}{ch2}: {prob:.4f}')

print(f'{log_likelihood=}')

#Negative log likelihood give nice property where error (loss function) should be small, i.e zero is good.
nll = -log_likelihood
print(f'{nll=}')

#Usually people work with average negative log likelihood
print(f'{nll/n=}')

To avoid infinity probability for some predictions, people do model "smoothing" (assigning very small probability to unlikely scenario)

Iteration 2: Bigram Language Model using Neural Network

Need to create a dataset for training, i.e input and output char pair. (x and y).

One hot encoding needs to be done before feeding into NN

Log {count} = Logits counts = exp(Logits)

xenc = F.one_hot(xs, num_classes = 27).float()
for i in range(100):

    # Forward Pass
    logits = xenc @ W # Pred log-counts

    counts = logits.exp() # Counts

    probs = counts  / counts.sum(1, keepdims = True) 

    loss = -probs[torch.arange(228146), ys].log().mean()
    print(loss.item())

    #Backward pass
    W.grad=None
    loss.backward()

    #Update parameters using the gradient calculated
    W.data+= -50  * W.grad # here 50 is h , initial tried small numbers , like 0.1 but it is decreasing the loss very slowly hence increased to 50

Thoughts and comparison of above two approaches

In the first approach, we added 1 to the actual count because we don't want to end up in a situation it give \(-\infty\) for the character pair it didn't see in the trainin dataset. If you add large number then actual frequency is less relevent and we get uniform distribution. It is called smoothing

Similarly, gradient based approach has a way to "smoothing". When you keep all values of W to be zero, exp(W) gives all ones and softmax would provide equal probabilities to all outputs. You incentivise this in loss function by using second component like below

loss = -probs[torch.arange(228146), ys].log().mean() + (0.1 * (W**2).mean())

Second component pushed W to be zero , 0.1 is the strength of Regularization that determines the how much weight we want to give to this regularization component. It is similar to the number of "fake" count you add in the first approach.

We took two approaches

i) Frequency based model ii) NN based model (using Negative log likelihood to optimize)

We ended up with the same model , in the NN based approach the W represents the log probability (same as first approach) , we can exponential the W to get count

Building makemore Part 2: MLP - YouTube

In this class we would build makemore to predict based on last 3 characters.

Embedding

As a first step, we need to build embedding for the characters, we start with 2 dimensional embedding.

Pasted%20image%2020250130124540

h = emb.view(-1, 6) @ W1 + b1 # Hiden layer activation

We index on embedding matrix to get the weight / embeddings for the character. Another way to interpret is one hot encoding. indexing and one hot encoding produce similar result. in this case we think first layer as weight of neural network.

logits = h @ W2 + b2
counts = logits.exp()
prob = counts/counts.sum(1,keepdims=True)
prob.shape
# torch.Size([32, 27])

In Final layer we get probability distribution for all 27 characters.

# Negative Log likelihood 

loss = -prob[torch.arange(32), Y].log().mean()
loss

In Practice, we use mini batch for forward or backward pass. it is efficient than optimizing on the entire dataset.

it is much efficient to take many steps (iteration) with low confidence in gradient

Learning rate

Learning rate is an important hyper , we need to find the reasonable range manually and we can use different techniques to search for the optimal parameter in that range.

Dataset split

Important to split dataset into three sets - train split is to find model parameters

  • dev split is to find hyper parameters

  • test split is to evaluate the model performance finally

we improve the model by increasing the complexity by increasing the parameters. for example hidden layer neurons can be increased.

In our case , bottle neck may be the embeddings, we are cramping all the character in just two dimensional space. we can increase embedding dimensions to 10 from 2.

Now we get better name sounding words than before ( with just one character in context)

dex.
marial.
mekiophity.
nevonimitta.
nolla.
kyman.
arreyzyne.
javer.
gota.
mic.
jenna.
osie.
tedo.
kaley.
mess.
suhaiaviyny.
fobs.
mhiriel.
vorreys.
dasdro.