\ Communities, chats, and forums are an endless source of information on a multitude of topics. Slack often replaces technical documentation, and Telegram and Discord communities help with gaming, startups, crypto, and travel questions. Despite the relevance of firsthand information, it is frequently highly unstructured, making it difficult to search through. In this article, we will explore the complexities of implementing a Telegram bot that will find answers to questions by extracting information from the history of chat messages.
\ Here are the challenges that await us:
Find relevant messages. The answer may be scattered across several people's dialogue or in a link to external resources.
Ignoring offtopic. There is a lot of spam and off-topics, which we should learn to identify and filter out
Prioritization. Information becomes outdated. How do you know the correct answer to date?
\ Basic chatbot userflow we are going to implement
\ Let's walk through the main stages of this user flow and highlight the main challenges we will face.
Data preparationTo prepare a message history for search, we need to create the embeddings of these messages - vectorized text representations. While dealing with a wiki article or PDF document, we would split the text into paragraphs and compute Sentence Embedding for each.
\ However, we should take into account the peculiarities that are typical for chats and not for well-structured text:
\
\ Next, we should choose the embedding model. There are many different models for building embeddings, and several factors must be considered when choosing the right model.
\
\ To improve the quality of search results, we can categorize messages by topic. For example, in a chat dedicated to frontend development, users can discuss topics such as: CSS, tooling, React, Vue, etc. You can use LLM (more expensive) or classic topic-modeling methods from libraries like BERTopic to classify messages by topics.
\ We will also need a vector database to store embeddings and meta-information (links to original posts, categories, dates). Many vector storages, such as FAISS, Milvus, or Pinecone, exist for this purpose. A regular PostgreSQL with the pgvector extension will also work.
Processing a users questionIn order to answer a user's question, we need to convert the question to a searchable form, and thus compute the question's embedding, as well as determine its intent.
\ The result of a semantic search on a question could be similar questions from the chat history but not the answers to them.
\ To imporve this, we can use one of the popular HyDE (hypothetical document embeddings) optimization techniques. The idea is to generate a hypothetical answer to a question using LLM and then compute the embedding of the answer. This approach in some cases allows more accurate and efficient search for relevant messages among answers rather than questions.
\
Finding the most relevant messagesOnce we have the question embedding, we can search for the closest messages in the database. LLM has a limited context window, so we may be unable to add all the search results if there are too many. The question arises of how to prioritize the answers. There are several approaches for this:
\
Recency score. Over time, information becomes outdated, and to prioritize new messages, you can calculate the recency score using the simple formula 1 / (today - date_of_message + 1)
\
Metadata filtering. (you need to identify the topic of the question and posts). This helps to narrow down your search, leaving only those posts that are relevant to the topic you are looking for
\
\
\
Generating the final responseAfter searching and sorting in the previous step, we can keep the 50-100 most relevant posts that will fit into the LLM context.
\ The next step is to create a clear and concise prompt for LLM using the user's original query and search results. It should specify to the LLM how to answer the question, the user's query, and the context - the relevant messages we found. For this purpose, it is essential to consider these aspects:
\
System Prompt are instructions to the model that explain how it should process information. For example, you can tell the LLM to look for an answer only in the data provided.
\
Context length - the maximum length of the messages we can use as input. We can calculate the number of tokens using the tokenizer corresponding to the model we use. For example, OpenAI uses Tiktoken.
\
Model hyperparameters - for example, the temperature is responsible for how creative the model will be in its responses.
\
The choice of the model. It is not always worth overpaying for the most large and powerful model. It makes sense to conduct several tests with different models and compare their results. In some cases, less resource-intensive models will do the job if they do not require high accuracy.
\
ImplementationNow let's try to implement these steps with NodeJS. Here is the tech stack I’m going to use:
\
\ Let's skip the basic steps of installing dependencies and telegram bot setup and move on straight to the most important features. The database schema, which will be needed later:
\
import { Entity, Enum, Property, Unique } from '@mikro-orm/core'; @Entity({ tableName: 'groups' }) export class Group extends BaseEntity { @PrimaryKey() id!: number; @Property({ type: 'bigint' }) channelId!: number; @Property({ type: 'text', nullable: true }) title?: string; @Property({ type: 'json' }) attributes!: Record\
Split user dialogs into chunksSplitting long dialogs between multiple users into chunks is not the most trivial task.
\ Unfortunately, default approaches such as RecursiveCharacterTextSplitter, available in the Langchain library, do not account for all the peculiarities specific to chatting. However, in the case of Telegram, we can take advantage of Telegram threads that contain related messages and the replies sent by users.
\ Every time a new batch of messages arrives from the chat room, our bot needs to perform a few steps:
\
\
class ChatContentSplitter { constructor( private readonly splitter RecursiveCharacterTextSplitter, private readonly longMessageLength = 200 ) {} public async split(messages: EntityDTO\
EmbeddingsNext, we need to calculate the embeddings for each of the chunks. For this we can use the OpenAI model text-embedding-3-large
\
public async getEmbeddings(chunks: ContentChunks[]) { const chunked = groupArray(chunks, 100); for await (const chunk of chunks) { const res = await this.openai.embeddings.create({ input: c.text, model: 'text-embedding-3-large', encoding_format: "float" }); chunk.embeddings = res.data[0].embedding } await this.orm.em.flush(); }\ \
Answering user questionsTo answer a user's question, we first count the embedding of the question and then find the most relevant messages in the chat history
\
public async similaritySearch(embeddings: number[], groupId; number): Promise\ \ Then we rerank the search results with the help of the Cohere’s reranking model
\
public async rerank(query: string, chunks: ContentChunk[]): Promise\ \ Next, ask the LLM to answer the user's question by summarising the search results. The simplified version of the processing a search query will look like this:
\
public async search(query: string, group: Group) { const queryEmbeddings = await this.getEmbeddings(query); const chunks = this.chunkService.similaritySearch(queryEmbeddings, group.id); const reranked = this.cohereService.rerank(query, chunks); const completion = await this.openai.chat.completions.create({ model: 'gpt-4-turbo', temperature: 0, messages: [ { role: 'system', content: systemPrompt }, { role: 'user', content: this.userPromptTemplate(query, reranked) }, ] ] return completion.choices[0].message; } // naive prompt public userPromptTemplate(query: string, chunks: ContentChunk[]) { const history = chunks .map((c) => `${c.text}`) .join('\n----------------------------\n') return ` Answer the user's question: ${query} By summarizing the following content: ${history} Keep your answer direct and concise. Provide refernces to the corresponding messages.. `; }\ \
Further improvementsEven after all optimizations, we may feel the LLM powered bot answers are non-ideal and incomplete. What else could be improved?
\
For user posts that include links, we can also parse the web-pages and pdf-documents content.
Query-Routing — directing user queries to the most appropriate data source, model, or index based on the query’s intent and context to optimize accuracy, efficiency, and cost.
We can include resources relevant to topic of the chat-room to the search index — at work, it can be documentation from Confluence, for visa chats, consulate websites with rules, etc.
RAG-Evaluation - We need to set up a pipeline to evaluate the quality of our bot's responses
\ \ \
\
All Rights Reserved. Copyright , Central Coast Communications, Inc.