A Knowledge Graph Process Building Engine Based on Tidb and LlamaIndex: Autoflow

Introduction

Autoflow is an open source GraphRAG (Knowledge Graph) built on TiDB Vector ([1]) , LlamaIndex ([2]) and DSPy ([3]) .

Function

A conversation search page similar to Perplexity : Our platform is equipped with an advanced built-in web crawler designed to enhance your browsing experience. The crawler makes it easy to navigate official and document websites, crawling through sitemap URLs to ensure comprehensive coverage and simplify the search process.

You can even edit the Knowledge Graph to add more information or correct any inaccuracies. This feature is particularly useful for enhancing the search experience and ensuring that the information provided is accurate and up-to-date.

Embeddable JavaScript Code Snippets : Easily integrate our conversation search window into your website by copying and embedding a simple JavaScript code snippet. The widget is usually placed in the bottom right corner of the website for instant response to product-related queries.

Deployment

• Deploy using Docker Compose ([4]) (Requirements: 4 CPU cores and 8GB RAM)

If you need to know more, please refer to the documentation directly: https://tidb.ai/docs/quick-start, Github address: https://github.com/pingcap/autoflow

TiDB Vector

TiDB Vector Search (beta version) provides an advanced search solution that can perform semantic similarity search on various data types (including documents, images, audio, and video). This feature enables developers to easily build scalable applications with generative artificial intelligence (AI) capabilities using familiar MySQL skills.

Concept

Vector search is a search method that prioritizes the meaning of data to provide relevant results.

Unlike traditional full-text search that relies on precise keyword matching and word frequency, vector search converts various data types (such as text, images, or audio) into high-dimensional vectors and performs queries based on the similarity between these vectors. This search method captures the semantic meaning and contextual information of the data, thereby more accurately understanding user intent.

Even if the search term does not exactly match the content in the database, vector search can still provide results consistent with the user's intention by analyzing the semantics of the data.

For example, a full-text search for "a swimming animal" only returns results that contain these exact keywords. However, a vector search can return results for other swimming animals, such as fish or ducks, even if these results do not contain the exact keywords.

Vector embedding

Vector embedding, also known as embedding, is a series of numbers that represent real-world objects in high-dimensional space. It captures the meaning and context of non-Structured Data, such as documents, images, audio, and video.

Vector embeddings are crucial in Machine Learning and are the foundation of semantic similarity search.

TiDB introduces vector data types ([5]) and vector search indexes ([6]) , aiming to optimize the storage and retrieval of vector embeddings and enhance their use in AI applications. You can store vector embeddings in TiDB and use these data types to perform vector search queries to find the most relevant data.

Embedded model

Embedding models are algorithms that transform data into vector embeddings ([7]) .

Choosing the right embedding model is crucial to ensuring the accuracy and relevance of semantic search results. For unstructured text data, you can find the best performing text embedding model on the Massive Text Embedding Benchmark (MTEB) Leaderboard ([8]) .

To learn how to generate vector embeddings for specific data types, refer to the integration tutorial or example for embedding models.

The working principle of vector search

After converting the original data source to vector embedding and storing it in TiDB, your application can execute vector search queries to find the data that is most relevant to the user's query semantics or context.

TiDB Vector Search identifies the k closest neighbor (KNN) vectors by using the distance function ([9]) to calculate the distance between a given vector and the vector stored in the database. The vector closest to the given vector in the query represents the data that is most similar in meaning.

TiDB vector search diagram

As a relational database with integrated vector search function, TiDB allows you to store data and its corresponding vector representation (i.e. vector embedding) in a database. You can choose any of the following storage methods:

• Store the data and its corresponding vector representations in different columns of the same table. • Store the data and its corresponding vector representations in different tables. In this way, JOIN queries need to be used to combine these tables when retrieving data.

Use case

Retrieval Enhanced Generation (RAG)

Retrieval-enhanced generation (RAG) is an architecture designed to optimize the output of large language models (LLMs). By using vector search, RAG applications can embed vectors in a database and retrieve relevant documents as additional context when LLM generates responses, thereby improving the quality and relevance of answers.

Semantic search

Semantic search is a search technique that returns results based on the meaning of the query rather than just keyword matching. It uses embeddings to interpret the meaning between different languages and various types of data (such as text, images, and audio). Then, vector search algorithms use these embeddings to find the most relevant data that satisfies the user's query.

Recommendation engine

The recommendation engine is a system that proactively suggests content, products, or services that are relevant and personalized to the user. It achieves this by creating embeddings that represent user behavior and preferences. These embeddings help the system identify similar items that other users interact with or are interested in, thereby increasing the likelihood that recommendations are both relevant and attractive to the user.

DSPY

DSPy is an open-source framework for programming - rather than hinting - language models . It allows you to iterate quickly to build Modularization AI systems and provides algorithms for optimizing their hints and weights , whether you're building simple classifiers, complex RAG pipelines, or Agent loops.

DSPy stands for Declarative Self-improving Python. Unlike fragile tips, you write compositional Python code and use DSPy's tools to teach your language model to provide high-quality output . This lecture ([10]) is a good concept introduction. Learn about the community, ask for help, or start contributing through our GitHub repository and Discord server ([11]) .

GitHub Address: https://github.com/stanfordnlp/dspy