In this article, we explore LangChain's LLM Graph Transformer and its dual mode for building Knowledge Graph from text. The tool-based mode is our main method, which reduces prompt engineering and allows attribute extraction by using structured output and function calls. On the other hand, when tools are not available, the prompt-based mode is very useful, relying on a small number of examples to guide LLM.

As pre article, we can build a graph via AI is easy, however build a knowledge graph is not, because prompt-based extraction does not support attribute extraction and does not produce isolated nodes.

In-depth analysis of LangChain's implementation in LLM graph construction

Creating graphics from text is very exciting, but also challenging. Basically, it's about converting unstructured text into Structured Data. Although this method has been around for some time, with the emergence of large language models (LLMs), it has gained significant development and entered mainstream applications.

Extract entities and relationships from text to build Knowledge Graph.

In the above figure, you can see how information extraction converts raw text into Knowledge Graph. On the left, multiple documents display unstructured sentences about individuals and their relationships with companies. On the right, the same information is represented as graphs of entities and their connections, showing who works in various organizations or created these organizations.

But why would you want to extract Structured Information from text and represent it as a graph? A key reason is to provide support for Retrieval Enhanced Generation (RAG) applications. While using text embedding models on unstructured text is a useful approach, it can be inadequate when answering complex, multi-hop problems ([1]) that require understanding connections between multiple entities, or problems that require structured operations such as filtering, sorting, and aggregation ([2]) . By extracting Structured Information from text and building Knowledge Graph, you can not only organize data more efficiently, but also create a powerful framework to understand complex relationships between entities. This structured approach makes it easier to retrieve and leverage specific information, expands the types of questions you can answer, and provides greater accuracy.

About a year ago, I started experimenting with building graphs using LLM ([3]) , and due to growing interest, we decided to integrate this feature into LangChain as LLM Graph Transformer ([4]) . Over the past year, we have gained valuable insights and introduced new features, which we will showcase in this blog post.

The code is available on GitHub ([5]) .

Setting up the Neo4j environment

We will use Neo4j as the underlying graph storage, which comes with graphical visualization for Out Of The Box. The easiest way to get started is to use the free Neo4j Aura ([6]) instance, which provides a cloud instance of the Neo4j database. Alternatively, you can set up a local Neo4j database by downloading the Neo4j Desktop ([7]) application and creating a local database instance.

from langchain_community.graphs import Neo4jGraph
graph = Neo4jGraph(    
    url="bolt://54.87.130.140:7687",    
    username="neo4j",    
    password="cables-anchors-directories",    
    refresh_schema=False)

LLM Graph Transformer

LLM Graph Transformer aims to provide a flexible framework for building graphs using any LLM. With so many different providers and models available, this task is by no means simple. Fortunately, LangChain intervened and handled most of the standardization process. As for LLM Graph Transformer itself, it is like two cats stacked in a windbreaker - with the ability to operate in two completely independent modes.

LLM Graph Transformer consists of two independent modes for extracting graphics from text. The images are provided by the user.

LLM Graph Transformer operates in two different modes, each designed to generate graphs from documents using LLM in different scenarios.

1. Tool-based pattern (default): This pattern leverages LLM's built-in with_structured_output to use tools when LLM supports structured output or function calls. The tool specification defines the output format, ensuring that entities and relationships are extracted in a structured, predefined way. This is shown on the left side of the figure, showing the code for the Node and Relationship classes. 2. Hint-based pattern (fallback): In cases where LLM does not support tool or function calls, LLM Graph Transformer falls back to a purely hint-based method. This pattern uses a small number of example hints to define the output format, guiding LLM to extract entities and relationships in a text-based way. Then, the output of LLM is converted to JSON format through custom function parsing. This JSON is used to populate nodes and relationships, just like in tool-based patterns, but here LLM is completely guided by hints rather than structured tools. This is shown on the right side of the figure, providing example hints and result JSON output.

Note that even in models that support tools/functions, you can use hint-based extraction by setting the property ignore_tools_usage = True .

Tool-based extraction

We initially chose the tool-based extraction method because it minimized the need for extensive prompt engineering and custom parsing functions. In LangChain, the with_structured_output method allows you to extract information using tools or functions, and the output can be defined through JSON structures or Pydantic objects. Personally, I found Pydantic objects clearer, so we chose it.

We first define a Node class.

class Node(BaseNode):
    id: str = Field(..., description="Name or human-readable unique identifier")
    label: str = Field(..., description=f"Available options are {enum_values}")
    properties: Optional[List[Property]]

Each node has an id , a label , and optional properties . For brevity, I have not included the full description here. Describing the id as a human-readable Unique Device Identifier is important because some LLMs tend to understand ID attributes in a more traditional way, such as random strings or increasing integers. Instead, we want to use the name of the entity as the id attribute. We also limit the available tag types by simply listing them in the label description. Additionally, LLMs like OpenAI support the enum parameter, which we also use.

Next, let's look at the Relationship class


class Relationship(BaseRelationship):
    source_node_id: str
    source_node_label: str = Field(..., description=f"Available options are {enum_values}")
    target_node_id: str
    target_node_label: str = Field(..., description=f"Available options are {enum_values}")
    type: str = Field(..., description=f"Available options are {enum_values}")
    properties: Optional[List[Property]]

This is the second iteration of the Relationship class. Initially, we used nested Node objects for the source and target nodes, but we soon found that nested objects reduced the accuracy and quality of the extraction process. Therefore, we decided to flatten the source and target nodes into separate fields - for example, source_node_id and source_node_label , as well as target_node_id and target_node_label . In addition, we defined allowed values in the node label and description of the relationship type to ensure that LLM complies with the specified graphical pattern.

The tool-based extraction method allows us to define properties for nodes and relationships. Below are the classes we use to define them.


class Property(BaseModel):
    """A single property consisting of key and value"""
    key: str = Field(..., description=f"Available options are {enum_values}")
    value: str

Each Property is defined as an Attribute - Value Pair. While this approach is flexible, it has its limitations. For example, we cannot provide a unique description for each property, nor can we specify that some properties are required and others are optional, so all properties are defined as optional. Additionally, properties are not defined individually for each node or relationship type, but are shared across all nodes and relationships.

We also implemented a detailed system prompt ([8]) to help guide the extraction. However, in my experience, function and parameter descriptions often have a greater impact than system messages.

Unfortunately, there is currently no easy way to customize function or parameter descriptions in LLM Graph Transformer.

Hint-based extraction

Since only a few commercial LLMs and LLaMA 3 support native tools, we have implemented fallback methods for models that do not support tools. Even when using models that support tools, you can set ignore_tool_usage = True to switch to hint-based methods.

Most of the hint-based methods hint projects and examples were contributed by Geraldus Wilsen ([9]) .

In the prompt-based approach, we have to define the output structure directly in the prompt. You can find the full prompt here ([10]) . In this blog post, we'll just give a high-level overview. We'll start by defining the system prompt.


You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph. Your task is to identify the entities and relations specified in the user prompt from a given text and produce the output in JSON format. This output should be a list of JSON objects, with each object containing the following keys:

- **"head"**: The text of the extracted entity, which must match one of the types specified in the user prompt.
- **"head_type"**: The type of the extracted head entity, selected from the specified list of types.
- **"relation"**: The type of relation between the "head" and the "tail," chosen from the list of allowed relations.
- **"tail"**: The text of the entity representing the tail of the relation.
- **"tail_type"**: The type of the tail entity, also selected from the provided list of types.

Extract as many entities and relationships as possible.

**Entity Consistency**: Ensure consistency in entity representation. If an entity, like "John Doe," appears multiple times in the text under different names or pronouns (e.g., "Joe," "he"), use the most complete identifier consistently. This consistency is essential for creating a coherent and easily understandable knowledge graph.

**Important Notes**:

- Do not add any extra explanations or text.

You are a top-level algorithm designed to extract information in a structured format to build a Knowledge Graph. Your task is to identify the entities and relationships specified in the user prompt from the given text and generate output in JSON format. This output should be a list of JSON objects, each containing the following keys:

• "Head" : The text of the extracted entity, which must match one of the types specified in the user prompt. • "head_type" : The type of the extracted head entity, selected from the specified list of types. • "relation" : The type of relationship between "head" and "tail", selected from the list of allowed relationships. • "tail" : The text of the entity that represents the tail of the relationship. • "tail_type" : The type of the tail entity, also selected from the provided list of types.

Extract as many entities and relationships as possible.

Entity Consistency : Ensure consistency in entity representation. If an entity, such as "John Doe", appears multiple times in the text with different names or pronouns (e.g. "Joe", "he"), always use the most complete identifier. This consistency is essential for creating a coherent and easy-to-understand Knowledge Graph.

Important notes :

Do not add any additional explanations or text.

In the prompt-based approach, a key difference is that we require LLM to extract only relationships, not individual nodes. This means that we will not have any isolated nodes, unlike the tool-based approach. In addition, because models that lack native tool support usually perform poorly, we do not allow any attributes to be extracted - whether nodes or relationships - to maintain the simplicity of the extracted output.

Next, we added several low-sample examples to the model.

examples = [
    {
        "text": (
            "Adam is a software engineer in Microsoft since 2009, "
            "and last year he got an award as the Best Talent"
        ),
        "head": "Adam",
        "head_type": "Person",
        "relation": "WORKS_FOR",
        "tail": "Microsoft",
        "tail_type": "Company",
    },
    {
        "text": (
            "Adam is a software engineer in Microsoft since 2009, "
            "and last year he got an award as the Best Talent"
        ),
        "head": "Adam",
        "head_type": "Person",
        "relation": "HAS_AWARD",
        "tail": "Best Talent",
        "tail_type": "Award",
    },
    ...
]

In this approach, adding custom few-sample examples or additional instructions is not currently supported. The only way to customize is to modify the entire prompt through the prompt property. We are actively considering extending the customization options.

Next, we will see how to define the graphic mode.

Define graphic mode

When using LLM Graph Transformer for information extraction, defining a graph pattern is crucial for guiding the model to build meaningful and structured knowledge representations. A well-defined graph pattern specifies the nodes and relationship types to be extracted, as well as any attributes associated with each node and relationship. This pattern serves as a blueprint to ensure that LLM consistently extracts relevant information in a manner that conforms to the required Knowledge Graph structure.

In this blog post, we will use the opening paragraph of Marie Curie's Lingo page ([11]) for testing and add a sentence about Robin Williams at the end.


from langchain_core.documents import Document

text = """
Marie Curie, 7 November 1867 – 4 July 1934, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.
She was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.
Her husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.
She was, in 1906, the first woman to become a professor at the University of Paris.
Also, Robin Williams.
"""
documents = [Document(page_content=text)]

We will also use GPT-4o in all examples.


from langchain_openai import ChatOpenAI
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI api key")

llm = ChatOpenAI(model='gpt-4o')

First, let's see how the extraction process works without defining any graphic patterns.

from langchain_experimental.graph_transformers import LLMGraphTransformer
no_schema = LLMGraphTransformer(llm=llm)

Now, we can use asynchronous aconvert_to_graph_documents functions to process documents. It is recommended to use asynchronous in LLM extraction, as it allows for parallel processing of multiple documents. This approach can significantly reduce latency and improve throughput, especially when processing multiple documents.

data = await no_schema.aconvert_to_graph_documents(documents)

The response of LLM Graph Transformer will be a graphical document with the following structure:

[
    GraphDocument(
        nodes=[
            Node(id="Marie Curie", type="Person", properties={}),
            Node(id="Pierre Curie", type="Person", properties={}),
            Node(id="Nobel Prize", type="Award", properties={}),
            Node(id="University Of Paris", type="Organization", properties={}),
            Node(id="Robin Williams", type="Person", properties={}),
        ],
        relationships=[
            Relationship(
                source=Node(id="Marie Curie", type="Person", properties={}),
                target=Node(id="Nobel Prize", type="Award", properties={}),
                type="WON",
                properties={},
            ),
            Relationship(
                source=Node(id="Marie Curie", type="Person", properties={}),
                target=Node(id="Nobel Prize", type="Award", properties={}),
                type="WON",
                properties={},
            ),
            Relationship(
                source=Node(id="Marie Curie", type="Person", properties={}),
                target=Node(
                    id="University Of Paris", type="Organization", properties={}
                ),
                type="PROFESSOR",
                properties={},
            ),
            Relationship(
                source=Node(id="Pierre Curie", type="Person", properties={}),
                target=Node(id="Nobel Prize", type="Award", properties={}),
                type="WON",
                properties={},
            ),
        ],
        source=Document(
            metadata={"id": "de3c93515e135ac0e47ca82a4f9b82d8"},
            page_content="\nMarie Curie, 7 November 1867 – 4 July 1934, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.\nShe was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.\nHer husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.\nShe was, in 1906, the first woman to become a professor at the University of Paris.\nAlso, Robin Williams!\n",
        ),
    )
]

The graphical document describes the extracted nodes and relationships . Additionally, the extracted source document is added under the source key.

We can use the Neo4j browser to visualize the output, thus gaining a clearer and more intuitive understanding of the data.

Visualization of two extractions from the same dataset without defining a graphical mode. Images provided by the author.

The above figure shows two extractions of the same paragraph about Marie Curie. In this case, we used GPT-4 and tool-based extractions, which also allow for isolated nodes, as shown in the figure. Since no graph mode is defined, LLM decides at runtime which information to extract, which can lead to changes in output, even from the same paragraph. Therefore, some extractions are more detailed than others, and even for the same information structure, they may be different. For example, on the left, Marie is represented as the WINNER of the Nobel Prize, while on the right, she WON the Nobel Prize.

Now, let's try to do the same extraction using a hint-based approach. For models that support tools, you can enable hint-based extraction by setting ignore_tool_usage parameters.

no_schema_prompt = LLMGraphTransformer(llm=llm, ignore_tool_usage=True)data = await no_schema.aconvert_to_graph_documents(documents)

Again, we can visualize two separate executions in the Neo4j browser.

Visualization of extracting the same dataset twice using a prompt-based method without defining a graphical mode. Images provided by the author.

Using the hint-based approach, we won't see any isolated nodes. However, as with the previous extraction, the pattern may change between runs, resulting in different outputs for the same input.

Next, let's see how we can help produce more consistent output by defining graphical patterns.

Define allowed nodes

The graph structure of constraint extraction is very beneficial because it guides the model to focus on specific and related entities and relationships. By defining clear patterns, you can improve the consistency of extraction, make the output more predictable, and remain consistent with the information you actually need. This reduces variation between runs, ensuring that the extracted data follows a standardized structure and captures the expected information. With well-defined patterns, the model is less likely to overlook key details or introduce unexpected elements, resulting in cleaner and more usable graphics.

We'll start by defining the expected node type to be extracted using allowed_nodes parameters.


allowed_nodes = ["Person", "Organization", "Location", "Award", "ResearchField"]
nodes_defined = LLMGraphTransformer(llm=llm, allowed_nodes=allowed_nodes)
data = await allowed_nodes.aconvert_to_graph_documents(documents)

Here, we define that LLM should extract five types of nodes, such as Person, Organization, Location, etc. We visualize two separate executions in Neo4j browser for comparison.

Visualization of two extractions using predefined node types. Images provided by the author.

By specifying the expected node type, we achieved more consistent node extraction. However, some changes may still occur. For example, in the first run, "radioactivity" was extracted as a research field, but not in the second run.

As we have not defined the relationship, their types may also change between runs. Also, some extractions may capture more information than others. For example, the MARRIED_TO relationship between Marie and Pierre does not appear in all extractions.

Now, let's explore how to define relationship types to further improve consistency.

Define allowed relationships

As we have observed, defining only node types still allows for changes in relationship extraction. To solve this problem, let's take a look at how to define relationships. The first method is to use a list of available types to specify allowed relationships.


allowed_nodes = ["Person", "Organization", "Location", "Award", "ResearchField"]
allowed_relationships = ["SPOUSE", "AWARD", "FIELD_OF_RESEARCH", "WORKS_AT", "IN_LOCATION"]
rels_defined = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=allowed_nodes,
    allowed_relationships=allowed_relationships
)
data = await rels_defined.aconvert_to_graph_documents(documents)

Let's check the two separate draws again.

Visualization of two extractions using predefined nodes and relationship types. Images provided by the author.

After defining the nodes and relationships, our output becomes significantly more consistent. For example, Marie is always shown as having won an award, is Pierre's spouse, and works at the University of Paris. However, since the relationship is designated as a universal list and there are no restrictions on the nodes they can connect to, some changes still occur. For example, FIELD_OF_RESEARCH relationship may appear between Person and ResearchField , but sometimes it connects Award and ResearchField . In addition, there may be differences in direction consistency due to the undefined relationship direction.

To solve the problem of not being able to specify which nodes a relationship can connect to and enforce the direction of the relationship, we recently introduced new options for defining relationships, as shown below.


allowed_nodes = ["Person", "Organization", "Location", "Award", "ResearchField"]
allowed_relationships = [
    ("Person", "SPOUSE", "Person"),
    ("Person", "AWARD", "Award"),
    ("Person", "WORKS_AT", "Organization"),
    ("Organization", "IN_LOCATION", "Location"),
    ("Person", "FIELD_OF_RESEARCH", "ResearchField")
]
rels_defined = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=allowed_nodes,
    allowed_relationships=allowed_relationships
)
data = await rels_defined.aconvert_to_graph_documents(documents)

We no longer define relationships as simple string lists, but use tuple formats with three elements representing the source node, relationship type, and target node, respectively.

Let's visualize the results again.

Visualization of two extractions using predefined nodes and advanced relationship types. Images provided by the author.

Using the triplet method provides a more consistent pattern for extracting graphics for multiple executions. However, given the nature of LLM, there may still be some variations in the level of detail of extraction. For example, on the right, Pierre is shown as having won the Nobel Prize, while on the left, this information is missing.

Define attributes

The final enhancement we can make to the graph pattern is to define attributes for nodes and relationships. Here we have two options. The first is to set the node_properties or relationship_properties to True , allowing the LLM to decide which attributes to extract.


allowed_nodes = ["Person", "Organization", "Location", "Award", "ResearchField"]
allowed_relationships = [
    ("Person", "SPOUSE", "Person"),
    ("Person", "AWARD", "Award"),
    ("Person", "WORKS_AT", "Organization"),
    ("Organization", "IN_LOCATION", "Location"),
    ("Person", "FIELD_OF_RESEARCH", "ResearchField")
]
node_properties=True
relationship_properties=True
props_defined = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=allowed_nodes,
    allowed_relationships=allowed_relationships,
    node_properties=node_properties,
    relationship_properties=relationship_properties
)
data = await props_defined.aconvert_to_graph_documents(documents)
graph.add_graph_documents(data)

Let's check the results.

Extracted node and relationship attributes. Image provided by the author.

We have enabled LLM to add any nodes or relationship attributes it deems relevant. For example, it selects to include Marie Curie's birth and death dates, her professorship at the University of Paris, and the fact that she has won two Nobel Prizes. These additional attributes significantly enrich the extracted information.

The second option is to define the node and relationship attributes we want to extract.

allowed_nodes = ["Person", "Organization", "Location", "Award", "ResearchField"]
allowed_relationships = [
    ("Person", "SPOUSE", "Person"),
    ("Person", "AWARD", "Award"),
    ("Person", "WORKS_AT", "Organization"),
    ("Organization", "IN_LOCATION", "Location"),
    ("Person", "FIELD_OF_RESEARCH", "ResearchField")
]
node_properties=["birth_date", "death_date"]
relationship_properties=["start_date"]
props_defined = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=allowed_nodes,
    allowed_relationships=allowed_relationships,
    node_properties=node_properties,
    relationship_properties=relationship_properties
)
data = await props_defined.aconvert_to_graph_documents(documents)
graph.add_graph_documents(data)

Attributes are simply defined as two lists. Let's see what LLM extracts.

Extracted predefined nodes and relationship attributes. Image provided by the author.

The birth and death dates are consistent with the previous draw. However, this time, LLM also extracted the start date of Mary's professorship at the University of Paris.

Attributes do add valuable depth to the extracted information, although the current implementation has some limitations.

• Attributes can only be extracted using tool-based methods. • All attributes are extracted as strings. • Attributes can only be defined globally, not for specific node labels or relationship types. • There is no option to customize attribute descriptions to guide LLM for more precise extraction.

Strict mode

If you think we have perfected a way to make LLM perfectly follow the defined pattern, I must clarify the facts. Although we have invested a lot of effort in prompt engineering, it is challenging to make LLM (especially poor-performing models) follow instructions completely and accurately. To solve this problem, we introduced a post-processing step called strict_mode , which removes any information that does not conform to the defined graphic pattern, ensuring cleaner and more consistent results.

By default, the strict_mode is set to True , but you can disable it with the following code:


LLMGraphTransformer(
    llm=llm,
    allowed_nodes=allowed_nodes,
    allowed_relationships=allowed_relationships,
    strict_mode=False
)

After turning off strict mode, you may get nodes or relationship types that exceed the defined graphical mode, as LLM may sometimes be creative in output structure.

Import graph document into ByteGraph

Graphical documents extracted from LLM Graph Transformer can be imported into ByteGraph such as Neo4j using add_graph_documents methods for further analysis and application. We will explore different import options to suit different use cases.

Default import

You can use the following code to import nodes and relationships into Neo4j.

graph.add_graph_documents(graph_documents)

This method imports all nodes and relationships directly from the provided graphical document. We have used this method throughout the blog post to view the results of different LLM and schema configurations.

Default import settings. Images provided by the author.

Base entity label

Most ByteGraphs support indexes to optimize data import and retrieval. In Neo4j, indexes can only be set for specific node labels. Since we may not know all node labels in advance, we can handle this by adding a secondary base label for each node using the baseEntityLabel parameter. This way, we can still use indexes for efficient import and retrieval without setting indexes for every possible node label in the graph.

graph.add_graph_documents(graph_documents, baseEntityLabel=True)

As mentioned earlier, using the baseEntityLabel parameter will result in each node having an additional Entity label.

Each node gets a secondary label via the baseEntityLabel parameter. Image provided by the author.

Contains source document

The last option is to also import the source document of the extracted nodes and relationships. This method allows us to keep track of which documents each entity appears in. You can import the source document using the include_source parameter.

graph.add_graph_documents(graph_documents, include_source=True)

When checking the imported graphics, we should see results similar to the following.

Imported source document. Images provided by the author.

In this visualization, the source document is highlighted in blue, and all entities extracted from it are connected by MENTIONS relationships. This pattern enables you to build retrievers that utilize both structured and unstructured search methods ([12]) .

Summary

We observe that well-defined graph patterns, including allowed node and relationship types, can improve the consistency and performance of extraction. Constrained patterns help ensure that the output conforms to the structure we need, making it more predictable, reliable, and applicable. Whether using tools or prompts, LLM Graph Transformer can transform non-Structured Data into more organized and structured representations, supporting better RAG applications and multi-hop query processing.

The code is available on GitHub ([13]) . You can also use the Neo4j-hosted LLM Graph Builder application to try out LLM Graph Transformer in a no-code environment.

Neo4j graph builder([14])

FAQs

Hi Tomaz, thank you for sharing your insights on building Knowledge Graph with Neo4j from unstructured text!

My problem is that I want to create a Knowledge Base from the German legal field corpus, but I cannot define all possible nodes or relationships...

The solution is : organize the corpus to fine-tune the large model, see: https://medium.com/@rakesh.sheshadri44/building-a-large-language-model-with-1-billion-parameters-part-1-778e7652b553

References

•Constructing knowledge graphs from text using OpenAI functions([15])•Knowledge Graphs + LLMs = Multi-hop Question Answering([16])•Limitations of Text Embeddings in RAG Applications([17])•Enhancing the Accuracy of RAG Applications with Knowledge Graphs([18])

Building Knowledge Graph with LLM Graph Transformer