本地大模型编程实战(05)从文本中提取重要信息(1)

本文将演示使用大语言模型从文本中提炼结构化信息。
我们将用 llama3.1 和 deepseek 做一个简单的对比。

由于 langchain 可能对不同大模型支持程度不同，不同大模型的特点也不同，所以这个对比并不能说明哪个模型更好。

准备

在正式开始撸代码之前，需要准备一下编程环境。

计算机
本文涉及的所有代码可以在没有显存的环境中执行。我使用的机器配置为：
- CPU: Intel i5-8400 2.80GHz
- 内存: 16GB
Visual Studio Code 和 venv 这是很受欢迎的开发工具，相关文章的代码可以在 Visual Studio Code 中开发和调试。我们用 python 的 venv 创建虚拟环境, 详见：
在Visual Studio Code中配置venv。
Ollama 在 Ollama 平台上部署本地大模型非常方便，基于此平台，我们可以让 langchain 使用 llama3.1、qwen2.5 等各种本地大模型。详见：
在langchian中使用本地部署的llama3.1大模型。

定义数据格式

我们先用 Pydantic 来定义要提取的数据格式。

from typing import Optional
from pydantic import BaseModel, Field

class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[float] = Field(
        default=None, description="Height measured in meters"
    )

设置提示词

from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        # Please see the how-to about improving performance with
        # reference examples.
        # MessagesPlaceholder('examples'),
        ("human", "{text}"),
    ]
)

提取Person

from langchain_ollama import ChatOllama

def single_entry(model_name,text):

    structured_llm = ChatOllama(model=model_name,temperature=0.5,verbose=True).with_structured_output(schema=Person)

    prompt = prompt_template.invoke({"text": text})
    response = structured_llm.invoke(prompt)
    return response

我们用下面的方法做一下测试。

text = "Alan Smith is 1.83 meters tall and has blond hair."
response = single_entry("llama3.1",text)
print(f'\n llama3.1 response:\n{response}')

response = single_entry("deepseek-r1",text)
print(f'\n deepseek-r1 response:\n{response}')

text = "Alan Smith is 6 feet tall and has blond hair."
response = single_entry("llama3.1",text)
print(f'\n llama3.1 response:\n{response}')

response = single_entry("deepseek-r1",text)
print(f'\n deepseek-r1 response:\n{response}')

测试结果：

	llama3.1	deepseek-r1-tool-calling
text = “Alan Smith is 1.83 meters tall and has blond hair.”	name=‘Alan Smith’ hair_color=‘blond’ height_in_meters=1.83	name=‘Alan Smith’ hair_color=‘blond’ height_in_meters=1.83
text = “Alan Smith is 6 feet tall and has blond hair.”	name=‘Alan Smith’ hair_color=‘blond’ height_in_meters=None	name=None hair_color=‘blond’ height_in_meters=1.8288

第一个问题比较简单，两个模型表现都不错；第二个问题难了一些，大模型需要能把英尺转换成米，我们可以看出 llama3.1 没有做转换，没有解析出来 height_in_meters ,deepseek-r1 则成功的将英尺转换成米，但是却没有提取出 name 。

提取多个 Person

from langchain_ollama import ChatOllama

def single_entry(model_name,text):

    structured_llm = ChatOllama(model=model_name,temperature=0.5,verbose=True).with_structured_output(schema=Person)

    prompt = prompt_template.invoke({"text": text})
    response = structured_llm.invoke(prompt)
    return response

from typing import List

class Data(BaseModel):
    """Extracted data about people."""

    # Creates a model so that we can extract multiple entities.
    people: List[Person]

def multiple_entry(model_name,text):
    structured_llm = ChatOllama(model=model_name,temperature=0.5,verbose=True).with_structured_output(schema=Data)  
    response = structured_llm.invoke([text])
    prompt = prompt_template.invoke({"text": text})
    response = structured_llm.invoke(prompt)
    return response

我们对比测试一下：

text = "Alan Smith is 1.83 meters tall and has blond hair. John Doe is 1.72 meters tall and has brown hair."
response = multiple_entry("llama3.1",text)
print(f'\n llama3.1 response:\n{response}')

response = multiple_entry("MFDoom/deepseek-r1-tool-calling:7b",text)
print(f'\n deepseek response:\n{response}')

text = "Alan Smith is 1.88 meters tall and has blond hair. John Doe is 7 feet tall and has brown hair."
response = multiple_entry("llama3.1",text)
print(f'\n llama3.1 response:\n{response}')

response = multiple_entry("MFDoom/deepseek-r1-tool-calling:7b",text)
print(f'\n deepseek response:\n{response}')

	llama3.1	deepseek-r1-tool-calling
Alan Smith is 1.83 meters tall and has blond hair. John Doe is 1.72 meters tall and has brown hair.	people=[Person(name=‘Alan Smith’, hair_color=None, height_in_meters=None), Person(name=‘John Doe’, hair_color=None, height_in_meters=None)]	失败
Alan Smith is 1.88 meters tall and has blond hair. John Doe is 7 feet tall and has brown hair.	people=[Person(name=None, hair_color=None, height_in_meters=None), Person(name=None, hair_color=None, height_in_meters=None)]	people=[Person(name=None, hair_color=‘blond’, height_in_meters=None)]

这次表现都不好。当然，估计用参数更多的大模型能够正确解析出内容。

总结

通过以上的演练，我们可以看出 llama3.1 和 deepseek 在提取单个对象的时候表现还可以，但是提取列表都不行。deepseek 似乎考虑问题更复杂，推理时间也长一些；另外，deepseek 似乎稳定性差一点：有时候推理可以成功，有时候会失败。

代码

本文涉及的所有代码以及相关资源都已经共享，参见：

github
gitee

参考:

Classify Text into Labels

🪐祝好运🪐

准备#

定义数据格式#

设置提示词#

提取Person#

提取多个 Person#

总结#

代码#

准备

定义数据格式

设置提示词

提取Person

提取多个 Person

总结

代码