微软开源GraphRAG的使用教程-使用自定义数据测试GraphRAG

在这里插入图片描述

微软在今年4月份的时候提出了GraphRAG的概念，然后在上周开源了GraphRAG,Github链接见https://github.com/microsoft/graphrag,截止当前，已有6900+Star。

安装教程

官方推荐使用Python3.10-3.12版本，我使用Python3.10版本安装时，在初始化项目过程中会报错，切换到Python3.11版本后运行正常，推测是Python3.10与微软的一些最新的SDK不兼容。所以建议使用Python3.11的环境，安装GraphRAG比较简单，直接下面一行代码即可安装成功。

pip install graphrag

使用教程

在这个教程中，我们使用马伯庸的《太白金星有点烦》这个短篇小说为例，测试下使用微软开源的GraphRAG的处理效果。

注意，GraphRAG是使用LLM来提取文本片段中的实体关系，因此耗费Token数较多，如果是个人调研使用，不建议使用GPT4级别的模型（费用太高，不差钱的大佬请忽略此条建议）。综合成本和效果，我这里使用的是DeepSeek-Chat模型。

初始化项目

我这边先创建了一个临时测试目录myTest，然后按照官方教程，在myTest目录下创建了input目录，并把《太白金星有点烦》这本书的txt版本重命名为book.txt后放到input目录下。然后调用python -m graphrag.index --init 进行初始化工作，生成一些配置文件。

mkdir ./myTest/input
curl https://www.xxx.com/太白金星有点烦.txt > ./myTest/input/book.txt  // 这里是示例代码，大家在测试时根据实际情况放入自己要测试的txt文本即可。
cd ./myTest
python -m graphrag.index --init

执行完成后，会在当前目录（即MyTest）目录下生成几个新的文件夹：output-后续执行生成的中间结果会保存到这个目录中；prompts-处理过程中用到的一些Prompt内容；.env-大模型API配置文件，里面默认就一个GRAPHRAG_API_KEY 用于配置大模型的apiKey；settings.yaml-该文件是整体的配置信息，如果我们使用的非OPENAI的官方模型和官方API，我们需要修改此配置文件来让GraphRAG按照我们指定的配置文件执行。

配置相关文件

先在.env文件中配置大模型API的Key，这个配置是全局生效的。我们在.env文件中配置完成后，不需要在settings.yaml文件中重复配置。settings.yaml中使用的默认模型为gpt-4-turbo-preview ，如果不需要修改模型以及调用的API地址，那现在就已经配置完成了，后续的配置内容可以执行忽略并直接到执行阶段。

我这里使用的是agicto 提供的APIkey(主要是新用户注册可以免费获取到10块钱的调用额度，白嫖还是挺爽的)。我在这里主要就修改了API地址和调用模型的名称，修改完成后的settings文件完整内容如下：

encoding_model: cl100k_base
skip_workflows: []
llm:api_key: ${GRAPHRAG_API_KEY}type: openai_chat # or azure_openai_chatmodel: deepseek-chatmodel_supports_json: false # recommended if this is available for your model.api_base: https://api.agicto.cn/v1# max_tokens: 4000# request_timeout: 180.0# api_version: 2024-02-15-preview# organization: <organization_id># deployment_name: <azure_model_deployment_name># tokens_per_minute: 150_000 # set a leaky bucket throttle# requests_per_minute: 10_000 # set a leaky bucket throttle# max_retries: 10# max_retry_wait: 10.0# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times# concurrent_requests: 25 # the number of parallel inflight requests that may be madeparallelization:stagger: 0.3# num_threads: 50 # the number of threads to use for parallel processingasync_mode: threaded # or asyncioembeddings:## parallelization: override the global parallelization settings for embeddingsasync_mode: threaded # or asynciollm:api_key: ${GRAPHRAG_API_KEY}type: openai_embedding # or azure_openai_embeddingmodel: text-embedding-3-smallapi_base: https://api.agicto.cn/v1# api_base: https://<instance>.openai.azure.com# api_version: 2024-02-15-preview# organization: <organization_id># deployment_name: <azure_model_deployment_name># tokens_per_minute: 150_000 # set a leaky bucket throttle# requests_per_minute: 10_000 # set a leaky bucket throttle# max_retries: 10# max_retry_wait: 10.0# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times# concurrent_requests: 25 # the number of parallel inflight requests that may be made# batch_size: 16 # the number of documents to send in a single request# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request# target: required # or optionalchunks:size: 300overlap: 100group_by_columns: [id] # by default, we don't allow chunks to cross documentsinput:type: file # or blobfile_type: text # or csvbase_dir: "input"file_encoding: utf-8file_pattern: ".*\\.txt$"cache:type: file # or blobbase_dir: "cache"# connection_string: <azure_blob_storage_connection_string># container_name: <azure_blob_storage_container_name>storage:type: file # or blobbase_dir: "output/${timestamp}/artifacts"# connection_string: <azure_blob_storage_connection_string># container_name: <azure_blob_storage_container_name>reporting:type: file # or console, blobbase_dir: "output/${timestamp}/reports"# connection_string: <azure_blob_storage_connection_string># container_name: <azure_blob_storage_container_name>entity_extraction:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this taskprompt: "prompts/entity_extraction.txt"entity_types: [organization,person,geo,event]max_gleanings: 0summarize_descriptions:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this taskprompt: "prompts/summarize_descriptions.txt"max_length: 500claim_extraction:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this task# enabled: trueprompt: "prompts/claim_extraction.txt"description: "Any claims or facts that could be relevant to information discovery."max_gleanings: 0community_report:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this taskprompt: "prompts/community_report.txt"max_length: 2000max_input_length: 8000cluster_graph:max_cluster_size: 10embed_graph:enabled: false # if true, will generate node2vec embeddings for nodes# num_walks: 10# walk_length: 40# window_size: 2# iterations: 3# random_seed: 597832umap:enabled: false # if true, will generate UMAP embeddings for nodessnapshots:graphml: falseraw_entities: falsetop_level_nodes: falselocal_search:# text_unit_prop: 0.5# community_prop: 0.1# conversation_history_max_turns: 5# top_k_mapped_entities: 10# top_k_relationships: 10# max_tokens: 12000global_search:# max_tokens: 12000# data_max_tokens: 12000# map_max_tokens: 1000# reduce_max_tokens: 2000# concurrency: 32

执行并构建图索引

此流程是GraphRAG的核心流程，即构建基于图的知识库用于后续的问答环节，通过以下代码即可触发执行。

python -m graphrag.index

基于微软在论文中提到的实现思路，执行过程GraphRAG主要实现了如下功能：

Source Documents → Text Chunks：将源文档分割成文本块。
Text Chunks → Element Instances：从每个文本块中提取图节点和边的实例。
Element Instances → Element Summaries：为每个图元素生成摘要。
Element Summaries → Graph Communities：使用社区检测算法将图划分为社区。
Graph Communities → Community Summaries：为每个社区生成摘要。
Community Summaries → Community Answers → Global Answer：使用社区摘要生成局部答案，然后汇总这些局部答案以生成全局答案。

整体执行耗时与具体的文本大小有关。我这个例子整体耗时大概20分钟，耗费人民币大约4块钱。执行过程中的输出如下：


🚀 Reading settings from settings.yaml
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will 
be removed in a future version. Please use 'DataFrame.transpose' instead.return bound(*args, **kwds)
🚀 create_base_text_un