English | 简体中文

IEPile：大规模信息抽取语料库

这是论文 IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus 的官方仓库。

数据集 | 论文 | 使用方法 | 局限性 | 声明和许可 | 引用

请注意，我们的IEPile可能会进行更新（一旦发布更新，我们将通知您）。建议使用最新版本。

IEPile：大规模信息抽取语料库

新闻

[2024/05] 论文 IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus 被 ACL 2024会议录用。
[2024/04] 发布中英双语大模型知识抽取框架OneKE，同时开源基于Chinese-Alpaca-2-13B全参数微调的版本。
[2024/02] 发布大规模(0.32B tokens)双语(中文和英文)信息抽取(IE)指令数据集IEPile, 以及基于 IEPile 训练的两个模型baichuan2-13b-iepile-lora、llama2-13b-iepile-lora。
[2023/10] 我们发布了一个新的双语(中文和英文)基于主题的信息抽取(IE)指令数据集，名为InstructIE和论文。
[2023/08] 我们推出了专用于信息抽取(IE)的13B模型，名为knowlm-13b-ie。
[2023/05] 我们启动了基于指令的信息抽取项目。

1.介绍

IEPile 数据集下载链接：Google Drive | Hugging Face | WiseModel | ModelScpoe

请注意，以上提供的数据集链接中所含数据已经排除了与ACE2005数据集相关的部分。若您需要访问未经过滤的完整数据集，并且已成功获取所需的权限，敬请通过电子邮件方式联系 guihonghao@zju.edu.cn 或 zhangningyu@zju.edu.cn。我们将提供完整数据集资源。

我们精心收集并清洗了现有的信息抽取（IE）数据，共整合了26个英文IE数据集和7个中文IE数据集。如图1所示，这些数据集覆盖了包括通用、医学、金融等多个领域。

本研究采用了所提出的“基于schema的轮询指令构造方法”，成功创建了一个名为 IEPile 的大规模高质量双语(中文和英文)IE指令微调数据集，包含约0.32B tokens。

基于IEPile，我们对 Baichuan2-13B-Chat 和 LLaMA2-13B-Chat 模型应用了 Lora 技术进行了微调。实验证明，微调后的 Baichuan2-IEPile 和 LLaMA2-IEPile 模型在全监督训练集上取得了可比的结果，并且在零样本信息抽取任务中取得了提升。

零样本信息抽取结果

全监督数据集结果

2.数据

2.1IEPile的构造

我们专注于基于指令的信息抽取，因此指令中的schema的构造至关重要，因为它反映着具体抽取需求，是动态可变的。然而，现有研究在构造指令时往往采取一种较为粗放的schema处理策略，即利用标签集内全部schema进行指令构建。这种方法潜在地存在2个重要的问题：

训练和评估阶段schema询问的数量不一致，即使这些schema在内容上相似，可能损害模型的泛化能力。若训练过程中每次询问的schema数量大约是20个，而评估时询问的是10个或30个schema，即使这些schema在内容上与训练阶段相似，模型性能仍可能受到影响。
指令中的schema之间的对比性不足。语义近似的schema，如“裁员”、“离职”与“解雇”，它们的语义模糊性可能造成模型混淆。这类易混淆的模式应当在指令集中更为频繁地出现。

因此，我们提出如下解决方案：1、构造难负样本字典；2、轮询式的指令生成。

难负样本

假设数据集 $\mathcal{D}$ 有其全量标签集 $L$， $\mathcal{D}$ 中某一文本 $S$， $S$ 中真实存在的标签构成正例标签集 $Pos_L$，而不存在的标签则形成负例标签集 $Neg_L$。在我们的分析中，我们发现模型误判的主要原因在于schema的语义模糊，导致了模型的混淆。传统方法中，负例标签 $Neg_L$通常简单地定义为 $L - Pos_L$。然而，这种方法忽视了一个重要方面：需要特别注意那些与正例标签语义相近的负例标签。受对比学习理论的启发。我们构造了一个难负样本字典 $\mathcal{D}$，其键值对应的是Schema及其语义上相近的Schema集。因此难负样本集 $Hard_L = \mathcal{D}[Pos_L]$。然而，若 $Neg_L$ 仅由 $Hard_L$ 构成会缺少足够的负例让模型学习。因此，我们定义其他负样本 $Other_L = L - Hard_L$，最终，负例标签 $Neg_L$ 由 $Hard_L$ 和少量的 $Other_L$ 组成。这种难负样本的构建旨在促进语义近似的模式更频繁地出现在指令中，同时也能在不牺牲性能的情况下减少训练样本量（例如，原本需12个指令集的49个schema可减至3个）。

轮询式的指令生成

在完成了上述步骤后，我们得到了最终的schema集合 $L'=Pos_L + Neg_L$。在基于schema的信息抽取（IE）指令构造中，schema的作用至关重要，它直接决定了模型需要抽取的信息类型，并且反映了用户的具体需求。传统做法通常将完整的schema一次性整合入指令中，然而，在本研究中，我们采纳了一种轮询式方法，限制每次询问的schema数量为 $split_num$ 个，取值范围在4至6之间。因此 $L'$ 将被分为 $|L'|/split_num$ 个批次进行询问，每批次询问 $split_num$ 个schema。即使在评估阶段询问的schema数目与训练时不同，通过轮询机制，我们可以将询问数量平均分散至 $split_num$ 个，从而缓解泛化性能下降的问题。

2.2IEPile的数据格式

IEPile 中的每条数据均包含 task, source, instruction, output 4个字段

以下是一条数据实例：

{
  "task": "NER", 
  "source": "MSRA", 
  "instruction": "{\"instruction\": \"你是专门进行实体抽取的专家。请从input中抽取出符合schema定义的实体，不存在的实体类型返回空列表。请按照JSON字符串的格式回答。\", \"schema\": [\"组织机构\", \"地理位置\", \"人物\"], \"input\": \"对于康有为、梁启超、谭嗣同、严复这些从旧文化营垒中走来的年轻“布衣”，他们背负着沉重的历史包袱，能够挣脱旧传统的束缚，为拯救民族的危亡而献身，实在是中华民族的脊梁。\"}", 
  "output": "{\"组织机构\": [], \"地理位置\": [\"中华\"], \"人物\": [\"康有为\", \"梁启超\", \"谭嗣同\", \"严复\"]}"
}

该数据实例所属任务是 NER, 所属数据集是 MSRA, 待抽取的schema列表是 ["组织机构", "地理位置", "人物"], 待抽取的文本是"对于康有为、梁启超、谭嗣同、严复这些从旧文化营垒中走来的年轻“布衣”，他们背负着沉重的历史包袱，能够挣脱旧传统的束缚，为拯救民族的危亡而献身，实在是中华民族的脊梁。", 输出是 {"组织机构": [], "地理位置": ["中华"], "人物": ["康有为", "梁启超", "谭嗣同", "严复"]}

注意输出中的 schema 顺序与 instruction 中的 schema 顺序一致

更多任务的数据实例

{
  "task": "RE", 
  "source": "DuIE2.0", 
  "instruction": "{\"instruction\": \"你是专门进行关系抽取的专家。请从input中抽取出符合schema定义的关系三元组，不存在的关系返回空列表。请按照JSON字符串的格式回答。\", \"schema\": [\"国籍\", \"作者\", \"毕业院校\", \"主角\"], \"input\": \"对比日本动画电影在中日两国的票房表现，可以发现，日漫风格的动画，在国内也有圈层限制，即便是宫崎骏《千与千寻》、新海诚《你的名字》，这类日本动画票房榜首的电影，国内票房也停留在5亿左右\"}", 
  "output": "{\"国籍\": [], \"作者\": [{\"subject\": \"你的名字\", \"object\": \"新海诚\"}], \"毕业院校\": [], \"主角\": []}"
}

{
  "task": "EE", 
  "source": "DuEE1.0", 
  "instruction": "{\"instruction\": \"你是专门进行事件提取的专家。请从input中抽取出符合schema定义的事件，不存在的事件返回空列表，不存在的论元返回NAN，如果论元存在多值请返回列表。请按照JSON字符串的格式回答。\", \"schema\": [{\"event_type\": \"人生-求婚\", \"trigger\": true, \"arguments\": [\"求婚对象\"]}, {\"event_type\": \"人生-订婚\", \"trigger\": true, \"arguments\": [\"订婚主体\", \"时间\"]}, {\"event_type\": \"灾害/意外-坍/垮塌\", \"trigger\": true, \"arguments\": [\"受伤人数\", \"坍塌主体\"]}, {\"event_type\": \"人生-失联\", \"trigger\": true, \"arguments\": [\"地点\", \"失联者\"]}], \"input\": \"郭碧婷订婚后，填资料依旧想要填单身，有谁注意向佐说了什么？\"}", 
  "output": "{\"人生-求婚\": [], \"人生-订婚\": [{\"trigger\": \"订婚\", \"arguments\": {\"订婚主体\": [\"向佐\", \"郭碧婷\"], \"时间\": \"NAN\"}}], \"灾害/意外-坍/垮塌\": [], \"人生-失联\": []}"
}

以下是各字段的说明:

字段	说明
task	该实例所属的任务, (`NER`、`RE`、`EE`、`EET`、`EEA`) 5种任务之一。
source	该实例所属的数据集
instruction	输入模型的指令, 经过json.dumps处理成JSON字符串, 由`"instruction"`, `"schema"`, `"input"`三部分组成
output	输出, 采用字典的json字符串的格式, key是schema, value是抽取出的内容

在IEPile中, instruction 的格式采纳了类JSON字符串的结构，实质上是一种字典型字符串，它由以下三个主要部分构成： (1) 'instruction': 任务描述, 它概述了指令的执行任务(NER、RE、EE、EET、EEA之一)。 (2) 'schema': 待抽取的schema(实体类型, 关系类型, 事件类型)列表。 (3) 'input': 待抽取的文本。

instruction.py 中提供了各个任务的指令模版。

3.使用IEPile训练模型

3.1环境

在开始之前，请确保按照下面的指导创建适当的虚拟环境：

conda create -n IEPile python=3.9   # 创建虚拟环境
conda activate IEPile               # 激活环境
pip install -r requirements.txt     # 安装依赖

3.2下载数据和模型

IEPile 数据集下载链接：Google Drive | Hugging Face

IEPile
├── train.json    # 训练集
└── dev.json      # 验证集

以下是本仓库代码支持的一些基础模型：[llama, alpaca, vicuna, zhixi, falcon, baichuan, chatglm, qwen, moss, openba]

mkdir data         # 数据放这
mkdir models       # 基础模型放这
mkdir results      # 预测结果放这
mkdir lora         # lora微调结果放这

3.3LoRA微调

重要提示：以下的所有命令均应在IEPile目录下执行。例如，如果您想运行微调脚本，您应该使用如下命令：bash ft_scripts/fine_llama.bash。请确保您的当前工作目录正确。请确保训练/验证文件中每条数据包含 instruction, output 字段。

output_dir='lora/llama2-13b-chat-v1'
mkdir -p ${output_dir}
CUDA_VISIBLE_DEVICES="0,1,2,3" torchrun --nproc_per_node=4 --master_port=1287 src/finetune.py \
    --do_train --do_eval \
    --overwrite_output_dir \
    --model_name_or_path 'models/llama2-13b-chat' \
    --stage 'sft' \
    --model_name 'llama' \
    --template 'llama2' \
    --train_file 'data/NER/train.json' \
    --valid_file 'data/NER/dev.json' \
    --output_dir=${output_dir} \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --preprocessing_num_workers 16 \
    --num_train_epochs 10 \
    --learning_rate 5e-5 \
    --max_grad_norm 0.5 \
    --optim "adamw_torch" \
    --max_source_length 400 \
    --cutoff_len 700 \
    --max_target_length 300 \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 10 \
    --lora_r 16 \
    --lora_alpha 32 \
    --lora_dropout 0.05 \
    --bf16 \
    --deepspeed configs/ds_config_bf16.json

CUDA_VISIBLE_DEVICES="0,1,2,3": 指定哪些GPU可用于当前的训练任务。这里的"0,1,2,3"意味着使用编号为0、1、2、3的四个GPU。如果你的机器上有多于四个GPU，这个设置可以让你选择使用哪四个。
--nproc_per_node=4: 指定每个节点上要启动的进程数。在这个例子中，因为指定了4个GPU，所以也需要启动4个进程，每个进程对应一个GPU。
对于只使用单个GPU进行训练的情况，可以通过CUDA_VISIBLE_DEVICES=0 python src/finetune.py命令来启动训练任务，其中CUDA_VISIBLE_DEVICES=0指定了编号为0的GPU用于此次训练。
model_name: 指定所需的模型架构名称(7B、13B、Base、Chat属于同一模型架构)。当前支持的模型包括：["llama", "alpaca", "vicuna", "zhixi", "falcon", "baichuan", "chatglm", "qwen", "moss", "openba"]。请注意，此参数应与 --model_name_or_path 区分。
model_name_or_path: 模型路径, 请到 HuggingFace 下载相应模型。
template: 使用的模板名称，包括：alpaca, baichuan, baichuan2, chatglm3等, 请参考 src/datamodule/template.py 查看所有支持的模版名称, 默认使用的是alpaca模板, Chat版本的模型建议使用配套的模版, Base版本模型可默认使用alpaca。
train_file, valid_file(可选): 训练集和验证集的文件路径, 注意：目前仅支持json格式的文件, ⚠️若不指定valid_file, 将自动从train_file中划分val_set_size个数据作为验证集。
output_dir: LoRA微调后的权重参数保存路径。
val_set_size: 验证集的样本数量, 默认为1000。
per_device_train_batch_size, per_device_eval_batch_size: 每台GPU设备上的batch_size, 根据显存大小调整, RTX3090建议设置2~4。
max_source_length, max_target_length, cutoff_len: 最大输入、输出长度、截断长度, 截断长度可以简单地视作最大输入长度 + 最大输出长度, 需根据具体需求和显存大小设置合适值。
如果出现在eval阶段后保存模型时爆显存的情况, 请设置 evaluation_strategy no

可通过设置 bits = 4 进行量化, RTX3090建议量化。

要了解更多关于参数配置的信息，请参考 src/utils/args 目录。

微调LLaMA2-13B-Chat模型的具体脚本可以在 ft_scripts/fine_llama.bash 中找到。

微调Baichuan2-13B-Chat模型的具体脚本可以在 ft_scripts/fine_baichuan.bash 中找到。

4.领域内数据继续训练

尽管 Baichuan2-IEPile 和 LLaMA2-IEPile 模型已在多个通用数据集上接受了广泛的指令微调，并因此获得了一定的通用信息抽取能力，但它们在特定领域(如法律、教育、科学、电信)的数据处理上可能仍显示出一定的局限性。针对这一挑战，建议对这些模型在特定领域的数据集上进行二次训练。这将有助于模型更好地适应特定领域的语义和结构特征，从而增强其在该领域内的信息抽取能力。

4.1训练数据转换

首先, 需要将数据格式化以包含instruction、output字段。为此，我们提供了一个脚本 convert_func.py，它可以将数据批量转换成模型可以直接使用的格式。

在使用 convert_func.py 脚本之前，请确保参考了 data 目录。该目录详细说明了每种任务所需的数据格式要求。 sample.json 描述了转换前数据的格式，schema.json 展示了 schema 的组织结构， train.json 描述了转换后的数据格式。

此外，可直接使用包含12个主题（如人物、交通工具、艺术作品、自然科学、人造物品、天文对象等）的中英双语信息抽取数据集 zjunlp/InstructIE。

python ie2instruction/convert_func.py \
    --src_path data/NER/sample.json \
    --tgt_path data/NER/train.json \
    --schema_path data/NER/schema.json \
    --language zh \
    --task NER \
    --split_num 6 \
    --random_sort \
    --split train

language: 支持zh, en两种语言, 不同语言使用的指令模版不同。
task: 目前支持['RE', 'NER', 'EE', 'EET', 'EEA']五类任务。
split_num: 定义单个指令中可包含的最大schema数目。默认值为4，设置为-1则不进行切分。推荐的任务切分数量依任务而异：NER建议为6，RE、EE、EET、EEA均推荐为4。
random_sort: 是否对指令中的schema随机排序, 默认为False, 即按字母顺序排序。
split: 指定数据集类型，可选train或test。

转换后的训练数据将包含 task, source, instruction, output 四个字段。

难负样本生成: 促进语义相近容易混淆schema共现, 减少训练样本量

python ie2instruction/convert_func.py \
    --src_path data/SPO/sample.json \
    --tgt_path data/SPO/train.json \
    --schema_path data/SPO/schema.json \
    --cluster_mode \
    --hard_negative_path data/hard_negative/SPO_DuIE2.0.json \
    --language zh \
    --task SPO \
    --split_num 4 \
    --random_sort \
    --split train

增加--cluster_mode, --hard_negative_path data/hard_negative/SPO_DuIE2.0.json 参数, --hard_negative_path对应难负样本字典, hard_dict.json 中有IEPILE中涉及的所有数据集的难负样本字典。

4.2继续训练

checkpoint_dir	model_name_or_path	moadel_name	fp16/bf16	template
llama2-13b-iepile-lora	LLaMA2-13B-Chat	llama	bf16	llama2
baichuan2-13b-iepile-lora	BaiChuan2-13B-Chat	baichuan	bf16	baichuan2
llama3-8b-iepile-lora	LLaMA3-8B-Instruct	llama	bf16	alpaca
qwen1.5-14b-iepile-lora	Qwen1.5-14B-Chat	qwen2	bf16	qwen
OneKE	OneKE	llama	bf16	llama2_zh

output_dir='lora/llama2-13b-chat-v1-continue'
mkdir -p ${output_dir}
CUDA_VISIBLE_DEVICES="0,1,2,3" torchrun --nproc_per_node=4 --master_port=1287 src/finetune.py \
    --do_train --do_eval \
    --overwrite_output_dir \
    --model_name_or_path 'models/llama2-13B-Chat' \
    --checkpoint_dir 'lora/llama2-13b-iepile-lora' \
    --stage 'sft' \
    --model_name 'llama' \
    --template 'llama2' \
    --train_file 'data/train.json' \
    --valid_file 'data/dev.json' \
    --output_dir=${output_dir} \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --preprocessing_num_workers 16 \
    --num_train_epochs 10 \
    --learning_rate 5e-5 \
    --max_grad_norm 0.5 \
    --optim "adamw_torch" \
    --max_source_length 400 \
    --cutoff_len 700 \
    --max_target_length 300 \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 10 \
    --lora_r 64 \
    --lora_alpha 64 \
    --lora_dropout 0.05 \
    --bf16

参数说明请参考3.3LoRA微调
若要基于微调后的LoRA权重继续训练，仅需将 checkpoint_dir 参数指向LoRA权重路径，例如设置为'zjunlp/llama2-13b-iepile-lora'。

可通过设置 bits = 4 进行量化, RTX3090建议量化。

请注意，在使用 LLaMA2-IEPile 或 Baichuan2-IEPile 时，保持lora_r和lora_alpha均为64，对于这些参数，我们不提供推荐设置。

若要基于微调后的模型权重继续训练，只需设定 model_name_or_path 参数为权重路径，如'zjunlp/KnowLM-IE-v2'，无需设置checkpoint_dir。

脚本可以在 ft_scripts/fine_continue.bash 中找到。

4.3OneKE继续训练

4.3.1全监督训练

output_dir='lora/OneKE-continue'
mkdir -p ${output_dir}
CUDA_VISIBLE_DEVICES="0,1,2,3" torchrun --nproc_per_node=4 --master_port=1287 src/test_finetune.py \
    --do_train --do_eval \
    --overwrite_output_dir \
    --model_name_or_path 'models/OneKE' \
    --stage 'sft' \
    --model_name 'llama' \
    --template 'llama2_zh' \
    --train_file 'data/train.json' \
    --valid_file 'data/dev.json' \
    --output_dir=${output_dir} \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --preprocessing_num_workers 16 \
    --num_train_epochs 10 \
    --learning_rate 5e-5 \
    --max_grad_norm 0.5 \
    --optim "adamw_torch" \
    --max_source_length 400 \
    --cutoff_len 700 \
    --max_target_length 300 \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 10 \
    --bf16

4.3.2Lora训练

output_dir='lora/OneKE-continue-lora'
mkdir -p ${output_dir}
CUDA_VISIBLE_DEVICES="0,1,2,3" torchrun --nproc_per_node=4 --master_port=1287 src/test_finetune.py \
    --do_train --do_eval \
    --overwrite_output_dir \
    --model_name_or_path 'models/OneKE' \
    --stage 'sft' \
    --model_name 'llama' \
    --template 'llama2_zh' \
    --train_file 'data/train.json' \
    --valid_file 'data/dev.json' \
    --output_dir=${output_dir} \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --preprocessing_num_workers 16 \
    --num_train_epochs 10 \
    --learning_rate 5e-5 \
    --max_grad_norm 0.5 \
    --optim "adamw_torch" \
    --max_source_length 400 \
    --cutoff_len 700 \
    --max_target_length 300 \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 10 \
    --lora_r 64 \
    --lora_alpha 64 \
    --lora_dropout 0.05 \
    --bf16

5.预测

5.1测试数据转换

在准备测试数据转换之前，请访问 data 目录以了解各任务所需的数据结构：1）输入数据格式参见 sample.json；2）schema格式请查看 schema.json；3）转换后数据格式可参照 train.json。与训练数据不同, 测试数据的输入无需包含标注字段（entity, relation, event）。

python ie2instruction/convert_func.py \
    --src_path data/NER/sample.json \
    --tgt_path data/NER/test.json \
    --schema_path data/NER/schema.json \
    --language zh \
    --task NER \
    --split_num 6 \
    --split test

设置 split 为 test 时，请根据任务类型选择适当的schema数量：NER推荐为6，而RE、EE、EET、EEA推荐为4。转换后的测试数据将含有id, task, source, instruction, label五个字段。

label 字段将用于后续评估。若输入数据中缺少标注字段（entity, relation, event），则转换后的测试数据将不包含label字段，适用于那些无原始标注数据的场景。

5.2IEPile测试数据

下载 IEPile 数据集 Google Drive | Hugging Face | WiseModel | ModelScpoe

文件树如下所示

IEPile
├── train.json      # Training Set
├── dev.json        # Validation Set
├── IE-en           # English Unified Format Data
│   ├── NER
│   │   ├── CoNLL2003
│   │   │   ├── train.json
│   │   │   ├── dev.json
│   │   │   ├── schema.json   # schema information file
│   │   │   └── test.json
│   │   ├── ...
│   ├── RE
│   ├── EE
│   ├── EET
│   ├── EEA
├── IE-zh           # Chinese Unified Format Data
│   ├── NER
│   ├── RE
│   ├── EE
│   ├── EET
│   ├── EEA

通过下面脚本可批量获得测试指令数据：

bash ie2instruction/eval_data_convert.bash

需要设置脚本中第一行 dir_path 为 IEPile 数据集实际绝对路径注意：由于转换后schema序列中label顺序可能不一致，所以评估结果可能略有偏差

5.3基础模型+Lora预测

LLaMA2-IEPile | Baichuan2-IEPile 模型下载链接：zjunlp/llama2-13b-iepile-lora | zjunlp/baichuan2-13b-iepile-lora

checkpoint_dir	model_name_or_path	moadel_name	fp16/bf16	template
llama2-13b-iepile-lora	LLaMA2-13B-Chat	llama	bf16	llama2
baichuan2-13b-iepile-lora	BaiChuan2-13B-Chat	baichuan	bf16	baichuan2
llama3-8b-iepile-lora	LLaMA3-8B-Instruct	llama	bf16	alpaca
qwen1.5-14b-iepile-lora	Qwen1.5-14B-Chat	qwen2	bf16	qwen

⚠️ 注意使用基础模型+Lora预测时不仅需要下载Lora权重参数, 还要下载基础模型参数。例如: 使用baichuan2-13b-iepile-lora(--checkpoint_dir), 还需要下载BaiChuan2-13B-Chat(--model_name_or_path), 🚫不能只设置 --model_name_or_path lora/baichuan2-13b-iepile-lora。

CUDA_VISIBLE_DEVICES=0 python src/inference.py \
    --stage sft \
    --model_name_or_path 'models/llama2-13B-Chat' \
    --checkpoint_dir 'lora/llama2-13b-IEPile-lora' \
    --model_name 'llama' \
    --template 'llama2' \
    --do_predict \
    --input_file 'data/input.json' \
    --output_file 'results/llama2-13b-IEPile-lora_output.json' \
    --finetuning_type lora \
    --output_dir 'lora/test' \
    --predict_with_generate \
    --cutoff_len 512 \
    --bf16 \
    --max_new_tokens 300

在进行推理时，model_name, template, 和 bf16 必须与训练时的设置相同。
model_name_or_path: 指定所使用的基础模型路径，必须与相应的LoRA模型匹配。
checkpoint_dir: LoRA的权重文件路径。
output_dir: 此参数在推理时不起作用，可以随意指定一个路径。
input_file, output_file: 分别指定输入的测试文件路径和预测结果的输出文件路径。
cutoff_len, max_new_tokens: 设置最大的输入长度和生成的新token数量，根据显存大小进行调整。

可通过设置 bits = 4 进行量化, RTX3090建议量化。

5.4IE专用模型预测

checkpoint_dir	model_name_or_path	moadel_name	fp16/bf16	template
OneKE	OneKE	llama	bf16	llama2_zh

OneKE(based on chinese-alpaca2) 模型下载链接：zjunlp/OneKE

CUDA_VISIBLE_DEVICES=0 python src/inference.py \
    --stage sft \
    --model_name_or_path 'models/OneKE' \
    --model_name 'llama' \
    --template 'llama2_zh' \
    --do_predict \
    --input_file 'data/NER/test.json' \
    --output_file 'results/OneKE_output.json' \
    --output_dir 'lora/test' \
    --predict_with_generate \
    --cutoff_len 512 \
    --bf16 \
    --max_new_tokens 300 \
    --bits 4

model_name_or_path: IE专用模型权重路径

模型使用

模型下载

HuggingFace, ModelScope, WiseModel

环境安装

conda create -n OneKE python=3.9
conda activate OneKE
pip install -r requirements.txt

快速运行

训练和推理建议至少具备20GB的显存

import torch
from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    GenerationConfig,
    BitsAndBytesConfig
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path = 'zjunlp/OneKE' #选择你下载的模型存储在本地的位置
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)


# 4bit量化OneKE
quantization_config=BitsAndBytesConfig(     
    load_in_4bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    config=config,
    device_map="auto",  
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model.eval()


system_prompt = '<<SYS>>\nYou are a helpful assistant. 你是一个乐于助人的助手。\n<</SYS>>\n\n'
sintruct = "{\"instruction\": \"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\", \"schema\": [\"person\", \"organization\", \"else\", \"location\"], \"input\": \"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\"}"
sintruct = '[INST] ' + system_prompt + sintruct + '[/INST]'

input_ids = tokenizer.encode(sintruct, return_tensors="pt").to(device)
input_length = input_ids.size(1)
generation_output = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_length=1024, max_new_tokens=512, return_dict_in_generate=True), pad_token_id=tokenizer.eos_token_id)
generation_output = generation_output.sequences[0]
generation_output = generation_output[input_length:]
output = tokenizer.decode(generation_output, skip_special_tokens=True)

print(output)

vLLM 推理

vLLM的环境配置可见其官方安装配置文档 (Installation)

部署服务

python -m vllm.entrypoints.openai.api_server --model zjunlp/OneKE

终端使用Api推理

curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/data2/lkw/OneKE", "prompt": "[INST] <<SYS>>You are a helpful assistant. 你是一个乐于助人的助手。<</SYS>>{\"instruction\": \"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\", \"schema\": [\"person\", \"organization\", \"else\", \"location\"], \"input\": \"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\"}[/INST]", "max_tokens": 1024, "temperature": 0}'

gguf 格式转换

为了将模型权重从Hugging Face格式转换为GGUF格式，我们首先需要克隆llama.cpp的GitHub仓库，该仓库包含了必要的转换脚本。请按照以下步骤操作：

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

接下来，使用提供的Python脚本convert_hf_to_gguf.py来执行格式转换。确保你已经安装了所需的Python环境和依赖项。下面是执行转换命令的具体方式：

python3 convert_hf_to_gguf.py \
    /disk/disk_20T/ghh/OneKE \
    --outfile /disk/disk_20T/ghh/OneKE.gguf \
    --outtype bf16

请注意，--model_dir参数指定了原始模型文件的位置，而--outfile定义了转换后GGUF文件的保存位置。--outtype参数用来设置输出文件中数值的精度。

转换后的gguf格式的OneKE OneKE-gguf

ollama 推理

ollama 的环境配置可见其官方文档 https://github.com/ollama/ollama/tree/main

curl -fsSL https://ollama.com/install.sh | sh

创建 Modelfile 文件

FROM ./OneKE-13B-BF16.gguf
PARAMETER temperature 0
PARAMETER num_ctx 4096
TEMPLATE """[INST] <<SYS>>You are a helpful assistant. 你是一个乐于助人的助手。<</SYS>>{{ .Prompt }}[/INST]"""

启动 ollama

ollama serve

在另一个终端窗口输入命令

ollama create oneke -f Modelfile

ollama run oneke

输入和输出

>>> {\"instruction\": \"你是专门进行实体抽取的专家。请从input中抽取出符合schema定义的实体，不存在的实体类型
... 返回空列表。请按照JSON字符串的格式回答。\", \"schema\": [\"人物\", \"地理位置\", \"组织机构\"], \"input
... \": \"在这里恕弟不恭之罪，敢在尊前一诤：前人论书，每曰“字字有来历，笔笔有出处”，细读公字，何尝跳出前人
... 藩篱，自隶变而后，直至明季，兄有何新出？\"}
 {"人物": [], "地理位置": [], "组织机构": []}

>>> {\"instruction\": \"你是专门进行实体抽取的专家。请从input中抽取出符合schema定义的实体，不存在的实体类型
... 返回空列表。请按照JSON字符串的格式回答。\", \"schema\": [\"组织机构\", \"地理位置\", \"人物\"], \"input
... \": \"胡老说，当画画疲倦时就到院里去看看，给这盆花浇点水，给那棵花剪剪枝，回来再接着画，画累了再出去，
... 如此循环往复，脑体结合，有益健康，胜过吃药。\"}
 {"组织机构": [], "地理位置": [], "人物": ["胡"]}

>>> {\"instruction\": \"你是专门进行事件提取的专家。请从input中抽取出符合schema定义的事件，不存在的事件返回
... 空列表，不存在的论元返回NAN，如果论元存在多值请返回列表。请按照JSON字符串的格式回答。\", \"schema\": [{
... \"event_type\": \"产品行为-获奖\", \"trigger\": true, \"arguments\": [\"获奖人\", \"颁奖机构\", \"奖项\
... ", \"时间\"]}, {\"event_type\": \"组织行为-罢工\", \"trigger\": true, \"arguments\": [\"罢工人数\", \"
... 罢工人员\", \"所属组织\", \"时间\"]}, {\"event_type\": \"组织关系-裁员\", \"trigger\": true, \"argument
... s\": [\"裁员方\", \"时间\", \"裁员人数\"]}, {\"event_type\": \"组织关系-解散\", \"trigger\": true, \"ar
... guments\": [\"解散方\", \"时间\"]}], \"input\": \"消失的“外企光环”，5月份在华裁员900余人，香饽饽变“臭”
... 了\"}
 {"产品行为-获奖": [], "组织行为-罢工": [], "组织关系-裁员": [{"trigger": "裁员", "arguments": {"裁员方
": "NAN", "时间": "5月份", "裁员人数": "900余人"}}], "组织关系-解散": []}

退出后删除

ollama stop oneke

ollama rm oneke

在 Mac 上推理

import torch
from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    GenerationConfig,
    BitsAndBytesConfig
)

device = torch.device("mps")
model_path = 'zjunlp/OneKE' #选择你下载的模型存储在本地的位置
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    config=config,
    device_map="auto",  
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model.eval()
model = model.to(device)

PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py 命令行启动。

多卡推理

import torch
from transformers import AutoConfig, AutoModel, AutoTokenizer, GenerationConfig
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_in_model, dispatch_model

max_memory_each_gpu = '15GiB' 
gpu_device_ids = [0, 1] 
no_split_module_classes = ["LlamaDecoderLayer"]
model_path = '/disk/disk_20T/ghh/OneKE' #选择你下载的模型存储在本地的位置

max_memory = {
    device_id: max_memory_each_gpu for device_id in gpu_device_ids
}

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

with init_empty_weights():
    model = AutoModel.from_config(config, torch_dtype=torch.float16, trust_remote_code=True)

device_map = infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=no_split_module_classes)

print("auto determined device_map", device_map)
device_map["llm.model.embed_tokens"] = 0
device_map["llm.model.layers.0"] = 0
device_map["llm.lm_head"] = 0
device_map["vpm"] = 0
device_map["resampler"] = 0
print("modified device_map", device_map)

load_checkpoint_in_model(model, model_path, device_map=device_map)

model = dispatch_model(model, device_map=device_map)
torch.set_grad_enabled(False)
model.eval()


system_prompt = '<<SYS>>\nYou are a helpful assistant. 你是一个乐于助人的助手。\n<</SYS>>\n\n'
sintruct = "{\"instruction\": \"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\", \"schema\": [\"person\", \"organization\", \"else\", \"location\"], \"input\": \"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\"}"
sintruct = '[INST] ' + system_prompt + sintruct + '[/INST]'

input_ids = tokenizer.encode(sintruct, return_tensors="pt")
input_length = input_ids.size(1)
generation_output = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_length=1024, max_new_tokens=512, return_dict_in_generate=True), pad_token_id=tokenizer.eos_token_id)
generation_output = generation_output.sequences[0]
generation_output = generation_output[input_length:]
output = tokenizer.decode(generation_output, skip_special_tokens=True)

print(output)

6.评估

我们提供了评估各个任务F1分数的脚本。

python ie2instruction/eval_func.py \
  --path1 data/NER/processed.json \
  --task NER

task: 目前支持['RE', 'NER', 'EE', 'EET', 'EEA']五类任务。
可以设置 sort_by 为 source, 分别计算每个数据集上的F1分数。

7.声明和许可

我们认为标注数据蕴含着人类的智慧宝库，它的存在是为了促进全人类的利益，并有助于提升我们的生活质量。我们强烈敦促所有的用户不要将我们的语料库用于任何可能对国家或公共安全造成伤害、违反法律法规的行为。我们竭尽所能地保证所提供数据的质量与其合法性。但我们也意识到，尽管如此，可能还是存在一些不可预见的问题，诸如数据保护的担忧以及数据被滥用可能引起的风险和问题。对于这些潜在的问题，我们将不承担责任。对于那些受限于比CC BY-NC-SA 4.0协议更为严格的使用许可的原始数据，IEPile将恪守那些较为严格的条款。在其他所有情形下，我们的操作将基于CC BY-NC-SA 4.0许可协议。

8.局限

从数据角度来看，我们的研究主要集中在基于schema的信息抽取（IE）上，这限制了我们将研究成果推广至不遵循我们特定格式要求的人类指令的能力。此外，我们没有探索开放信息抽取（Open IE）领域；然而，如果我们去除schema约束，我们的数据集将适用于开放信息抽取场景。此外，我们的数据集目前仅包含英语和中文数据，在未来，我们希望能够包含更多语言的数据。从模型的角度来看，由于计算资源的限制，我们的研究仅评估了两个模型：Baichuan和LLaMA，以及一些基线模型。我们的数据集可以应用于任何其他的大型语言模型（LLMs），如Qwen、ChatGLM。

9.引用

如果您使用IEPile或代码，请引用以下论文：

@article{DBLP:journals/corr/abs-2402-14710,
  author       = {Honghao Gui and
                  Lin Yuan and
                  Hongbin Ye and
                  Ningyu Zhang and
                  Mengshu Sun and
                  Lei Liang and
                  Huajun Chen},
  title        = {IEPile: Unearthing Large-Scale Schema-Based Information Extraction
                  Corpus},
  journal      = {CoRR},
  volume       = {abs/2402.14710},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2402.14710},
  doi          = {10.48550/ARXIV.2402.14710},
  eprinttype    = {arXiv},
  eprint       = {2402.14710},
  timestamp    = {Tue, 09 Apr 2024 07:32:43 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2402-14710.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

10.致谢

我们非常感谢MathPile和KnowledgePile项目提供的宝贵灵感。我们对以下数据集构建者和维护者表示特别的谢意：AnatEM、BC2GM、BC4CHEMD、NCBI-Disease、BC5CDR、HarveyNER、CoNLL2003、GENIA、ACE2005、MIT Restaurant、MIT Movie、FabNER、MultiNERD、Ontonotes、FindVehicle、CrossNER、MSRA NER、Resume NER、CLUE NER、Weibo NER、Boson、ADE Corpus、GIDS、CoNLL2004、SciERC、Semeval-RE、NYT11-HRL、KBP37、NYT、Wiki-ZSL、FewRel、CMeIE、DuIE、COAE2016、IPRE、SKE2020、CASIE、PHEE、CrudeOilNews、RAMS、WikiEvents、DuEE、DuEE-Fin、FewFC、CCF law等，这些数据集极大地促进了本研究的进展。我们也要对InstructUIE与YAYI-UIE为数据和模型在信息抽取领域做出的宝贵贡献表示感激。我们的研究成果同样得益于他们的创新和努力。此外，我们要对hiyouga/LLaMA-Factory表示衷心的感谢，我们的微调代码实现在很大程度上参考了他们的工作。通过这些学术资源的辅助，我们得以完成本项研究，对此我们深表感激。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_CN.md

README_CN.md

IEPile：大规模信息抽取语料库

新闻

1.介绍

2.数据

2.1IEPile的构造

2.2IEPile的数据格式

3.使用IEPile训练模型

3.1环境

3.2下载数据和模型

3.3LoRA微调

4.领域内数据继续训练

4.1训练数据转换

4.2继续训练

4.3OneKE继续训练

4.3.1全监督训练

4.3.2Lora训练

5.预测

5.1测试数据转换

5.2IEPile测试数据

5.3基础模型+Lora预测

5.4IE专用模型预测

模型使用

模型下载

环境安装

快速运行

vLLM 推理

gguf 格式转换

ollama 推理

在 Mac 上推理

多卡推理

6.评估

7.声明和许可

8.局限

9.引用

10.致谢

Files

README_CN.md

Latest commit

History

README_CN.md

File metadata and controls

IEPile：大规模信息抽取语料库

新闻

1.介绍

2.数据

2.1IEPile的构造

2.2IEPile的数据格式

3.使用IEPile训练模型

3.1环境

3.2下载数据和模型

3.3LoRA微调

4.领域内数据继续训练

4.1训练数据转换

4.2继续训练

4.3OneKE继续训练

4.3.1全监督训练

4.3.2Lora训练

5.预测

5.1测试数据转换

5.2IEPile测试数据

5.3基础模型+Lora预测

5.4IE专用模型预测

模型使用

模型下载

环境安装

快速运行

vLLM 推理

gguf 格式转换

ollama 推理

在 Mac 上推理

多卡推理

6.评估

7.声明和许可

8.局限

9.引用

10.致谢