Data construction code for Schema-adaptable Knowledge Graph Construction
Note: First of all, make sure your current path is */AdaKGC/dataset_construct
- Quick links
- Obtain Raw Dataset
- Convert Raw Dataset to Iteration Dataset
- Data Format - Example of instance
- Acknowledgement
<a id="obtain-raw-dataset"></a>
We follow the following methods to obtain raw data. We sincerely thank previous works.
Dataset | Preprocessing |
---|---|
Few-NERD | Few-NERD |
NYT | JointER |
ACE05-Evt | OneIE |
<a id="few-nerd"></a>
mkdir -p data/Few-NERD
wget -O data/Few-NERD/supervised.zip https://cloud.tsinghua.edu.cn/f/09265750ae6340429827/?dl=1
unzip -o -d data/Few-NERD/ data/Few-NERD/supervised.zip && rm data/Few-NERD/supervised.zip
python scripts/process_nerd.py
rm -r data/Few-NERD/supervised
$ tree data/Few-NERD
data/Few-NERD
├── dev.txt
├── test.txt
└── train.txt
<a id="nyt"></a>
mkdir data/NYT-multi
wget -P data/NYT-multi https://raw.githubusercontent.com/yubowen-ph/JointER/master/dataset/NYT-multi/data/train.json
wget -P data/NYT-multi https://raw.githubusercontent.com/yubowen-ph/JointER/master/dataset/NYT-multi/data/dev.json
wget -P data/NYT-multi https://raw.githubusercontent.com/yubowen-ph/JointER/master/dataset/NYT-multi/data/test.json
$ tree data/NYT-multi
data/Few-NERD
├── dev.json
├── test.json
└── train.json
<a id="ace05-evt"></a>
The preprocessing code of ACE05-Evt is following OneIE.
Please follow the instructions and put preprocessed dataset at data/oneie
:
## OneIE Preprocessing, ACE_DATA_FOLDER -> ace_2005_td_v7
$ tree data/oneie/ace05-EN
data/oneie/ace05-EN
├── dev.oneie.json
├── english.json
├── english.oneie.json
├── test.oneie.json
└── train.oneie.json
Note:
nltk==3.5
is used in our experiments, we foundnltk==3.6+
may lead to different sentence numbers.- Ensure that the tree structure of the data is consistent with what we have listed above.
<a id="convert-raw-dataset-to-iteration-datset"></a>
You can obtain all iteration datasets by running the following code:
bash scripts/run_data_convert.bash
All iteration datasets will be located in the /AdaKGC/data
directory.
<a id="details-of-dataset-conversion"></a>
Now let's introduce the details of dataset conversion.
Each iteration dataset has a corresponding configuration file and the configuration file is placed under the directory AdaKGC/config/data_config
.
<a id="example-of-dataset-configuration-file"></a>
# AdaKGC/config/data_config/event/ace05_event_M3.yaml
name: ace05_event
path: data/oneie/ace05-EN
data_class: OneIEEvent
split:
train: train.oneie.json
val: dev.oneie.json
test: test.oneie.json
language: en
new_list:
- acquit
- appeal
- business
- contact
- meet
- merge organization
- start organization
- trial hearing
delete_list:
- Personnel:Elect
- Personnel:End-Position
- Personnel:Nominate
- Personnel:Start-Position
mapper:
Business:Declare-Bankruptcy: business
Business:End-Org: business
Business:Merge-Org: merge organization
Business:Start-Org: start organization
Conflict:Attack: attack
Conflict:Demonstrate: demonstrate
Contact:Meet: meet
Contact:Phone-Write: contact
Justice:Acquit: acquit
Justice:Appeal: appeal
Justice:Arrest-Jail: justice
Justice:Charge-Indict: justice
Justice:Convict: justice
Justice:Execute: execute
Justice:Extradite: extradite
Justice:Fine: fine
Justice:Pardon: pardon
Justice:Release-Parole: release parole
Justice:Sentence: sentence
Justice:Sue: sue
Justice:Trial-Hearing: trial hearing
Life:Be-Born: be born
Life:Die: die
Life:Divorce: life
Life:Injure: life
Life:Marry: marry
Movement:Transport: transport
Personnel:Elect: elect
Personnel:End-Position: end position
Personnel:Nominate: nominate
Personnel:Start-Position: start position
Transaction:Transfer-Money: transfer money
Transaction:Transfer-Ownership: transfer ownership
FAC: facility
GPE: geographical social political
...
delete_list
indicates the schema node to be deleted. For example, if a sample contains the Personal:Elect
tag, but Personal:Elect is in delete_list, then delete the Personal:Elect tag of the sample.
new_list
indicates the change of the schema of iteration n
compared with iteration 1
(i.e., the newly added schema node)
mapper
is a label mapping
between the raw dataset and the iteration dataset, which is important in vertical and replacement segmentation
<a id="file-name-format"></a>
The configuration file name also indicates some information about the iteration dataset. E.g. ace05_event_H2
, where H
indicates that its segmentation mode is horizontal segmentation, and 2
indicates iteration 2.
<a id="schema-hierarchy"></a>
The schemas of the three datasets used in AdaKGC all have a hierarchical structure (on the NYT dataset, we have built a hierarchical schema ourselves). Schema hierarchy usually has a two-layer structure. Here, for simplicity, we call it parent class
and child class
. The parent class can be regarded as a coarse-grained label, while the subclass is a fine-grained label.
The following is an example of schema hierarchy in three datasets:
Justice:Appeal # ace05_event: Justice is parent class, Appeal is child class
building-library # Few-NERD: building is parent class, library is child class
/business/company # NYT: business is parent class, company is child class
<a id="segmentation-mode"></a>
H
: Horizontal segmentation, Using only child class as label, each iteration will add several child class schema nodes (add or remove labels fromdelete_list
). So the number of labels per iteration is variable.V
: Vertical segmentation, Using both parent class and child class as label, each iterationmapper
will change several labels from parent class to child class (delete_list
is always empty). So the number of labels in each iteration is unchanged, but the content of labels change from parent class to child class.M
: Mixed segmentation, A mixture of H and V.R
: Replacement segmentation, Using only child class as label, each iterationmapper
will change several labels from child class to their synonyms (delete_list
is always empty). So the number of labels in each iteration is unchanged, but the content of labels change from child class to their synonyms. Therefore, the number of schema nodes remains unchanged, but the content of the nodes will become their synonyms.
<a id="iteration-change"></a>
Each iteration will change a fixed number of schema nodes, 2 for NYT, 3 for ACE05_Event, and 6 for Few-NERD. iteration 1
usually has the minimum number of schema nodes, and iteration 7
has the maximum number of schema nodes.
<a id="data-format"></a>
We use the same data format as UIE, each iteration dataset has the following items:
$ tree data/iter_1/NYT_H
data/iter_1/NYT_H
├── schema.json
├── test.json
├── train.json
└── val.json
<a id="example-of-instance-from-ace05-event"></a>
ACE2005_Event
{
"text": "She would be the first foreign woman to die in the wave of kidnappings in Iraq .",
"tokens": ["She", "would", "be", "the", "first", "foreign", "woman", "to", "die", "in", "the", "wave", "of", "kidnappings", "in", "Iraq", "."],
"record": "<extra_id_0> <extra_id_0> die <extra_id_5> die <extra_id_0> victim <extra_id_5> woman <extra_id_1> <extra_id_0> place <extra_id_5> Iraq <extra_id_1> <extra_id_1> <extra_id_1>",
"entity": [],
"relation": [],
"event": [
{"type": "die", "offset": [8], "text": "die", "args": [{"type": "victim", "offset": [6], "text": "woman"},
{"type": "place", "offset": [15], "text": "Iraq"}]}
],
"spot": ["die"],
"asoc": ["victim", "place"],
"spot_asoc": [
{"span": "die", "label": "die", "asoc": [["victim", "woman"], ["place", "Iraq"]]}
]
}
NYT
{
"text": "Should Turkey face eastward , toward its Muslim neighbors , or westward , toward Europe ?",
"tokens": ["Should", "Turkey", "face", "eastward", ",", "toward", "its", "Muslim", "neighbors", ",", "or", "westward", ",", "toward", "Europe", "?"],
"record": "<extra_id_0> <extra_id_0> location <extra_id_5> Turkey <extra_id_1> <extra_id_0> location <extra_id_5> Europe <extra_id_0> contains <extra_id_5> Turkey <extra_id_1> <extra_id_1> <extra_id_1>",
"entity": [
{"type": "location", "offset": [14], "text": "Europe"},
{"type": "location", "offset": [1], "text": "Turkey"}
],
"relation": [
{"type": "contains", "args": [{"type": "location", "offset": [14], "text": "Europe"},
{"type": "location", "offset": [1], "text": "Turkey"}]}
],
"event": [],
"spot": ["location"],
"asoc": ["contains"],
"spot_asoc": [
{"span": "Turkey", "label": "location", "asoc": []},
{"span": "Europe", "label": "location", "asoc": [["contains", "Turkey"]]}
]
}
Few-NERD
{
"text": "Now Multan is the name of the city in Pakistan .",
"tokens": ["Now", "Multan", "is", "the", "name", "of", "the", "city", "in", "Pakistan", "."],
"record": "<extra_id_0> <extra_id_0> geographical social political <extra_id_5> Multan <extra_id_1> <extra_id_0> geographical social political <extra_id_5> Pakistan <extra_id_1> <extra_id_1>",
"entity": [
{"type": "geographical social political", "offset": [9], "text": "Pakistan"},
{"type": "geographical social political", "offset": [1], "text": "Multan"}
],
"relation": [],
"event": [],
"spot": ["geographical social political"],
"asoc": [],
"spot_asoc": [
{"span": "Multan", "label": "geographical social political", "asoc": []},
{"span": "Pakistan", "label": "geographical social political", "asoc": []}
]
}
<a id="acknowledgement"></a>
Parts of the code are modified from UIE. We appreciate the authors for making their projects open-sourced.