Dataset publication guidelines
To ensure the proper ingestion of your datasets and facilitate its dissemination on the EU Vocabularies website we advise you to comply with a set of basic rules as follows:
Packaging format and communication
The content of the future publication will be delivered as a zip archive.
The delivery has to take place in accordance with the scheduled “code freeze” date.
Any change of date has to be communicated at least 2 weeks in advance of “code freeze”.
Unless defined otherwise, the package will be sent to the following email address:
Content of the publication package
A package will not be accepted for publication unless the following components are included:
Dataset file(s)
- The actual dataset files will always be located in the root folder of the archive
- Depending on the type, the files will be in one of the following formats
- Semantic vocabularies: RDF, TTL, XML, JSON-LD
- Generic vocabularies: CSV, GC, XML, SVG
- Models: OWL, XML Schema, DTD, XML, TTL
- Alignments: RDF, TTL, XML
Documentation
- Every dataset type intended for publication will be accompanied by at least a documentation file and a release note
- All documentation files associated with the dataset will be stored in the Documentation folder
- The Documentation folder will be located in the root folder of the main package
- The documentation will be provided only in HTML or PDF format
- Any documentation file will clearly state in the beginning the dataset name and the title of the document (first page or first screen to be displayed)
- If only on documentation file is provided, this file will contain at least the following sections:
- Title of the document
- Title of the dataset
- The scope and intended target of the document
- A basic description of the dataset
A main section presenting the dataset at large, as well as its intended use, should be included. Such a description might give details about the structure, usage principles, data models, associated statistics, etc.
- The Release notes will be stored in the Release folder that is located in the root folder of the main package
- The release notes will be delivered as a HTML, PDF or TXT file.
- The Release note will contain at the minimum : the version ID, a list of distribution formats included in the release, contact details of the copyright owner and if possible a list of new elements that the release is providing
Optionally, and if relevant for the scope of the dataset, a publication package might contain as well:
- Sample files – Packed together as a zip file with the name Samples. Stored in the root folder of the main package
- Diff files – Stored as independent files under the folder Diff that is located in the root folder of the main package
Depending on the type of dataset, some elements of the package might differ.
Any such deviation has to be clarified in advance with the publication team ([email protected])
File naming and conventions
In order to ensure clarity in communicating the scope of each file to the intended users it is advisable to use a proper naming convention for the various files stored in the publication package.
Our preferred file naming structure follows the rules bellow:
DA – [Required] Dataset name or acronym (e.g. EuroVoc, IMMC, ECLAS, etc.)
FC – [Required] File content, intent or distribution (e.g. Alignment, Example, User_manual, Release_note, Diff, SKOS, MARC, etc)
VS – [Optional] Version ID or date of the dataset|
EXT – File extension (e.g., RDF, TTL, XML, PDF, CSV, etc.)
File name = DA_FC_VS.EXT
No spaces are accepted in the file names of the package or the files included in the publication package.
In case of non compliance
If an already existing convention (for content, labels, etc.) was defined and/or used for previously published packages, please inform the publication team ([email protected]) to identify the best approach to be followed.
Downloads via SPARQL queries
Download is done by retrieving the official published data directly from the common data repository of the Publications Office (Cellar) in the specified format. The download links are direct links to the downloadable query results. HTML format opens in a table view in the browser, CSV and JSON files will be downloaded.
The CSV files can be imported into Excel (NB: don’t use ‘open with’ but open the Excel first and import data: Data/From text/CSV – select the downloaded file; import; choose the delimiter: comma; load). The links retrieving JSON can be directly included in external systems to use the data.
Only EU Member States must be listed alphabetically using the spelling of their source language. To retrieve the correct listing the protocol order attribute can be used.
For other countries and territories no specific recommendations are made.
The following queries retrieves all the current EU member states, their related English (preferred) labels and long labels, and their protocol order:
The results are set in 4 columns:
- ?country_uri : the identifier of the country or the territory inside the dataset
- ?country_en : the preferred label (in English)
- ?longLabel_en : the related long label (in English)
- ?protocol_order: the related protocol order.
The following queries retrieves all the current countries, their related English (preferred) labels and their ISO codes:
The results are in 7 columns :
- country_uri: the identifier of the country
- country_en: the preferred labels in English
- named_authority_code: official codes representing the country (it can be different from the ISO codes)
- interinstitutional_style_guide_code: an alpha-2 code used in EU institutions which is the same as ISO_31661_alpha2 except for Greece and UK
- ISO_31661_alpha2: the two-letter country codes based on ISO 3166-1
- ISO_31661_alpha3: the three-letter country codes based on ISO 3166-1
- ISO_31661_num: three-digit country codes based on ISO 3166-1
Current countries and territories in HTML
The results enclose all the countries and territories included in the dataset (including the "deprecated" countries, which are available for historical reasons).
They are shown in 2 columns :
- country_uri : the identifier of the country
- country_en : the preferred labels in English
Full list of countries and territories in HTML