Copyright © 2019 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and permissive document license rules apply.
This document lists use cases iteratively compiled by the Dataset Exchange Working Group. They identify current shortcomings and motivate the extension of the Data Catalog Vocabulary (DCAT). Further, they motivate the creation of guidelines for and a formalisation of the concept of (application) profiles and how to describe those, and the need for a mechanism to exchange information about those profiles including profile-based content-negotiation.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
This document was published by the Dataset Exchange Working Group as an Editor's Draft.
Comments regarding this document are welcome. Please send them to [email protected] (archives).
Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the W3C Patent Policy.
This document is governed by the 1 February 2018 W3C Process Document.
The provision of metadata describing datasets is crucial to enable data sharing, openly or not, among researchers, governments and citizens. There is a variety of metadata standards used by different communities to describe their datasets, some of which are highly specialized. W3C’s Data Catalog Vocabulary, DCAT is in widespread use, but so too are CKAN’s native schema, schema.org's dataset description vocabulary, ISO 19115, DDI, SDMX, CERIF, VoID, INSPIRE and, in the healthcare and life sciences domain, the Dataset Description vocabulary and DATS (ref) among others.
This wealth of metadata standards indicates that there is no complete and universally accepted solution. It is also recognized that DCAT does not provide sufficient vocabulary for some aspects of dataset descriptions (e.g. ways of describing APIs to access datasets, datasets versions, temporal aspects of datasets).
In addition to the variety of standard vocabularies, there have been multiple definitions of application profiles, which define how a vocabulary is used, for example by providing cardinality constraints and/or enumerated lists of allowed values such that data can be validated.
To enable interoperability between services, e-infrastructures and virtual research environments, it is needed to provide mechanisms for metadata standards and application profiles to be exposed and ingested through transparent and sustainable interfaces. Thus, we need a mechanism for servers to indicate the available standards and application profiles, and for clients to choose an appropriate one. This leads to the concept of content negotiation by application profile, which is orthogonal to content negotiation by data format and language that is already part of HTTP.
Within this context, the mission of the Dataset Exchange Working Group as described in its charter is to:
This document represents the results of the Working Group's initial efforts to identify use cases and requirements for all of the above activities. It contains use cases reflecting situations that members of the Working Group and other stakeholders have identified relevant to these goals, and a minimal set of requirements derived from the use cases. The use cases and requirements will be used to develop the three key deliverables described below.
The deliverables for this Working Group as described in the charter are noted below.
An update and expansion of the current DCAT Recommendation. The new version may deprecate, but MUST NOT delete, any existing terms.
A definition of what is meant by an application profile and an explanation of one or more methods for publishing and sharing them.
An explanation of how to implement the expected RFC and suitable fallback mechanisms as discussed at the SDSVoc workshop.
This Working Group was formed as an outcome of the Smart Descriptions & Smarter Vocabularies (SDSVoc) workshop held in Amsterdam from November 30 to December 1, 2016. The first Working Group meeting was held May 18, 2017. At that meeting the Working Group Chairs called for Working Group members and other stakeholders to submit use cases on the Working Group's wiki.
Use cases were written using a template, which required use case authors to provide a problem statement, list of stakeholders experiencing the problem, and requirements suggested given the problem. It was also recommended that use case authors provide references to existing approaches that might be useful for DCAT, links to documents and projects their use cases referenced, related use cases, and editorial comments. In addition, the list of tags below was created and applied to the use cases to reorganize them on demand.
Working Group Chairs, grouped related use cases for discussion. Use cases were discussed during the Working Group's weekly meetings and an intensive two day face-to-face meeting held at the Oxford e-Research Centre, University of Oxford, July 17-18, 2017. After the discussion of each use case, a proposal to accept the use case as-is, accept the use case with changes, or reject the use case was put before Working Group members in attendance for a vote. During voting Working Group Chairs sought consensus as outlined in Section 3.3 of the W3C Process Document.
The requirements were derived from the accepted use cases. One of the key tasks for the editors of this document was removing duplicate requirements, editing those that remained and adding missing ones. The editors also ensured links from requirements to use cases were maintained.
A tag set has been defined to label the use cases and to interactively rebuild the document according to tag filter selected. Please choose one or more tags and click "apply filter". "Reset filter" will clear the selection and recreate the original specification view .
Makx Dekkers
Data publisher
In practice, distributions are sometimes made available in a packaged or compressed format. For example, a group of files may be packaged in a ZIP file, or a single large file may be compressed. The current specification of DCAT allows the package format to be expressed in dct:format or dcat:mediaType but it is currently not possible to specify what types of files are contained in the package.
An example of an approach is the way ADMS defines Representation Technique which could be used to describe the type of data in a ZIP file, e.g. dcat:mediaType="https://www.iana.org/assignments/media-types/application/zip"; adms:representationTechnique="https://www.iana.org/assignments/media-types/text/csv".
Ruben Verborgh
data consumer, data producer, data publisher
While a content type such as application/json
identifies the kind of parser a client needs for a given representation,
it does not cover all assumptions of the server.
In practice, the server will often follow a much more strict pattern than “everything that is valid JSON”,
restricting itself to one of more subsets of JSON.
For the purpose of this use case, we refer to such subsets generically as “profiles”.
A profile captures additional structural and/or semantic constraints
in addition to the media type.
Note that one profile might be used across different media types:
for instance, a profile could be applied to multiple RDF syntaxes.
In order to inform clients that a representation conforms to a certain profile, servers should be able to explicitly indicate which profile(s) a response conforms to. This then allows the client to make the additional structural and/or semantic interpretations that are allowed within that profile.
Clients and servers should be able to indicate their compatibility and/or preference for certain profiles. This enables clients to request a resource in a specific profile, in addition to the specific content type it requests. A client should be able to determine which profiles a server supports, and with which content types. A client should be able to look up more information about a certain profile.
One example of such a profile is a specific DCAT Application Profile, but many other profiles can be created. For example, another profile could indicate that the representation uses a certain vocabulary.
Ruben Verborgh
data consumer, data producer, data publisher
A response of a server can conform to multiple content types.
For example, a JSON-LD response conforms to the following content types:
application/octet-stream
, application/json
, application/ld+json
(even though only one of them will typically be indicated).
Similarly, the response of a server can conform to multiple profiles. For example, a profile X could demand that all persons are described with the FOAF vocabulary, and a profile Y could demand that all books are described with the Schema.org vocabulary. Then, a response which uses FOAF for people and Schema.org for books, clearly conforms to both profiles. And in contrast to content types, it is informative to list both profiles, as their conformance is independent.
Therefore, servers should be able to indicate, if they wish to do so, that a response conforms to multiple profiles. Clients should also be able to specify their preference for one or multiple profiles.
This enables a modular design of profiles, which can be combined when appropriate. With content types, only hierarchical combinations are possible. For example, a JSON-LD document is always a JSON document. However, with profiles, this is not necessarily the case: some of them might allow orthogonal combinations (as is the case in the vocabulary example above).
Most datasets that are maintained long-term and evolve over time have distributions of multiple versions. However, the current DCAT model does not cover versioning with sufficient details. Being able to publish dataset version information in a standard way will help both producers publishing their data on data catalogues or archiving data and dataset consumers who want discover new versions of a given dataset, etc.
We can also see some similarities with software versioning and dataset versioning, for instance, some data projects release daily dataset distributions, major/minor releases etc. Probably, we can use some of the lessons learned from software versioning. There are several existing dataset description models that extend DCAT to provide versioning information, for example, HCLS Community Profile.
There are multiple reasons to provide different information about the same concept, so if Linked Data is to exist based on URI object identifiers, and these are to relate to the real world entity, rather than specific implementations (i.e. information records), then it is inevitable that different sets of information will be required for different purposes.
Consider a request for the boundary of a country, with a coastline. If the coastline is included as a property, this may be many megabytes of detail. Alternatively, a generalised simple coastline may be provided, or a single point, or may not be required at all. (In reality there may be many different versions of coastline based on different legal definitions, or practices, or approximation methods).
Furthermore, in any graph based response, the depth of traversal of the graph is always a choice. Consider a request to the GBIF taxonomy service to search for a biological species. A response typically includes not just the species, but potentially more information about the hierarchy of the taxonomy (kingdom, phyla, family, genus etc) - also possible synonyms, also possibly a wide range of metadata about name sources, usages and history. There is a need for offering different choices of how deep such a traversal of relationships should be undertaken and returned.
Different information models (response schema), and different choices of content within the same schema , constitute necessary options, and there may be a large number of these.
Thus there is a need for discovering which profiles a service will offer for a given resource, and a canonical machine readable graph of metadata about what such offerings consist of and how they may be invoked. This may be as simple as providing a profile name, or content profile, schema choice, encoding and languages.
Note that the Linked Data API implementation used by the UK Linked Data effort, includes the notion of _view parameters in URI requests - these are "named collections of properties" but it does not provide a means to attach metadata about what such views consist of. equivalent HTTP header based profile negotiation would still need to address this requirement in the same way as agent-driven negotiation (https://www.w3.org/Protocols/rfc2616/rfc2616-sec12.html) - what is required is a minimal set of metadata and extension mechanisms for this.
Support for a specific profile is also a powerful search axis, potentially encompassing the full suite of semantic specification and resource interoperability requirements. Thus metadata about profile support can be used for both discovery and mediated traversal via forms of content negotiation.
Jonathan Yu, CSIRO
Data provider, data consumer
Users often access datasets via web services. DCAT provides constructs for associating a resource described by dcat:Dataset with dcat:Distribution descriptions. However, the Distribution class provides only the dcat:accessURL and dcat:downloadURL properties for users to access/download something. It would be useful for users to gain more information about the web service endpoint and how users can interact with the data. If information about the web service is known with appropriate identifiers for the data, then users can understand additional context then invoke a call to the web service to access/download the dataset resource or subsets of it.
Jonathan Yu, Simon Cox (CSIRO)
Data provider, data consumer
We want to be able to describe a dataset record using properties appropriate to the dataset type. This is especially the case in a dataset in the geoscience domain, e.g. an observation of a "feature" in the real world captured using a sensor about some property. There are emerging practices on how to represent these semantics for the data, however, DCAT currently only supports association of a dcat:Dataset with dcat:theme to a skos:Concept. Data providers could extend/specialise dcat:theme to provide specific semantics about the association between dcat:Dataset and the ‘theme’ but is this enough? Furthermore, there are broad/aggregated semantics at the dataset level (e.g. observations in the Great Barrier Reef) and then fine grained semantics for elements within a dataset (e.g. sea surface temperature observations in the Great Barrier Reef). Users need a way to view the aggregated collection level metadata and the associated semantics and then they need a way to view record level metadata to obtain/filter on specific information, e.g. instrument/sensor used, spatial feature, observable property, quantity kind, etc.
Properties from the W3C Semantic Sensor Network SOSA ontology and QUDT may be useful in this context.
http://dapds00.nci.org.au/thredds/catalogs/fx3/catalog.html
Simon Cox (CSIRO)
Data catalogue
Some users of DCAT may want to apply it to resources that not everyone would consider a 'dataset'. Some examples are text documents, source code, controlled vocabularies, and ontologies. It's not clear what kinds of resources may be described with DCAT or how one would describe the types listed. Users need guidance about the expected scope for DCAT and on the use of whatever terms DCAT chooses to use or recommend for assigning types (e.g., dc:type). There also needs to be a way for a DCAT description to indicate the 'type' of dataset involved (the semantic type, not the media-type).
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
The European Commission's Joint Research Centre (JRC) is a multidisciplinary research organization with the mission of supporting EU policies with independent evidence throughout the whole policy life-cycle.
In order to provide a single access and discovery point to JRC data, in 2016 a corporate data catalog has been launched, where datasets are documented by using a modular metadata schema, consisting of a core profile, defining the elements that should be common to all metadata records, and a set of domain-specific extensions.
The reference metadata standard used is the DCAT application profile for European data portals [DCAT-AP] (the de facto EU standard metadata interchange format), and the related domain-specific extensions - namely, [GeoDCAT-AP], for geospatial metadata, and [StatDCAT-AP], for statistical metadata. The core profile of JRC metadata is however not using [DCAT-AP] as is, but it complements it with a number of metadata elements that have been identified as most relevant across scientific domains, and which are required in order to support data citation.
More precisely, the most common, cross-domain requirements identified at JRC are following ones:
[VOCAB-DCAT] does not provide guidance on how to model this information. [DCAT-AP] and [GeoDCAT-AP] partially support these requirements - namely, the specification of dataset authors (dcterms:creator
[DCTerms]), data lineage (dcterms:provenance
[DCTerms]), and input data (dcterms:source
[DCTerms]). For the two remaining requirements, the JRC metadata schema makes use of dcterms:isReferencedBy
[DCTerms] (publications) and vann:usageNote
[VANN] (usage notes).
These solutions allow a simplified description of the dataset context, that can be used for multiple purposes - as assessing the quality and fitness for use of a dataset, or identifying the dataset most commonly used as input data. Additional details could be provided by representing more precisely these relationship with "qualified" forms by using vocabularies as [PROV-O], [VOCAB-DQV], or [VOCAB-DUV]: for instance, the relationship between a dataset and input data can be complemented with the model used for processing them, and possibly with additional information on the data generation workflow.
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
Data citation is gaining more and more importance as a way to recognize the scientific value of research data, by treating them as traditionally done for scientific publications.
Requirements for data citation include:
A study has been carried out at the European Commission's Joint Research Centre (JRC), in order to create mappings between [DataCite] (the current de facto standard for data citation) and [DCAT-AP].
The results show that [DCAT-AP] covers most of the required [DataCite] metadata elements, but some of them are missing. In particular:
Guidance should be provided on how to model this information in order to enable data citation also in records represented with [VOCAB-DCAT] and related application profiles.
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
A number of different (possibly persistent) identifiers are widely used in the scientific community, especially for publications, but now increasingly for authors and data.
Different approaches are used for representing them in RDF – best practices are needed to enable their effective use across platforms. But more importantly, they need to be made actionable, irrespective of the platforms they are used in.
Encoding identifiers as HTTP URIs seems to be the most effective way of making them actionable. Notably, quite a few identifier schemes can be encoded as dereferenceable HTTP URIs, and some of them are also returning machine readable metadata (e.g., DOIs, ORCIDs). Moreover, they can still be encoded as literals, especially if there is the need of knowing the identifier “type”. In such a case, a common identifier type registry would ensure interoperability.
Another issue concerns the ability to specify primary and secondary identifiers. This may be a requirement when resources are associated with multiple identifiers.
When encoded as HTTP URIs, the usual approach to model primary and alternative identifiers is to use the former as the resource URI, whereas the latter are specified by using owl:sameAs
. In this case, the information about the identifier type is not explicitly specified, and can be derived only from the URI syntax - although this is not always possible.
To model identifiers as literals, [VOCAB-DCAT] uses dcterms:identifier
, but it makes no distinction between primary / alternative identifiers, or the identifier type. For alternative identifiers, [DCAT-AP] recommends class adms:Identifier
[VOCAB-ADMS], which can be used to specify the identifier type, plus additional information - namely, the identifier scheme agency and the identifier issue date. It is worth noting that the adms:Identifier
has the primary purpose of describing the identifier itself, which makes it less suitable for linking purposes.
Finally, a number of vocabularies have defined specific properties for modeling identifier types, as prism:doi
[PRISM] and bibo:doi
[BIBO] for DOIs. Moreover, starting from version 3.2, [SCHEMA-ORG] has defined a super-property schema:identifier
for all the identifier-specific properties already used in [SCHEMA-ORG].
An alternative approach is to denote the identifier type with an RDF datatype. In such a case, the same property can be used to specify the identifier - e.g., dcterms:identifier
. This solution has the advantage of being able to easily identify all literals used as identifiers (you just have to lookup / search for the same property), whereas the datatype can be used to filter specific identifier types.
KC: Note that the libraries/archives community has identifiers that are not (yet) actionable, like ISBNs, ISSNs. These can be coded as dcterms:identifier strings but the string itself is not unique. Not sure how these fit into the overall picture but perhaps we can task someone to bring a specific proposal.
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
Documentation on data lineage is crucial to ensure transparency on how data are created and to facilitate their reproducibility. These have been traditional requirements for scientific data, but are currently becoming relevant also in other communities, especially in the public sector when data are used in support to policy making.
Data lineage is typically specified with a more or less detailed human-targeted documentation. In very few cases, this information is represented in a formal, machine-readable way, enabling a (semi)automated data processing workflow that can be used to re-run the experiment from which the data were produced.
[DCAT-AP] uses property dcterms:provenance
[DCTerms] to specify a human-readable documentation of data lineage, that can be either embedded in metadata or described in a document linked from the metadata record itself. Moreover, dcterms:source
can be used to refer to the input data.
[PROV-O] can be used in order to provide a machine-readable description of data lineage, but best practices on how to use it consistently are missing.
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
Each metadata standard has its own set of agent roles, and they all use their own vocabularies / code lists. E.g., the latest version (2014) of [ISO-19115-1] has 20 roles, and [DataCite] even more.
Two of the main issues concern (a) how to ensure interoperability across roles defined in different standards, and (b) if it makes sense to support all of them across platforms. The latter point follows from a common issue in metadata standards supporting multiple roles, with overlapping semantics (e.g., the difference between a data distributor and a data publisher is not always clear). In these scenarios, whenever metadata are not created by specialists, roles frequently happen to be used inconsistently.
As far as research data are concerned, agent roles are important to denote the type of contribution provided by each individual / organization in producing data.
Moreover, in some cases, an additional requirement is to specify the temporal dimension of a role – i.e., the time frame during which an individual / organisation played a given role - and, maybe, also other information – e.g., the organisation where the individual held a given position while playing that role.
[DCTerms] defines a limited number of agent roles as properties. [VOCAB-DCAT] re-uses some of them (in particular, dcterms:publisher
), plus it defines a new one, namely, dcat:contactPoint
. [DCAT-AP] and [GeoDCAT-AP] provide guidance on the use of other [DCTerms] roles - in particular, dcterms:creator
, dcterms:rightsHolder
. Anyway, the role properties defined in [DCTerms] and [VOCAB-DCAT] model just a subset of the agent roles defined in other standards. Moreover, they cannot be used to associate a role with other information concerning its temporal / organizational context.
[PROV-O] could be used for this purpose by using a “qualified attribution”. This is, for instance, the approach used in [GeoDCAT-AP] to model agent roles defined in [ISO-19115-1] but not supported in [DCTerms] and [VOCAB-DCAT]:
a:Dataset a dcat:Dataset;
prov:qualifiedAttribution [ a prov:Attribution ;
# The agent role, as per ISO 19115
dcterms:type <http://inspire.ec.europa.eu/metadata-codelist/ResponsiblePartyRole/owner> ;
# The agent playing that role
prov:agent [ a foaf:Organization ;
foaf:name "European Union"@en ] ] .
However, to address the different use cases, such “qualified roles” should be compatible with the corresponding non-qualified forms, and both should be mutually inferable. For instance, the example above in [GeoDCAT-AP] is considered as equivalent to:
a:Dataset a dcat:Dataset;
dcterms:rightsHolder [ a foaf:Organization ;
foaf:name "European Union"@en ] .
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
Used in its broader sense, the notion of "data quality" covers different aspects, that may vary depending on the domain.
They include, but are not limited to:
In order to provide a mechanism for the consistent representation of data quality, the most frequently used data quality aspects should be identified, based on existing standards (e.g., [ISO-19115-1]) and practices. Such aspects should also be used to identify possible common modeling patterns.
Solutions for modeling data quality have been defined in [DCAT-AP], [GeoDCAT-AP], [StatDCAT-AP], [VOCAB-DQV], and [VOCAB-DUV]. They cover the following aspects:
Notably, the first 4 aspects (those related to "conformance") follow a common pattern in that the reference vocabularies model all them by using property dcterms:conformsTo
[DCTerms].
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
Understanding the level of precision and accuracy of a dataset is fundamental to verify its fitness for purpose. This is typically denoted in terms of spatial or temporal resolution, but other dimensions are also possible.
Some metadata standards include elements for specifying precision. For instance the latest version (2014) of [ISO-19115-1] supports the possibility of specifying spatial resolution in terms of scale (e.g., 1:1,000,000), distance - further split into horizontal ground distance, vertical distance, and angular distance - and level of detail. However, [VOCAB-DCAT] does not provide guidance on how to model this information.
Actually, for some time, [VOCAB-DCAT] included a property dcat:granularity
to model precision, which was dropped in the final version of the vocabulary (see ISSUE-10, and, in particular, the mail proposing to drop this property).
This issue was raised during the development of [VOCAB-DQV], and a solution has been proposed on how to model data precision in terms of spatial resolution - expressed as equivalent scale (e.g., 1:1,000,000) or distance (e.g., 1km) - and data accuracy as percentage - see [VOCAB-DQV], Section 6.13 Express dataset precision and accuracy. Notably, the same approach can be followed to model temporal resolution.
[SDW-BP] addresses this problem as well re-using the approach defined in [VOCAB-DQV], and, additionally, it provides an example on how to specify accuracy by stating conformance with a quality standard - see [SDW-BP], Best Practice 14: Describe the positional accuracy of spatial data.
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
One of the ways of expressing data quality is to verify whether a given dataset is (or not) conformant with a given quality standard / benchmark.
[ISO-19115-1] supports a way of modeling this information, by allowing to state whether a given dataset passed or not a given test result. Moreover, [INSPIRE-MD] extends this approach by supporting an additional possible result, namely, "not evaluated".
Another approach is provided by the [EARL10] vocabulary, which provides a generic mechanisms to model test results. More precisely, [EARL10] supports the following possible outcome values (quoting from Section 2.7 OutcomeValue Class):
earl:passed
- Passed - the subject passed the test.
earl:failed
- Failed - the subject failed the test.
earl:cantTell
- Cannot tell - it is unclear if the subject passed or failed the test.
earl:inapplicable
- Inapplicable - the test is not applicable to the subject.
earl:untested
- Untested - the test has not been carried out.
[VOCAB-DQV] allows to specify data conformance with a reference quality standard / benchmark. However, this can model only one of the possible scenarios - i.e., when data are conformant.
[GeoDCAT-AP] provides an alternative and extended way of expressing "conformance" by using [PROV-O], allowing the specification of additional information about conformance tests (when this has been carried out, by whom, etc.), but also different conformance test results (namely, conformant, not conformant, not evaluated).
An example of the [GeoDCAT-AP] [PROV-O]-based representation of conformance is provided by the following code snippet:
a:Dataset a dcat:Dataset ;
prov:wasUsedBy a:TestingActivity .
a:TestingActivity a prov:Activity ;
prov:generated a:TestResult ;
prov:qualifiedAssociation [ a prov:Association ;
# Here you can specify which is the agent who did the test, when, etc.
prov:hadPlan a:ConformanceTest ] .
# Conformance test result
a:TestResult a prov:Entity ;
dcterms:type <http://inspire.ec.europa.eu/metadata-codelist/DegreeOfConformity/conformant> .
a:ConformanceTest a prov:Plan ;
# Here you can specify additional information on the test
prov:wasDerivedFrom <http://data.europa.eu/eli/reg/2014/1312/oj> .
# Reference standard / specification
<http://data.europa.eu/eli/reg/2014/1312/oj> a prov:Entity, dct:Standard ;
dcterms:title "Commission Regulation (EU) No 1089/2010 of 23 November 2010 implementing
Directive 2007/2/EC of the European Parliament and of the Council as regards
interoperability of spatial data sets and services"@en
dcterms:issued "2010-11-23"^^xsd:date .
The example states that the reference dataset is conformant with the Commission Regulation (EU) No 1089/2010 of 23 November 2010 implementing Directive 2007/2/EC of the European Parliament and of the Council as regards interoperability of spatial data sets and services. Since this case corresponds to the scenario supported in [VOCAB-DQV], the [PROV-O]-based representation above is equivalent to:
a:Dataset a dcat:Dataset ;
dcterms:conformsTo <http://data.europa.eu/eli/reg/2014/1312/oj> .
# Reference standard / specification
<http://data.europa.eu/eli/reg/2014/1312/oj> a prov:Entity, dct:Standard ;
dcterms:title "Commission Regulation (EU) No 1089/2010 of 23 November 2010 implementing
Directive 2007/2/EC of the European Parliament and of the Council as regards
interoperability of spatial data sets and services"@en
dcterms:issued "2010-11-23"^^xsd:date .
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
The types of possible access restrictions of a dataset are one of the key filtering criteria for data consumers. For instance, while searching in a data catalogue, I may not be interested in those data I cannot access (closed data), or in those data requiring I provide personal information (as data that can be accessible by anyone, but only after registration).
Moreover, it is often the case that different distributions of the same dataset are released with different access restrictions. For instance, a dataset containing sensitive information (as personal data) should not be publicly accessible, although it would be possible to openly release a distribution where these data are aggregated and/or anonymized.
Finally, whenever data are not publicly available, an explanation of a reason why they are closed should be provided - especially when these data are maintained by public authorities, or are the outcomes of public-funded research activities.
[DCAT-AP] models this information at the dataset level by using property dcterms:accessRights
[DCTerms], and defines three possible values:
- Public
- Definition: Publicly accessible by everyone.
- Usage note/comment: Permissible obstacles include: registration and request for API keys, as long as anyone can request such registration and/or API keys.
- Restricted
- Definition: Only available under certain conditions.
- Usage note/comment: This category may include: resources that require payment, resources shared under non-disclosure agreements, resources for which the publisher or owner has not yet decided if they can be publicly released.
- Non-public
- Definition: Not publicly accessible for privacy, security or other reasons.
- Usage note/comment: This category may include resources that contain sensitive or personal information.
In addition to this, the JRC extension to [DCAT-AP] uses property dcterms:accessRights
also at the distribution level, with the following possible values:
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
This concerns how to model dataset distributions available via services / APIs (e.g., a SPARQL endpoint), and not via direct file download. In such cases, it is necessary to know how to query the service / API to get the data. Moreover, an additional issue is that a service may provide access to more than one dataset. As a consequence, users do not know how to get access to the relevant subset of data accessible via a service / API.
Although this is a domain-independent issue, it is a key one in the geospatial domain, where data are typically made accessible via services (e.g., a view or download service), that, to be used, require specific clients. In metadata, the link to such services is usually pointing to an XML document describing the service's "capabilities". This of course puzzles non-expert users, who expect instead to get the actual "data".
Some catalogue platforms (as GeoNetwork and, to some extent, CKAN) are able to make this transparent for some services (typically, view services), but not for all. It would therefore be desirable to agree on a cross-domain and cross-platform approach to deal with this issue.
In [VOCAB-DCAT], the option of accessing data via a service / API is explicitly mentioned, recommending the use of dcat:accessURL to point to it. However, this property is meant to be used, generically, for indirect data download, so it is not enough to know that the URL points to a service endpoint rather than to a download page.
Actually, for some time, [VOCAB-DCAT] included a class dcat:WebService
(subclass of dcat:Distribution
) to specify that data is available via a service / API. Other subclasses of dcat:Distribution
were also defined to specify direct data access (dcat:Download
), and data access via an RSS/Atom feed (dcat:Feed
). All these subclasses were dropped in the final version of the vocabulary (see ISSUE-8 / ISSUE-9, and related discussion).
A proposal to address this issue has been elaborated in the framework of the DCAT-AP implementation guidelines (see issue DT2: Service-based data access), where two main requirements have been identified:
As far as point (1) is concerned, the proposal is to associate with distributions the following information:
dcterms:type
).dcterms:conformsTo
).An example is provided by the following code snippet. Here, the distribution's access URL points to service, implemented by using the [WMS] standard of the Open Geospatial Consortium (OGC):
a:Dataset a dcat:Dataset;
dcat:distribution [ a dcat:Distribution ;
dct:title "GMIS - WMS (9km)"@en ;
dct:description "Web Map Service (WMS) - GetCapabilities"@en ;
dct:license <http://publications.europa.eu/resource/authority/licence/COM_REUSE> ;
dcat:accessURL <http://gmis.jrc.ec.europa.eu/webservices/9km/wms/meris/?dataset=kd490> ;
# The distribution points to a service
dct:type <http://publications.europa.eu/resource/authority/distribution-type/WEB_SERVICE> ;
# The service conforms to the WMS specification
dct:conformsTo <http://www.opengis.net/def/serviceType/ogc/wms> ] .
About (2) (i.e., provide a description of the API / service interface), a number options have been discussed (e.g., describe a service/API by using an OpenSearch document), but no final decision has been taken.
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
In most cases, the relationships between datasets and related resources (e.g., author, publisher, contact point, publications / documentation, input data, model(s) / software used to create the dataset) can be specified with simple, binary properties available from widely used vocabularies - as [DCTerms] and [VOCAB-DCAT].
As an example, dcterms:source
can be used to specify a relationship between a dataset (output:Dataset
), and the dataset it was derived from (input:Dataset
):
output:Dataset a dcat:Dataset ;
dcterms:source input:Dataset .
input:Dataset a dcat:Dataset .
However, there may be the need of providing additional information concerning, e.g., the temporal context of a relationship, which requires the use of a more sophisticated representation, similar to the "qualified" forms used in [PROV-O]. For instance, the previous example may be further detailed by saying that the output dataset is an anonymized version of the input dataset, and that the anonymization process started at time t and ended at time t′. By using [PROV-O], this information can be expressed as follows:
output:Dataset a dcat:Dataset ;
prov:qualifiedDerivation [
a prov:Derivation ;
prov:entity input:Dataset ;
prov:hadActivity :data_anonymization
] .
input:Dataset a dcat:Dataset .
# The process of anonymizing the data (load the data, process it, and generate the anonymized version)
:data_anonymization
a prov:Activity ;
# When the process started
prov:startedAtTime "2018-01-23T01:52:02Z"^^xsd:dateTime;
# When the process ended
prov:endedAtTime "2018-01-23T02:00:02Z"^^xsd:dateTime .
Besides [PROV-O], vocabularies as [VOCAB-DQV] and [VOCAB-DUV] can be used to specify relationships between datasets and related resources. However, there is the need of providing guidance on how to use them consistently, since the lack of modeling patterns results in the difficulty of aggregating this information across metadata records and catalogs.
Moreover, it is important to define mappings between qualified and non-qualified forms (e.g., along the lines of what done in [PROV-DC]), not only to make it clear their semantic relationships (e.g., dcterms:source
is the non-qualified form of prov:qualifiedDerivation
), but also to enable metadata sharing and re-use across catalogs that may support only one of the two forms (qualified / non-qualified).
[GeoDCAT-AP] makes use of both qualified and non-qualified forms to model agent roles and data quality conformance test results.
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
[VOCAB-DCAT] makes use of quite a general definition of dataset (quoting from [VOCAB-DCAT], Section 5.3 Class: Dataset: "A collection of data, published or curated by a single agent, and available for access or download in one or more formats."), which is meant to be used as broady as possible (as stated in ISSUE-62).
As such, it could be theoretically used to model a variety of resources - including documents, software, images and audio-visual content. However, the solution adopted in [VOCAB-DCAT] is not able to address the following scenarios:
dcat:Dataset
, it is not possible for users to restrict their search to the specific type of resource they are interested in.These two scenarios are not hypothetical, but they reflect what is typically included, e.g., in catalogs following the [ISO-19115-1] or the [DataCite] standards, which model in different ways the documented resources, and both support records about services.
[GeoDCAT-AP] provides a mechanism to model three out of the more than 20 resource types supported in [ISO-19115-1] - namely, dataset, dataset series, and service.
The adopted approach is as follows:
dcat:Dataset
.dcterms:type
[DCTerms].dcat:Catalog
, in case of a catalog service, and with dctype:Service
[DCTerms] in all the other cases.dcterms:type
.A similar approach has been adopted in the study carried out by the European Commission's Joint Research Centre (JRC) to map [DataCite] to [DCAT-AP].
The resource types supported in [DataCite] are 14. Most of them fall into the generic [VOCAB-DCAT] definition of "dataset", so they are modeled with dcat:Dataset
. Moreover, the DCMI Type Vocabulary [DCTerms] is used to model both the dataset "type", and those resource types that cannot be modeled as datasets (events, physical objects, services).
Stephen Richard, Columbia University
A geologic unit dataset has various service distributions e.g. OGC v1.1.1 WFS as GeoSciML v3 GeologicUnit, GeoSciML portrayal GeologicUnit, GeoSciML v4 GeologicUnit, OGC v. 1.3.0 WMS layer portrayed according to stratigraphic age, layer portrayed according to lithology, or layer portrayed according to stratigraphic unit, and as an ESRI feature service. A user's map client software has a catalog search capability, and requires GeoSciMLv4 encoding in order to function correctly.
The metadata must provide sufficient information about the distributions for the catalog client to filter for only services that offer such a distribution in the results offered to the user.
Stephen Richard, Columbia University
A dataset is offered via an OData end point, and the distribution link is a template with several parameters that the user must provide values for to obtain a valid response. Client must have means to know the valid value domains for the parameters. This could be via a link to an open search or URI template description type document, or by metadata elements associated with the link that define the parmeters and their domains.
Riccardo Albertoni (Consiglio Nazionale delle Ricerche), Antoine Isaac (VU University Amsterdam and Europeana)
Data publisher, data consumer, catalog maintainer, application profile publisher
As discussed in the recent W3C recommendation DWBP “The quality of a dataset can have a big impact on the quality of applications that use it. As a consequence, the inclusion of data quality information in data publishing and consumption pipelines is of primary importance.” DQV is a new RDF vocabulary which extends DCAT with additional properties and classes suitable for expressing the quality of DCAT datasets and distributions. It defines concepts such as measures and metrics to assess the quality of user-defined quality dimensions, but it also puts much importance to allowing many actors to assess the quality of datasets and publish their annotations, certificates, opinions about a dataset. The W3C DWBP Working Group left a list of possible topics to be developed which were not in the scope or could not be covered by the DWBP group, in particular, some of the wishes left for Data Quality Vocabulary (DQV) seem to be related to the activity of this group.
The list below groups some of DQV wishes by the most likely impacted DXWG deliverable. Each requirement in the list might be expanded into a separated use case after a first scrutiny by the group. Some of the DQV wishes might be included either as Use Cases or as group issues. The choice on the most appropriate way of inclusion is affected by the level of commitment that DCAT1.1 wil make about quality documentation, and how much DCAT will rely on DQV for documenting the dataset quality.
Thomas D'haenens, Informatie Vlaanderen
Within our government agency we are struggling to combine two targets. On one side, we have a European obligation to share datasets about a wide range of topics (going from environment to transport to ...), following the INSPIRE guidelines. These are for a major part in the spirit of georeferenceable datasets and are based on ISO-standards and go much more in detail than DCAT does. On the other side we also have an open data policy and implementations based on DCAT (much leaner metadata vocabulary).
We're now working on a way to map the INSPIRE-based descriptions to DCAT-based descriptions. Since INSPIRE is a European Regulations (thus obligated for all European countries) this work ought to be done on a supranational level. At the least, I believe guidelines and mapping rules should be defined within both DCAT(-AP) and INSPIRE to enhance interoperability. Starting point should be that a dataset must be described only once (off course)
INSPIRE-based metadata catalog, see https://metadata.geopunt.be/zoekdienst/apps/tabsearch/index.html?hl=dut
Jaroslav Pullmann, Christian Mader (Fraunhofer)
Data publisher, data portal operator
While the operation and co-existence of data portals hosting DCAT descriptions is out of group's scope the standard should support an explicit regulation of their (re)distribution and hosting. This use case refers to scenarios where individual Datasets or entire Catalogs are copied among data portals.
An explicit reference to the original resource should be maintained within any copy even when both share the same URI (i.e. are local copies of an identical resource). The reference to the original resource should indicate resource's publication context (data portal) in a way that is accesible for search engine indexing and browser navigation.
Usage policies might further regulate handling of the distributed entities, e.g. a duty to keep the copies updated, display the provenance information or prohibit a commercial exploitation.
Jaroslav Pullmann, Christian Mader (Fraunhofer)
DXWG members
Considering DCAT a high-level model for data exchange agree on significant aspects missing so far and define extension points (typically properties) for re-use and integration of existing standards, application profiles etc. The reference listing of aspects deemed relevant is based on evaluation of DXWG use cases and ISO 19115:2014:
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
[VOCAB-DCAT] uses dcterms:temporal
[DCTerms] to specify the temporal coverage of a dataset, but does not provide guidance on to specify the start / end date.
Actually, the only relevant example provided in [VOCAB-DCAT] makes use of a URI operated by reference.data.gov.uk, denoting a time interval modeled by using [OWL-TIME]. Such sophisticated representation could be relevant for some use cases, but it is quite cumbersome when the requirement is to specify simply a start / end date, and it makes it difficult to use temporal coverage as a filtering mechanism during the discovery phase.
To address this issue, [VOCAB-ADMS] makes use of properties schema:startDate
and schema:endDate
[SCHEMA-ORG].
[DCAT-AP] follows the same approach.
Other existing approaches are:
DATS,
DataCite and
Google datasets
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
One of the key information necessary to correctly interprete geospatial data is the spatial coordinate reference system used. For instance, a coordinate reference system can denote the order in which the coordinates are specified (latitude / longitude, longitude / latitude), whether coordinates denote points, lines, surfaces, volumes, which is the unit of measurement used.
This information is normally included in geospatial metadata since, depending on the coordinate reference system used, a dataset can or cannot be used for specific use cases. So, users can filter the relevant datasets during the discovery phase.
Used more broadly, the notion of "reference system" can be applied to other data as well. For instance, suppose a dataset consisting of a set of measurements expressed as numbers. Are they percentages or quantities using a specific unit of measurement?
[SDW-BP] addresses this issue in Best Practice 8, and illustrates a number of options that can be followed.
[GeoDCAT-AP] models this information by specifying data conformance with a given standard, as done in [VOCAB-DQV], which, in this case, is a spatial or temporal reference system. As far as spatial reference systems are concerned, they are denoted by the HTTP URIs operated by the OGC CRS register (see [SDW-BP], Example 22):
@prefix ex: <http://data.example.org/datasets/> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
ex:ExampleDataset
a dcat:Dataset ;
dcterms:conformsTo <http://www.opengis.net/def/crs/EPSG/0/32630> .
<http://www.opengis.net/def/crs/EPSG/0/32630>
a dcterms:Standard, skos:Concept ;
dcterms:type <http://inspire.ec.europa.eu/glossary/SpatialReferenceSystem> ;
dcterms:identifier "http://www.opengis.net/def/crs/EPSG/0/32630"^^xsd:anyURI ;
skos:prefLabel "WGS 84 / UTM zone 30N"@en ;
skos:inScheme <http://www.opengis.net/def/crs/EPSG/0/> .
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
The "spatial" or "geographic coverage" of a dataset denotes the geographic area of the phenomena described in the dataset itself.
How dataset spatial coverage is specified varies depending on the domain and metadata standards used. However, the different solutions make basically use of two different approaches (not mutually exclusive):
Geometries are typically used when it is necessary to denote an arbitrary geographic area, which may not correspond to a specific geographical name. Examples include (but are not limited to) satellite images and data from sensors. Geometries are also used in existing data catalogs for discovery and filtering purposes (e.g., this feature is supported in GeoNetwork and CKAN). Moreover, spatial queries are supported by the majority of the existing triple stores (including those not supporting [GeoSPARQL]).
[VOCAB-DCAT] allows the specification of the spatial coverage of a dataset by using dcterms:spatial
[DCTerms], and includes an example making use of an HTTP URI from Geonames denoting a geographical area.
However, no guidance is provided on how to denote arbitrary regions with a "geometry" (i.e., a point, a bounding box, a polygon), which is the typical way spatial coverage is specified in geospatial metadata.
The issue is particularly problematic since the existing vocabularies model this information in very different ways. Moreover, geometries can be expressed in a number of formats (e.g., [GML], WKT, GeoJSON [RFC7946]). This situation makes it difficult to use information on spatial coverage effectively, e.g., to support spatial search and filtering.
[SDW-BP] provides a comprehensive guidance on how to specify geometries in the Best Practices under Section 12.2.2 Geometries and coordinate reference systems.
As far as metadata are concerned, one of the documented approaches concerns the solution adopted in [GeoDCAT-AP], which models spatial coverage by using property locn:geometry
[LOCN], and recommending encoding the geometry by using [GML] and/or [WKT] - see [SDW-BP], Example 15:
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix geosparql: <http://www.opengis.net/ont/geosparql##> .
@prefix locn: <http://www.w3.org/ns/locn#> .
<http://www.ldproxy.net/bag/inspireadressen/> a dcat:Dataset ;
dcterms:title "Adressen"@nl ;
dcterms:title "Addresses"@en ;
dcterms:description "INSPIRE Adressen afkomstig uit de basisregistratie Adressen,
beschikbaar voor heel Nederland"@nl ;
dcterms:description "INSPIRE addresses derived from the Addresses base registry,
available for the Netherlands"@en ;
dcterms:isPartOf <http://www.ldproxy.net/bag/> ;
dcat:theme <http://inspire.ec.europa.eu/theme/ad> ;
dcterms:spatial [
a dcterms:Location ;
locn:geometry
# Bounding box in WKT
"POLYGON((3.053 47.975,7.24 47.975,7.24 53.504,3.053 53.504,3.053 47.975))"^^geosparql:wktLiteral ,
# Bounding box in GML
"<gml:Envelope srsName=\"http://www.opengis.net/def/crs/OGC/1.3/CRS84\">
<gml:lowerCorner>3.053 47.975</gml:lowerCorner>
<gml:upperCorner>7.24 53.504</gml:upperCorner>
</gml:Envelope>"^^geosparql:gmlLiteral ,
# Bounding box in GeoJSON
"{ \"type\":\"Polygon\",\"coordinates\":[[
[3.053,47.975],[7.24,47.975],[7.24,53.504],[3.053,53.504],[3.053,47.975]
]] }"^^https://www.iana.org/assignments/media-types/application/geo+json
] .
Andrea Perego - European Commission, Joint Research Centre (JRC)
data consumer, data producer, data publisher
Cross-catalog harvesting is not a recent practice. Standard catalog services, as [OAI-PMH] and [CSW], have been designed to support this functionality. However, in the past, this was typically done across catalogs of homogeneous resources, usually pertaining to the same domain.
This has changed in the last years, especially with the publication of cross-sector catalogs of government data. A notable example is the European Data Portal, which harvests metadata from both cross-sector and thematic catalogs across EU Member States. In this scenario, one of the issues to be addressed is the heterogeneity of the metadata standards and harvesting protocols used across catalogs.
A partial solution is provided by the development of harmonized mappings between metadata standards (see, e.g., the geospatial and statistical extensions to [DCAT-AP]), and by enabling catalog platforms, as CKAN and GeoNetwork, to support multiple harvesting protocols and to map different metadata standards into their internal representation.
An alternative approach is to enable catalogs to provide metadata in different profiles, using a standard harvesting protocol. Notably, standard protocols as [OAI-PMH] and [CSW] already support the possibility of serving records in different metadata schemas and serializations, by using specific query parameters. So, what is needed is an API-independent mechanism that can be used by clients with the existing catalog service protocols.
HTTP content negotiation may be the most viable solution, since HTTP is the protocol Web-based catalog services makes use of. However, although the HTTP protocol would allow metadata to be served in different formats, it does not support the ability to negotiate the metadata profile.
The GeoDCAT-AP API was designed to enable [CSW] endpoints to serve [ISO-19115-1] metadata based on the [GeoDCAT-AP] profile, by using the standard [CSW] interface - i.e., parameters outputSchema
(for the metadata profile) and outputFormat
(for the metadata format).
HTTP content negotiation is supported to determine the returned metadata format, without the need of using parameter outputSchema
. The ability to negotiate also the profile would enable a client to query a [CSW] endpoint without the need of knowing the supported harvesting protocol.
Besides the resulting RDF serialisation of the source [ISO-19115-1] records, the API returns a set of HTTP Link
headers, using the following relationship types:
derivedfrom
: The URL of the source document, containing the ISO 19139 records.profile
: The URI of the metadata schema used in the returned document.self
: The URL of the document returned by the API.alternate
: The URL of the current document, encoded in an alternative serialization.It is worth noting that, in its current definition, relationship type alternate
denotes just a different serialization, and so it cannot be used to list the possibly alternative metadata schemas.
Alejandra Gonzalez-Beltran
data consumer, data producer, data publisher
Many datasets (or catalogs) are produced with support by a sponsor/funder (e.g. scientific datasets that result from a study funded by a funding organisation or datasets produced by governmental organisations) and the ability to describe and group them by funder is important across domains.
Alejandra Gonzalez-Beltran
data consumer, data producer, data publisher
Datasets are related in many different ways, e.g. the relationships between the different versions of a dataset, 'has part' relationships between datasets, derivation, aggregation.
Examples of relationships:
- the Dryad repository defines the concept of a collection of datasets, for example, for datasets related for their topic
e.g. see the collection about Galapagos finches http://datadryad.org/handle/10255/dryad.148
- the Gene Expression Onmibus repository (GEO) has the concept of series for related data
- in the Investigation/Study/Assay (ISA) model, it is possible to represent the workflow from raw data to processed data and to indicate the process that yielded the new data
- to represent data citation
See the list of relationTpes given in the DataCite schema: [1](http://schema.datacite.org/meta/kernel-4.0/include/datacite-relationType-v4.xsd)
(Makx Dekkers) Specific cases of relationships that I have come across:
Alejandra Gonzalez-Beltran Deliverable(s): DCAT1.1, AP Guidelines
Data consumer, data producer, data publisher
Summary/descriptive statistics that characterize a dataset are important elements to have a high-level overview of the dataset. This is particularly important for datasets that are not publicly accessible, but whose access could be requested under certain conditions.
HCLS dataset description included a number of statistics for RDF datasets: https://www.w3.org/TR/hcls-dataset/ For healthcare data, there is the Automated Characterization of Health Information at Large-scale Longitudinal Evidence System (ACHILLES): https://www.ohdsi.org/analytic-tools/achilles-for-data-characterization/
Makx Dekkers
Data publishers
DCAT defines a Distribution as "Represents a specific available form of a dataset. Each dataset might be available in different forms, these forms might represent different formats of the dataset or different endpoints. Examples of distributions include a downloadable CSV file, an API or an RSS feed". It turns out that people read this differently. Main interpretations are that (a) the data in different Distributions of the same Dataset *only* differ in format, i.e the data contains the same data points in different representations and (b) the data in different Distributions might be related in other ways, for example by containing different data points for similar observations, as in the same kind of data for different years.
In the current situation, a variety of approaches can be observed. In an analysis of the data in the DataHub (see link), at least five different approaches could be observed.
Makx Dekkers
Data publisher
The DCAT model contains a hierarchy of the main entities: a catalogue contains datasets and a dataset has associated distributions. This model does not contemplate a situation that datasets exist outside of a catalogue, while in practice datasets may be exposed on the Web as individual entities without description of a catalogue. Also, it may be inferred from the current model that a dataset, if it is defined as part of a catalogue, is part of only one catalogue; no consideration is given to the practice that datasets may be aggregated – for example when the European Data Portal aggregates datasets from national data portals.
Makx Dekkers
Data publisher
In the context of W3C working and interest groups (e.g. SWIG, GLD, DWBP) several overlapping vocabularies have been developed for the description of datasets: DCAT, VoID and Data Cube. These vocabularies define similar concepts, but it is not entirely clear how these concepts are related. For example, all three vocabularies define a notion of ‘dataset’ – dcat:Dataset, void:Dataset and qb:DataSet. These notions are similar but not entirely equivalent.For example, it has been argued that void:Dataset and qb:DataSet are more like a dcat:Distribution than a dcat:Dataset.
Valentine Charles, Antoine Isaac
The metadata aggregated by Europeana is described using the Europeana Data Model (EDM) which goal is to ensure interoperability between various cultural heritage data sources. EDM has been developed to be as re-usable as possible. It can be seen as an anchor to which various finer-grained models can be attached, ensuring their interoperability at a semantic level. The alignments done between EDM and other models such as CIDOC-CRM allow the definition of adequate application profiles that enable the transition from one model to another without hindering the interoperability of the data. Currently, Europeana itself maintains data in two flavours of EDM is being defined into two specific flavours, each with a specific XML Schema (for RDF/XML data):
Both "external" and "internal" schemas are available at https://github.com/europeana/corelib/tree/master/corelib-solr-definitions/src/main/resources/eu
Because XML can’t capture all the constraints expressed in the EDM, an additional set of rules was defined using Schematron and embedded in the XML schema. These technical choices impose limitations on the constraints that can be checked and a validation approach less suitable for Linked Data (XML imposes a document-centric approach).
Europeana is not the only one designing and consuming different profiles of EDM in its ecosystem.
Finally, some third party sources of interest (esp. authority data, thesauri, gazetteers) use models that are building blocks of EDM, like SKOS (i.e. EDM can itself be been as an application profile / extension of SKOS). Sometimes these sources publishes their data in different flavours at once (e.g http://viaf.org), which makes data consumption both easier (consumer can find the data elements it can consumer) and more difficult (consumer has to separate elements of interests from irrelevant ones)
Europeana has identified two types of AP:
Currently data providers who would like to provide their data to Europeana using their profiles are unable to do it, even when these profiles would be 'compatible' with the Europeana one for ingestion (which typically happens in the case of a basic EDM extension that adds fields on top of the Europeana profile). This is chiefly because of XML rigidities: Europeana ingestion expects a reference to only one profile/schema. It will not recognize profiles that are compatible with it.
Jaroslav Pullmann with contributions by Andrea Perego, Simon Cox et al.
Data authors, data publishers, data consumers
There is an evident demand for capturing various types of time-related information in DCAT. This meta use case provides a topic overview and summary of general requirements on temporal statements shared among detailed use cases each dealing with an individual aspect.
There are two basic layers where temporal modeling applies, the content (a) and the publication life-cycle layer (b). The former refers to the different time dimensions of the data and its elicitation process, i.e. occurrence (phenomenon), overall coverage (scope) and observation time etc. The latter considers stages of the DCAT publication process independently of any domain or content.
While the use cases differ with regard to purpose and interpretation of the temporal expressions some general patterns become apparent. There are references to singular or recurrent, named (last week, Middle Ages, Thanksgiving Day) or formal, numeric expressions (e.g. ISO 8601). These might be relative (today, P15M) or absolute, represent an instant or interval.
The description of evidence and motivation in context for these expressions is delegated to sub-use cases.
Possible use cases at content level (a)
Possible use cases at life-cycle and publishing level (b)
Lieven Raes, Thomas D'haenens
Data authors, data publishers, data consumers
In the field we see people describing their datasets confronted with different regulations/profiles etc with each their own target/goal. Slowly we're starting to transgress domain boundaries (especially between geo and open - on a high level), but the process is still hard. This is partly due to the lack of guidelines/recommendations on a higher level (W3C, OGC).
Eg within the project of OTN (Open TransportNet) harmonization work has been done on different levels (more info : https://www.slideshare.net/plan4all/white-paper-data-harmonization-interoperability-in-opentransportnet). The risk exists that when everyone starts to do so, we loose interoperability along the way.
GeoDCAT-AP has started with a first attempt of bridging the gap between Geo and Open - https://joinup.ec.europa.eu/node/139283 Within Informatie Vlaanderen, a project is running of combining the two worlds in one catalogue with an automated mapping - https://www.w3.org/2016/11/sdsvoc/SDSVoc16_PPT_v02
See above
Rob Atkinson
data publishers, search engines, data users
Major search engines use mechanisms formalised via schema.org to extract structured metadata from Web resources. It is possible, but not given, that some may directly support DCAT in future. Regardless, consideration should be given to exposing DCAT content using equivalent schema.org elements - and this may perhaps be a case for content negotiation by profile, where equivalent schema.org properties are entailed in a DCAT graph.
Schema.org defines a range of equivalent properties
Karen Coyle
Data producers, data consumers. In particular this facilitates sharing between different data consumers.
When considering using data produced by someone else, it is necessary to know not only what their vocabulary terms are, but how those terms are used. This means that you need to know
It would be ideal if the profile could be translated into a validation language (such as ShEx or SHACL). If not, it should at least be able to link to such a language.
Karen Coyle
Data consumers
The GLAM communities (galleries, libraries, archives, museums) produce metadata based on a small set of known guidance rules. These rules determine choices made in creating the metadata such as: form of names for people, families and organizations; selection of primary titles; use of vocabularies like language lists, subject lists, genre and form lists, geographic designators. There needs to be a place in a profile to indicate which of the relevant standards was used in producing the metadata.
The primary metadata format used by libraries includes these, but that is a very narrow case.
Alejandra Gonzalez-Beltran
data consumer, data producer, data publisher
Datasets distributions may or not comply with different types of standards, e.g. may be represented in specific formats, follow specific content guidelines, may be annotated with specific ontologies, may comply with standards for describing their use of identifiers, etc.
The compliance with specific standards is useful information for data consumers, data producers and data publishers and it may help identifying how to use a dataset, what tools may be needed, etc.. DCAT currently supports describing the file format of a dataset distribution, but it is not possible to indicate compliance with other types of standards.
Jaroslav Pullmann, Keith Jeffery
Data consumers
A prerequisite of communicating, annotating or linking a dataset (or a defined part of it) is its unambiguous identification. Since a dataset and its distributions might evolve over time the identification method has to take into account their versioning. The respective distributions might significantly differ in terms of media type and further serialization properties and should therefore have distinct identifiers.
While DCAT currently does not support resource versioning, subsets (slices) and derivations of a Dataset might be specified as separate, related Dataset instances. Each one is exposed by a set of dedicated Distribution resources identified by a resolvable URI. These Distribution URIs are used to refer to a particular materialization of the (abstract) Dataset. Their design preferably follows the RESTful URI naming conventions. Referencing Distribution metadata has the benefit of providing access to related properties, e.g. usage restrictions and licensing. Contrary, when there are multiple independent copies of Dataset's metadata across Catalogs this method suffers from generating alternative identifiers for the same resource (i.e. the same access/download target).
Makx Dekkers
data producer, data publisher, data consumer (of statistical data)
In many cases, data producers and data publishers may want to inform the data consumers about the quality aspects of the data so that consumers better understand the possibilities and risks of using and reusing the data.
Data producers may have human-readable, textual information or more precise machine-readable information either as part of their publication process or as external resources that they can attach to the description of the dataset.
The European StatDCAT application profile for data portals in Europe specifies the optional use of the property dqv:hasQualityAnnotation with a range of a subclass of oa:Annotation from the Open Annotation Model, which allows annotations to be either embedded text or an external resource identified by a URI.
Karen Coyle
user interface developers, data input staff
Profiles can be used to drive input forms for staff creating the data. To facilitate this, as many features as possible of a good input environment need to be supported. Profiles need to have suitable rules for the validation of values, such as date forms and pick lists. There need to be human-readable definitions of terms and, if needed, instructions for input that would accompany a property and its value.
Karen Coyle
data consumers
In the library environment, datasets are issued as periodic aggregated (and up-to-date) files with daily or weekly changes to that file as supplements. The change files have new records that are additions to the file, changed records that must replace the record with the same identifier in the file, and deleted records that must result in the matching record being removed from the local copy of the file.
Karen Coyle
data producer, data publisher, validation program(s)
Many of the functions needed for a profile are also ones that will be targeted by validation routines, such as cardinality of properties, valid values, etc. To define these redundantly in profiles and in validation routines risks the creation of contradictory rules relating to the profile.
There needs to be a way to coordinate the profile and the validation function. This could be a matter of basing the profile on a defined validation language (SHACL or ShEx or Schematron...), or of deriving the validation rules from the profile. Note, though, that the existing validation languages are quite atomistic and so far there has not been a demonstration of creating a usable profile from a validation language. In any case, the relationship between these two related languages needs to be clarified.
Peter Brenton, Simon Cox (CSIRO)
data consumer, data producer, data publisher
It is helpful and often essential to know the business context in which one or more datasets are created and managed, in particular concerning the project, program, initiative through which the dataset was generated. These are typically associated with funding or policy.
The business context links associated entities participating in a project. Projects can be an umbrella or unifying entity for one or many datasets which share the same project context.
DCAT or users of DCAT have often used externally defined classes for associated concepts from FOAF and W3C Organization ontology, but there is not currently any slot or guidance about how to relate a dataset to its business context. However, there is no general agreement on a class for 'Project'. The class might includes spatial, temporal, social, descriptive and financial information. There are a number of discipline or domain specific Project classes (see Links below), but there does not appear to be anything available which is sufficiently expressive and generic.
As part of the DXWG there might be an opportunity to define a basic ontology for projects and related concepts. This should have a tight scope and few dependencies, similar to the approach used in W3C Organization ontology.
Peter Winstanley
data producer, data publisher, validation program(s)
Many events in the life cycle of a data set change the information content - data is added or removed as different versions of the dataset are created. Other events do not alter the information content of the dataset. An example of the latter is deduplication. Perhaps encryption or compressions are similar examples. There are issues to be considered relating to the type of deduplication (e.g. file vs block), but in the main these events do not reduce the information content.
The need of this use case is to be able to record these events in the provenance data, but to have some way to indicate that although something was done to the dataset, there was actually no change in information content. In this way it is slightly different to the regular interpretation of a "version".
Rob Atkinson
data user
When considering Linked Data applications viewing datasets there are potentially multiple ways to access items. Large datasets in particular may support access to specific items, queries, subsets and optimally packaged downloads of data. Because the implications for user and agent interaction vary greatly between these modes of access there is a need to distinguish between distributions that package the entire dataset. Furthermore, access methods delivering queries and subsets may introduce container elements, independently of the dataset itself.
So if we had a metadata catalog that supported a query service that returned a set of DCAT records (using say the BotDCAT-AP profile), and also an api that delivered a specific record using this same profile we would need to be able to specify the query service support the BotDCAT-AP profile and MyDCATCatalogSearchFunction profile (that specifies how the container structure is implemented?)
This chapter lists the requirements for the Working Group deliverables
In some requirements the expression 'recommended way' is used. This means that a single best way of doing something is sought. It does not say anything about the form this recommended way should have, or who should make the recommendation.
In some requirements the expression 'canonical property' is used. This identifies a specific requirement to provide a property with the required semantics to meet the requirement and guidance on usage of this property. (Note, a 'recommended way' may also involve such canonical properties - requirements described this way reflects cases where such properties have been implemented by communities and the identified requirement is to consolidate and make these properties generally used by the wider community.
Many of these requirements depend on interpretation of keywords such as "profile". At this stage the definitions for these terms are defined in the Tags section. These terms should be cross-referenced as links - but may be best to choose appropriate format - i.e. a formal glossary?
Encode identifiers as dereferenceable HTTP URIs
Indicate type of identifier (e.g. prism:doi, bibo:doi, ISBN
etc.).
Provide means to distinguish the primary and alternative (legacy) identifiers.
Identify DCAT resources that are subject to versioning, i.e. Catalog, Dataset, Distribution.
Provide clear guidance on conditions, type and severity of a resource's update that might motivate the creation of a new version in scenarios such as dataset evolution, conversion, translations etc, including how this may assist change management processes for consumers (e.g. semantic versioning techniques)
Provide a means to identify a version (URI-segment, property etc.). Clarify relationship to identifier of the subject resource.
It must be possible to assign a date to a version. The version identifier might refer to the release date.
Provide a way to indicate the change delta or other change information from the previous version.
Ability to provide information on how to use the data.
Should this also apply to specific distributions?
Express summary statistics and descriptive metrics to characterize a Dataset.
Provide a way to link to structured information about the provenance of a dataset including:
dct:creator, dct:publisher etc are special cases, which require guidance, further roles may be defined in provenance or other richer models. The requirement is to establish an extensible mechanism, and for profiles to specify canonical equivalents for the special case properties of dcat:Dataset
Provide means to describe the funding (amount and source) of a Dataset (or entire Catalog).
Provide a means to define a "project" as a research, funding or work organzation context of a dataset.
Identify common modeling patterns for different aspects of data quality based on frequently referenced data quality attributes found in existing standards and practices.
This includes potential use and revision of DQV
Aspects include:
Define a way to associate quality-related information with Datasets.
Provide a mechanism to indicate the type of data being described and recommend vocabularies to use given the dataset type indicated.
Providing examples of scope will provide guidance, without being unnecessarily restrictive. The key requirement is interoperability, achieved by using standardised vocabulary terms. It it unclear whether a canonical registry is required or whether communities should constrain choice via DCAT profiles.
Provide recommendations and mechanisms for data providers to describe datasets with a formal description of aspects (e.g. instrument/sensor used, spatial feature, observable property, quantity kind).
Finer grained semantics will also allow dataset dimensions to be described, and distributions described using these semantics - for example how a dataset is composed of multiple subsets, such as a set of image bands or tiles, or parameterised filtering/subsetting services
This requirement applies to catalogues of DCAT records, and is thus related to the concept of profiles, which are expected to define classification dimensions (use of controlled vocabularies in mandatory properties)
Provide means to specify the reference system(s) used in a dataset.
Provide means to specify spatial coverage with geometries.
Allow for specification of the start and/or end date of temporal coverage.
Revise definition of Distribution. Make clearer what a Distribution is and what it is not. Provide better guidance for data publishers.
Define a way to include identification of the schema the described data conforms to
This may include rich information via extensions points, URI templates and parameters, dimensions and subsetting operations, dereferenceable identifiers of service behaviour profiles and canonical identifiers of well-known web service interfaces (e.g. OGC - WFS, WMS, OpenDAP, REST apis).
Such a description may be provided through identifier of a suitable profile that defines interoperability conditions the distribution conforms to.
Ability 1) to describe the type of distribution and 2) provide information about the type of service
Such a description may be provided through a suitable profile identifier that defines a profile of the relevant service type.
Provide a means to specify the container structure of a distribution for access methods that return lists, independently of the specification of the profile the list items conform to.
Related to the distinction between dct:accessURL and dct:downloadURL. May be covered by service type, but specifically supports identification of lists vs items. (items have no container). lists may be wrapped in a structural element or not - so this also needs to be described.
Define way to specify content of packaged files in a Distribution. For example, a set of files may be organised in an archive format and then compressed, but dct:hasFormat property only indicates the encoding type of the outer layer of packaging.
Ability to represent the different relationships between datasets, including: versions of a dataset, collection of datasets, to describe their inclusion criteria and to define the 'hasPart'/'partOf' relationship, derivation, e.g. processed data that is derived from raw data
this requriement to be rolled in here: Update method of Dataset: Indicate the update method of a Dataset description, e.g. whether each new dataset entirely supercedes previous ones (is stand-alone), or whether there is a base dataset with files that effect updates to that base.
Provide a means to indicate the relation of Datasets to a project.
Clarify the relationships between Datasets and zero, one or multiple Catalogs, e.g. in scenarios of copying, harvesting and aggregation of Dataset descriptions among Catalogs.
Provide a way to specify access restrictions for both a dataset and a distribution.
Provide a way to define the meaning of the access restrictions for a dataset or distribution and to specify what is required to access a dataset and distribution.
Provide a way to specify information required for data citation (e.g., dataset authors, title, publication year, publisher, persistent identifier)
Provide a way to link publications about a dataset to the dataset.
Provide a way to cite the original metadata with a dereferenceable identifier.
Provide means to express rights relating to reuse of DCAT metadata
Analyse and compare similar concepts defined by vocabularies related to DCAT (e.g. VOID, Data Cube dataset).
Define guidelines how to create a DCAT description of a VOID or Data Cube dataset
Define schema.org equivalents for DCAT properties to support entailment of schema.org compliant profiles of DCAT records.
Ability to express "guidance rules" or "creation rules" in DCAT
Define qualified forms to specify additional attributes of appropriate binary relations (e.g. temporal context).
This requirement is still under review
Specify mapping of qualified to non-qualified forms (lowering mapping). The reverse requires information (qualification) that might not be present/evident.
This requirement is still under review
Profiles are "named collections of properties" or metadata terms (if not RDF).
github:issue/275A profile can have multiple base specifications.
github:issue/268One can create a profile of profiles, with elements potentially inherited on several levels.
github:issue/270Profiles may add to or specialise clauses from one or more base specifications. Such profiles inherit all the constraints from base specifications.
Some data may conform to one or more profiles at once
github:issue/608Data publishers may publish data according to different profiles, either simultaneously (e.g. in one same data "distribution") or in parallel (e.g. via content negotiation)
github:issue/274Profiles can have human-readable definitions of terms and input instructions.
github:issue/283There needs to be a property in the profile where the rules for the descriptive content can be provided. This would apply to the entire profile.
github:issue/255A profile may be (partially) "implemented" by "schemas" (in OWL, SHACL, XML Schema...) that allow different levels of data validation
github:issue/273Profiles should be able to indicate which external specifications are expected to be applied/have been applied to values of individual properties.
github:issue/280Profiles may provide rules governing value validity.
github:issue/277Profiles may provide lists of values to pick from in order to populate data elements.
github:issue/282Profiles may provide rules on cardinality of terms (including "recommended").
github:issue/276Profiles may express dependencies between elements of the vocabulary (if A then not B, etc.).
github:issue/278Profiles can have what is needed to drive forms for data input or for user display.
github:issue/281A profile should have human-readable documentation that expresses for humans the main components of a profile, which can also be available as machine-readable resources (ontology or schema files, SHACL files, etc). This includes listing of elements in the profile, instructions and recommendations on how to use them, constraints that determine what data is valid according to the profile, etc.
github:issue/272Profiles may be written in or may link to a document or schema in a validation language (ShEx, SHACL, XMLschema).
github:issue/279Profiles must be discoverable through a machine-readable metadata that describes what is offered and how to invoke the offered profiles.
github:issue/288From the perspective of management of profiles, and guidance to users and data experts, ecosystems of profiles should be properly described (e.g. in profile catalogues/repositories), especially documenting the relationships between profiles and what they are based on, and between profiles that are based on other profiles.
github:issue/271Enable the ability to negotiate the data profile via http, similar to the negotiation of metadata formats today.
github:issue/265There needs to be a way to encode the necessary relations using an http link header.
github:issue/266Profiles should support discoverability via search engines.
github:issue/222A client should be able to determine which profiles are supported by a server, and with which content types or other properties, in order to receive the one most appropriate for their use.
github:issue/285There should be a way to look up additional information about a profile - this may be machine readable for run-time mediation or used to develop or configure a client.
github:issue/286According to the notes of group discussions, this requirement has not yet been approved.
Metadata about server profile support can be used for discovery and mediated traversal via content negotiation.
github:issue/264When requesting a representation, a client must be able to specify which profile it prefers the representation to adhere to. This information about the requested profile is not a replacement for content type (e. g. application/xml), language (e. g. zh-Hant) nor any other negotiated dimension of the representation.
github:issue/289Some data may conform to one or more profiles at once. A server can indicate that a response conforms to multiple profiles.
github:issue/287A profile must have an identifier that can be served with a response to an API or HTTP request.
github:issue/284A short token to specify a profile may be used as long as there is a discoverable mapping from it to the profile's identifying URI.
github:issue/290