SoftWare Heritage persistent IDentifiers (SWHIDs)#
version 1.6, last modified 2021-04-30
Overview#
You can point to objects present in the Software Heritage archive by the means of SoftWare Heritage persistent IDentifiers, or SWHIDs for short, that are guaranteed to remain stable (persistent) over time. Their syntax, meaning, and usage is described below. Note that they are identifiers and not URLs, even though URL-based resolvers for SWHIDs are also available.
A SWHID consists of two separate parts, a mandatory core identifier that can point to any software artifact (or “object”) available in the Software Heritage archive, and an optional list of qualifiers that allows to specify the context where the object is meant to be seen and point to a subpart of the object itself.
Objects come in different types:
contents
directories
revisions
releases
snapshots
Each object is identified by an intrinsic, type-specific object identifier that is embedded in its SWHID as described below. The intrinsic identifiers embedded in SWHIDs are strong cryptographic hashes computed on the entire set of object properties. Together, these identifiers form a Merkle structure, specifically a Merkle DAG.
See the Software Heritage data model for an overview of
object types and how they are linked together. See
swh.model.git_objects
for details on how the intrinsic identifiers
embedded in SWHIDs are computed.
The optional qualifiers are of two kinds:
context qualifiers: carry information about the context where a given object is meant to be seen. This is particularly important, as the same object can be reached in the Merkle graph following different paths starting from different nodes (or anchors), and it may have been retrieved from different origins, that may evolve between different visits
fragment qualifiers: allow to pinpoint specific subparts of an object
Syntax#
Syntactically, SWHIDs are generated by the <identifier>
entry point in the
following grammar:
<identifier> ::= <identifier_core> [ <qualifiers> ] ;
<identifier_core> ::= "swh" ":" <scheme_version> ":" <object_type> ":" <object_id> ;
<scheme_version> ::= "1" ;
<object_type> ::=
"snp" (* snapshot *)
| "rel" (* release *)
| "rev" (* revision *)
| "dir" (* directory *)
| "cnt" (* content *)
;
<object_id> ::= 40 * <hex_digit> ; (* intrinsic object id, as hex-encoded SHA1 *)
<dec_digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
<hex_digit> ::= <dec_digit> | "a" | "b" | "c" | "d" | "e" | "f" ;
<qualifiers> := ";" <qualifier> [ <qualifiers> ] ;
<qualifier> ::=
<context_qualifier>
| <fragment_qualifier>
;
<context_qualifier> ::=
<origin_ctxt>
| <visit_ctxt>
| <anchor_ctxt>
| <path_ctxt>
;
<origin_ctxt> ::= "origin" "=" <url_escaped> ;
<visit_ctxt> ::= "visit" "=" <identifier_core> ;
<anchor_ctxt> ::= "anchor" "=" <identifier_core> ;
<path_ctxt> ::= "path" "=" <path_absolute_escaped> ;
<fragment_qualifier> ::= "lines" "=" <line_number> ["-" <line_number>] ;
<line_number> ::= <dec_digit> + ;
<url_escaped> ::= (* RFC 3987 IRI *)
<path_absolute_escaped> ::= (* RFC 3987 absolute path *)
Where:
in either case all occurrences of ;
(and %
, as required by the RFC)
have been percent-encoded (as %3B
and %25
respectively). Other
characters can be percent-encoded, e.g., to improve readability and/or
embeddability of SWHID in other contexts.
Semantics#
Core identifiers#
:
is used as separator between the logical parts of core identifiers. The
swh
prefix makes explicit that these identifiers are related to SoftWare
Heritage. 1
(<scheme_version>
) is the current version of this
identifier scheme. Future editions will use higher version numbers, possibly
breaking backward compatibility, but without breaking the resolvability of
SWHIDs that conform to previous versions of the scheme.
A SWHID points to a single object, whose type is explicitly captured by
<object_type>
:
snp
to snapshots,rel
to releases,rev
to revisions,dir
to directories,cnt
to contents.
The actual object pointed to is identified by the intrinsic identifier
<object_id>
, which is a hex-encoded (using lowercase ASCII characters) SHA1
computed on the content and metadata of the object itself, as follows:
for snapshots, intrinsic identifiers are SHA1 hashes of manifests computed as per
swh.model.git_objects.snapshot_git_object()
for releases, as per
swh.model.git_objects.release_git_object()
that produces the same result as a git release hashfor revisions, as per
swh.model.git_objects.revision_git_object()
that produces the same result as a git commit hashfor directories, per
swh.model.git_objects.directory_git_object()
that produces the same result as a git tree hashfor contents, the intrinsic identifier is the
sha1_git
hash returned byswh.hashutil.MultiHash.digest()
, i.e., the SHA1 of a byte sequence obtained by juxtaposing the ASCII string"blob"
(without quotes), a space, the length of the content as decimal digits, a NULL byte, and the actual content of the file.
Qualifiers#
;
is used as separator between the core identifier and the optional
qualifiers, as well as between qualifiers. Each qualifier is specified as a
key/value pair, using =
as a separator.
The following context qualifiers are available:
origin: the software origin where an object has been found or observed in the wild, as an URI;
visit: the core identifier of a snapshot corresponding to a specific visit of a repository containing the designated object;
anchor: a designated node in the Merkle DAG relative to which a path to the object is specified, as the core identifier of a directory, a revision, a release or a snapshot;
path: the absolute file path, from the root directory associated to the anchor node, to the object; when the anchor denotes a directory or a revision, and almost always when it’s a release, the root directory is uniquely determined; when the anchor denotes a snapshot, the root directory is the one pointed to by
HEAD
(possibly indirectly), and undefined if such a reference is missing;
The following fragment qualifier is available:
lines: line number(s) of interest, usually within a content object
We recommend to equip identifiers meant to be shared with as many qualifiers as
possible. While qualifiers may be listed in any order, it is good practice to
present them in the order given above, i.e., origin
, visit
, anchor
,
path
, lines
. Redundant information should be omitted: for example, if
the visit is present, and the path is relative to the snapshot indicated
there, then the anchor qualifier is superfluous; similarly, if the path is
empty, it may be omitted.
Interoperability#
URI scheme#
The swh
URI scheme is registered at IANA for SWHIDs. The present documents
constitutes the scheme specification for such URI scheme.
Git compatibility#
SWHIDs for contents, directories, revisions, and releases are, at present,
compatible with the Git way of computing identifiers for its objects.
The <object_id>
part of a SWHID for a content object is the Git blob
identifier of any file with the same content; for a revision it is the Git
commit identifier for the same revision, etc. This is not the case for
snapshot identifiers, as Git does not have a corresponding object type.
Note that Git compatibility is incidental and is not guaranteed to be maintained in future versions of this scheme (or Git).
Automatically fixing invalid SWHIDs#
User interfaces may fix invalid SWHIDs, by lower-casing the
<identifier_core>
part of a SWHID, if it contains upper-case letters
because of user errors or limitations in software displaying SWHIDs.
However, implementations displaying or generating SWHIDs should not rely on this behavior, and must display or generate only valid SWHIDs when technically possible.
User interfaces should show an error when such an automatic fix occurs, so users have a chance to fix their SWHID before pasting it to an other interface that does not perform the same corrections. This also makes it easier to understand issues when a case-sensitive qualifier has its casing altered.
Examples#
Core identifiers#
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
points to the content of a file containing the full text of the GPL3 licenseswh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505
points to a directory containing the source code of the Darktable photography application as it was at some point on 4 May 2017swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d
points to a commit in the development history of Darktable, dated 16 January 2017, that added undo/redo supports for masksswh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f
points to Darktable release 2.3.0, dated 24 December 2016swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453
points to a snapshot of the entire Darktable Git repository taken on 4 May 2017 from GitHub
Identifiers with qualifiers#
The following SWHID denotes the lines 9 to 15 of a file content that can be found at absolute path
/Examples/SimpleFarm/simplefarm.ml
from the root directory of the revisionswh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0
that is contained in the snapshotswh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9
taken from the originhttps://gitorious.org/ocamlp3l/ocamlp3l_cvs.git
:swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b; origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git; visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9; anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0; path=/Examples/SimpleFarm/simplefarm.ml; lines=9-15
Here is an example of a SWHID with a file path that requires percent-escaping:
swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04; origin=https://github.com/web-platform-tests/wpt; visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499; anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96; path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/
Implementation#
Computing#
An important property of any SWHID is that its core identifier is intrinsic:
it can be computed from the object itself, without having to rely on any
third party. An implementation of SWHID that allows to do so locally is the
swh identify
tool, available from the swh.model
Python package under the GPL license. This package can be installed via the pip
package manager with the one liner pip3 install swh.model[cli]
on any machine with
Python (at least version 3.7) and pip
installed (on a Debian or Ubuntu system a simple apt install python3 python3-pip
will suffice, see the general instructions for other platforms).
SWHIDs are also automatically computed by Software Heritage for all archived objects as part of its archival activity, and can be looked up via the project Web interface.
This has various practical implications:
when a software artifact is obtained from Software Heritage by resolving a SWHID, it is straightforward to verify that it is exactly the intended one: just compute the core identifier from the artefact itself, and check that it is the same as the core identifier part of the SHWID
the core identifier of a software artifact can be computed before its archival on Software Heritage
Choosing what type of SWHID to use#
swh:1:dir:
SWHIDs are the most robust SWHIDs, as they can be recomputed from
the simplest objects (a directory structure on a filesystem), even when all
metadata is lost, without relying on the Software Heritage archive.
Therefore, we advise implementers and users to prefer this type of SWHIDs
over swh:1:rev:
and swh:1:rel:
to reference a source code artifacts.
However, since keeping the metadata is also important, you should add an anchor
qualifier to swh:1:dir:
SWHIDs whenever possible, so the metadata stored
in the Software Heritage archive can be retrieved when needed.
This means, for example, that you should prefer
swh:1:dir:a8eded6a2d062c998ba2dcc3dcb0ce68a4e15a58;anchor=swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f
over swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f
.
Resolvers#
Software Heritage resolver#
SWHIDs can be resolved using the Software Heritage Web interface.
In particular, the root endpoint
/
can be given a SWHID and will lead to the browsing page of the
corresponding object, like this:
https://archive.softwareheritage.org/<identifier>
.
A dedicated /resolve
endpoint of the Software Heritage Web API is also available to
programmatically resolve SWHIDs; see: GET /api/1/resolve/(swhid)/
.
Examples:
Third-party resolvers#
The following third party resolvers support SWHID resolution:
Identifiers.org; see: http://identifiers.org/swh/ (registry identifier MIR:00000655).
Note that resolution via Identifiers.org currently only supports core identifiers due to syntactic incompatibilities with qualifiers.
Examples:
https://identifiers.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
https://identifiers.org/swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505
https://identifiers.org/swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d
https://n2t.net/swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f
https://n2t.net/swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453
References#
Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. Identifiers for Digital Objects: the Case of Software Source Code Preservation. In Proceedings of iPRES 2018: 15th International Conference on Digital Preservation, Boston, MA, USA, September 2018, 9 pages.
Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. Referencing Source Code Artifacts: a Separate Concern in Software Citation. In Computing in Science and Engineering, volume 22, issue 2, pages 33-43. ISSN 1521-9615, IEEE. March 2020.