Collected notes on OCI Artifacts
tarballs, sha digests and json as far as the eye can see
Published at
OCI1 artifacts are an expansion of the OCI Image and OCI Distribution specifications that allow storing arbitrary blobs in the same content addressable manner as container image layers, configuration and manifests are stored.
Content Addressing
Content addressing is the idea of storing information based on some function of
its content rather than an arbitrary label2. Importantly this function must reliably and
uniquely identify content. If the function was the length of the content, then
all content that is 5 bytes long would compete for the same digest which isn't
very useful. In the case of OCI registries the canonical function is sha256
although other hashing functions are supported.
Digests
Digests are the combination of a hashing function's name and the result of that
hash when applied to some content. If we want to store
OoTR_1465294_OHBKB3ZQLY.zpf
in the registry and then later retrieve it, we
would need to know its digest:
1 2 3 |
|
Digests always refer to a specific revision of the target because that revision will uniquely produce that digest. All content in the registry is addressable via its digest
Tags
Tags are used as human readable aliases to a specific digest and multiple tags
may reference the same digest. A tag always refers to a specific digest but
which digest that a tag refers to may change over time. Emphasis that any tag
may change its referenced digest over time, not just specific tags such as
latest
.
mediaType
A mediaType3 is a human and machine readable instruction for how to interpret a blob. A blob is ascribed a media type when it is referenced by a descriptor and a blob can have different media types ascribed to it in different descriptors. Media types are constructed of the following portions:
- Type
- Subtype Tree
- Suffix
- Parameters
1 |
|
- application
- vnd.sudonters.zootr.settings.v7.1
- json
In this case this media type would refer to a settings file for OOTR version 7.1 and it is stored in a json format. Such specific media types are extremely common. Not only are developing such taxonomies fun4 but they double as semantic labels and versioning schemes.
Versioning is important to allow the artifact to change over time. In these examples I have decided to tie the version of my artifacts to the current, at the time of writing, stable version of the OOTR program. Using the suffix to indicate the representation may seem silly but this allows clients to request alternative representations if available through a process known as content negotiation.
1 |
|
The change in this suffix indicates that this is the single line encoded representation of settings. Extensions to the program could introduce XML or YAML formatted versions of the settings. A client may prefer any of these formats instead and the server should provide that representation if available.
1 |
|
- application
- vnd.sudonters.zootr.notes.v7.1
- md
- charset=utf8
This media type describes a text document that are written as markdown and the encoding character set is utf-8. Parameters are specific to the media type.
The spec defines several of its own media types5
Blobs
Blobs6 are the actual content we want to store in the registry. If we want to store the content directly, meaning uncompressed etc, in the registry we would create the digest directly from the contents:
1 2 3 |
|
However, if we want to compress the content then the digest should be calculated from the compressed content:
1 2 3 |
|
This distinction is important because the OCI image specification makes references to digests of both compressed and uncompressed content in different context that all refer, conceptually at least, to the same content except for the difference in compression. Blobs are always referenced by the digest of the raw bytes at rest in the registry.
Blobs themselves do not carry any kind of mediaType. Instead this interpretation is provided via descriptors that reference blobs. This means that blobs are strictly concerned with the raw bytes that make up the content and defers interpretation of the content to other registry resources like manifests that declare descriptors.
Descriptors
Descriptors are how all content is referenced by resources within the repository. Descriptors are required to carry:
- The mediaType of the blob
- The digest of the blob
- The size in bytes of the blob
The digest is used to both address content and serves as verification that the correct content was downloaded when interacting with untrusted sources. Similarly, the size can also be used to verify contents. The mediaType describes how the content should be interpreted by the client.
The digest is used to both address the blob and for verification that the correct blob was downloaded when interacting with untrusted sourced. The byte size can similarly be used as a check for the correct blob. The mediaType describe what the blob is in context of this artifact.
An example descriptor of the gzipped OOTR patch file:
1 2 3 4 5 |
|
A descriptor may have other attributes on it such as annotations -- arbitrary
metadata represented as key-value pairs -- or an artifactType
which is a
secondary mediaType that accurately describes the artifact when the descriptor
does not point to an image layer but is using an image layer mediaType7.
There is also an "empty descriptor", reproduced below, which is intended as a kind of "null" value when an artifact does not have any content associated to a property, such as layers or configuration:
1 2 3 4 5 6 |
|
Manifests
A manifest describes a specific revision of an image or artifact by recording the blobs of the configuration and layers of that revision. The primary manifest mediaType is application/vnd.oci.image.manifest.v1+json which can be used to describe image AND non image artifacts.
Config
The manifest config
property is intended as machine readable instructions for
interacting with the image or artifact. When the manifest describes an OCI
Image this descriptor points to configuration that describes how to construct
the unionfs for the container, the default entrypoint, and other information
needed to launch the image.8
An artifact may have similar instructions. An artifact with mediaType
application/vnd.sudonters.zootr.patch.v7.1
might be bundled with a
configuration that indicates the settings and RNG seed used to generate the
specific patch file.
If the artifact does not have any configuration it should use the "Empty Descriptor" described above.
Layers
These are descriptors to blobs within the repository. For image, layers must be an ordered collection of unionfs layers; however, the only restriction placed on artifacts is that they SHOULD have at least one layer and SHOULD use the empty descriptor instead of providing an empty layer collection. Additionally, there is no requirement that all layers in a manifest need to share the same media type.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Indexes
Tags provide an alias to a specific digest. However, if we were to pull and
inspect the Docker image ubuntu:22.04
on an amd64 host and an arm64 host we
would see they have different
digests9. This is accomplished via
application/vnd.oci.image.index.v1+json
aka indexes , which are a collection of manifest descriptors in a single
document. Initially this was designed to support "multiarch images" but could
also be used any time a registry might offer various representations of an
artifact. The specification allows for an index to reference another index in
this fashion10.
After submitting all references in the index to the registry, we additionally
submit an index request. If we want to describe a particular configuration to
store in the registry, we might want to store both the JSON formatted artifact
and the encoded string format as separate manifests rather than as separate
layers within the same manifest. We submit the request similar to a manifest
but with a different Content-Type
header:
1 2 |
|
And the body looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
ACCEPT
header:
1 2 |
|
If the index content exists at the reference then the registry will produce it, otherwise an error will be returned which either indicates that there's nothing at the tag at all or there is content that exists at the tag but it is in a different representation.
Supplemental Manifests
The registry I was playing with -- registry:2
-- did not support subject and
references which were added in OCI Distribution 1.1 which is pretty new at time
of writing.
A manifest may reference another manifest11 via the subject field:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
Clients may then discover this manifest by asking the registry for manifests
that reference
sha256:06fce1ebf8c29ec26851e9f4fa35c0b9e1329a0ba245efc21f1c387366391c6b
--
the patch manifest. A registry that implements the referrers API MUST NOT
return a 404 to these queries12. When responding a registry produces an image
index that holds the referencing manifests. Clients may additionally request
only referencing manifests by appending an artifactType=${MEDIATYPE}
to the
request. If a client was interested in only locating notes for a given patch:
1 2 3 |
|
The canonical examples for this mechanism are SBOM13 manifests and cryptographic signatures14 where these manifests aren't necessary to the operation of the artifact and in fact may be generated by third parties. These additional manifests may be required by tooling within our infrastructure, i.e. a kubernetes admission handler that requires a signature from a particular private key before allowing an image to run in the cluster.
-
Open Container Initiative ↩
-
not that a function of content is any less arbitrary ↩
-
aka "Content Type" aka "MIME Type" ↩
-
Maybe these are the justified hierarchies Chomsky talks about ↩
-
including
application/vnd.oci.descriptor.v1+json
which describes a descriptor ↩ -
not a joke, this is the technical term for "eh it's just some bytes who cares" ↩
-
for reasons such as legacy support ↩
-
sha256:56887c5194fddd8db7e36ced1c16b3569d89f74c801dc8a5adbf48236fb34564
andsha256:cf3cc0848a5d6241b6218bdb51d42be7a9f9bd8c505f3abe1222b9c2ce2451ac
at time of writing ↩ -
I'm not 100% sure the use of this but it is nice that it is an option ↩
-
Unclear if indexes are allowed to participate in this relationship ↩
-
Returning a 404 means not implemented because it predates 1.1 registries ↩
-
Software Bill of Materials, a document that details what software is included in an artifact ↩
-
separate from digests, digests authenticate what the bytes are, these authenticate that we said its okay for those bytes to be around ↩