This proposal introduces a dedicated API to easily figure out a file format.
While a Publication
is independent of any particular format, knowing the format of a publication
file is necessary to:
This API is not tied to Publication
, so it can be used as a general purpose tool to guess a file
format, e.g. during HTTP requests or in the LCP library.
You can use the Media Type API every time you need to figure out the format of a file or bytes.
To use this API efficiently, you should:
You can use directly MediaType.of()
for sniffing raw bytes (e.g. the body of an HTTP response). It
will take a closure returning the bytes lazily.
let feedLink: Link
let response = httpClient.get(feedLink.href)
let mediaType = MediaType.of(
bytes: { response.body },
// You can give several file extension and media type hints, which will be sniffed in order.
fileExtensions: [feedLink.href.pathExtension],
mediaTypes: [response.headers["Content-Type"], feedLink.type]
)
In the case of an HTTP response, this can be simplified by using the HTTPResponse.sniffMediaType()
extension:
let feedLink: Link
let response = httpClient.get(feedLink.href)
let mediaType = response.sniffMediaType(mediaTypes: [feedLink.type])
For local files, you can provide an absolute path to MediaType.of()
. To improve sniffing speed,
you should also provide a media type hint if possible – for example if you previously stored it in a
database.
let dbBook = database.get(bookId)
let mediaType = MediaType.of(
path: dbBook.path,
mediaTypes: [dbBook.mediaType]
)
Reading apps are welcome to extend this API with their own media types. To declare a custom media type, you need to:
MediaType
constant, optionally in the MediaType.
namespace.MediaType.SnifferContext
.MediaType.sniffers
shared list to be used globally,sniffers
argument of
MediaType.of()
.Here’s an example with Adobe’s ACSM media type.
// 1. Create the `MediaType` instance.
private let acsmMediaType = MediaType(
"application/vnd.adobe.adept+xml",
name: "Adobe Content Server Manager",
fileExtension: "acsm"
)
extension MediaType {
static var ACSM: MediaType { acsmMediaType }
}
// 2. Create the sniffer function.
func sniffACSM(context: MediaType.SnifferContext) -> MediaType? {
if
context.hasMediaType("application/vnd.adobe.adept+xml") ||
context.hasFileExtension("acsm") ||
context.contentAsXML?.documentElement?.localName == "fulfillmentToken"
{
return MediaType.ACSM
}
return nil
}
// 3.1. Declare the sniffer globally.
MediaType.sniffers.add(sniffACSM)
let mediaType = MediaType.of(path: acsmPath)
// 3.2. Or use the sniffer on a case-by-case basis.
let mediaType = MediaType.of(path: acsmPath, sniffers: MediaType.sniffers + [sniffACSM])
File formats are represented by MediaType
instances, which can be used to get the file
extension, name and media type string.
However, some formats can be identified by several media type aliases, for example CBZ has for
canonical type application/vnd.comicbook+zip
but has an historical alias application/x-cbz
. In
this case, you should only store the canonical type in a database. You can resolve the canonical
version of a known media type using mediaType.canonicalized
.
All Readium APIs already return canonical media types, so this is useful only if you
create your own MediaType
from strings.
let fileExtension = MediaType("text/plain")?.canonicalized.fileExtension
Sniffers are functions with the type MediaType.Sniffer
whose job is to resolve a MediaType
from
bytes or metadata. Each supported MediaType
must have at least one matching sniffer to be
recognized. Therefore, a reading app should provide its own sniffers to support custom publication
formats.
MediaType
classRepresents a document format, identified by a unique RFC 6838 media type.
MediaType
handles:
Comparing media types is more complicated than it looks, since they can contain
parameters such as charset=utf-8
. We can’t ignore them
because some formats use parameters in their media type, for example
application/atom+xml;profile=opds-catalog
for an OPDS 1 catalog.
MediaType(string: String, name: String? = null, fileExtension: String? = null)
MediaType
from its string representation and an optional name and file extension.string: String
(or toString()
if more idiomatic).
charset
parameter, which is uppercased.name: String?
fileExtension: String?
type: String
application
in application/epub+zip
.subtype: String
epub+zip
in application/epub+zip
.parameters: Map<String, String>
charset=utf-8
.structuredSyntaxSuffix: String?
+zip
in application/epub+zip
encoding: Encoding?
charset
parameter, if there’s any.Encoding
type provided by the platform, for convenience.canonicalized: MediaType
application/x-cbz
is an alias of the
canonical application/vnd.comicbook+zip
.MediaType.of(string) || this
.contains(other: MediaType) -> Boolean
, contains(other: String) -> Boolean
other
media type is included in this media type.text/html
contains text/html;charset=utf-8
.other
must match the parameters in the parameters
property, but extra parameters are ignored.image/*
contains image/png
and */*
contains everything.matches(other: MediaType) -> Boolean
, matches(other: String) -> Boolean
other
are the same, ignoring parameters that are not in both media types.text/html
matches text/html;charset=utf-8
, but text/html;charset=ascii
doesn’t. This is basically like contains
, but working in both directions.==
(equality)
of(mediaTypes: List<String> = [], fileExtensions: List<String> = [], sniffers: List<Sniffer> = MediaType.sniffers) -> MediaType?
mediaTypes: List<String> = []
Link.type
, from a Content-Type
HTTP header or from a database.fileExtensions: List<String> = []
sniffers: List<Sniffer> = MediaType.sniffers
MediaType.sniffers + [customSniffer]
.ofFile(file: String, mediaTypes: List<String> = [], fileExtensions: List<String> = [], sniffers: List<Sniffer> = MediaType.sniffers) -> MediaType?
file: String
ofBytes(bytes: () -> ByteArray, mediaTypes: List<String> = [], fileExtensions: List<String> = [], sniffers: List<Sniffer> = MediaType.sniffers) -> MediaType?
bytes: () -> ByteArray
Computed properties for convenience. More can be added as needed.
isZIP: Boolean
isJSON: Boolean
isOPDS: Boolean
OPDS1
, OPDS1Entry
, OPDS2
or OPDS2Publication
.isHTML: Boolean
HTML
or XHTML
.isBitmap: Boolean
BMP
, GIF
, JPEG
, JXL
, PNG
, TIFF
, WebP
or AVIF
.isAudio: Boolean
isPublication: Boolean
Link
HelpersmediaType: MediaType
application/octet-stream
if the type can’t be determined.MediaType.of(link.type) ?? MediaType.binary
.sniffers: List<Sniffer>
MediaType
.Static constants are provided in MediaType
for well known media types. These are MediaType
instances, not String
.
Constant | Media Type | Extension | Name |
---|---|---|---|
AAC |
audio/aac | aac | Â |
ACSM |
application/vnd.adobe.adept+xml | acsm | Adobe Content Server Message |
AIFF |
audio/aiff | aiff | Â |
AVI |
video/x-msvideo | avi | Â |
AVIF |
image/avif | avif | Â |
Binary |
application/octet-stream | Â | Â |
BMP |
image/bmp | bmp | Â |
CBZ |
application/vnd.comicbook+zip | cbz | Comic Book Archive |
CSS |
text/css | css | Â |
DiViNa |
application/divina+zip | divina | Digital Visual Narratives |
DiViNaManifest |
application/divina+json | json | Digital Visual Narratives |
EPUB |
application/epub+zip | epub | EPUB |
GIF |
image/gif | gif | Â |
GZ |
application/gzip | gz | Â |
JavaScript |
text/javascript | js | Â |
JPEG |
image/jpeg | jpeg | Â |
JXL |
image/jxl | jxl | Â |
HTML |
text/html | html | Â |
JSON |
application/json | json | Â |
LCPProtectedAudiobook |
application/audiobook+lcp | lcpa | LCP Protected Audiobook |
LCPProtectedPDF |
application/pdf+lcp | lcpdf | LCP Protected PDF |
LCPLicenseDocument |
application/vnd.readium.lcp.license.v1.0+json | lcpl | LCP License |
LCPStatusDocument |
application/vnd.readium.license.status.v1.0+json | Â | Â |
LPF |
application/lpf+zip | lpf | Â |
MP3 |
audio/mpeg | mp3 | Â |
MPEG |
video/mpeg | mpeg | Â |
NCX |
application/x-dtbncx+xml | ncx | Â |
Ogg |
audio/ogg | oga | Â |
Ogv |
video/ogg | ogv | Â |
Opus |
audio/opus | opus | Â |
OPDS1 |
application/atom+xml;profile=opds-catalog | Â | Â |
OPDS1Entry |
application/atom+xml;type=entry;profile=opds-catalog | Â | Â |
OPDS2 |
application/opds+json | Â | Â |
OPDS2Publication |
application/opds-publication+json | Â | Â |
OPDSAuthentication |
application/opds-authentication+json | Â | Â |
OTF |
font/otf | otf | Â |
PDF |
application/pdf | ||
PNG |
image/png | png | Â |
ReadiumAudiobook |
application/audiobook+zip | audiobook | Readium Audiobook |
ReadiumAudiobookManifest |
application/audiobook+json | json | Readium Audiobook |
ReadiumWebPub |
application/webpub+zip | webpub | Readium Web Publication |
ReadiumWebPubManifest |
application/webpub+json | json | Readium Web Publication |
SMIL |
application/smil+xml | smil | Â |
SVG |
image/svg+xml | svg | Â |
Text |
text/plain | txt | Â |
TIFF |
image/tiff | tiff | Â |
TTF |
font/ttf | ttf | Â |
W3CWPUBManifest |
(non-existent) application/x.readium.w3c.wpub+json | json | Web Publication |
WAV |
audio/wav | wav | Â |
WebMAudio |
audio/webm | webm | Â |
WebMVideo |
video/webm | webm | Â |
WebP |
image/webp | webp | Â |
WOFF |
font/woff | woff | Â |
WOFF2 |
font/woff2 | woff2 | Â |
XHTML |
application/xhtml+xml | xhtml | Â |
XML |
application/xml | xml | Â |
ZAB |
(non-existent) application/x.readium.zab+zip | zab | Zipped Audio Book |
ZIP |
application/zip | zip | Â |
MediaType.Sniffer
Function TypeDetermines if the provided content matches a known media type.
MediaType.Sniffer = (context: MediaType.SnifferContext) -> MediaType?
context
holds the file metadata and cached content, which are shared among the sniffers.MediaType.SnifferContext
InterfaceA companion type of MediaType.Sniffer
holding the type hints (file extensions, types) and providing an access to the file content.
Examples of concrete implementations:
MediaType.FileSnifferContext
to sniff a local file.MediaType.BytesSnifferContext
to sniff a bytes array.MediaType.MetadataSnifferContext
to sniff only the media type and file extension hints.mediaTypes: List<String>
fileExtensions: List<String>
encoding: Encoding?
Encoding
declared in the media types’ charset
parameter.Encoding
type provided by the platform, for convenience.contentAsString: String?
charset
parameter from the media type hints to figure out an encoding. Otherwise, fallback on UTF-8.contentAsXML: XMLDocument?
contentAsArchive: Archive?
contentAsJSON: JSONObject?
contentAsRWPM: Publication?
hasFileExtension(fileExtensions: String...) -> Boolean
fileExtensions
array.hasMediaType(mediaTypes: String...) -> Boolean
, hasMediaType(mediaTypes: MediaType...) -> Boolean
mediaTypes
array, using MediaType
to handle the comparison.stream() -> Stream?
read(range: Range<Int>? = null) -> ByteArray?
range
.close()
It’s useful to be able to resolve a format from an HTTP response. Therefore, implementations should provide when possible an extension to the native HTTP response type.
HTTPResponse.sniffMediaType(mediaTypes: List<String> = [], fileExtensions: List<String> = [], sniffers: List<Sniffer> = MediaType.sniffers): MediaType?
mediaTypes
fileExtensions
sniffers
This extension will create a MediaType.BytesSnifferContext
using these informations:
mediaTypes
, in order:
Content-Type
HTTP header,mediaTypes
, for example to use the value of Link.type
.fileExtensions
, in order:
Content-Disposition
,bytes
: the response’s bodyIt’s important to have consistent results across platforms, so we need to use the same sniffing strategy.
Sniffing a format is done in two rounds, because we want to give an opportunity to all sniffers to
return a MediaType
quickly before inspecting the content itself:
To do that, MediaType.of()
will iterate over all the sniffers twice, first with a
MediaType.SnifferContext
containing only extensions and media types, and the second time with a
context containing the content, if available.
Sniffers can encapsulate the detection of several media types to factorize similar detection logic. For example, the following sniffers were identified. The sniffers order is important, because some formats are subsets of others.
In the case of bitmap formats, the default Readium sniffers don’t perform any heavy sniffing, because we only need to detect these formats using file extensions in ZIP entries or media types in a manifest. If needed, a reading app could add additional sniffers doing heavy sniffing of bitmap files.
audiobook
application/audiobook+zip
manifest.json
entry, parsed as an RWPM with either:
metadata.@type == http://schema.org/Audiobook
, orLink
with an audio type, checked with MediaType::isAudio
application/audiobook+json
metadata.@type == http://schema.org/Audiobook
, orLink
with an audio type, checked with MediaType::isAudio
bmp
or dib
image/bmp
or image/x-bmp
cbz
application/vnd.comicbook+zip
, application/x-cbz
or application/x-cbr
acbf
, gif
, jpeg
, jpg
, jxl
, png
, tiff
, tif
, webp
, avif
or xml
.
and Thumbs.db
are ignoreddivina
application/divina+zip
manifest.json
entry parsed as an RWPM, with a reading order containing only bitmap images – checked using MediaType.isBitmap
on each Link.type
application/divina+json
MediaType.isBitmap
on each Link.type
epub
application/epub+zip
mimetype
entry containing strictly application/epub+zip
, encoded in US-ASCIIgif
image/gif
htm
, html
, xht
or xhtml
text/html
or application/xhtml+xml
, checked using MediaType.isHTML
<html>
root nodejpg
, jpeg
, jpe
, jif
, jfif
or jfi
image/jpeg
jxl
image/jxl
application/atom+xml;profile=opds-catalog
<feed>
root node with the XML namespace http://www.w3.org/2005/Atom
application/atom+xml;type=entry;profile=opds-catalog
<entry>
root node with the XML namespace http://www.w3.org/2005/Atom
application/opds+json
Link
with self
rel and application/opds+json
typeapplication/opds-publication+json
Link
with a rel starting with http://opds-spec.org/acquisition
application/opds-authentication+json
or application/vnd.opds.authentication.v1.0+json
id
, title
and authentication
lcpa
application/audiobook+lcp
license.lcpl
entrymanifest.json
entry, parsed as an RWPM with either:
metadata.@type == http://schema.org/Audiobook
, orLink
with an audio type, checked with MediaType::isAudio
lcpdf
application/pdf+lcp
license.lcpl
entrymanifest.json
entry, parsed as an RWPM with a reading order containing only Link
with application/pdf
typelcpl
application/vnd.readium.lcp.license.v1.0+json
id
, issued
, provider
and encryption
lpf
application/lpf+zip
publication.json
entry, containing at least https://www.w3.org/ns/pub-context
in the @context
string/array propertyindex.html
entrypdf
application/pdf
%PDF-
png
image/png
webpub
application/webpub+zip
manifest.json
entry parsed as an RWPMapplication/webpub+json
Link
with self
rel and application/webpub+json
typehttps://www.w3.org/ns/wp-context
in the @context
string/array propertytiff
or tif
image/tiff
or image/tiff-fx
webp
image/webp
avif
image/avif
zab
aac
, aiff
, alac
, flac
, m4a
, m4b
, mp3
, ogg
, oga
, mogg
, opus
, wav
or webm
asx
, bio
, m3u
, m3u8
, pla
, pls
, smil
, vlc
, wpl
, xspf
or zpl
.
and Thumbs.db
are ignored