Referring to external media files from XML
Mapping from references in source XML to media locations on disk, and then generating HTML with the correct references to media on a web server is…a little tricky.
The question at hand is, how can we best manage the relationship between XML documents and source media files (video, audio, images, etc.) in our input with output HTML documents and their corresponding media files.
For context, suppose we have a collection of XML documents and some of those documents contain references to media files (for convenience, I’m going to focus on images, but the same questions apply to other kinds of external media). The documents, which may be in different directories, are pulled together into a single document that is processed by XSLT. The output is a collection of HTML documents. These documents, which may also be in different directories, have to contain references to the images that will resolve correctly on the web server.
I’m focused on DocBook here, and what the DocBook xslTNG Stylesheets should do, but I don’t think the problem is really specific to DocBook.
It’s important that the stylesheets be able to find the actual images during the transformation so they can determine the dimensions of each image. It’s important that the HTML output contains URI references to the images on the webserver that the browser will resolve successfully.
As a matter of practicality, we have to assume that the directory
hierarchies are different. If the fully qualified path name of your
first image is /path/to/my/image.png
, we can’t assume that
<img src="/path/to/my/image.png" />
is going to be a useful reference on the web server.
What are the logical possibilities that we need to support?
Inputs
For the input images, what are the options?
-
The images are stored in the same source tree as the XML at locations that are correct relative to the references in the XML.
If
/path/to/image/file.png
is the image and/path/to/xml/book/setup/chap01.xml
is the source XML, then the reference inchap01.xml
is../../../image/file.png
. -
Images are stored in a separate location; references in the XML are relative to that location.
If
/path/to/common/image/file.png
is the image,/path/to/xml/book/setup/chap01.xml
is the source XML, and the separate location is/path/to/common/
, then the reference inchap01.xml
isimage/file.png
.
There are (at least) a couple of other possibilities. The fully qualified paths could be used in the XML image references. I think that’s the same as saying the images are stored in a separate location and that location is "", so I don’t think it’s an interestingly different case.
Another logical possibility is that the images are stored in a hierarchy that’s the same as the XML, but at a different root.
In this case, if /path/to/images/image/file.png
is the image,
/path/to/xml/book/setup/chap01.xml
is the source XML, and
the alternate root is /path/to/images/
, then chap01.xml
contains
the same relative reference as the first example above.
But that seems a lot harder to manage, and I don’t think it adds anything of value. (In other words, I don’t think anyone would actually do this.)
(Note that I’m ignoring the cases where the images and sources are in a CMS or other database. That’s a different set of problems.)
Outputs
What about outputs? There are a number of complications here, including the fact that the relative locations of the output files may be different from relative locations of the input files. Also, generally speaking, it’s inconvenient if the images have absolute paths because then it’s difficult to change the location of the document hierarchy on the web server.
For the output images, what are the options? In the examples below,
we’re considering what should happen when chap01.xml
refers to file.png
in the source and the output HTML is /book/ch01.html
.
-
The images are stored in the same result tree as the HTML. The HTML should refer to
file.png
. The image and the HTML file will be in the same directory.(Note that it is not the responsibility of the stylesheets to copy the actual images to the HTML hierarchy.)
-
The images are stored in a location that’s relative to the HTML. The HTML should should refer to
img/file.png
. The image will always be in animg
directory relative to the location of the HTML. -
The images are stored in a common location that’s accessed with a relative URI. The HTML should refer to
../img/file.png
. The image will always be in animg
directory that’s at the root of our output hierarchy, but we want to access it with a relative URI so that the whole subtree can be moved around without breaking any links.This is actually a fairly common scenario with systems that give you a preview URI of
https://host/username/projects/example/
for a set of pages that you ultimately plan to publish tohttps://example.com/
. -
The images are stored in a common location that’s accessed with an absolute URI. If we accept option 3 as a requirement, I think we can ignore the case where the absolute URI would be
/img/
on the same server. They’re functionally equivalent.This case is still important when the images are on a different server. When HTML should refer to an absolute location:
https://mediaserver.example.com/img/file.png
because the image will always be on that server.
At this point, I should observe that you can put the image files anywhere you like, and you can make the XML references anything you like, and you can produce any HTML you like. You’re free to write your own templates that work for your particular requirements, no matter what they are. And sometimes, that’s really the easiest thing.
My question is, what should the stylesheets support out of the box, and how?
Are there other scenarios for inputs or outputs that I’ve overlooked?
Is it reasonable and necessary that any input configuration should be able to produce any output configuration automatically?