Convert Directory of EPUB Files to OPDS
2021-08-02Try hard to solve a problem with existing solutions before yougo ahead and write another one (problem or solution you choose).
Background Story
Since some time, I was thinking about a replacement for calibre's internal OPDS server. Considering anarcat's blog post, calibre offers a complete set of tools to manage your e-book collection. Among others, calibre provides an OPDS sever that allows you to publish your e-book collection following web syndication standards. Depending on your collection, this process could consume a great amount of resources during operation. The situation becomes even more interesting if you try to utilize a small single-board computer (i.e. raspberry pi) to get the job done. In one of my previous article, I've already described how some of these limitations could be avoided to a certain degree.
In this article I want to move one step closer to a solution that fits better to my needs. But what exactly are “my needs”?
Personal setup requirements:
- Personal e-book collection (less then 500 books)
- Self-hosted server within home network
- Low-power consumption for reducing overall operation costs (24/7)
- Reading books exclusively on an e-book reader (dedicated device or application)
These requirements lead to the following minimal solution I could think of:
- running on a small single-board computer (e.g. raspberry pi zero )
- providing a syndication feed of the e-book collection that can be subscribed by an e-book reader
Especially the solution I had in mind does not include:
- a web interface
- authentication support
- uploading and management of e-books
- serving a feed file as such
For providing such features, however, I've foreseen a combination of different tools, each of them handling one part of the overall setup. Within this article, I will concentrate on finding and processing e-books in a directory as a first step.
A Game of Standards
For getting started, first let's have a closer look on the input file (i.e. epub) as well as the output file that need to be served.
Both file formats belong to a family of specifications that are dedicated to the distribution of digital publications - so called Open Publication Distribution System (OPDS).
An epub
file is just a zipped file archive that includes, among
others, a content.opf
that lists all the content of the epub file and
follows the Open
Packaging Format. OPF is essentially a XML file format including a
metadata
block that can be seen in the following example.
<metadata >
<dc:identifier id="pub-identifier">9781491969557</dc:identifier>
<dc:title id="pub-title">9781491969557 converted</dc:title>
<dc:language id="pub-language">en</dc:language>
<dc:publisher>O'Reilly Media, Inc.</dc:publisher>
<dc:date>2019-04-19</dc:date>
</metadata>
On the other hand, OPDS also defines so called OPDS Catalog Feed Documents.
These documents are Atom Feeds and are either Navigation Feeds or Acquisition Feeds
Let's keep the details of a Navigation and Acquisition Feed beside for a moment and concentrate on the actual format: The Atom Syndication Format (in short Atom). Also Atom is specified as a XML file format that is also well known for its usage in RSS/Atom Feeds from your daily news provider. An entry of such a feed can be seen in the following snippet.
<entry>
<title>Bob, Son of Bob</title>
<id>urn:uuid:6409a00b-7bf2-405e-826c-3fdff0fd0734</id>
<updated>2010-01-10T10:01:11Z</updated>
<author>
<name>Bob the Recursive</name>
<uri>http://opds-spec.org/authors/1285</uri>
</author>
<dc:language>en</dc:language>
<dc:issued>1917</dc:issued>
<summary>The story of the son of the Bob and the gallant part he played in
the lives of a man and a woman.</summary>
<link rel="http://opds-spec.org/image"
href="/covers/4561.lrg.png"
type="image/png"/>
<link rel="http://opds-spec.org/acquisition"
href="/content/free/4561.epub"
type="application/epub+zip"/>
</entry>
In general, transforming XML documents into other documents (i.e. could
also be XML) can be done by using the
eXtensible Stylesheet
Language Transformations (XSLT). Take the following example of a XSLT
template for parsing a metadata
block from an epub file.
<xsl:template match = "opf:package">
Title: <xsl:value-of select="opf:metadata/dc:title"/>
Language: <xsl:value-of select="opf:metadata/dc:language"/>
Creator: <xsl:value-of select="opf:metadata/dc:creator"/>
Date: <xsl:value-of select="opf:metadata/dc:date"/>
Description: <xsl:value-of select="opf:metadata/dc:description"/>
</xsl:template>
As a short recap, it seems that the general core logic behind the
conversation lies in transforming a metadata
block from content.opf
to an entry
item in an atom.xml
feed. Both are xml-based documents
that can be transformed by using the eXtensible Stylesheet Language
(XSL) Transformations(XSLT). Thus, as a next step let's check some tools
that can be used to do so.
Tools for XSL Transformations
Before jumping into any details about XSLT, due to the fact that
content.opf
is essentially an XML file, known tools for working with
XML on command-line can also be used to work with OPF files. Take the
following command-line as an example that I used a lot during the
development to pretty print the content of an epub file to stdout
(i.e. content.opf
already extracted from epub).
cat content.opf | xmllint --format -
The same example but with on-the fly extraction from an epub file (zip archive).
unzip -p my.epub content.opf | xmllint --format -
Back to topic, the main tool that I'm using for the transformation is
xsltproc. I've borrowed
this idea from the bashpodder
project that uses xsltproc
to process podcast feeds for downloading
the related audio files.
The following examples give an overview about a workflow with xsltproc
without describing single XSL files in more detail. That would be too
much for this article. An important aspect in writing XSL files,
however, is the correct usage of
namespaces.
All relevant files for the following examples can be found within the
epub2opds repository.
The workflow to extract information from an e-book file
(i.e. content.opf
) is as follows:
unzip -q book.epub OEBPS/content.opf
xsltproc package.xsl OEBPS/content.opf
The resulting XML partial1 (that is not a valid XML document) can be seen below. Whereas the output syntax doesn't need to be XML again (i.e. plaintext is also possible). I choose XML as output because that partial snippet should be part of an atom feed anyhow. Also the information itself that we're interested in are up to the chosen selectors within the XSL file.
<title>9781491969557 converted</title>
<id>9781491969557</id>
<updated>2019-04-19</updated>
<author>
<name>Jay McGavren</name>
</author>
<dc:language xmlns:dc="http://purl.org/dc/elements/1.1/">en</dc:language>
<summary>Go makes it easy to build software that's simple, reliable, and efficient.
And this book makes it easy for programmers like you to get started. Google designed
Go for high-performance networking and multiprocessing,
but--like Python and JavaScript--the language is easy to read and use. With this
practical hands-on guide, you'll learn how to write Go code using clear examples
that demonstrate the language in action. Best of all, you'll understand the conventions
and techniques that employers want entry-level Go developers to know.</summary>
Metadata (i.e. package data), however, is not the only information we
could get from content.opf
because it provides a complete inventory of
the e-book, we can use the same approach to extract the path of a cover
image as demonstrated below:
xsltproc cover.xsl OEBPS/content.opf
# output: assets/cover.png
unzip -q book.epub assets/cover.png
The output here is good old plaintext because later on
assets/cover.png
gets assigned to a variable that is an input to
unzip
. As you can see here information and presentation is completely
different whereas the overall workflow stays the same. What a nice
example of data driven processing.
To sum it up, xsltproc
is a powerful tool to process XML-based input
files (i.e. content.opf
). The processing can be influenced by XSL
files where both relevant information and presentation of these
information can be defined. At the current point, xsltproc
is the core
of processing XML-based files. But how are these XML-based files found,
extracted and what happens with the extracted information? To answer
these questions, for sure there are other bits and pieces required which
will come together in the next section and finishing up.
Script Implementation
It should not come to a surprise that we glue the single bits and pieces together via a shell script. The main steps within the script are the following:
find
epub files in a given directoryopds_dir()
extracts meta and cover fromcontent.opf
opds_xml()
generates rss feed
I adapted some ideas how to deal with folders, files and generating feeds via a shell script from karl's blog.sh. The source of the script can be found within the epub2opds repository for your reference. I will not replicate the complete code here rather explain some highlights from the code.
Finding e-books
Finding relevant e-book files is as easy as:
find "./library" -type f -name "*.epub" > books.txt
and leads to a file books.txt
that merely contains the list of books:
/dir/ebook.epub /dir2/ebook.epub /dir3/ebook.epub
The nice thing here is that we don't need to care about the structure of
folders (i.e. flat or deep hierarchy) because we'll find
them all.
Extracting
Extracting the information and cover is a combination of unzip
and
xsltproc
that can be outlined as follows:
unzip -p book.epub OEBPS/content.opf | xsltproc [package|cover].xsl -
In principal that work on all of my GNU/Linux machines. I need some
adaptation for FreeBSD because the unzip
command behaves different on
that system (i.e. -p
is not there) where you need to specify the
qualified path of the file you want to extract. That's why I came up
with the following:
unzip -lqq book.epub | grep '\.opf' | awk '{print $NF}'
Also dealing with e-books that doesn't include a cover image, there is
an unknown.jpg
that can be used as a drop-in.
The resulting file dir.tsv
(tab-separated file) contains.
library/Head First Go.epub cache/Head First Go/meta.xml cache/Head First Go/cover.png
I'm using a printf
statement here rather then an echo
because of the
side-effects of the latter when dealing with tabs and newlines.
printf "%s\t%s\t%s\n" "${f}" "${meta}" "${o_dir}/${cover##*/}"
Generating feed.xml
Generating the rss feed is done within opds_xml()
function that
contains a static part (i.e. header) and a dynamic part where the single
entries get assembled. Due to the fact that the output of meta
information is already formatted in XML we could simply include them
within the entry itself.
<entry>
$(cat "${meta}")
...
<entry>
It iterates over the entries in dir.tsv
and fills the data into
relevant fields. For sure, generating a feed in this way, is somehow
inconsistent because you could also use XSL to get that job done.
Closing Words
That's it! This article started as a small collection of field
notes on how to use the different tools and becomes more extensive as
planned. Sure to provide a complete hosting solution for your e-book
library you need to find a way to upload books and serve
the feed.xml
file. A potential solution
consisting of a webdav server and using python's HTTP module is
described within the README. Last but not least, I want to mention
relevant projects for example
dir2opds written in golang and
opdscreate
which was written in perl.
Currently, this approach serves me quite well on a RPi Zero and runs 24/7. However, an improvement I could think about right away is to introduce an incremental update rather than delete the generated files and build from scratch.