Convert Directory of EPUB Files to OPDS

2021-08-02

This article describes a script for generating a OPML catalogue out of a directory with ebooks

Background Story

Since some time, I was thinking about a replacement for calibre's internal OPDS server. Considering anarcat's blog post, calibre offers a complete set of tools to manage your e-book collection. Among others, calibre provides an OPDS sever that allows you to publish your e-book collection following web syndication standards. Depending on your collection, this process could consume a great amount of resources during operation. The situation becomes even more interesting if you try to utilize a small single-board computer (i.e. raspberry pi) to get the job done. In one of my previous article, I've already described how some of these limitations could be avoided to a certain degree.

In this article I want to move one step closer to a solution that fits better to my needs. But what exactly are “my needs”?

Personal setup requirements:

Personal e-book collection (less then 500 books)
Self-hosted server within home network
Low-power consumption for reducing overall operation costs (24/7)
Reading books exclusively on an e-book reader (dedicated device or application)

These requirements lead to the following minimal solution I could think of:

running on a small single-board computer (e.g. raspberry pi zero )
providing a syndication feed of the e-book collection that can be subscribed by an e-book reader

Especially the solution I had in mind does not include:

a web interface
authentication support
uploading and management of e-books
serving a feed file as such

For providing such features, however, I've foreseen a combination of different tools, each of them handling one part of the overall setup. Within this article, I will concentrate on finding and processing e-books in a directory as a first step.

A Game of Standards

For getting started, first let's have a closer look on the input file (i.e. epub) as well as the output file that need to be served.

Both file formats belong to a family of specifications that are dedicated to the distribution of digital publications - so called Open Publication Distribution System (OPDS).

An epub file is just a zipped file archive that includes, among others, a content.opf that lists all the content of the epub file and follows the Open Packaging Format. OPF is essentially a XML file format including a metadata block that can be seen in the following example.

<metadata >
    <dc:identifier id="pub-identifier">9781491969557</dc:identifier>
    <dc:title id="pub-title">9781491969557 converted</dc:title>
    <dc:language id="pub-language">en</dc:language>
    <dc:publisher>O'Reilly Media, Inc.</dc:publisher>
    <dc:date>2019-04-19</dc:date>
</metadata>

On the other hand, OPDS also defines so called OPDS Catalog Feed Documents.

These documents are Atom Feeds and are either Navigation Feeds or Acquisition Feeds

Let's keep the details of a Navigation and Acquisition Feed beside for a moment and concentrate on the actual format: The Atom Syndication Format (in short Atom). Also Atom is specified as a XML file format that is also well known for its usage in RSS/Atom Feeds from your daily news provider. An entry of such a feed can be seen in the following snippet.

<entry>
    <title>Bob, Son of Bob</title>
    <id>urn:uuid:6409a00b-7bf2-405e-826c-3fdff0fd0734</id>
    <updated>2010-01-10T10:01:11Z</updated>
    <author>
      <name>Bob the Recursive</name>
      <uri>http://opds-spec.org/authors/1285</uri>
    </author>
    <dc:language>en</dc:language>
    <dc:issued>1917</dc:issued>
    <summary>The story of the son of the Bob and the gallant part he played in
      the lives of a man and a woman.</summary>
    <link rel="http://opds-spec.org/image"    
          href="/covers/4561.lrg.png"
          type="image/png"/>
    <link rel="http://opds-spec.org/acquisition"
          href="/content/free/4561.epub"
          type="application/epub+zip"/>
 </entry>

In general, transforming XML documents into other documents (i.e. could also be XML) can be done by using the eXtensible Stylesheet Language Transformations (XSLT). Take the following example of a XSLT template for parsing a metadata block from an epub file.

<xsl:template match = "opf:package">
    Title: <xsl:value-of select="opf:metadata/dc:title"/>
    Language: <xsl:value-of select="opf:metadata/dc:language"/>
    Creator: <xsl:value-of select="opf:metadata/dc:creator"/>
    Date: <xsl:value-of select="opf:metadata/dc:date"/>
    Description: <xsl:value-of select="opf:metadata/dc:description"/>
</xsl:template>

As a short recap, it seems that the general core logic behind the conversation lies in transforming a metadata block from content.opf to an entry item in an atom.xml feed. Both are xml-based documents that can be transformed by using the eXtensible Stylesheet Language (XSL) Transformations(XSLT). Thus, as a next step let's check some tools that can be used to do so.

Tools for XSL Transformations

Before jumping into any details about XSLT, due to the fact that content.opf is essentially an XML file, known tools for working with XML on command-line can also be used to work with OPF files. Take the following command-line as an example that I used a lot during the development to pretty print the content of an epub file to stdout (i.e. content.opf already extracted from epub).

cat content.opf | xmllint --format -

The same example but with on-the fly extraction from an epub file (zip archive).

unzip -p my.epub content.opf | xmllint --format -

Back to topic, the main tool that I'm using for the transformation is xsltproc. I've borrowed this idea from the bashpodder project that uses xsltproc to process podcast feeds for downloading the related audio files.

The following examples give an overview about a workflow with xsltproc without describing single XSL files in more detail. That would be too much for this article. An important aspect in writing XSL files, however, is the correct usage of namespaces. All relevant files for the following examples can be found within the epub2opds repository.

The workflow to extract information from an e-book file (i.e. content.opf) is as follows:

unzip -q book.epub OEBPS/content.opf
    xsltproc package.xsl OEBPS/content.opf

The resulting XML partial¹ (that is not a valid XML document) can be seen below. Whereas the output syntax doesn't need to be XML again (i.e. plaintext is also possible). I choose XML as output because that partial snippet should be part of an atom feed anyhow. Also the information itself that we're interested in are up to the chosen selectors within the XSL file.

<title>9781491969557 converted</title>
<id>9781491969557</id>
<updated>2019-04-19</updated>
<author>
<name>Jay McGavren</name>
</author>
<dc:language xmlns:dc="http://purl.org/dc/elements/1.1/">en</dc:language>
<summary>Go makes it easy to build software that's simple, reliable, and efficient. 
And this book makes it easy for programmers like you to get started. Google designed 
Go for high-performance networking and multiprocessing, 
but--like Python and JavaScript--the language is easy to read and use. With this 
practical hands-on guide, you'll learn how to write Go code using clear examples 
that demonstrate the language in action. Best of all, you'll understand the conventions 
and techniques that employers want entry-level Go developers to know.</summary>

Metadata (i.e. package data), however, is not the only information we could get from content.opf because it provides a complete inventory of the e-book, we can use the same approach to extract the path of a cover image as demonstrated below:

xsltproc cover.xsl OEBPS/content.opf
# output: assets/cover.png
unzip -q book.epub assets/cover.png

The output here is good old plaintext because later on assets/cover.png gets assigned to a variable that is an input to unzip. As you can see here information and presentation is completely different whereas the overall workflow stays the same. What a nice example of data driven processing.

To sum it up, xsltproc is a powerful tool to process XML-based input files (i.e. content.opf). The processing can be influenced by XSL files where both relevant information and presentation of these information can be defined. At the current point, xsltproc is the core of processing XML-based files. But how are these XML-based files found, extracted and what happens with the extracted information? To answer these questions, for sure there are other bits and pieces required which will come together in the next section and finishing up.

Script Implementation

It should not come to a surprise that we glue the single bits and pieces together via a shell script. The main steps within the script are the following:

find epub files in a given directory
opds_dir() extracts meta and cover from content.opf
opds_xml() generates rss feed

I adapted some ideas how to deal with folders, files and generating feeds via a shell script from karl's blog.sh. The source of the script can be found within the epub2opds repository for your reference. I will not replicate the complete code here rather explain some highlights from the code.

Finding e-books

Finding relevant e-book files is as easy as:

find "./library" -type f -name "*.epub" > books.txt

and leads to a file books.txt that merely contains the list of books:

/dir/ebook.epub
/dir2/ebook.epub
/dir3/ebook.epub

The nice thing here is that we don't need to care about the structure of folders (i.e. flat or deep hierarchy) because we'll find them all.

Extracting

Extracting the information and cover is a combination of unzip and xsltproc that can be outlined as follows:

unzip -p book.epub OEBPS/content.opf | xsltproc [package|cover].xsl -

In principal that work on all of my GNU/Linux machines. I need some adaptation for FreeBSD because the unzip command behaves different on that system (i.e. -p is not there) where you need to specify the qualified path of the file you want to extract. That's why I came up with the following:

unzip -lqq book.epub | grep '\.opf' | awk '{print $NF}'

Also dealing with e-books that doesn't include a cover image, there is an unknown.jpg that can be used as a drop-in.

The resulting file dir.tsv (tab-separated file) contains.

library/Head First Go.epub  cache/Head First Go/meta.xml    cache/Head First Go/cover.png

I'm using a printf statement here rather then an echo because of the side-effects of the latter when dealing with tabs and newlines.

printf "%s\t%s\t%s\n" "${f}" "${meta}" "${o_dir}/${cover##*/}"

Generating feed.xml

Generating the rss feed is done within opds_xml() function that contains a static part (i.e. header) and a dynamic part where the single entries get assembled. Due to the fact that the output of meta information is already formatted in XML we could simply include them within the entry itself.

<entry>
 $(cat "${meta}")
 ...
<entry>

It iterates over the entries in dir.tsv and fills the data into relevant fields. For sure, generating a feed in this way, is somehow inconsistent because you could also use XSL to get that job done.

Closing Words

That's it! This article started as a small collection of field notes on how to use the different tools and becomes more extensive as planned. Sure to provide a complete hosting solution for your e-book library you need to find a way to upload books and serve the feed.xml file. A potential solution consisting of a webdav server and using python's HTTP module is described within the README. Last but not least, I want to mention relevant projects for example dir2opds written in golang and opdscreate which was written in perl.

Currently, this approach serves me quite well on a RPi Zero and runs 24/7. However, an improvement I could think about right away is to introduce an incremental update rather than delete the generated files and build from scratch.