Posted on by & filed under general.

My first on the TeleRead blog is up: Small pieces, loosely joined. This reflects my thinking in working with epub these last few weeks and with open source publishing in general.

There are a number of projects I’ve got lined up and they’re all going to follow the famous imperative that good programmers should be lazy. The kind of laziness I discuss in the article (re-use and domain-specific languages) isn’t what Larry Wall meant, but I’ll maintain it’s a solid foundation for digital publishing.

Posted on by & filed under general.

I’ll be at the annual meeting for SSP, the Society for Scholarly Publishing, this Thursday and Friday in Boston, MA. I’d love to meet with people about ebooks, the epub standard, and digital publishing in general.

There are a number of talks I’m looking forward to, especially in the areas of deep web reference discovery, ebooks (obviously) and applying the lessons of agile software development to publishing workflows. It should be a good conference.

Drop me a line at if you want to meet up.

Posted on by & filed under tools.

For one-off checks or use by non-developers, there is now a web front-end for the valuable epubcheck validation tool.

Uploaded files are run through the validator and any error messages are reported. The error report includes some notes to help decipher the sometimes-cryptic messages (notes are taken from the epubcheck wiki).

epub files are deleted immediately after validation, but take care not to upload any proprietary or sensitive documents. This tool provides no guarantees that any information it receives is secure.

The tool is running epubcheck version 0.9.5.  It is planned that the front-end will track updates of the epubcheck library.

Posted on by & filed under tools.

I have split threepress into two different projects hosted on Google Code:

  1. threepress search, which is the web application that is running on
  2. epub-tools, which is a repository for standalone tools which can be used in other projects

Most developers will be interested in epub-tools.  Experimental projects will start in the search application, receive feedback from the digital publishing community, and get packaged up for distribution in epub-tools.

The tools project has only one application now: tei2epub.

The current version of tei2epub includes these recent changes:

  1. The latest version of the validation utility epubcheck (0.9.5)
  2. The NCX files now validate against the NCX DTD as well as epubcheck’s schema (thanks to Jon Noring for testing)
  3. Some corrections were made to render TOCs more attractively in Adobe Digital Editions

Posted on by & filed under content.

The last set of Gutenberg HTML books that were planned for demonstration on threepress have been added.  As usual, data-loading took more time and uncovered up more problems than expected, which is always a reason to add as many samples as possible.  This set includes one non-fiction book (On the Origin of Species) and one with verse components (The Jungle Book); both required significant updates to the XSLT that converts the Gutenberg DTD to TEI.

To expand the project in useful ways I’d like to be able to add:

  1. Other content types besides novels, especially reference
  2. Content from other document formats, such as DocBook
  3. Native, highly-tagged TEI documents

Wikipedia and its cohorts are by far the largest source of public domain data on the web now, but they aren’t encoded in XML. Publishers are unlikely to use wiki formatting to mark up their content and thus developing a workflow to convert from wiki to TEI doesn’t seem productive.

XML data welcome!

Posted on by & filed under tools.

The most useful standalone tool in threepress right now is tei2epub, which the system uses to convert its internal source XML to the emerging e-book standard format epub.

TEI is the Text Encoding Initiative, and is one of the most popular markup formats for printed works (especially in academics). All of the content on threepress has been converted from the Gutenberg format to TEI upon ingestion into the site.

epub is the shorthand for the e-book format proposed by the International Digital Publishing Forum (IDPF), which uses XHTML and custom metadata formats. An e-book bundle is distributed in ZIP file format with its text and supplementary media “bound” together.

tei2epub is written in Python with XSLT. It also comes bundled with the latest version of epubcheck, for validating the output. It is meant to be used by developers rather than end-users (unlike the recent BookGlutton epub converter) and as most of the functionality is in the XSLT, should be easy to port to other languages. Like all threepress tools it is released under the BSD license which means it is free for all commercial and non-commercial use. You may download the ZIP version of the current release or get the latest version from svn at

Current limitations:

  1. tei2epub has not been tested on extensively marked-up TEI. It leverages the standard TEI to XHTML stylesheets distributed by TEI, but it is unknown whether epub readers will support all of the resulting markup
  2. It accepts only a single source document (i.e. an entire TEI book)
  3. It does not handle images or other kinds of media

Any of the above can be addressed with the addition of more complex TEI source books.

Edited May 22, 2008 to point resources at a new standalone repository.

Posted on by & filed under features.

Although this project is primarily aimed at tools for searching and reading textual content, software developers have increasing options to easily develop high-quality graphical applications. The program described here is written in the graphical environment Processing, but Adobe Flash or Microsoft’s Silverlight can be used for similar purposes.

I imagine applying techniques such as this to create algorithmic, generative book trailers, that exploit words in the text or use imagery derived from the web.

These two examples are the same program, threewords, running the text of Pride and Prejudice. Each time it displays a word, it records the frequency of that word. As terms appear more and more often, they zoom towards the viewer. Common words such as “the” are excluded. It would be possible to collapse all forms of a word to its common stem (the Xapian search engine used by threepress has stemming capability), but this version does not stem.

The first movie is of the initial four chapters, run at a readable speed:

The second is of the entire text, at 16X speed (2 minutes in length):

However, the application looks best when run locally. Processing exports standalone versions for Windows, Mac OSX and Linux. Source code is included in the application folders.

Look for more text-based movies in the coming months.

Posted on by & filed under content.

Two books that should’ve been in the initial release were added today: A Tale of Two Cities by Charles Dickens and The Cask of Amontillado by Edgar Allen Poe.

Tale was challenging because of the way the “books” were organized (they’re called parts in threepress).  This book exposed a bug in the way I was handling chapter ordering, which I’ve fixed.

Cask is my only content with no chapters, as it’s a short story.  I could make that more transparent to the user than the current implementation (right now content is assigned to a pseudo-chapter called “Complete story”), but whether I do that will depend on which is the outlying case: books or single-chaptered works.  Right now it’s mostly books, so that feels like the natural way to organize the site.

Posted on by & filed under features.

I just added support for AdaptiveBlue’s AB Meta format on all book pages.  I’m only supporting type, author and title because that’s all the metadata I have in the source XML. Hopefully I can find some content from other sources which is tagged in more detail.

I chose to use the Dublin Core namespace (rather than AdaptiveBlue’s) because it’s more familiar to me and more widely used in the industry.

Ironically it was much simpler to add AB Meta to my Django source code than it was to even explain how to do it in WordPress, as I did in a post on the Tools of Change blog.

Posted on by & filed under general. is a repository for open source software designed for use by publishers.

What this means is:

  1. All of the software is free, meaning there is no cost associated with it. It also means free in the sense of unencumbered: it can be modified or re-purposed for any use, including commercial use.
  2. Most packages re-use other tools (which are themselves open source, but may have slightly different licensing restrictions). One of the goals of threepress is to maximize existing toolkits — carefully modified to suit the needs of publishers — but not to re-invent whole processes.

More pragmatically, theepress also serves as a sandbox for me to experiment with projects as part of my consulting business. Although in many cases it will be impossible to do so, I hope to convince publishers that it is in their own interest to use and release projects in an open source context. For more information on consulting, see this About page.