The ePub eBooks Metadata Mess

The subject is metadata.

Metadata is the information that is supposed to accompany each eBook so that things such as — but not limited to — arranging them by Writer or Genre or Publisher or Date Published is possible.

If you’re going on vacation, for example, and want to take along a mystery, how could you quickly find one in an eBook library comprised of several hundred editions?

Such sorting is why we have computers. They do the grunt work.

But they can only do it based on data. When the data — the metadata — isn’t there, all hell breaks loose and life is rotten for everybody.

The world runs on metadata. This financial mess we’re in? It’s all metadata — information abstracted from its original source. Metadata is derivative, abstracted data.

So if you think metadata is some little thing the world can do without, you are wrong.

Metadata is one of the little details that, ages ago, Steve Jobs would have cared about. His attention to detail in the screens of the original circa-1984 Macintosh is the stuff of legends. He would criticize down to the individual pixel. Well, Jobs, metadata is akin to those pixels you once studied.

In my previous post, about iTunes ePub eBook display options, I encountered bizarre arrangements of the eBooks. Something was plain wrong there and I wanted to find out what it was.

Some of it is not Apple’s fault, but parts of it is Apple’s fault.

This post’s purpose is to wake up every book publisher — major, minor, and single-writer, every book distributor — Smashwords, Feedbooks, et al, and Apple itself.

Let’s begin.

This is all wrong:


Click = big

When sorted by Categories, it shouldn’t look like that at all.

Again, with a count of books for each Category. Except that not all are Categories. A Christmas Carol is a Category? And why is the Gutenberg Holmes not under either Novels or Action & Adventure? And why is the 23rd Century book in its own Category and not under Science-Fiction? Murder Piping Hot, from Smashwords, should be under a Mystery category — so where did haggis come from? Also notice how many are thrown into Unknown Genre! What’s happening with all of this metadata?

So what was going on here?

I had to go look at the metadata. Fortunately, there is a free tool that enables that: Calibre.


Click = big

(Note that I’ve had to redact book info not relevant to this post. I didn’t want to go through the hassle of having to re-add them to Calibre later.)

I added each one to Calibre’s Library and called up the metadata. That’s accomplished either via the menu or right-clicking on a title, as I have done here:


Click = big

The metadata screen looks like this:


Click = big

I’m not going to burden everyone with having to Click = Big for every snap, so I will crop them all to the part that’s important, as seen here in red:


Click = big

And we’ll thus examine the metadata for the first book. Murder Piping Hot, which seemed to have two anomalies. The first being its author’s name coming up as firstname-lastname, while the book next to it was lastname-firstname. And then it appearing under the Category of “haggis.”

Is any of that metadata correct and if not, why not?

As it turns out, the Author field is correct: Ann Morven — or firstname-lastname.

But where did the “haggis” come from? From the Tags field, where it also appears with “mystery” and “robertburns.”

Some of you think you just had an AHA! moment, but save that for now.

Let’s look at the book next to it now, Skyrider:

We can see the Author field is incorrectly filled in. We never see books with lastname-firstname as the author!

Under Tags, we see there is “Action & Adventure” — which begins to make you begin to doubt that AHA! moment.

We will skip to The Hunting Party, which wound up under Unknown Genre. Why did it?

Right away, we can see there is nothing in the Tags field. That answers the question of Unknown Genre.

But look! There’s a glaring error in the Author Sort field! It should be lastname-firstname, not firstname-lastname. This file is from Feedbooks and was probably quickly put together for me to test earlier today, so we won’t hold this error against it right now. But this does illustrate how easy it is to make a mistake with such important data and how such a little thing can have a cascading effect that can louse up everyone’s day.

We’ll hop to Strange Future, which came up under the bizarre Category of “23rd Century.” Let’s look at its Tags field:

The field is cut off. Instead of scrolling back and forth, it’s better to select all of it and paste it in a text program. It then reveals:

future society, timetravel, future earth, 23rd century, satire, science fiction comedy, futuristic novel, future life, time travel

And those of you who earlier had the AHA! moment just think you saw its validation. No. Wait for it.

I’m not going to address those tags. I just wanted to show what they are.

Now onto that Christmas Carol book:

And let me extract what’s in its Tags field:

romance, caden leigh, a christmas carol, scrooge, fiction

Where is your AHA! now? (I’ll get to that later.)

Let’s hop to the Sherlock Holmes book:

And extract its Tags field metadata:

Private investigators — England — Fiction, Detective and mystery stories, English, Holmes, Sherlock (Fictitious character) — Fiction

And now everyone who thought they had that AHA! moment — their heads explode!

Because up to now, we suspected that what was happening with Tags was iTunes was doing an alphabetic sort first and then putting the first tag — or set — as the Category. But here it has bizarrely and illogically grabbed “Holmes, Sherlock (Fictitious character)” as the Category field! If it was doing an alphabetical sort, it would have grabbed “England” instead.

Now you can see why I say some of this is also Apple’s fault.

Something very screwy is happening there with iTunes.

And to confirm that something screwy is happening with iTunes, here’s one more book’s metadata:

See? If iTunes was simply doing an alphabetical sort and then grabbing the first word or possible set, it would have placed this under the Category of “Philosophy,” not “Philosophy, Theology.”

I don’t know what kind of algorithm Apple is using. What I suspect is they have a faulty database they match against. If a match fails, it then does an alpha sort of Tags and grabs — something.

But this isn’t good enough.

How many people are going to look for a Mystery book under haggis?!

Some of you will protest that what I’ve shown you are from publishers who won’t be in the iBookstore. Well guess what? Murder Piping Hot is a Smashwords books — and it’s going to be in the iBookstore. Under the frikkin Category of haggis, apparently!

If you think none of this matters, this is why you aren’t working for Apple. How do you think their Genius system will work for books? It will be based primarily on sales and matched to book Categories. It won’t recommend a Romance to someone who primarily buys Biographies — unless the underlying metadata is screwed up. And as we have just seen, in the case of Murder Piping Hot, it will be!

Mismatched Genius recommendations hurts writers and casts doubts on the entire Genius system (which isn’t exactly held in high esteem by people who use it for music — but let’s be a bit idealistic here, OK?).

Amazon winds up getting people to spend additional money with its recommendation system. Don’t you think Apple wants to do that too?

And what I’ve shown you is only the tip of the iceberg with metadata. Let me show you the metadata possibilities that also exist. These are the metadata fields from the ePub editing program SIGIL:


Click = big

Note that in the following screensnaps I have redacted information that is private in nature:


Click = big


Click = big

And now your jaw will drop:


Click = big

And that’s not even every possibility, either! It’s just a taste!

(Thanks to Moriah Jovan for those screensnaps.)

A book’s metadata is as important as the book itself now. Because none of us are going to be strolling through physical bookstores browsing shelves. We’re going to use virtual shelves, on screens. And when we’re using those, we really don’t want to browse — the desire to browse goes down in proportion to the number of possible items. Just ask yourself what your desire is to browse the 150,000-plus apps in the App Store and you’ll see the naked truth of that!

We won’t browse: we’ll search. We’ll want to find what we want, buy it, and start reading it.

But without the metadata to help us along, buying is not going to be a smooth process. Apple will lose money — and more importantly, writers will lose money.

Apple has the resources to do the right thing. Setting metadata standards, hiring metadata specialist editors. But does Apple have the will?

Well, Apple better find the will. It has Google breathing down its neck — and Google has stolen the entire history of books (see all the backlinks there). Google is going to want to make lots of money off that investment theft — and Google understands the primacy of metadata for search.

Apple: Get to it. Steve Jobs: start caring about the little things again!

For more information, read Laura J. Dawson’s post: Metadata! More Important Than Ever!

Apple should be smart and hire her as a consultant. She knows metadata, period.

Apple needs to as well.

17 responses to “The ePub eBooks Metadata Mess

  1. Some sterling detective work there. Again, the problem is largely iTunes being a one-size-fits-all portal for media. It’s needed fixing since anything other than music was included. My understanding is that Genius works by aggregating users’ libraries and building patterns of taste. I don’t think it uses metadata in the way you’re suggesting.

    All the same, agreed. Apple has had a problem with taxonomy since the App Store went crazy. Let’s hope the impending EXPLOSION of the digital book market ushers in the necessary standards. But I think the onus is with the publishers here and Apple needs to work with them to find a consistent system for metadata.

  2. Jo Paterson

    Metadata specialist = Librarian. We have be making kicka$$ metadata for yonks! :-) Libraries have invested alot of time and money into creating good metadata, using controlled vocabularies, metadata standards and elegant classification schemes. We can help.

  3. I fixed your typo. Yes, I’ve called for Apple to put librarians in charge of the App Store at the very least!
    http://ebooktest.wordpress.com/2009/10/03/apples-app-store-needs-librarians/

    Now that Apple is selling books, the need is even more pressing!

  4. I was pretty shocked when I saw my first examples of really bad metadata. It’s not complicated stuff, it just takes time to do.

    Now, when we start talking about internal metadata, that’s when eBooks will get really exciting.

  5. Thank you for your interesting and very informative post. I’m the author of the Strange Future book that has the bizarre category of “23rd century”. I’d like to take a bit to explain that if you don’t mind…

    First off, to add some details that might explain things: the ePub copy of the book that you grabbed was from Smashwords, not one that I generated myself. I had considered doing multiple conversions and making them available on my site, but since I wanted to make it available to the widest audience possible, I had to pick a better distribution method. My site is getting found now, but the traffic to it is nowhere near what I get from the book’s posting on Smashwords.

    Smashwords doesn’t allow the author to upload their own versions for each format. Rather, authors upload a word document (or similar) and the rest of the formats are automatically generated. This is met with mixed results as I quickly found out, but in the end it seemed to work out pretty well. Obviously, though, it does have some mixed results too as you pointed out.

    The reason I picked those tags for my book is that I wanted to ensure that anyone interested in a particular subject (time travel for instance) could find it from one of those searches. I was under the impression that tagging your book with many tags based on the contents, title, thematic elements, etc would make it appear more easily when someone would perform a search on Smashwords’ site. I was NOT aware that the tags would become the basis for the category filtering in iTunes.

    Obviously, my book being filed as “23rd century” instead of “Science Fiction” on a basic category-browse in the iTunes bookstore would be a horrible thing, and apparently from your experiments, which of the tags would be chosen as the book’s category is a complete crapshoot.

    So now comes the decision: Does having the book tagged correctly in the iTunes store outweigh the benefits of having it tagged that way on the Smashwords site so that someone browsing another time travel story would quickly find my story, which also involves time travel?

    With no proof that having a numerous number of tags for Smashwords browsers is beneficial, the answer, in my mind, is obvious: remove the extraneous tags and stick with a simple one: Science Fiction. Fortunately, the books from Smashwords haven’t shipped to Apple yet, so the fix I’ve just made in my author dashboard should go into effect for the copy Apple will get. Keyword there being “should”…

    I’m hoping that maybe Smashwords can work things out so that the category the book is placed into on the site will be the category placed into the “tags” field instead. I might send them an email about this myself, actually.

    At any rate, I’ve rambled FARRR too long. Thanks again for your post. Appreciate it.

  6. Pingback: Most Tweeted Articles by Books Experts

  7. Yes, I did a Smashwords experiment and know all about their gruesome Meatgrinder process. Back then, when Smashwords began, there was no hint of books on it winding up at B&N, Sony, and soon Apple. So the tags that worked for Smashwords, worked for Smashwords. But now that doesn’t work in the outside world Smashwords is sending books to.

  8. Pingback: iPadプレビュー:Kindleの敵ではない。当分は : EBook2.0 Forum

  9. Pingback: The iPad: Obligatory Post on Impressions, Reading, and Wrist Strength | Booksquare

  10. Pingback: iPad link roundup | socialibrarian

  11. iPad-eBook-Download

    free ipad ebook downloads [URL redacted]

  12. I don’t know who you are and that site smells of illegality, so your URL — in both the Comment and your name — has been redacted. Do not try again or I will hit you with the Spam button and you’ll never get out of that shit for months.

  13. Well, I’ve cleaned up my metadata, got everything the way I want it, in Calibre at least, Unfortunately iTunes ignores the metadat.opf files and uses some of it’s own. Where it gets that form is anybodies guess. Oh, BTW I’ve actually watched iTunes rename some of my music mp3 files, to match the one above it in the library. It’s a genuinely garbage piece of software, and this is the later version we’re talking about here, 10.6 on OSX Snow Leopard.

  14. Yeah, count on iTunes to do that. Let’s hope the upcoming revamp of the store and software will end that practice — or at least make it sensible.

  15. This METADATA editor MAY be “pro standard”, but I won’t know until I pay the 195 EURO asking price :-( Anyone tested BLUE GRIFFON ? Is it worth the money ?
    http://www.bluegriffon-epubedition.com/BGEE.html#contactus

    Would have used SIGIL, but doesn’t work on my “old” MAC 10.6.8
    I don’t need anything as fancy as Blue Griffon, so maybe need to upgrade my MAC ?

  16. Just to clarify one or two things…

    Simply explained, the reading apps grab their metadata from a file inside the ePub; the OPF file. If this file is not correctly filled, the app will display that incorrect metadata. So it’s really the fault of the people that made the ePub. For example, for the book where the ePub says “Philosophy, Theology.” a reading app will put it in the following category “Philosophy, Theology.” As simple as that. If the publisher inputs “hi, test” then that will be the category.

    The other thing you need to be aware of is that publishers, when sending a book to a retailer like Apple, Google or Amazon, also send a metadata file alongside the ePub file, usually in an XML format called ONIX. This ONIX will contain the categories of the book (BISAC system for the USA, BIC for spain, CLIL for France, etc.). So usually the web stores (iBooks Store, Google Play books, Amazon website) use these categories to sort the books online, and not the information inside the OPF.

    I do agree 100% with you, publishers are not still as aware as they should be about the importance of metadata. I think Apple, Google and Amazon are, but they have to work with whatever the publishers send.