Home
Blog
Overview of all products
SalesData
LibraryData
CataList
Loan Stars
BiblioShare
Webform
EDI
Products for publishers
Products for retailers
Products for libraries
Information for authors
BNC Research
Canadian literary awards
SalesData & LibraryData Research Portal
Events
Tech Forum
Webinars & Training
Code of Conduct
Standards
EDI standards
Product identifiers
Classification schemes
ONIX standards
About
Contact us
Media
Bestseller lists
Newsletters
Podcast
Jobs
SalesData
LibraryData
CataList
BiblioShare
Webform
EDI

BookNet Canada

Home
Blog
Overview of all products
SalesData
LibraryData
CataList
Loan Stars
BiblioShare
Webform
EDI
Products for publishers
Products for retailers
Products for libraries
Information for authors
BNC Research
Canadian literary awards
SalesData & LibraryData Research Portal
Events
Tech Forum
Webinars & Training
Code of Conduct
Standards
EDI standards
Product identifiers
Classification schemes
ONIX standards
About
Contact us
Media
Bestseller lists
Newsletters
Podcast
Jobs
SalesData
LibraryData
CataList
BiblioShare
Webform
EDI
Tom Richardson
November 10, 2009
BiblioShare, ONIX, Standards & Metadata

Data Exchange Tip #4: Escaping Entities—Påvøl Breaches Checkpoint Charlie

Tom Richardson
November 10, 2009
BiblioShare, ONIX, Standards & Metadata

Most publishers have more than 100 books on their list, and a few of them will be by non-European authors or reference some atypical symbol. And wasn’t the production manager proud when the cover copy spelled it right?

ONIX is a bibliographic data exchange standard, and it behooves us who toil in publishing to spell the author’s name, book title or their review journal correctly. Really foreign scripts will (probably) allow use of some transliteration system, so it’s unlikely you’ll need to use Chinese, Arabic or Cyrillic scripts. I have arbitrarily decreed this to be beyond my scope (or knowledge)—but “similar” alphabets like Norwegian or Romanized Slavic are not part of iso-8859-1. And even for the Western European languages it does cover (French, Spanish and German) there are missing letters. There are common symbols missing like trademark and copyright… At some point you’ll need to put something in an ONIX file that’s outside of the common encoding schemes, and for that you’ll use an escaped entity.

I’m going to make one of my daring generalizations here to help you recognize an entity: It starts with an ampersand, “&”, has simple keyboard characters in between and ends with a semicolon, “;”. It’s recognized by XML software and rendered as characters by browsers. Here’s a link to one of my favorite sites with lists of entities:

http://htmlhelp.com/reference/html40/entities/

For example an e with an acute accent—é—can be “escaped” as é or é or é—and further, an entity is special in XML because the ampersand should not be itself escaped. You should never see a “double escaped” entity like é in an ONIX file.

A file encoded as utf-8 has everything that can’t be expressed as a simple keyboard character escaped while iso-8859-1 can have characters like é, ç, à, è, ô, ö, û, ñ, etc. but not characters like ů, ũ, or š, etc. Neither of these encodings can accept “smart” characters, m or n dashes, etc. although there are escaped versions of these. Where is the dividing line? When the software complains, and just like the previous post on file cleaning the solution is simple substitution by find and replace. What? You haven’t kept track of how your characters are stored in your source file? Oh dear… understanding what’s in your source file is the first step. But basically: the XML software complains and you fix a problem, just like the encoding problems.

There are considerations: First, while a basic validation in most XML software accepts all entity types, XML schema languages are stricter (I don’t know why), and they don’t accept the ‘html’ type like é. So using html style entities will cause you a problem with BookNet Canada’s BiblioShare and ONIX 3.0. Schema validation is the future, so the prudent administrator should avoid html entities. The ‘decimal’ style é is the most common one supported by schema languages, and the one I recommend. I seldom see files using the ‘hex’ escaped entities like é so I suggest not using it but I can’t defend that prejudice. Ideally you should use one system, consistently, in your file, and if you need to change it at some later point it won’t be hard.

A second consideration is the on-line companies that get your data. In the descriptions and biographies you can certainly spell things correctly as the entities will appear correctly in browsers, but what about “searchable” fields like Contributor and Title? What happens to your author Hélènne Ővēn if you submit her name ‘correctly’ (according to who? me?) in encoding=’iso-8859-1’ and escaped as ” Hélènne Ővēn” or in utf-8 as “Hélènne Ővēn”? It may render properly in a browser, but will it affect how easily you can search for her name in Amazon or Indigo? Can a consumer search the obvious keyboard bastardization of “Oven” and find the book? It’s a problem, that’s about all I can tell you. The on-lines are way better about this than they were a few years ago when anything outside of simple keyboard characters weren’t acceptable in a searchable field but there are no guidelines here. I’d say that “Hélènne” with it’s pretty normal “special characters” within iso-8859-1 wouldn’t be much of a stretch, but “Ővēn ” is likely to cause problems. it might matter if you submitted “Hélènne”, “Hélènne” or “Hélènne”. They are all different, clearly, and programmers at every aggregator or on-line would have to set up to process all of these to index. Did they? Will your own website? Oh dear!

This is, sort of, what it means when BISG says iso-8859-1 is the recommended encoding for the US supply chain: Aggregators should accommodate at least the special characters in it. And maybe they do more, maybe they do less, but it’s reasonable to think they’ll do that much. And when I say Canada hasn’t made a recommendation it means, well, we haven’t gone that far.

If you really have a lot of special characters that are critical and you don’t yet know what you and your trading partners are doing, well, that’s beyond the scope of this blog. I’m trying here for practical help to largely English language ONIX producers. But mostly I want to say: Take advantage of ONIX!! You can, and should, update your records. So spell the name right, submit your data as early as you can and then check the on-line records. If it doesn’t look right or the searches fail, then ask them about it or judiciously misspell the name to compensate and re-submit your data. You should have 6 months before publication to work it out. Maybe Amazon and Indigo will be OK but it’ll be wrong on Barnes & Noble. Maybe it’s only Walmart who can’t get it right. And maybe Walmart is the only one that matters to you. It’s your call but the author will probably understand why you made your choice. Try again in 2 years and the answers will have changed.

And that advice should make anyone who cares even a little about the accuracy of their records cringe.

Tagged: xml, data exchange tips

Newer PostData Exchange Tip #5: Some Basics—Tools Before Validation
Older PostRegistration for O'Reilly Tools of Change 2010 Now Open
Blog RSS

The Canadian Book Market 2024 is the comprehensive guide to the Canadian market with in-depth category data.

Get your copy now

Listen to our latest podcast episode


  • Research & Analysis 446
  • Ebooks 304
  • Tech Forum 266
  • Conferences & Events 261
  • Standards & Metadata 228
  • Bookselling 218
  • Publishing 194
  • ONIX 178
  • Marketing 152
  • Podcasts 117
  • ebookcraft 112
  • BookNet News 99
  • Loan Stars 71
  • Libraries 66
  • BiblioShare 59
  • SalesData 51
  • 5 Questions With 48
  • CataList 42
  • Thema 42
  • Awards 30
  • Diversity & Inclusion 20
  • Publishing & COVID-19 18
  • Sustainability 10
  • LibraryData 9
  • EU Regulations 8
  • ISNI 4

 

 

BookNet Canada is a non-profit organization that develops technology, standards, and education to serve the Canadian book industry. Founded in 2002 to address systemic challenges in the industry, BookNet Canada supports publishing companies, booksellers, wholesalers, distributors, sales agents, industry associations, literary agents, media, and libraries across the country.

 

Privacy Policy | Accessibility Policy | About Us

BOOKNET CANADA

Contact us | (416) 362-5057 or toll free 1 (877) 770-5261

We acknowledge the financial support of the Government of Canada through the Canada Book Fund (CBF) for this project.

Back to Top

BookNet Canada acknowledges that its operations are remote and our colleagues contribute their work from the traditional territories of the Mississaugas of the Credit First Nation, the Anishnawbe, the Haudenosaunee, the Wyandot, the Mi’kmaq, the Ojibwa of Fort William First Nation, the Three Fires Confederacy of First Nations (which includes the Ojibwa, the Odawa, and the Potawatomie), and the Métis, the original nations and peoples of the lands we now call Beeton, Brampton, Guelph, Halifax, Thunder Bay, Toronto, Vaughan, and Windsor. We endorse the Calls to Action from the Truth and Reconciliation Commission of Canada (PDF) and support an ongoing shift from gatekeeping to spacemaking in the book industry.