Booknet Canada Blog

Archive for the ‘Data Exchange Tips’ Category

Data Exchange Tips #8: Up(date) Your ONIX, or: What code should the White Rabbit use when late for Tea, and will his invitation change in response?

Wednesday, February 10th, 2010 by tom richardson

If your company uses EDI you probably understand the need to maintain consistent and accurate ‘transactional’ records But EDI is a limited number of fields and the company comptroller probably has it all under his or her thumb and allocates resources to its maintenance because it makes money (or more importantly loses money when done badly). EDI doesn’t really require “updates” because of the question/response nature of transactions flow. It’s all pretty simple unless you’re responsible for it, because an error means something is shipped or not, paid for or not. You and your trading partner notice quickly.

If it only was so for ONIX files! Just like EDI an ONIX file is an electronic resource, used to support business to business communication. While it’s not used to communicate transactions, retailers use the information in it to sell and expect the information in the ONIX file to be correct and updated. The price on their on-line retail site is probably sourced from your ONIX file, as well as the publication date and availability status (not to mention the author and title!), so if you cancel a title and don’t update or simply stop sending the title’s ONIX record, the retailer may be trying to sell a product that will never be available.

Consumers seem to think that the on-line record should match the book they order, shipping departments think that the book weight should too, and the buyer thinks the carton quantity should be accurate.

Well, duh! Like you didn’t know that? And you probably know why it’s not quite right, too. I mean I know some links on our website are down — I really need to fix them and that’s not happening either!

There are not really clear guidelines on what constitutes a proper update routine and the answer changes radically between a poetry publisher with a slow growth list and nothing OPed in 20 years, and a multinational whose books can come in and out of print in a season, but here’s some guidelines:

Retailers rely on book metadata be it for the initial buy (6 months before publication), transactions (active titles), and customer relations (accurate titles and descriptions). Each on-line page describing a book is like a little contract with the consumer – maybe not written quite in stone but it should be treated that way.

ONIX suppliers, by supplying book metadata should make a commitment to accuracy, and be willing and able to update it if it changes. This is different from including enhanced data – that’s different and just as necessary to maximize sales. I mean the basics: title, author, subject, imprint, publisher, status, pub date, supplier, availability and expected ship date all should be accurate, maintained fully utilizing the relevant ONIX composites so that name and title are parsed out, the publisher and supplier names are consistent, etc. etc. If it’s wrong it should be fixed and re-sent.

Frequency should be as often as it’s needed – for big companies that might be weekly “deltas” (change only) with monthly (full files), medium companies can probably do monthly files and small companies quarterly. But everyone should make an effort to realistically maintain and update their ONIX records and resend them regularly. A full file of all active records should be submitted to your supply chain trading partners at least once a year – and more often probably makes sense too. It’s not enough to issue a record once and expect 5 years later that retailers know it’s still active. It’s not enough to never clean your file of the books you no longer support either.

And yes, when a book is out-of-stock and reprinting you really should tell retailers when it’s due to be available again. If you really don’t know then fake it: maybe give a date 3 months away and keep updating it – it’s better than saying nothing until the reprint is ordered and restocking is 2 weeks away.

When books go inactive – Out-of-Print or No-Longer-Carried, what-have-you – you should also maintain records and release them appropriately (aren’t I cunningly vague about that). There’s no need, once the supply chain knows their status to continue to send them, but give everyone a chance to update their records before you drop them. And then maintain a file so that it’s available on request. Whoa! A lot of this information would be useful internally too!

It’s not easy to do and it takes thought to set up the internal communications – but it’s not necessarily that hard either. It’s a breeze in comparison to what agency pricing is going to be like. And how were you planning to communicate that? (New codes in ONIX should be in place by March by-the-way.) It’s only going to get faster.

The more astute of you might be thinking, I bet this harangue about the need to do the obvious well is a lead up to BNC BiblioShare. And you’re right!! There’s a webinar “Introduction to BiblioShare” on the 24th of this month… 2 to 3:
http://biblioshareintro.eventbrite.com/
and it’ll be available after the fact too.

Data Exchange Tips #7: Nic Boshart on Mac based XML solutions: Using oXygen

Friday, December 4th, 2009 by tom richardson

XML on Mac is a rare bird, or if not rare, seriously undercooked. There isn’t the same amount of options as on a PC, and certainly not an ONIX-based editor such as with ONIXEdit. There are a couple of free XML editors available, however Smultron is no longer updating and Serna Free XML Editor is not available for commercial use, instead deferring to Serna Enterprise for businesses. However, Serna Free is a good tool for getting used to XML.

OXygen is a WYSISYG XML editor with lot of powerful tools to do complex XML development — but using it to validate ONIX files is simple. It’s Java based, so it’s able to run on Windows, Linux or Mac platform and it edits files up to 70mb. You can use the large file viewer in the Tools menu to look at larger files, but you won’t be able to edit them due to the constraints of Java.

Caveats, they aren’t many. OXygen will do a basic or “DTD” validation on an ONIX file with the standard declaration. And to do a strict or “schema” validation, you’ll need to follow the same procedure detailed in the post about XML Notepad: The normal ONIX declaration needs to be replaced with declaration information set up for pulling schema information from your files.

But that’s quite easy. Follow the exact same steps as you would setting up XML Notepad. Download the schema, name it well, and replace the declaration with the same as in the “Create a Schema Specific File” portion of the BNC blog post Data Exchange Tips #6: A DIY Guide to Schema Validation on a PC: XML Notepad 2007. Now this part is a bit easier in oXygen, as you do not have to replace the last line of the declaration with the local address of your XML schema, the program will do that for you.

Setting Up to Use Schemas Using OXygen

Once you’ve downloaded the XML schema, open up your file in oXygen and replace the declaration. The hard part is over (well, depending on the quality of the metadata, anyway).

  • Top Menu Bar: Click on Document
  • From the dropdown: Choose XML Document
  • From the second dropdown: Choose Associate Schema

You should have opened a dialog box with several tabs at the top. XML Schema should be the first tab, already selected. The empty bar below is labeled URL — don’t be fooled, you want to open a local file. Click the folder to the right and find your schema accordingly.

Now you’re ready to validate. The declaration should have been changed accordingly. If not, it will tell you.

OXygen is a good system, it offers a lot of useful tools, including track changes to keep record of who did what to which file. It’s also useful as an ePub editor as you can open the full file without extracting it. The blog Instant InDesign has a good article on this:
http://instantindesign.com/index.php?view=412

Nic Boshart is Research and Communications Coordinator at the ACP and one of the organizers of next week’s The Canadian Publisher’s Digital Workshop on December 9 – 10, 2009.


Data Exchange Tips #6: A DIY Guide to Schema Validation on a PC: XML Notepad 2007

Thursday, November 26th, 2009 by tom richardson

For the purist, those who want their XML validation without the added benefits of what some programmer thinks would improve their ONIX file, there is a lovely generic XML software product called XML Notepad 2007. It’s free and written by a Microsoft programmer, Chris Lovett, so the freeware is from a safe source, it’s easy to set-up for a schema validation and robust with files as large as 20,000 records. The only problem is that you’ll need to use a file with its XML declaration information set up for a schema validation rather than using the normal ONIX declaration. That just means replacing the first few lines of the ONIX file with a different script — a simple cut and paste that only takes a few seconds. Just be sure to use the correct ONIX declaration on the file you send to trading partners.

The software requires that you have.NET Frameworks v2.0 or above installed (you’ll likely have it already on your computer, but it’s another Microsoft product) and you can download XML Notepad here:
http://www.codeplex.com/xmlnotepad
Just follow the links to the installer.

Getting the ONIX Schema

You’ll need is the ONIX Schema on your computer, which is available from www.editeur.org:

  • Navigate through Standards to ONIX for Books, Previous releases (not ONIX 3.0)
  • Scroll down to “Download Release 2.1 XML Schema” and click on it.
  • Click on the “Release 2.1 (revision XX) XML Schema” and save the zip to your hard drive.

Unpacking the zip will give you a directory with 7 files in it, 6 xsd ’schema’ files and a ‘read me’. You’ll need to put a location reference to these files into your schema, so make it easy on yourself and store this on your computer in an easily named location — avoid spaces in your directory names (spaces can confuse the XML software’s ability to find the file). As an example, if you were to create a directory in your top level C drive named XML with a subdirectory XSD and put the contents of the zip there, then naming the local file reference would not only be easy but have a long tradition behind it:
C:\XML\XSD\
However you choose to name the location put the 7 files into that location.

The schema file ONIX_BookProduct_CodeLists.xsd includes the ONIX codes, so every time the code list is updated, this file needs to be updated as well. New code lists are announced and listed by BookNet but if you add a new code to your ONIX and your file fails your validation process an outdated XSDs file might be the problem.

Create a Schema Specific File

The last hurdle is creating a schema specific file with your ONIX in it. The ONIX file you send to your trading partners has to have the “declaration” — the first lines before the Header tags — as defined by the ONIX for Books XML Message Specification. In order to schema validate using XML Notepad (and other XML software) you’ll need to replace that declaration with another one.

While you can modify your ONIX file with a new declaration, what I find easier is to create a file just for schema validation and then to copy the ONIX data section into it. Create a file, say: “schema.xml,” with either the Reference or Short tag declarations as below. And then copy and paste the ONIX file in using everything from the tag < Header> (Reference) or < header> (Short) to the bottom of the ONIX file (including the < ONIXMessage> or < ONIXmessage> tag).

Or alternatively, you can copy and paste a declaration from below into your ONIX file. You just have to be sure to replace it with the correct ONIX Message declaration before you send the file to your trading partners.

But one way or another, everything from the first line < ?xml version… through to < ONIXMessage> (for reference tag files) or < ONIXmessage> (for short tag files), such as in this example:

< ?xml version=”1.0″ encoding=”utf-8″?>
< !DOCTYPE ONIXMessage SYSTEM “http://www.editeur.org/onix/2.1/02/reference/onix-international.dtd”>
< ONIXMessage>

needs to be replaced with the following:

For ONIX files using Reference Tags:

< ?xml version=”1.0″ encoding=”utf-8″?>
< ONIXMessage refname=”ONIXMessage” shortname=”ONIXmessage” release=”2.1″
xmlns=”http://www.editeur.org/onix/2.1/reference”
xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”
xsi:schemaLocation=”http://www.editeur.org/onix/2.1/reference
C:\XML\XSD\ONIX_BookProduct_Release2.1_reference.xsd”>

For ONIX files using Short Tags:

< ?xml version=”1.0″ encoding=” utf-8″?>
< ONIXmessage refname=”ONIXMessage” shortname=”ONIXmessage” release=”2.1″
xmlns=”http://www.editeur.org/onix/2.1/short”
xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”
xsi:schemaLocation=”http://www.editeur.org/onix/2.1/short
C:\XML\XSD\ONIX_BookProduct_Release2.1_short.xsd”>

Note that the last line in the declaration contains the “local address” on your computer and one of the Schema files you downloaded from Editeur.org. So unless you’ve used the suggested directory naming proposed above you’ll need to modify the above script to use the local computer address you used. Note also that if there are spaces in the directory name (and possibly long directory names) that the address may not be read properly by the XML software. This is one of the most likely things to go wrong — keeping it simple and direct makes your life easier.

That’s it. Save the file (always save the file as the software works with the last saved version and won’t include unsaved changes) with your data and the new declarations.

Setting up to use Schemas using XML Notepad

Start up XML Notepad 2007. You’ll need to let the software know where to find the schema by creating an internal link to the XSD schema files from Editeur. So:

  • Top menu bar: Click on View
  • From the dropdown: Click on Schema
  • From the dialog box: Click on the box with the ellipse (three dots) at the far right

This opens a standard Windows file finding user interface — use it to navigate to where the schemas are stored and select:

For Reference tags: ONIX_BookProduct_Release2.1_reference.xsd

For Short tags: ONIX_BookProduct_Release2.1_short.xsd

(You can do this step twice and put both reference and short links here. The program will work fine if you do, but it’s only necessary if you use both short and reference tag files.)

Note: Whenever you first start XML Notepad you’ll need to let it know where the local schemas are — it’s just a matter of opening the Schema dialog and clicking “OK.”

Your First Validation using XML Notepad

Using XML Notepad go to its top menu bar: Click on File / Open and from the dropdown choose the file you created using the schema declaration and your ONIX data in it and open it.

Assuming it opens (if not, skip back to previous blog posts on File Cleaning), in the upper left, you’ll find a tree that corresponds to the file’s XML structure — the composite and tag names are displayed.

In the upper right is the data within those tags

And on the bottom half — the Error List which may (or may not) be displaying errors.

First: Did you open the Schema dialogue and click OK? Are you sure? XML Notepad will display errors until you confirm the schema that it should use. So, just to be sure go to the top menu bar: Click on View, Choose Schema, (and assuming the ONIX schemas are there as described in the previous section) Click on OK.

In the error section double click on one of the errors. If you have a large file it may take a while to respond, but it will take you directly to the error identified. If you have no idea why it’s a problem (and why would you?) I’ll try to put some advice together in future blogs, but the problem will either be a violation of the XML Standard or a violation of the ONIX Standard.

There’s a webinar available on this
http://connectpro95248216.acrobat.com/p33362696/
It’s a little rough, but we’re just learning the software (and truth be told, I’m not at Michael Tamblyn’s level of presentation), but if you find this sort of webinar helpful let us know and we’ll try to do more.

For your convenience here here are two files set up for Schema validation. Limitations on our website prevent me from using files with a .xml extension, so you’ll have to change the file from .txt to .xml. There are two, one for Reference Tags and one for Short Tags — if your tags are largely in English then it’s Reference and if the tags are largely a character plus 3 digit codes then it’s Short:

Reference Tag Schema

Short Tag Schema

Data Exchange Tips #5: Some Basics: Tools before validation

Monday, November 16th, 2009 by tom richardson

An XML file is simply text — nothing very special — except that in order for XML software to read and interpret it, everything needs to be just so. XML is, loosely, a computer formatting language, and as such is a low type of computer code — if not quite as finicky as a proper programming language it has much stricter rules than HTML.

Every part of the structure of the file, and aspects of the contents, must match two defining documents: An ONIX file is validated by using XML software to compare your file against the rules of the XML standard (www.W3.org) and the schema written by the ONIX developers at Editeur (www.editeur.org). So an ONIX validation is both something that applies to all XML documents and is specific to the ONIX data exchange standard — and validation errors might be from either. You shouldn’t confuse the XML validation process with the Certification report generated by BiblioShare. Every file accepted into BiblioShare, after it passes the XML validation discussed here, gets a quality assessment that looks for data issues. This is a distinct and separate process from XML validation.

You probably should research and try to understand as much about XML as time, energy and inclination allow you — you’ll be happier producing ONIX if you do, and possibly more comfortable using Epub too. There are good resources on the web, Wikipedia and www.w3schools.com/XML/ are recommended as a start.

What gets validated?

The ONIX file, the file you send to BiblioShare and your other trading partners is what gets validated. In solving validation problems you might make corrections to your original dataset and re-output the ONIX file, or you might just manually modify the file itself, but it’s the file, whatever.xml, that we’re working with here. Validation is always the last step — before sending XML files to anyone they should be checked.

Taking Stock

First off: Do you have any XML software — an XML editor or development suite like XML Spy or oXygen? If you’ve inherited this job, look at your program list, ask! It can’t hurt and you may as well use what you’ve got or paid for. I will be recommending some specific software and one is free, but there’s nothing special about it. You should consider getting more than one validation tool (you can never have enough validation).

You can find software through a web search on “XML editor” or look at the “XML Resources” at O’Reilly’s www.XML.com.

Text Editor

As noted above, an XML file is just text and XML files can be opened in a text editor. If you’ve got an XML editor you might use that as it’s designed for the work, but you absolutely can view and edit an ONIX file in a text editor. Your only concern is ensuring the editor does not change the file. For example you can use MS-Word to open an xml file — but don’t do it!! Word is set up to “help” you run XSLT transformation scripts and will make any number of assumptions and changes to the file content, none will be good for our purpose of using the XML standard to exchange data. (This warning about software changing files applies to a lot of XML software. Until you’re sure it doesn’t, assume any software might be making changes to an XML file.)

What you want to use is as simple a text editor as possible, on a PC Notepad or Wordpad, on a Mac, TextEdit or SimpleText, make gentle changes to the text you can see and save it without rendering the document unreadable to XML software. Really, it’s just use the keyboard or cut and paste text, and exit using the most straightforward options. If forced to choose format options on saving first try to use the one labeled “ASCII US” or “ASCII text”.

There are number of text editors available designed to be used by programmers — they tend to have better “Go To Line” features, are usually tag sensitive (you’ll understand that when you see it — very handy in XML) — and they don’t muck with the code. I’m fond of Notepad2, http://www.flos-freeware.ch/notepad2.html, but Notepad++ http://notepad-plus.sourceforge.net/uk/site.htm might be worth checking out.

As always all work should be done on a copy of your ONIX file — experiment but don’t trash your work.

PC vs MAC

Macs are better for a lot of things but you have more options (and more free options) for XML software on a PC. If your Mac has a Windows emulation or operating system boot area any PC solution should work. Mac solutions are typically Java based — and there’s nothing wrong with that (PC software usually rely on the .NET Framework) — but they are more likely to have fees associated with them.

I would really appreciate feedback from Mac users as I’m not very familiar with what’s out there. oXygen seems to be the clear favorite but I’m sure there’s some good freeware too.

File Size

XML software is typically processor intensive and requires a lot of RAM memory resources. Some software fails at large file sizes, and all most will be more difficult when handling large files. You’ll find it faster and easier to understand if you do this on a smaller file (below 1000 records and below 100 records would be even better), at least while doing you’re first validations. When you’re familiar with the software and its responses try using larger sizes — most XML software has an upper limit at which it’s unresponsive. How would you know if you haven’t done it successfully?

How do you cut a file down to size? Use a text editor, open the ONIX file and remove individual product records by starting with the tag (or

for short tags) and include the corresponding tag (or ). So long as you remove whole product records ( to ) and leave the other tags alone you can take out as many as you want.

Internet Access

XML software usually needs internet access to work — do this on a computer hooked up to it.

The ONIX Documentation

It’s big, it’s dull and you need it on your computer: www.editeur.org ONIX / ONIX for Books / Previous releases / Release 2.1 Downloads / Download Release 2.1 format specifications You’ll need to get the current release so I’ve not provided a direct link. Having a copy of the Product Manual and the Message Specifications is invaluable. The PDF is linked to the code lists and it’s the easiest way to look up something.

Data Exchange Tips #4: Escaping Entities: Påvøl breeches Checkpoint Charlie

Tuesday, November 10th, 2009 by tom richardson

Most publishers have more than 100 books on their list, and a few of them will be by non-European authors or reference some atypical symbol. And wasn’t the production manager proud when the cover copy spelt it right?

ONIX is a bibliographic data exchange standard, and it behooves us who toil in publishing to spell the author’s name, book title or their review journal correctly. Really foreign scripts will (probably) allow use of some transliteration system, so it’s unlikely you’ll need to use Chinese, Arabic or Cyrillic scripts. I have arbitrarily decreed this to be beyond my scope (or knowledge) — but “similar” alphabets like Norwegian or Romanized Slavic are not part of iso-8859-1. And even for the Western European languages it does cover (French, Spanish and German) there are missing letters. There are common symbols missing like trademark and copyright… At some point you’ll need to put something in an ONIX file that’s outside of the common encoding schemes, and for that you’ll use an escaped entity.

I’m going to make one of my daring generalizations here to help you recognize an entity: It starts with an ampersand, “&”, has simple keyboard characters in between and ends with a semicolon, “;”. It’s recognized by XML software and rendered as characters by browsers. Here’s a link to one of my favorite sites with lists of entities:
http://htmlhelp.com/reference/html40/entities/

For example an e with an acute accent — é — can be “escaped” as &eacute; or &#233; or &#xE9; — and further, an entity is special in XML because the ampersand should not be itself escaped. You should never see a “double escaped” entity like &amp;#233; in an ONIX file.

A file encoded as utf-8 has everything that can’t be expressed as a simple keyboard character escaped while iso-8859-1 can have characters like é, ç, à, è, ô, ö, û, ñ, etc. but not characters like ů, ũ, or š, etc. Neither of these encodings can accept “smart” characters, m or n dashes, etc. although there are escaped versions of these. Where is the dividing line? When the software complains, and just like the previous post on file cleaning the solution is simple substitution by find and replace. What? You haven’t kept track of how your characters are stored in your source file? Oh dear… understanding what’s in your source file is the first step. But basically: the XML software complains and you fix a problem, just like the encoding problems.

There are considerations: First, while a basic validation in most XML software accepts all entity types, XML schema languages are stricter (I don’t know why), and they don’t accept the ‘html’ type like &eacute;. So using html style entities will cause you a problem with BookNet Canada’s BiblioShare and ONIX 3.0. Schema validation is the future, so the prudent administrator should avoid html entities. The ‘decimal’ style &#233; is the most common one supported by schema languages, and the one I recommend. I seldom see files using the ‘hex’ escaped entities like &#xE9; so I suggest not using it but I can’t defend that prejudice. Ideally you should use one system, consistently, in your file, and if you need to change it at some later point it won’t be hard.

A second consideration is the on-line companies that get your data. In the descriptions and biographies you can certainly spell things correctly as the entities will appear correctly in browsers, but what about “searchable” fields like Contributor and Title? What happens to your author Hélènne Ővēn if you submit her name ‘correctly’ (according to who? me?) in encoding=’iso-8859-1′ and escaped as ” Hélènne &#336;v&#275;n” or in utf-8 as “H&#233;l&#232;nne &#336;v&#275;n”? It may render properly in a browser, but will it affect how easily you can search for her name in Amazon or Indigo? Can a consumer search the obvious keyboard bastardization of “Oven” and find the book? It’s a problem, that’s about all I can tell you. The on-lines are way better about this than they were a few years ago when anything outside of simple keyboard characters weren’t acceptable in a searchable field but there are no guidelines here. I’d say that “Hélènne” with it’s pretty normal “special characters” within iso-8859-1 wouldn’t be much of a stretch, but “Ővēn ” is likely to cause problems. it might matter if you submitted “Hélènne”, “H&eacute;l&egrave;nne” or “H&#233;l&#232;nne”. They are all different, clearly, and programmers at every aggregator or on-line would have to set up to process all of these to index. Did they? Will your own website? Oh dear!

This is, sort of, what it means when BISG says iso-8859-1 is the recommended encoding for the US supply chain: Aggregators should accommodate at least the special characters in it. And maybe they do more, maybe they do less, but it’s reasonable to think they’ll do that much. And when I say Canada hasn’t made a recommendation it means, well, we haven’t gone that far.

If you really have a lot of special characters that are critical and you don’t yet know what you and your trading partners are doing, well, that’s beyond the scope of this blog. I’m trying here for practical help to largely English language ONIX producers. But mostly I want to say: Take advantage of ONIX!! You can, and should, update your records. So spell the name right, submit your data as early as you can and then check the on-line records. If it doesn’t look right or the searches fail, then ask them about it or judiciously misspell the name to compensate and re-submit your data. You should have 6 months before publication to work it out. Maybe Amazon and Indigo will be OK but it’ll be wrong on Barnes & Noble. Maybe it’s only Walmart who can’t get it right. And maybe Walmart is the only one that matters to you. It’s your call but the author will probably understand why you made your choice. Try again in 2 years and the answers will have changed.

And that advice should make anyone who cares even a little about the accuracy of their records cringe.

Data Exchange Tips #3: File cleaning, not just for your nails.

Tuesday, November 3rd, 2009 by tom richardson

In later posts I’ll look at and recommend XML software (if anyone has favorite software — particularly for Macs as I don’t use them — let me know), but for this I’m assuming you have some and you’ve loaded into it an ONIX file that uses one of the two most common XML encoding declarations:
< ?xml version=”1.0″ encoding=”utf-8″?> (the file contains only standard keyboard characters)
or
< ?xml version=”1.0″ encoding=”iso-8859-1″?> (the file contains only standard keyboard characters plus basic French, Spanish, or some German accented characters)
and this being XML, the software is giving back some sort of statement saying on Line X, position Y there’s an unrecognized character — or possibly shown some sort of box listing 5 or 6 gibberish values that it’s converted to an underscore. Or maybe the software just craps out and won’t load.

This is what XML software does when it looks at your file and finds something in it that doesn’t’ match the encoding declaration — and this is what will happen when the file is loaded at Bowker, Indigo or Amazon. The aggregators are probably fixing minor problems because it’s faster to do that than complain, but if there a lot of problems your file may well get shuffled to one side and never loaded. So you can rely on the kindness of retail to fix and maybe load your file, or you can do what you can to make sure that the file loads properly. If you make the effort I can assure you retailers will know and will be much more likely to contact you if they have problems.

The first (but not only) step in file cleaning is finding and correcting encoding issues because they usually prevent XML software from working. Because not all XML software is the same it helps to use more than one piece of software when trying to clean files. File cleaning is pretty simple conceptually, and simple in practice too. An XML file is just a text file — the simplest type of computer output possible. XML software needs all the characters in the file to be recognized in order to work, so to fix problems the easiest thing to do is open the ONIX file in a simple text editor (Notepad, WordPad, SimpleText, etc.), or if it’s available in the “text” view of your XML software.

What you don’t want to do is open the file in something like MS-Word that will recognize it as an XML document and start modifying it based on what it thinks you’re doing. ONIX is a data exchange standard and Word will think you’re trying to XSLT transformations.

Use the “Go to line” function (ctrl G) to go to that line specified and look around (if the Go to Line function isn’t available, I’ll have some suggested text editors in the software discussion). You’ll probably see some glitchy text, a “smart” character, or possibly an accented character. If it’s the latter and your encoding is UTF-8, change the declaration to iso-8859-1 and try loading it into the software again. The game you’re playing is matching the characters in your file to what the XML software expects, so changing the encoding statement to the appropriate one is allowed (but no aggregator accepts every possible encoding and the two recommended here are the most common). The next blog post on “Escaping Entities” will deal with leaving your encoding as UTF-8 or using special characters outside of iso-8859-1.

But let’s say it’s a glitchy character — incoherent text strings or symbols, or possibly it’s a “smart” character: curly apostrophes, special dashes and the like that are pretty and work in their source software but are not part of the encoding. The first test is if can you copy and paste them into the “Find” dialog box? If you can’t then whatever they are they’re so not-text that the text editor is not willing to work with them (Bones might say: “It’s a letter Jim, but not as we know it.”). At a guess they are hex (witchcraft?) characters and you may be forced to clean such issues one at time. I’ve never seen a file with a lot of this problem, but cleaning them in the source (what your ONIX was created from) is the way to go.

The second test is: Does the character you’re searching reoccur consistently in the file and in each case does it represent the same thing. If not, you’re again looking at manual cleaning. There’s no easy way to do this, but if there’s too much to fix manually, maybe you need to go to the source of the character and do some tests there. This is why encoding is so important — it’s so fundamental to the file that everything hinges from it. You may need to change how you create your documents in order to prevent problems.  But the point is that XML software won’t care about anything more than the XML file in question. What came before it doesn’t matter to it and only send files that match the encoding statement.

The most likely thing will be if you copy and paste the problem into “Find” (ctrl F) is that it reoccurs numerous times in your file and that it’s consistently the same problem. If it’s a big file try to test at several spots in the file because it’s just possible that data loaded at a different time will be different.

This is a copy of your ONIX file, right? So no harm in experimenting — use find and replace to transform the glitch to what it should be — the simplest possible keyboard character or an escaped entity (next blog post). Make a copy of the two values for future reference. You’ll possibly find that there are hundreds or thousands of instances of the problem in your file. Save the file and go back to the XML software and attempt to load it again (remember, the software will load the file’s last saved state so be sure to save your work). You’ll probably get another problem. Repeat the process.

While this may seem futile there are probably only a limited number of such problems in your file — 5 to 10 types are normal — smart quotes and apostrophes (several types) and dashes. Depending on the encoding and sensitivity of the XML software you may also find accented character, trademarks and other special characters similarly noted. My next post on Escaping Entities will be a fuller discussion of these.

You’re going to need to make a decision at this point. You can fix the characters in the source document — that is if your ONIX file is generated from a database or other source — to go back to it and fix the problem there so that problem won’t exist in any future ONIX output. Or you can just fix the problem in the ONIX file and do this as a step every time you create and send it. Which makes the most sense probably depends on the number and how easily you can change the source file. Some database software allows you to do find and replace on multiple entries while other content management systems (CMS) only allow you open up individual records.

It’s common for publishers to clean each ONIX file prior to sending — and the whole point of storing information as XML in text files is to make it easily transformed — but clearly having the source clean is preferable. And given the source probably supports other uses like your website and catalogues it likely is worth the effort. The very best choice is to ensure material added to your source is clean and if over time you clean up problems in the existing records eventually it will be. If your system only lets you edit one record at a time and there’s no way to convert every example of a bad character across records the understanding of what you need might be enough to get the developer to help you clean the contents as a one-time project.

I do this with trepidation as such documents may lead you astray, but here are a couple of documents with the most common “smart” characters I see, with alternatives and a spreadsheet that lists the most common escaped characters. The problem is that because these documents list characters that are encoding problems they may well not render the same way on your computer as mine — you may be better off creating such lists in the environment that you work in. So with that caveat, that what you see may not be what I see, here are a couple of files that might be helpful:

sample_bad_characters.doc
special_character_list.xls

Data Exchange Tips #2: So what(’s) encoding, anyway?

Tuesday, October 27th, 2009 by tom richardson

In the first tip, I tried to establish why you, the ONIX file sender, have to test your file, and that’s simply to ensure that the files content — all the characters — would be recognized by the aggregator’s software. The “encoding” declaration in the first line of the file tells the recipient what to expect — and your job is to ensure that the file matches that.

If you’re trading files in English speaking North America you’ve got a choice of three encodings that will almost certainly be considered acceptable by aggregators. (There are lots of others, but my assumption is that you’re trading files largely in English, with some French and/or Spanish thrown in).

The default encoding in ONIX is UTF-8. It’s the most commonly used in English North America for XML and the most supported by XML software. It’s more-or-less what was called ASCII (but not extended ASCII) — the English language keyboard characters. Any text document in English will almost certainly be largely in UTF-8 encoding without any work on your part.

The other common encoding is ISO-8859-1, what might be called ‘extended ASCII’ or Latin-1. It supports the common accented characters in French, Spanish and German. BISG has identified this as the preferred encoding for the US supply chain. We in Canada are more demure and think it slightly impolite to discuss, but are OK with it too.

And then there is “windows-1252.” This is what, in desperation, your trading partners will use when they hope you’re on the Windows operating system and your file is screwing up when they load it. It’s the Windows version of ISO-8859-1. I think. I don’t really know… Who could possibly care about this!!!!

Here’s the dummy version: When you hit a computer key some code is generated and interpreted and appears on your screen. There’re conventions and standards that control all this and when you bought your computer if the sales person was awfully knowledgeable, they might have been able to tell you what conventions your computer follows. If you’re on a PC with a number pad try this: Hold the ALT key down and on the number pad key 80. If you did that you made a big pee, and I’m really, really pleased with myself for getting you to do it. My only point is that there really isn’t a way to know what your computer is doing, except that:

  • If you bought your computer in English speaking North America;
  • and no one said it wasn’t an standard keyboard;
  • and you’ve not really thought much about it;

then what happens when you make simple keystrokes is almost certainly UTF-8 (unless some piece of software is screwing with what you type). Can you cut and paste into a text document or email and it (usually) doesn’t turn to gibberish? Then it’s more or less UTF-8.

XML software doesn’t care. It’s up to you to tell it what your characters are, and as a start assume that you’re typing largely in UTF-8. You don’t really have a choice. But here’s a quick solution to testing your ONIX and it’s not loading properly because of unrecognized characters. Change the encoding declaration to encoding=”iso-8859-1″ and hope. It may be all that you need, but more likely you’ll have a small number of unrecognized types of characters in your file.

To summarize: You must test all XML files before sending them, and the initial point of testing XML files is to ensure that the contents are recognized and defined. There are some secondary data quality and validation issues that will come up when the actual ONIX standard is discussed, but the first step is always a coherent recognized file acceptable to XML software.

The next post is some practical tips on cleaning files, and the one after that is on what to do with special characters outside of your encoding statement, so don’t worry about your weekly excitement just yet.

Data Exchange Tips #1: Why XML?

Wednesday, October 21st, 2009 by tom richardson


I’m going to do a series of blog posts on some of the very basic issues in file trading — what needs to be done before you submit an ONIX file (or an E-book if your e-book is in XML). In doing this I’m hoping that publishers will comment about software they like (and don’t), problems they have — and with any luck their successes.

So, for the first post: Why XML?

Any discussion about file exchange has to start with why XML works, which is because of its underlying assumptions and the software that supports them. The main assumption is that all the characters, line returns, visible and hidden content — all of it — are recognized in every file. XML software tests for this and it’s so important that information about it normally appears in the first line of an XML file as an encoding statement, right after you identify that this is an XML document:
<? xml version=”1.0″ encoding=”utf-8″ ?>
or
<? xml version=”1.0″ encoding=”iso-8859-1″ ?>

Think about that for a moment: How obvious and how could it be otherwise? And then think about just how unlikely it is to be true about a publisher’s ONIX file, built up over long periods of time through cut and paste from who knows what source documents. You don’t really know where all the millions of characters in your ONIX file came from, do you? And that’s why trading delimited files or database files doesn’t work. None of these test the incoming data. But XML software does and it won’t work with less than “well encoded” data.

Publishers can think of it this way: You’ve probably heard of or published a book where an “incompetent freelance designer didn’t use the right font” (or used “outdated software,” or provided “bad thingies”) and the files screwed up when it went to the printer. And your production manager “fixed that file” with a lot of overtime and foul language. That’s an encoding problem: What you sent to someone else didn’t appear as you intended it to be. If you were trading files in XML and did it right that wouldn’t happen. All sorts of other things might — but not that.

The trick to the encoding statement is it doesn’t really matter where the characters came from — it’s not your ability to answer the Zen koan: “What is the encoding of the letter you’re typing now?” What matters is what happens when someone else loads the file. Does their software recognize all the characters? You may have software designed to create an ONIX file, but does it monitor what’s going into it? Does it prevent you from loading dashes from Word 97 or WP5.1 with an error message? Does it ask you want the output encoding to be and prevent anything else going it? It would be surprising if it did.

So the first rule of data exchange is that you must test the ONIX output every time you create it. You test your data with XML software before you send it. The XML standard demands it. The ONIX standard depends on it.

That’s why XML works. The XML standard and software are designed to enforce things like this. You may think you can trade data using Excel or delimited formats, but none of these will do a good job of ensuring that what you send can be read at the other end. XML does (somewhat — don’t think it’ll be perfect), and that’s main reason it’s better for data transfer.