In later posts I’ll look at and recommend XML software (if anyone has favorite software—particularly for Macs as I don’t use them—let me know), but for this I’m assuming you have some and you’ve loaded into it an ONIX file that uses one of the two most common XML encoding declarations:

< ?xml version=”1.0” encoding=”utf-8”?> (the file contains only standard keyboard characters)

or

< ?xml version=”1.0” encoding=”iso-8859-1”?> (the file contains only standard keyboard characters plus basic French, Spanish, or some German accented characters)

and this being XML, the software is giving back some sort of statement saying on Line X, position Y there’s an unrecognized character—or possibly shown some sort of box listing 5 or 6 gibberish values that it’s converted to an underscore. Or maybe the software just craps out and won’t load.

This is what XML software does when it looks at your file and finds something in it that doesn’t’ match the encoding declaration—and this is what will happen when the file is loaded at Bowker, Indigo or Amazon. The aggregators are probably fixing minor problems because it’s faster to do that than complain, but if there a lot of problems your file may well get shuffled to one side and never loaded. So you can rely on the kindness of retail to fix and maybe load your file, or you can do what you can to make sure that the file loads properly. If you make the effort I can assure you retailers will know and will be much more likely to contact you if they have problems.

The first (but not only) step in file cleaning is finding and correcting encoding issues because they usually prevent XML software from working. Because not all XML software is the same it helps to use more than one piece of software when trying to clean files. File cleaning is pretty simple conceptually, and simple in practice too. An XML file is just a text file—the simplest type of computer output possible. XML software needs all the characters in the file to be recognized in order to work, so to fix problems the easiest thing to do is open the ONIX file in a simple text editor (Notepad, WordPad, SimpleText, etc.), or if it’s available in the “text” view of your XML software.

What you don’t want to do is open the file in something like MS-Word that will recognize it as an XML document and start modifying it based on what it thinks you’re doing. ONIX is a data exchange standard and Word will think you’re trying to XSLT transformations.

Use the “Go to line” function (ctrl G) to go to that line specified and look around (if the Go to Line function isn’t available, I’ll have some suggested text editors in the software discussion). You’ll probably see some glitchy text, a “smart” character, or possibly an accented character. If it’s the latter and your encoding is UTF-8, change the declaration to iso-8859-1 and try loading it into the software again. The game you’re playing is matching the characters in your file to what the XML software expects, so changing the encoding statement to the appropriate one is allowed (but no aggregator accepts every possible encoding and the two recommended here are the most common). The next blog post on “Escaping Entities” will deal with leaving your encoding as UTF-8 or using special characters outside of iso-8859-1.

But let’s say it’s a glitchy character—incoherent text strings or symbols, or possibly it’s a “smart” character: curly apostrophes, special dashes and the like that are pretty and work in their source software but are not part of the encoding. The first test is if can you copy and paste them into the “Find” dialog box? If you can’t then whatever they are they’re so not-text that the text editor is not willing to work with them (Bones might say: “It’s a letter Jim, but not as we know it.”). At a guess they are hex (witchcraft?) characters and you may be forced to clean such issues one at time. I’ve never seen a file with a lot of this problem, but cleaning them in the source (what your ONIX was created from) is the way to go.

The second test is: Does the character you’re searching reoccur consistently in the file and in each case does it represent the same thing. If not, you’re again looking at manual cleaning. There’s no easy way to do this, but if there’s too much to fix manually, maybe you need to go to the source of the character and do some tests there. This is why encoding is so important—it’s so fundamental to the file that everything hinges from it. You may need to change how you create your documents in order to prevent problems. But the point is that XML software won’t care about anything more than the XML file in question. What came before it doesn’t matter to it and only send files that match the encoding statement.

The most likely thing will be if you copy and paste the problem into “Find” (ctrl F) is that it reoccurs numerous times in your file and that it’s consistently the same problem. If it’s a big file try to test at several spots in the file because it’s just possible that data loaded at a different time will be different.

This is a copy of your ONIX file, right? So no harm in experimenting—use find and replace to transform the glitch to what it should be—the simplest possible keyboard character or an escaped entity (next blog post). Make a copy of the two values for future reference. You’ll possibly find that there are hundreds or thousands of instances of the problem in your file. Save the file and go back to the XML software and attempt to load it again (remember, the software will load the file’s last saved state so be sure to save your work). You’ll probably get another problem. Repeat the process.

While this may seem futile there are probably only a limited number of such problems in your file—5 to 10 types are normal—smart quotes and apostrophes (several types) and dashes. Depending on the encoding and sensitivity of the XML software you may also find accented character, trademarks and other special characters similarly noted. My next post on Escaping Entities will be a fuller discussion of these.

You’re going to need to make a decision at this point. You can fix the characters in the source document—that is if your ONIX file is generated from a database or other source—to go back to it and fix the problem there so that problem won’t exist in any future ONIX output. Or you can just fix the problem in the ONIX file and do this as a step every time you create and send it. Which makes the most sense probably depends on the number and how easily you can change the source file. Some database software allows you to do find and replace on multiple entries while other content management systems (CMS) only allow you open up individual records.

It’s common for publishers to clean each ONIX file prior to sending—and the whole point of storing information as XML in text files is to make it easily transformed—but clearly having the source clean is preferable. And given the source probably supports other uses like your website and catalogues it likely is worth the effort. The very best choice is to ensure material added to your source is clean and if over time you clean up problems in the existing records eventually it will be. If your system only lets you edit one record at a time and there’s no way to convert every example of a bad character across records the understanding of what you need might be enough to get the developer to help you clean the contents as a one-time project.

I do this with trepidation as such documents may lead you astray, but here are a couple of documents with the most common “smart” characters I see, with alternatives and a spreadsheet that lists the most common escaped characters. The problem is that because these documents list characters that are encoding problems they may well not render the same way on your computer as mine—you may be better off creating such lists in the environment that you work in. So with that caveat, that what you see may not be what I see, here are a couple of files that might be helpful:

sample_bad_characters.doc

special_character_list.xls

Data Exchange Tip #3: File Cleaning—Not Just for Your Nails

Listen to our latest podcast episode