Home
Blog
Overview of all products
SalesData
LibraryData
CataList
Loan Stars
BiblioShare
Webform
EDI
Products for publishers
Products for retailers
Products for libraries
Information for authors
BNC Research
Canadian literary awards
SalesData & LibraryData Research Portal
Events
Tech Forum
Webinars & Training
Code of Conduct
Standards
EDI standards
Product identifiers
Classification schemes
ONIX standards
About
Contact us
Media
Bestseller lists
Newsletters
Podcast
Jobs
SalesData
LibraryData
CataList
BiblioShare
Webform
EDI

BookNet Canada

Home
Blog
Overview of all products
SalesData
LibraryData
CataList
Loan Stars
BiblioShare
Webform
EDI
Products for publishers
Products for retailers
Products for libraries
Information for authors
BNC Research
Canadian literary awards
SalesData & LibraryData Research Portal
Events
Tech Forum
Webinars & Training
Code of Conduct
Standards
EDI standards
Product identifiers
Classification schemes
ONIX standards
About
Contact us
Media
Bestseller lists
Newsletters
Podcast
Jobs
SalesData
LibraryData
CataList
BiblioShare
Webform
EDI
Tom Richardson
November 3, 2009
BiblioShare, ONIX, Standards & Metadata

Data Exchange Tip #3: File Cleaning—Not Just for Your Nails

Tom Richardson
November 3, 2009
BiblioShare, ONIX, Standards & Metadata

In later posts I’ll look at and recommend XML software (if anyone has favorite software—particularly for Macs as I don’t use them—let me know), but for this I’m assuming you have some and you’ve loaded into it an ONIX file that uses one of the two most common XML encoding declarations:

< ?xml version=”1.0” encoding=”utf-8”?> (the file contains only standard keyboard characters)

or

< ?xml version=”1.0” encoding=”iso-8859-1”?> (the file contains only standard keyboard characters plus basic French, Spanish, or some German accented characters)

and this being XML, the software is giving back some sort of statement saying on Line X, position Y there’s an unrecognized character—or possibly shown some sort of box listing 5 or 6 gibberish values that it’s converted to an underscore. Or maybe the software just craps out and won’t load.

This is what XML software does when it looks at your file and finds something in it that doesn’t’ match the encoding declaration—and this is what will happen when the file is loaded at Bowker, Indigo or Amazon. The aggregators are probably fixing minor problems because it’s faster to do that than complain, but if there a lot of problems your file may well get shuffled to one side and never loaded. So you can rely on the kindness of retail to fix and maybe load your file, or you can do what you can to make sure that the file loads properly. If you make the effort I can assure you retailers will know and will be much more likely to contact you if they have problems.

The first (but not only) step in file cleaning is finding and correcting encoding issues because they usually prevent XML software from working. Because not all XML software is the same it helps to use more than one piece of software when trying to clean files. File cleaning is pretty simple conceptually, and simple in practice too. An XML file is just a text file—the simplest type of computer output possible. XML software needs all the characters in the file to be recognized in order to work, so to fix problems the easiest thing to do is open the ONIX file in a simple text editor (Notepad, WordPad, SimpleText, etc.), or if it’s available in the “text” view of your XML software.

What you don’t want to do is open the file in something like MS-Word that will recognize it as an XML document and start modifying it based on what it thinks you’re doing. ONIX is a data exchange standard and Word will think you’re trying to XSLT transformations.

Use the “Go to line” function (ctrl G) to go to that line specified and look around (if the Go to Line function isn’t available, I’ll have some suggested text editors in the software discussion). You’ll probably see some glitchy text, a “smart” character, or possibly an accented character. If it’s the latter and your encoding is UTF-8, change the declaration to iso-8859-1 and try loading it into the software again. The game you’re playing is matching the characters in your file to what the XML software expects, so changing the encoding statement to the appropriate one is allowed (but no aggregator accepts every possible encoding and the two recommended here are the most common). The next blog post on “Escaping Entities” will deal with leaving your encoding as UTF-8 or using special characters outside of iso-8859-1.

But let’s say it’s a glitchy character—incoherent text strings or symbols, or possibly it’s a “smart” character: curly apostrophes, special dashes and the like that are pretty and work in their source software but are not part of the encoding. The first test is if can you copy and paste them into the “Find” dialog box? If you can’t then whatever they are they’re so not-text that the text editor is not willing to work with them (Bones might say: “It’s a letter Jim, but not as we know it.”). At a guess they are hex (witchcraft?) characters and you may be forced to clean such issues one at time. I’ve never seen a file with a lot of this problem, but cleaning them in the source (what your ONIX was created from) is the way to go.

The second test is: Does the character you’re searching reoccur consistently in the file and in each case does it represent the same thing. If not, you’re again looking at manual cleaning. There’s no easy way to do this, but if there’s too much to fix manually, maybe you need to go to the source of the character and do some tests there. This is why encoding is so important—it’s so fundamental to the file that everything hinges from it. You may need to change how you create your documents in order to prevent problems.  But the point is that XML software won’t care about anything more than the XML file in question. What came before it doesn’t matter to it and only send files that match the encoding statement.

The most likely thing will be if you copy and paste the problem into “Find” (ctrl F) is that it reoccurs numerous times in your file and that it’s consistently the same problem. If it’s a big file try to test at several spots in the file because it’s just possible that data loaded at a different time will be different.

This is a copy of your ONIX file, right? So no harm in experimenting—use find and replace to transform the glitch to what it should be—the simplest possible keyboard character or an escaped entity (next blog post). Make a copy of the two values for future reference. You’ll possibly find that there are hundreds or thousands of instances of the problem in your file. Save the file and go back to the XML software and attempt to load it again (remember, the software will load the file’s last saved state so be sure to save your work). You’ll probably get another problem. Repeat the process.

While this may seem futile there are probably only a limited number of such problems in your file—5 to 10 types are normal—smart quotes and apostrophes (several types) and dashes. Depending on the encoding and sensitivity of the XML software you may also find accented character, trademarks and other special characters similarly noted. My next post on Escaping Entities will be a fuller discussion of these.

You’re going to need to make a decision at this point. You can fix the characters in the source document—that is if your ONIX file is generated from a database or other source—to go back to it and fix the problem there so that problem won’t exist in any future ONIX output. Or you can just fix the problem in the ONIX file and do this as a step every time you create and send it. Which makes the most sense probably depends on the number and how easily you can change the source file. Some database software allows you to do find and replace on multiple entries while other content management systems (CMS) only allow you open up individual records.

It’s common for publishers to clean each ONIX file prior to sending—and the whole point of storing information as XML in text files is to make it easily transformed—but clearly having the source clean is preferable. And given the source probably supports other uses like your website and catalogues it likely is worth the effort. The very best choice is to ensure material added to your source is clean and if over time you clean up problems in the existing records eventually it will be. If your system only lets you edit one record at a time and there’s no way to convert every example of a bad character across records the understanding of what you need might be enough to get the developer to help you clean the contents as a one-time project.

I do this with trepidation as such documents may lead you astray, but here are a couple of documents with the most common “smart” characters I see, with alternatives and a spreadsheet that lists the most common escaped characters. The problem is that because these documents list characters that are encoding problems they may well not render the same way on your computer as mine—you may be better off creating such lists in the environment that you work in. So with that caveat, that what you see may not be what I see, here are a couple of files that might be helpful:

sample_bad_characters.doc

special_character_list.xls

Tagged: xml, data exchange tips

Newer PostUpcoming Events: Webinars, Seminars, and More!
Older PostEmployee of the Month
Blog RSS

The Canadian Book Market 2024 is the comprehensive guide to the Canadian market with in-depth category data.

Get your copy now

Listen to our latest podcast episode


  • Research & Analysis 450
  • Ebooks 304
  • Tech Forum 266
  • Conferences & Events 261
  • Standards & Metadata 228
  • Bookselling 218
  • Publishing 194
  • ONIX 178
  • Marketing 152
  • Podcasts 118
  • ebookcraft 112
  • BookNet News 99
  • Loan Stars 71
  • Libraries 66
  • BiblioShare 59
  • SalesData 51
  • 5 Questions With 48
  • CataList 42
  • Thema 42
  • Awards 30
  • Diversity & Inclusion 21
  • Publishing & COVID-19 18
  • Sustainability 11
  • EU Regulations 9
  • LibraryData 9
  • ISNI 4

 

 

BookNet Canada is a non-profit organization that develops technology, standards, and education to serve the Canadian book industry. Founded in 2002 to address systemic challenges in the industry, BookNet Canada supports publishing companies, booksellers, wholesalers, distributors, sales agents, industry associations, literary agents, media, and libraries across the country.

 

Privacy Policy | Accessibility Policy | About Us

BOOKNET CANADA

Contact us | (416) 362-5057 or toll free 1 (877) 770-5261

We acknowledge the financial support of the Government of Canada through the Canada Book Fund (CBF) for this project.

Back to Top

BookNet Canada acknowledges that its operations are remote and our colleagues contribute their work from the traditional territories of the Mississaugas of the Credit First Nation, the Anishnawbe, the Haudenosaunee, the Wyandot, the Mi’kmaq, the Ojibwa of Fort William First Nation, the Three Fires Confederacy of First Nations (which includes the Ojibwa, the Odawa, and the Potawatomie), and the Métis, the original nations and peoples of the lands we now call Beeton, Brampton, Guelph, Halifax, Thunder Bay, Toronto, Vaughan, and Windsor. We endorse the Calls to Action from the Truth and Reconciliation Commission of Canada (PDF) and support an ongoing shift from gatekeeping to spacemaking in the book industry.