Quantcast
Skip to content


ATTENTION TWEETERS: Before clicking the tweet button, please ensure the proper URL is displayed in your location bar. How? Click the post title, or permalink. (Otherwise, your tweet will point to the site, only, making it difficult for your readers to find.)
-- Thank you!
the Author

Adventures in Character Encoding

See also:

the Never Ending Story

(or alternatively, less titled but a
form for gazing[ mecca, or gaze seems seismic not so,]
small but Infinitesimal,
in Accessible i18n, if his him her and she an
extra spectral l10n thing waxing
too, but waning maybe by these
homological
transliterations

It’s a topic which i return to time and again:

  • Character Encoding
  • Numerical Character References
  • Named Entities (aka. Named Character References)
  • Unicode and UTF-8
  • the Character Set : and Proper Selection Thereof
  • and the list goes on… (and on)…

What Is so Complicated about & and  ?

I’m not going to write another redundant entry about NCR’s and HTML Entities, but I would like to share an experience in which i found a way to make a document do what i wanted it to do. To make it more easy to understand, I’ve translated it into a 3-item list below.

I’m not going to say that I am the supreme intelligence on Character Encodings, but it does interest me. If you are searching for a way to publish documents with confidence that the Character Encodings will be properly decoded by the user-agent of your target-audience, you may find that the following information proves effective. Please feel free to contact me via: learn at novicenotes com

Western : ISO-8859-1 – a Formidable Faux

You may find your most comprehensive resource to be the What is Unicode? section at Unicode.org. Before you set out to bother with concern over character encodings, you should at least learn something about the reasons behind your need for using the most appropriate Character Encoding for your documents. Once you realize the vast complexity of the issue, in terms of the global Internationalization of documents (and their readers), you might find it encouraging to pay more attention to the Character Encoding of all of your own documents

Ideally, the document author will use Character Encodings (e.g. Numerical Character References, HTML Entities, etc.) which are common to a singular Character Set, or charset. One element, central to any difficulty experienced with Unicode and other Encoding Standards, such as ISO-8859-1, is that it’s not always easy to identify whether there exists an erroneous charset declaration in a document header, vs. the actual Character Encodings you’ve used in your document. Unless you specialize in Unicode, and Character Encodings, the realm of possibilities is so massive that the Charset issue is often overlooked, ignored, or simply allowed to be set by the text editor of choice for the author.

Get to Know your Source Code Editor!

For all that you might research and learn about Character Encodings, it will all be for naught if you allow your editor to manipulate the glyphs, or entities you’ve created in your document. If your editor has an option in the configuration which reads something such as “…automatically convert Numerical to HTML Entities…”, I recommend you seriously consider disabling the option, and instead rely on what you know to be true about the document– rather than trust the guesswork of the software.

To check if your document conforms to a Standard Character Set, and that you’ve properly declared the charset in the document head (XHTML), simply pay a visit to the FREE, on-line web Standards Compliant Validation service, provided by the WWW Consortium.


May Your On-Line Documents Decode, Happily Ever After

Step One : Your Document
Begin with your document. Some of the possible methods of coming up w/ a document might be the following:
  • You’ve Authored Your Own Document From Scratch
  • For research, you have Cut & Paste text or source-code from an existing published document
  • You’ve found an old document on your system which you want to re-use, but it appears to be using the wrong encoding

or other reasons — as there are many of course!

Step Two : View Document in Browser
Do you see any strange characters when viewing the document, such as one which looks like a “White Question Mark” on a black square turned at a 45° angle? Do you see characters which look like a Latin capital letter “A”, but with some extra glyphs (aka. “squiggly” marks, such as )?

If this sounds familiar to you, then you’ve probably got an encoding / decoding problem with your document’s Charset declaration vs. the encodings used within the document.

If you’ve reached this point, and you’ve simply run out of ideas on how to solve it, try the following.

Step Three : BabelPad
Go to BabelStone Software and acquire BabelPad.

Once you’ve installed BabelPad and practiced with using it, try the following steps for arriving at a proper encoding for your document(s).

  1. Ensure that your document contains any Numerical Character References, or Entities where appropriate.
  2. Copy and Paste the source code / markup into BabelPad.
  3. Highlight the entirety of the source-code, select from the BabelPad menu the following:
    • Convert…
    • Normalization Form…
    • NFKD (cannonical decomposition with compatibility characters replaced)
  4. Save the document as UTF-8 (unless you have a specific reason for saving it in another format) and be sure to un-check “Byte Order Mark”. Our knowledgeable friends at the W3C warn us: it is best not to save Web Documents with the Byte Order Mark until it is more widely accepted as a standard for Unicode declarations.
Step Four : Verification
Now it’s just up to you to double-check for any extra lines or unusual characters in your output (such as [  ] , one of the tell-tale signs).

Now go to the W3C Validator and verify that your markup is compliant, and you’ve properly declared your Character Set!

GOOD LUCK!


NOTES:

In my personal experience with Character Encodings, one of the documents i’ve found to pose a difficulty, if modified from its original form is the official PHP.NET documentation (English). The documentation is available for download in html format, as separate HTML documents which you can copy to a folder under your own local server directory structure (IIS, Apache, etc.) for off-line reference. The first thing you may notice about the downloaded documentation, however, is that it is that it does not come complete with an accompanying stylesheet (last i checked). Though the documentation is perfectly useful in its raw form, I have link‘d those sections which I review to my own local stylesheet copy. When I open one of the HTML files,

I find that the lines of code are quite difficult to read, due to a seemingly distorted interpretation (by the editor) of the source code. Being perhaps a bit overly conscious of this mis-wrapped code, I feel compelled to “Tidy” it using HTML Tidy as a source-code re-wrapping tool in any of a number of the FREE editors on my system (such as Notepad++, or PSPad). After the Tidy makes the code look all Pretty Print ready, I save the document using UTF-8, the character set declared in the PHP.NET documents. Strangely, I find that when loaded on the server in a browser, (e.g. http://localhost.localdomain/php_manual_en/… ), this procedure– all the time attentive to the character encoding– does not always result in the document view producing the same glyphs, or character output as the document prior to my editing it. Why?

The quick answer is that the editor has, as a default function for Unicode interpretation (most likely), modified the characters within the document. In other words, when the editor opened the document, some character therein was not properly DE-coded by the editor, and that character is replaced with whatever character the software chooses to put in in its place. This result is evident when the document is viewed in a browser (on the local server) where, depending upon the browser, a diamond-shaped black square with a question-mark symbol in the center, or maybe just a hollow square appears. The unusual character is there because the browser doesn’t know what it’s supposed to do with any random characters which the text editor might have used in place of the original encoding references (i.e. NCR’s, or Entities, or <control> characters).
encodings
Click to Enlarge

As mentioned above, I’ve found that the editor, BabelPad does NOT alter the original character encodings– however, it does not offer the Tidy, PrettyPrint code re-wrapping option. So, it becomes a mix-n-match operation for me to get the results i want in the source-code while maintaining the original, unaltered document character encodings! Ha! …and you thought it was difficult, learning to code XHTML by hand!

EditPlus: a superb text-editor with excellent Unicode support, full-featured integrated FTP client, and simple project management, provides the option to add, ignore, or remove the BOM upon saving the current document.

W3C I18N FAQ: Display problems caused by the UTF-8 BOM. Deborah Cawkwell, BBC World Service. Modified by: Richard Ishida, W3C. Rev. 2007-07-17. Available at: http://www.w3.org/International/questions/qa-utf8-bom ; Accessed: 2007-08-17

Leave a Reply



Close