Character Set & Character Encoding
Are they one and the same? How does Charset differ from a Character Set for example, and how do either relate to the Character Encoding used in a document? Is this a lot of fuss over nothing, because the terms reference precisely the same thing? Maybe, since the terms are similar, then so are the meanings only somewhat different.
Where do we most often encounter character set, and character encoding? Do these terms always appear together, in order to identify a particular commonality between them, or have they nothing in common at all?
Perhaps we might narrow it down by defining what either term does not mean, even if we remain uncertain of a proper definition? Reader, if you find the topic confusing, please don’t worry: you’re not alone. However, I must encourage you to pay attention to the various issues discussed here, and most of all– if you are uncertain– please study, that you might describe in your own words, the unique meaning of Charset and Character Encoding. You’ll be glad you did.
i18n · NCR’s · entities · ASCII…
just what on Earth does it all mean!? Are they similar? Is it important to understand any differences between them? Is there a time and a place for one, but not the other? Who, or what is the leading authority on the subject?…
It is critical to keep separate the notion of a simple table of characters and their numbers, i.e. a coded character set, separate from the various algorithms to encoded sequences of characters, i.e. character encoding schemes. This separation allows a representation of a text entity which is consistent with both the MIME and SGML specifications.Dan Connolly (“Character Set” Considered Harmful)
Demystifying Unicode Character Encodings Markup
Not that it is at all ambiguous what Mr. Connolly is explaining above, but just in case you didn’t get the gist of it from his own words, I’ll attempt to bring it down to Earth a bit for you. The quote above is meant to identify [for the web content author, for example] an inherent importance in differentiating between what is a “Character Set”, or charset (imagine a toolbox containing only a particular set of special screws and nails — or in this case, a set of letters, shapes or glyphs, and that is your Character-Set) which you want a browser to display when you use, for example, a named entity in your HTML, and any of the countless “Character Encodings” which are integrated to read, in this example, your entity (to maintain the analogy, another toolbox containing the screwdriver and hammer for making use of what came from your toolbox) which decode that entity from its nonsensical HTML format, to print into the browser as maybe the shape of a letter, a webding font character, or other symbol you might wish for your reader to see when viewing your page. Confused? Don’t worry– so are many others, including myself, by the very nature of its complexity. Hang in there, and you’ll get enough of it to stay afloat.
HTML Entities & Numerical Character Reference
The Web Developer Toolbar contains an ASCII Character Reference table.
PHP: Character Encodings
If you require a Character Encoding other than ‘Western European’, you might investigate this Issue in more depth: according to various Note:‘s in the Manual at PHP.net, PHP will default to the ISO 8859-1 Character Set in many functions, such as
htmlentities(), for example, when a PHP script must process a string wherein Character Encodings or Decodings are manipulated.
For issues of escape sequences (escaping special characters for proper decoding at runtime), it is important to know the differences between Single and Double Quotes.
Unicode in PHP
DTD: Charset Encoding Declaration
Issues of Internationalization (aka. i18n ) in Document Type Definitions ( DTD ).
How to Identify the Character Set Encoding
Even when the author uses the most appropriate Entity or Numerical Character Reference ( NCR ) in his or her markup, the chosen Character Set from which the Entity or NCR is derived must also be declared in the document (see this brief tutorial). For example, in HTML, you may have declared the Character Set, or Charset, in the document <head> with a META tag
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >
It is only through the use of such a Charset declaration that the Character Reference (i.e. the Named Entity or NCR) used in the markup will be properly read, or de-coded by the browser. Using the proper character encodings in your markup alone is not enough if the document does not also identify the Charset to tell the browser which Character Encoding was used in original document. The browser will display the document precisely as intended if the charset is declared, otherwise it is a guessing-game in which the browser may or may not displa
y what you wanted your reader to see.
Web Standards and i18n: Charset in the Big Picture
One of the quintessential resources on the subject is the World Wide Web Consortium.
The W3C offers a very Practical Guide to i18n (internationalization) in Document Markup. Everyone who publishes HTML should be aware of the information on this concise tutorail. It offers everything you need to ensure a properly declared Document Encoding.
HTML Entities and NCR Quick Reference:
Here I plan to maintain a reference of those characters which I’ve used, perhaps even few and far between, but which I don’t want to lose the appropriate reference, now that i’ve found it. I hope that you are able to use them as well. Please feel free to copy and paste the following <dl> list into your own HTML if you wish; if you find it helpful.
- U+00AF : MACRON : [ Decimal NCR - ¯ | Hexadecimal NCR - ¯ ]
- I like to think of this one as the ‘over¯score’, since it seems to be the exact opposite of our more popular Under_score. it ¯ looks ¯ like ¯ this ¯ !
Character Encoding: Recommended Resources
Learn more about Escape Sequences, [Numerical] Character References (NCR’s), and Character Encodings at the following Recommeded Resources:
- Essential Software: save your document using the proper encoding for your language, or codepage standard. Make a PayPal contribution to BabelStone software, and download the ‘Pad and the ‘Map! (Andrew West, benevolent author of BabelStone Software, is a Unicode Standard contributor, and recognized authority on the representation of Chinese, and East-Asian glyphs in Unicode. He is also a seeminly all-around-good-guy! )
- Official Names of Character Sets : Recommended by the W3C!
- GNU dot org : Escape Sequences
- IT and Communication: Tutorial on Character Code issues
- Wikipedia: Character Encodings