Christopher Warner Studies and thoughts, usually in coherent fashion.

19Feb/090

HTML Entities and Encodings

There seems to be confusion in regards to HTML Entities and Encodings over the web. HTML Entities are not Encodings; they are just a representation set of characters but that has nothing to do with the overall encoding of an HTML document or text in general.

Oddly Wikipedia; explains it best:

Numeric references always refer to Universal Character Set code points, regardless of the page's encoding. Using numeric references that refer to UCS control code ranges is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00“08, 0B“0C, 0E“1F, 7F, and 80“9F cannot be used in an HTML document, not even by reference”so "™", for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80“9F range are interpreted by some browsers as representing the characters mapped to bytes 80“9F in the Windows-1252 encoding.

Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately then HTML character references are usually only required for a few special characters (or not at all if a native Unicode encoding like UTF-8 is used).

Why you want all your HTML documents in UTF-8?

All text in IETF protocols for this here internet are using UTF-8. Here is a good outline on how to decide a charset from the RFC 2277 "Best Current Practice, IETF Policy on Character Sets and Languages" published in 1998:

3.2. How to decide a charset

When the protocol allows a choice of multiple charsets, someone must
make a decision on which charset to use.

In some cases, like HTTP, there is direct or semi-direct
communication between the producer and the consumer of data
containing text. In such cases, it may make sense to negotiate a
charset before sending data.

In other cases, like E-mail or stored data, there is no such
communication, and the best one can do is to make sure the charset is
clearly identified with the stored data, and choosing a charset that
is as widely known as possible.

Note that a charset is an absolute; text that is encoded in a charset
cannot be rendered comprehensibly without supporting that charset.

(This also applies to English texts; charsets like EBCDIC do NOT have
ASCII as a proper subset)

Negotiating a charset may be regarded as an interim mechanism that is
to be supported until support for interchange of UTF-8 is prevalent;
however, the timeframe of "interim" may be at least 50 years, so
there is every reason to think of it as permanent in practice.

So basically if your encodings or representation of text is intermingled with different char sets you'll usually have problems but if your overall document is in the UTF-8 encoding all of the chars that need to be represented will be available. So at least in the web browser space; all of the major web browsers will be able to display your bit of text. In other programs or where ever you are storing your data you may have problems if you aren't doing the due diligence. So in the end making ALL of your text or data UTF-8 alleviates the problem completely. Hopefully unlike said above; it doesn't take 50 years for people to get this.

This is pretty simple stuff, if you aren't encoding your text or data that you display in a a universal character set. It's just simple; you are doing it wrong.

17Feb/090

Article Search API from the New York Times

Welp, the New York Times are doing all of the cool stuff (in this case the Article Search API) and they have all of their articles in objects with associated metadata. The only. ONLY. Dead tree media company that gets exposing their content. Allowing people to build tools and accessing their customers in the way their customers want to access them. These API's are the building blocks for their obvious future success.

UPDATE: Awesome; ahhh it hurts.. IT HURTS!!!

Tagged as: No Comments