File encoding (UTF-8, ASCII)

I hope I will light some lanterns, even if this entry has been written quickly during the lunch break and is more like a dirty draft! This may contain either, wrong or incomplete information.

Notes

While reading this entry, keep these notes in mind, they may correct mistakes.

Mikero: text files read by Windows OS assume they are ‘local code page’ unless a bom is specified. this includes UTF-8 encoded sbcs. windows assumes it’s cp1252, NOT UTF-8 without the bom.

Forewords

Yesterday, a community member (Kllrt) has reported me an issue about file encoding in Poseidon Tools. Problem was:

How it’s possible that Poseidon writes a file in ASCII whereas I saved it with UTF-8 encoding?

Every decision I make, every feature I write, is based on the “standards“, “your requests & feedback” and sometimes “RFCs“.

Here, I will skip the BOM (Byte Order Mark). A reason among others of why to do not use the BOM is, if your software is expecting pure ASCII won’t understand the BOM, then, fails. For those whose want to go deeper in this topic, I invite you to follow this link (rfc 3629).

For your information:
UCS = Universal Character Set, standard ISO-10646
UTF = UCS Transformation Format
UTF-8 = UCS Transformation Format over 8 bits
UTF-16 = UCS Transformation Format over 16 bits
UTF-32 = UCS Transformation Format over 32 bits

Basics of the text encoding

Each existing text encoding has its own reference table, used to convert a binary to a character (or any value, bin, hex, octal…). One of the first tables was the ASCII (everybody knows that one), it’s a reference table composed of 127 entries, each entry represents a character. Next question is, what’s the relation between the ASCII table and the UTF-8? The answer is very simple, the UTF8 is an extension of the ASCII, meaning that the 127 first character of the UTF8 are exactly the same as the ASCII.

Let’s take an example, the character A:

  • ASCII:
    • Output: A
    • Scalar value: 65 (%01000001)
    • Hexadecimal: $41 (U+0041)
  • UTF-8:
    • Output: A
    • Scalar value: 65 (%01000001)
    • Hexadecimal: $41

You can make this every first 127 characters of both tables (E.g.: The first one, [NULL] = %00000000 = $0  or the last one, [DEL] = %01111111 = $7F ), so, until you use a character specific to the UTF8 table, no difference.

What’s going on with Poseidon Tools?

Its behavior is:

Until you have a non ASCII character inside your file, it will save it with the ASCII encoding because there is no need to use the UTF-8.

Why a such behavior has been validated?

The reason is simple, the ASCII table has only 127 entries and these last are the very same as the UTF-8, then, to determine, Poseidon Tools uses the file content to determine what is the encoding of the current.

BTW: When the BOM is present, there is zero-width no-break space (U+FEFF) at the very beginning of the file, meaning, the following text is encoded with UTF-8. For example, if you open an UTF-8 BOM encoded file, with an ISO-8859-1, you will see this invisible character as:  

In consequence, until the text streamer reads a non ASCII character, it sticks with the ASCII table because you cannot make the difference.

Leave a Reply

Your email address will not be published. Required fields are marked *