another quick one: so we had a text file that had text with accented words and we had to figure out which format they were. You see, for a while the "standard" text format for computers was ASCII, more precisely 7bit ASCII (characters 0 to 127 in decimal) which was created in the 1960s and whose character set aassumed English language only. Before some of you get all excited please note that ASCII stands for American Standard Code for Information Interchange, so it stands to reason they picked English. As this standard became adopted by other countries, it became clear that some of them used characters that were not representable with only those characters, and that let to many attempts to solve that. One of the earliest was to extend the original ASCII table, where another 128 possible characters were added, which after a few adventures evolved into ISO-8859-1 a.k.a. ISO-Latin-1, and UTF-8. There are other character sets, but the principle is the same.
Thanks for the history lesson, but how about getting to the point? How to identify the text format given a file? Let's answer that by using a couple of examples.
Example 1
Let's say we got a text file that has a name, say Luis de La Peña in it somewhere. Depending how you look -- how helpful your text viewer is -- at the file, it might either show "ñ" or some garbled character; the later happens if the text viewer only knows 7bit ASCII. For instance, cat would spit out something like this in my Ubuntu laptop:
bash-3.2$ cat text_test1 Luis de La Pe�a bash-3.2$
Don't know about you, but that "�" does nothing to me; it's just cat's way of telling us it cannot represent the character so it is putting a placeholder. Let's try something else; since the title of this article mentions hexdump, I propose to look at it through that program (I am telling it to print the value of each character and then the ascii representation of those characters):
bash-3.2$ hexdump -Cv text_test1 00000000 4c 75 69 73 20 64 65 20 4c 61 20 50 65 f1 61 20 |Luis de La Pe.a | 00000010 0a |.| 00000011 bash-3.2$
first thing we notice is that it too does not know how to show "ñ", so it is using "." as placeholder. That is a different character than what cat in my ubuntu box used; just deal with it. What we really care about is the hex side tells us that
0xF1 = ñ
That is very important: it tells us that "ñ" is represented by one 8-Bit character, so UTF8 is out. So, we need to look for an 8-bit charset. After hours of agonizing search and heavy drinking, we find that the extended ASCII and/or the ISO-8859-1 tables match all characters (don't believe me? Check the other characters in the text including space). Not bad at all, so we can read the text and convert it to a different char set.
Example 2
So we feel all good about ourselves, and we need another example. This time, I will steal a real life example from a previous article, where we had a text containing Italienisches Olivenöl which would cause DKIM email body authentication failures. Yeah, something as seemly harmless as character set can create some annoying problems.
As before, we begin by asking cat what it thinks about the text:
bash-3.2$ cat text_test2 Italienisches Olivenöl bash-3.2$
Hold on right there. Why is cat able to print that "ö" but could not print that "ñ" earlier? Now you begin to see some of the limitations of cat compared to hexdump for these probulations: depending on how cat was compiled, it will handle some character sets but not others. hexdump knows nothing about character sets: it only knows of ASCII; anything else becomes a ".". Of course, it would suck to use hexdump all the time, so you need to know your tools and when to use each one. Since we talked about hexdump, let's see what it sees:
bash-3.2$ hexdump -Cv text_test2 00000000 49 74 61 6c 69 65 6e 69 73 63 68 65 73 20 4f 6c |Italienisches Ol| 00000010 69 76 65 6e c3 b6 6c 0a |iven..l.| 00000018 bash-3.2$
Some kind of funny business happening in the second row:
- all the English-looking characters not only seem to be represented by one 8bit value but also the same ones we saw earlier in the ASCII example:
0x69 = i 0x76 = v 0x65 = e 0x6e = n 0x6c = l
- There are two "." characters (0xc3 and 0xb6) where "ö" should be.
- There is a 0x0a after the "l".
0xc3b6If we look at any UTF8 conversion table such as this one (picked at random), we will see that is the UTF8 HEX for "ö" (Unicode code would be U+00F6).
Ok, smartypants, what about the 0x0a after the "l"? Yes that. You might have not noticed it was also on text_test1 on the first example. That is the line feed character, which in Linux means end of line.
Insert Boring Conclusion Here
I hope this was useful to you; I thought it was fun and even learned a few things while writing this. the thought process here is similar to what, say you would do when you are examining an encrypted document: try to find known patterns to work with before going after the really unknown stuff.