Wednesday, January 11, 2017

Creating extended ASCII file in bash and maybe powershell

Does anyone remember extended ASCII (as opposite to UTF-8)? If you never heard of them, we are not talking about a proper character list that supports Russian or Japanese languages. All we are dealing with here is iso-8859-1, whose table can be found here.

I have a document in that format I need to convert to something else; if this reminds you of some 8bitmime issues we talked about, well, let's just say we could have used this to create the test file. With that said, the current situation is that I wrote a script to manipulate it which is not preserving that format; that we can talk about in a future post.

Bottom line is I need to create a small test file that I can later throw it and the script output on hexdump.

My test file will have only 3 lines,

Olivenöl
Bayerstraße 22
München

Nothing fancy; just enough to use one extended ASCII character per line. Now let's try to create the little file. Just to be different, instead of starting on Linux we will do most of the attempts in OSX. Once we have a working system, we can see if it also works on Linux.

Attempt #1

How about if we do the lazy thing and jus cut-n-paste the 3 lines above into a text file we opened usng vim, notepad++, or some pico clone? Done. Now let's see how it looks like

bash-3.2$ cat /tmp/chartest 
Olivenöl
Bayerstraße 22
München
bash-3.2$ 

That looks very promising. In fact, this might end up being a very short article. Before I publish it, should we see what hexdump thinks of it?

bash-3.2$ hexdump -Cv /tmp/chartest 
00000000  4f 6c 69 76 65 6e c3 b6  6c 0d 0a 42 61 79 65 72  |Oliven..l..Bayer|
00000010  73 74 72 61 c3 9f 65 20  32 32 0d 0a 4d c3 bc 6e  |stra..e 22..M..n|
00000020  63 68 65 6e 0d 0a                                 |chen..|
00000026
bash-3.2$ 

Correct me if I am wrong but it seems each extended ASCII character is taking two characters to be represented instead of just one single character. For instance ö is being represented by two characters, 0xC3B6. That sounds more like UTF-8/Unicode/whatever (if you want to know what to look for, they all start with a 0xC3) but not extended ASCII. Also, it is using carriage return (CR, 0x0D in hexadecimal) and line feed (LF, 0x0A) characters to separate the lines. But this is very Windowsy, not OSX/Linux style, where lines are separated by the line feed (0x0A) character only.

Attempt #2

What if we paste the lines onto the terminal and use echo to write that to the test file? Well, let's make a single line test file and see what happens

bash-3.2$ echo "Olivenöl" > /tmp/chartest 
bash-3.2$ hexdump -Cv /tmp/chartest 
00000000  4f 6c 69 76 65 6e c3 b6  6c 0a                    |Oliven..l.|
0000000a
bash-3.2$ 

Still using two characters to represent ö; at least it did not add a CR. Now this is getting annoying; is there anything else we can try?

Attempt #3

It turns out there is and we do not need any extra stuff. You see, echo has this -e option that allows you to pass a character by its hexadecimal code. From the extended ASCII table we know that ö= 0xF6 and ß= 0xDF (just to pick two examples). We also know that CRFL = 0x0D0A. I know I whined about that before, but the reason is I want to be able to decide when I want to use those characters and when I do not want, as opposite to having some program or script making a choice for me.

Let's try again, this time passing the extended ASCII characters explicitly:

bash-3.2$ echo -e "Oliven\xf6l\x0d\x0aBayerstra\xdfe 22\x0d\x0aM\xfcnchen\x0d" > /tmp/chartest
bash-3.2$ 
bash-3.2$ cat /tmp/chartest 
Oliven�l
Bayerstra�e 22
M�nchen
bash-3.2$ hexdump -Cv /tmp/chartest 
00000000  4f 6c 69 76 65 6e f6 6c  0d 0a 42 61 79 65 72 73  |Oliven.l..Bayers|
00000010  74 72 61 df 65 20 32 32  0d 0a 4d fc 6e 63 68 65  |tra.e 22..M.nche|
00000020  6e 0d 0a                                          |n..|
00000023
bash-3.2$ 

That's more like it: only one character is used to represent each character in the file. Isn't it interesting when we cat the file it is replacing the extended characters with ? But, if hexdump says they are there that is good enough for me.

What about powershell?

Even though the name of this blog implied Unix, we use enough powershell we might as well see if we can do the same. But, we have to accept we start with a bit of a handicap: powershell really really wants to write UTF-8 or unicode instead of extended ascii/iso-8859-1. Let me show you what I mean by trying to create a small file with just one single word on it, Olivenöl. As we seen before, ö = 0xF6 = 246. And that should still be true in powershell; let's find out:

PS > 'Oliven' + [char]246 + 'l'
Olivenöl
PS >

Looks like we are getting somewhere, right? For our next trick, we will save that to a file (| out-file .\chartest.txt is equivalent to doing > .\chartest.txt.

PS > 'Oliven' + [char]246 + 'l' | out-file .\chartest.txt
PS > cat .\chartest.txt
Olivenöl
PS >

Hey chief! It seems to be working fine? Why are you make this huge drama about this? That is a very good question. I will let dear old hexdump do the talking:

$ hexdump -Cv chartest.txt
00000000  ff fe 4f 00 6c 00 69 00  76 00 65 00 6e 00 f6 00  |..O.l.i.v.e.n...|
00000010  6c 00 0d 00 0a 00                                 |l.....|
00000016

$

Each character is now represented by 2 characters. Smells like Unicode, right? ok, smart guy. Now just force it to save as ASCII then. Will do:

PS > 'Oliven' + [char]246 + 'l' | out-file -encoding ASCII .\chartest.txt
PS > cat .\chartest.txt
Oliven?l
PS >

And hexdump

$ hexdump -Cv chartest.txt
00000000  4f 6c 69 76 65 6e 3f 6c  0d 0a                    |Oliven?l..|
0000000a

$

It converted the characters into ?. Helpful, isn't it? The Microsoft Scripting Guys forum pretty much tells you should save file as unicode or UTF-8 and then convert it somehow. Far from me to disagree with them, at least in this article since it makes for a great cliffhanger. In a future article we will talk about how to get extended ASCII properly in powershell just like we did in bash. It will be a bit longer but doable.

No comments: