Showing posts with label text. Show all posts
Showing posts with label text. Show all posts

Monday, August 29, 2016

Poor Man's Downconverting UTF-8 to ASCII7 in powershell

I will warn you this is really lousy and inelegant code. But I needed something quick. I will also warn that this was going to be focus primarily on doing the UTF-8 to ASCII7 conversion, but it went a bit sideways with me learning something new about Powershell. If you feel a bit disappointed, oh well.

I will do my best take the high road and not bitch and make snide comments about a certain software company whose product caused me to write this script. However, after rereading this article I think I failed miserably. You have been warned!

The Problem

So, I am dealing with yet a not so bright piece of software. Specifically, one that has a database whose fields for usernames, addresses, and even phone numbers and such things assume that those users are all living in the United States and have names like Bob or Sue or John or Travis (OK, maybe not that one since it has two syllables) who lived in towns like Cleveland or Tampa. This vendor developed the code in early 2000(!) hoping it would never have to deal with umlauts and accents and other characters. Probably they did that to save space since an ASCII7-only DB field is smaller than a UTF-aware one. Imagine how much money they would make their customers save on storage!

Then someone told their customers that there are lands past the dragons and the turtles, where strange people with mysterious habits and incomprehensible languages roam.

Like Sweden.

Imagine how shocked they were when now their software had to deal with names like Фёдор, Ljungström, Simón, 宮崎, and Häyhä and towns like Malmö. To address that, in 1987 a formal work group, which later became the Unicode Consortium was formed and, anfter much head-smacking, released its first version in 1991. Unfortunately, someone failed to mention that to this company. Fear not, however, for they were not the first and will not be the last to develop their code in ASCII and then realize they have to support other languages.

This is all nice and all, but we feed data to this program as text files. Which is in UTF-8 format since it contains people and places with names using non-ASCII characters. And as a result the database gets very unhappy.

The Solution

I called them to see if they had plan on supporting Unicode in the future. And they replied. Bottom line is they will not change their code to handle Unicode. So, if we are going to use this program we need to massage the input data.

As many times before in many different operating systems, if we are going to do this, it better be doable using some kind of scripting language. Being this Windows my language of choice will once again be Powershell. Before we write a single line of code, let's do something dangerously subversive and think about what we need to do.

  1. We need to detect the non-ASCII characters. Specifically we are talking about the characters that are not within the original 7bit ASCII (characters 0 to 127 in decimal, or 0x00 and 0x7F respectively in hexadecimal) table; we talked a bit before about detecting different files formats. From there we know the UTF-8 characters we are looking for are represented by a pair: "ö" has a UTF8 Hex of 0xc3b6. Do note both characters > 0x7F.

  2. Once we detect said character pair, we need to do something with them. We are going to convert them into a single or a combination of characters, and that is where it gets interesting. Take our friend ö: should it become o or oe? And the answer is it depends.

    Some countries came up with their own representation in ASCI7 of their non-ASCI7 characters. And they are not really universal. Also, if the import file uses some kind of fixed column formatting, we have to convert it to one single character. This is really not a coding problem; we can easily create a lookup table to replace the character. So I will just leave this here in the open for you to ponder deeply about. In this article we will cheat, and badly.

Let's talk about that cheating. Instead of bothering to do some proper converting, all we will do is replace the non-ASCII7 char pair with a single question mark (63 or 0x3F). This way, whoever sees the data will know we did something there and can (if they are the affected user) replace that "?" with whatever character they feel like. Talk about passing the hot potato around...

So, if we find a character whose decimal value is > 127, we can replace it with a "?" (decimal 63). Of course, being it UTF-8 the next character in the pair is also > 127. So, we need to replace the pair of characters with a single "?".

With that in mind, let's try coding that:

param($infile) 

# Convert UTF-8 document into ASCII7
function UTF2ASCII($encoded)
{
   $oldi = ''
   $converted = ''

   foreach ($line in $encoded)
   {
      # Convert each character in $line into a decimal number
      foreach($i in [int[]][char[]]$line)
      {
         # 0 <= ASCII7 <= 127
         if ($i -gt 127)
         {
            # KLUDGE: only change to "?" if previous char isn't.
            if ($oldi -ne 63)
            {
               $i = 63
            }
            else
            { continue }
         }
         $converted = $converted + [char]$i
         $oldi = $i
      }

   }

   return $converted
}

# Read in the contents of $infile and feed them to our function
UTF2ASCII(get-content $infile)

The if statements are the lousiest part of the code, specially the part that assumes if the previous char is "?" we must be dealing with the two-bit representation of the UTF-8 character; that is a rather tall assumption but I was in a hurry and could not think of an edge case. Now that I admitted my kludge, let's test it. The way you run this script is

utf2ascii.ps1 infile

which will output the modified text to the screen. If we want to save it into a file,

utf2ascii.ps1 infile > outfile

We also need a test file. I created it by gabbing a few paragraphs from the wikipedia page on the Automatgevär m/42 plus an extra line:

Olivenöl

Ag m/42, kitaip AG42, AG-42 ar Ljungman (šved. Automatgevär m/42) – švediškas savitaisis šautuvas, nuo 1942 m. iki 7-ojo dešimtmečio naudotas Švedijos armijos.

Šautuvą 1941 m. sukonstravo Erikas Eklundas iš kompanijos AB C. J. Ljungmans Verkstäder (Malmė, Švedija). Juos gaminti 1942 m. ėmė kompanija Carl Gustafs Stads Gevärsfaktori (Eskilstuna, Švedija). Švedijos armijai buvo pagaminta apie 30 000 Ag m/42. Tačiau šis šautuvas nebuvo pagrindinis Švedijos armijos šautuvas, juo buvo 6,5 mm pertaisomas šautuvas Mauser m/96.

Po kurio laiko buvo aptikti kai kurie šautuvo trūkumai (pvz., dujų vamzdelio rūdijimas), todėl 1953–1956 Švedijos armijos šautuvai Ag m/42 buvo modernizuoti ir pavadinti Ag m/42B. Modernizaciniai pakeitimai:

cat utftest.txt | wc -l (yes it is Linux, sue me) tells me that file is 7 lines long. So, let's run it and see what happens.

$ utf2ascii.ps1 infile
Oliven?l

Ag m/42, kitaip AG42, AG-42 ar Ljungman (?ved. Automatgev?r m/42) ? ?vedi?kas s
avitaisis ?autuvas, nuo 1942 m. iki 7-ojo de?imtme?io naudotas ?vedijos armijos
.

?autuv? 1941 m. sukonstravo Erikas Eklundas i? kompanijos AB C. J. Ljungmans Ve
rkst?der (Malm?, ?vedija). Juos gaminti 1942 m. ?m? kompanija Carl Gustafs Stad
s Gev?rsfaktori (Eskilstuna, ?vedija). ?vedijos armijai buvo pagaminta apie 30
000 Ag m/42. Ta?iau ?is ?autuvas nebuvo pagrindinis ?vedijos armijos ?autuvas,
juo buvo 6,5 mm pertaisomas ?autuvas Mauser m/96.

Po kurio laiko buvo aptikti kai kurie ?autuvo tr?kumai (pvz., duj? vamzdelio r?
dijimas), tod?l 1953?1956 ?vedijos armijos ?autuvai Ag m/42 buvo modernizuoti i
r pavadinti Ag m/42B. Modernizaciniai pakeitimai:
$

Looks great, right? Not quite. Problem with the above code is the following line:

UTF2ASCII(get-content $infile)
It just tries to read the entire file into memory in one sitting and then passes all that to UTF2ASCII(). That works fine with small files, but we will be filling memory/swap space and maybe even running out of them if the file is big enough. And that is a bit of shooting ourselves in the foot. Let's rewrite it a bit by creating a second function, ConvertFile(), which reads the file line by line. This way we need not to care how large the file is.

param($infile)
$count = 0

function ConvertFile($encoded)
{
   $oldi = ''
   $converted = ''

   foreach ($line in $encoded)
   {
      ConvertLine($line)
   }
}

function ConvertLine($line)
{
   foreach($i in [int[]][char[]]$line)
   {
      if ($i -gt 127)
      {
         if ($oldi -ne 63)
         {
            $i = 63
            $global:count++
         }
         else
         { continue }
      }
      $converted = $converted + [char]$i
      $oldi = $i
   }

   $converted = $converted + "`r`n"
   return $converted
}

ConvertFile(get-content $infile)
"$($count) characters were converted"

The $count global variable is just there to count how many times the script found and converted UTF-8 characters.

When we test the script with a large file, and found out we did solve the problem in the original code we mentioned above; you can try it out if you do not believe in me.

So, let's now do a few more tests.

$ utf2ascii.ps1 infile > outfile
$ cat outfile
Oliven?l

Ag m/42, kitaip AG42, AG-42 ar Ljungman (?ved. Automatgev?r m/42) ? ?vedi?kas s
avitaisis ?autuvas, nuo 1942 m. iki 7-ojo de?imtme?io naudotas ?vedijos armijos
.

?autuv? 1941 m. sukonstravo Erikas Eklundas i? kompanijos AB C. J. Ljungmans Ve
rkst?der (Malm?, ?vedija). Juos gaminti 1942 m. ?m? kompanija Carl Gustafs Stad
s Gev?rsfaktori (Eskilstuna, ?vedija). ?vedijos armijai buvo pagaminta apie 30
000 Ag m/42. Ta?iau ?is ?autuvas nebuvo pagrindinis ?vedijos armijos ?autuvas,
juo buvo 6,5 mm pertaisomas ?autuvas Mauser m/96.

Po kurio laiko buvo aptikti kai kurie ?autuvo tr?kumai (pvz., duj? vamzdelio r?
dijimas), tod?l 1953?1956 ?vedijos armijos ?autuvai Ag m/42 buvo modernizuoti i
r pavadinti Ag m/42B. Modernizaciniai pakeitimai:
35 characters were converted
$

Yeah, the fact the line stating how many characters were converted is in the outfile is a drag, but I will survive. I really want to count the number of lines though:

$ cat outfile | wc -l
16

$

Now that was unexpected to me, specially coming from a Linux background. How did my 7 lines became 16? You see, we run the script as

utf2ascii.ps1 infile > outfile

Thing is, in Powershell the > will split a long line to the width of the console window we are using. I read the explanation quite a few times and still am not satisfied. But, since we need this code to work, time to change it one more time!

param($ifile, $ofile)
$count = 0

function ConvertFile($originalFile, $convertedFile)
{
   $oldi = ''
   $converted = ''
   rm $convertedFile -ea SilentlyContinue

   foreach ($line in (get-content -path $originalFile))
   {
      ConvertLine($line) | `
      Out-File -Append -Encoding ASCII $ofile
   }
}

function ConvertLine($line)
{
   foreach($i in [int[]][char[]]$line)
   {
      if ($i -gt 127)
      {
         if ($oldi -ne 63)
         {
            $i = 63
            $global:count++
         }
         else
         { continue }
      }
      $converted = $converted + [char]$i
      $oldi = $i
   }

   return $converted
}

if ( [string]::IsNullOrEmpty($ifile) )
{
   Write-Host "Usage: powershell -ExecutionPolicy ByPass -File", `
               $MyInvocation.MyCommand.Name, "infile [outfile]"
   exit
}

# Create a $ofile if one was not given
if ( [string]::IsNullOrEmpty($ofile) )
{
   $ofile = ([io.path]::GetFileNameWithoutExtension("$ifile")) + `
            (get-date -format yyyymmdd) + `
            ([io.path]::GetExtension("$ifile"))
}

ConvertFile $ifile $ofile
"$($count) characters were converted"

Notes:

  • Instead of sending output to console, which then would need to be redirected to a file, this script reads the input and output filenames. If an outputfilename is not provided, it will create one based on that of the inputfile.
  • Corrected the issue of splitting the processed lines before saving them.
  • Gives usage message if no parameters are given
  • Outputfile is overwritten without questioning.
  • Only output to the stdout is the number of conversions

I am bored, let's run it

$ utf2ascii.ps1 infile outfile
35 characters were converted

$

Note that it now wrote the converted character count message to the screen as promised. How about the number of lines?

$ cat outfile | wc -l
7

$

Much better! Now all we need to do is decide if we want to keep the "?" or make some conversion table, which I will leave the decision up to you.

Friday, July 31, 2015

Detecting text file format using hexdump

another quick one: so we had a text file that had text with accented words and we had to figure out which format they were. You see, for a while the "standard" text format for computers was ASCII, more precisely 7bit ASCII (characters 0 to 127 in decimal) which was created in the 1960s and whose character set aassumed English language only. Before some of you get all excited please note that ASCII stands for American Standard Code for Information Interchange, so it stands to reason they picked English. As this standard became adopted by other countries, it became clear that some of them used characters that were not representable with only those characters, and that let to many attempts to solve that. One of the earliest was to extend the original ASCII table, where another 128 possible characters were added, which after a few adventures evolved into ISO-8859-1 a.k.a. ISO-Latin-1, and UTF-8. There are other character sets, but the principle is the same.

Thanks for the history lesson, but how about getting to the point? How to identify the text format given a file? Let's answer that by using a couple of examples.

Example 1

Let's say we got a text file that has a name, say Luis de La Peña in it somewhere. Depending how you look -- how helpful your text viewer is -- at the file, it might either show "ñ" or some garbled character; the later happens if the text viewer only knows 7bit ASCII. For instance, cat would spit out something like this in my Ubuntu laptop:

bash-3.2$ cat text_test1 
Luis de La Pe�a 
bash-3.2$

Don't know about you, but that "�" does nothing to me; it's just cat's way of telling us it cannot represent the character so it is putting a placeholder. Let's try something else; since the title of this article mentions hexdump, I propose to look at it through that program (I am telling it to print the value of each character and then the ascii representation of those characters):

bash-3.2$ hexdump -Cv text_test1 
00000000  4c 75 69 73 20 64 65 20  4c 61 20 50 65 f1 61 20  |Luis de La Pe.a |
00000010  0a                                                |.|
00000011
bash-3.2$ 

first thing we notice is that it too does not know how to show "ñ", so it is using "." as placeholder. That is a different character than what cat in my ubuntu box used; just deal with it. What we really care about is the hex side tells us that

0xF1 = ñ

That is very important: it tells us that "ñ" is represented by one 8-Bit character, so UTF8 is out. So, we need to look for an 8-bit charset. After hours of agonizing search and heavy drinking, we find that the extended ASCII and/or the ISO-8859-1 tables match all characters (don't believe me? Check the other characters in the text including space). Not bad at all, so we can read the text and convert it to a different char set.

Example 2

So we feel all good about ourselves, and we need another example. This time, I will steal a real life example from a previous article, where we had a text containing Italienisches Olivenöl which would cause DKIM email body authentication failures. Yeah, something as seemly harmless as character set can create some annoying problems.

As before, we begin by asking cat what it thinks about the text:

bash-3.2$ cat text_test2
Italienisches Olivenöl
bash-3.2$

Hold on right there. Why is cat able to print that "ö" but could not print that "ñ" earlier? Now you begin to see some of the limitations of cat compared to hexdump for these probulations: depending on how cat was compiled, it will handle some character sets but not others. hexdump knows nothing about character sets: it only knows of ASCII; anything else becomes a ".". Of course, it would suck to use hexdump all the time, so you need to know your tools and when to use each one. Since we talked about hexdump, let's see what it sees:

bash-3.2$ hexdump -Cv text_test2
00000000  49 74 61 6c 69 65 6e 69  73 63 68 65 73 20 4f 6c  |Italienisches Ol|
00000010  69 76 65 6e c3 b6 6c 0a                           |iven..l.|
00000018
bash-3.2$

Some kind of funny business happening in the second row:

  1. all the English-looking characters not only seem to be represented by one 8bit value but also the same ones we saw earlier in the ASCII example:
    0x69 = i
    0x76 = v
    0x65 = e
    0x6e = n
    0x6c = l
  2. There are two "." characters (0xc3 and 0xb6) where "ö" should be.
  3. There is a 0x0a after the "l".
What's going on here? Hold onto that question and let's check the "ö". According to the above, the two "." are there to tell us "ö" is represented as two 8bit values:
0xc3b6
If we look at any UTF8 conversion table such as this one (picked at random), we will see that is the UTF8 HEX for "ö" (Unicode code would be U+00F6).

Ok, smartypants, what about the 0x0a after the "l"? Yes that. You might have not noticed it was also on text_test1 on the first example. That is the line feed character, which in Linux means end of line.

Insert Boring Conclusion Here

I hope this was useful to you; I thought it was fun and even learned a few things while writing this. the thought process here is similar to what, say you would do when you are examining an encrypted document: try to find known patterns to work with before going after the really unknown stuff.