I will warn you this is really lousy and inelegant code. But I needed something quick. I will also warn that this was going to be focus primarily on doing the UTF-8 to ASCII7 conversion, but it went a bit sideways with me learning something new about Powershell. If you feel a bit disappointed, oh well.
I will do my best take the high road and not bitch and make snide comments about a certain software company whose product caused me to write this script. However, after rereading this article I think I failed miserably. You have been warned!
The Problem
So, I am dealing with yet a not so bright piece of software. Specifically, one that has a database whose fields for usernames, addresses, and even phone numbers and such things assume that those users are all living in the United States and have names like Bob or Sue or John or Travis (OK, maybe not that one since it has two syllables) who lived in towns like Cleveland or Tampa. This vendor developed the code in early 2000(!) hoping it would never have to deal with umlauts and accents and other characters. Probably they did that to save space since an ASCII7-only DB field is smaller than a UTF-aware one. Imagine how much money they would make their customers save on storage!
Then someone told their customers that there are lands past the dragons and the turtles, where strange people with mysterious habits and incomprehensible languages roam.
Like Sweden.
Imagine how shocked they were when now their software had to deal with names like Фёдор, Ljungström, Simón, 宮崎, and Häyhä and towns like Malmö. To address that, in 1987 a formal work group, which later became the Unicode Consortium was formed and, anfter much head-smacking, released its first version in 1991. Unfortunately, someone failed to mention that to this company. Fear not, however, for they were not the first and will not be the last to develop their code in ASCII and then realize they have to support other languages.
This is all nice and all, but we feed data to this program as text files. Which is in UTF-8 format since it contains people and places with names using non-ASCII characters. And as a result the database gets very unhappy.
The Solution
I called them to see if they had plan on supporting Unicode in the future. And they replied. Bottom line is they will not change their code to handle Unicode. So, if we are going to use this program we need to massage the input data.
As many times before in many different operating systems, if we are going to do this, it better be doable using some kind of scripting language. Being this Windows my language of choice will once again be Powershell. Before we write a single line of code, let's do something dangerously subversive and think about what we need to do.
We need to detect the non-ASCII characters. Specifically we are talking about the characters that are not within the original 7bit ASCII (characters 0 to 127 in decimal, or 0x00 and 0x7F respectively in hexadecimal) table; we talked a bit before about detecting different files formats. From there we know the UTF-8 characters we are looking for are represented by a pair: "ö" has a UTF8 Hex of 0xc3b6. Do note both characters > 0x7F.
Once we detect said character pair, we need to do something with them. We are going to convert them into a single or a combination of characters, and that is where it gets interesting. Take our friend ö: should it become o or oe? And the answer is it depends.
Some countries came up with their own representation in ASCI7 of their non-ASCI7 characters. And they are not really universal. Also, if the import file uses some kind of fixed column formatting, we have to convert it to one single character. This is really not a coding problem; we can easily create a lookup table to replace the character. So I will just leave this here in the open for you to ponder deeply about. In this article we will cheat, and badly.
Let's talk about that cheating. Instead of bothering to do some proper converting, all we will do is replace the non-ASCII7 char pair with a single question mark (63 or 0x3F). This way, whoever sees the data will know we did something there and can (if they are the affected user) replace that "?" with whatever character they feel like. Talk about passing the hot potato around...
So, if we find a character whose decimal value is > 127, we can replace it with a "?" (decimal 63). Of course, being it UTF-8 the next character in the pair is also > 127. So, we need to replace the pair of characters with a single "?".
With that in mind, let's try coding that:
param($infile) # Convert UTF-8 document into ASCII7 function UTF2ASCII($encoded) { $oldi = '' $converted = '' foreach ($line in $encoded) { # Convert each character in $line into a decimal number foreach($i in [int[]][char[]]$line) { # 0 <= ASCII7 <= 127 if ($i -gt 127) { # KLUDGE: only change to "?" if previous char isn't. if ($oldi -ne 63) { $i = 63 } else { continue } } $converted = $converted + [char]$i $oldi = $i } } return $converted } # Read in the contents of $infile and feed them to our function UTF2ASCII(get-content $infile)
The if statements are the lousiest part of the code, specially the part that assumes if the previous char is "?" we must be dealing with the two-bit representation of the UTF-8 character; that is a rather tall assumption but I was in a hurry and could not think of an edge case. Now that I admitted my kludge, let's test it. The way you run this script is
utf2ascii.ps1 infile
which will output the modified text to the screen. If we want to save it into a file,
utf2ascii.ps1 infile > outfile
We also need a test file. I created it by gabbing a few paragraphs from the wikipedia page on the Automatgevär m/42 plus an extra line:
OlivenölAg m/42, kitaip AG42, AG-42 ar Ljungman (šved. Automatgevär m/42) – švediškas savitaisis šautuvas, nuo 1942 m. iki 7-ojo dešimtmečio naudotas Švedijos armijos.
Šautuvą 1941 m. sukonstravo Erikas Eklundas iš kompanijos AB C. J. Ljungmans Verkstäder (Malmė, Švedija). Juos gaminti 1942 m. ėmė kompanija Carl Gustafs Stads Gevärsfaktori (Eskilstuna, Švedija). Švedijos armijai buvo pagaminta apie 30 000 Ag m/42. Tačiau šis šautuvas nebuvo pagrindinis Švedijos armijos šautuvas, juo buvo 6,5 mm pertaisomas šautuvas Mauser m/96.
Po kurio laiko buvo aptikti kai kurie šautuvo trūkumai (pvz., dujų vamzdelio rūdijimas), todėl 1953–1956 Švedijos armijos šautuvai Ag m/42 buvo modernizuoti ir pavadinti Ag m/42B. Modernizaciniai pakeitimai:
cat utftest.txt | wc -l (yes it is Linux, sue me) tells me that file is 7 lines long. So, let's run it and see what happens.
$ utf2ascii.ps1 infile Oliven?l Ag m/42, kitaip AG42, AG-42 ar Ljungman (?ved. Automatgev?r m/42) ? ?vedi?kas s avitaisis ?autuvas, nuo 1942 m. iki 7-ojo de?imtme?io naudotas ?vedijos armijos . ?autuv? 1941 m. sukonstravo Erikas Eklundas i? kompanijos AB C. J. Ljungmans Ve rkst?der (Malm?, ?vedija). Juos gaminti 1942 m. ?m? kompanija Carl Gustafs Stad s Gev?rsfaktori (Eskilstuna, ?vedija). ?vedijos armijai buvo pagaminta apie 30 000 Ag m/42. Ta?iau ?is ?autuvas nebuvo pagrindinis ?vedijos armijos ?autuvas, juo buvo 6,5 mm pertaisomas ?autuvas Mauser m/96. Po kurio laiko buvo aptikti kai kurie ?autuvo tr?kumai (pvz., duj? vamzdelio r? dijimas), tod?l 1953?1956 ?vedijos armijos ?autuvai Ag m/42 buvo modernizuoti i r pavadinti Ag m/42B. Modernizaciniai pakeitimai: $
Looks great, right? Not quite. Problem with the above code is the following line:
UTF2ASCII(get-content $infile)It just tries to read the entire file into memory in one sitting and then passes all that to UTF2ASCII(). That works fine with small files, but we will be filling memory/swap space and maybe even running out of them if the file is big enough. And that is a bit of shooting ourselves in the foot. Let's rewrite it a bit by creating a second function, ConvertFile(), which reads the file line by line. This way we need not to care how large the file is.
param($infile) $count = 0 function ConvertFile($encoded) { $oldi = '' $converted = '' foreach ($line in $encoded) { ConvertLine($line) } } function ConvertLine($line) { foreach($i in [int[]][char[]]$line) { if ($i -gt 127) { if ($oldi -ne 63) { $i = 63 $global:count++ } else { continue } } $converted = $converted + [char]$i $oldi = $i } $converted = $converted + "`r`n" return $converted } ConvertFile(get-content $infile) "$($count) characters were converted"
The $count global variable is just there to count how many times the script found and converted UTF-8 characters.
When we test the script with a large file, and found out we did solve the problem in the original code we mentioned above; you can try it out if you do not believe in me.
So, let's now do a few more tests.
$ utf2ascii.ps1 infile > outfile $ cat outfile Oliven?l Ag m/42, kitaip AG42, AG-42 ar Ljungman (?ved. Automatgev?r m/42) ? ?vedi?kas s avitaisis ?autuvas, nuo 1942 m. iki 7-ojo de?imtme?io naudotas ?vedijos armijos . ?autuv? 1941 m. sukonstravo Erikas Eklundas i? kompanijos AB C. J. Ljungmans Ve rkst?der (Malm?, ?vedija). Juos gaminti 1942 m. ?m? kompanija Carl Gustafs Stad s Gev?rsfaktori (Eskilstuna, ?vedija). ?vedijos armijai buvo pagaminta apie 30 000 Ag m/42. Ta?iau ?is ?autuvas nebuvo pagrindinis ?vedijos armijos ?autuvas, juo buvo 6,5 mm pertaisomas ?autuvas Mauser m/96. Po kurio laiko buvo aptikti kai kurie ?autuvo tr?kumai (pvz., duj? vamzdelio r? dijimas), tod?l 1953?1956 ?vedijos armijos ?autuvai Ag m/42 buvo modernizuoti i r pavadinti Ag m/42B. Modernizaciniai pakeitimai: 35 characters were converted $
Yeah, the fact the line stating how many characters were converted is in the outfile is a drag, but I will survive. I really want to count the number of lines though:
$ cat outfile | wc -l 16 $
Now that was unexpected to me, specially coming from a Linux background. How did my 7 lines became 16? You see, we run the script as
utf2ascii.ps1 infile > outfile
Thing is, in Powershell the > will split a long line to the width of the console window we are using. I read the explanation quite a few times and still am not satisfied. But, since we need this code to work, time to change it one more time!
param($ifile, $ofile) $count = 0 function ConvertFile($originalFile, $convertedFile) { $oldi = '' $converted = '' rm $convertedFile -ea SilentlyContinue foreach ($line in (get-content -path $originalFile)) { ConvertLine($line) | ` Out-File -Append -Encoding ASCII $ofile } } function ConvertLine($line) { foreach($i in [int[]][char[]]$line) { if ($i -gt 127) { if ($oldi -ne 63) { $i = 63 $global:count++ } else { continue } } $converted = $converted + [char]$i $oldi = $i } return $converted } if ( [string]::IsNullOrEmpty($ifile) ) { Write-Host "Usage: powershell -ExecutionPolicy ByPass -File", ` $MyInvocation.MyCommand.Name, "infile [outfile]" exit } # Create a $ofile if one was not given if ( [string]::IsNullOrEmpty($ofile) ) { $ofile = ([io.path]::GetFileNameWithoutExtension("$ifile")) + ` (get-date -format yyyymmdd) + ` ([io.path]::GetExtension("$ifile")) } ConvertFile $ifile $ofile "$($count) characters were converted"
Notes:
- Instead of sending output to console, which then would need to be redirected to a file, this script reads the input and output filenames. If an outputfilename is not provided, it will create one based on that of the inputfile.
- Corrected the issue of splitting the processed lines before saving them.
- Gives usage message if no parameters are given
- Outputfile is overwritten without questioning.
- Only output to the stdout is the number of conversions
I am bored, let's run it
$ utf2ascii.ps1 infile outfile 35 characters were converted $
Note that it now wrote the converted character count message to the screen as promised. How about the number of lines?
$ cat outfile | wc -l 7 $
Much better! Now all we need to do is decide if we want to keep the "?" or make some conversion table, which I will leave the decision up to you.
1 comment:
I really loved reading your blog. It was very well authored and easy to understand. Unlike other blogs I have read which are really not that good.Thanks alot! charactercount.org
Post a Comment