There Ain't No Such Thing as Plain Text

January 8, 2005

Over the last few months, I've come to realize that I had an ugly American view of strings. I always wondered what those crazy foreigners were complaining about in their comments on my CodeProject articles, and now I know: there ain't no such thing as plain text:

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly. Almost every stupid "my website looks like gibberish" or "she can't read my emails when I use accents" problem comes down to one naive programmer who didn't understand the simple fact that if you don't tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.

How do we preserve this information about what encoding a string uses? Well, there are standard ways to do this. For an email message, you are expected to have a string in the header of the form

Content-Type: text/plain; charset="UTF-8"

For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself -- not in the HTML itself, but as one of the response headers that are sent before the HTML page.

This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn't really know what encoding each file was written in, so it couldn't send the Content-Type header.

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy... how can you read the HTML file until you know what encoding it's in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get that far on the HTML page without starting to use funny letters.

In my case, this applies absolutely. I was doing a naive, blanket UTF-8 conversion of this byte data, assuming I got back something of type "text/*":

        Dim wc As New Net.WebClient
        wc.Headers.Add("User-Agent", _strHttpUserAgent)
        wc.Headers.Add("Accept-Encoding", _strAcceptedEncodings)
        Dim b() As Byte = wc.DownloadData(strUrl)

Clearly this isn't right. It is right most of the time, which can lull you into a false sense of correctness. A lot of things are like that in software; you think you have it right, but you just haven't hit the edge conditions yet. I found a code sample on Feroze Daud's blog that demonstrates how to semi-correctly detect the HTML encoding, as described by Joel. I thought it could be further improved. Here's my take:

    ''' <summary>
    ''' attempt to convert this charset string into a named .NET text encoding
    ''' </summary>
    Private Function CharsetToEncoding(ByVal Charset As String) _
        As System.Text.Encoding

        If Charset = "" Then Return Nothing
        Try
            Return System.Text.Encoding.GetEncoding(Charset)
        Catch ex As System.ArgumentException
            Return Nothing
        End Try
    End Function

    ''' <summary>
    ''' Given the Content-Type header, try to determine string encoding 
    ''' using header and raw content bytes
    ''' "Content-Type: text/html; charset=us-ascii"
    ''' <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    ''' </summary>
    Private Function GetEncoding(ByVal ContentTypeHeader As String, _
      ByVal ResponseBytes() As Byte) As System.Text.Encoding

        If Not _blnDetectEncoding Then
            Return _DefaultEncoding
        End If

        Dim strCharset As String
        Dim encoding As System.Text.Encoding

        '-- first try the header
        strCharset = Regex.Match(ContentTypeHeader, "charset=([^;""'/>]+)", _
                RegexOptions.IgnoreCase).Groups(1).ToString.ToLower
        encoding = CharsetToEncoding(strCharset)

        '-- if we can't get it from header, try the body bytes
        If encoding Is Nothing Then
            strCharset = Regex.Match( _
                System.Text.Encoding.ASCII.GetString(ResponseBytes), _
                "]+content-type[^>]+charset=([^;""'/>]+)", _
                RegexOptions.IgnoreCase).Groups(1).ToString.ToLower
            encoding = CharsetToEncoding(strCharset)
            If encoding Is Nothing Then
                Return _DefaultEncoding
            End If
        End If

        Return encoding
    End Function

Between the raw bytes from the HTTP response, and the Content-Type HTTP header, we should be able to get something reasonable. I use UTF-8 as my default if no encoding can be determined, which as near as I can tell is a best practice with strings in .NET. I apologize to all the non-English speaking users of my CodeProject articles-- I'm fixing it!

Posted by Jeff Atwood
3 Comments

As it turns out, Windows-1252 *can* be a better default for web strings than UTF-8.

Microsoft's Mikhail Arkhipov describes some of the changes in VS.NET 2005 in this area:

"First, Visual Studio is a Unicode application and actually even supports Unicode Surrogates Pairs. Most of Web pages, however, are not stored in Unicode. Therefore when opening a Web page VS has to figure out how to convert document to Unicode and how to convert it back on save. Here is how Visual Studio does it: "

http://blogs.msdn.com/mikhailarkhipov/archive/2004/8/7.aspx

Jeff Atwood on April 23, 2005 4:41 AM

UTF-8 is good as a default. But here is a better rule:
* If the string is a valid UTF-8 encoded string, interpret it as UTF-8
* If the string is not valid UTF-8, interpret it as windows-1252.

This is because there are certain combinations of bits and bytes that are not allowed in UTF-8. So if it is valid UTF-8, then you can be pretty sure that it is UTF-8.

Moreover, I want to comment on an earlier comment on this page: When talking about the "Unicode" encoding, it really means the UTF-16 encoding which has surrogate pairs. The spec says that a UTF-16 encoding has a Byte Order Mark at the beginning of the file, i.e. two bytes with the value FFFE or FEFF.

Gunnar Vestergaard on April 1, 2008 3:53 AM

Actually, I think according to the spec, text/{something} === text/{something}; charset=US-ASCII

If no charset is defined on a text/{something} mime type, then the bytes must be interpreted as us-ascii. That is why the application/xml mime type is preferred to the text/xml, with application/xml, you can just pass the bytes to your parser, with text/xml, you have to assume those bytes are ascii

sean on April 2, 2008 8:59 AM

The comments to this entry are closed.