I <3 Steve McConnell*
Coding Horror
programming and human factors
by Jeff Atwood

June 14, 2005

Formatting HTML code snippets with Ten Ton Wrecking Balls

If you've ever tried to cut and paste code from the VS.NET IDE, you may have noticed that the code generally comes across looking like crap. The root of this problem is that VS.NET copies code into your clipboard in the accursed Rich Text Format. If you were expecting something like standard HTML, think again, bucko!

Brad Abrams posted a quick and dirty workaround to convert the clipboard to HTML using Word. Cory Smith took that workaround and turned it into a VS.NET Macro. It works fairly well, but...

  • Using Word automation to color-code a code snippet in your clipboard is... not exactly lightweight. But my motto is, why use a hammer when you can use a frickin' ten ton wrecking ball?!
  • Word doesn't seem to pick up background colors, only foreground colors. That's kind of a bummer.
  • The resulting HTML is kinda nasty, even though we are specifically asking for Word's simplified “filtered“ HTML. But it does work in Firefox and IE just fine.

I experimented with Cory's macro, simplifying it slightly, and forcing a standard font. (I normally use a custom font for programming, but not everyone will have that font installed.)

I knew Word's HTML wasn't going to be optimal, but after taking a closer look at it, I was profoundly unhappy with it. The fact that copying and pasting it back into VS.NET resulted in extra line breaks was kind of a showstopper, too. Here's a little taste:

<P class=MsoNormal style="MARGIN: 0in 0in 0pt">
<SPAN style="FONT-SIZE: 9pt; FONT-FAMILY: 'Courier New'; 
mso-bidi-font-size: 12.0pt"> <o:p></o:p>

If this is Word's idea of "filtered" HTML, I'd hate to see the unfiltered version. And what's up with those empty <o:p> tags all over the place? After I figured out the threading issue preventing me from accessing the clipboard in a macro, I added some code to postfix Word's crazy HTML into something resembling standard, basic HTML. This worked OK.

But then I wondered-- why not convert the native RTF on the clipboard to HTML myself and cut out the middleman? I'm all for using ten ton wrecking balls, but not when they er.. wreck stuff! Fortunately, I've written RTF to HTML converters before, and even more fortunately, VS.NET only uses a tiny subset of RTF to place colored code on the clipboard. Here's the main conversion function:

    Private Function RtfToHtml(ByVal rtf As String) As String
        Const tabSpaces As String = "&nbsp;&nbsp;&nbsp;&nbsp;"

        '-- remove line breaks
        rtf = Regex.Replace(rtf, "[\n\r\f]", "")

        '-- parse RTF color table
        Dim colorTable As New Collections.Hashtable
        Dim i As Integer = 1
        For Each m As Match In Regex.Matches(rtf, _
            "\\red(?<red>\d+)\\green(?<green>\d+)\\blue(?<blue>\d+);")
            colorTable.Add(i, HtmlColor(m))
            i += 1
        Next

        '-- remove header and footer RTF tags
        rtf = Regex.Replace(rtf, "{\\rtf1[^\s]+\s", "")
        rtf = Regex.Replace(rtf, "}$", "")
        rtf = Regex.Replace(rtf, "\\deff0{\\fonttbl{\\f\d+[^}]+}}", "")
        rtf = Regex.Replace(rtf, "{\\colortbl;(\\red\d+\\green\d+\\blue\d+;)+}", "")

        '-- fix escaped C# brackets
        rtf = Regex.Replace(rtf, "\\{", "{")
        rtf = Regex.Replace(rtf, "\\}", "}")

        '-- replace any HTML-specific characters
        rtf = Web.HttpUtility.HtmlEncode(rtf)

        '-- convert RTF tags to HTML tags
        rtf = Regex.Replace(rtf, "\\tab\s", tabSpaces)
        rtf = Regex.Replace(rtf, "\\par\s", "<br/>" & Environment.NewLine)

        '-- remove unmapped RTF tags
        rtf = Regex.Replace(rtf, "\\fs(?<size>\d+)\s", "")
        rtf = Regex.Replace(rtf, "\\cb\d+\\highlight\d+\s", "")

        '-- map foreground color RTF tags using <font> tag
        rtf = Regex.Replace(rtf, "\\cf0\s", "</span><span style='color:black'>")
        For Each m As Match In Regex.Matches(rtf, "\\cf(?<num>\d+)\s")
            i = Convert.ToInt32(m.Groups("num").Value)
            rtf = Regex.Replace(rtf, "\\cf" & i & "\s", _
                "</span><span style='color:" & colorTable.Item(i) & "'>")
        Next
        '-- fix up orphaned spans at start and end
        rtf = Regex.Replace(rtf, "(^.*?)</span>", "$1")
        rtf = rtf & "</span>"

        '-- convert remaining spaces to HTML spaces
        rtf = Regex.Replace(rtf, "  ", "&nbsp;&nbsp;")

        '-- add wrapping div
        rtf = "<div style='font-family:" & CodeFontName & _
            "; font-size: " & CodeFontSize & "pt;'>" & _
            rtf & "</div>"
        Return rtf
    End Function

All this RTF spelunking revealed an interesting fact. I've always been disappointed that none of the copied code had background color highlighting. Well, that's because the RTF on the clipboard doesn't contain any of the background colors! The actual background formatting codes are there, but there are absolutely no entries in the RTF color table for them. Weird.

Update 4/2006: I have a much improved RTF conversion macro. This macro is only interesting for historical reasons, or if you need the Word interop conversion.

Anyway, here's the full FormatToHtml macro (zip). It contains the direct RTF clipboard to HTML conversion, as well as the RTF clipboard to Word clipboard to HTML conversion. To get started:

  1. go to Tools - Macros - IDE
  2. create a new Module named "FormatToHtml" under "MyMacros"
  3. paste the downloaded code into the module
  4. add references to System.Drawing, System.Web, and System.Windows.Forms via the Add Reference menu
  5. save and close the macro IDE window
  6. go to Tools - Macros - Macro Explorer
  7. two new macros named "UsingWord" and "UsingRtfConversion" will be under "FormatToHtml":

    macro_explorer_formattohtml.gif

    Double-click to run the macro, then paste away..

Posted by Jeff Atwood    View blog reactions

 

« Phantom DOS files in my root Where Are The .NET Blogging Solutions? »

 

Comments

For an even more hard-core "convert the RTF to HTML our own damn selves" solution, try the excellent VS.NET add-in CopySourceAsHtml:

http://www.jtleigh.com/people/colin/software/CopySourceAsHtml/

This is far more sophisticated and feature-rich than my little lightweight RTF to HTML function.

Jeff Atwood on June 14, 2005 09:32 PM

Thanks for the plug! I'm glad to see you had the same epiphany I did. :)

Colin on June 14, 2005 10:32 PM

Yeah, it's really nice work! I was looking at the source the other day while working on this.

Jeff Atwood on June 14, 2005 11:23 PM

dasBlog includes an Insert Code toolbar button on it's implementation of FreeTextBox that does the formatting for you. You'd probably have to contact Scott Hanselman to find out who wrote it or if you could get it though (assuming you use FreeTextBox).

Chris Wallace on June 15, 2005 06:56 AM

I happen to be looking at DasBlog right now. Looks like a colorizing regex engine, on a popup form dedicated to that purpose.

It's using AylarSolutions.Highlight.Highlighter

http://weblogs.asp.net/tjohansen/archive/2003/08/17/24291.aspx

Jeff Atwood on June 15, 2005 03:06 PM

I use the squishySyntaxHighlighter . It's very nice, preserves collapsable regions and line numbers and is free.

Scott Schecter on June 19, 2005 11:47 AM

Hello Jeff,
I have downloaded the zip but can not open it (corrupted).
Please advice,
Mario

Mario on July 11, 2005 11:12 AM

Upgrade to the latest WinZip at http://www.winzip.com .. unfortunately I might have saved this with the "extreme compression" that is new to that version of WinZip.

I'll try to save it in the "compatible compression" and re-upload it.

Jeff Atwood on July 11, 2005 01:41 PM

I updated the macro tonight. Most of the improvements are in UsingRtfConversion:

- Works under VS.NET 2005 (Thread.ApartmentState.STA must be manually specified when accessing the clipboard)
- Wraps the code snippet in a DIV
- Sets the background color of the code
- Minor HTML formatting improvements
- Option to remove first TAB for heavily indented code (this could be automated, hmm..)

The Word functionality is unchanged!

Jeff Atwood on October 27, 2005 06:15 AM

I created a simpler version of this macro here:

http://www.codinghorror.com/blog/archives/000429.html

Jeff Atwood on October 27, 2005 05:28 PM

I had to restart my IDE before the macro would work. Dunno if that's something particular to my environment or what, but thought it was worth mentioning if someone else has any problems.

Scott Bellware on December 24, 2005 02:52 PM

Using the RTM of VS 2005 I found that I had to add a reference to System.Web in order to get this to owkr. After that point, it worked great. Thanks for the tool!

Rob Gillen on January 14, 2006 08:51 AM

Ok, I give up. I installed the macro but now how do I use it?

Thanks.

Dale on January 27, 2006 07:31 PM

First, you should be using the new, simpler macro here:

http://www.codinghorror.com/blog/archives/000429.html

Second, once you've installed the Macro, either map it to a keyboard key (Tools, Options, Keyboard) or just double-click on it in the Macro Explorer.

Jeff Atwood on January 27, 2006 07:43 PM

I'm unable to use your tool. It is full of errors. I'm getting Regex error (around 21 in number)

Can you provide a step-by-step installation instructions?

anand on February 1, 2006 07:21 AM

It is now working but there is an error in the code

rtf = Web.HttpUtility.HtmlEncode(rtf)

Error 1 'HttpUtility' is not a member of 'Web'.

Is there any way to fix this error?

anand on February 1, 2006 07:36 AM

Hi Anand, you need to add a reference to the System.Web class in the Macro IDE-- you can do this via the "add reference" menu.

Jeff Atwood on February 1, 2006 04:55 PM







(hear it spoken)


(no HTML)




Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved.