If You Like Regular Expressions So Much, Why Don't You Marry Them?

March 22, 2005

All right... I will!

Pee-Wee likes fruit salad so much, he married it

I'm continually amazed how useful regular expressions are in my daily coding. I'm still working on the MhtBuilder refactoring, and I needed a function to convert all URLs in a page of HTML from relative to absolute:

''' <summary>
''' converts all relative url references
'''    href="myfolder/mypage.htm"
''' into absolute url references
'''    href="http://mywebsite/myfolder/mypage.htm"
''' </summary>
Private Function ConvertRelativeToAbsoluteRefs(ByVal html As String) As String
  Dim r As Regex

  Dim urlPattern As String = _
    "(?<attrib>\shref|\ssrc|\sbackground)\s*?=\s*?" & _
    "(?<delim1>[""'\\]{0,2})(?!#|http|ftp|mailto|javascript)" & _
    "/(?<url>[^""'>\\]+)(?<delim2>[""'\\]{0,2})"

  Dim cssPattern As String = _
    "@import\s+?(url)*['""(]{1,2}" & _
    "(?!http)\s*/(?<url>[^""')]+)['"")]{1,2}"

  '-- href="/anything" to href="http://www.web.com/anything"
  r = New Regex(urlPattern, _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    html = r.Replace(html, "${attrib}=${delim1}" & _HtmlFile.UrlRoot & "/${url}${delim2}")

  '-- href="anything" to href="http://www.web.com/folder/anything"
  r = New Regex(urlPattern.Replace("/", ""), _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    html = r.Replace(html, "${attrib}=${delim1}" & _HtmlFile.UrlFolder & "/${url}${delim2}")

  '-- @import(/anything) to @import url(http://www.web.com/anything)
  r = New Regex(cssPattern, _
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    html = r.Replace(html, "@import url(" & _HtmlFile.UrlRoot & "/${url})")

  '-- @import(anything) to @import url(http://www.web.com/folder/anything)
  r = New Regex(cssPattern.Replace("/", ""), _            
    RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    html = r.Replace(html, "@import url(" & _HtmlFile.UrlFolder & "/${url})")

  Return html
End Function

Each regex is repeated because I have to resolve relative URLs starting with forward slashes to the webroot first--and then all remaining relative URLs to the current web folder.

One of the BCL team recently recommended pretty-printing regular expressions, eg, using whitespace to make regexes more readable with RegexOptions.IgnorePatternWhitespace. I agree completely. We do this all the time with SQL. I can think of a half-dozen tools that will block of SQL and pretty format it-- but I am not aware of any regex tools that offer this functionality. I guess I'll email the author of Regexbuddy and see what he has to say.

And here's an interesting bit of trivia: did you know that the ASP.NET page parser uses regular expressions?

Posted by Jeff Atwood
11 Comments

We use TOAD (only for ORACLE dbs), which has an entire module dedicated to this..

http://www.quest.com/toad/

Jeff Atwood on March 23, 2005 3:10 AM

but there are potential performance issues in recompiling the regex on each postback

They could just embed the precompiled regex into their assembly, but at the framework level of the coding pyramid it probably makes more sense to hard-code it.

Jeff Atwood on March 23, 2005 3:11 AM

[... did you know that the ASP.NET page parser uses regular expressions? ...]

And another trivium is that the ValidateRequest logic that looks for HTML and other "potentially dangerous" markup in the postback does NOT use regexes. It could -- it's a natural application -- but there are potential performance issues in recompiling the regex on each postback.

I have a few details about that here, including the hypothetical regex that they would use:

http://mikepope.com/blog/DisplayBlog.aspx?permalink=441

mike on March 23, 2005 12:43 PM

I have been looking for a tool to beautify SQL for quite some time. Where could I find them?

KMF on March 23, 2005 12:53 PM

Anyone got a regex that converts an absolute path into a relative path? I need to convert "http://domain.com/folder/images/myimage.gif" to just "images/myimage.gif".

Erica on April 22, 2005 3:00 AM

Erica,

Here are a few.. starting with:

http://domain.com/folder/images/myimage.gif

Return the last filename in the url:

"[^/]+[^/]$" -- myimage.gif

Return the last folder in the url:

"[^/]+/(?=$|[^/]+$)" -- images/

Return the webroot:

"^\w+://[^/]+(/)*" -- http://domain.com/

Return the webroot plus the first subfolder:

"^\w+://([^/]+/){2}" -- http://domain.com/folder/

The problem you run into is that relative is.. uh.. relative to what?

Jeff Atwood on April 22, 2005 4:20 AM

Sorry , I 4got to add this information.All the links are of the form:
"a href="disable_javascript:mapWindow=x_window.open('/something.htm')"

I need to convert these relative urls to absolute ones. I know that i have to use the Replace() method.How do i go about it??
I went thru the code that is posted above and didnt understand this line:
"html = r.Replace(html, "${attrib}=${delim1}" _HtmlFile.UrlRoot "/${url}${delim2}")"

What does _HtmlFile.UrlRoot mean??


Samir on March 27, 2006 4:23 AM

Hi,
I have got the same problem as samir.
Samir did you or anyone else found the meaning of _HtmlFile.UrlRoot?

S.Jan on June 14, 2006 5:43 AM

Hi guys,
I have a regular expression proble, I have a huge html text and I want to convert all a hreftext/a to text where the link does not start with http://
In short I want only external links in my document and want to replace others with their respective texts.

Please please help

Tina on July 9, 2007 5:59 AM

it's only about three years too late, but KMF, here's an awesome sql formatter:

http://www.sqlinform.com/

maybe others will find it useful even though this post is quite old.

cowgod on June 27, 2008 8:41 AM

The comments to this entry are closed.