The Problem With URLs

October 29, 2008

URLs are simple things. Or so you'd think. Let's say you wanted to detect an URL in a block of text and convert it into a bona fide hyperlink. No problem, right?

Visit my website at http://www.example.com, it's awesome!

To locate the URL in the above text, a simple regular expression should suffice -- we'll look for a string at a word boundary beginning with http:// , followed by one or more non-space characters:

\bhttp://[^\s]+

Piece of cake. This seems to work. There's plenty of forum and discussion software out there which auto-links using exactly this approach. Although it mostly works, it's far from perfect. What if the text block looked like this?

My website (http://www.example.com) is awesome.

This URL will be incorrectly encoded with the final paren. This, by the way, is an extremely common way average everyday users include URLs in their text.

What's truly aggravating is that parens in URLs are perfectly legal. They're part of the spec and everything:

only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.

Certain sites, most notably Wikipedia and MSDN, love to generate URLs with parens. The sites are lousy with the damn things:

http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)
http://msdn.microsoft.com/en-us/library/aa752574(VS.85).aspx

URLs with actual parens in them means we can't take the easy way out and ignore the final paren. You could force users to escape the parens, but that's sort of draconian, and it's a little unreasonable to expect your users to know how to escape characters in the URL.

http://en.wikipedia.org/wiki/PC_Tools_%28Central_Point_Software%29
http://msdn.microsoft.com/en-us/library/aa752574%28VS.85%29.aspx

To detect URLs correctly in all most cases, you have to come up with something more sophisticated. Granted, this isn't the toughest problem in computer science, but it's one that many coders get wrong. Even coders with years of experience, like, say, Paul Graham.

If we're more clever in constructing the regular expression, we can do a better job.

\(?\bhttp://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]

  1. The primary improvement here is that we're only accepting a whitelist of known good URL characters. Allowing arbitrary random characters in URLs is setting yourself up for XSS exploits, and I can tell you that from personal experience. Don't do it!
  2. We only allow certain characters to "end" the URL. Ending a URL in common punctuation marks like period, exclamation point, semicolon, etc means those characters will be considered end-of-hyperlink characters and not included in the URL.
  3. Parens, if present, are allowed in the URL -- and we absorb the leading paren, if it is there, too.

I couldn't come up with a way for the regex alone to distinguish between URLs that legitimately end in parens (ala Wikipedia), and URLs that the user has enclosed in parens. Thus, there has to be a handful of postfix code to detect and discard the user-enclosed parens from the matched URLs:

if (s.StartsWith("(") && s.EndsWith(")"))
{
    return s.Substring(1, s.Length - 2);
}

That's a whole lot of extra work, just because the URL spec allows parens. We can't fix Wikipedia or MSDN and we certainly can't change the URL spec. But we can ensure that our websites avoid becoming part of the problem. Avoid using parens (or any unusual characters, for that matter) in URLs you create. They're annoying to use, and rarely handled correctly by auto-linking code.

Posted by Jeff Atwood
137 Comments

/* As some people have already said, the find URLs problem is trivial and can be solved very easily. Notice that this solution uses no whitelists (Jeff's new favorite buzzword), so it can handle pretty much anything you throw at it: Unicode, ftp://, whatever. The only trouble spots are Did you mean http://example.com? and URLs containing brackets (but see http://en.wikipedia.org/wiki/Template:Bracketed). */

#include stdio.h
#include stdlib.h
#include string.h
#include ctype.h

void extractURL(const char *text, int start, int length)
{
#define RETURN(a,b) do { start = a; length = b; return; } while (0)
const char *t = text;
find_next_colon:
const char *colon = strstr(t, ://);
if (colon == NULL)
RETURN(-1,-1);
/* Get the preceding protocol ID; e.g., http or ed2k. */
const char *s = colon;
while (s text isalnum(s[-1])) --s;
if (s == colon) {
/* We have $@#://http://example.com. Keep going. */
t = colon+3;
goto find_next_colon;
}
start = s-text;
s = colon+3;
/* URLs end with whitespace, , or brackety things. Unbalanced
* parentheses also end the URL; consider (at http://example.com)
* as opposed to (http://example.com, for example). */
int parens = 0;
while (!isspace(*s) !strchr(\[]{}, *s)) {
if (*s == '(') ++parens;
if (*s == ')') --parens;
if (parens 0) break;
++s;
}
/* Consider http://en.wikipedia.org/wiki/Bang!. I've
* decided arbitrarily that ! and ? may end a URL, but
* we must correctly handle I like http://example.com. */
if (strchr(.,:;, s[-1])) --s;
if (s == colon+3) {
/* Reject http:// with nothing following it. */
t = colon+3;
goto find_next_colon;
}
/* Accept the rest. */
RETURN(start, (s-text) - start);
#undef RETURN
}


/* For testing. */
int main()
{
char buffer[1000];
while (fgets(buffer, sizeof buffer, stdin) != NULL) {
const char *text = buffer;
int start = 0;
int len = 0;
while (1) {
extractURL(text, start, len);
if (start == -1) break;
printf(%d %d: \%.*s\\n, start, len, len, text+start);
text += start+len;
}
}
return 0;
}

Anonymous Cowherd on October 30, 2008 2:27 AM

What Barry Kelly said - you're *never* going to get it to be Completely Right with a regexp (especially, as others have said, with a non-Roman or even non-low-ASCII alphabet), and the regexp will rapidly, as you try, become completely write-only gibberish.

Use a *real* parser, in the form of an actual grammar.

(The 99% close enough solution linked by a previous commenter of course fails utterly on the non-Roman/low-ASCII case by pretending that a-zA-Z0-9 is sufficient to recognise a name.

If one is willing to live with assuming that users will never want to link anywhere that uses an umlaut, an accent, or a non-Roman character, one is certainly free to... but that's probably a bad idea.)

(Is the Anonymous Coward with the C code *deliberately* writing obfuscated code for some reason? Or is it long-term C exposure that makes people think that's good style?

But at least that has the advantage of being an actual parser - if an ugly one - rather than trying to fit everything into a regular expression.)

Sigivald on October 30, 2008 2:45 AM


You could do something like this: http://website.com


This URI is in perfect angle brackets and yet, the parser recognizes the closing angle bracket as part of the URI.

Perfect example for another broken parser. :D

Vinzent Hoefler on October 30, 2008 2:48 AM

Yup, I am doing a little project for myself. Looking at duplicate sites, http://www.google.com http://google.com google.com ftp://google.com etc. Regex, and indexof, substring, all needed to check the site.

Cheers, Sarkie.

Sarkie on October 30, 2008 3:34 AM

URL extracting indeed can be extremely troublesome.
Concerning your point 1: In times of IDNs the whitelist-character-approach ist at least problematical.

Stecki on October 30, 2008 3:42 AM

Are you really advocating avoiding the use of perfectly valid characters in URLs, just because they make a URL difficult to identify in code?

Many websites use regular expressions to validate email addresses, and these too will often fail to correctly identify perfectly valid email addresses. Would you recommend the victims of these coding failures just change their email address?

Rob on October 30, 2008 3:42 AM

You tried to solve a problem using regular expressions...and then you had two problems.

Sorry - couldn't resist.

But why the dilemma of telling people to escape their parentheses? Square brackets _aren't_ legitimate characters in URLs from what you've stated, yes? So...

My website [http://www.example.com] is awesome.

...should work just fine.

I suppose this post should demonstrate whether your regex works as expected :-)

Mark on October 30, 2008 3:50 AM

Can't forget https, ftp and file URLs.

I display my latest Twitter entry on my homepage and decided to use the following to parse the text:

preg_replace(`\b(https?|ftp|file)://[-A-Za-z0-9+@#/%?=~_|!:,.;]*[-A-Za-z0-9+@#/%=~_|]\b`, 'a href=\0\0/a', substr($item['title'], 10));

I wrote this only yesterday and completely forgot about parens.

Lloyd on October 30, 2008 3:51 AM

There's a few comments suggesting that you validate the URL by requesting it and checking if you get a 404. There's a few reasons against this:

1. Many dynamic sites written by newer coders won't give you a 404 if you request a bad page. e.g. http://example.com/page?id=4 , tack a bracket on the end, and you likely get a bad ID. The page would tell you, but you won't get a 404.

2. It could open both you and a poor target up to a DOS attack. Imagine someone submitting a post with 1,000,000 references to http://example.com .

Peter on October 30, 2008 3:54 AM

Umm, you are actually missing all URL's out there that contain non ascii characters in their domain names, which are perfectly valid: http://en.wikipedia.org/wiki/Internationalized_domain_name

fs111 on October 30, 2008 3:56 AM

On a site-note, comments posted here don't respect RFC2396.
Example: http://www.example.com -- the trailing angle-bracket gets included in the URL.

Peter on October 30, 2008 3:56 AM

while I usually agree with your points, this one leaves me baffled: by definition, automating semantics extraction from text without using context aware parser is not possible, so auto linking will always be far from perfectly working as intended by the user

the point of the problem there is: as intended
users are not required how to format a perfect href tag, nor it's desirable to allow rendering html through custom text, but user should know how to play by the rules. if they want an autolink, they better know that they couldn't use spaces because are treated as linking boundary and that spaces should be escaped using %20s.

as a solution, I'd prefer a method to have a live or batched preview to allow user to test their link before posting. enabling links during writing permits user to see how the boundary system works, and to avoid mistakes

aaawww on October 30, 2008 3:59 AM

A better heuristic for extracting URLs would be to use a stronger pattern formalism than regular expressions, such as a context-free grammar. Since humans generally produce the format for URLs, you could expect that URLs are highly unlikely to include unbalanced parens. Regular grammars can't express this constraint, but a context-free grammar can.

Barry Kelly on October 30, 2008 4:05 AM

What's up with Domains with Umlaut?
They're by now perfectly legal and work in all modern browsers, as fs111 correctly stated. And they are alreade in usage here in germany.

Example (does not exist actually, but could and is valid):
http://www.mllrr.de/

titrat on October 30, 2008 4:05 AM

Since conscientious users use the preview feature, the url detection can be minimal and we can propose a specific syntax for exceptional case.

Well maybe we need a preview here also :-)

DomreiRoam on October 30, 2008 4:10 AM

Great, how does that work with the international characters allowed in domain names recently?

Xepol on October 30, 2008 4:14 AM

Actually, this problem can be solved with a single regular expression, although it's not an easy one. I have split the regex over several lines for clarity:

(?=\()
\bhttp://[-A-Za-z0-9+@#/%?=~_()|!:,.;]*[-A-Za-z0-9+@#/%=~_()|]
(?=\))
|
(?=(?wrap[=~|_#]))
\bhttp://[-A-Za-z0-9+@#/%?=~_()|!:,.;]*[-A-Za-z0-9+@#/%=~_()|]
(?=\kwrap)
|
\bhttp://[-A-Za-z0-9+@#/%?=~_()|!:,.;]*[-A-Za-z0-9+@#/%=~_()|]

This will match any URL that is surrounded by parentheses, but also by any of the following characters: '=','~','|','_','#'.
Of course, it will fail in some very borderline cases, but I think it matches 99.9% of URLs entered by users.

Philippe Leybaert on October 30, 2008 4:17 AM

I think this is one of those situations where as you've stated
you can't get a solution to fit all cases. Therefore you have
to take a pragmatic approach. The most pragmatic I think is
to not allow () in urls, and in the .1% that have (), people can easily
cut and paste the URL rather than clicking:

Here is the simple python snippet I use to auto link URLs:

r=((?:ftp|https?)://[^ \t\n\r()\']+)
comment=re.sub(r,r'a rel=nofollow href=\1\1/a',comment)

Pdraig Brady on October 30, 2008 4:28 AM

This is why I allways write urls on a separate line, not only because it's more likely that any automatic link creator will detect it but allso because it's easier to select and copy. At least I allways have a space between the url and any punctuation

Visit my website at http://www.example.com, it's awesome!
is hard to select..

Visit my website at http://www.example.com , it's awesome!
is better but typographically wrong...

I prefer:
Visit my website, it's awesome:
http://www.example.com

Qvasi on October 30, 2008 4:29 AM

Nice, I'll have to update my JavaScript solution to include parens:

http://knol.google.com/k/adam-eivy/javascript-html-format-links-in-text/2a9qcf9a3ig0u/14#

Adam Eivy on October 30, 2008 4:31 AM

I would like to suggest everyone to avoid also legal comma (,).

It is not uncommon to find a href=http://www.example1.com/test.htmlhttp://www.example1.com/test.html/a,">http://www.example1.com/test.html/a,">http://www.example1.com/test.htmlhttp://www.example1.com/test.html/a, a href=http://www.example2.com/test2.htmlhttp://www.example2.com/test2.html/a,">http://www.example2.com/test2.html/a,">http://www.example2.com/test2.htmlhttp://www.example2.com/test2.html/a, ...

dmajkic on October 30, 2008 4:32 AM

I recently tackled this problem myself, and came to a similar regular expression (minor differences, and I think that Jeff's is better) plus some additional parsing to handle edge-cases and prevent the regular expression from becoming a complete mess. I think that every single person in the comments assumed that this problem is related solely to message boards. It's not. There are plenty of reasons that you might want to linkify text. You might be writing a web-based e-mail client. You might be writing a client for a chat or IM protocol. You might be trying to turn flat text files into something slightly more presentable on the web. In none of these cases can you reasonably expect the text to contain easy-to-parse URIs with bbcode-style tags or spaces surrounding the link text.

In my case, it was a personal project relating to a web interface for searching and viewing IRC logs. Lots of links get posted in IRC. The IDN problem is a non-issue, but other problems of parsing certainly are not. And I'm neither capable of enforcing URI standards, or would I want to if I could.

As others have pointed out, all we can do is get good enough. I had to accept that my algorithm was going to make mistakes, and move on. But what struck me most while reading through the comments were the people who either a) assumed that the problem is simple (discounting those edge cases that are becoming more and more common on the web) and b) stuck in their own little world where the problem can be solved by waving a big stick at your users. On a coding forum, I was quite surprised at the number of assumptions that people made about the types of situations in which this becomes useful.

sancho on October 30, 2008 4:44 AM

RFC2396 has a Recommendations for Delimiting URI in Context section that talks about how URIs *should* be encased. Not everyone follows that part though.

Shadow on October 30, 2008 4:52 AM

Dude, that's so easy, just have you server try to access the url with and without the final paren, and see which one actually works ;)

jwickers on October 30, 2008 4:53 AM

Would it be a good approach to have the autolinker request to the potential URL and see if it comes back with a 404?

Then you could spot bad URLs and ask the poster to fix them.

Graham Stewart on October 30, 2008 4:56 AM

Damn jwickers types faster.

Graham Stewart on October 30, 2008 5:07 AM

jwickers, Graham Stewart - please take a moment to consider the malicious uses for this idea.

Here's just two ideas for exploiting a 'validating autolinker':

1) Create a DoS condition on the host, or a third party site by passing in hundreds or thousands of URLs which need to get tested. Potentially executing expensive (in resource terms) requests.

2) Access and modify protected resources which are only accessable from 'inside' the firewall (Management sites, router configuration settings, and many other things)

Always assume that whatever input recieved is a deliberate attempt to exploit or subvert your application. Certainly validate that the input is legal, but you should not automatically request unknown third party resources without significant constraints around it.

As for solving the issue Jeff is talking about - wouldn't a backref test in the regex be the easiest solution?

Will Hughes on October 30, 2008 5:50 AM

How about spaces in URLs?

http://www.google.com/codesearch?q=jeff atwood

Draconian to require the user to put a %20 instead?

Vinzent Hoefler on October 30, 2008 6:00 AM

To make matters worse, this doesn't correctly parse the following:
See my site (at http://example.com)

Dinah on October 30, 2008 6:06 AM

Avoid using parens (or any unusual characters, for that matter) in URLs you create

Fucking lazy programmers solution right there

BAWWWWWW DONT DO THAT IT MAKES MY LIFE HARD, fuck off back to your code you lazy twat

Trev on October 30, 2008 6:10 AM

okay, so having spent a bit of time trying to wrangle over testing a backref... I'll admit it's not so easy. I'm sure there's a way to test that a backref contains something - but maybe I'm getting my XSLT and Regexs mixed up.

Will Hughes on October 30, 2008 6:16 AM

Useful regex resources for those not wanting to reinvent the wheel:

RegExLib web site - Very useful library of common regular expressions
http://regexlib.com/

Mastering Regular Expressions in case you really want to understand how regular expressions work
http://oreilly.com/catalog/9780596528126/index.html

David Sheeks on October 30, 2008 6:18 AM

Adium has a nice library for detecting hyperlinks: http://cloggedtubes.com/development/the_aihyperlinks_framework_or_how_adium_finds_links

Tim Trueman on October 30, 2008 6:22 AM

What I would like to see is a regular expression that will avoid any links that have already been enclosed in a tags.

That is, linkify this link: http://www.google.com

But do not re-linkify this link: a href=http://www.google.com/http://www.google.com/a">http://www.google.com/a">http://www.google.com/http://www.google.com/a

Chris Dary on October 30, 2008 6:24 AM

first example does not work with first code example. There is a comma!

http://www.example.com,

David B on October 30, 2008 6:25 AM

IMHO the best approach would be to force your users to enter their text like this:

My website [url]http://www.example.com[/url] is awesome.

Dave Schenk on October 30, 2008 6:31 AM

URLs are hard, let's go shopping :)

atma on October 30, 2008 6:39 AM

I think that we are oversolving the problem.

First, Jeff, you have gone a little to far in suggesting that people change the URLs they enter because the poor little computer can't autolink correctly.

Second, the whole of the URL text is present even if not correctly autolinked. A savy user will simply copy/paste the link. An unsavy user shouldn't be on the internet anyway. So make a good effort, and then call it a day. You will catch 90% of everything.

-df5

drfloyd5 on October 30, 2008 6:40 AM

I agree with Dave Schenk; people aren't so stupid that they can't use simplified markup.

Or you could actually ping the url (assuming you only do this check once).

As for URL construction, I still like the way the PHP site does it.

leohorie on October 30, 2008 6:42 AM

Great post Jeff.

I can't say that I've ever thought about detecting parentheses in urls at all, much less the implications of parentheses surrounding a url. I am enlightened once again.

brad dunbar on October 30, 2008 6:45 AM

And unfortunately some sites' URLs even end in periods, which you called end-of-hyperlink characters.

e.g. http://en.wikisource.org/wiki/1911_Encyclop%C3%A6dia_Britannica/Aga_Khan_I.

yes the . is part of the URL.

The parentheses on Wikipedia pages are particularly annoying. I paste URLs into identi.ca and have trained myself to put %29 at the end because I know otherwise a URL with parentheses won't work.

pfctdayelise on October 30, 2008 6:48 AM

Can you use the regex balancing group technique to avoid matching a ending parenthesis when one is detected on the front before the http?

http://blog.stevenlevithan.com/archives/balancing-groups

Josh Bush on October 30, 2008 6:51 AM

there isn't a problem URLs, the real problem is to use regex for url matching!

Eduardo Diaz on October 30, 2008 6:58 AM

By the way, you should be using s.Length - 2 to strip both the first and last parentheses. Using s.Substring(1, s.Length - 1) will have the same effect as s.Substring(1), since the remaining length after removing the first character is s.Length - 1.

Emperor XLII on October 30, 2008 7:00 AM

Shouldn't you be stripping the leading parenthesis and only removing the closing one if the leading one is missing. For example, it would seem, (http://example.com/ Example Site) would capture the leading parenthesis and would never get stripped since it doesn't have a closing parenthesis.

Jonathan Snook on October 30, 2008 7:03 AM

It's not just parens, it can be any characters surrounding the url. The first example shows a url followed by a comma. A comma is legal in urls, so is it in or out? There's no way to write a regex to correctly delimit a url in all cases, you have to know the grammar of the data. And in human communications the grammar is informal, a matter of convention in a particular group.

numerodix on October 30, 2008 7:09 AM

Rather than checking that first/last char are parentheses, I'd suggest removing any closing paren unless there's an unbalanced matching open paren in the URL itself. (I'm going on the assumption it's unlikely a programmer-type will construct a URL that intentionally has unbalanced parentheses.)

The trim off first/last strategy won't correctly deal with

Hey, try this (my friend's site at http://google.com)

This alternative strategy would handle that, as well as the following:

Here's a link (http://google.com)

Here's an ugly link: http://google.com/file(stuff)

Here's an ugly link (http://google.com/file(stuff))

Here's another one (with a comment http://google.com/file(stuff))

Anj on October 30, 2008 7:10 AM

The issue of international characters and other such things could probably be circumvented by using a well tested and long used existing regular expression to this problem.

http://search.cpan.org/~abigail/Regexp-Common-2.122/lib/Regexp/Common/URI/http.pm

This came up pretty quickly. The author of that is a pretty smart guy.

You'd potentially have to still wrap this regex inside another to apply your same approach with the parenthesis. This is trivial.

Oh and yes, that's perl, but extracting the actual regex in use from that thing shouldn't be too difficult and most languages out there use PCRE or something very, very, very close to it.

Best tool for the job and all that.

Ben on October 30, 2008 7:15 AM

Just nitpicking, but shouldn't that code return return s.Substring(1, s.Length - 2) if the idea is to remove both the opening and closing parens?

Lucas on October 30, 2008 7:16 AM

Oh, in recognizing you might not be familiar with how perl imports libraries, the regex linked earlier looks to be this:

my $http_uri = (?k:(?k:http)://(?k:$host)(?::(?k:$port))? .
(?k:/(?k:(?k:$path_segments)(?:[?](?k:$query))?))?);

(. is a concatenate operator) with the $ variables defined here
http://search.cpan.org/src/ABIGAIL/Regexp-Common-2.122/lib/Regexp/Common/URI/RFC2396.pm

As you can see, getting these regex right is harder than it would appear at first blush.

Ben on October 30, 2008 7:23 AM

Why not save some back-end processing time and just give the users a WYSIWYG editor?

You get easy-to-parse (X)HTML, the user clicks buttons.

Michael Thompson on October 30, 2008 7:37 AM


I've noticed that you have a certain tendency to see too many problems as nails that you can hit with your regex hammer :-)

The trouble is that regexes (provably) can only deal with very limited grammars.

As someone else pointed out, you're never going to get this perfect, as you are ultimately dealing with a human language, which no parsers yet written deal with perfectly. And what are you going to do if the URL is just in an example and not supposed to be a real one (in a code sample, for example)?

If this is just for markup purposes, just specify the format. People will learn that quicker than you can write code to parse English, or whatever.

Jim Cooper on October 30, 2008 7:46 AM

have you posted this at 2:30 in the morning?

you might need to look at this: {http://crazy-videoz.com/cool-stories/suggestions-for-sleeping-at-work/)}

Rus on October 30, 2008 7:52 AM

Why not just check for cases where it might be possible or likely the parse just got confused, and simply prompt the user before form submit? I know it's an additional step, but it's actually not that huge of an obstruction.

DW on October 30, 2008 7:59 AM

Jeff's on a roll lately.

AnonymousCoward on October 30, 2008 8:06 AM

This is why I prefer vb code for this purpose. Who was ever hurt by a little [url] [/url] ?

ProfessorTom on October 30, 2008 8:07 AM

I just force people to use [URL][/URL] if they want to include a URL. Then I don't have to worry about all these special cases... unless of course someone uses [URL] in their URL, but that's their own fault for having an absurd URL.

Kris on October 30, 2008 8:20 AM

Heh. Linkification in Firefox failed to handle most of the problematic URLs in the post and comments. The trailing paren in the first wikipedia URL doesn't get linkified, nor do the %28s. The one with the umlauts was somehow split into two URLs at the first u-umlaut. Guess it is harder than it looks.

Chris C. on October 30, 2008 8:27 AM

I guess this blog really has become focused entirely on web development. Sucks for me since I don't do web dev and couldn't care less about auto-linking URLs. When StackOverflow was started, CodingHorror jumped the shark :(

Kyle on October 30, 2008 8:43 AM

What it really comes down to is that parsing text for anything is one of the biggest pains in the ass when it comes to programming. Quite simply you never know what's coming.

HB on October 30, 2008 8:45 AM

http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)

is far more meangful than
http://en.wikipedia.org/wiki/article?x4kp2

What if someone trys to make a url like ( http://www.notethefirstspace.com)?

I think you should look for opening parens in the middle of the url like:

http://en.wikipedia.org/wiki/PC_Tools_(

then you found the parens, since you found it you expect to have a closing parens at some point

http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)
found the close parens, ignoring all the closing parens until the end of the url (unless you find another opening parens)

so if someone type:
looking further into (http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)) the subject I found out that parens suck
it would work

This assumes that for every opening parens there is a closing one which is logical and probably true for almost all the cases.

So the algorithm I'm proposing resumes to:
1. find the beginning of the url (and ignore everthing before it)
2. if find one ( look for one )
3. look for the end of the url (like space or punctuation)
4. if not found one ) until the end of the url do nothing

so the only case this algorithm don't work is when the url itself ends with a ) or if it have a ( and the user types the url between parens without a spece at the end.

Cases that it won't work:


check this out: http://www.example.com/finish?asxk)
the final parens would be left off of the url.


from my sources (http://www.example.com/finish(source)
the real url is http://www.example.com/finish(source
there is a ( in the middle but no closing one at any part and the user puts the url inside parens without space (or punctuation) at the end.
But the algorithm would get the final parens into the hyperlink.
If you really want to you can remove this case if you see if the url started with a (, but then again if it was something like from my sources (the excelent website example.com: http://www.example.com/finish(source) it would still not work.


Those 2 cases are probably very, very, very rare.

cases that would work:
from that old post (http://www.example.com/finish(asxk)) we found out...
from that old post (http://www.example.com/finish)asxk) we found out...
check this out: http://www.example.com/finish(asxk
check this out: http://www.example.com/finish)asxk

This algorithm is wikipedia safe.

If there is a thing we learn at the first years of college, it is string manipulation...

Hoffmann on October 30, 2008 8:48 AM

Correcting my above post:
What if someone trys to make a url like ( http://www.notethefirstspace.com/note(space))?

Hoffmann on October 30, 2008 8:52 AM

I couldn't come up with a way for the regex alone to distinguish between URLs that legitimately end in parens (ala Wikipedia), and URLs that the user has enclosed in parens.

Well, a closing paren at the end would probably mean an opening paren WITHIN the URL. How about a simple counter, counting up each opening paren in the url, counting down ending parens. If the assumed URL ends with a closing paren, and the counter is 0, that last paren is probably part of the URL.

mephane on October 30, 2008 8:53 AM

Kyle wrote:
I guess this blog really has become focused entirely on web development. Sucks for me since I don't do web dev and couldn't care less about auto-linking URLs. When StackOverflow was started, CodingHorror jumped the shark :(

This post in particular is about string manipulation applyed to web development. Even if you do no web dev you should still need to format strings some times...

Hoffmann on October 30, 2008 8:55 AM

Another one of your dangerous posts, where if all you have is a hammer (regex), everything needs to be a nail. You need to recall your finite automata course from college/university. A regex is equivalent to a FSM (finite state machine) which means that it cannot handle nesting, for that you need a stack automata, aka, a parser. If you used a LR(0) or LR(1) context free grammar or procedural code with a stack, you can quite easily handle URL syntax properly. A single stack automata is still a weak concept, it cannot parse all strings since it is not equivalent to a Turing machine, for that you need two stacks (that basically simulate an infinite tape).

old_timer on October 30, 2008 8:58 AM

I wrote about this in my book on regular expressions, having been responsible for plucking URLs out of financial news an press releases for years at Yahoo! Finance. The URL I used there is shown at the bottom of:

http://regex.info/listing.cgi?ed=3p=207

This predated Wikipedia, and in any case, one wouldn't expect to find such URLs in the problem space (financial news). Still, I thought I'd mention it. The prose of the book, starting on page 206, discusses the approach taken to build the regex I ended up with (and indeed, it's full of heuristics).

Jeffrey Friedl on October 30, 2008 9:23 AM

I'm not sure that checking for start/and parenthesis is bullet-proof.
What if the link is the first word, but not the whole content of the parentheses, AND it contains parenthesis? Like this:

[...] as the many uses of the word Superman (http://en.wikipedia.org/wiki/Superman_(disambiguation) for a reference) demonstrate [...]

Your RegExpr should crop the final ) from the Wiki link.

Filini on October 30, 2008 9:35 AM

This is one of those situations where I think it's OK to take a Worse-is-better approach. As a user I've leared to avoid putting puntuation like periods, commas, or close parens directly at the end of a URL.

It's a little suprising the first time you get that close paren tacked onto the link, but the reason why is fairly understandable for the user, so easy enough to learn to avoid. Why make an even more complicated heuristic which may still fail, but in a much less user-understandable way?

T.E.D. on October 30, 2008 9:36 AM

Yup, you certainly cannot completely solve this one since http://www.example.com, could either include the comma or not include it - no way of knowing for sure. I think your best bet is combining an 'easy' UI (similar to the popup used on SO for inserting links) with a powerful yet simple syntax - e.g. enclose URLs in []. Users who understand URLs (most of them don't, of course) are more than capable of using one of these approaches.

bobby on October 30, 2008 9:40 AM

Oops - I probably should have made that http://www.example.com/resource, - not sure about comma in a domain name. My previous post demonstrates that your current parser falls over, though :)

bobby on October 30, 2008 9:42 AM

Even your first example doesn't work. Your regular expression will include the comma in the link, when it's part of the surrounding text. While parentheses add complexity, it's not as much as you think, as you've oversimplified the no parentheses case.

This seems to be an example of every problem looking like a nail because the only tool you have is a hammer (er, regular expression matcher). You can do a much better job with a custom parser than you can with regular expressions. For example, you could check for balanced parentheses either around the URL and/or within it. (Nothing says paren in a URL must be balanced, but I'd wager that nearly all of them are.)

Using a regular expression to find the URL (possibly with some garbage at the end), and then some post-processing heuristics to trim off the garbage, has promise. But it feels kludgy.

There is no perfect solution; there will always be ambiguous cases. Nevertheless, a 99% solution is probably beyond pure regular expression matching. I'd write a custom scanner.

Adrian on October 30, 2008 9:43 AM

You really couldn't come up with a change to the regex to deal with delineated URLs? Huh, maybe this is unique to Perl, but back-references and some smart evaluation would appear to fix this problem.

For readability, I'm using Perl's /x operator so that whitespace and comments in the expression are ignored; if you port this, you'll have to remove whitespace and comments.

my %pair = qw/( ) [ ] : : \/ \//;
my $left = quotemeta( join '', keys(%pair) );

s/\b # word border
([$left])? # optionally starts with ([:/, capture
([^\s]*?) # non-whitespace chars, non-greedy
($pair{$1})? # opposite of the pair we started with
\b/
$1a href='$2'$2/a$3/x

Untested, but you get the idea. Backreferences are a major strength of Perl-style Regexes (the syntax in your language might vary slightly).

Darren on October 30, 2008 9:52 AM

Not to be nitpicky (okay, yes I am):

Jeff, you wrote an URL when it should be a URL.

Both the acronym and what it stands for (yoo/U and yoo-nuh-fawrm/Uniform) use the article an.

Pete on October 30, 2008 10:07 AM

Make that *use the article a. I'm on a roll this morning.

Pete on October 30, 2008 10:08 AM

... unless you pronounce it Earl, of course ;-)

bobby on October 30, 2008 10:11 AM

It should be -2, not -1:

return s.Substring(1, s.Length - 2);

Mike on October 30, 2008 10:46 AM

The reason you can't write a simple regular expression to capture every combination of parentheses is because the inclusion of n opening parenthesis followed by n closing parentheses is no longer a regular language...

Proof is in the Pumping Lemma, described here: http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages

caleb on October 30, 2008 10:50 AM

Linkification in Firefox failed to handle most of the problematic URLs in the post and comments. The trailing paren in the first wikipedia URL doesn't get linkified, nor do the %28s. The one with the umlauts was somehow split into two URLs at the first u-umlaut. Guess it is harder than it looks.

That's what I'm saying! It's a hard problem and almost nobody gets it right. Not even Paul Graham.

http://news.ycombinator.com/item?id=10889

Jeff Atwood on October 30, 2008 10:55 AM

I guess this blog really has become focused entirely on web development.

This post is relevant if you *USE* the web, IMO, since many (most?) web forums screw this algorithm up as well. As T.E.D. mentioned above, users have to learn to enter URLs the right way.

Of course it helps to be a programmer so you understand why this is happening.

Jeff Atwood on October 30, 2008 10:58 AM

What if you use some heuristics, like you suggest with your whitelist regex, plus some others (like including punctuation or not) to create an ordered list of link candidates and then try fetching them looking for the first one that returns 200 OK? Seems like something you could batch process offline a few times a day.

Nick Gerner on October 30, 2008 11:07 AM

the problem with urls?
nope, the problem is regex
first you have trouble with html tags, now with urls
just forget the regex and do some parsing work:

Barry Kelly knows that stuff:
A better heuristic for extracting URLs would be to use a stronger pattern formalism than regular expressions, such as a context-free grammar. Since humans generally produce the format for URLs, you could expect that URLs are highly unlikely to include unbalanced parens. Regular grammars can't express this constraint, but a context-free grammar can.


fane on October 30, 2008 11:17 AM

Jeff, you're right. Programming's hard. Let's just initiate a movement to cast shame on people who work within standards that we are ill-equipped to handle.

Or, we could just go shopping.

mbhunter on October 30, 2008 11:22 AM

Is anybody an AI student/hobbyist?

This problem seems just barely too difficult for simple rules,
and has many samples available from the wild.

Patrick on October 30, 2008 11:40 AM

Not all things are solvable.

Steve on October 30, 2008 11:47 AM

just checked my site and we do it wrong too...

Dan on October 30, 2008 11:48 AM

Jeff, you can't expect the users to learn to insert urls in the right way. They assume we are here to do the loading!

Saj on October 30, 2008 11:53 AM

What is most annoying is that spaces are valid URLs characters. That can really jack up an auto-linker

Billkamm on October 30, 2008 11:55 AM

I take the approach of ignoring preceding punctuation entirely and cutting off punctuation from the end of the URL entirely, taking care to leave balanced parentheses in. Examples:

http://example.com/something) - remove the close paren
http://example.com/something_(something) - leave alone
http://example.com/something,.; - remove the comma, dot and semicolon

This is not fool-proof, as some URLs may genuinely be formatted that way, but there is *no way* of knowing that while auto-linking, and the vast majority of URLs do *not* include punctuation at the end, with the exception of balanced parentheses, which can easily be accounted for.

Side note: I spend so much time on Stack Overflow now I've found myself wishing I could upvote comments here.

Trevor on October 30, 2008 11:55 AM

Developers are simply too obsessed with borderline cases. What's wrong with good enough?

What's wrong with trying to solve a problem with regular expressions if you succeed to capture 99.99% of all cases?

I had a try:
http://www.blog.activa.be/2008/10/30/ExtractingURLsNotPerfectButQuotgoodEnoughquot.aspx

Philippe Leybaert on October 30, 2008 11:58 AM

As much as you might dislike it, some URLs do end in periods. I've been bitten by clever systems that assumed I didn't want the period in the link. (Some probably end in commas too, but I haven't encountered them.) I don't mind the heuristic to pick up URLs wrapped in parentheses, because if you really want to enter a URL with weird parentheses in it all you have to do is not precede it with an open paren (. However, by always excluding a trailing period, you make it impossible to enter some URLs. I think it's not unreasonable for a user to expect to be able to paste a URL alone on a blank line and expect it to be auto-linked correctly. I would rather have a system that gets it wrong when the user tries to do something complicated (append punctuation to the end of a URL and expect it not to be included) than one that gets it wrong when the user tries to do something simple (paste in a URL on its own).

Weeble on October 30, 2008 12:02 PM

Very informative post. If only the world was simple. Unfortunately complicated URLs are here to stay.

Matthew James Taylor on October 30, 2008 12:07 PM

That's a whole lot of work for trying to jam a square peg into a well-defined hole with an RFC spec that's been around for over a decade and everyone else has learned to live with.

You could do something like this: aaaa http://website.com bbbb

Or this: aaaa ( http://website.com ) bbbb

both are satisfactory.

Philihp on October 30, 2008 12:48 PM

PS, your URL detection thinks it's legal to put in a URL ;)

Philihp on October 30, 2008 12:49 PM

@Weeble: you may be right, but the best we can do is try to detect the majority of hyperlinks. If a website (-creator) is stupid enough to have pages ending with a punctuation mark, it's better to just ignore the site. Linking to it should be banned anyway. Using heuristic URL detection: mission accomplished :)

Philippe Leybaert on October 30, 2008 1:11 PM

Although this would require another column in the table, on initial input of text (or editing), find the links and test them for 404's. Mark the text as good if no 404's otherwise mark it as text with bad links.

Next you could use a cascading set of rules to auto-edited the text until all links return 200's or have an admin tool and manually edit those links.

Guy Ellis on October 30, 2008 1:20 PM

In the case where you have to choose whether to include a trailing paren, why not just hit the URL and see if it exists?

In fact, you could do this in all cases and alert the user if he/she is referring to a link that goes to a non-registered domain or 404 page.

Brian on October 30, 2008 1:56 PM

It's simple if you are trying to spot URL's in plain typed text you will fail.... unless you invent strong AI

If someone is writing about URL's then you will fail completely as you turn text about url's into url's and they then have to attempt to escape it so you don't ....

Why not bother and force them to delimit it if they want it turned [http://www.example.com]

Jaster on October 31, 2008 2:36 AM

It would be useful if you could 'ping' the intended URL and get returned what the actual URL is, as a check... if nothing comes back, then its wrong. if there is no site with the trailing parens but there is one without it, then assume you can drop it.
Trouble is, most unused URL get forwarded to some hosting company.

Jonny on October 31, 2008 3:01 AM

This is the problem *with URLs*? Really!?!?

If you're going to delimit your URLs, just use characters that aren't valid in a URL.

PEBCAK.

Fred on October 31, 2008 6:42 AM

Maybe a post expression solution to this is to look at the text that exists before the occurrentce of the initial 'http://' etc. and look for an opening paren (. If you find an opening paren then you disregard a closing apren if it is regarded as the last item in the URL...

Keith Jackson on October 31, 2008 7:22 AM

More comments»

The comments to this entry are closed.