URLs are simple things. Or so you'd think. Let's say you wanted to detect an URL in a block of text and convert it into a bona fide hyperlink. No problem, right?
Visit my website at http://www.example.com, it's awesome!
To locate the URL in the above text, a simple regular expression should suffice -- we'll look for a string at a word boundary beginning with http:// , followed by one or more non-space characters:
\bhttp://[^\s]+
Piece of cake. This seems to work. There's plenty of forum and discussion software out there which auto-links using exactly this approach. Although it mostly works, it's far from perfect. What if the text block looked like this?
My website (http://www.example.com) is awesome.
This URL will be incorrectly encoded with the final paren. This, by the way, is an extremely common way average everyday users include URLs in their text.
What's truly aggravating is that parens in URLs are perfectly legal. They're part of the spec and everything:
only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.
Certain sites, most notably Wikipedia and MSDN, love to generate URLs with parens. The sites are lousy with the damn things:
http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software) http://msdn.microsoft.com/en-us/library/aa752574(VS.85).aspx
URLs with actual parens in them means we can't take the easy way out and ignore the final paren. You could force users to escape the parens, but that's sort of draconian, and it's a little unreasonable to expect your users to know how to escape characters in the URL.
http://en.wikipedia.org/wiki/PC_Tools_%28Central_Point_Software%29 http://msdn.microsoft.com/en-us/library/aa752574%28VS.85%29.aspx
To detect URLs correctly in all most cases, you have to come up with something more sophisticated. Granted, this isn't the toughest problem in computer science, but it's one that many coders get wrong. Even coders with years of experience, like, say, Paul Graham.
If we're more clever in constructing the regular expression, we can do a better job.
\(?\bhttp://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]
I couldn't come up with a way for the regex alone to distinguish between URLs that legitimately end in parens (ala Wikipedia), and URLs that the user has enclosed in parens. Thus, there has to be a handful of postfix code to detect and discard the user-enclosed parens from the matched URLs:
if (s.StartsWith("(") && s.EndsWith(")"))
{
return s.Substring(1, s.Length - 2);
}
That's a whole lot of extra work, just because the URL spec allows parens. We can't fix Wikipedia or MSDN and we certainly can't change the URL spec. But we can ensure that our websites avoid becoming part of the problem. Avoid using parens (or any unusual characters, for that matter) in URLs you create. They're annoying to use, and rarely handled correctly by auto-linking code.
| [advertisement] Read the largest case study ever published about lightweight peer code review in Best Kept Secrets of Peer Code Review. Free book, free shipping. |
Posted by Jeff Atwood View blog reactions
« The Web Browser is the New Laptop HCI Remixed »
Yup, I am doing a little project for myself. Looking at duplicate sites, http://www.google.com http://google.com google.com ftp://google.com etc. Regex, and indexof, substring, all needed to check the site.
Cheers, Sarkie.
Sarkie on October 30, 2008 02:34 AMURL extracting indeed can be extremely troublesome.
Concerning your point 1: In times of IDNs the whitelist-character-approach ist at least problematical.
Are you really advocating avoiding the use of perfectly valid characters in URLs, just because they make a URL difficult to identify in code?
Many websites use regular expressions to validate email addresses, and these too will often fail to correctly identify perfectly valid email addresses. Would you recommend the victims of these coding failures just change their email address?
"You tried to solve a problem using regular expressions...and then you had two problems."
Sorry - couldn't resist.
But why the dilemma of telling people to "escape" their parentheses? Square brackets _aren't_ legitimate characters in URLs from what you've stated, yes? So...
"My website [http://www.example.com] is awesome."
...should work just fine.
I suppose this post should demonstrate whether your regex works as expected :-)
Mark on October 30, 2008 02:50 AMCan't forget https, ftp and file URLs.
I display my latest Twitter entry on my homepage and decided to use the following to parse the text:
preg_replace("`\b(https?|ftp|file)://[-A-Za-z0-9+&@#/%?=~_|!:,.;]*[-A-Za-z0-9+&@#/%=~_|]\b`", '<a href="\0">\0</a>', substr($item['title'], 10));
I wrote this only yesterday and completely forgot about parens.
Lloyd on October 30, 2008 02:51 AMUmm, you are actually missing all URL's out there that contain non ascii characters in their domain names, which are perfectly valid: http://en.wikipedia.org/wiki/Internationalized_domain_name
fs111 on October 30, 2008 02:56 AMwhile I usually agree with your points, this one leaves me baffled: by definition, automating semantics extraction from text without using context aware parser is not possible, so auto linking will always be far from perfectly working "as intended" by the user
the point of the problem there is: "as intended"
users are not required how to format a perfect href tag, nor it's desirable to allow rendering html through custom text, but user should know how to "play by the rules". if they want an autolink, they better know that they couldn't use spaces because are treated as linking boundary and that spaces should be escaped using %20s.
as a solution, I'd prefer a method to have a live or batched preview to allow user to test their link before posting. enabling links during writing permits user to see how the boundary system works, and to avoid mistakes
aaawww on October 30, 2008 02:59 AMA better heuristic for extracting URLs would be to use a stronger pattern formalism than regular expressions, such as a context-free grammar. Since humans generally produce the format for URLs, you could expect that URLs are highly unlikely to include unbalanced parens. Regular grammars can't express this constraint, but a context-free grammar can.
Barry Kelly on October 30, 2008 03:05 AMWhat's up with Domains with "Umlaut"?
They're by now perfectly legal and work in all modern browsers, as fs111 correctly stated. And they are alreade in usage here in germany.
Example (does not exist actually, but could and is valid):
http://www.müllärör.de/
Since conscientious users use the preview feature, the url detection can be minimal and we can propose a specific syntax for exceptional case.
Well maybe we need a preview here also :-)
DomreiRoam on October 30, 2008 03:10 AMGreat, how does that work with the international characters allowed in domain names recently?
Xepol on October 30, 2008 03:14 AMActually, this problem can be solved with a single regular expression, although it's not an easy one. I have split the regex over several lines for clarity:
(?<=\()
\bhttp://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]
(?=\))
|
(?<=(?<wrap>[=~|_#]))
\bhttp://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]
(?=\k<wrap>)
|
\bhttp://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]
This will match any URL that is surrounded by parentheses, but also by any of the following characters: '=','~','|','_','#'.
Of course, it will fail in some very borderline cases, but I think it matches 99.9% of URLs entered by "users".
I think this is one of those situations where as you've stated
you can't get a solution to fit all cases. Therefore you have
to take a pragmatic approach. The most pragmatic I think is
to not allow () in urls, and in the .1% that have (), people can easily
cut and paste the URL rather than clicking:
Here is the simple python snippet I use to auto link URLs:
r="((?:ftp|https?)://[^ \t\n\r()\"']+)"
comment=re.sub(r,r'<a rel="nofollow" href="\1">\1</a>',comment)
This is why I allways write urls on a separate line, not only because it's more likely that any automatic link creator will detect it but allso because it's easier to select and copy. At least I allways have a space between the url and any punctuation
"Visit my website at http://www.example.com, it's awesome!"
is hard to select..
"Visit my website at http://www.example.com , it's awesome!"
is better but typographically wrong...
I prefer:
"Visit my website, it's awesome:
http://www.example.com
"
I would like to suggest everyone to avoid also legal comma (,).
It is not uncommon to find <a href="http://www.example1.com/test.html">http://www.example1.com/test.html</a>, <a href="http://www.example2.com/test2.html">http://www.example2.com/test2.html</a>, ...
dmajkic on October 30, 2008 03:32 AMRFC2396 has a "Recommendations for Delimiting URI in Context" section that talks about how URIs *should* be encased. Not everyone follows that part though.
Shadow on October 30, 2008 03:52 AMDude, that's so easy, just have you server try to access the url with and without the final paren, and see which one actually works ;)
jwickers on October 30, 2008 03:53 AMWould it be a good approach to have the autolinker request to the potential URL and see if it comes back with a 404?
Then you could spot bad URLs and ask the poster to fix them.
Damn jwickers types faster.
Graham Stewart on October 30, 2008 04:07 AMjwickers, Graham Stewart - please take a moment to consider the malicious uses for this idea.
Here's just two ideas for exploiting a 'validating autolinker':
1) Create a DoS condition on the host, or a third party site by passing in hundreds or thousands of URLs which need to get tested. Potentially executing expensive (in resource terms) requests.
2) Access and modify protected resources which are only accessable from 'inside' the firewall (Management sites, router configuration settings, and many other things)
Always assume that whatever input recieved is a deliberate attempt to exploit or subvert your application. Certainly validate that the input is legal, but you should not automatically request unknown third party resources without significant constraints around it.
As for solving the issue Jeff is talking about - wouldn't a backref test in the regex be the easiest solution?
Will Hughes on October 30, 2008 04:50 AMHow about spaces in URLs?
http://www.google.com/codesearch?q=jeff atwood
"Draconian" to require the user to put a %20 instead?
Vinzent Hoefler on October 30, 2008 05:00 AMTo make matters worse, this doesn't correctly parse the following:
See my site (at http://example.com)
"Avoid using parens (or any unusual characters, for that matter) in URLs you create"
Fucking lazy programmers solution right there
BAWWWWWW DONT DO THAT IT MAKES MY LIFE HARD, fuck off back to your code you lazy twat
Trev on October 30, 2008 05:10 AMokay, so having spent a bit of time trying to wrangle over testing a backref... I'll admit it's not so easy. I'm sure there's a way to test that a backref contains something - but maybe I'm getting my XSLT and Regexs mixed up.
Will Hughes on October 30, 2008 05:16 AMWhat I would like to see is a regular expression that will avoid any links that have already been enclosed in <a> tags.
That is, linkify this link: http://www.google.com
But do not re-linkify this link: <a href="http://www.google.com/">http://www.google.com</a>
Chris Dary on October 30, 2008 05:24 AMfirst example does not work with first code example. There is a comma!
David B on October 30, 2008 05:25 AMIMHO the best approach would be to force your users to enter their text like this:
My website [url]http://www.example.com[/url] is awesome.
Dave Schenk on October 30, 2008 05:31 AMURLs are hard, let's go shopping :)
atma on October 30, 2008 05:39 AMI think that we are oversolving the problem.
First, Jeff, you have gone a little to far in suggesting that people change the URLs they enter because the poor little computer can't autolink correctly.
Second, the whole of the URL text is present even if not correctly autolinked. A savy user will simply copy/paste the link. An unsavy user shouldn't be on the internet anyway. So make a good effort, and then call it a day. You will catch 90% of everything.
-df5
drfloyd5 on October 30, 2008 05:40 AMYour problem isn't about parsing URLs - it's about parsing natural(-ish) language (i.e. no formal syntax) to extract some information with a formal syntax.
To do this job properly, you need some level of context sensitivity and some heuristics looking at leading and trailing characters (if any) to decide what to do.
Gmail seems to do a pretty good job of that. Certainly better than just saying 'ooh, ooh - don't use parentheses in URLs' - that's just lame.
Stuart Dootson on October 30, 2008 05:41 AMI agree with Dave Schenk; people aren't so stupid that they can't use simplified markup.
Or you could actually "ping" the url (assuming you only do this check once).
As for URL construction, I still like the way the PHP site does it.
leohorie on October 30, 2008 05:42 AMGreat post Jeff.
I can't say that I've ever thought about detecting parentheses in urls at all, much less the implications of parentheses surrounding a url. I am enlightened once again.
Can you use the regex balancing group technique to avoid matching a ending parenthesis when one is detected on the front before the http?
http://blog.stevenlevithan.com/archives/balancing-groups
Josh Bush on October 30, 2008 05:51 AMthere isn't a problem URLs, the real problem is to use regex for url matching!
Eduardo Diaz on October 30, 2008 05:58 AMBy the way, you should be using "s.Length - 2" to strip both the first and last parentheses. Using "s.Substring(1, s.Length - 1)" will have the same effect as "s.Substring(1)", since the remaining length after removing the first character is "s.Length - 1".
Emperor XLII on October 30, 2008 06:00 AMShouldn't you be stripping the leading parenthesis and only removing the closing one if the leading one is missing. For example, it would seem, "(http://example.com/ Example Site)" would capture the leading parenthesis and would never get stripped since it doesn't have a closing parenthesis.
Jonathan Snook on October 30, 2008 06:03 AMIt's not just parens, it can be any characters surrounding the url. The first example shows a url followed by a comma. A comma is legal in urls, so is it in or out? There's no way to write a regex to correctly delimit a url in all cases, you have to know the grammar of the data. And in human communications the grammar is informal, a matter of convention in a particular group.
numerodix on October 30, 2008 06:09 AMRather than checking that first/last char are parentheses, I'd suggest removing any closing paren unless there's an unbalanced matching open paren in the URL itself. (I'm going on the assumption it's unlikely a programmer-type will construct a URL that intentionally has unbalanced parentheses.)
The "trim off first/last" strategy won't correctly deal with
"Hey, try this (my friend's site at http://google.com)"
This alternative strategy would handle that, as well as the following:
"Here's a link (http://google.com)"
"Here's an ugly link: http://google.com/file(stuff)"
"Here's an ugly link (http://google.com/file(stuff))"
"Here's another one (with a comment http://google.com/file(stuff))"
The issue of international characters and other such things could probably be circumvented by using a well tested and long used existing regular expression to this problem.
http://search.cpan.org/~abigail/Regexp-Common-2.122/lib/Regexp/Common/URI/http.pm
This came up pretty quickly. The author of that is a pretty smart guy.
You'd potentially have to still wrap this regex inside another to apply your same approach with the parenthesis. This is trivial.
Oh and yes, that's perl, but extracting the actual regex in use from that thing shouldn't be too difficult and most languages out there use PCRE or something very, very, very close to it.
Best tool for the job and all that.
Ben on October 30, 2008 06:15 AMJust nitpicking, but shouldn't that code return return s.Substring(1, s.Length - 2) if the idea is to remove both the opening and closing parens?
Lucas on October 30, 2008 06:16 AMOh, in recognizing you might not be familiar with how perl imports libraries, the regex linked earlier looks to be this:
my $http_uri = "(?k:(?k:http)://(?k:$host)(?::(?k:$port))?" .
"(?k:/(?k:(?k:$path_segments)(?:[?](?k:$query))?))?)";
(. is a concatenate operator) with the $ variables defined here
http://search.cpan.org/src/ABIGAIL/Regexp-Common-2.122/lib/Regexp/Common/URI/RFC2396.pm
As you can see, getting these regex right is harder than it would appear at first blush.
Ben on October 30, 2008 06:23 AMURIs are not the problem; the problem is that you cannot place URIs in freeform text and hope to precisely extract URIs from that text. Your options are to restrict the form of URIs (what you're proposing) or to restrict the context in which the URIs exist (the text). The former is defined in an RFC; your attempt to refine the URI syntax is hopeless: how will you enforce this restriction on the web? The latter problem has already been solved by not using freeform text and instead using a markup language. There are many; pick one.
If you can't control the context than you basically need to make a best effort and accept that you will imperfectly parse URIs.
Aaron Evans on October 30, 2008 06:24 AMTake a look at appendix E of RFC 2396 <http://www.ietf.org/rfc/rfc2396.txt>, as it talks about how to include URLs in context (prose). It suggests the use of angle-brackets.
When a URL is discovered that is enclosed within angle brackets, the regex should be much more permissive in what characters it allows.
Why not save some back-end processing time and just give the users a WYSIWYG editor?
You get easy-to-parse (X)HTML, the user clicks buttons.
Michael Thompson on October 30, 2008 06:37 AMSeems to be a common theme on this blog lately: If a problem can't be solved with a regex and 2 or 3 lines of code then it obviously can't be solved at all, and there must be something wrong with some *other* aspect of the problem.
Whatever happened to "always assume that the problem is in your own code?"
Most people simply choose not to bother with this "problem". Every kind of markup under the sun has provisions for links, and if some braindead user can't be bothered to link properly then he'll have to deal with the majority of people not following his link. As others have stated, link extraction with a regex is tantamount to auto repair with duct tape; be prepared for your solution to break easily and often, and don't blame the car.
Aaron G on October 30, 2008 06:41 AM
I've noticed that you have a certain tendency to see too many problems as nails that you can hit with your regex hammer :-)
The trouble is that regexes (provably) can only deal with very limited grammars.
As someone else pointed out, you're never going to get this perfect, as you are ultimately dealing with a human language, which no parsers yet written deal with perfectly. And what are you going to do if the URL is just in an example and not supposed to be a real one (in a code sample, for example)?
If this is just for markup purposes, just specify the format. People will learn that quicker than you can write code to parse English, or whatever.
Jim Cooper on October 30, 2008 06:46 AMhave you posted this at 2:30 in the morning?
you might need to look at this: {http://crazy-videoz.com/cool-stories/suggestions-for-sleeping-at-work/)}
Rus on October 30, 2008 06:52 AMWhy not just check for cases where it might be possible or likely the parse just got confused, and simply prompt the user before form submit? I know it's an additional step, but it's actually not that huge of an obstruction.
DW on October 30, 2008 06:59 AMJeff's on a roll lately.
AnonymousCoward on October 30, 2008 07:06 AMThis is why I prefer vb code for this purpose. Who was ever hurt by a little [url] [/url] ?
ProfessorTom on October 30, 2008 07:07 AMI just force people to use [URL][/URL] if they want to include a URL. Then I don't have to worry about all these special cases... unless of course someone uses [URL] in their URL, but that's their own fault for having an absurd URL.
Kris on October 30, 2008 07:20 AMHeh. Linkification in Firefox failed to handle most of the problematic URLs in the post and comments. The trailing paren in the first wikipedia URL doesn't get linkified, nor do the %28s. The one with the umlauts was somehow split into two URLs at the first u-umlaut. Guess it is harder than it looks.
Does FaceBook have a good solution? When you write messages on there, about a second after you've typed a URL, it not only grabs the URL and makes it appear as a hyperlink so you can see what the end result will be, but it grabs content from the URL and gives a preview with a "major" image and content. Of course, if they don't do a good job of grabbing valid URLs, it's a moot point. But if they do a good job... don't reinvent the wheel.
Jason Beck on October 30, 2008 07:38 AMI guess this blog really has become focused entirely on web development. Sucks for me since I don't do web dev and couldn't care less about auto-linking URLs. When StackOverflow was started, CodingHorror jumped the shark :(
Kyle on October 30, 2008 07:43 AMWhat it really comes down to is that parsing text for anything is one of the biggest pains in the ass when it comes to programming. Quite simply you never know what's coming.
HB on October 30, 2008 07:45 AMhttp://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)
is far more meangful than
http://en.wikipedia.org/wiki/article?x4kp2
What if someone trys to make a url like "( http://www.notethefirstspace.com)"?
I think you should look for opening parens in the middle of the url like:
http://en.wikipedia.org/wiki/PC_Tools_(
then you found the parens, since you found it you expect to have a closing parens at some point
http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)
found the close parens, ignoring all the closing parens until the end of the url (unless you find another opening parens)
so if someone type:
"looking further into (http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)) the subject I found out that parens suck"
it would work
This assumes that for every opening parens there is a closing one which is logical and probably true for almost all the cases.
So the algorithm I'm proposing resumes to:
1. find the beginning of the url (and ignore everthing before it)
2. if find one "(" look for one ")"
3. look for the end of the url (like space or punctuation)
4. if not found one ")" until the end of the url do nothing
so the only case this algorithm don't work is when the url itself ends with a ")" or if it have a "(" and the user types the url between parens without a spece at the end.
Cases that it won't work:
"check this out: http://www.example.com/finish?asxk) "
the final parens would be left off of the url.
"from my sources (http://www.example.com/finish(source)"
the real url is http://www.example.com/finish(source
there is a "(" in the middle but no closing one at any part and the user puts the url inside parens without space (or punctuation) at the end.
But the algorithm would get the final parens into the hyperlink.
If you really want to you can remove this case if you see if the url started with a "(", but then again if it was something like "from my sources (the excelent website example.com: http://www.example.com/finish(source)" it would still not work.
Those 2 cases are probably very, very, very rare.
cases that would work:
"from that old post (http://www.example.com/finish(asxk)) we found out..."
"from that old post (http://www.example.com/finish)asxk) we found out..."
"check this out: http://www.example.com/finish(asxk "
"check this out: http://www.example.com/finish)asxk "
This algorithm is wikipedia safe.
If there is a thing we learn at the first years of college, it is string manipulation...
Hoffmann on October 30, 2008 07:48 AMCorrecting my above post:
What if someone trys to make a url like "( http://www.notethefirstspace.com/note(space))"?
"I couldn't come up with a way for the regex alone to distinguish between URLs that legitimately end in parens (ala Wikipedia), and URLs that the user has enclosed in parens."
Well, a closing paren at the end would probably mean an opening paren WITHIN the URL. How about a simple counter, counting up each opening paren in the url, counting down ending parens. If the assumed URL ends with a closing paren, and the counter is >0, that last paren is probably part of the URL.
mephane on October 30, 2008 07:53 AMKyle wrote:
"I guess this blog really has become focused entirely on web development. Sucks for me since I don't do web dev and couldn't care less about auto-linking URLs. When StackOverflow was started, CodingHorror jumped the shark :("
This post in particular is about string manipulation applyed to web development. Even if you do no web dev you should still need to format strings some times...
Hoffmann on October 30, 2008 07:55 AMAnother one of your "dangerous" posts, where if all you have is a hammer (regex), everything needs to be a nail. You need to recall your finite automata course from college/university. A regex is equivalent to a FSM (finite state machine) which means that it cannot handle nesting, for that you need a stack automata, aka, a parser. If you used a LR(0) or LR(1) context free grammar or procedural code with a stack, you can quite easily handle URL syntax properly. A single stack automata is still a weak concept, it cannot parse all strings since it is not equivalent to a Turing machine, for that you need two stacks (that basically simulate an infinite tape).
I wrote about this in my book on regular expressions, having been responsible for plucking URLs out of financial news an press releases for years at Yahoo! Finance. The URL I used there is shown at the bottom of:
http://regex.info/listing.cgi?ed=3&p=207
This predated Wikipedia, and in any case, one wouldn't expect to find such URLs in the problem space (financial news). Still, I thought I'd mention it. The prose of the book, starting on page 206, discusses the approach taken to build the regex I ended up with (and indeed, it's full of heuristics).
Jeffrey Friedl on October 30, 2008 08:23 AMI'm not sure that checking for start/and parenthesis is bullet-proof.
What if the link is the first word, but not the whole content of the parentheses, AND it contains parenthesis? Like this:
[...] as the many uses of the word Superman (http://en.wikipedia.org/wiki/Superman_(disambiguation) for a reference) demonstrate [...]
Your RegExpr should crop the final ")" from the Wiki link.
Filini on October 30, 2008 08:35 AMThis is one of those situations where I think it's OK to take a Worse-is-better approach. As a user I've leared to avoid putting puntuation like periods, commas, or close parens directly at the end of a URL.
It's a little suprising the first time you get that close paren tacked onto the link, but the reason why is fairly understandable for the user, so easy enough to learn to avoid. Why make an even more complicated heuristic which may still fail, but in a much less user-understandable way?
T.E.D. on October 30, 2008 08:36 AMYup, you certainly cannot completely solve this one since "http://www.example.com," could either include the comma or not include it - no way of knowing for sure. I think your best bet is combining an 'easy' UI (similar to the popup used on SO for inserting links) with a powerful yet simple syntax - e.g. enclose URLs in []. Users who understand URLs (most of them don't, of course) are more than capable of using one of these approaches.
bobby on October 30, 2008 08:40 AMOops - I probably should have made that "http://www.example.com/resource," - not sure about comma in a domain name. My previous post demonstrates that your current parser falls over, though :)
bobby on October 30, 2008 08:42 AMEven your first example doesn't work. Your regular expression will include the comma in the link, when it's part of the surrounding text. While parentheses add complexity, it's not as much as you think, as you've oversimplified the no parentheses case.
This seems to be an example of every problem looking like a nail because the only tool you have is a hammer (er, regular expression matcher). You can do a much better job with a custom parser than you can with regular expressions. For example, you could check for balanced parentheses either around the URL and/or within it. (Nothing says paren in a URL must be balanced, but I'd wager that nearly all of them are.)
Using a regular expression to find the URL (possibly with some garbage at the end), and then some post-processing heuristics to trim off the garbage, has promise. But it feels kludgy.
There is no perfect solution; there will always be ambiguous cases. Nevertheless, a 99% solution is probably beyond pure regular expression matching. I'd write a custom scanner.
Adrian on October 30, 2008 08:43 AMYou really couldn't come up with a change to the regex to deal with delineated URLs? Huh, maybe this is unique to Perl, but back-references and some smart evaluation would appear to fix this problem.
For readability, I'm using Perl's "/x" operator so that whitespace and comments in the expression are ignored; if you port this, you'll have to remove whitespace and comments.
my %pair = qw/( ) [ ] : : \/ \//;
my $left = quotemeta( join '', keys(%pair) );
s/\b # word border
([$left])? # optionally starts with ([:/, capture
([^\s]*?) # non-whitespace chars, non-greedy
($pair{$1})? # opposite of the pair we started with
\b/
$1<a href='$2'>$2</a>$3/x
Untested, but you get the idea. Backreferences are a major strength of Perl-style Regexes (the syntax in your language might vary slightly).
Not to be nitpicky (okay, yes I am):
Jeff, you wrote "an URL" when it should be "a URL".
Both the acronym and what it stands for ("yoo"/U and "yoo-nuh-fawrm"/Uniform) use the article "an".
Pete on October 30, 2008 09:07 AMMake that *use the article "a". I'm on a roll this morning.
Pete on October 30, 2008 09:08 AM... unless you pronounce it "Earl", of course ;-)
bobby on October 30, 2008 09:11 AMIt should be -2, not -1:
return s.Substring(1, s.Length - 2);
Mike on October 30, 2008 09:46 AMThe reason you can't write a simple regular expression to capture every combination of parentheses is because the inclusion of n opening parenthesis followed by n closing parentheses is no longer a regular language...
Proof is in the Pumping Lemma, described here: http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages
caleb on October 30, 2008 09:50 AM> Linkification in Firefox failed to handle most of the problematic URLs in the post and comments. The trailing paren in the first wikipedia URL doesn't get linkified, nor do the %28s. The one with the umlauts was somehow split into two URLs at the first u-umlaut. Guess it is harder than it looks.
That's what I'm saying! It's a hard problem and almost nobody gets it right. Not even Paul Graham.
http://news.ycombinator.com/item?id=10889
> I guess this blog really has become focused entirely on web development.
This post is relevant if you *USE* the web, IMO, since many (most?) web forums screw this algorithm up as well. As T.E.D. mentioned above, users have to learn to enter URLs "the right way".
Of course it helps to be a programmer so you understand why this is happening.
Jeff Atwood on October 30, 2008 09:58 AMWhat if you use some heuristics, like you suggest with your whitelist regex, plus some others (like including punctuation or not) to create an ordered list of link candidates and then try fetching them looking for the first one that returns 200 OK? Seems like something you could batch process offline a few times a day.
Nick Gerner on October 30, 2008 10:07 AMthe problem with urls?
nope, the problem is regex
first you have trouble with html tags, now with urls
just forget the regex and do some parsing work:
Barry Kelly knows that stuff:
"A better heuristic for extracting URLs would be to use a stronger pattern formalism than regular expressions, such as a context-free grammar. Since humans generally produce the format for URLs, you could expect that URLs are highly unlikely to include unbalanced parens. Regular grammars can't express this constraint, but a context-free grammar can."
Jeff, you're right. Programming's hard. Let's just initiate a movement to cast shame on people who work within standards that we are ill-equipped to handle.
Or, we could just go shopping.
mbhunter on October 30, 2008 10:22 AMIs anybody an AI student/hobbyist?
This problem seems just barely too difficult for simple rules,
and has many samples available from the wild.
Not all things are solvable.
Steve on October 30, 2008 10:47 AMjust checked my site and we do it wrong too...
Dan on October 30, 2008 10:48 AMJeff, you can't expect the users to learn to insert urls in the right way. They assume we are here to do the loading!
Saj on October 30, 2008 10:53 AMWhat is most annoying is that spaces are valid URLs characters. That can really jack up an auto-linker
Billkamm on October 30, 2008 10:55 AMI take the approach of ignoring preceding punctuation entirely and cutting off punctuation from the end of the URL entirely, taking care to leave balanced parentheses in. Examples:
http://example.com/something) <- remove the close paren
http://example.com/something_(something) <- leave alone
http://example.com/something,.; <- remove the comma, dot and semicolon
This is not fool-proof, as some URLs may genuinely be formatted that way, but there is *no way* of knowing that while auto-linking, and the vast majority of URLs do *not* include punctuation at the end, with the exception of balanced parentheses, which can easily be accounted for.
Side note: I spend so much time on Stack Overflow now I've found myself wishing I could upvote comments here.
Trevor on October 30, 2008 10:55 AMDevelopers are simply too obsessed with borderline cases. What's wrong with "good enough"?
What's wrong with trying to solve a problem with regular expressions if you succeed to capture 99.99% of all cases?
I had a try:
http://www.blog.activa.be/2008/10/30/ExtractingURLsNotPerfectButQuotgoodEnoughquot.aspx
As much as you might dislike it, some URLs do end in periods. I've been bitten by "clever" systems that assumed I didn't want the period in the link. (Some probably end in commas too, but I haven't encountered them.) I don't mind the heuristic to pick up URLs wrapped in parentheses, because if you really want to enter a URL with weird parentheses in it all you have to do is not precede it with an open paren "(". However, by always excluding a trailing period, you make it impossible to enter some URLs. I think it's not unreasonable for a user to expect to be able to paste a URL alone on a blank line and expect it to be auto-linked correctly. I would rather have a system that gets it wrong when the user tries to do something complicated (append punctuation to the end of a URL and expect it not to be included) than one that gets it wrong when the user tries to do something simple (paste in a URL on its own).
Weeble on October 30, 2008 11:02 AMThat's a whole lot of work for trying to jam a square peg into a well-defined hole with an RFC spec that's been around for over a decade and everyone else has learned to live with.
You could do something like this: aaaa <http://website.com> bbbb
Or this: aaaa ( http://website.com ) bbbb
both are satisfactory.
Philihp on October 30, 2008 11:48 AMPS, your URL detection thinks it's legal to put ">" in a URL ;)
Philihp on October 30, 2008 11:49 AM@Weeble: you may be right, but the best we can do is try to detect the majority of hyperlinks. If a website (-creator) is stupid enough to have pages ending with a punctuation mark, it's better to just ignore the site. Linking to it should be banned anyway. Using "heuristic" URL detection: mission accomplished :)
Philippe Leybaert on October 30, 2008 12:11 PMAlthough this would require another column in the table, on initial input of text (or editing), find the links and test them for 404's. Mark the text as good if no 404's otherwise mark it as text with bad links.
Next you could use a cascading set of rules to auto-edited the text until all links return 200's or have an admin tool and manually edit those links.
Guy Ellis on October 30, 2008 12:20 PMIn the case where you have to choose whether to include a trailing paren, why not just hit the URL and see if it exists?
In fact, you could do this in all cases and alert the user if he/she is referring to a link that goes to a non-registered domain or 404 page.
Brian on October 30, 2008 12:56 PM/* As some people have already said, the "find URLs" problem is trivial and can be solved very easily. Notice that this solution uses no "whitelists" (Jeff's new favorite buzzword), so it can handle pretty much anything you throw at it: Unicode, ftp://, whatever. The only trouble spots are "Did you mean http://example.com?" and URLs containing brackets (but see http://en.wikipedia.org/wiki/Template:Bracketed). */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
void extractURL(const char *text, int &start, int &length)
{
#define RETURN(a,b) do { start = a; length = b; return; } while (0)
const char *t = text;
find_next_colon:
const char *colon = strstr(t, "://");
if (colon == NULL)
RETURN(-1,-1);
/* Get the preceding protocol ID; e.g., "http" or "ed2k". */
const char *s = colon;
while (s > text && isalnum(s[-1])) --s;
if (s == colon) {
/* We have "$@#://http://example.com". Keep going. */
t = colon+3;
goto find_next_colon;
}
start = s-text;
s = colon+3;
/* URLs end with whitespace, ", or brackety things. Unbalanced
* parentheses also end the URL; consider "(at http://example.com)"
* as opposed to "(http://example.com, for example)". */
int parens = 0;
while (!isspace(*s) && !strchr("\"<>[]{}", *s)) {
if (*s == '(') ++parens;
if (*s == ')') --parens;
if (parens < 0) break;
++s;
}
/* Consider "http://en.wikipedia.org/wiki/Bang!". I've
* decided arbitrarily that ! and ? may end a URL, but
* we must correctly handle "I like http://example.com." */
if (strchr(".,:;", s[-1])) --s;
if (s == colon+3) {
/* Reject "http://" with nothing following it. */
t = colon+3;
goto find_next_colon;
}
/* Accept the rest. */
RETURN(start, (s-text) - start);
#undef RETURN
}
/* For testing. */
int main()
{
char buffer[1000];
while (fgets(buffer, sizeof buffer, stdin) != NULL) {
const char *text = buffer;
int start = 0;
int len = 0;
while (1) {
extractURL(text, start, len);
if (start == -1) break;
printf("%d %d: \"%.*s\"\n", start, len, len, text+start);
text += start+len;
}
}
return 0;
}
What Barry Kelly said - you're *never* going to get it to be Completely Right with a regexp (especially, as others have said, with a non-Roman or even non-low-ASCII alphabet), and the regexp will rapidly, as you try, become completely write-only gibberish.
Use a *real* parser, in the form of an actual grammar.
(The "99% close enough" solution linked by a previous commenter of course fails utterly on the "non-Roman/low-ASCII" case by pretending that a-zA-Z0-9 is sufficient to recognise a name.
If one is willing to live with assuming that users will never want to link anywhere that uses an umlaut, an accent, or a non-Roman character, one is certainly free to... but that's probably a bad idea.)
(Is the Anonymous Coward with the C code *deliberately* writing obfuscated code for some reason? Or is it long-term C exposure that makes people think that's good style?
But at least that has the advantage of being an actual parser - if an ugly one - rather than trying to fit everything into a regular expression.)
Sigivald on October 30, 2008 01:45 PM>>>
You could do something like this: <http://website.com>
<<<
This URI is in perfect angle brackets and yet, the parser recognizes the closing angle bracket as part of the URI.
Perfect example for another broken parser. :D
Vinzent Hoefler on October 30, 2008 01:48 PMThere's a few comments suggesting that you validate the URL by requesting it and checking if you get a 404. There's a few reasons against this:
1. Many dynamic sites written by newer coders won't give you a 404 if you request a bad page. e.g. http://example.com/page?id=4 , tack a bracket on the end, and you likely get a bad ID. The page would tell you, but you won't get a 404.
2. It could open both you and a poor target up to a DOS attack. Imagine someone submitting a post with 1,000,000 references to http://example.com .
Peter on October 30, 2008 02:54 PMOn a site-note, comments posted here don't respect RFC2396.
Example: <http://www.example.com> -- the trailing angle-bracket gets included in the URL.
Nice, I'll have to update my JavaScript solution to include parens:
http://knol.google.com/k/adam-eivy/javascript-html-format-links-in-text/2a9qcf9a3ig0u/14#
I recently tackled this problem myself, and came to a similar regular expression (minor differences, and I think that Jeff's is better) plus some additional parsing to handle edge-cases and prevent the regular expression from becoming a complete mess. I think that every single person in the comments assumed that this problem is related solely to message boards. It's not. There are plenty of reasons that you might want to linkify text. You might be writing a web-based e-mail client. You might be writing a client for a chat or IM protocol. You might be trying to turn flat text files into something slightly more presentable on the web. In none of these cases can you reasonably expect the text to contain easy-to-parse URIs with bbcode-style tags or spaces surrounding the link text.
In my case, it was a personal project relating to a web interface for searching and viewing IRC logs. Lots of links get posted in IRC. The IDN problem is a non-issue, but other problems of parsing certainly are not. And I'm neither capable of enforcing URI standards, or would I want to if I could.
As others have pointed out, all we can do is get "good enough." I had to accept that my algorithm was going to make mistakes, and move on. But what struck me most while reading through the comments were the people who either a) assumed that the problem is simple (discounting those edge cases that are becoming more and more common on the web) and b) stuck in their own little world where the problem can be solved by waving a big stick at your users. On a coding forum, I was quite surprised at the number of assumptions that people made about the types of situations in which this becomes useful.
sancho on October 30, 2008 03:44 PMUseful regex resources for those not wanting to reinvent the wheel:
RegExLib web site - Very useful library of common regular expressions
http://regexlib.com/
"Mastering Regular Expressions" in case you really want to understand how regular expressions work
http://oreilly.com/catalog/9780596528126/index.html
Adium has a nice library for detecting hyperlinks: http://cloggedtubes.com/development/the_aihyperlinks_framework_or_how_adium_finds_links
Tim Trueman on October 30, 2008 05:22 PMAnd unfortunately some sites' URLs even end in periods, which you called "end-of-hyperlink characters".
e.g. http://en.wikisource.org/wiki/1911_Encyclop%C3%A6dia_Britannica/Aga_Khan_I.
yes the "." is part of the URL.
The parentheses on Wikipedia pages are particularly annoying. I paste URLs into identi.ca and have trained myself to put %29 at the end because I know otherwise a URL with parentheses won't work.
pfctdayelise on October 30, 2008 05:48 PMVery informative post. If only the world was simple. Unfortunately complicated URLs are here to stay.
Matthew James Taylor on October 30, 2008 11:07 PMIt's simple if you are trying to spot URL's in plain typed text you will fail.... unless you invent strong AI
If someone is writing about URL's then you will fail completely as you turn text about url's into url's and they then have to attempt to escape it so you don't ....
Why not bother and force them to delimit it if they want it turned [http://www.example.com]
Jaster on October 31, 2008 01:36 AMIt would be useful if you could 'ping' the intended URL and get returned what the actual URL is, as a check... if nothing comes back, then its wrong. if there is no site with the trailing parens but there is one without it, then assume you can drop it.
Trouble is, most unused URL get forwarded to some hosting company.
This is the problem *"with URLs"*? Really!?!?
If you're going to delimit your URLs, just use characters that aren't valid in a URL.
PEBCAK.
Fred on October 31, 2008 05:42 AMMaybe a post expression solution to this is to look at the text that exists before the occurrentce of the initial 'http://' etc. and look for an opening paren (. If you find an opening paren then you disregard a closing apren if it is regarded as the last item in the URL...
Keith Jackson on October 31, 2008 06:22 AMWhat about urls without http:// ? Like www.example.com
alvin on October 31, 2008 06:23 AMRun all urls though an url shortening service and let them deal with it? ;)
justice on October 31, 2008 06:34 AM"This is the problem *"with URLs"*? Really!?!?
If you're going to delimit your URLs, just use characters that aren't valid in a URL.
PEBCAK"
So, either you change your links, or EVERY SINGLE BASIC WEB USER changes their natural grammar, brought about by about 100 years of English writing, so that YOUR links work. HETUBA. (Hostility Exists Towards Users By Author)
1) If you want people to go to your new site, you need to make getting there so simple that they just have to click a forum link out of curiosity, and they're there, enjoying your thoughts on pea-soup and racial tensions.
2) If the link doesn't work, because you've added a parenthesis, the chances that they'll copy and paste your link into their address bar is effectively 0. It's too much work for the gain.
3) Wikipedia gets away with it, but their site is not your site. And to be honest, even then, they probably get way more "link of mouth" traffic to the links that don't include brackets.
Just define your own standard. (That's what Microsoft does!)
Users will figure it out eventually.
Practicality on October 31, 2008 06:59 AMParsing text for things like URLs is a similar problem to trying to detect spam - the inputs are as varied as people can imagine them. So stop trying to deal with it using a series of fixed and universal laws!
Suggested algorithm:
1. Get the feasible string: from the beginning of what you think might be a URL to the first space (or illegal character), allowing for i18n - you can use your regex here - and a list of any open parenthetics in the paragraph ( (,[,< etc).
2. Generate a series of the possible URLs from it, by dropping each of the characters from the end that could be wrong:
" Give me a URL (like http://www.example.com/querya?b)(ideally)? " becomes:
a."http://www.example.com/querya?b)(ideally)?"
b."http://www.example.com/querya?b)(ideally)"
c."http://www.example.com/querya?b)"
d."http://www.example.com/querya?b"
e."http://www.example.com/querya"
and any other variations you find useful.
Assign each one a rating based on:
. whether there are unbalanced parentheses inside
. whether the parenthesis would balance open ones in the paragraph - in this example the open bracket would be balanced by this close bracket, so that lowers the scores for a. and b.
. whether the URL is sensible - "blah.com/)" is less sensible than "blah.com/"
. any other good/bad valuation you can think of
3. Rank the options
If the top two (or more) options are very close or equal in ranking, then test for the existence of each by just polling the URLs in ranked order until you find a real one. If you adjust the threshold of how close is close, you should only be testing in rare cases. If you don't like polling, just pick one, you can't out-unwit every idiot or mistake.
4. Finally, return the selected URL
There are endless ways to improve it beyond even that - you could even try balancing the parentheses such that your wikipedia article has its missing bracket fixed. At some point, perhaps, it becomes a bit pointless, but if this is all in a library and isn't too slow, nobody need rewrite it again, and the users are happy.
For me, the power of the method is in using ranking to allow unlikely options - unless you can separate all the possible inputs on a Venn diagram (which you can't here), then some rules will work for some sets of inputs, and others for others, and you'll never find a complete set that works for all of them.
Phil H on October 31, 2008 07:18 AM"We can't fix Wikipedia or MSDN and we certainly can't change the URL spec"
Well, technically, you _can_ fix Wikipedia. But the issue here is not that Wikipedia is "broken" -- as you point out it's perfectly valid spec.
The fact is that the nerds who decided on these specs weren't really considering user-friendliness. One needs to look no further than HREF for that. (We couldn't have just "link" or even "URL"? we have to have an abbreviation that is used in basically no other context, an abbreviation of a term that most non-techies wouldn't be familiar with even if it wasn't abbreviated?)
This is what happens when you let engineers design the world by themselves -- you end up with a dystopia that caters exclusively to the OCD.
Shmork on October 31, 2008 08:19 AMThis is an annoying one I have come across before. I had a couple of ideas to work around it, none of which is especially pleasing... so I basically ignored the problem iirc.
1.) Use bracket-matching and some parser, ignoring anything that is "inside" a URL. if you have a leftover open parenthesis then the closing one is part of the text, else its part of the URL. This fails if the user fails to match their brackets, or if the URL contains just one opening parenthesis at the end or an unmatched closing one.
2.) Force the user to confirm the url if a bracket or similar character is detected. This can be done with an input box to avoid ambiguity. This should never fail, only annoy the user.
3.) Check for matching brackets inside the url to decide. I think this will fail only if there is a trailing parenthesis or if a closed one appears that matches an opening parenthesis inside the URL, which is supposed to be unmatched. e.g. http://foo.com/ba)r would get only http://foo.com/ba
Jheriko on October 31, 2008 09:53 AMEasy solution: allow the user to preview their posting. If something is wrong, it is their responsibility to resolve it. A blind "post" button—like the one on this site—invites more errors than an arbitrary but predictable algorithm.
Shmork on October 31, 2008 10:54 AMHah, funny you mention it. A couple weeks ago I filed a feature request for Firefox for this very problem involving parentheses (https://bugzilla.mozilla.org/show_bug.cgi?id=458565).
As a result, Firefox 3.1 will encode parentheses when you copy a URL from the location bar. I expect some improvement in this situation you describe once Firefox 3.1 launches and starts to gain popularity, and I'd be really glad if other browsers followed that.
Daniel Luz on October 31, 2008 09:38 PM…and I pity those who claim this is a trivial problem, or who think that by merely saying you shouldn't do this or that will magically change all existing content on the web AND make people follow their imaginary rules.
No algorithm can solve a human communication ambiguity problem. You might be able pretty decent guesses, but it's simply impossible to have a perfect solution when you can't read the mind of whoever wrote the ambiguous hyperlink to know what was their intent. Parentheses don't necessarily have to be paired in URLs, and even in human language a writer may fail to do so, so you can't magically be sure of whether it should be part of a URL or not. Periods in the end of a URL are in an even worse situation.
Daniel Luz on October 31, 2008 09:59 PMOh God. I just found a library catalogue that routinely ends URLs in hyphens:
http://catalogue.nla.gov.au/Author/Home?author=Spigelman,%20James,%201946-
Now I have to learn how to encode a hyphen... ugh!!
pfctdayelise on October 31, 2008 10:47 PMWhen everyone got aware that the url should contain relevant words in order to gain points on google's page rank, the URL became more and more similar to regular text (as a long as a computer can distinguish).
They now allow spaces, punctuation, special characters like ń or ü or such.
I mean come on, even a real life human being would not correctly recognize a URL with spaces in between and punctuation-endings when they are embedded on normal text!
I propose we (as devs) should only care about a starting protocol, subdomain.domain, optional extra subdomains, and then an optional / followed by pretty everithing that is not an space (even when spaces are valid url characters). for this are pretty good regexps on regexlib.
Before that thou, we should have taken care of matching parentheses, angle brackets, square brackets, curly brackets and enclosed punctuation like " " and ''. Another optional step would be cleaning up random html like enclosing <a>'s. Note that this enclosing nightmares should be apart by at least one space, in order to not screw up our URL's with enclosing things inside.
But the space is where I draw the line. Even then, we can make an exception when It's enclosed in angle brackets.
This heuristic can not be done with a single regexp as far as I'm aware, but should give that 95%-99% accuracy we all wish in a smart system with out shooting our brains off.
oh, Space separation between enclosing nightmares was a mistake.
But we can't remove safely enclosing things that are "between" words, neither we can remove then when are ending words since a url can end in ")".
I think I rushed myself into this, and now I'm considering Jheriko's approach, mixed with some preview would be some relief to the user who's just trying to get his URL showed as a hyperlink.
It will get even more complicated. Since, the following is also a
valid URL:
http://192.168.0.1/page.tml (this should parse fine)
http://[fe80::1]/page.html
:)
Vimal on November 1, 2008 05:38 AMContext.
The magical keyword here is *context*.
By definition, most human languages are context-dependent, ergo, they are not parseable by a Deterministic Finite Automaton, for which RegExps are a type of shorthand (well, to a subtype of them, actually).
So this can never be completely solved by using a DFA, at most you can get approximations for most of the cases. I'd try solving it with a non-deterministic automaton and, upon running out of input, would select from the list of valid end-states the one where found strings satisfy most rules - this would be more or less an equivalent of "makes most sense".
Does anyone know of a sample text containing a list of all difficult cases, to test a program with?
Joe on November 1, 2008 09:04 AMTwo thoughts:
a) How abt do the url-ifaction on the fly when one is typing (like word's spell check)?
Should be pretty easy to do that in JS.
That way the person can correct it (by explicitly making the text a link) if the super smart algo gets into an edge case.
b) If you are anal abt correctness, how abt actually check if the URL returns a 200, by doing a HTTP HEAD and then decide to urlify the string in question?
we could be talking abt URLs (like this blog post) and do not want ppl to actually click the url - like say "<a href="http://example.com">http://example.com</a> is a phony url"
Looks like quotes need to be handled :p by ur regexp
ajcb on November 3, 2008 09:12 AMI don't think this post is too specific to web software at all. Regular expressions in general are present in lots of software, from programs that distributed with the unix shell, all the way to non-compiled scripting languages. This is just a post using hyperlinks as a case study, and talking about how much programming we need to do and how much is user responsibility for their actions. I actually enjoyed the post.
One thing I would like to say though, is that even though regex is our weapon of choice here, I think the post-code that you mention is pretty much the preferred solution and there should be a whole lot more of it. The regex, in this case, should just be a way of narrowing down where in a given string a URL is, not precisely, mind you, but generally. And users have to take some responsibilty for their actions. We can't ping every url, because eventually a perfectly good html document will be present on a non ping-able host. We can't go to every url because of exploits. So, users will just have to settle for our ability to interpret their links.
Hutch on November 3, 2008 01:43 PMROFLMAO ..
The point isn't about smart or stupid users. They're given a lame freetext field for something, told they can paste links, and then are berated by the programmer because the perfectly valid links they enter get mangled. What's worse is the links provided by the user don't get interpreted properly and are broken from the users perspective.
A - Some of you aren't old enough to remember that the only way to stop (old) versions of outlook and other email clients from chopping long urls at 70-odd chars was to enclose them in parens. So users were taught a (bad) way to escape their url's and you just forgot.
B - Don't berate wiki or anyone else for making human-readable links that are within the url spec.
Conclusion: You're finding that your quick-solution regex is not going to cover all the bases. In a dreamworld, the Url spec is different and comes with the perfect regex in the box.
Snap out of it and roll your own state machine. It's probably all of a days work to extract url's without a regex. It's likely faster and more maintainable anyways. I bet it could be made 1-pass, minimal backtracking, and be a factor faster than a regex. Hmm, something like Anonymous Cowherd did.
The point of your blog entry should have been '99% just doesn't cut it. Regular expressions can't be made sophisticated enough to cover all permutations'
Lets go shopping indeed.
Another Cowherd on November 3, 2008 01:59 PMI would auto link all the "normal" links and let the user specify a special mark for complicated links, like the <angle brackets> already mentioned. A preview would be fine, so the user will know when his link was not identified, then he just encloses it with <>. Smarts users will learn to enclose links already on step 1 and avoid problems with the parser.
Aurelio Jargas on November 3, 2008 05:22 PMEvery regexp is equal to a nondeterministic finite state machine.
If you start your design thinking "machine" and not "regexp",
you can see the solution should be easy.
But the regexp might be difficult to read and lengthy.
E.g. if your basic URL machine is [:URL:], you might
get something like this:
[:URL:] | \([:URL:]\) | ... | ... | ...
A nondeterministic machine allows to change state without
"eating" any input, similar to operators "?" and "*" in regexps.
They are the ones making the whole conversion algorithm so
complicated. The output - e.g. the generated regexp - might
be hard to read.
If "(http://www.mywebsite.com") is so common, why not add the extra bracket at the beginning to the the regexp and then afterwards filter the junk away so it boils down to the general case again?
chris on November 10, 2008 07:56 AMIf "(<a href="http://www.mywebsite.com">http://www.mywebsite.com</a>") is so common, why not add the extra bracket at the beginning to the the regexp and then afterwards filter the junk away so it boils down to the general case again?
chris on November 10, 2008 08:13 AM@chris it's because programmers think / want just the 1 regex to magically do it all for them without having to write any extra code.
Maybe the thinking is:
You either do it all yourself, or you write one big regex to do it for you. If you have to start writing code around your regex, then your regex is just wrong (even if it's already needlessly complicated).
Don't get me wrong, I love regex, it's a fantastic tool, but I appreciate it has limitations.
Trying to access an URL with or without parantheses fails already with the Wikipedia case which are in my experience many URLs with parentheses at the end you're likely to encounter. WP just sends 200 even if the page does not exist, so no way there.
The suggestion of including the paren when there is already one in the URL and dropping it in the other case should work for all parentheses-containing URLs I've encountered to date.
Johannes Rössel on November 20, 2008 01:26 AMWordPress has a function to make URLs in text links, generally used on comments, called make_clickable(). Turns out it didn't deal well with these edge cases either. After some tweaks by myself and others make_clickable() will work with those cases. I've extracted that code into a stand alone PHP class called MakeItLink:
http://josephscott.org/archives/2008/11/makeitlink-detecting-urls-in-text-and-making-them-links/
Joseph Scott on November 28, 2008 10:46 AM| Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |