October 29, 2008
URLs are simple things. Or so you'd think. Let's say you wanted to detect an URL in a block of text and convert it into a bona fide hyperlink. No problem, right?
Visit my website at http://www.example.com, it's awesome!
To locate the URL in the above text, a simple regular expression should suffice -- we'll look for a string at a word boundary beginning with http:// , followed by one or more non-space characters:
Piece of cake. This seems to work. There's plenty of forum and discussion software out there which auto-links using exactly this approach. Although it mostly works, it's far from perfect. What if the text block looked like this?
My website (http://www.example.com) is awesome.
This URL will be incorrectly encoded with the final paren. This, by the way, is an extremely common way average everyday users include URLs in their text.
What's truly aggravating is that parens in URLs are perfectly legal. They're part of the spec and everything:
only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.
Certain sites, most notably Wikipedia and MSDN, love to generate URLs with parens. The sites are lousy with the damn things:
URLs with actual parens in them means we can't take the easy way out and ignore the final paren. You could force users to escape the parens, but that's sort of draconian, and it's a little unreasonable to expect your users to know how to escape characters in the URL.
To detect URLs correctly in
all most cases, you have to come up with something more sophisticated. Granted, this isn't the toughest problem in computer science, but it's one that many coders get wrong. Even coders with years of experience, like, say, Paul Graham.
If we're more clever in constructing the regular expression, we can do a better job.
- The primary improvement here is that we're only accepting a whitelist of known good URL characters. Allowing arbitrary random characters in URLs is setting yourself up for XSS exploits, and I can tell you that from personal experience. Don't do it!
- We only allow certain characters to "end" the URL. Ending a URL in common punctuation marks like period, exclamation point, semicolon, etc means those characters will be considered end-of-hyperlink characters and not included in the URL.
- Parens, if present, are allowed in the URL -- and we absorb the leading paren, if it is there, too.
I couldn't come up with a way for the regex alone to distinguish between URLs that legitimately end in parens (ala Wikipedia), and URLs that the user has enclosed in parens. Thus, there has to be a handful of postfix code to detect and discard the user-enclosed parens from the matched URLs:
if (s.StartsWith("(") && s.EndsWith(")"))
return s.Substring(1, s.Length - 2);
That's a whole lot of extra work, just because the URL spec allows parens. We can't fix Wikipedia or MSDN and we certainly can't change the URL spec. But we can ensure that our websites avoid becoming part of the problem. Avoid using parens (or any unusual characters, for that matter) in URLs you create. They're annoying to use, and rarely handled correctly by auto-linking code.
Posted by Jeff Atwood
What about urls without http:// ? Like www.example.com
Run all urls though an url shortening service and let them deal with it? ;)
This is the problem *with URLs*? Really!?!?
If you're going to delimit your URLs, just use characters that aren't valid in a URL.
So, either you change your links, or EVERY SINGLE BASIC WEB USER changes their natural grammar, brought about by about 100 years of English writing, so that YOUR links work. HETUBA. (Hostility Exists Towards Users By Author)
1) If you want people to go to your new site, you need to make getting there so simple that they just have to click a forum link out of curiosity, and they're there, enjoying your thoughts on pea-soup and racial tensions.
2) If the link doesn't work, because you've added a parenthesis, the chances that they'll copy and paste your link into their address bar is effectively 0. It's too much work for the gain.
3) Wikipedia gets away with it, but their site is not your site. And to be honest, even then, they probably get way more link of mouth traffic to the links that don't include brackets.
Just define your own standard. (That's what Microsoft does!)
Users will figure it out eventually.
Parsing text for things like URLs is a similar problem to trying to detect spam - the inputs are as varied as people can imagine them. So stop trying to deal with it using a series of fixed and universal laws!
1. Get the feasible string: from the beginning of what you think might be a URL to the first space (or illegal character), allowing for i18n - you can use your regex here - and a list of any open parenthetics in the paragraph ( (,[, etc).
2. Generate a series of the possible URLs from it, by dropping each of the characters from the end that could be wrong:
Give me a URL (like http://www.example.com/querya?b)(ideally)? becomes:
and any other variations you find useful.
Assign each one a rating based on:
. whether there are unbalanced parentheses inside
. whether the parenthesis would balance open ones in the paragraph - in this example the open bracket would be balanced by this close bracket, so that lowers the scores for a. and b.
. whether the URL is sensible - blah.com/) is less sensible than blah.com/
. any other good/bad valuation you can think of
3. Rank the options
If the top two (or more) options are very close or equal in ranking, then test for the existence of each by just polling the URLs in ranked order until you find a real one. If you adjust the threshold of how close is close, you should only be testing in rare cases. If you don't like polling, just pick one, you can't out-unwit every idiot or mistake.
4. Finally, return the selected URL
There are endless ways to improve it beyond even that - you could even try balancing the parentheses such that your wikipedia article has its missing bracket fixed. At some point, perhaps, it becomes a bit pointless, but if this is all in a library and isn't too slow, nobody need rewrite it again, and the users are happy.
For me, the power of the method is in using ranking to allow unlikely options - unless you can separate all the possible inputs on a Venn diagram (which you can't here), then some rules will work for some sets of inputs, and others for others, and you'll never find a complete set that works for all of them.
We can't fix Wikipedia or MSDN and we certainly can't change the URL spec
Well, technically, you _can_ fix Wikipedia. But the issue here is not that Wikipedia is broken -- as you point out it's perfectly valid spec.
The fact is that the nerds who decided on these specs weren't really considering user-friendliness. One needs to look no further than HREF for that. (We couldn't have just link or even URL? we have to have an abbreviation that is used in basically no other context, an abbreviation of a term that most non-techies wouldn't be familiar with even if it wasn't abbreviated?)
This is what happens when you let engineers design the world by themselves -- you end up with a dystopia that caters exclusively to the OCD.
Hah, funny you mention it. A couple weeks ago I filed a feature request for Firefox for this very problem involving parentheses (https://bugzilla.mozilla.org/show_bug.cgi?id=458565).
As a result, Firefox 3.1 will encode parentheses when you copy a URL from the location bar. I expect some improvement in this situation you describe once Firefox 3.1 launches and starts to gain popularity, and I'd be really glad if other browsers followed that.
This is an annoying one I have come across before. I had a couple of ideas to work around it, none of which is especially pleasing... so I basically ignored the problem iirc.
1.) Use bracket-matching and some parser, ignoring anything that is inside a URL. if you have a leftover open parenthesis then the closing one is part of the text, else its part of the URL. This fails if the user fails to match their brackets, or if the URL contains just one opening parenthesis at the end or an unmatched closing one.
2.) Force the user to confirm the url if a bracket or similar character is detected. This can be done with an input box to avoid ambiguity. This should never fail, only annoy the user.
3.) Check for matching brackets inside the url to decide. I think this will fail only if there is a trailing parenthesis or if a closed one appears that matches an opening parenthesis inside the URL, which is supposed to be unmatched. e.g. http://foo.com/ba)r would get only http://foo.com/ba
…and I pity those who claim this is a trivial problem, or who think that by merely saying you shouldn't do this or that will magically change all existing content on the web AND make people follow their imaginary rules.
No algorithm can solve a human communication ambiguity problem. You might be able pretty decent guesses, but it's simply impossible to have a perfect solution when you can't read the mind of whoever wrote the ambiguous hyperlink to know what was their intent. Parentheses don't necessarily have to be paired in URLs, and even in human language a writer may fail to do so, so you can't magically be sure of whether it should be part of a URL or not. Periods in the end of a URL are in an even worse situation.
Easy solution: allow the user to preview their posting. If something is wrong, it is their responsibility to resolve it. A blind post button—like the one on this site—invites more errors than an arbitrary but predictable algorithm.
When everyone got aware that the url should contain relevant words in order to gain points on google's page rank, the URL became more and more similar to regular text (as a long as a computer can distinguish).
They now allow spaces, punctuation, special characters like or or such.
I mean come on, even a real life human being would not correctly recognize a URL with spaces in between and punctuation-endings when they are embedded on normal text!
I propose we (as devs) should only care about a starting protocol, subdomain.domain, optional extra subdomains, and then an optional / followed by pretty everithing that is not an space (even when spaces are valid url characters). for this are pretty good regexps on regexlib.
Before that thou, we should have taken care of matching parentheses, angle brackets, square brackets, curly brackets and enclosed punctuation like and ''. Another optional step would be cleaning up random html like enclosing a's. Note that this enclosing nightmares should be apart by at least one space, in order to not screw up our URL's with enclosing things inside.
But the space is where I draw the line. Even then, we can make an exception when It's enclosed in angle brackets.
This heuristic can not be done with a single regexp as far as I'm aware, but should give that 95%-99% accuracy we all wish in a smart system with out shooting our brains off.
oh, Space separation between enclosing nightmares was a mistake.
But we can't remove safely enclosing things that are between words, neither we can remove then when are ending words since a url can end in ).
I think I rushed myself into this, and now I'm considering Jheriko's approach, mixed with some preview would be some relief to the user who's just trying to get his URL showed as a hyperlink.
The magical keyword here is *context*.
By definition, most human languages are context-dependent, ergo, they are not parseable by a Deterministic Finite Automaton, for which RegExps are a type of shorthand (well, to a subtype of them, actually).
So this can never be completely solved by using a DFA, at most you can get approximations for most of the cases. I'd try solving it with a non-deterministic automaton and, upon running out of input, would select from the list of valid end-states the one where found strings satisfy most rules - this would be more or less an equivalent of makes most sense.
Does anyone know of a sample text containing a list of all difficult cases, to test a program with?
I don't think this post is too specific to web software at all. Regular expressions in general are present in lots of software, from programs that distributed with the unix shell, all the way to non-compiled scripting languages. This is just a post using hyperlinks as a case study, and talking about how much programming we need to do and how much is user responsibility for their actions. I actually enjoyed the post.
One thing I would like to say though, is that even though regex is our weapon of choice here, I think the post-code that you mention is pretty much the preferred solution and there should be a whole lot more of it. The regex, in this case, should just be a way of narrowing down where in a given string a URL is, not precisely, mind you, but generally. And users have to take some responsibilty for their actions. We can't ping every url, because eventually a perfectly good html document will be present on a non ping-able host. We can't go to every url because of exploits. So, users will just have to settle for our ability to interpret their links.
The point isn't about smart or stupid users. They're given a lame freetext field for something, told they can paste links, and then are berated by the programmer because the perfectly valid links they enter get mangled. What's worse is the links provided by the user don't get interpreted properly and are broken from the users perspective.
A - Some of you aren't old enough to remember that the only way to stop (old) versions of outlook and other email clients from chopping long urls at 70-odd chars was to enclose them in parens. So users were taught a (bad) way to escape their url's and you just forgot.
B - Don't berate wiki or anyone else for making human-readable links that are within the url spec.
Conclusion: You're finding that your quick-solution regex is not going to cover all the bases. In a dreamworld, the Url spec is different and comes with the perfect regex in the box.
Snap out of it and roll your own state machine. It's probably all of a days work to extract url's without a regex. It's likely faster and more maintainable anyways. I bet it could be made 1-pass, minimal backtracking, and be a factor faster than a regex. Hmm, something like Anonymous Cowherd did.
The point of your blog entry should have been '99% just doesn't cut it. Regular expressions can't be made sophisticated enough to cover all permutations'
Lets go shopping indeed.
I would auto link all the normal links and let the user specify a special mark for complicated links, like the angle brackets already mentioned. A preview would be fine, so the user will know when his link was not identified, then he just encloses it with . Smarts users will learn to enclose links already on step 1 and avoid problems with the parser.
a) How abt do the url-ifaction on the fly when one is typing (like word's spell check)?
Should be pretty easy to do that in JS.
That way the person can correct it (by explicitly making the text a link) if the super smart algo gets into an edge case.
b) If you are anal abt correctness, how abt actually check if the URL returns a 200, by doing a HTTP HEAD and then decide to urlify the string in question?
we could be talking abt URLs (like this blog post) and do not want ppl to actually click the url - like say a href=http://example.comhttp://example.com/a">http://example.com/a">http://example.comhttp://example.com/a is a phony url
Looks like quotes need to be handled :p by ur regexp
Every regexp is equal to a nondeterministic finite state machine.
If you start your design thinking machine and not regexp,
you can see the solution should be easy.
But the regexp might be difficult to read and lengthy.
E.g. if your basic URL machine is [:URL:], you might
get something like this:
[:URL:] | \([:URL:]\) | ... | ... | ...
A nondeterministic machine allows to change state without
eating any input, similar to operators ? and * in regexps.
They are the ones making the whole conversion algorithm so
complicated. The output - e.g. the generated regexp - might
be hard to read.
@chris it's because programmers think / want just the 1 regex to magically do it all for them without having to write any extra code.
Maybe the thinking is:
You either do it all yourself, or you write one big regex to do it for you. If you have to start writing code around your regex, then your regex is just wrong (even if it's already needlessly complicated).
Don't get me wrong, I love regex, it's a fantastic tool, but I appreciate it has limitations.
If (http://www.mywebsite.com) is so common, why not add the extra bracket at the beginning to the the regexp and then afterwards filter the junk away so it boils down to the general case again?
If (a href=http://www.mywebsite.comhttp://www.mywebsite.com/a)">http://www.mywebsite.com/a)">http://www.mywebsite.comhttp://www.mywebsite.com/a) is so common, why not add the extra bracket at the beginning to the the regexp and then afterwards filter the junk away so it boils down to the general case again?
Trying to access an URL with or without parantheses fails already with the Wikipedia case which are in my experience many URLs with parentheses at the end you're likely to encounter. WP just sends 200 even if the page does not exist, so no way there.
The suggestion of including the paren when there is already one in the URL and dropping it in the other case should work for all parentheses-containing URLs I've encountered to date.
I know is a bit late but I found this post. And I want just to answer with a handy regex I use for this purpose
Thanks all, This site is really awesome but i need little bit more into my site user posts www.google.com or google.com but your code only link http://google.com or http://www.google.com
Please help me out from this problem
Your problem isn't about parsing URLs - it's about parsing natural(-ish) language (i.e. no formal syntax) to extract some information with a formal syntax.
To do this job properly, you need some level of context sensitivity and some heuristics looking at leading and trailing characters (if any) to decide what to do.
Gmail seems to do a pretty good job of that. Certainly better than just saying 'ooh, ooh - don't use parentheses in URLs' - that's just lame.
URIs are not the problem; the problem is that you cannot place URIs in freeform text and hope to precisely extract URIs from that text. Your options are to restrict the form of URIs (what you're proposing) or to restrict the context in which the URIs exist (the text). The former is defined in an RFC; your attempt to refine the URI syntax is hopeless: how will you enforce this restriction on the web? The latter problem has already been solved by not using freeform text and instead using a markup language. There are many; pick one.
If you can't control the context than you basically need to make a best effort and accept that you will imperfectly parse URIs.
Take a look at appendix E of RFC 2396 http://www.ietf.org/rfc/rfc2396.txt, as it talks about how to include URLs in context (prose). It suggests the use of angle-brackets.
When a URL is discovered that is enclosed within angle brackets, the regex should be much more permissive in what characters it allows.
Seems to be a common theme on this blog lately: If a problem can't be solved with a regex and 2 or 3 lines of code then it obviously can't be solved at all, and there must be something wrong with some *other* aspect of the problem.
Whatever happened to always assume that the problem is in your own code?
Most people simply choose not to bother with this problem. Every kind of markup under the sun has provisions for links, and if some braindead user can't be bothered to link properly then he'll have to deal with the majority of people not following his link. As others have stated, link extraction with a regex is tantamount to auto repair with duct tape; be prepared for your solution to break easily and often, and don't blame the car.
Does FaceBook have a good solution? When you write messages on there, about a second after you've typed a URL, it not only grabs the URL and makes it appear as a hyperlink so you can see what the end result will be, but it grabs content from the URL and gives a preview with a major image and content. Of course, if they don't do a good job of grabbing valid URLs, it's a moot point. But if they do a good job... don't reinvent the wheel.
Jeff if you're interested, this regular expression eliminates the need of having subsequent code to distinguish whether link is inside brackets or not. It uses a positive lookahead:
and you can easily reconstruct anchor links with:
The regex is rather complex (but so is this problem, as it turns out). A RegexBuddy library file is included as part of the Github project if you are into that.
@Philippe Leybaert - What's wrong with good enough? For example, your post url. It contains 'quot' in the end because Wordpress's 'slug' generator is only good enough.