Markdown was one of the humane markup languages that we evaluated and adopted for Stack Overflow. I've been pretty happy with it, overall. So much so that I wanted to implement a tiny, lightweight subset of Markdown for comments as well.
I settled on these three commonly used elements:
*italic* or _italic_ **bold** or __bold__ `code`
I loves me some regular expressions and this is exactly the stuff regex was born to do! It doesn't look very tough. So I dusted off my copy of RegexBuddy and began.
I typed some test data in the test window, and whipped up a little regex in no time at all. This isn't my first time at the disco.
Bam! Yes! Done and done! By gum, I must be a genius programmer!
Despite my obvious genius, I began to have some small, nagging doubts. Is the test phrase...
I would like this to be *italic* please.
... really enough testing?
Sure it is! I can feel in my bones that this thing freakin' works! It's almost like I'm being pulled toward shipping this code by some inexorable, dark, testing ... force. It's so seductively easy!
But wait. I have this whole database of real world comments that people have entered on Stack Overflow. shouldn't I perhaps try my awesome regular expression on that corpus of data to see what happens? Oh, fine. If we must. Just to humor you, nagging doubt. Let's run a query and see.
select Text from PostComments where dbo.RegexIsMatch(Text, '\*(.*?)\*') = 1
Which produced this list of matches, among others:
Interesting fact about math: x * 7 == x + (x * 2) + (x * 4), or x + x >> 1 + x >> 2. Integer addition is usually pretty cheap.Thanks. What I needed was to turn on Singleline mode too, and use .*? instead of .*.
yeah, see my edit - change select * to select RESULT.* one row - are sure you have more than one row item with the same InstanceGUID?
Not your main problem, but you are mix and matching wchar_t and TCHAR. mbstowcs() converts from char * to wchar_t *.
aawwwww.... Brainf**k is not valid. :/
Thank goodness I listened to my midichlorians and let the light side of the testing force prevail here!
So how do we fix this regex? We use the light side of the force -- brute force, that is, against a ton of test cases! My job here is relatively easy because I have over 20,000 test cases sitting in a database. You may not have that luxury. Maybe you'll need to go out and find a bunch of test data on the internet somewhere. Or write a function that generates random strings to feed to the routine, also known as fuzz testing.
I wanted to leave the rest of this regular expression as an exercise for the reader, as I'm a sick guy who finds that sort of thing entertaining. If you don't -- well, what the heck is wrong with you, man? But I digress. I've been criticized for not providing, you know, "the answer" in my blog posts. Let's walk through some improvements to our italic regex pattern.
First, let's make sure we have at least one non-whitespace character inside the asterisks. And more than one character in total so we don't match the ** case. We'll use positive lookahead and lookbehind to do that.
\*(?=\S)(.+?)(?<=\S)\*
That helps a lot, but we can test against our data to discover some other problems. We get into trouble when there are unexpected characters in front of or behind the asterisks, like, say, p*q*r. So let's specify that we only want certain characters outside the asterisks.
(?<=[\s^,(])\*(?=\S)(.+?)(?<=\S)\*(?=[\s$,.?!])
Run this third version against the data corpus, and wow, that's starting to look pretty darn good! There are undoubtedly some edge conditions, particularly since we're unlucky enough to be talking about code in a lot of our comments, which has wacky asterisk use.
This regex doesn't have to be (and probably cannot be, given the huge possible number of human inputs) perfect, but running it against a large set of input test data gives me reasonable confidence that I'm not totally screwing up.
So by all means, test your code with the force -- brute force! It's good stuff! Just be careful not to get sloppy, and let the dark side of the testing force prevail. If you think one or two simple test cases covers it, that's taking the easy (and most likely, buggy and incorrect) way out.
| [advertisement] Interested in agile? See how a world-leading software vendor is practicing agile. |
Woaw, you're the king of regex !
Geo on July 8, 2009 7:06 AMA while back there was a post somewhere out there in the tubes by a guy advocating test driven development. In his case he was building a sudoku solver, and began his development process, per TDD dictums, by building his tests.
He then pursued the development of the application in the most mindless, ridiculous manner possible, basically just believing that the best approach was to kind of randomly change things until the tests passed.
I feel the same way about this entry. Did you *really* have to "go to the tests" to realize that there were egregious gaps in your matching? I would say that it is close to one of the worst ways to approach the problem.
Dennis Forbes on July 8, 2009 7:10 AMYeah, gotta start reading that book you talked about, those regexp actually make Japanese appealing :-)
Dan on July 8, 2009 7:12 AMIf Dennis Forbes is angry with me, well ... it must be Wednesday!
I can't speak for you, but for me, it's *way* harder to sit around and dream up all the edge conditions than it is to, y'know, throw a bunch of data at it and see.
Jeff Atwood on July 8, 2009 7:15 AM....OR....
Why not save the comments as HTML. EG: make sure its parsed to sane [X]HTML before you store it in the db field.
That way, you're current comments won't get affected. Depending on what you do at the moment you sanitize them first (escape the < and &).
This will also help if you ever change your format from Markdown to something else (or not) again.
Clinton on July 8, 2009 7:23 AMBut those comments have been entered by users who don't have any markup available. By introducing 'code' as possible markup don't most of those things cease to be a problem? Surely they'll just write '7*x'?
Angry? Not really, Jeff.
And I would argue that they aren't edge conditions, and that really is the crux of the point. Such scenarios are blatantly obvious, I think, if you just stop and contemplate for a few moments.
As an aside, make your implementation in F# and you'll have an interesting (and likely very fast) implementation to talk about. Regular expressions are one of the greatest solutions in software development, but you know that trite old saying -- now you have two problems. It isn't super ideal for parsing text in such a manner.
Dennis Forbes on July 8, 2009 7:30 AMWhat's with all this testing crap? Remember: We Make Shitty Software.. With Bugs!
http://www.codinghorror.com/blog/archives/000099.html
Zack Peterson on July 8, 2009 7:31 AMUnfortunately, one completely ambiguous case is thing like identifiers surrounded by double underscores such as __FILE__ and __LINE__. If it were up to me, I would only use asterisks for bold/italics, and not underscores.
Of course, the easy workaround is to do `__FILE__`, but to do so, you need to be aware that double underscores mean italics.
Adam Rosenfield on July 8, 2009 7:32 AMGreat post! Brute-force testing makes a lot of sense, specially for small components of a software. For larger components brute force might be a problem, either because it takes a lot of time or there is no possible way you can cover a lot of ground in a reasonable amount of time. Testing stuff at a lower level is a must.
Tathagata on July 8, 2009 7:35 AMIs just me, or is this particular blog not generating a lot of comments.
Don't get me wrong Jeff, if I have to search and replace more than one thing in any of my files, I do it with Codewright's version of regex - like most things in coding, no cares but me.
Jamie on July 8, 2009 7:35 AMWhile I see the point Dennis is making, I often go with Jeff's approach for testing.
It is the same thing as using the compiler as a test. Officially frowned upon, but if your in an interpreted language or working on something that takes seconds to compile why not use the compiler as a sanity check?
Heck my IDE of choice, Netbeans, does that automatically now.
Some Random Internet Guy on July 8, 2009 7:43 AMRan into this yesterday unexpectedly on Server Fault. I was trying to *chortle* but I wound up doing it in italics which was very deceptive. I thought *'s around a word meant some sort of noise your body might make. At least that's how it was in mIRC in the mid-90's.
Peter Turner on July 8, 2009 7:43 AM> Such scenarios are blatantly obvious, I think, if you just stop and contemplate for a few moments.
So what? If you've got a lot of test data available, why not run your code against it to see which, if any, blatantly obvious things you missed?
Dumb approaches, smart approaches, they can all help, and they can all miss things.
Paul D. Waite on July 8, 2009 7:50 AMErm, would just using a Markdown library not have made more sense?
Robert Synnott on July 8, 2009 7:52 AM"First, let's make sure we have at least one non-whitespace character before and after each asterisk."
I don't think that's what you mean... what you wrote would highlight "my*string*s", but not "my *string* is"
Tordek on July 8, 2009 7:55 AM*sigh*
http://www.googlefight.com/index.php?lang=en_GB&word1=%22Jeff+Atwood%22&word2=%22Dennis+Forbes%22
Juan on July 8, 2009 8:00 AM@Paul-
>So what? If you've got a lot of test data available, why not run your code against it to see which, if any, blatantly obvious things you missed?
Tests are grrrrr-eat.
Tests are not a substitute for actually thinking about the problem space, however, and this post supports a theory I've long had that tests sometimes make people produce worse code, because they get used as a substitute for purposed coding.
I strongly suspect that Jeff took dramatic license in his post, and he didn't really go into it so blindly, but it served to demonstrate the use of test data.
@Juan-
Google fight? Dude, I'm an unknown nobody. Jeff smacks me down and calls me Sally when it comes to internet popularity.
Dennis Forbes on July 8, 2009 8:04 AMI wonder how many real italics attempts will now be lost by the updated expression?
Joel Coehoorn on July 8, 2009 8:05 AMWhy is is it easier and less confusing for the commenter to write *italics* rather than italics ?
Especially when we eventually have *italics* #bold# @underline@ %hidden% etc.
Being able to "dream up" edge cases demonstrates that you understand both the problem and the algorithm, along with limitations.
My first thought on seeing your initial regex, was the fact that a comment wouldn't be able to contain multiple italic blocks:
test this in *italic*
test this *other* thing in *italic*
With a greedy algorithm — (which most regex engines are) — the block "other* thing in *italic" would be italic.
Essentially what you're doing is trying to use a regular expression to match a context-free grammar, which is not possible. You can fake it to a limited extent using an "enhanced" regex engine (with back-references, etc) and some creative expressions, but the level of nesting that you can support will be finite, and the deeper you go, the more complex the expression becomes.
Either put a rich text field on your form (so it acts like a mini word processor and hides the markup in the background) or learn to live with the fact that users will have to enter markup into a plaintext field in order to have their plaintext appear to be rich text on the web page.
Oh, wait, **text** and __text__ is SOOO much less ambiguous and easier on the eyes of a layperson.
tb on July 8, 2009 8:11 AMCould you just have commenters type stuff I want italicized?
Xami on July 8, 2009 8:15 AMYou should have looked at the WMD code! I could have sent you some snippets where this sort of thing is handled.
Dana on July 8, 2009 8:16 AMtb: Stack Overflow commenters aren't laypeople.
anon on July 8, 2009 8:21 AMIf you're looking for a regex to do X you should start by looking at http://regexlib.com/ to see if someone already has done X. Most of these situations are non-unique, so reinventing the wheel can often be saved. And of course, if you don't find something that suits your needs, or that what's there is full of bugs and you know how to fix them, by all means submit the new-and-improved regex.
Steve Smith on July 8, 2009 8:25 AMI know you love RegEx, but isn't this the wrong tool for the job? Shouldn't you be writing a parser for your reduced version of Markdown?
This shouldn't be too time consuming as:
You only want three elements of the complete language.
There are probably open source parsers for Markdown you can rip off/learn from.
And you still get to test it against heaps-o-data.
Win Win?
Sir Digby Chicken Caeser on July 8, 2009 8:29 AMI guess the point of the article is to use real test data for tests, so, good on you.
But it would have been nice to hear a word about BNF grammars and actual, real text parsing, instead of hacking it together from regex. Even if Markdown isn't defined in a deterministic way, it still would have been nice to hear a little peep about parsing and how it would have solved all those nasty little edge cases, and why we can't use the traditional approach today.
Peter on July 8, 2009 8:30 AMHow about saving a "markup version" with each comment? Then you could render old submissions without any formatting or the old way... and new submissions with the current method...
Or... if you have a way of escaping your newly introduced control chars, escape all old comments...
Jim on July 8, 2009 8:30 AMI hope you will find the problem one day you are trying to solve with this mindless post.
Regular expressions are useful. We get it. You are a cool. Your web site is cool. You will make a lot of $$$ (or $#%@%^@, it remains to be seen).
securityhorror on July 8, 2009 8:33 AMYou probably will want to change your regex from:
\*(.+?)\*
to
\*([^*]+)\*
The problem with your original is with using the lazy operator (?) is that you can run into some really bad backtracking which can kill performance. Using the negated character class ([^*]) will work fast all the time.
Jon von Gillern on July 8, 2009 8:33 AMTesting is for pansies! Test this, test that...blah, blah, blah. Real codeslingers just throw their product out to the masses and deal with the - ahem - very small number of bugs that their exceptional code will somehow manage to generate. What's the worst that could happen, eh?
Kenneth on July 8, 2009 9:01 AMYou should have just posted a question on Stack Overflow, asking for help, I know I would have seen some of the problems with the first one, and I think there still may be some problems with the last one.
( The question should also have had a link to the Meta SO question were people could make comments about the idea )
Brad Gilbert on July 8, 2009 9:02 AMclbuttic
Aaron Seet on July 8, 2009 9:03 AMclbuttic
Aaron Seet on July 8, 2009 9:04 AMclbuttic
Aaron Seet on July 8, 2009 9:05 AMclbuttic
Aaron Seet on July 8, 2009 9:05 AMThis won't match
*Testing* is good.
You may want:
(?<=^|[\s^,(])\*(?=\S)(.+?)(?<=\S)\*(?=[\s$,.?!])
Tim on July 8, 2009 9:09 AMWell, here is an excellent example of what your stackoverflow dumps can be used for, TEST CASES FOR FREE :D
Kost on July 8, 2009 9:12 AM"Especially when we eventually have *italics* #bold# @underline@ %hidden% etc."
Back from the old BBS days, I find /italics/ *bold* _underline_ much more intuitive.
Secure on July 8, 2009 9:14 AMThis will work until you decide to add some escaping. Oh wait... you have already added it: `code`!
So now you need to modify your regexp so that it `text *some* text` would not match
LXj on July 8, 2009 9:16 AMDon't forget to watch in which mode you currently are. E.g. within a `code block` you probably don't want * to be matched at all (people will most likely not use italic within a code block, will they?). This is probably the hardest thing to do -- if possible with only regex at all... I'm not sure about that (not unless you pack all three components into a single regex, code, italic and bold).
BTW, I don't understand why *xxx* is italic and **xxx** is bold. Ages ago, where nobody ever heard of HTML, we styled out read-me's, mails and Usenet posts with: /italic/, *bold*, _underlined_ :-)
(apart from the fact that I think italic is rather useless. To me italic text is not emphasized at all, it is rather slimmer and less emphasized, de-emphasized so to say. Sometimes it's also just plain ugly or hardly even distinguishable from normal text, depending on the font being used. So I always use bold to emphasize and if / dies tomorrow, I will not really miss it)
Mecki on July 8, 2009 9:18 AMSure. The best use of your time is to stare at the code and ponder all the possible holes in it. Do this for hours so you can be SURE you thought of everything. Definitely do NOT run a simple quick query on a huge amount of real world test data. That would be stupid.
While you are at it, I think you should go back through your code base, ignore all profiling you have done so far, and start pre-optimizing code. That's always a good use of time.
Because as a team of 3, I'm sure you guys have enough diversity to think like every person in every culture out there and fully know exactly what they will click on, in what order, what type of data they will enter, etc.
After you then get your perfect code, release it. When people complain of problems, go back and fix your code (what? I thought I was so smart I'd found every possible bug! that's unpossible!).
But for Dennis' sake, definitely do NOT use your test data first!
Come on people. Using a huge amount of real world data as your first cut is the best possible use. You will see how people are using the *, and then, after you solve all those edge cases, you can THEN stare at your code and analyze each special case you had to consider. This will actually help you think of as of yet not encountered uses of the * character.
Some people are so incredibly arrogant it amazes me.
Matt on July 8, 2009 9:18 AMWhile I agree with the original idea of "this is exactly the stuff regex was born to do" when the problem seemed trivial enough, I will have to agree with tb: in this case, regexes are not the best tool for the job (although I smiled at the fact that using "**" instead of "*" doesn't solve the problem, because things like "char **p;" could happen too).
On the other hand, maybe you are not looking for the *perfect* solution, maybe something that works right 99.98% of the time is good enough, in which case you have already (almost) solved your problem. Furthermore, maybe *not* spending two days writing some free-context grammar or a state machine could be one of those "bad for the software, good for the business" things, right?
Having nitpicked enough, +1 for the article.
Side note: you should consider some form of comment moderation. I'm not saying you should get rid of anyone who doesn't agree with you (Dennis Forbes, for instance, disagrees with you, but in a good way), but are insults and such really necesary (securityhorror, I'm looking at you)?
Martin on July 8, 2009 9:19 AMThe real solution would be to realize that regexes are a poor tool for this, and use a proper parser instead. Continue down this path and you have something like MediaWiki markup - an incredibly irregular markup language that can only be properly parsed by the one and only canonical implementation, using a horrid mishmash of regex and other functions.
Nick Johnson on July 8, 2009 9:20 AMI really think this is one of those situations where regex is not the answer. By the time you've coaxed out all the tricksy situations, you've got a monster unmaintainable regex.
Much better to do it in "normal" code. Of course that normal code might use simple regexes.
John H on July 8, 2009 9:22 AMMidichlorians?
*gag*
I thought we had all agreed that "The Phantom Menace" didn't really happen.
jeffH on July 8, 2009 9:26 AM@Kenneth-
Who are you directing your sarcasm at, because no comment that I can see in here implies that testing is a bad thing. Testing is a great thing.
However Jeff is pursuing Test Driven-Off-A-Cliff Development here. Worse, it isn't even *really* tests at all because the tests "pass" based upon him eyeballing generated output to see if it's getting closer to expectations. It is classic hackery-in-the-bad-way sort of coding.
@Matt-
Groan. It isn't worth replying.
Dennis Forbes on July 8, 2009 9:29 AMThis is why I use HTML. It's slightly more time-consuming than Markdown, especially for unordered lists, but as someone very familiar with HTML I'm not really bothered.
Jonathan Drain | D20 Source on July 8, 2009 9:47 AMJeff, what kind of shoes do you wear? I'll buy the same because I want to be like you. And please tell me more about you.
code monkey on July 8, 2009 9:48 AMTim was right. When you put `^` or `$` inside a character class (like `[\s^,(]` or `[\s$,.?!]`), they no longer match positions, but the those literal characters. `\s` may or may not mean "any whitespace" inside a character class, depending on your regex engine (some allow it, some interpret it as "either \ or s").
So I believe what you meant was:
`(?<=^|[\s,(])\*(?=\S)(.+?)(?<=\S)\*(?=$|[\s,.?!])`
(and this matches one or more characters inside the asterisks, not "more than one character in total").
However, it seems to be working well so far! Thanks for the new feature!
Noah on July 8, 2009 9:52 AMIt's funny, but with each new article discussing regex I seem to dislike it more and more. I've toyed with regex's before but to me writing code with my chosen languages native string functions is much easier to read, modify and maintain, especially as complexity grows.
Sure, often I do in ten lines of code what can be done in one, but hasn't C taught us that that isn't always the brightest idea?
HearWa on July 8, 2009 10:00 AMIf you're a nobody... how do you get attention?
Simple, just go against everything that a well-known person says even if they are completely right, there you go! you got your 15 minutes at last!
Guys let's take this advice and ignore this guy.
http://www.codinghorror.com/blog/archives/001271.html
PS: That guy is Forbes
"Some people are so incredibly arrogant it amazes me.
Matt on July 8, 2009 9:18 AM"
Excellent, excellent point Matt I couldn't have said it better myself.
o.s. on July 8, 2009 10:33 AMAnother example of why regex is shit.
That regex is pretty much impossible to read unless you carefully split it up in to smaller parts.
For this kind of thing you need a proper grammar.
In the grand tradition of honing in on something in a blog post that has nothing to do with the purpose of the post...
and are outdated. Instead we're all supposed to use and
Alex on July 8, 2009 10:45 AMSorry, I forgot- html doesn't fly in comments-
What I was saying before: we're not supposed to use [i] and [b] anymore, instead we're supposed to use [em] and [strong]
Alex on July 8, 2009 10:46 AM>If you're a nobody... how do you get attention?
Disagree with Jeff on his own blog! GENIUS! Then you can gain the attention of a bunch of people who through survivorship-bias (in that they continued to read it) are going to likely be fans of Jeff's!
Somehow I don't think that strategy is a very good avenue to fame. Gosh, I'm going to have to rethink this.
>Simple, just go against everything that a well-known person says even if they are completely right
Sorry, friend, but I've trodden this ground half a decade ago - http://www.yafla.com/dforbes/The_Fallacy_of_Test_Driven_Development
I disagree with Jeff when I disagree with Jeff (somehow CodingHorror got on my iGoogle page, and I've been remiss to remove it. And every now and then I expand one of those nodes...). If this hurts your precious feelings, I would advise that you stop reading the comments.
>Guys let's take this advice and ignore this guy.
This is like those YouTube channels where people put a big notice at the top disclaiming that they don't care what anyone thinks, which of course means that they desperately care what everyone thinks.
Honestly I think Jeff should disable comments, because his biggest fans are his worst enemies, and they are the reason he gets often undeserved backlash. It's like some sort of weird little groupie festival.
Dennis Forbes on July 8, 2009 10:57 AM*text* or _text_ (or double) are no good choices for markup in an environment that is full of bad C code. Go for of some sort and sanitize the database by escaping the old posts of course. These * and _ will just annoy everyone.
Someone
Someone on July 8, 2009 11:20 AMGoogle fight is it's own arch enemy: compare
http://www.googlefight.com/index.php?lang=en_GB&word1=%22Jeff+Atwood%22&word2=%22Dennis+Forbes%22
to
http://www.googlefight.com/index.php?lang=en_GB&word1=Jeff+Atwood&word2=Dennis+Forbes
> While I see the point Dennis is making, I often go with Jeff's approach for testing
I go with the 'Dennis' (smart) approach _always_ (mind the markup). *THEN* I always go and bruteforce test in as many ways possible. I'm always surprised at at least one edge case I missed. I try to make it a habit to _think_ why I missed the particular case initially. That way my reasoning + hitrate improves.
This post reminded me of this article: http://blog.dotnetwiki.org/2009/01/16/NamedFormatsPexTestimonium.aspx where he used Pex to automatically generate test cases where the two implementations differ. Perhaps something like that would be of use for you.
Kevin H on July 8, 2009 11:21 AMah my tags tag was deleted :-) fun
Someone on July 8, 2009 11:22 AMI've never quite understood why simple HTML markup is considered "inhumane". What, really, is the difference between these:
*italic*
[i]italic[/i]
<i>italic</i>
Why come up with some complicated regex filter to convert some contrived markup to HTML, when the original HTML was designed to be simple and human readable to begin with?
In almost all cases where I've seen this "markdown" style of formatting, there's some big filter up front that automatically strips out all possible remnants of HTML as part of some cargo-cult security mechanism. Why not just modify the HTML filter to allow basic bold and italic tags through?
jasonmray on July 8, 2009 11:48 AMI agree. This all seems way too complicated.
JM on July 8, 2009 11:50 AMThe basics of Markdown -- the parts that Jeff is trying to capture it seems -- do have a certain elegance, paying homage to a less advanced era: When all you had was ASCII, it was generally agreed that could *emphasize* certain words, and draw _attention_ to others, with nothing more than appropriately place characters. For those with such a habit, Markdown semantically draws from what they are use to.
I have seen a lot of sites that allow either Markdown, HTML, or some other bastardizations. The back-end process was always Markdown (where used) -> HTML -> correctness checker, so it is a concise set of code.
Dennis Forbes on July 8, 2009 12:05 PMWhy aren't you using the nice semantic element, instead of the old, presentational element?
John Topley on July 8, 2009 12:25 PMLet's try again. Why aren't you using the nice semantic "em" element, instead of the old, presentational "i" element?
John Topley on July 8, 2009 12:26 PMlol - you chose the Dark Side when you decided to use regex to parse markdown in order to solve a problem that didn't need either regex or markdown ;-)
but regression testing is always a good thing
What the heck should googlefights want to show?
try
http://www.googlefight.com/index.php?lang=en_GB&word1=Jeff+Atwood&word2=Adolf+Hitler
even a wrong spelled name can win
http://googlefight.com/index.php?lang=en_GB&word1=jeff+atwood&word2=joseph+stalin
And if you try to compare him against somebody in informatik which has really done something (like developing semaphores, winning the turing award and so on... guess who would win?)
http://googlefight.com/index.php?lang=en_GB&word1=jeff+atwood&word2=Edsger+Dijkstra
I usually enjoy your posts, and I am a fan, but this one was just a dumb waste of time. Seems like you were trying to fill a quota on this one.
The Skipper on July 8, 2009 1:04 PMAs this entry demonstrates nicely, Markdown IS the dark side:
-insufficiently unique tags
-closing tag is the same as the opening tag
...thus necessitating a regex that's ridiculous even by the standards of regexes.
"italic" is hardly onerous, as jasonmray notes above, and stripping out all HTML except , , and is a solved problem.
Paul on July 8, 2009 1:33 PMNaturally, I meant "<i>" or "\" or whatever magic incantation it takes to make angle brackets show up here. I trust the irony is not lost on anyone.
Paul on July 8, 2009 1:36 PMMay I ask what the difference is between .* and .*?
I just used a regex today (not something too common in my work) and am now curious :)
Hmm, a mini-language. This seems like a good time to use the new M DSL-building language.
http://msdn.microsoft.com/en-us/library/dd285282.aspx
@Dennis Forbes:
"(somehow CodingHorror got on my iGoogle page, and I've been remiss to remove it. And every now and then I expand one of those nodes...)"
If I don't like the service in one place I don't go never again, period. For someone who does not like this blog you have been here for a long time (http://www.codinghorror.com/blog/archives/000845.html). Maybe you are a stacker, maybe you are jealous, or maybe you have a crush on him. Either way it's kinda scary.
Yes. It's terribly scary, Juan. And from looking at the cadence of your sentences, I have to think that you like to injure small animals for fun.
As much as I enjoy ridiculous allusions and accusations, and as utterly in-awe I am of your amazing Google-fu, I find it humorous that you pointed out that I had commented on that particular blog entry, given that I did so after Jeff commented on one of my entries.
Though I'd been visiting Jeff's blog -- there have been some very enjoyable reads over the years -- since a lot earlier than 2007.
Maybe, just maybe my dear friend Juan, the online development community really isn't all that big, and we find ourselves in surprising intersections all the time. Strange how that works.
Dennis Forbes on July 8, 2009 2:38 PMI'm with Dennis. While I certainly appreciate the value of testing in general, and testing against real data in particular, it's a poor approach to start with the most naïve of all solutions and proceed to iron out the "bugs" with brute-force testing.
Yes, it's often impossible to foresee every edge case, especially with free-form user-submitted content, but a more thoughtful approach would have enumerated at least the most obvious exceptions before writing a single line of code or regex: asterisks as code syntax or math symbols, underscores at the beginning of reserved words like __fastcall, multiple underscores within constants like MAX_BUFFER_SIZE, mismatched begin/end "tags", tags within other tags, and so on.
An even more thoughtful approach would be to examine the long and growing list of edge cases and consider that regular expressions are notoriously inefficient and inaccurate with so many edge cases, not to mention difficult to maintain, and that perhaps a regex is not the best implementation. One might still come to the conclusion that it's good /enough/ (see what I did there?), but the mentality at work here appears to be "It doesn't really matter if there's a better solution because I can just run tests, massage the expression, and keep iterating until the tests pass." That's what Dennis is criticizing, and it's a very valid criticism.
On the other hand, if Jeff had framed the initial "test" as more of a "domain analysis", it would convey a very different and perhaps more positive message, and for all I know, that's how Jeff really went about it. In other words: "I'm not sure yet how best to go about solving this problem, and I'd like to know more about the real-world data that it's going to be operating on. We already have reams of data, so I think I can automate some of this requirements-gathering phase with a dumb regex, and oh look, it turned up a few edge cases I hadn't really considered, like this one here where somebody actually posted a regular expression in the comments. Cool."
This type of thing, I do all the time. Somebody will ask me for a particular report, and instead of going out and immediately architecting a solution, I'll throw together a quick-and-dirty version, maybe just an inefficient and clunky SQL query, simply to verify that the results of the report actually tell the story that people expect it to tell, or even that the results are meaningful at all (often they're not). The difference is, I don't try to chisel away at that mess to implement the real solution; I throw it away, and get to work on a proper design based on the revised requirements.
Aaron G on July 8, 2009 2:51 PMThis does basically the same thing, covering a few more edge cases, using the more advanced features of the .NET regex engine:
(?(?(?${Phrase}
kbiel on July 8, 2009 3:00 PMMr. Forbes is not wrong at all, as he's not bashing at Jeff's testing - but criticizing HOW he is conducting the implementation of the feature.
And for the googlefight results, the ones he's winning is because of probably being links for the Forbes magazine or related material.
Comment filtering failure. I'm not going to attempt to find out how to fix it without a preview option.
kbiel on July 8, 2009 3:05 PM>x + (x * 2) + (x * 4), or x + x >> 1 + x >> 2.
Aren't those shifts going the wrong way? Also, don't the "+" operators get evaluated before the ">>" operators?
Keith on July 8, 2009 4:40 PM*Or* You could just make it optional, and then you don't need to worry about it ...
Too Much PHP on July 8, 2009 4:53 PMYour regex implementation doesn't have \b apparently - which is for word boundary. Rather useful :)
kd on July 8, 2009 6:12 PMI, like most Unix shell hacks are pretty good at regular expressions, but I'm not sure whether you want to use them here.
Besides, why are you writing your own parser when you have textile that does it for you? (See http://textile.thresholdstate.com/).
As an added benefit, you'll be using the same sort of syntax that other sites use. Italics? Use underscores. Bold? Use asterisks. Underscore? Use plus signs. Crossout? Use minus signs. Why should I have to remember your site uses double underlines when everyone else uses single underlines.
And, if you really, really insist on using C#, you'll be happy to here that there's a .NET version of Textile: http://www.codeplex.com/textilenet
David W. on July 8, 2009 6:25 PMAs I read the post, I expected a pithy moral about not trying to impose new semantics on old free-form data written without knowledge of those semantics. The question of how to (or whether to) write a regex for this should come after more important questions such as whether to do it in the first place (and for older comments written when a star was expected to be a star, I'd say no).
Of course, this all begs the question of why Yet Another Random Markup Language needs to be developed. Are html tags so foreign to your visitor base that you have to come up with something new?
Jeffrey Friedl on July 8, 2009 6:29 PMIf backticks signify code, how would I write shell code that contains backticks?
alt on July 8, 2009 7:17 PMGreat article Jeff. I love star wars, and I love the star wars/regex tie in. I'll be back.
Cheap Websites - Josh on July 8, 2009 10:12 PMGreat article Jeff. I love star wars, and I love the star wars/regex tie in. I'll be back.
Cheap Websites - Josh on July 8, 2009 10:13 PMWhoops, sorry for the double post. Please delete.
Cheap Websites - Josh on July 8, 2009 10:13 PM@Dennis Forbes:
By the way, there are two Juan's in the room, I'm Juan Zamudio, the other one is just Juan.
While agree that most of the recent posts are not that valuable I keep coming back hoping I can read another great post like in the good old days, but I find interesting that you come back for more, and after almost two years you have not found the time to remove codinghorror from IGoogle given the fact that you dislike this blog. That's the point want to made.
I also didn't see the point in that Google-fu that you mention, that bring nothing to the table.
PS; I'm not a Jeff Groupie, i found this blog by accident also (searching something related to the code complete book, I'm a McConnell whore, i have to admit that).
PSS: If you don't find cadence in my sentences (I didn't know if that was for me or the other Juan), sorry, my English is not that good.
First, the folks debating how to make sure the '*' block is surrounded by either whitespace or the start/end of lines ... doesn't the regexp library being used support 'word boundary' matches? "\b" is usually it (http://www.regular-expressions.info/wordboundaries.html).
Second, I echo the concerns of trying to handle this as a regular expression problem, when it's quite obviously a language grammer parsing problem more likely to be satisfactorially solved using BNF or PEG grammar.
Third, and most importantly, why are you eschewing libraries which are out there to do exactly this? I mean, one of the advantages of using a quasi-standard like markdown is that everyone and their mother has made a parser of some sort for it already. Don't waste time reinventing the wheel!
An example PEG grammar for Markdown: http://github.com/jgm/peg-markdown/blob/master/markdown_parser.leg
You'll need to use something like ANTLR to generate your C# parser code from that .leg file, but that should be a WHOLE lot easier than even what you've already done with regular expressions.
Fourth, I think the use of two different ways to do a very simple thing ('*' and '_', and '**' and '__') is Just Plain Wrong. Provide one way to make bold, and one to make italics. Makes it less likely we'll hit the other case by mistake. IMHO, the '*' is the most used one and least likely to cause problems.
Finally, I agree with other posters that markdown's choice of '*' for italics and '**' for bold is braindead (sorry, Gruber!). It should have been '/' and '*' instead. But, at this point, markdown is markdown, and you don't want an exception on your one site.
Tom Dibble on July 8, 2009 11:38 PMUnless you changed/improved original implementation at StackOverflow, yesterday I have noticed that `_` in identifiers causes some wacky italicizing if there are two identifiers in single line in comments.
Unfortunately I don't remember example, and I have not bookmarked it.
Jakub Narębski on July 9, 2009 12:33 AMPlease Jeff, just use BB code that everyone in the world knows how to use and can be implemented with a library. As for breaking existing comments, just don't run this ridiculous italicization code over posts < $Date
give the Regex a break and stop trying to reinvent the wheel, especially when the wheel isn't broken!!
fwgx on July 9, 2009 1:04 AMAren't you still missing the point about testing? Sure you've got a good data set to work from but you're still not actively seeking the conditions that may or may not break your code.
Also, unless you save your test data set, your tests are not repeatable. Without a consistent data set how do you know if a future change to your code has the desired effect?
In your previous blogs you were talking about polishing your code, but you don't seem to be practicing what you are preaching. That is, once you have a piece of code that you think is ready, take the time to develop a thorough test plan and execute it. Better yet, write the test plan when you write your designs. You do do designs and documentation prior to the implementation don't you? Perhaps if you did you wouldn't be so reliant on the Force. :op
Jackie on July 9, 2009 1:24 AM"This will also help if you ever change your format from Markdown to something else (or not) again." ~ Clinton
But not if you want to change the presentation format. This is especially important since they've started giving away data dumps. It also doesn't work very well with the wiki editing.
"May I ask what the difference is between .* and .*? " ~ Sandro
http://www.ultraedit.com/support/tutorials_power_tips/ultraedit/non_greedy_expressions.html
Aaron G puts it very well - when viewed as a domain analysis exercise the technique can be very useful.
TDD is clearly ineffective for designing the architecture of your code, but can be good for small bits of code -- comparing someone who failed at the former to the latter confuses the scope of Jeffs suggestion I feel.
Hacking code until tests pass is clearly bad. But using the domain knowledge and reasoning that people are claiming this technique circumvents is vital in understanding *why* the edge case occurs, else you are going to meet more edge cases hit by your solution and keep going around in circles.
However, with real world data (or failing that, generated expected data) and some smart reasoning and analysis of that you can create some very useful regression tests which, in a TDD fashion, can help to write well thought out, clean code (or blindly continue with regular expressions, depending on your level of fanboyism).
[ICR] on July 9, 2009 1:29 AM@Jeff - Or, you can skip reinventing the wheel and check some Markdown source. For example, look at lines 1242++ in http://cpansearch.perl.org/src/BOBTFISH/Text-Markdown-1.0.24/lib/Text/Markdown.pm.
Berserk on July 9, 2009 1:45 AMFor those Star Wars references, I hereby dub you
the Jar-Jar of using popular culture references in tech blogging.
Mere mention of midichlorians makes my skin crawl. It's like someone
had a really bad hangover and decided to ruin perfectly good trilogy with additional fluff co-produced with disney.
> TDD is clearly ineffective for designing the architecture of your code
It's pretty good at testing whether you have a good architecture or not though. Of course, not everybody is good at design. Such people often don't do unit testing because it's too hard to test their code, and therefore unit testing sucks (because there can't be anything wrong with their code, right?)
Also, TDD should mean that you notice cases like Jeff's, where you have clearly taken the wrong approach, because your code looks terrible, and doesn't work all that well.
Someone should take Jeff's RegEx hammer away from him before he hits anything else with it.
Jim Cooper on July 9, 2009 2:51 AMPerhaps a [spoiler] [/spoiler] tag would be appropriate for this blog where it only shows on mouseover or click? =)
Brandon D'Imperio on July 9, 2009 4:41 AMI'd just put a comment version flag on the comments; comments need to be parsed according to whatever parsing rules were in place when they were input. It means that over time you have to support all your old parsing variants, unless you add something that never appeared in a single comment. But in practice it's what's needed given that we can't edit comments.
Mr. Shiny & New on July 9, 2009 8:41 AMdid you _really_ need italics?
Amazing how complicated it is to interpret human intentions with a text parser, huh?
I spend a lot of time on it as well. It's takes a little bit of artificial intelligence level logic to really do it well (at least, once you get past trivial cases).
Practicality on July 9, 2009 9:15 AMYou need to add a checkbox to the comment interface saying "enable markup in comment" which defaults to however I last set that field. Then users doing weird things with special chars can just uncheck the field and not worry about what happens. Similarly, all the existing comments would have that field set false and would render tomorrow like they did yesterday.
jmucchiello on July 9, 2009 9:29 AMx + x >> 1 + x >> 2?
Wtf?
Surely that was supposed to be:
(x + (x << 1)) + (x << 2)
Am I missing something?
Nicholas Wright on July 9, 2009 9:33 AMQuestions regarding the utility of markdown-style formatting syntax aside, I find myself wondering (again) why Jeff has such a love for reinventing wheels. This isn't his first post where he takes a solved problem, tries to implement his own solution, and posts about the pitfalls he encounters while doing so (the encryption post comes to mind, offhand).
Here, he already knows of (and even uses!) an appropriate existing implementation, yet he still feels the need to roll his own.
I can understand wheel reinvention as a thought experiment or a learning exercise (or in the case where you actually can do something substantially better), but it seems like a poor choice if you're actually trying to get something done.
On the other hand, I guess it may make for a good subject of conversation in a blog post.
Jeremy T on July 9, 2009 10:09 AMI agree with Jeffrey Friedl. I would have implemented the feature only for the new comments, as you will break at least some old comments no matter how smart your regexp is otherwise.
Paul-Gabriel Müller on July 9, 2009 1:23 PMI'm sure IE isn't written as one gigantic regex parser. Oh wait, that could explain a few things.
I've always found that for parsing, while Regex is alluring, it always lets me down once things get real, so I only use it sparingly. If it takes me more than 20 minutes to come up with a regex (or find one on the internet) I'm probably wasting my time.
In the end I typically find it easier to create a parser for complicated things, because it's easier to optimize the internal state, I can do things that Regex can't, and it's much easier to maintain.
Regex is awesome, don't get me wrong, but it's just not for complex parsing.
I also understand the desire for simple markup, but why in Dennis' name do we have to reuse some of the most common characters, and then go write twisted code to deal with it?
Could we have avoided global warming and sped up technological progress by 20 years if we did things in a smarter way from the start? This industry is plagued with idiots who mean well.
Chuck Bartowski on July 9, 2009 3:03 PM@jmucchiello that's a much better solution as previous commenter's wouldn't have expected that their comments would be markdown-ified.
For the tenacious among the SO crew, you can always run queries on old comments and 'upgrade' one's that are obviously correct, and flag one's that will never be. It's a far easier way to hit 99.9% and keeps the code clean.
Chuck Bartowski on July 9, 2009 3:08 PMJeff you have a regular expression problem. Using regular expression in this situation is the wrong solution.
Are you just looking for an excuse to use regular expressions?
And how maintainable do you think that regular expression is?
*sigh* on July 9, 2009 4:07 PMCouldn't you just pose certain requirements to the user, like required whitespace in front of the first asterisk, nothing but text allowed within the asterisks (no whitespace either) and a required whitespace after the closing asterisk? In addition you could ignore any of asterisks within code markup (``). I think this would eliminate most edge cases. I don't think these requirements would even require any explanation to the user, as writing like *this* and not like* this* to emphasize something is the natural way.
Parsing that is dead easy and wouldn't even require regexes. But then, I'm afraid of regex. I think it's a genetic thing whether or not you get them.
Although you wouldn't be able to emphasize numbers, but when you do emphasize a number, you generally write it out for emphasis, like "It's 9 inches long. *Nine* inches!"
Julian on July 9, 2009 5:13 PMNot being a Computer Scientist or Software Engineer, I am not understanding why it requires a regex that searches for delimiters + whatever might come in between, when the actual strings you want are finite (5) and known.
What part am I missing?
mike on July 9, 2009 6:27 PMI don't know if Jeff even reads this far into the comments, but I think this is missing an obvious fact.
You can't take this huge data you have and apply filtering on it when it was written with no filtering in mind. It will not work correctly, and it will break on that odd case where it will be found by that unlucky dude looking at an old entry.
One way to make your life easier when adding such a thing is to use versioning, any comments older than the date you decided to implement this get rendered without the filtering. That way old data is preserved, and someone posting a comment will notice that his formula got italicized and go back and edit it after posting.
This is way better than trying to paddle a boat upstream with a fork and a knife.
Chady on July 9, 2009 7:22 PMmarkdown is such a *cool* idea
Anonymous on July 10, 2009 1:03 AMI just wanna say that using *real* data for testing is an awesome experience.
So often we let weasely managers convince us that we "can't use the real data" because of some bogus political excuse.
Real data is powerful!
And: you rock Jeff. Keep doin what you do.
secretGeek on July 10, 2009 5:03 AMmy God someone got their coffee today!
Steve on July 10, 2009 7:23 AMChill out, guys - this is a fight you can't win :-P
Juan on July 10, 2009 12:05 PMI'm all for brute-force testing with real-life data. That part of the topic is fine.
But not every parsing problem can be solved with regular expressions. In fact, only a small fraction of parsing problems can be solved with regular expressions. Checking for balanced punctuation is the poster-child for things you CANNOT do with regular expressions.
My guess is you'll reach some unmaintainable level of complexity with your expression and declare it "good enough", because you won't be able to make it any better without pitching the REs and using better-suited parsing techniques.
Please remember that not every parsing problem is a nail you can whack with a regular expression.
Adrian on July 10, 2009 12:10 PMRegEx...YUM!
Here is an XML to Array function built on RegEX
http://blog.chronofish.com/?p=111
-CF
Didn't we used to do italics like /this/ ?
And didn't this mean _underscore_ ?
And this *bold* ?
Do we really have to have a different syntax every time a new website comes along ?
Jeff Atwood, you put your mistakes on the internet for everyone to see. That takes brass balls! Just in case today was the day that you decided to let it get to you... don't. Thanks for doing this, I'm sure I'm not alone when I say I appreciate it!
Chris McCall on July 13, 2009 7:23 AMIt seems the Forbes/Attwood argument is a mirror of scientific research in general. The classic approach is to create a hypothesis first THEN test it with a bunch of data. Whereas the opposite (known rather unfairly as fishing) suggests you analyse a bunch of data first and THEN try and learn something from it.
Personally (and no offence to anyone) I find the classical approach a bit arrogant as it kind of implies that we know all the answers before we even start. If the size of the data is significant, and the method of analysing it is thorough and you then think very carefully about what it is telling you (ie true cause and effect relationships vs just natural correlations) then the test first process can be very powerful.
Glenn on July 13, 2009 7:15 PMHm.. missing desing and missing test knowledge leds to just a trial and error style of scripting.... think you can apply for a dark red light saber now.
A Glenn: That is no test first approach. A "test first" approach would let you define your test set first, what you wan't to parse and what not. Think of test driven programming. You first write the e.g. nunit tests. Then forget the test completly. Then your code. (In other cases you only write code to fit your tests, thats not good. Even better is if two distinct persons write test and code). Then your code. But both the tests and the code are specified.
To not specify something and let it just run over some real world data will most likely led to problems later. You "test" against a fixed point in time where some problems are still to occure. E.g. If you try to test a algorithm against a database with credit card numbers it can be that you either have already fixed data, or that you don't see some cases (like some users entering whitespaces between each number block). Sure, you have real world data. But real world data ages like hell.
There is nothing to test as System Test or System Validation against real world data. But at the first test level, you should sit down and think what your requirement is, what are good and bad cases and test against them.
It'S like building a house. You can plan first, or just begin to put stones on each other. Maybe you build stable with just putting stones together around equipped rooms (your real world test scenario), maybe not.
offler on July 14, 2009 12:17 AMJeff ... *when you're adding on new functionality like that, I'd be very careful*. Perhaps you'd rather use a whitelist format that no one has used before, with HTML like you said in a previous post ... such as [B]foo[/B] or something similar :)
And when did you get rid of ORANGE? Too much spam now eh?
Greg Magarshak on July 17, 2009 9:26 AMI was just reading the conversation here. It is a good post. I learned a few things about meta that I didn't know before. Thanks!
Thomas on July 18, 2009 11:21 PMIf you need to test it to see what the test will produce you probably didn't think it through enough. Where's your foresight man?
BmB on July 25, 2009 7:22 AMNo, not the one you might first think collector-solar.com of; rather, Chicago journalist and friend of Paul Hornschemeier, who co-stars in the video
mark on July 29, 2009 1:00 AMThis is a very specific case in the life of a programmer, who is facing a particularly tricky problem.
I'll hazard to break-down the problem into two parts:
1) The data the programmer is working with, is user-generated (maybe with minimal input-side clean-up).
2) The programmer is attempting to blindly (meaning, without 'approving each change') manipulate the data.
The second part is what makes this exercise most dangerous. Of course, in this scenario, it makes sense to think up of as many cases as possible, PLUS throw as much test data to it as possible.
But not all programmers go through this phase all the time. When you are building small, solid logic pieces, you can get away with traditional testing (note that I disagree with Jeff's first insinuation that it was enough testing -- most programmers at that point would have cringed and automatically thought up of a lot of cases that weren't even programmed for, forget tested for).
But a simple thumbrule should be that if you are manipulating data in any way, test until your grandchildren complain and if you are storing the manipulated data, maintain backups of the original data for a long long time.
Veer on August 3, 2009 1:24 AMBut not all programmers go through this phase all the time. When you are building small, solid logic pieces, you can get away with traditional testing
links of london on August 26, 2009 12:40 AMThis is a very specific case in the life of a programmer, who is facing a particularly tricky problem.
abercrombie and fitch on August 28, 2009 12:10 AMTiffany Jewellery barely 2-year-old result called Iridesse is set to the more Tiffany Key Rings South Coast Plaza setting was the jeweler’s supreme tome branch stockTiffany Bracelets diamonds are about more than absolute condition, cut and beauty - they are one of our diamonds underscores.Tiffany Sets reputation as a world premier jeweler synonymous with diamonds of the finest feature,” added Bennett.
tiffany jewellery on August 28, 2009 2:09 AM| Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |