Markdown was one of the humane markup languages that we evaluated and adopted for Stack Overflow. I've been pretty happy with it, overall. So much so that I wanted to implement a tiny, lightweight subset of Markdown for comments as well.
I settled on these three commonly used elements:
*italic* or _italic_ **bold** or __bold__ `code`
I loves me some regular expressions and this is exactly the stuff regex was born to do! It doesn't look very tough. So I dusted off my copy of RegexBuddy and began.
I typed some test data in the test window, and whipped up a little regex in no time at all. This isn't my first time at the disco.
Bam! Yes! Done and done! By gum, I must be a genius programmer!
Despite my obvious genius, I began to have some small, nagging doubts. Is the test phrase...
I would like this to be *italic* please.
... really enough testing?
Sure it is! I can feel in my bones that this thing freakin' works! It's almost like I'm being pulled toward shipping this code by some inexorable, dark, testing ... force. It's so seductively easy!
But wait. I have this whole database of real world comments that people have entered on Stack Overflow. shouldn't I perhaps try my awesome regular expression on that corpus of data to see what happens? Oh, fine. If we must. Just to humor you, nagging doubt. Let's run a query and see.
select Text from PostComments where dbo.RegexIsMatch(Text, '\*(.*?)\*') = 1
Which produced this list of matches, among others:
Interesting fact about math: x * 7 == x + (x * 2) + (x * 4), or x + x >> 1 + x >> 2. Integer addition is usually pretty cheap.Thanks. What I needed was to turn on Singleline mode too, and use .*? instead of .*.
yeah, see my edit - change select * to select RESULT.* one row - are sure you have more than one row item with the same InstanceGUID?
Not your main problem, but you are mix and matching wchar_t and TCHAR. mbstowcs() converts from char * to wchar_t *.
aawwwww.... Brainf**k is not valid. :/
Thank goodness I listened to my midichlorians and let the light side of the testing force prevail here!
So how do we fix this regex? We use the light side of the force -- brute force, that is, against a ton of test cases! My job here is relatively easy because I have over 20,000 test cases sitting in a database. You may not have that luxury. Maybe you'll need to go out and find a bunch of test data on the internet somewhere. Or write a function that generates random strings to feed to the routine, also known as fuzz testing.
I wanted to leave the rest of this regular expression as an exercise for the reader, as I'm a sick guy who finds that sort of thing entertaining. If you don't -- well, what the heck is wrong with you, man? But I digress. I've been criticized for not providing, you know, "the answer" in my blog posts. Let's walk through some improvements to our italic regex pattern.
First, let's make sure we have at least one non-whitespace character inside the asterisks. And more than one character in total so we don't match the ** case. We'll use positive lookahead and lookbehind to do that.
\*(?=\S)(.+?)(?<=\S)\*
That helps a lot, but we can test against our data to discover some other problems. We get into trouble when there are unexpected characters in front of or behind the asterisks, like, say, p*q*r. So let's specify that we only want certain characters outside the asterisks.
(?<=[\s^,(])\*(?=\S)(.+?)(?<=\S)\*(?=[\s$,.?!])
Run this third version against the data corpus, and wow, that's starting to look pretty darn good! There are undoubtedly some edge conditions, particularly since we're unlucky enough to be talking about code in a lot of our comments, which has wacky asterisk use.
This regex doesn't have to be (and probably cannot be, given the huge possible number of human inputs) perfect, but running it against a large set of input test data gives me reasonable confidence that I'm not totally screwing up.
So by all means, test your code with the force -- brute force! It's good stuff! Just be careful not to get sloppy, and let the dark side of the testing force prevail. If you think one or two simple test cases covers it, that's taking the easy (and most likely, buggy and incorrect) way out.
What the heck should googlefights want to show?
try
http://www.googlefight.com/index.php?lang=en_GB&word1=Jeff+Atwood&word2=Adolf+Hitler
even a wrong spelled name can win
http://googlefight.com/index.php?lang=en_GB&word1=jeff+atwood&word2=joseph+stalin
And if you try to compare him against somebody in informatik which has really done something (like developing semaphores, winning the turing award and so on... guess who would win?)
http://googlefight.com/index.php?lang=en_GB&word1=jeff+atwood&word2=Edsger+Dijkstra
I usually enjoy your posts, and I am a fan, but this one was just a dumb waste of time. Seems like you were trying to fill a quota on this one.
The Skipper on July 8, 2009 2:04 AMAs this entry demonstrates nicely, Markdown IS the dark side:
-insufficiently unique tags
-closing tag is the same as the opening tag
...thus necessitating a regex that's ridiculous even by the standards of regexes.
"italic" is hardly onerous, as jasonmray notes above, and stripping out all HTML except , , and is a solved problem.
Paul on July 8, 2009 2:33 AMNaturally, I meant "i" or "\" or whatever magic incantation it takes to make angle brackets show up here. I trust the irony is not lost on anyone.
Paul on July 8, 2009 2:36 AMMay I ask what the difference is between .* and .*?
I just used a regex today (not something too common in my work) and am now curious :)
@Dennis Forbes:
"(somehow CodingHorror got on my iGoogle page, and I've been remiss to remove it. And every now and then I expand one of those nodes...)"
If I don't like the service in one place I don't go never again, period. For someone who does not like this blog you have been here for a long time (http://www.codinghorror.com/blog/archives/000845.html). Maybe you are a stacker, maybe you are jealous, or maybe you have a crush on him. Either way it's kinda scary.
Yes. It's terribly scary, Juan. And from looking at the cadence of your sentences, I have to think that you like to injure small animals for fun.
As much as I enjoy ridiculous allusions and accusations, and as utterly in-awe I am of your amazing Google-fu, I find it humorous that you pointed out that I had commented on that particular blog entry, given that I did so after Jeff commented on one of my entries.
Though I'd been visiting Jeff's blog -- there have been some very enjoyable reads over the years -- since a lot earlier than 2007.
Maybe, just maybe my dear friend Juan, the online development community really isn't all that big, and we find ourselves in surprising intersections all the time. Strange how that works.
Dennis Forbes on July 8, 2009 3:38 AMThis does basically the same thing, covering a few more edge cases, using the more advanced features of the .NET regex engine:
(?(?(?${Phrase}
kbiel on July 8, 2009 4:00 AMMr. Forbes is not wrong at all, as he's not bashing at Jeff's testing - but criticizing HOW he is conducting the implementation of the feature.
And for the googlefight results, the ones he's winning is because of probably being links for the Forbes magazine or related material.
Comment filtering failure. I'm not going to attempt to find out how to fix it without a preview option.
kbiel on July 8, 2009 4:05 AM>x + (x * 2) + (x * 4), or x + x >> 1 + x >> 2.
Aren't those shifts going the wrong way? Also, don't the "+" operators get evaluated before the ">>" operators?
Keith on July 8, 2009 5:40 AM*Or* You could just make it optional, and then you don't need to worry about it ...
Too Much PHP on July 8, 2009 5:53 AMYour regex implementation doesn't have \b apparently - which is for word boundary. Rather useful :)
kd on July 8, 2009 7:12 AMI, like most Unix shell hacks are pretty good at regular expressions, but I'm not sure whether you want to use them here.
Besides, why are you writing your own parser when you have textile that does it for you? (See http://textile.thresholdstate.com/).
As an added benefit, you'll be using the same sort of syntax that other sites use. Italics? Use underscores. Bold? Use asterisks. Underscore? Use plus signs. Crossout? Use minus signs. Why should I have to remember your site uses double underlines when everyone else uses single underlines.
And, if you really, really insist on using C#, you'll be happy to here that there's a .NET version of Textile: http://www.codeplex.com/textilenet
David W. on July 8, 2009 7:25 AMAs I read the post, I expected a pithy moral about not trying to impose new semantics on old free-form data written without knowledge of those semantics. The question of how to (or whether to) write a regex for this should come after more important questions such as whether to do it in the first place (and for older comments written when a star was expected to be a star, I'd say no).
Of course, this all begs the question of why Yet Another Random Markup Language needs to be developed. Are html tags so foreign to your visitor base that you have to come up with something new?
Jeffrey Friedl on July 8, 2009 7:29 AMWoaw, you're the king of regex !
Geo on July 8, 2009 8:06 AMA while back there was a post somewhere out there in the tubes by a guy advocating test driven development. In his case he was building a sudoku solver, and began his development process, per TDD dictums, by building his tests.
He then pursued the development of the application in the most mindless, ridiculous manner possible, basically just believing that the best approach was to kind of randomly change things until the tests passed.
I feel the same way about this entry. Did you *really* have to "go to the tests" to realize that there were egregious gaps in your matching? I would say that it is close to one of the worst ways to approach the problem.
Dennis Forbes on July 8, 2009 8:10 AMYeah, gotta start reading that book you talked about, those regexp actually make Japanese appealing :-)
Dan on July 8, 2009 8:12 AMIf Dennis Forbes is angry with me, well ... it must be Wednesday!
I can't speak for you, but for me, it's *way* harder to sit around and dream up all the edge conditions than it is to, y'know, throw a bunch of data at it and see.
Jeff Atwood on July 8, 2009 8:15 AMIf backticks signify code, how would I write shell code that contains backticks?
alt on July 8, 2009 8:17 AMBut those comments have been entered by users who don't have any markup available. By introducing 'code' as possible markup don't most of those things cease to be a problem? Surely they'll just write '7*x'?
Angry? Not really, Jeff.
And I would argue that they aren't edge conditions, and that really is the crux of the point. Such scenarios are blatantly obvious, I think, if you just stop and contemplate for a few moments.
As an aside, make your implementation in F# and you'll have an interesting (and likely very fast) implementation to talk about. Regular expressions are one of the greatest solutions in software development, but you know that trite old saying -- now you have two problems. It isn't super ideal for parsing text in such a manner.
Dennis Forbes on July 8, 2009 8:30 AMWhat's with all this testing crap? Remember: We Make Shitty Software.. With Bugs!
http://www.codinghorror.com/blog/archives/000099.html
Zack Peterson on July 8, 2009 8:31 AMUnfortunately, one completely ambiguous case is thing like identifiers surrounded by double underscores such as __FILE__ and __LINE__. If it were up to me, I would only use asterisks for bold/italics, and not underscores.
Of course, the easy workaround is to do `__FILE__`, but to do so, you need to be aware that double underscores mean italics.
Adam Rosenfield on July 8, 2009 8:32 AMGreat post! Brute-force testing makes a lot of sense, specially for small components of a software. For larger components brute force might be a problem, either because it takes a lot of time or there is no possible way you can cover a lot of ground in a reasonable amount of time. Testing stuff at a lower level is a must.
Tathagata on July 8, 2009 8:35 AMIs just me, or is this particular blog not generating a lot of comments.
Don't get me wrong Jeff, if I have to search and replace more than one thing in any of my files, I do it with Codewright's version of regex - like most things in coding, no cares but me.
Jamie on July 8, 2009 8:35 AMWhile I see the point Dennis is making, I often go with Jeff's approach for testing.
It is the same thing as using the compiler as a test. Officially frowned upon, but if your in an interpreted language or working on something that takes seconds to compile why not use the compiler as a sanity check?
Heck my IDE of choice, Netbeans, does that automatically now.
Some Random Internet Guy on July 8, 2009 8:43 AMRan into this yesterday unexpectedly on Server Fault. I was trying to *chortle* but I wound up doing it in italics which was very deceptive. I thought *'s around a word meant some sort of noise your body might make. At least that's how it was in mIRC in the mid-90's.
Peter Turner on July 8, 2009 8:43 AM> Such scenarios are blatantly obvious, I think, if you just stop and contemplate for a few moments.
So what? If you've got a lot of test data available, why not run your code against it to see which, if any, blatantly obvious things you missed?
Dumb approaches, smart approaches, they can all help, and they can all miss things.
Paul D. Waite on July 8, 2009 8:50 AMErm, would just using a Markdown library not have made more sense?
Robert Synnott on July 8, 2009 8:52 AM"First, let's make sure we have at least one non-whitespace character before and after each asterisk."
I don't think that's what you mean... what you wrote would highlight "my*string*s", but not "my *string* is"
Tordek on July 8, 2009 8:55 AM*sigh*
http://www.googlefight.com/index.php?lang=en_GB&word1=%22Jeff+Atwood%22&word2=%22Dennis+Forbes%22
Juan on July 8, 2009 9:00 AM@Paul-
>So what? If you've got a lot of test data available, why not run your code against it to see which, if any, blatantly obvious things you missed?
Tests are grrrrr-eat.
Tests are not a substitute for actually thinking about the problem space, however, and this post supports a theory I've long had that tests sometimes make people produce worse code, because they get used as a substitute for purposed coding.
I strongly suspect that Jeff took dramatic license in his post, and he didn't really go into it so blindly, but it served to demonstrate the use of test data.
@Juan-
Google fight? Dude, I'm an unknown nobody. Jeff smacks me down and calls me Sally when it comes to internet popularity.
Dennis Forbes on July 8, 2009 9:04 AMI wonder how many real italics attempts will now be lost by the updated expression?
Joel Coehoorn on July 8, 2009 9:05 AMWhy is is it easier and less confusing for the commenter to write *italics* rather than italics ?
Especially when we eventually have *italics* #bold# @underline@ %hidden% etc.
Being able to "dream up" edge cases demonstrates that you understand both the problem and the algorithm, along with limitations.
My first thought on seeing your initial regex, was the fact that a comment wouldn't be able to contain multiple italic blocks:
test this in *italic*
test this *other* thing in *italic*
With a greedy algorithm — (which most regex engines are) — the block "other* thing in *italic" would be italic.
Essentially what you're doing is trying to use a regular expression to match a context-free grammar, which is not possible. You can fake it to a limited extent using an "enhanced" regex engine (with back-references, etc) and some creative expressions, but the level of nesting that you can support will be finite, and the deeper you go, the more complex the expression becomes.
Either put a rich text field on your form (so it acts like a mini word processor and hides the markup in the background) or learn to live with the fact that users will have to enter markup into a plaintext field in order to have their plaintext appear to be rich text on the web page.
Oh, wait, **text** and __text__ is SOOO much less ambiguous and easier on the eyes of a layperson.
tb on July 8, 2009 9:11 AMCould you just have commenters type stuff I want italicized?
Xami on July 8, 2009 9:15 AMYou should have looked at the WMD code! I could have sent you some snippets where this sort of thing is handled.
Dana on July 8, 2009 9:16 AMIf you're looking for a regex to do X you should start by looking at http://regexlib.com/ to see if someone already has done X. Most of these situations are non-unique, so reinventing the wheel can often be saved. And of course, if you don't find something that suits your needs, or that what's there is full of bugs and you know how to fix them, by all means submit the new-and-improved regex.
Steve Smith on July 8, 2009 9:25 AMI know you love RegEx, but isn't this the wrong tool for the job? Shouldn't you be writing a parser for your reduced version of Markdown?
This shouldn't be too time consuming as:
You only want three elements of the complete language.
There are probably open source parsers for Markdown you can rip off/learn from.
And you still get to test it against heaps-o-data.
Win Win?
Sir Digby Chicken Caeser on July 8, 2009 9:29 AMI guess the point of the article is to use real test data for tests, so, good on you.
But it would have been nice to hear a word about BNF grammars and actual, real text parsing, instead of hacking it together from regex. Even if Markdown isn't defined in a deterministic way, it still would have been nice to hear a little peep about parsing and how it would have solved all those nasty little edge cases, and why we can't use the traditional approach today.
Peter on July 8, 2009 9:30 AMHow about saving a "markup version" with each comment? Then you could render old submissions without any formatting or the old way... and new submissions with the current method...
Or... if you have a way of escaping your newly introduced control chars, escape all old comments...
Jim on July 8, 2009 9:30 AMI hope you will find the problem one day you are trying to solve with this mindless post.
Regular expressions are useful. We get it. You are a cool. Your web site is cool. You will make a lot of $$$ (or $#%@%^@, it remains to be seen).
securityhorror on July 8, 2009 9:33 AMTesting is for pansies! Test this, test that...blah, blah, blah. Real codeslingers just throw their product out to the masses and deal with the - ahem - very small number of bugs that their exceptional code will somehow manage to generate. What's the worst that could happen, eh?
Kenneth on July 8, 2009 10:01 AMYou should have just posted a question on Stack Overflow, asking for help, I know I would have seen some of the problems with the first one, and I think there still may be some problems with the last one.
( The question should also have had a link to the Meta SO question were people could make comments about the idea )
Brad Gilbert on July 8, 2009 10:02 AMThis won't match
*Testing* is good.
You may want:
(?=^|[\s^,(])\*(?=\S)(.+?)(?=\S)\*(?=[\s$,.?!])
Tim on July 8, 2009 10:09 AMWell, here is an excellent example of what your stackoverflow dumps can be used for, TEST CASES FOR FREE :D
Kost on July 8, 2009 10:12 AM"Especially when we eventually have *italics* #bold# @underline@ %hidden% etc."
Back from the old BBS days, I find /italics/ *bold* _underline_ much more intuitive.
Secure on July 8, 2009 10:14 AMThis will work until you decide to add some escaping. Oh wait... you have already added it: `code`!
So now you need to modify your regexp so that it `text *some* text` would not match
LXj on July 8, 2009 10:16 AMDon't forget to watch in which mode you currently are. E.g. within a `code block` you probably don't want * to be matched at all (people will most likely not use italic within a code block, will they?). This is probably the hardest thing to do -- if possible with only regex at all... I'm not sure about that (not unless you pack all three components into a single regex, code, italic and bold).
BTW, I don't understand why *xxx* is italic and **xxx** is bold. Ages ago, where nobody ever heard of HTML, we styled out read-me's, mails and Usenet posts with: /italic/, *bold*, _underlined_ :-)
(apart from the fact that I think italic is rather useless. To me italic text is not emphasized at all, it is rather slimmer and less emphasized, de-emphasized so to say. Sometimes it's also just plain ugly or hardly even distinguishable from normal text, depending on the font being used. So I always use bold to emphasize and if / dies tomorrow, I will not really miss it)
Mecki on July 8, 2009 10:18 AMSure. The best use of your time is to stare at the code and ponder all the possible holes in it. Do this for hours so you can be SURE you thought of everything. Definitely do NOT run a simple quick query on a huge amount of real world test data. That would be stupid.
While you are at it, I think you should go back through your code base, ignore all profiling you have done so far, and start pre-optimizing code. That's always a good use of time.
Because as a team of 3, I'm sure you guys have enough diversity to think like every person in every culture out there and fully know exactly what they will click on, in what order, what type of data they will enter, etc.
After you then get your perfect code, release it. When people complain of problems, go back and fix your code (what? I thought I was so smart I'd found every possible bug! that's unpossible!).
But for Dennis' sake, definitely do NOT use your test data first!
Come on people. Using a huge amount of real world data as your first cut is the best possible use. You will see how people are using the *, and then, after you solve all those edge cases, you can THEN stare at your code and analyze each special case you had to consider. This will actually help you think of as of yet not encountered uses of the * character.
Some people are so incredibly arrogant it amazes me.
Matt on July 8, 2009 10:18 AMWhile I agree with the original idea of "this is exactly the stuff regex was born to do" when the problem seemed trivial enough, I will have to agree with tb: in this case, regexes are not the best tool for the job (although I smiled at the fact that using "**" instead of "*" doesn't solve the problem, because things like "char **p;" could happen too).
On the other hand, maybe you are not looking for the *perfect* solution, maybe something that works right 99.98% of the time is good enough, in which case you have already (almost) solved your problem. Furthermore, maybe *not* spending two days writing some free-context grammar or a state machine could be one of those "bad for the software, good for the business" things, right?
Having nitpicked enough, +1 for the article.
Side note: you should consider some form of comment moderation. I'm not saying you should get rid of anyone who doesn't agree with you (Dennis Forbes, for instance, disagrees with you, but in a good way), but are insults and such really necesary (securityhorror, I'm looking at you)?
Martin on July 8, 2009 10:19 AMThe real solution would be to realize that regexes are a poor tool for this, and use a proper parser instead. Continue down this path and you have something like MediaWiki markup - an incredibly irregular markup language that can only be properly parsed by the one and only canonical implementation, using a horrid mishmash of regex and other functions.
Nick Johnson on July 8, 2009 10:20 AMI really think this is one of those situations where regex is not the answer. By the time you've coaxed out all the tricksy situations, you've got a monster unmaintainable regex.
Much better to do it in "normal" code. Of course that normal code might use simple regexes.
John H on July 8, 2009 10:22 AMMidichlorians?
*gag*
I thought we had all agreed that "The Phantom Menace" didn't really happen.
jeffH on July 8, 2009 10:26 AM@Kenneth-
Who are you directing your sarcasm at, because no comment that I can see in here implies that testing is a bad thing. Testing is a great thing.
However Jeff is pursuing Test Driven-Off-A-Cliff Development here. Worse, it isn't even *really* tests at all because the tests "pass" based upon him eyeballing generated output to see if it's getting closer to expectations. It is classic hackery-in-the-bad-way sort of coding.
@Matt-
Groan. It isn't worth replying.
Dennis Forbes on July 8, 2009 10:29 AMThis is why I use HTML. It's slightly more time-consuming than Markdown, especially for unordered lists, but as someone very familiar with HTML I'm not really bothered.
Jonathan Drain | D20 Source on July 8, 2009 10:47 AMJeff, what kind of shoes do you wear? I'll buy the same because I want to be like you. And please tell me more about you.
code monkey on July 8, 2009 10:48 AMTim was right. When you put `^` or `$` inside a character class (like `[\s^,(]` or `[\s$,.?!]`), they no longer match positions, but the those literal characters. `\s` may or may not mean "any whitespace" inside a character class, depending on your regex engine (some allow it, some interpret it as "either \ or s").
So I believe what you meant was:
`(?=^|[\s,(])\*(?=\S)(.+?)(?=\S)\*(?=$|[\s,.?!])`
(and this matches one or more characters inside the asterisks, not "more than one character in total").
However, it seems to be working well so far! Thanks for the new feature!
Noah on July 8, 2009 10:52 AMIt's funny, but with each new article discussing regex I seem to dislike it more and more. I've toyed with regex's before but to me writing code with my chosen languages native string functions is much easier to read, modify and maintain, especially as complexity grows.
Sure, often I do in ten lines of code what can be done in one, but hasn't C taught us that that isn't always the brightest idea?
HearWa on July 8, 2009 11:00 AMIf you're a nobody... how do you get attention?
Simple, just go against everything that a well-known person says even if they are completely right, there you go! you got your 15 minutes at last!
Guys let's take this advice and ignore this guy.
http://www.codinghorror.com/blog/archives/001271.html
PS: That guy is Forbes
Great article Jeff. I love star wars, and I love the star wars/regex tie in. I'll be back.
Cheap Websites - Josh on July 8, 2009 11:12 AMGreat article Jeff. I love star wars, and I love the star wars/regex tie in. I'll be back.
Cheap Websites - Josh on July 8, 2009 11:13 AMWhoops, sorry for the double post. Please delete.
Cheap Websites - Josh on July 8, 2009 11:13 AM@Dennis Forbes:
By the way, there are two Juan's in the room, I'm Juan Zamudio, the other one is just Juan.
While agree that most of the recent posts are not that valuable I keep coming back hoping I can read another great post like in the good old days, but I find interesting that you come back for more, and after almost two years you have not found the time to remove codinghorror from IGoogle given the fact that you dislike this blog. That's the point want to made.
I also didn't see the point in that Google-fu that you mention, that bring nothing to the table.
PS; I'm not a Jeff Groupie, i found this blog by accident also (searching something related to the code complete book, I'm a McConnell whore, i have to admit that).
PSS: If you don't find cadence in my sentences (I didn't know if that was for me or the other Juan), sorry, my English is not that good.
"Some people are so incredibly arrogant it amazes me.
Matt on July 8, 2009 9:18 AM"
Excellent, excellent point Matt I couldn't have said it better myself.
o.s. on July 8, 2009 11:33 AMAnother example of why regex is shit.
That regex is pretty much impossible to read unless you carefully split it up in to smaller parts.
For this kind of thing you need a proper grammar.
In the grand tradition of honing in on something in a blog post that has nothing to do with the purpose of the post...
and are outdated. Instead we're all supposed to use and
Alex on July 8, 2009 11:45 AMSorry, I forgot- html doesn't fly in comments-
What I was saying before: we're not supposed to use [i] and [b] anymore, instead we're supposed to use [em] and [strong]
Alex on July 8, 2009 11:46 AM>If you're a nobody... how do you get attention?
Disagree with Jeff on his own blog! GENIUS! Then you can gain the attention of a bunch of people who through survivorship-bias (in that they continued to read it) are going to likely be fans of Jeff's!
Somehow I don't think that strategy is a very good avenue to fame. Gosh, I'm going to have to rethink this.
>Simple, just go against everything that a well-known person says even if they are completely right
Sorry, friend, but I've trodden this ground half a decade ago - http://www.yafla.com/dforbes/The_Fallacy_of_Test_Driven_Development
I disagree with Jeff when I disagree with Jeff (somehow CodingHorror got on my iGoogle page, and I've been remiss to remove it. And every now and then I expand one of those nodes...). If this hurts your precious feelings, I would advise that you stop reading the comments.
>Guys let's take this advice and ignore this guy.
This is like those YouTube channels where people put a big notice at the top disclaiming that they don't care what anyone thinks, which of course means that they desperately care what everyone thinks.
Honestly I think Jeff should disable comments, because his biggest fans are his worst enemies, and they are the reason he gets often undeserved backlash. It's like some sort of weird little groupie festival.
Dennis Forbes on July 8, 2009 11:57 AM*text* or _text_ (or double) are no good choices for markup in an environment that is full of bad C code. Go for of some sort and sanitize the database by escaping the old posts of course. These * and _ will just annoy everyone.
Someone
Someone on July 8, 2009 12:20 PMGoogle fight is it's own arch enemy: compare
http://www.googlefight.com/index.php?lang=en_GB&word1=%22Jeff+Atwood%22&word2=%22Dennis+Forbes%22
to
http://www.googlefight.com/index.php?lang=en_GB&word1=Jeff+Atwood&word2=Dennis+Forbes
> While I see the point Dennis is making, I often go with Jeff's approach for testing
I go with the 'Dennis' (smart) approach _always_ (mind the markup). *THEN* I always go and bruteforce test in as many ways possible. I'm always surprised at at least one edge case I missed. I try to make it a habit to _think_ why I missed the particular case initially. That way my reasoning + hitrate improves.
This post reminded me of this article: http://blog.dotnetwiki.org/2009/01/16/NamedFormatsPexTestimonium.aspx where he used Pex to automatically generate test cases where the two implementations differ. Perhaps something like that would be of use for you.
Kevin H on July 8, 2009 12:21 PMah my tags tag was deleted :-) fun
Someone on July 8, 2009 12:22 PM
First, the folks debating how to make sure the '*' block is surrounded by either whitespace or the start/end of lines ... doesn't the regexp library being used support 'word boundary' matches? "\b" is usually it (http://www.regular-expressions.info/wordboundaries.html).
Second, I echo the concerns of trying to handle this as a regular expression problem, when it's quite obviously a language grammer parsing problem more likely to be satisfactorially solved using BNF or PEG grammar.
Third, and most importantly, why are you eschewing libraries which are out there to do exactly this? I mean, one of the advantages of using a quasi-standard like markdown is that everyone and their mother has made a parser of some sort for it already. Don't waste time reinventing the wheel!
An example PEG grammar for Markdown: http://github.com/jgm/peg-markdown/blob/master/markdown_parser.leg
You'll need to use something like ANTLR to generate your C# parser code from that .leg file, but that should be a WHOLE lot easier than even what you've already done with regular expressions.
Fourth, I think the use of two different ways to do a very simple thing ('*' and '_', and '**' and '__') is Just Plain Wrong. Provide one way to make bold, and one to make italics. Makes it less likely we'll hit the other case by mistake. IMHO, the '*' is the most used one and least likely to cause problems.
Finally, I agree with other posters that markdown's choice of '*' for italics and '**' for bold is braindead (sorry, Gruber!). It should have been '/' and '*' instead. But, at this point, markdown is markdown, and you don't want an exception on your one site.
Tom Dibble on July 8, 2009 12:38 PMI've never quite understood why simple HTML markup is considered "inhumane". What, really, is the difference between these:
*italic*
[i]italic[/i]
iitalic/i
Why come up with some complicated regex filter to convert some contrived markup to HTML, when the original HTML was designed to be simple and human readable to begin with?
In almost all cases where I've seen this "markdown" style of formatting, there's some big filter up front that automatically strips out all possible remnants of HTML as part of some cargo-cult security mechanism. Why not just modify the HTML filter to allow basic bold and italic tags through?
jasonmray on July 8, 2009 12:48 PMI agree. This all seems way too complicated.
JM on July 8, 2009 12:50 PMThe basics of Markdown -- the parts that Jeff is trying to capture it seems -- do have a certain elegance, paying homage to a less advanced era: When all you had was ASCII, it was generally agreed that could *emphasize* certain words, and draw _attention_ to others, with nothing more than appropriately place characters. For those with such a habit, Markdown semantically draws from what they are use to.
I have seen a lot of sites that allow either Markdown, HTML, or some other bastardizations. The back-end process was always Markdown (where used) -> HTML -> correctness checker, so it is a concise set of code.
Dennis Forbes on July 8, 2009 1:05 PMWhy aren't you using the nice semantic element, instead of the old, presentational element?
John Topley on July 8, 2009 1:25 PMLet's try again. Why aren't you using the nice semantic "em" element, instead of the old, presentational "i" element?
John Topley on July 8, 2009 1:26 PMlol - you chose the Dark Side when you decided to use regex to parse markdown in order to solve a problem that didn't need either regex or markdown ;-)
but regression testing is always a good thing
Please Jeff, just use BB code that everyone in the world knows how to use and can be implemented with a library. As for breaking existing comments, just don't run this ridiculous italicization code over posts $Date
give the Regex a break and stop trying to reinvent the wheel, especially when the wheel isn't broken!!
fwgx on July 9, 2009 2:04 AMI agree with Jeffrey Friedl. I would have implemented the feature only for the new comments, as you will break at least some old comments no matter how smart your regexp is otherwise.
Paul-Gabriel Müller on July 9, 2009 2:23 AMAren't you still missing the point about testing? Sure you've got a good data set to work from but you're still not actively seeking the conditions that may or may not break your code.
Also, unless you save your test data set, your tests are not repeatable. Without a consistent data set how do you know if a future change to your code has the desired effect?
In your previous blogs you were talking about polishing your code, but you don't seem to be practicing what you are preaching. That is, once you have a piece of code that you think is ready, take the time to develop a thorough test plan and execute it. Better yet, write the test plan when you write your designs. You do do designs and documentation prior to the implementation don't you? Perhaps if you did you wouldn't be so reliant on the Force. :op
Jackie on July 9, 2009 2:24 AM"This will also help if you ever change your format from Markdown to something else (or not) again." ~ Clinton
But not if you want to change the presentation format. This is especially important since they've started giving away data dumps. It also doesn't work very well with the wiki editing.
"May I ask what the difference is between .* and .*? " ~ Sandro
http://www.ultraedit.com/support/tutorials_power_tips/ultraedit/non_greedy_expressions.html
Aaron G puts it very well - when viewed as a domain analysis exercise the technique can be very useful.
TDD is clearly ineffective for designing the architecture of your code, but can be good for small bits of code -- comparing someone who failed at the former to the latter confuses the scope of Jeffs suggestion I feel.
Hacking code until tests pass is clearly bad. But using the domain knowledge and reasoning that people are claiming this technique circumvents is vital in understanding *why* the edge case occurs, else you are going to meet more edge cases hit by your solution and keep going around in circles.
However, with real world data (or failing that, generated expected data) and some smart reasoning and analysis of that you can create some very useful regression tests which, in a TDD fashion, can help to write well thought out, clean code (or blindly continue with regular expressions, depending on your level of fanboyism).
[ICR] on July 9, 2009 2:29 AM@Jeff - Or, you can skip reinventing the wheel and check some Markdown source. For example, look at lines 1242++ in http://cpansearch.perl.org/src/BOBTFISH/Text-Markdown-1.0.24/lib/Text/Markdown.pm.
Berserk on July 9, 2009 2:45 AMFor those Star Wars references, I hereby dub you
the Jar-Jar of using popular culture references in tech blogging.
Mere mention of midichlorians makes my skin crawl. It's like someone
had a really bad hangover and decided to ruin perfectly good trilogy with additional fluff co-produced with disney.
> TDD is clearly ineffective for designing the architecture of your code
It's pretty good at testing whether you have a good architecture or not though. Of course, not everybody is good at design. Such people often don't do unit testing because it's too hard to test their code, and therefore unit testing sucks (because there can't be anything wrong with their code, right?)
Also, TDD should mean that you notice cases like Jeff's, where you have clearly taken the wrong approach, because your code looks terrible, and doesn't work all that well.
Someone should take Jeff's RegEx hammer away from him before he hits anything else with it.
Jim Cooper on July 9, 2009 3:51 AMJeff you have a regular expression problem. Using regular expression in this situation is the wrong solution.
Are you just looking for an excuse to use regular expressions?
And how maintainable do you think that regular expression is?
*sigh* on July 9, 2009 5:07 AMCouldn't you just pose certain requirements to the user, like required whitespace in front of the first asterisk, nothing but text allowed within the asterisks (no whitespace either) and a required whitespace after the closing asterisk? In addition you could ignore any of asterisks within code markup (``). I think this would eliminate most edge cases. I don't think these requirements would even require any explanation to the user, as writing like *this* and not like* this* to emphasize something is the natural way.
Parsing that is dead easy and wouldn't even require regexes. But then, I'm afraid of regex. I think it's a genetic thing whether or not you get them.
Although you wouldn't be able to emphasize numbers, but when you do emphasize a number, you generally write it out for emphasis, like "It's 9 inches long. *Nine* inches!"
Julian on July 9, 2009 6:13 AMNot being a Computer Scientist or Software Engineer, I am not understanding why it requires a regex that searches for delimiters + whatever might come in between, when the actual strings you want are finite (5) and known.
What part am I missing?
mike on July 9, 2009 7:27 AMI don't know if Jeff even reads this far into the comments, but I think this is missing an obvious fact.
You can't take this huge data you have and apply filtering on it when it was written with no filtering in mind. It will not work correctly, and it will break on that odd case where it will be found by that unlucky dude looking at an old entry.
One way to make your life easier when adding such a thing is to use versioning, any comments older than the date you decided to implement this get rendered without the filtering. That way old data is preserved, and someone posting a comment will notice that his formula got italicized and go back and edit it after posting.
This is way better than trying to paddle a boat upstream with a fork and a knife.
Chady on July 9, 2009 8:22 AMI'd just put a comment version flag on the comments; comments need to be parsed according to whatever parsing rules were in place when they were input. It means that over time you have to support all your old parsing variants, unless you add something that never appeared in a single comment. But in practice it's what's needed given that we can't edit comments.
Mr. Shiny New on July 9, 2009 9:41 AMdid you _really_ need italics?
Amazing how complicated it is to interpret human intentions with a text parser, huh?
I spend a lot of time on it as well. It's takes a little bit of artificial intelligence level logic to really do it well (at least, once you get past trivial cases).
Practicality on July 9, 2009 10:15 AMYou need to add a checkbox to the comment interface saying "enable markup in comment" which defaults to however I last set that field. Then users doing weird things with special chars can just uncheck the field and not worry about what happens. Similarly, all the existing comments would have that field set false and would render tomorrow like they did yesterday.
jmucchiello on July 9, 2009 10:29 AMx + x >> 1 + x >> 2?
Wtf?
Surely that was supposed to be:
(x + (x 1)) + (x 2)
Am I missing something?
Nicholas Wright on July 9, 2009 10:33 AMQuestions regarding the utility of markdown-style formatting syntax aside, I find myself wondering (again) why Jeff has such a love for reinventing wheels. This isn't his first post where he takes a solved problem, tries to implement his own solution, and posts about the pitfalls he encounters while doing so (the encryption post comes to mind, offhand).
Here, he already knows of (and even uses!) an appropriate existing implementation, yet he still feels the need to roll his own.
I can understand wheel reinvention as a thought experiment or a learning exercise (or in the case where you actually can do something substantially better), but it seems like a poor choice if you're actually trying to get something done.
On the other hand, I guess it may make for a good subject of conversation in a blog post.
Jeremy T on July 9, 2009 11:09 AMThe comments to this entry are closed.
|
|
Traffic Stats |