July 7, 2009
Markdown was one of the humane markup languages that we evaluated and adopted for Stack Overflow. I've been pretty happy with it, overall. So much so that I wanted to implement a tiny, lightweight subset of Markdown for comments as well.
I settled on these three commonly used elements:
*italic* or _italic_
**bold** or __bold__
I loves me some regular expressions and this is exactly the stuff regex was born to do! It doesn't look very tough. So I dusted off my copy of RegexBuddy and began.
I typed some test data in the test window, and whipped up a little regex in no time at all. This isn't my first time at the disco.
Bam! Yes! Done and done! By gum, I must be a genius programmer!
Despite my obvious genius, I began to have some small, nagging doubts. Is the test phrase...
I would like this to be *italic* please.
... really enough testing?
Sure it is! I can feel in my bones that this thing freakin' works! It's almost like I'm being pulled toward shipping this code by some inexorable, dark, testing ... force. It's so seductively easy!
But wait. I have this whole database of real world comments that people have entered on Stack Overflow. shouldn't I perhaps try my awesome regular expression on that corpus of data to see what happens? Oh, fine. If we must. Just to humor you, nagging doubt. Let's run a query and see.
select Text from PostComments
where dbo.RegexIsMatch(Text, '\*(.*?)\*') = 1
Which produced this list of matches, among others:
Interesting fact about math: x * 7 == x + (x * 2) + (x * 4), or x + x >> 1 + x >> 2. Integer addition is usually pretty cheap.
Thanks. What I needed was to turn on Singleline mode too, and use .*? instead of .*.
yeah, see my edit - change select * to select RESULT.* one row - are sure you have more than one row item with the same InstanceGUID?
Not your main problem, but you are mix and matching wchar_t and TCHAR. mbstowcs() converts from char * to wchar_t *.
aawwwww.... Brainf**k is not valid. :/
Thank goodness I listened to my midichlorians and let the light side of the testing force prevail here!
So how do we fix this regex? We use the light side of the force -- brute force, that is, against a ton of test cases! My job here is relatively easy because I have over 20,000 test cases sitting in a database. You may not have that luxury. Maybe you'll need to go out and find a bunch of test data on the internet somewhere. Or write a function that generates random strings to feed to the routine, also known as fuzz testing.
I wanted to leave the rest of this regular expression as an exercise for the reader, as I'm a sick guy who finds that sort of thing entertaining. If you don't -- well, what the heck is wrong with you, man? But I digress. I've been criticized for not providing, you know, "the answer" in my blog posts. Let's walk through some improvements to our italic regex pattern.
First, let's make sure we have at least one non-whitespace character inside the asterisks. And more than one character in total so we don't match the ** case. We'll use positive lookahead and lookbehind to do that.
That helps a lot, but we can test against our data to discover some other problems. We get into trouble when there are unexpected characters in front of or behind the asterisks, like, say,
p*q*r. So let's specify that we only want certain characters outside the asterisks.
Run this third version against the data corpus, and wow, that's starting to look pretty darn good! There are undoubtedly some edge conditions, particularly since we're unlucky enough to be talking about code in a lot of our comments, which has wacky asterisk use.
This regex doesn't have to be (and probably cannot be, given the huge possible number of human inputs) perfect, but running it against a large set of input test data gives me reasonable confidence that I'm not totally screwing up.
So by all means, test your code with the force -- brute force! It's good stuff! Just be careful not to get sloppy, and let the dark side of the testing force prevail. If you think one or two simple test cases covers it, that's taking the easy (and most likely, buggy and incorrect) way out.
Posted by Jeff Atwood
Unless you changed/improved original implementation at StackOverflow, yesterday I have noticed that `_` in identifiers causes some wacky italicizing if there are two identifiers in single line in comments.
Unfortunately I don't remember example, and I have not bookmarked it.
markdown is such a *cool* idea
I just wanna say that using *real* data for testing is an awesome experience.
So often we let weasely managers convince us that we "can't use the real data" because of some bogus political excuse.
Real data is powerful!
And: you rock Jeff. Keep doin what you do.
my God someone got their coffee today!
I'm all for brute-force testing with real-life data. That part of the topic is fine.
But not every parsing problem can be solved with regular expressions. In fact, only a small fraction of parsing problems can be solved with regular expressions. Checking for balanced punctuation is the poster-child for things you CANNOT do with regular expressions.
My guess is you'll reach some unmaintainable level of complexity with your expression and declare it "good enough", because you won't be able to make it any better without pitching the REs and using better-suited parsing techniques.
Please remember that not every parsing problem is a nail you can whack with a regular expression.
Didn't we used to do italics like /this/ ?
And didn't this mean _underscore_ ?
And this *bold* ?
Do we really have to have a different syntax every time a new website comes along ?
It seems the Forbes/Attwood argument is a mirror of scientific research in general. The classic approach is to create a hypothesis first THEN test it with a bunch of data. Whereas the opposite (known rather unfairly as fishing) suggests you analyse a bunch of data first and THEN try and learn something from it.
Personally (and no offence to anyone) I find the classical approach a bit arrogant as it kind of implies that we know all the answers before we even start. If the size of the data is significant, and the method of analysing it is thorough and you then think very carefully about what it is telling you (ie true cause and effect relationships vs just natural correlations) then the test first process can be very powerful.
Jeff Atwood, you put your mistakes on the internet for everyone to see. That takes brass balls! Just in case today was the day that you decided to let it get to you... don't. Thanks for doing this, I'm sure I'm not alone when I say I appreciate it!
Hm.. missing desing and missing test knowledge leds to just a trial and error style of scripting.... think you can apply for a dark red light saber now.
A Glenn: That is no test first approach. A "test first" approach would let you define your test set first, what you wan't to parse and what not. Think of test driven programming. You first write the e.g. nunit tests. Then forget the test completly. Then your code. (In other cases you only write code to fit your tests, thats not good. Even better is if two distinct persons write test and code). Then your code. But both the tests and the code are specified.
To not specify something and let it just run over some real world data will most likely led to problems later. You "test" against a fixed point in time where some problems are still to occure. E.g. If you try to test a algorithm against a database with credit card numbers it can be that you either have already fixed data, or that you don't see some cases (like some users entering whitespaces between each number block). Sure, you have real world data. But real world data ages like hell.
There is nothing to test as System Test or System Validation against real world data. But at the first test level, you should sit down and think what your requirement is, what are good and bad cases and test against them.
It'S like building a house. You can plan first, or just begin to put stones on each other. Maybe you build stable with just putting stones together around equipped rooms (your real world test scenario), maybe not.
Jeff ... *when you're adding on new functionality like that, I'd be very careful*. Perhaps you'd rather use a whitelist format that no one has used before, with HTML like you said in a previous post ... such as [B]foo[/B] or something similar :)
And when did you get rid of ORANGE? Too much spam now eh?
I was just reading the conversation here. It is a good post. I learned a few things about meta that I didn't know before. Thanks!
If you need to test it to see what the test will produce you probably didn't think it through enough. Where's your foresight man?
No, not the one you might first think collector-solar.com of; rather, Chicago journalist and friend of Paul Hornschemeier, who co-stars in the video
This is a very specific case in the life of a programmer, who is facing a particularly tricky problem.
I'll hazard to break-down the problem into two parts:
1) The data the programmer is working with, is user-generated (maybe with minimal input-side clean-up).
2) The programmer is attempting to blindly (meaning, without 'approving each change') manipulate the data.
The second part is what makes this exercise most dangerous. Of course, in this scenario, it makes sense to think up of as many cases as possible, PLUS throw as much test data to it as possible.
But not all programmers go through this phase all the time. When you are building small, solid logic pieces, you can get away with traditional testing (note that I disagree with Jeff's first insinuation that it was enough testing -- most programmers at that point would have cringed and automatically thought up of a lot of cases that weren't even programmed for, forget tested for).
But a simple thumbrule should be that if you are manipulating data in any way, test until your grandchildren complain and if you are storing the manipulated data, maintain backups of the original data for a long long time.
But not all programmers go through this phase all the time. When you are building small, solid logic pieces, you can get away with traditional testing
Tiffany Jewellery barely 2-year-old result called Iridesse is set to the more Tiffany Key Rings South Coast Plaza setting was the jeweler’s supreme tome branch stockTiffany Bracelets diamonds are about more than absolute condition, cut and beauty - they are one of our diamonds underscores.Tiffany Sets reputation as a world premier jeweler synonymous with diamonds of the finest feature,” added Bennett.
This is a very specific case in the life of a programmer, who is facing a particularly tricky problem.
Why not save the comments as HTML. EG: make sure its parsed to sane [X]HTML before you store it in the db field.
That way, you're current comments won't get affected. Depending on what you do at the moment you sanitize them first (escape the and &).
This will also help if you ever change your format from Markdown to something else (or not) again.
tb: Stack Overflow commenters aren't laypeople.
You probably will want to change your regex from:
The problem with your original is with using the lazy operator (?) is that you can run into some really bad backtracking which can kill performance. Using the negated character class ([^*]) will work fast all the time.
I'm with Dennis. While I certainly appreciate the value of testing in general, and testing against real data in particular, it's a poor approach to start with the most naïve of all solutions and proceed to iron out the "bugs" with brute-force testing.
Yes, it's often impossible to foresee every edge case, especially with free-form user-submitted content, but a more thoughtful approach would have enumerated at least the most obvious exceptions before writing a single line of code or regex: asterisks as code syntax or math symbols, underscores at the beginning of reserved words like __fastcall, multiple underscores within constants like MAX_BUFFER_SIZE, mismatched begin/end "tags", tags within other tags, and so on.
An even more thoughtful approach would be to examine the long and growing list of edge cases and consider that regular expressions are notoriously inefficient and inaccurate with so many edge cases, not to mention difficult to maintain, and that perhaps a regex is not the best implementation. One might still come to the conclusion that it's good /enough/ (see what I did there?), but the mentality at work here appears to be "It doesn't really matter if there's a better solution because I can just run tests, massage the expression, and keep iterating until the tests pass." That's what Dennis is criticizing, and it's a very valid criticism.
On the other hand, if Jeff had framed the initial "test" as more of a "domain analysis", it would convey a very different and perhaps more positive message, and for all I know, that's how Jeff really went about it. In other words: "I'm not sure yet how best to go about solving this problem, and I'd like to know more about the real-world data that it's going to be operating on. We already have reams of data, so I think I can automate some of this requirements-gathering phase with a dumb regex, and oh look, it turned up a few edge cases I hadn't really considered, like this one here where somebody actually posted a regular expression in the comments. Cool."
This type of thing, I do all the time. Somebody will ask me for a particular report, and instead of going out and immediately architecting a solution, I'll throw together a quick-and-dirty version, maybe just an inefficient and clunky SQL query, simply to verify that the results of the report actually tell the story that people expect it to tell, or even that the results are meaningful at all (often they're not). The difference is, I don't try to chisel away at that mess to implement the real solution; I throw it away, and get to work on a proper design based on the revised requirements.
Perhaps a [spoiler] [/spoiler] tag would be appropriate for this blog where it only shows on mouseover or click? =)