July 31, 2009
The Paper Data Storage Option
As programmers, we regularly work with text encodings. But there's another sort of encoding at work here, one we process so often and so rapidly that it's invisible to us, and we forget about it. I'm talking about visual encoding -- translating the visual glyphs of the alphabet you're reading right now. The alphabet is no different than any other optical machine readable input, except the machines are us.
But how efficient is the alphabet at encoding information on a page? Consider some of the alternatives -- different visual representations of data you could print on a page, or display on a monitor:
5081 punch card
up to 80 alphanumeric characters
Maxicode
up to 93 alphanumeric characters
Data Matrix
up to 2,335 alphanumeric characters
QR Code
up to 4,296 alphanumeric characters
Aztec Code
up to 3,067 alphanumeric characters
High Capacity Color Barcode
varies by # of color and density; up to 3,500 characters per square inch
Printed page
about 10,000 characters per page
Paper the way we typically use it is criminally inefficient. It has a ton of wasted data storage space. That's where programs like PaperBack come in:
PaperBack is a free application that allows you to back up your precious files on ordinary paper in the form of oversized bitmaps. If you have a good laser printer with the 600 dpi resolution, you can save up to 500,000 bytes of uncompressed data on a single sheet.You may ask - why? Why, for heaven's sake, do I need to make paper backups, if there are so many alternative possibilities like CD-R's, DVD±R's, memory sticks, flash cards, hard disks, streaming tapes, ZIP drives, network storage, magneto-optical cartridges, and even 8-inch double-sided floppy disks formatted for DEC PDP-11? The answer is simple: you don't. However, by looking on CD or magnetic tape, you are not able to tell whether your data is readable or not. You must insert your medium into the drive, if you even have one, and try to read it.
Paper is different. Do you remember punched cards? For years, cards were the main storage medium for the source code. I agree that 100K+ programs were... inconvenient, but hey, only real programmers dared to write applications that large. And used cards were good as notepads, too. Punched tapes were also common. And even the most weird encodings, like CDC or EBCDIC, were readable by humans (I mean, by real programmers).
Of course, bitmaps produced by PaperBack are also human-readable (with the small help of any decent microscope). I'm joking. What you need is a scanner attached to your PC.
PaperBack, like many of the other visual encodings listed above, includes provisions for:
- compression -- to increase the amount of data stored in a given area.
- redundancy -- in case part of the image becomes damaged or is otherwise unreadable.
- encryption -- to prevent the image from being readable by anyone except the intended recipient.
Sure, it's still paper, but the digital "alphabet" you're putting on that paper is a far more sophisticated way to store the underlying data than traditional ASCII text.
This may all seem a bit fanciful, since the alphabet is about all us poor human machines can reasonably deal with, at least not without the assistance of a computer and scanner. But there is at least one legitimate use for this stuff, the trusted paper key. There's even software for this purpose, PaperKey:
The goal with paper is not secure storage. There are countless ways to store something securely. A paper backup also isn't a replacement for the usual machine readable (tape, CD-R, DVD-R, etc) backups, but rather as an if-all-else-fails method of restoring a key. Most of the storage media in use today do not have particularly good long-term (measured in years to decades) retention of data. If and when the CD-R and/or tape cassette and/or USB key and/or hard drive the secret key is stored on becomes unusable, the paper copy can be used to restore the secret key.For paper, on the other hand, to claim it will last for 100 years is not even vaguely impressive. High-quality paper with good ink regularly lasts many hundreds of years even under less than optimal conditions.
Another bonus is that ink on paper is readable by humans. Not all backup methods will be readable 50 years later, so even if you have the backup, you can't easily buy a drive to read it. I doubt this will happen anytime soon with CD-R as there are just so many of them out there, but the storage industry is littered with old now-dead ways of storing data.
Computer encoding formats and data storage schemes come and go. This is why so much archival material survives best in the simplest possible formats, like unadorned ASCII. Depending on what your goals are, a combination of simple digital encoding and the good old boring, reliable, really really old school technology of physical paper can still make sense.
July 29, 2009
Coding Horror: Movable Type Since 2004
When I started this blog, way back in the dark ages of 2004, the best of the options I had was Movable Type.
A Perl and MySQL based blogging platform may seem like an odd choice for a Windows-centric developer like me, but I felt it was the best of the available blog solutions at the time, and clearly ahead of the .NET blogging solutions.
Sure, I have areas of expertise that I like to stick to, but my attitude has always been to put religion aside and use what works, regardless of language or platform. That's much more of a reality today than it was five years ago. Today, we have embarrassing amounts of CPU power and memory in our servers, and a plethora of good virtualization solutions. Spinning up a Linux virtual machine to solve some problem is no big deal, and we do it every day on Stack Overflow.
In retrospect, my choice of Movable Type was a fortunate one. Although I also use and appreciate WordPress, it's a bit of a CPU hog. Given the viral highs and lows of my blogging career, there's no way this modest little server could have survived the onslaught of growth with WordPress. It would have been inexorably crushed under the weight of all those pageviews.
What's Movable Type's performance secret? For the longest time -- almost 5 years -- I used the version I started with, 2.66. That version of Movable Type writes each new blog entry out to disk as a single, static HTML file. In fact, every blog entry you see here is a physical HTML file, served up by IIS just like it would serve up any other HTML file sitting in a folder. It's lightning fast, and serving up hundreds of thousands of pageviews is no sweat. The one dynamic feature of the page, comments, are handled via a postback CGI which writes the page back to disk as each new comment is added. (This is also the source of the occasional comment disk write collision, when two commenters happen to leave a comment at the same time.) Yes, it's a little primitive, but it's also very much in the spirit of KISS: why not do the simplest possible thing that could work?
This static publishing mode precludes glitzy dynamic per-page widgets, but I am a minimalist who likes his pages austere. That restriction suits me fine. The other downside is that a site-wide change requires republishing hundreds or thousands of blog entries. Over time, that can get painful. Modern versions of Movable Type offer both static and dynamic publishing modes, which can give you the best of both worlds.
Movable Type was created by Six Apart. Over the last few years, I've had the opportunity to meet Anil Dash, who is not only the chief evangelist for and first employee of Six Apart, but also an old-school blogger from way back in 1999. This is a guy who has been through the intertubes a time or two. That's why I sought out Anil's advice when we were struggling to come up with a decent name for this crazy website concept Joel Spolsky and I were working on -- and it was his excellent advice on naming that eventually guided us to the name Stack Overflow.
Anil isn't just a brilliant blogger and community evangelist, he's quite influential in his own humble way. And despite his well earned status as a lion of the Web 1.0 blogging era, he's also willing to go far, far out of his way to help a fellow blogger. Anil personally helped me drag Coding Horror from the dark ages of 2004-era Movable Type 2.66 to today's modern Movable Type 4.2x. And by that I mean he logged in himself and did the grunt work to make it happen, including following up with me personally and going through at least two rounds of my crazy demands to make everything as primitive and featureless as I need it to be.
In short, Anil's a mensch.
So, if you're considering a blogging platform, I can vouch for not only the Movable Type software, but the Six Apart team, and the community around it. In all honesty, blogging changed my life. I'm not sure that's directly attributable to me choosing Movable Type, exactly, but I can give it the highest praise I give any software I've used:
It Just Works.
July 26, 2009
Windows 7: The Best Vista Service Pack Ever
While I haven't been unhappy with Windows Vista, it had a lot of rough edges:
This is why the screenshot of the Windows 7 Calculator, although seemingly trivial, is so exciting to me. It's evidence that Microsoft is going to pay attention to the visible parts of the operating system this time around. I'm a fan of Vista, despite all the nerd rage on the topic, but I'll be the first to admit that Vista had all the polish of a particularly dull rock. Let's just say the overall user experience was.. uninspiring. This led many people to shrug, sigh "why bother?", and stick with crusty old XP.
Vista was like a solid B student who shows up at your doorstep reeking of body odor and dressed in shabby clothing from the local thrift shop. There's something decent at the core, but it's a real challenge to get past the obvious surface deficiencies.
Thus, I've been following the development of Windows 7 with cautious optimism. It's important to me not because I am an operating system fanboy, but mostly because I want the world to get the hell off Windows XP. A world where people regularly use 9 year old operating systems is not a healthy computing ecosystem. Nobody is forcing anyone to use Windows, of course, but given the fundamental inertia in most people's computing choices, the lack of a compelling Windows upgrade path is a dangerous thing.
Now that Windows 7 has reached its "release to manufacturing" milestone, I had the opportunity to install it for myself and see.
Within 5 minutes of installation it was immediately obvious to me -- Windows 7 is the best Vista Service Pack ever!
The core of the operating system isn't that different, but the experience is absolutely what Vista should have been on day one. Microsoft took that B student, gave him a bath and a makeover, and even improved his grades ever so slightly.
It sounds like a subtle thing, but it's not. Sit down and use Windows 7 for even a few minutes and you'll find an operating system that is faster, cleaner looking, and filled with lots of little useful, thoughtful touches utterly lacking in Vista. Where Vista was half-implemented and often clunky, Windows 7 is competent bordering on pleasant. I won't bore you with all the details, as Windows 7 has been getting lots of positive press from all corners of the web. There's no need for me to add my voice to the chorus. But suffice it to say that Windows 7 finally offers a compelling upgrade path from Windows XP. So from my perspective, mission accomplished. Three years late, but hey, who's counting.
(Note that this is not an invitation to rekindle the eternal OS flame war, as I'm much more interested in the cool stuff you're creating than what OS you use to create it with. I'm sorry, but screwdrivers just aren't that sexy to me.)
I normally do clean installs for operating system upgrades, but I've been busy recently, and I don't have any new PC hardware builds scheduled. If you're already on Vista, the upgrade path is perhaps more compelling than it otherwise would be. All the breaking fundamental changes were in Vista, so if you've made it over the Vista hump, then an in-place Windows 7 upgrade is relatively painless -- or at least, it has been for me on the two machines I've tried so far.
I think Windows 7 works well as a de-facto Vista service pack. I guess that's not surprising if you compare the version numbers.
C:\Users\Jeff>ver
Microsoft Windows [Version 6.0.6002]
C:\Users\Jeff>ver
Microsoft Windows [Version 6.1.7600]
Here's to exactly 0.1.1598 worth of improvement for the Windows ecosystem. Now can we please get the hell off Windows XP already?
July 21, 2009
Nobody Hates Software More Than Software Developers
A few months ago we bought a new digital camera, all the better to take pictures of our new spawned process. My wife, who was in charge of this purchase, dutifully unboxed the camera, installed the batteries, and began testing it out for the first time. Like so many electronic gadgets, it came bundled with a CD of software. So she innocently ejected the DVD tray, and dropped the CD in.
I happened to notice out of the corner of my eye that this was happening. At which point, I -- now, try to imagine this in exaggerated slow motion, for full effect -- screamed "noooooooooooo", and frantically launched myself across the room in a desperate attempt to keep that CD from launching and installing its payload of software. It worked, but I nearly took out a cat in the process.
There's nothing wrong with the software that comes bundled with a digital camera. Or is there?
- It's probably unnecessary. Any modern operating system (and even Windows XP!) can see and automatically download pictures from a new digital camera. No extra software needed. But in a questionable attempt to add "value" and distinguish themselves from their many digital camera competitors, some executive at the camera company came up with a harebrained scheme to include software with a bunch of wacky, unique features that nobody else has.
- Hardware companies don't generally do software well. Digital camera companies excel at building digital camera hardware. Software, if it exists at all, is an afterthought, a side effect, a checkbox on some marketing weasel's clipboard.
- Software of unknown provenance is likely written by bad programmers. All other things being equal, the odds that new, random bit of software you're about to install will be pleasant, useful, and stress free are ... uh, low.
One of the (many) unfortunate side effects of choosing a career in software development is that, over time, you learn to hate software. I mean really hate it. With a passion. Take the angriest user you've ever met, multiply that by a thousand, and you still haven't come close to how we programmers feel about software. Nobody hates software more than software developers. Even now, writing about the stuff is making me physically angry.
Isn't that an odd attitude coming from people whose job it is to write software? How can we hate what we get paid to create every day?
David Parnas explained in an interview:
Q: What is the most often-overlooked risk in software engineering?A: Incompetent programmers. There are estimates that the number of programmers needed in the U.S. exceeds 200,000. This is entirely misleading. It is not a quantity problem; we have a quality problem. One bad programmer can easily create two new jobs a year. Hiring more bad programmers will just increase our perceived need for them. If we had more good programmers, and could easily identify them, we would need fewer, not more.
How do I know, incontrovertibly, beyond the shadow of a doubt, that the world is full of incompetent programmers? Because I'm one of them!
We work at the sausage factory, so we know how this stuff is made. And it is not pretty. Most software is created by bad programmers like us (or worse!), which means that by definition, most software sucks. Let's refer to Scott Berkun's Why Software Sucks to nail down the definition:
When people say "this sucks" they mean one or more of the following:
- This doesn't do what I need
- I can't figure out how to do what I need
- This is unnecessarily frustrating and complex
- This breaks all the time
- It's so ugly I want to vomit just so I have something prettier to look at
- It doesn't map to my understanding of the universe
- I'm thinking about the tool, instead of my work
How many of those do you think would be true of the software on that CD bundled with the digital camera? I'm guessing all of them. That's why the best choice of software is often no software -- and barring that, as little software as you can possibly get away with, and even then, only from the most reputable and reliable sources.
I don't look forward to installing new software. On the contrary, I dread it.
Let me share a recurring nightmare I have with you. In this dream, I'm sitting down in front of a computer which boots up, running an operating system I've written. I then launch a web browser I've created from scratch, all by myself, and navigate to a website I've constructed. I click on the first link and the whole thing bluescreens. And the bluescreen itself bluescreens and begins to fold in on itself, collapsing into a massive explosion that destroys an entire city block.
That's the good version of the dream. In the other one, there's just … screaming. And darkness.
In short, I hate software -- most of all and especially my own -- because I know how hard it is to get it right. It may sound strange, but it's a natural and healthy attitude for a software developer. It's a bond, a rite of passage that you'll find all competent programmers share.
In fact, I think you can tell a competent software developer from an incompetent one with a single interview question:
What's the worst code you've seen recently?
If their answer isn't immediately and without any hesitation these two words:
My own.
Then you should end the interview immediately. Sorry, pal. You don't hate software enough yet. Maybe in a few more years. If you keep at it.
July 18, 2009
Software Engineering: Dead?
I was utterly floored when I read this new IEEE article by Tom DeMarco (pdf). See if you can tell why.
My early metrics book, Controlling Software Projects: Management, Measurement, and Estimates [1986], played a role in the way many budding software engineers quantified work and planned their projects. In my reflective mood, I'm wondering, was its advice correct at the time, is it still relevant, and do I still believe that metrics are a must for any successful software development effort? My answers are no, no, and no.I'm gradually coming to the conclusion that software engineering is an idea whose time has come and gone.
Software development is and always will be somewhat experimental. The actual software construction isn't necessarily experimental, but its conception is. And this is where our focus ought to be. It's where our focus always ought to have been.
If your head just exploded, don't be alarmed. Mine did too. To somewhat reduce the migraine headache you might now be experiencing from reading the above summary, I highly recommend scanning the entire two page article pdf.
Tom DeMarco is one of the most deeply respected authority figures in the software industry, having coauthored the brilliant and seminal Peopleware as well as many other near-classic software project management books like Waltzing With Bears. For a guy of Tom's caliber, experience, and influence to come out and just flat out say that Software Engineering is Dead …
… well, as Keanu Reeves once said, whoa.
That's kind of a big deal. It's scary.
And yet, it's also a release. It's as if a crushing weight has been lifted from my chest. I can publicly acknowledge what I've slowly, gradually realized over the last 5 to 10 years of my career as a software developer: what we do is craftsmanship, not engineering. And I can say this proudly, unashamedly, with nary a shred of self-doubt.
I think Joel Spolsky, my business partner, recently had a similar epiphany. He wrote about it in How Hard Could It Be?: The Unproven Path:
I have pretty deeply held ideas about how to develop software, but I mostly kept them to myself. That turned out to be a good thing, because as the organization took shape, nearly all these principles were abandoned.As for what this all means, I'm still trying to figure that out. I abandoned seven long-held principles about business and software engineering, and nothing terrible happened. Have I been too cautious in the past? Perhaps I was willing to be a little reckless because this was just a side project for me and not my main business. The experience is certainly a useful reminder that it's OK to throw caution to the wind when you're building something completely new and have no idea where it's going to take you.
Yes, I could add a lot of defensive software engineering caveats here about the particulars of the software project you're working on: its type (mission critical, of course), its size (Google scale, naturally), the audience (millions of daily users, obviously), and so forth.
But I'm not going to do that.
What DeMarco seems to be saying -- and, at least, what I am definitely saying -- is that control is ultimately illusory on software development projects. If you want to move your project forward, the only reliable way to do that is to cultivate a deep sense of software craftsmanship and professionalism around it.
The guys and gals who show up every day eager to hone their craft, who are passionate about building stuff that matters to them, and perhaps in some small way, to the rest of the world -- those are the people and projects that will ultimately succeed.
Everything else is just noise.
July 13, 2009
Meta Is Murder
Are you familiar with the term "meta"? It permeates many concepts in programming, from metadata to the <meta> tag. But since we're on a blog, let's use blogging to explain what meta means. If you've read this blog for any length of time you've probably heard me rant about the evil of blogging about blogging, a.k.a. meta-blogging. As I said in Thirteen Blog Cliches:
I find meta-blogging -- blogging about blogging -- incredibly boring. I said as much in a recent interview on a site that's all about blogging (hence the title, Daily Blog Tips). I wasn't trying to offend or shock; I was just being honest. Sites that contain nothing but tips on how to blog more effectively bore me to tears.If you accept the premise that most of your readers are not bloggers, then it's highly likely they won't be amused, entertained, or informed by a continual stream of blog entries on the art of blogging. Even if they're filled with extra bloggy goodness.
Meta-blogging is like masturbating. Everyone does it, and there's nothing wrong with it. But writers who regularly get out a little to explore other topics will be healthier, happier, and ultimately more interesting to be around-- regardless of audience.
Triple-meta alert! That blog entry was me blogging about blogging about blogging. See? Painful. I told you.
Generally speaking, I am not a fan of the meta. It's seductive in a way that is subtly but deeply dangerous. It's far easier to introspect and write about the process of, say .. blogging .. than it is to think up, research, and write about an interesting new topic on your blog. Meta-work becomes a reflex, a habit, an addiction, and ultimately a replacement for real productive work. It's something I think everyone should watch out for, whatever walk of life or career you happen to have. In fact, I've come up with a zingy little catch phrase to help people remind themselves, and their coworkers, how toxic this stuff can be -- meta is murder.
Yes, you read that right. Murder. I mean it. If enough productive work is replaced by navelgazing meta-work, then people will be killed. Or at least, the community will be.
Joel Spolsky had a great example of how meta-discussion can kill community in our latest podcast.
Let's say that you become a podcaster, so you get really interested in podcasting gear. You're going to buy some mixers, and want to know what kind of headphones to use, what kinds of microphones, when should I do the A/D conversions, all that kind of stuff.So you find this awesome podcasting gear website. And you go on there, and the first subject of conversation is who's going to be elected to the podcasting gear website board of directors. And the second subject of conversation is whether the election that was done last year was orthodox, or was it slightly ... was there something suspicious about that whole thing. And you find a whole bunch of people arguing about that. And then you find a conversation about whether all the people who came in last year from South America and don't speak very good English should be allowed to hang around or should maybe be read-only users for the first six months.
That's all you find there, and you want to talk about mixers and mics. That's why you came to this site!
But they're bored talking about mixers and mics -- they've already had the full mixers and mics conversation all the way to the end, to its logical extreme. They all have, now, the perfect podcasting setup. Except for there's this one minor little thing about whether you should use Monster Cables that people still argue about.
So all they're talking about on this so-called "podcasting gear" website is the podcasting gear website itself.
If you don't control it, meta-discussion, like weeds run amok in a garden, will choke out a substantial part of the normal, natural growth of a healthy community.
The danger and peril of meta has been known for years. We had Josh Millard, a MetaFilter moderator, as a guest on the podcast last year. He described how quickly MetaFilter realized that meta-discussion, if not controlled, can destroy a community:
Millard: Matt set up MetaTalk sometime like 8 months after he started [MetaFilter], right about the beginning of 2000, because people were talking about MetaFilter on the front page. It's natural enough. People would say, hey what's with this, hey look at the post, hey this guy's a jerk. So he started up MetaTalk and directed stuff that was metacommentary to that part of the site. You could delete something and say, hey take it over there. If people wanted to have an extended argument that was derailing a thread, they could do it there.A lot of people cite MetaTalk as a reason that MetaFilter works. If you talk to a regular from the site they'll tell you MetaTalk is key to the success of the site because it's a sort of release valve. Talk pages on Wikipedia are a similar thing. I had the same experience as you the first time I checked those out -- it's not necessarily comprehensible to the casual user what is going on there. But for the people who are regulars, the people who develop a certain amount of passionate attachment to the sites, or really, really need to make their voice heard out of day one beyond just normal participation, you have this safe place you can let people ... let their freak flag fly, as it were, without damaging the core function of the site. You don't have big messes on the front page.
So there's a pretty strong culture of regulars who hang out on MetaTalk. Insofar as you have the big contributors and the serious regulars at any given site that make up the core of the community, there's a strong correlation between those people and the people who actually spend time on MetaTalk dealing with policy stuff and talking about user issues.
Atwood: Right. I totally get that. This is one of the things about designing social software -- you don't really understand it until you've lived through it. For the longest time I couldn't understand why people couldn't respect the rule we had to not discuss this meta stuff on the site itself. I totally get this now.
We've dealt with our meta problem on Stack Overflow, finally. OK, I had to be dragged kicking and screaming to finally do what I should have done months ago, but what else is new?
Anyway, my point is that meta isn't just a social software problem. Meta is a social problem, period. It's applicable to everything you do in life.
Software developers are known for their introspection, and a certain amount of meta is healthy. It qualifies as sharpening the saw -- mindfulness of what you're doing, and how it can be improved. But it's amazing how rapidly that can devolve into a crutch, a sort of methadone for Getting Things Donetm.
So sure, get meta when it makes sense to. But do be aware of what percentage of the time you're spending on meta. And consider: how is progress made in the world? By sitting around and debating the process of how things are done ad nauseam? Or by, y'know … doing stuff?
Allocate your time accordingly.
July 9, 2009
How Not to Advertise on the Internet
Games that run in your web browser are all the rage, and understandably so. Why not build your game for the largest audience in the world, using freely available technology, and pay zero licensing fees? One such game is Evony, formerly known as Civony -- a browser-based clone of the game Civilization with a buy-in mechanism.
There are also plentiful opportunities to 'pay money' now. In the end, Civony is still a business. And to be honest, it's probably better to give the option for some elite folks to finance the game for the masses than to make everyone pay a subscription or watch in-game ads. In addition to the old $0.30 per line world chat, you can spend money to speed up resource gathering, boost stats, and buy in-game artifacts. I'm sure there are other ways to pay money that I haven't discovered yet. But whenever you see a green plus-sign (+), you know the option exists to pay money for a perk.
The game is ostensibly free, but supported by a tiny fraction of players making cash payments for optional items (sometimes referred to as "freemium"). Thus, the player base needs to be quite large for the business of running the game to be sustainible, and the game's creators regularly purchase internet ad space to promote their game. The most interesting thing about Evony isn't the game, per se, but the game's advertising. Here's one of the early ads.
Totally reasonable advertisement. Gets the idea across that this is some sort of game set in medieval times, and emphasizes the free angle.
Apparently that ad didn't perform up to expectations at Evony world HQ, because the ads got progressively ... well, take a look for yourself. These are presented in chronological order of appearance on the internet.
(if this lady looks familiar, there's a reason.)
To be clear, these are real ads that were served on the internet. This is not a parody. Just to prove it, here's a screenshot of the last ad in context at The Elder Scrolls Nexus.
I've talked about advertising responsibly in the past. This is about as far in the opposite direction as I could possibly imagine. It's yet another way, sadly, the brilliant satire Idiocracy turned out to be right on the nose.
The dystopian future of Idiocracy predicted the reduction of advertising to the inevitable lowest common denominator of all, with Starbucks Exotic Coffee for Men, H.R. Block "Adult" Tax Return (home of the gentleman's rebate), and Pollo Loco chicken advertising a Bucket of Wings with "full release".
Evony, thanks for showing us what it means to take advertising on the internet to the absolute rock bottom ... then dig a sub-basement under that, and keep on digging until you reach the white-hot molten core of the Earth. I've always wondered what that would be like. I guess now I know.
July 7, 2009
Testing With "The Force"
Markdown was one of the humane markup languages that we evaluated and adopted for Stack Overflow. I've been pretty happy with it, overall. So much so that I wanted to implement a tiny, lightweight subset of Markdown for comments as well.
I settled on these three commonly used elements:
*italic* or _italic_ **bold** or __bold__ `code`
I loves me some regular expressions and this is exactly the stuff regex was born to do! It doesn't look very tough. So I dusted off my copy of RegexBuddy and began.
I typed some test data in the test window, and whipped up a little regex in no time at all. This isn't my first time at the disco.
Bam! Yes! Done and done! By gum, I must be a genius programmer!
Despite my obvious genius, I began to have some small, nagging doubts. Is the test phrase...
I would like this to be *italic* please.
... really enough testing?
Sure it is! I can feel in my bones that this thing freakin' works! It's almost like I'm being pulled toward shipping this code by some inexorable, dark, testing ... force. It's so seductively easy!
But wait. I have this whole database of real world comments that people have entered on Stack Overflow. shouldn't I perhaps try my awesome regular expression on that corpus of data to see what happens? Oh, fine. If we must. Just to humor you, nagging doubt. Let's run a query and see.
select Text from PostComments where dbo.RegexIsMatch(Text, '\*(.*?)\*') = 1
Which produced this list of matches, among others:
Interesting fact about math: x * 7 == x + (x * 2) + (x * 4), or x + x >> 1 + x >> 2. Integer addition is usually pretty cheap.Thanks. What I needed was to turn on Singleline mode too, and use .*? instead of .*.
yeah, see my edit - change select * to select RESULT.* one row - are sure you have more than one row item with the same InstanceGUID?
Not your main problem, but you are mix and matching wchar_t and TCHAR. mbstowcs() converts from char * to wchar_t *.
aawwwww.... Brainf**k is not valid. :/
Thank goodness I listened to my midichlorians and let the light side of the testing force prevail here!
So how do we fix this regex? We use the light side of the force -- brute force, that is, against a ton of test cases! My job here is relatively easy because I have over 20,000 test cases sitting in a database. You may not have that luxury. Maybe you'll need to go out and find a bunch of test data on the internet somewhere. Or write a function that generates random strings to feed to the routine, also known as fuzz testing.
I wanted to leave the rest of this regular expression as an exercise for the reader, as I'm a sick guy who finds that sort of thing entertaining. If you don't -- well, what the heck is wrong with you, man? But I digress. I've been criticized for not providing, you know, "the answer" in my blog posts. Let's walk through some improvements to our italic regex pattern.
First, let's make sure we have at least one non-whitespace character inside the asterisks. And more than one character in total so we don't match the ** case. We'll use positive lookahead and lookbehind to do that.
\*(?=\S)(.+?)(?<=\S)\*
That helps a lot, but we can test against our data to discover some other problems. We get into trouble when there are unexpected characters in front of or behind the asterisks, like, say, p*q*r. So let's specify that we only want certain characters outside the asterisks.
(?<=[\s^,(])\*(?=\S)(.+?)(?<=\S)\*(?=[\s$,.?!])
Run this third version against the data corpus, and wow, that's starting to look pretty darn good! There are undoubtedly some edge conditions, particularly since we're unlucky enough to be talking about code in a lot of our comments, which has wacky asterisk use.
This regex doesn't have to be (and probably cannot be, given the huge possible number of human inputs) perfect, but running it against a large set of input test data gives me reasonable confidence that I'm not totally screwing up.
So by all means, test your code with the force -- brute force! It's good stuff! Just be careful not to get sloppy, and let the dark side of the testing force prevail. If you think one or two simple test cases covers it, that's taking the easy (and most likely, buggy and incorrect) way out.
July 6, 2009
Code: It's Trivial
Remember that Stack Overflow thing we've been working on? Some commenters on a recent Hacker News article questioned the pricing of Stack Exchange -- essentially, a hosted Stack Overflow:
Seems really pricey for a relatively simple software like this. Someone write an open source alternative? it looks like something that can be thrown together in a weekend.
Ah, yes, the stereotypical programmer response to most projects: it's trivial! I could write that in a week!*
It's even easier than that. Open source alternatives to Stack Overflow already exist, so you've got a head start. Gentlemen, start your compilers! Er, I mean, interpreters!
No, I don't take this claim seriously. Not enough to write a response. And fortunately for me, now I don't need to, because Benjamin Pollack -- one of the few people outside our core team who has access to the Stack Overflow source code -- already wrote a response. Even if I had written a response, I doubt it would have been half as well written as Benjamin's.
Developers think cloning a site like StackOverflow is easy for the same reason that open-source software remains such a horrible pain in the ass to use. When you put a developer in front of StackOverflow, they don't really see StackOverflow. What they actually see is this:
create table QUESTION (ID identity primary key, TITLE varchar(255), BODY text, UPVOTES integer not null default 0, DOWNVOTES integer not null default 0, USER integer references USER(ID)); create table RESPONSE (ID identity primary key, BODY text, UPVOTES integer not null default 0, DOWNVOTES integer not null default 0, QUESTION integer references QUESTION(ID))If you then tell a developer to replicate StackOverflow, what goes into his head are the above two SQL tables and enough HTML to display them without formatting, and that really is completely doable in a weekend. The smarter ones will realize that they need to implement login and logout, and comments, and that the votes need to be tied to a user, but that's still totally doable in a weekend; it's just a couple more tables in a SQL back-end, and the HTML to show their contents. Use a framework like Django, and you even get basic users and comments for free.
But that's not what StackOverflow is about. Regardless of what your feelings may be on StackOverflow in general, most visitors seem to agree that the user experience is smooth, from start to finish. They feel that they're interacting with a polished product. Even if I didn't know better, I would guess that very little of what actually makes StackOverflow a continuing success has to do with the database schema--and having had a chance to read through StackOverflow's source code, I know how little really does. There is a tremendous amount of spit and polish that goes into making a major website highly usable. A developer, asked how hard something will be to clone, simply does not think about the polish, because the polish is incidental to the implementation.
I have zero doubt that given enough time, open source clones will begin to approximate what we've created with Stack Overflow. It's as inevitable as evolution itself. Well, depending on what time scale you're willing to look at. With a smart, motivated team of closed-source dinosaurs, it is indeed possible to outrun those teeny tiny open-source mammals. For now, anyway. Let's say we're those speedy, clever Velociraptor types of dinosaurs -- those are cool, right?
Despite Benjamin's well reasoned protests, the source code to Stack Overflow is, in fact, actually, kind of ... well, trivial. Although there is starting to be quite a lot of it, as we've been beating on this stuff for almost a year now. That doesn't mean our source code is good, by any means; as usual, we make crappy software, with bugs. But every day, our tiny little three person team of speedy-but-doomed Velociraptors starts out with the same goal. Not to write the best Stack Overflow code possible, but to create the best Stack Overflow experience possible. That's our mission: make Stack Overflow better, in some small way, than it was the day before. We don't always succeed, but we try very, very hard not to suck -- and more importantly, we keep plugging away at it, day after day.
Building a better Stack Overflow experience does involve writing code and building cool features. But more often, it's anything but:
- synthesizing cleaner, saner HTML markup
- optimizing our pages for speed and load time efficiency
- simplifying or improving our site layout, CSS, and graphics
- responding to support and feedback emails
- writing a blog post explaining some aspect of the site engine or philosophy
- being customers of our own sites, asking our own programming questions and sysadmin questions
- interacting with the community on our dedicated meta-discussion site to help gauge what we should be working on, and where the rough edges are that need polishing
- electing community moderators and building moderation tools so the community can police and regulate itself as it scales
- producing Creative Commons dumps of our user-contributed questions and answers
- coming up with schemes for responsible advertising so we can all make a living
- producing the Stack Overflow podcast with Joel
- helping set up logistics for the Stack Overflow DevDays conferences
- setting up the next site in the trilogy, and figuring out where we go next
As programmers, as much as we might want to believe that
lots_of_awesome_code = success;
There's nothing particularly magical about the production of source code. In fact, writing code is a tiny proportion of what makes most businesses successful.
Code is meaningless if nobody knows about your product. Code is meaningless if the IRS comes and throws you in jail because you didn't do your taxes. Code is meaningless if you get sued because you didn't bother having a software license created by a lawyer.
Writing code is trivial. And fun. And something I continue to love doing. But if you really want your code to be successful, you'll stop coding long enough to do all that other, even more trivial stuff around the code that's necessary to make it successful.
* Although, to be fair, I really could write Twitter in a week. It's so ridiculously simple! Come on!
July 1, 2009
Oh, You Wanted "Awesome" Edition
We recently upgraded our database server to 48 GB of memory -- because hardware is cheap, and programmers are expensive.
Imagine our surprise, then, when we rebooted the server and saw only 32 GB of memory available in Windows Server 2008. Did we install the memory wrong? No, the BIOS screen reported the full 48 GB of memory. In fact, the system information applet even reports 48 GB of memory:
But there's only 32 GB of usable memory in the system, somehow.
Did you feel that? A great disturbance in the Force, as if 17 billion bytes simultaneously cried out in terror and were suddenly silenced. It's so profoundly sad.
That's when I began to suspect the real culprit: weasels.
No. Not the cute weasels. I'm referring to angry, evil marketing weasels.
That's more like it. Those marketing weasels are vicious.
We belatedly discovered post-upgrade that we are foolishly using Windows Server 2008 Standard edition. Which has been arbitrarily limited to 32 GB of memory. Why? So the marketing weasels can segment the market.
It's sort of like if you were all set to buy that new merino wool sweater, and you thought it was going to cost $70, which is well worth it, and when you got to Banana Republic it was on sale for only $50! Now you have an extra $20 in found money that you would have been perfectly happy to give to the Banana Republicans!Yipes!
That bothers good capitalists. Gosh darn it, if you're willing to do without it, well, give it to me! I can put it to good use, buying a SUV or condo or Mooney or yacht one of those other things capitalists buy!
In economist jargon, capitalists want to capture the consumer surplus.
Let's do this. Instead of charging $220, let's ask each of our customers if they are rich or if they are poor. If they say they're rich, we'll charge them $349. If they say they're poor, we'll charge them $220.
Now how much do we make? Back to Excel. Notice the quantities: we're still selling the same 233 copies, but the richest 42 customers, who were all willing to spend $349 or more, are being asked to spend $349. And our profits just went up! from $43K to about $48K! NICE!
Capture me some more of that consumer surplus stuff!
How many versions of WIndows Server 2008 are there? I count at least six. They're capturing some serious consumer surplus, over there in Redmond.
- Datacenter Edition
- Enterprise Edition
- Standard Edition
- Foundation
- Web
- HPC
Already, I'm confused. Which one of these versions allows me to use all 48 GB of my server's memory? There are no less than six individual "compare" pages to slice and dice all the different features each version contains. Just try to make sense of it all. I dare you. No, I double dog dare you! Oh, and by the way, there's zero pricing information on any of these pages. So open another browser window and factor that into your decisionmaking, too.
I don't mean to single out Microsoft here; lots of companies use this segmented pricing trick. Even Web 2.0 darlings 37 Signals.
Heck, our very own product segments the market.
37signals just does it .. prettier, that's all. They're still asking you if you're poor or rich, and charging you more if you're rich.
Eric Sink also advocates the same "rich customer, poor customer" software pricing policy:
In an ideal world, the price would be different for every customer. The "perfect" pricing scheme would charge every customer a different amount, extracting from each one the maximum amount they are willing to pay.
- The IT guy at Podunk Lutheran College has no money: Gratis.
- The IT guy at a medium-sized real estate agency has some money: $500.
- The IT guy at a Fortune 100 company has tons of money: $50,000.
You can never make your pricing "perfect," but you can do much better than simply setting one constant price for all situations. By carefully tuning all these details, you can find ways to charge more money from the people who are willing to pay more.
This sort of pricing seems exploitative, but it can also be an act of public good -- remember that the poorest customers are paying less; with a one-size-fits-all pricing policy, they might not be able to afford the product at all. Drug companies often follow the same pricing model when selling life-saving drugs to third-world countries. First-world countries end up subsidizing the massive costs of drug development, but the whole world benefits.
What I object to isn't the money involved, but the mental overhead. The whole thing runs so contrary to the spirit of Don't Make Me Think. Sure, don't make us customers think. Unless you want us to think about how much we'd like to pay you, that is.
And what are we paying for? The privilege of flipping the magic bits in the software that say "I am blah edition!" It's all so.. anticlimactic. All that effort, all that poring over complex feature charts and stressing out about pricing plans, and for what? Just to get the one simple, stupid thing I care about -- using all the memory in my server.
Perhaps these complaints, then, point to one unsung advantage of open source software:
Open source software only comes in one edition: awesome.
The money is irrelevant; the expensive resource here is my brain. If I choose open source, I don't have to think about licensing, feature matrices, or recurring billing. I know, I know, we don't use software that costs money here, but I'd almost be willing to pay for the privilege of not having to think about that stuff ever again.
Now if you'll excuse me, I'm having trouble deciding between Windows 7 Smoky Bacon Edition and Windows 7 Kenny Loggins Edition. Bacon is delicious, but I also love that Footloose song..
