December 28, 2007
An Inalienable Right to Privacy
Privacy has always been a concern on the internet. But as more and more people let it all hang out on the many social networking websites popping up like weeds all over the web, there's much more at risk. Every other week, it seems, I'm reading about some new privacy gaffe. Last month, it was Facebook's Beacon opt-out policy; this week, it's Google Reader sharing private data. The privacy problems just keep piling up as more people tune in and turn on.
Nearly a decade ago, Sun Microsystems CEO Scott McNealy snapped out a warning to the worriers of the Internet Age: "You don't have any privacy. Get over it." McNealy's words look more prescient every year. In 2006, AOL unwittingly divulged the personal lives of 650,000 customers by publishing their search histories as research data. Despite AOL's attempts to anonymize the info, the New York Times quickly outed a 62-year-old lady in Georgia whose searches revealed her dog was wetting the upholstery. The Justice Department has subpoenaed Google, Yahoo!, MSN, and AOL for lists of search queries. More recently, Facebook employees were caught reading the customer logs.
Nothing warms the cockles of a user's heart quite like the tender mercies of your friendly neighborhood CEO. That privacy stuff you're so worried about? Get over it! You might wonder if Mr. McNealy has the same glib attitude towards the privacy of himself and his own family. Only criminals have stuff to hide, right? Here's Bruce Schneier's take on the value of privacy:
Last week, revelation of yet another NSA surveillance effort against the American people has rekindled the privacy debate. Those in favor of these programs have trotted out the same rhetorical question we hear every time privacy advocates oppose ID checks, video cameras, massive databases, data mining, and other wholesale surveillance measures: "If you aren't doing anything wrong, what do you have to hide?"Some clever answers: "If I'm not doing anything wrong, then you have no cause to watch me." "Because the government gets to define what's wrong, and they keep changing the definition." "Because you might do something wrong with my information." My problem with quips like these -- as right as they are -- is that they accept the premise that privacy is about hiding a wrong. It's not. Privacy is an inherent human right, and a requirement for maintaining the human condition with dignity and respect.
I promote openness and making things public. Not everything, of course; just the good and publicly useful sections you've culled from the repertoire of your life. If you don't consider any part of your life worthy of public consumption in any form, are you really doing anything?
Even as a proponent of selectively exhibiting parts of your life in public, there's a huge part of my life that's private. I didn't realize it, but I've relied on privacy through obscurity until now. My life is so utterly mundane that I can't imagine anyone caring what I do, what I buy, what I read, and who I talk to. I thought privacy was overrated. I certainly never considered privacy a basic human right, on par with life, liberty, and the pursuit of happiness. But it is.
Too many wrongly characterize the debate as "security versus privacy." The real choice is liberty versus control. Tyranny, whether it arises under threat of foreign physical attack or under constant domestic authoritative scrutiny, is still tyranny. Liberty requires security without intrusion, security plus privacy. Widespread police surveillance is the very definition of a police state. And that's why we should champion privacy even when we have nothing to hide.
If power corrupts, then access to a pure, unfettered stream of data on every American corrupts absolutely. The default strategy of privacy through obscurity may have worked by default in the hodepodge, sporadically digital worlds of the 80's and 90's. Not any more. Now that so much of the world is online or stored in a vast database somewhere, all those tiny digital artifacts of who you are and what you do can be woven into a complete tapestry of your life. And you better believe it will be, because it makes some people a lot of money.
So what can we do about it? Is privacy possible in the digital age?
The truth is, fighting to protect privacy is a quixotic venture. Sure, there are any number of technologies, techniques and work-arounds you can employ, all in the effort to protect your privacy. But such a quest is like trying to dig a hole in middle of a fast flowing river. The rich and powerful gain some amount of privacy only because they can afford to grid their personal lives with a kind of digital body armor.Garfinkel says we need to rethink privacy in the 21st Century. "It's not about the man who wants to watch pornography in complete anonymity over the Internet. It's about the woman who's afraid to use the Internet to organize her community against a proposed toxic dump - afraid because the dump's investors are sure to dig through her past if she becomes too much of a nuisance."
I'm with Bruce on this one. Demand privacy even if you don't think you need it. Consider that the next time you sign up for some new social networking service, or a grocery discount card, or give out your telephone or social security number for some trivial reason. Neglecting to protect our right to privacy is, in effect, giving up on privacy altogether. And that's not a world I want to live in. Openness is important-- but so is privacy, in equal measure. I believe we can have both, but not without active effort on our part.
December 26, 2007
Modern Logo
Leon recently posted a link to a great blog entry on rediscovering Logo. You know, Logo -- the one with the turtle.
I remember being exposed to Logo way back in high school. All I recall about Logo is the turtle graphics, and the primitive digital Etch-a-Sketch drawings you could create with it. What I didn't realize is that Logo is "an easier to read adaptation of the Lisp language.. [with] significant facilities for handling lists, files, I/O, and recursion", at least if the Wikipedia entry on Logo is to be believed.
Although I was eternally fascinated with programming, Logo held no interest for me. It seemed like a toy language, only useful for silly little graphical tricks and stunts with the turtle. But apparently there was a real language lurking underneath all that turtle graphics stuff. Brian Harvey is a Berkeley professor who not only co-wrote Berkeley Lisp, but authored three books that, amazingly, teach the whole of computer science using nothing but Logo.
- Computer Science Logo Style: Symbolic Computing
concentrates on natural language processing rather than the graphics most people associate with Logo. - Computer Science Logo Style: Advanced Techniques
discussions of more advanced Logo features alternate with sample projects using those features, with commentary on the structure and style of each. - Computer Science Logo Style: Beyond Programming
a brief introduction to six college-level computer science topics.
If you have no time to skim the material, and you're still convinced Logo is a graphics language for little kids, check out a sample Logo program that Brian put together to impress us. I'm impressed, anyway.
Logo is much more than the thin wrapper over turtle graphics I thought it was in 1986. But turtle graphics still-- how shall I put this? -- suck. I took two new books with me over the holiday vacation, and both deal with something akin to the spiritual successor to Logo-- the Processing environment.
Both Processing: A Programming Handbook for Visual Designers and Artists and Visualizing Data paint a picture of the Processing environment that strongly reminds me of Logo. But Processing doesn't offer up a new Lisp syntax -- it sticks with good old-fashioned Java.
If we didn't care about speed, it might make sense to use Python, Ruby, or many other scripting languages. That is especially true on the education side. If we didn't care about making a transition to more advanced languages, we'd probably avoid a C++ or Java-style syntax. But Java is a nice starting point for a sketching language because it's far more forgiving than C/C++ and also allows users to export sketches for distribution via the Web.
The focus of the Processing environment is squarely on learning while doing, which is definitely one of the tenets of Logo.
If you're already familiar with programming, it's important to understand how Processing differs from other development environments and languages. The Processing project encourages a style of work that builds code quickly, understanding that either the code will be used as a quick sketch or that ideas are being tested before developing a final project. This could be misconstrued as software engineering heresy. Perhaps we're not far from "hacking", but this is more appropriate for the roles in which Processing is used. Why force students or casual programmers to learn about graphics contexts, threading, and event handling methods before they can show something on the screen that interacts with the mouse? The same goes for advanced developers: why should they always need to start with the same two pages of code whenever they begin a project?In another scenario, if you're doing scientific visualization, the ability to try things out quickly is a far higher priority than sophisticated code structure. Usually you don't know what the outcome will be, so you might build something one week to try an initial hypothesis and build something new the next week based on what was learned in the first week.
It's an admirable philosophy, and it's especially appropriate for a domain-specific language. If you're interested in graphics and visualization -- if you're truly looking for a modern Logo-- leave the turtles behind and check out Processing instead.
December 23, 2007
Size Is The Enemy
Steve Yegge's latest, Code's Worst Enemy, is like all of his posts: rich, rewarding, and ridiculously freaking long. Steve doesn't write often, but when he does, it's a doozy. As I mentioned a year ago, I've started a cottage industry mining Steve's insanely great but I-hope-you-have-an-hour-to-kill writing and condensing it into its shorter form points. So let's begin:
- Steve began writing a multiplayer game in Java, Wyvern, around 1998. If you're curious what it looks like, see fan screenshots one and two.
- Over the last 9 years, Wyvern has grown to 500,000 lines of Java code.
- Steve realized that it is impossible for a single programmer to singlehandedly maintain and support half a million lines of code. Even if you're Steve Yegge.
There's much more, but I want to pause here for a moment. It is absolutely true that any programmer who personally maintains half a million lines of code is automatically in a pretty rarified club. Steve's right about this. Most developers will never have the superhuman privilege of personally maintaining 500k LOC or more. On any rational software development project, you'd have a team of developers working on it, or you'd open source the thing entirely to spread the effort across a community.
But here's what I don't understand:
I happen to hold a hard-won minority opinion about code bases. In particular I believe, quite staunchly I might add, that the worst thing that can happen to a code base is size.
So Steve believes the majority of developers, when encountering a code base approximately the size of the Death Star, would think:
I could totally build that.
It's a telling indicator of the impressively bearded computer scientist crowd that Steve runs with. They probably wear flip-flops to work, too. Amongst the programmers I know, the far more common-- and certainly more rational-- reaction to a code base that large would be to run away, screaming, as fast as they could. And I'd be right behind them.
I don't think you necessarily have to spend ten years writing 500k worth of fairly complicated Java code to independently reach the same conclusion. Size is the enemy. Simply going from 1k to 10k LOC-- assuming you're sufficiently self-aware as a programmer-- is more than enough of a glimpse into the maw of madness that lies beyond. Even if you've written zero lines of code, if you've ever read any Steve McConnell books, the size rule is pounded home, time and time again:
Project size is easily the most significant determinant of effort, cost and schedule [for a software project]. People naturally assume that a system that is 10 times as large as another system will require something like 10 times as much effort to build. But the effort for a 1,000,000 LOC system is more than 10 times as large as the effort for a 100,000 LOC system.
One of the most fundamental and truly effective pieces of advice you can give a software development team-- any software development team-- is to write less code, by any means necessary. Break the project into smaller subprojects. Deliver it in complementary fragments. Try iterative development. Stop writing everything in assembly language and APL. Hire better programmers who naturally write less code. Buy code from a third party. Do absolutely whatever it takes to write as little code as possible, because the best code is no code at all.
We're not done yet. I warned you that this was a long post. Continuing from above:
- Because Java is a statically typed language, it requires lots of tedious, repetitive boilerplate code to get things done.
- That tedious, repetitive boilerplate code has been codified into Java faith as the seminal books "Design Patterns" and "Refactoring".
- Java developers fervently believe, almost to a man/woman, that IDEs can overcome the unavoidable LOC bloat of Java.
- A rewrite of Wyvern from Java into a dynamic language that runs on the JVM could reduce the raw code size by 50% to 75%.
Here's where Steve not-so-gently segues from "size is the problem" to "Java is the problem".
Bigger is just something you have to live with in Java. Growth is a fact of life. Java is like a variant of the game of Tetris in which none of the pieces can fill gaps created by the other pieces, so all you can do is pile them up endlessly.
![]()
Going back to our crazed Tetris game, imagine that you have a tool that lets you manage huge Tetris screens that are hundreds of stories high. In this scenario, stacking the pieces isn't a problem, so there's no need to be able to eliminate pieces. This is the cultural problem: [Java programmers] don't realize they're not actually playing the right game anymore.
Steve singles out Martin Fowler, who recently "abandoned" the static-language Java fold in favor of the dynamically typed Ruby. Fowler quite literally wrote the book on refactoring, so perhaps there's some truth to Steve's claim that the rigid architecture of classic, statically typed languages ultimately prevent you from refactoring the code down as far as you need to go. If Fowler can't refactor the Java pieces to fit, who can?
Bruce Eckel is another notable Java personality who apparently reached many of the same conclusions about Java years ago.
I can't quantify [the cost of strong typing]. I haven't been able to come up with a from-first- principles mathematical proof, probably because it depends on human factors, like how much time it takes to remember how to open a file and put the try block in the right places and remember how to read lines and then remember what you were really trying to accomplish by reading that file. In Python, I can process each line in a file by saying:
for line in file("FileName.txt"): # Process lineI didn't have to look that up, or to even think about it, because it's so natural. I always have to look up the way to open files and read lines in Java. I suppose you could argue that Java wasn't intended to do text processing and I'd agree with you, but unfortunately it seems like Java is mostly used on servers where a very common task is to process text.
Lines of code are, and always have been, the enemy. More lines of code means more to read, more to understand, more to troubleshoot, more to debug. But it is possible to go too far in the other direction as well. If you're not careful, you could end up playing yet another game entirely-- yes, you've cleverly avoided the trap of Java's infinitely tall Tetris, but have you slipped into Perl's Golf instead?
Perl "golf" is the pastime of reducing the number of characters used in a Perl program to the bare minimum, much as how golf players seek to take as few shots as possible in a round.
![]()
It originally focused on the JAPHs used in signatures in Usenet postings and elsewhere, but the use of Perl to write a program which performed RSA encryption prompted a widespread and practical interest in this pastime. In subsequent years, code golf has been taken up as a pastime in other languages besides Perl.
In our war on verbosity, there's an inevitable tradeoff between verbosity and understandability. Steve acknowledges this by hinging his JVM language choice on what is "syntactically mainstream": JRuby, Groovy, Rhino (JavaScript), and Jython. I'll spoil the not-so-surprise ending for you: Steve is rewriting Wyvern in Rhino, and in the process he'll help bring Rhino up to spec with the forthcoming EcmaScript Edition 4 update to JavaScript. It's no magic bullet, but it seems like a reasonable compromise based on his goals.
So ends the epic ten year tale of Stevey and his merry band of Wyverneers. But where does that leave us? I have my opinions, naturally:
- If you personally write 500,000 lines of code in any language, you are so totally screwed.
- If you personally rewrite 500,000 lines of static language code into 190,000 lines of dynamic language code, you are still pretty screwed. And you'll be out a year of your life, too.
- If you're starting a new project, consider using a dynamic language like Ruby, JavaScript, or Python. You may find you can write less code that means more. A lot of incredibly smart people like Steve present a compelling case that the grass really is greener on the dynamic side. At the very least, you'll learn how the other half lives, and maybe remove some blinders you didn't even know you were wearing.
- If you're stuck using exclusively static languages, ask yourself this: why do we have to write so much damn code to get anything done-- and how can this be changed? Simple things should be simple, complex things should be possible. It's healthy to question authority, particularly language authorities.
Remember: size really is the enemy. Right after ourselves, of course.
December 20, 2007
Digital Certificates: Do They Work?
The most obvious badge of internet security is the "lock" icon. The lock indicates that the website is backed by a digital certificate:
- This website is the real deal, not a fake set up by criminals to fool you.
- All data between your browser and that website is sent encrypted. Nobody in the middle can read any sensitive information you submit to that website, such as your credit card number.
Here's what PayPal looks like in Internet Explorer 7. The lock icon and green background of the address bar let us know that this website is backed by a digital certificate. Clicking on the lock provides additional detail about the certificate.
Here's PayPal in Firefox 2, which follows the same conventions. The address bar color changes, and the lock icon is present. Clicking on the lock produces a dialog with similar summary information.
The summary is reasonable enough. The certificate authority instutution, VeriSign, vouches that this site is indeed PayPal. One question I've always had, though, is this: who decided VeriSign is a trusted authority? There's some kind of whitelist built into IE and Firefox that blesses these certificate authorities with "root" status. According to Wikipedia, a 2007 survey identified 6 major certificate authorities:
- VeriSign (57.6%)
- Comodo (8.3%)
- GoDaddy (6.4%)
- DigiCert (2.8%)
- Network Solutions (1.3%)
- Entrust (1.1%)
The certificate authority business has always struck me as an odd relationship, because it's completely commercial and superficial. Fork over your $300-$2,500, some nominal proof of your identity, and you're granted a certificate for a year. Does that imply trust? I'm not the only person to share these concerns; Bruce Schneier has an excellent whitepaper which examines the risks of certification authorities and public-key infrastructure:
Certificates provide an attractive business model. They cost almost nothing to make, and if you can convince someone to buy a certificate each year for $5, that times the population of the Internet is a big yearly income. If you can convince someone to purchase a private CA and pay you afee for every certificate he issues, you're also in good shape. It's no wonder so many companies are trying to cash in on this potential market.With that much money at stake, it is also no wonder that almost all the literature and lobbying on the subject is produced by PKI vendors. And this literature leaves some pretty basic questions unanswered: What good are certificates anyway? Are they secure? For what? In this essay, we hope to explore some of those questions.
The other problem with certificates is that, as an end user, it's nearly impossible to tell a good, valid certificate provided by a reputable certificate authority from a bad one. If we click through to examine the PayPal certificate details, we're presented with these three dense tabs:
I don't know about you, but none of that makes any sense to me. And I'm a programmer. Imagine the poor end user trying to make heads or tails of this. What does it all mean? Of course, most users simply won't pay attention -- it's questionable whether they'll even notice the presence of the lock icon and the color difference in the address bar.
Certificates aren't just for websites; they can also be applied to executables, too. Here's what happens when I double-click on the Safari 3.0.4 beta installer. It's been signed by Apple using their digital certificate.
Clicking on the word "Apple" opens detailed information about the certificate. Again, what does all this mean? How can we tell if it is valid?
I understand the value of digital certificates in theory-- to definitively establish the identity of a program or website before entrusting your data to it. Consider a real-world analog. What if I walked up to you on the street and told you I was a policeman? You might check to see if I'm wearing an appropriate uniform. You might ask to see my badge. You might wonder where my partner or squad car is. We use all these things to judge the authenticity of human interactions.
However, I don't understand how the current digital certificate infrastructure prevents criminals from obtaining their own certificates with ease. Even though I could potentially fake a policeman's badge and uniform in the real world, that pales compared with how trivially easy it is to obtain a digital certificate for code signing from TuCows:
- Create an account at Tucows
- Buy a Cert ($300)
- Email them your Drivers License
- Download the Cert
- Export your certificate from the machine and store in a safe place
- Grab signtool.exe from the .NET 2.0 SDK
- Sign your binary using the certificate from step 4
If the only validation is an emailed copy of a drivers' license, that doesn't exactly give me the warm fuzzies. And even if we enhance that with (more expensive, naturally) "extended validation", I fail to see how this would prevent a determined, resourceful criminal from getting whatever certificate they need.
I suppose digital certificates are better than nothing. But I also worry that they're incredibly confusing for the end user, easy to game, and ultimately provide a false sense of security-- and that's the most dangerous risk of all.
December 19, 2007
The Great Browser JavaScript Showdown
In The Day Performance Didn't Matter Any More, I found that the performance of JavaScript improved a hundredfold between 1996 and 2006. If Web 2.0 is built on a backbone of JavaScript, it's largely possible only because of those crucial Moore's Law performance improvements.
But have we hit a performance wall? Is it possible for browsers to run JavaScript significantly faster than they do today? I've always thought that just-in-time optimizing (or even compiling) JavaScript was an unexplored frontier in browser technology. And now the landscape has shifted:
- Apple's WebKit team just announced a great new JavaScript benchmark, SunSpider.
- The browser market is more competitive than it has been in years, with Opera 9.5, Firefox 3, Safari 3, and IE 8 all vying for the coveted default browser position.
Perhaps browser teams will begin to consider JavaScript performance a competitive advantage. The last time I looked for common JavaScript benchmarks, I came away deeply disappointed. That's why I'm particularly excited by the SunSpider benchmark: it's remarkably well thought out, easy to run, and comprehensive.
It's based on real code that does interesting things; both things that the web apps of today are doing, and more advanced code of the sorts we can expect as web apps become more advanced. Very few of the tests could be classed as microbenchmarks.It's balanced between different aspects of the JavaScript language -- not dominated by just a small handful of different things. In fact, we collected test cases from all over the web, including from other benchmarks. But at the same time, we avoided DOM tests and stuck to the core JavaScript language itself.
It's super easy to run in the browser or from the command line, so you can test both pure engine performance, and the results you actually get in the browser.
We included statistical analysis so you can see how stable the results you're getting really are.
Maciej Stachowiak, a member of Apple's WebKit team, graciously explained what each subsection of the benchmarks do in the comments:
| 3d | Pure JavaScript computations of the kind you might use to do 3d rendering, but without the rendering. This ends up mostly hitting floating point math and array access. |
| access | Array, object property and variable access. |
| bitops | Bitwise operations, these can be useful for various things including games, mathematical computations, and various kinds of encoding/decoding. It's also the only kind of math in JavaScript that is done as integer, not floating point. |
| controlflow | Control flow constructs (looping, recursion, conditionals). Right now it mostly covers recursion, as the others are pretty well covered by other tests. |
| crypto | Real cryptography code, mostly covers bitwise operations and string operations. |
| date | Performance of JavaScript's "date" objects. |
| math | Various mathematical type computations. |
| regexp | Regular expressions. Pretty self-explanatory. |
| string | String processing, including code to generate a giant "tagcloud", extracting compressed JS code, etc. |
SunSpider is the best JavaScript benchmark I've seen, something we desperately need in an era where JavaScript is the Lingua Franca of the web. I was so excited, in fact, that I ran some quick benchmarks to compare the four major players in the browser market:
- Windows Vista 32-bit
- 4 GB RAM
- dual-core 3.0 GHz Core 2 Duo CPU
- all browser extensions disabled (clean install)
What surprised me here is that Firefox is substantially slower than IE, once you factor out that wildly anomalous string result. I had to use a beta version of Opera to get something other than invalid (NaN) results for this benchmark, which coincidentally summarizes my opinion of Opera. Great when it works! I expected Opera to do well; it was handily winning JavaScript benchmarks way back in 2005. The new kid on the block, Safari, shows extremely well particularly considering that it is running outside its native OS X environment. Kudos to Apple. Well, except for that whole font thing.
If you're curious how each browser stacked up in each benchmark area, I broke that down, too:
If you need greater detail-- including variances-- you can download my complete set of SunSpider 0.9 results as a text file.
If I've learned anything from the computer industry, it's that competition benefits everyone. Here's hoping that a great JavaScript browser performance showdown spurs the browser teams on to better performance in this increasingly crucial area.
December 18, 2007
Nobody Cares What Your Code Looks Like
In The Problems of Perl: The Future of Bugzilla, Max Kanat-Alexander* laments the state of the Bugzilla codebase:
Once upon a time, Bugzilla was an internal application at Netscape, written in TCL. When it was open-sourced in 1998, Terry (the original programmer), decided to re-write Bugzilla in Perl. My understanding is that he re-wrote it in Perl because a lot of system administrators know Perl, so that would make it easier to get contributors.In 1998, there were few advanced, object-oriented web scripting languages. In fact, Perl was pretty much it. PHP was at version 3.0, python was at version 1.5, Java was just starting to become well-known, ruby was almost unheard of, and some people were still writing their CGI scripts in C or C++.
Perl has many great features, most of all the number of libraries available and the extreme flexibility of the language. However, Perl would not be my first choice for writing or maintaining a large project such as Bugzilla. The same flexibility that makes Perl so powerful makes it very difficult to enforce code quality standards or to implement modern object-oriented designs.
Since 1998 there have been many advances in programming languages. PHP has decent object-oriented features, python has many libraries and excellent syntax, Java has matured a lot, and Ruby is coming up in the world quickly. Nowadays, almost all of our competitors have one advantage: they are not written in Perl. They can actually develop features more quickly than we can, not because of the number of contributors they have, but because the language they're using allows it. There are at least two bug-trackers that I can think of off the top of my head that didn't even exist in 1998 and were developed rapidly up to a point where they could compete with Bugzilla.
In 1998, Perl was the right choice for a language to re-write Bugzilla in. In 2007, though, having worked with Perl extensively for years on the Bugzilla project, I'd say the language itself is our greatest hindrance. Without taking some action, I'm not sure how many more years Bugzilla can stay alive as a product. Currently, our popularity is actually increasing, as far as I can see. So we shouldn't abandon what we're doing now. But I'm seeing more and more products come into the bug-tracking arena, and I'm not sure that we can stay competitive for more than a few more years if we stick with Perl.
It's a credit to Max that he cares enough about the future of his work to surface these important issues. Perhaps it would make sense to rewrite Bugzilla in a friendlier, more modern language.
Neither Perl nor the circa-1998 Bugzilla codebase have aged particularly well over the last 10 years. I don't think Bugzilla is anyone's favorite bug tracking product. It is utilitarian bordering on downright ugly. But-- and here's the important part-- Bugzilla works. It's actively used today by some of the largest and most famous open source projects on the planet, including the Linux Kernel, Mozilla, Apache, and many others.
I have a friend who works for an extremely popular open source database company, and he says their code is some of the absolute worst he's ever seen. This particular friend of mine is no stranger to bad code-- he's been in a position to see some horrifically bad codebases. Adoption of this open source database isn't slowing in the least because their codebase happens to be poorly written and difficult to troubleshoot and maintain. Users couldn't care less whether the underlying code is pretty. All they care about is whether or not it works. And it must work-- otherwise, why would all these people all over the world be running their businesses on it?
I gave Joel Spolsky a hard time for his Wasabi language boondoggle, but I'm now reconsidering that stance. Fog Creek Software isn't funded by the admiration of other programmers. It's funded by selling their software to customers. And to the customer, the user interface is the application. I might point and laugh at an application written in some crazy hand rolled in-house language. But language choice is completely invisible to potential customers. As long as the customers are happy with the delivered application and sales are solid, who gives a damn what I-- or any other programmers, for that matter-- think?
Sure, we programmers are paid to care what the code looks like. We worry about the guts of our applications. It's our job. We want to write code in friendly, modern languages that make our work easier and less error-prone. We'd love any opportunity to geek out and rewrite everything in the newest, sexiest possible language. It's all perfectly natural.
The next time you're knee deep in arcane language geekery, remember this: nobody cares what your code looks like. Except for us programmers. Yes, well-factored code written in a modern language is a laudable goal. But perhaps we should also focus a bit more on things the customer will see and care about, and less on the things they never will.
* I desperately want to provide full name attribution here, but I was unable to find Max's last name on any of his pages-- which drives me absolutely bonkers (see # 3).
December 17, 2007
Software Registration Keys
Software is digital through and through, and yet there's one unavoidable aspect of software installation that remains thoroughly analog: entering the registration key.
The aggravation is intentional. Unique registration keys exist only to prevent piracy. Like all piracy solutions-- short of completely server hosted applications and games, where piracy means you'd have to host your own rogue server-- it's an incomplete client-side solution. How effective is it? One vendor implemented code to detect false registration keys and phone home with some basic information such as the IP address when these false keys are entered. Here's what they found:
| Software Connectivity | Ratio of pirated to legitimate keys |
| no internet connection required | 45 : 1 |
| occasional internet connection necessary | 60 : 1 |
| internet must be "always on" | 110 : 1 |
I have no idea how reliable this data is. The vendor is never named, and given that the title of the URL is sharewarejustice.com/software-piracy.htm, I'd expect it to be biased. But it is data, and without the registration key concept (and pervasive internet connectivity), we'd have no data whatsoever to quantify how much piracy actually exists. The BSA estimated 35% of all software was pirated in 2006, but it is just that-- an estimate. I'll choose biased data over no data whatsoever, every time.
I don't have a problem with registration keys. You could, in fact, argue that registration key validation actually works. Microsoft recently stated that the piracy rate of Vista is half that of XP, largely due to improvements in their Windows Genuine Advantage program-- Microsoft's global registration key validation service.
As a software developer, I can empathize with Microsoft to a degree. Unless you oppose the very concept of commercial software, there has to be some kind of enforcement in place. The digital nature of software makes it both easy and impersonal for people to avoid paying (note that I did not say "steal"), which is an irresistible combination for many. Unless you provide some disincentives, that's exactly what people will do-- they'll pay nothing for your software.
Microsoft's history with piracy goes way, way back-- all the way back to the original microcomputers. Witness Bill Gates' Open Letter To Hobbyists, written in 1976.
Almost a year ago, Paul Allen and myself, expecting the hobby market to expand, hired Monte Davidoff and developed Altair BASIC. Though the initial work took only two months, the three of us have spent most of the last year documenting, improving and adding features to BASIC. Now we have 4K, 8K, EXTENDED, ROM and DISK BASIC. The value of the computer time we have used exceeds $40,000.The feedback we have gotten from the hundreds of people who say they are using BASIC has all been positive. Two surprising things are apparent, however, 1) Most of these "users" never bought BASIC (less than 10% of all Altair owners have bought BASIC), and 2) The amount of royalties we have received from sales to hobbyists makes the time spent on Altair BASIC worth less than $2 an hour.
Why is this? As the majority of hobbyists must be aware, most of you steal your software. Hardware must be paid for, but software is something to share. Who cares if the people who worked on it get paid?
Is this fair? One thing you don't do by stealing software is get back at MITS for some problem you may have had. MITS doesn't make money selling software. The royalty paid to us, the manual, the tape and the overhead make it a break-even operation. One thing you do do is prevent good software from being written. Who can afford to do professional work for nothing? What hobbyist can put 3-man years into programming, finding all bugs, documenting his product and distribute for free? The fact is, no one besides us has invested a lot of money in hobby software. We have written 6800 BASIC, and are writing 8080 APL and 6800 APL, but there is very little incentive to make this software available to hobbyists. Most directly, the thing you do is theft.
Although computers have changed radically in the last thirty years, human behavior hasn't. (Alternately, you could argue that the economics of computing and the emergence of an ad-supported software ecosystem have fundamentally changed the rules of the game since 1976. But that's a topic for another blog post.)
I accept that software registration keys are a necessary evil for commercial software, and I resign myself to manually keeping track of them, and keying them in. But why do they have to be so painful? You do realize a human being has to type this stuff in, right? Here are some things that I've seen vendors get wrong with their registration key process:
- Using commonly mistaken characters in the key
Quick! Is that an 'O' or an '0'? A '6' or a 'G'? An 'I' or an 'l'? A 'B' or an '8'? At least have the courtesy to scour your registration key character set of those characters that are commonly mistaken for other characters. And please print the key in a font that minimizes the chances of confusion.
- Excessively long keys
The most rudimentary grasp of mathematics tells us that a conservative 10 character alphanumeric registration key is good for 197 trillion unique users. Even factoring in the pigeonhole principle, we can estimate about 14 million random registration key combinations before we have a 50 percent risk of a collision. So why, then, do software developers insist on 20+ character registration keys? It's ridiculous. Are they planning to sell licenses to every grain of sand on every beach?
- Not separating the key into blocks
Rather than smashing your key into one long string, make it a group of small 4 to 5 characters, separated by a delimiter. It's the same reason phone numbers are listed as 404-555-1212 and not 4045551212: People have an easier time handling and remembering small chunks of information.
- Making it difficult to enter the key
Short of providing every customer a handy USB barcode scanner, at least make the registration key entry form as user friendly as possible:
- Let the user enter the key in any format. With dashes, without dashes, using spaces, whatever. Be flexible. Accept a variety of formats.
- Do not provide five input boxes that require us to tab through each one to enter the key. It's death by a thousand tiny textboxes.
- Tell me as soon as I've entered a bad value in the key. Why should I have to go back and pore over my entry to figure out which letter or number I've screwed up? You're the computer, remember? This is what you're good at.
- Accept pasting from the clipboard. Once we've installed the software, we'll probably install it again, and nobody likes keying these annoying resgistration keys in more than once. I've seen some clever software that proactively checks the clipboard and enters the key automatically if it finds it there. (Kudos to you, Beyond Compare.)
- Don't passively-aggressively inform me that "the key you entered appears to be valid." Is it? Or isn't it? What's the point of unique registration keys if you can't be sure? I guess paying customers can't be trusted.
- Where's the %*@# key?
The key is important. Without it we can't install or use the software. So why is it buried in the back of the manual, or on an easy-to-overlook interior edge of the package? Make it easy to find-- and difficult to lose. Provide multiple copies of the key in different locations, maybe even as a peelable sticker we can place somewhere useful. And if the software was delivered digitally, please keep track of our key for us. We're forgetful.
Software registration keys are a disconcerting analog hoop we force users to jump through when using commercial software. Furthermore, registration keys are often the user's first experience with our software-- and first impressions matter. If you're delivering software that relies on registration keys, give that part of the experience some consideration. Any negative feelings generated by an unnecessarily onerous registration key entry process will tend to color users' perception of your software.
December 15, 2007
On The Meaning of "Coding Horror"
In a recent web search, I found the following comment in a programming.reddit.com thread from eight months ago, completely by accident:
I think prog.reddit will continue to move in phases... a couple of days ago, someone complained about a drop-off in Haskell articles, today there were 4 or 5 ... next time Django or Rails does something worth noting, there'll be a plethora of Python/Ruby stuff. Despite its limb-gnawing tedium, Coding Horror will continue to rank high.
I personally think describing what I do here as "limb-gnawing tedium" is a bit hyperbolic. But it made me laugh.
I can understand where the commenter is coming from; the web is chock full of content that absolutely bores me to tears. If I stopped and wrote a comment bemoaning every boring blog post or web page I've ever found, I'd scarcely have time to do anything else. Such comments would also be a bit of a downer for the author, as I'm sure someone is interested in that particular topic. The whole point of putting content on the internet is to find an audience, however tiny that audience might end up being. Maybe you're not a member of the audience, and that's OK.
I try to avoid blogging about blogging because it's such a cliche. And it's boring. However, after digging a bit deeper in the programming.reddit.com comments, I became concerned:
What I don't like about "Coding Horror": the title promises "Daily WTF" style entertainment, but doesn't deliver. "Coding Horror" ought to be about people coding dynamic web pages entirely in SQL, or having some mission critical system written in a cryptic version of csh.
This is a profound misunderstanding. If you're coming here looking for that sort of entertainment, you're bound to be disappointed. I'd like to think this site is the opposite of The Daily WTF.
I apologize for the confusion. Allow me to explain.
First, the literal explanation. The sidebar of Steve McConnell's seminal book, Code Complete, contains a series of icons denoting particular areas. There's a "Hard Data" icon, a "Key Point" icon, and a "Coding Horror" icon.
I have to talk a little bit about the influence this book had on me as a young developer.
I graduated from college in 1992, and entered the field of professional software development at that point, at least in terms of being paid to do so. I loved it, but I really had no idea what I was doing. I was a young, inexperienced developer working in small business, where there aren't a lot of other developers to look to as mentors. Nor was the internet a factor; the internet didn't really hit until '95 for most people. I was living in Denver at the time, and I frequented the Tattered Cover, a great independent bookstore. Code Complete was originally published in May 1993; I stumbled across it while browsing the computer book section at the Tattered Cover sometime in 1994. I was floored. Here's this entire book about becoming a professional software developer, written in this surprisingly friendly, humane voice. And it was backed by rational research and real data, not the typical developer "my brain is bigger than yours" chest-thumping.
I had found my muse. Reading Code Complete was a watershed event in my professional life. I read it three times in one week. It immediately became my Joy of Cooking. I didn't even know it existed, but it showed me that if you loved food enough, it was possible to go from being a mere cook to a real chef.
One of the most striking and memorable things about Code Complete, even to this day, is that Coding Horror illustration in the sidebar. Every time I saw it on the page, I would chuckle. Not because of other people's code, mind you. Because of my own code. That was the revelation. You're an amateur developer until you realize that everything you write sucks.
YOU are the Coding Horror.
The minute you realize that, you've crossed the threshold from being an amateur software developer into the realm of the professionals. Half of being a good, competent software developer is realizing that you're going to make tons of mistakes. You will be your own worst enemy almost all the time. It's a lifestyle. You're living it right now. You, me, all of us. The problems start with us. We're all coding horrors. This story from the Tao that Reginald Braithwaite posted is as good an explanation as any:
There was once a monk who would carry a mirror wherever he went. A priest noticed this one day and thought to himself, "This monk must be so preoccupied with the way he looks that he has to carry that mirror all the time. He should not worry about the way he looks on the outside. It's what's inside that counts." So the priest approached the monk and asked "Why do you always carry that mirror?", thinking this would surely prove his guilt.The monk took the mirror from his bag and pointed it at the priest. He said, "I use it in times of trouble. I look into it and it shows me the source of my problems as well as the solution to my problems."
If you're horrified by what you see in the mirror, you are not alone.
I chose that title for my blog – with explicit permission from Steve – because it's a clever in-joke about becoming a humble professional programmer. That's what I try to do here. I write to learn and explore topics that deal with computers and programming, and because I'm easily bored, the topics I find most interesting tend to apply to a wide audience of programmers. Maybe even people who don't know they're programmers yet. To steal a phrase from the talented Rich Skrenta, I blog to help others and also to learn. As it turns out both are aided by getting folks to actually read the stuff.
But that's not the complete story. I'd be lying if I didn't admit that there's an element of selfishness at work here. I love computers and programming. I love it so much it borders on obsession. When I saw the movie Into The Wild, I was transfixed by the final note written into the margins of Dr. Zhivago by a doomed Christopher McCandless: "Happiness only real when shared."
I realized, that's it. That's it exactly. That is what is so intensely satisfying about writing here. My happiness only becomes real when I share it with all of you.
December 13, 2007
Our Fractured Online Identities
Anil Dash has been blogging since 1999. He's a member of the Movable Type team from the earliest days. As you'd expect from a man who has lived in the trenches for so long, his blog is excellent. It's well worth a visit if you haven't been there already. I was recently reading through his 2002 blog recommendations and marvelling at the hardy few who survived through five long years of the internet. The way I figure, that's equivalent to thirty-five people years.
I also noticed something interesting lodged in the sidebar of his blog. A long list of Anil Dash's many online identities, spread across no less than 29 different websites:
Laurel Krahn created one of the first 30 weblogs back in 1998. Her home page paints a similarly fractured picture of her online identity. I count 21 different websites that represent some part of Laurel:
There's no way any one person could truly keep these 20 or 30 websites up to date. So which one of these websites represents the real Laurel Krahn, the real Anil Dash? Or do all these tiny fragments of identity cumulatively sum to a whole? Browsing around their sites, it's fairly easy to determine what is getting the lion's share of attention, and pare away the neglected parts. Still, it's unclear.
I suppose my online identity is similarly fractured, although somewhat less so than Anil and Laurel. I obviously have this primary blog, which represents me professionally. But I also have a twitter stream, which I alternately treat as my inner monologue, a link blog, and as a form of public instant messaging. Then there are my Vertigo blogs, a handful of online games I play semi-frequently, and various other online forums that I regularly participate in for particular special interests. All these things are me.
But which one is the real me? Is my online identity even a reasonable approximation of who I am? I think it could be. What you read here is mostly what you get, minus some corner-case peculiarities that probably aren't interesting to anyone but me (and my wife, but she's bound by law). It's reassuring to have a single central authoritative place that represents me online.
Mostly, I'm just amazed that these veteran bloggers feel they can actually maintain twenty or thirty different facets of their identity across all those disparate websites. I certainly can't. I struggle to write one lousy blog four to five times a week. I'm more interested in shrinking my focus into an ever narrower and sharper point than I am in diluting my effort across dozens of different websites.
There's no right or wrong answer here, of course. You should follow your interests wherever they lead you, and to as many different websites as necessary. But I do think building a strong online identity is an important strategy for distinguishing yourself in an increasingly online world. So choose carefully, and focus on those things that best represent you.
December 12, 2007
Sorting for Humans : Natural Sort Order
The default sort functions in almost every programming language are poorly suited for human consumption. What do I mean by that? Well, consider the difference between sorting filenames in Windows explorer, and sorting those very same filenames via Array.Sort() code:
| Explorer shell sort | Array.Sort()
|
|
|
|
Quite a difference.
I can say without the slightest hint of exaggeration that this exact sorting problem has been a sore point on every single project I've ever worked on. Users will inevitably complain that their items aren't sorting properly, and file bugs on these "errors". Being card-carrying members of the homo logicus club, we programmers produce a weary sigh, and try to keep any obvious eye-rolling in check as we patiently inform our users that this isn't an error. Items are sorting in proper order. Proper ASCII order, that is. As we're walking away, hopefully you won't hear us mutter under our breath what we're actually thinking-- "Stupid users! They don't even understand how sorting works!"
I always felt a pang of regret when rejecting these requests. Honestly, look at those two lists-- what sane person would want ASCII order? It's a completely nonsensical ordering to anyone who doesn't have the ASCII chart committed to memory (and by the way, uppercase A is decimal 65). I never really understood that there was another way to sort, even though natural sort has been right in front of us all along in the form of Mac Finder and Windows Explorer file listings. I had language-induced blinders on. If our built-in sort returns in ASCII order, then that must be correct. It was bequeathed upon us by the Language Gods. Can there be any other way?
Kate Rhodes is a bit up in arms about our collective ignorance of ASCIIbetical vs. Alphabetical. Can't say I blame her. I'm as guilty as anyone. Turns out the users weren't the stupid ones after all -- I was.
Silly me, I just figured that alphabetical sorting was such a common need (judging by the number of people asking how to do it I'm not wrong either) that I wouldn't have to write the damn thing. But I didn't count on the stupid factor. Jesus Christ people. You're programmers. You're almost all college graduates and none of you know what the f**k "Alphabetical" means. You should all be ashamed. If any of you are using your language's default sort algorithm, which is almost guaranteed to be ASCIIbetical (for good reason) to get alphabetical sorting you proceed to the nearest mirror and slap yourself repeatedly before returning to your desks and fixing your unit tests that didn't catch this problem.
It isn't called "Alphabetical sort"; it's collectively known as natural sort. But she's right about one thing: it's hard to find information on natural sorting, and many programmers are completely ignorant of it. None of the common computer languages (that I know of) implement anything other than ASCIIbetical sorts. There are a few places you can find natural sort algorithms, however:
- Dave Koelle's The Alphanum Algorithm
- Martin Pool's Natural Order String Comparison
- Ian Griffiths' Natural Sorting in C#
- Ned Batchelder's Compact Python Human Sort, along with Jussi Salmela's internationalized version of same.
Don't let Ned's clever Python ten-liner fool you. Implementing a natural sort is more complex than it seems, and not just for the gnarly i18n issues I've hinted at, above. But the Python implementations are impressively succinct. One of Ned's commenters posted this version, which is even shorter:
import re
def sort_nicely( l ):
""" Sort the given list in the way that humans expect.
"""
convert = lambda text: int(text) if text.isdigit() else text
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
l.sort( key=alphanum_key )
I tried to come up with a clever, similarly succinct C# 3.0 natural sort implementation, but I failed. I'm not interested in a one-liner contest, necessarily, but it does seem to me that a basic natural sort shouldn't require the 40+ lines of code it takes in most languages.
As programmers, we'd do well to keep Kate's lesson in mind: ASCIIbetical does not equal alphabetical. ASCII sorting serves the needs of the computer and the compiler, but what about us human beings? Perhaps a more human-friendly natural sort option should be built into mainstream programming languages, too.
