I <3 Steve McConnell*
Coding Horror
programming and human factors
by Jeff Atwood

December 12, 2007

Sorting for Humans : Natural Sort Order

The default sort functions in almost every programming language are poorly suited for human consumption. What do I mean by that? Well, consider the difference between sorting filenames in Windows explorer, and sorting those very same filenames via Array.Sort() code:

Explorer shell sort Array.Sort()
alphabetical sort ASCIIbetical sort

Quite a difference.

I can say without the slightest hint of exaggeration that this exact sorting problem has been a sore point on every single project I've ever worked on. Users will inevitably complain that their items aren't sorting properly, and file bugs on these "errors". Being card-carrying members of the homo logicus club, we programmers produce a weary sigh, and try to keep any obvious eye-rolling in check as we patiently inform our users that this isn't an error. Items are sorting in proper order. Proper ASCII order, that is. As we're walking away, hopefully you won't hear us mutter under our breath what we're actually thinking-- "Stupid users! They don't even understand how sorting works!"

I always felt a pang of regret when rejecting these requests. Honestly, look at those two lists-- what sane person would want ASCII order? It's a completely nonsensical ordering to anyone who doesn't have the ASCII chart committed to memory (and by the way, uppercase A is decimal 65). I never really understood that there was another way to sort, even though natural sort has been right in front of us all along in the form of Mac Finder and Windows Explorer file listings. I had language-induced blinders on. If our built-in sort returns in ASCII order, then that must be correct. It was bequeathed upon us by the Language Gods. Can there be any other way?

Kate Rhodes is a bit up in arms about our collective ignorance of ASCIIbetical vs. Alphabetical. Can't say I blame her. I'm as guilty as anyone. Turns out the users weren't the stupid ones after all -- I was.

Silly me, I just figured that alphabetical sorting was such a common need (judging by the number of people asking how to do it I'm not wrong either) that I wouldn't have to write the damn thing. But I didn't count on the stupid factor. Jesus Christ people. You're programmers. You're almost all college graduates and none of you know what the f**k "Alphabetical" means. You should all be ashamed. If any of you are using your language's default sort algorithm, which is almost guaranteed to be ASCIIbetical (for good reason) to get alphabetical sorting you proceed to the nearest mirror and slap yourself repeatedly before returning to your desks and fixing your unit tests that didn't catch this problem.

It isn't called "Alphabetical sort"; it's collectively known as natural sort. But she's right about one thing: it's hard to find information on natural sorting, and many programmers are completely ignorant of it. None of the common computer languages (that I know of) implement anything other than ASCIIbetical sorts. There are a few places you can find natural sort algorithms, however:

Don't let Ned's clever Python ten-liner fool you. Implementing a natural sort is more complex than it seems, and not just for the gnarly i18n issues I've hinted at, above. But the Python implementations are impressively succinct. One of Ned's commenters posted this version, which is even shorter:

import re 

def sort_nicely( l ): 
  """ Sort the given list in the way that humans expect. 
  """ 
  convert = lambda text: int(text) if text.isdigit() else text 
  alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ] 
  l.sort( key=alphanum_key ) 

I tried to come up with a clever, similarly succinct C# 3.0 natural sort implementation, but I failed. I'm not interested in a one-liner contest, necessarily, but it does seem to me that a basic natural sort shouldn't require the 40+ lines of code it takes in most languages.

As programmers, we'd do well to keep Kate's lesson in mind: ASCIIbetical does not equal alphabetical. ASCII sorting serves the needs of the computer and the compiler, but what about us human beings? Perhaps a more human-friendly natural sort option should be built into mainstream programming languages, too.

Posted by Jeff Atwood    View blog reactions
« Blacklists Don't Work
Our Fractured Online Identities »
Comments

Leading zeros are your friends.

mpbk on December 13, 2007 3:09 AM

PHP has a built in natural sort function as well: http://be2.php.net/natsort

Would be nice to have it built in .NET as well.

David Cumps on December 13, 2007 3:10 AM

I don't see how this would actually be a "sorting" problem for .NET. Nothing should be wrong with how .NET sorts, but rather you would need to provide it a custom comparer. You can still use the same Sort method, just provide a comparer that can tell if "z2" is greater than "z10" or not.

John Chapman on December 13, 2007 3:18 AM

You're absolutely right, Jeff - there's no reason that there shouldn't, somewhere in the libraries for .NET or Java, be a natural sorting algorithm. I know that in Java there's more than enough locale specific functionality that the argument of different character sets or sorting rules (someone in one of the links mentions that, in Sweden, 'V' and 'W' sort together) doesn't really hold water. Besides, it's Sun and Microsoft - don't they have motivated interns to do that sort of work?

BrotherBeal on December 13, 2007 3:25 AM

Though I don't disagree with the theme that we should do what we can as programmers to reduce cognative friction, make lives easier, and exceed user expectations, I have to wonder, what exactly is coming out unsorted here? Obviously the file names are a contrived example, because let's be honest, no (rational) human would name files with such a scheme if they ever intended those files for human consumption.

I can think of only a few situations (though admittedly I may be missing some) where sorting ASCIIbetically produces counter-intuitive results. For example, numbers/dates tacked on to the beginning (or end) of file names as an ad hoc method of version control. Maybe, a user ID number (perhaps an integer) that's used as a primary key in a database might sort out of order. In answer to these "incorrect" sorts, I would argue that the users are not using the proper tools (perhaps because we as programmers haven't supplied them, or due to other reasons such as cost).

My point is that the types of things that tend to sort poorly in ASCIIbetical are the type of things that we as programmers should be abstracting away in our software so that users don't have to consider whether something's sorted "incorrectly." Certainly there will be situations where abstracting such things is infeasible, or unnecessary, and such a "natural" sort might be appropriate, but when that happens, write the sort, port it out to a library assembly (especially now with .NET 3.0 extension methods this becomes tractable), and get on with it. If you come across a situation where it's necessary later, link the library into your assembly.

Just don't put the cart before the horse Jeff.

Michael Starke on December 13, 2007 3:26 AM

Leading zeros are your friends.

because let's be honest, no (rational) human would name files with such a scheme if they ever intended those files for human consumption.

We are thinking as programmers here, ask any normal person to write a list of file names and he/she will write
photo1
photo2
photo3
photo..
photo10
photo11

Natural sort is the way!

Edddy on December 13, 2007 3:34 AM

If you want to sort like Explorer, you are apparently supposed to use StrCmpLogicalW. See: (http://msdn2.microsoft.com/en-us/library/bb759947.aspx) and (http://www.pinvoke.net/default.aspx/shlwapi/StrCmpLogicalW.html). I think it is only suited to sorting filenames though, because it may not take account of sorting properly for different languages/cultures.

Chris L on December 13, 2007 3:35 AM

I'm sorry, but I don't agree. Let's call it alphanumerical.

z100 should always go BEFORE z2, the same way that table goes before tar.

Also, anyone naming their files *name*-*number* and not putting in the extra digits should not order by name, but by date.

And I take offense at Kate Rhodes' comments. It's yet another case of stupid users.

paketep on December 13, 2007 3:38 AM

Off the top of my head...

Can't you just treat the whole "name" as a base-62* numbering system?

The number line would look something like:
0 1 2 3 4 5 6 7 8 9 a A b B c C d D e E f F g G h H i I ... etc

Convert to decimal in the usual way so:

z1 = (61 * 62) + (1) = 3783
z2 = 3784
z10 = (61 * 62^2) + (1 * 62) + 0 = 234546
z11 = 234547
z100 = (61 * 62^3) + (1 * 62^2) + (0 * 62) + 0 = 14541852
z101 = 14541853

That seems to work.

Downsides are that you need to be able to handle very large numbers to get support a useful filename length and base-62 only covers alphanumeric, you'd need a much bigger base to cover punctuation and unicode.

(Apologies if my maths is a bit wonky on the above. It's before lunch and my coffee isn't kicking in. But I think the principle is sound.)

Graham Stewart on December 13, 2007 3:39 AM

>> We are thinking as programmers here, ask any normal person to write a list of file names and he/she will write
>> photo1
>> photo2
>> photo3
>> photo..
>> photo10
>> photo11

Why are they choosing file names like that though? Could it be because we've failed as developers to provide them tools to give their files more descriptive names?
Why not name those same photos:
birthday cake.jpg
football cheerleader.jpg
Pam smiling.jpg
zoo animals.jpg

That certainly seems more like the kind of names a user would pick. I would wager if you asked a person if they'd rather have files named rationally with a description of their content, rather than some arbitrary numbering scheme, just about everyone would rather have descriptive file names.

Let's solve the root of the problem before so that we don't have to continually solve the symptoms.

Michael Starke on December 13, 2007 3:41 AM

My Ruby version of Dave Koelle's algorithm:
http://rikkus.info/arch/sensible_sort.rb

Rik Hemsley on December 13, 2007 3:43 AM

> Leading zeros are your friends

How many leading zeros?? 3, 4, 10??

Then you have to type them in to access whatever the thing is you've indexed with them - really unfriendly when there are more than about 3.

More thought required methinks.

> Why are they choosing file names like that though?

How many keyboards have you ever seen on a digital camera?

Francis Fish on December 13, 2007 3:46 AM

>> How many keyboards have you ever seen on a digital camera?

Well, duh! The camera has no way of knowing if you're taking a picture of a pumpkin or a polar bear, and without a keyboard you cannot tell it. But most digital cameras these days come with a suite of software that you use to sink the photos from the camera to the computer. Why is that software not gently prodding those users to provide descriptions of the content of those files so that an intelligible value can be used for the file name? If a user chooses not to provide those descriptions, then they have to remember that DSC01722.jpg is the picture of Pam from the office party. If that's what they want to do, you can't stop them. But you can put the right tools in the hands of users who do want the feature.

Michael Starke on December 13, 2007 3:57 AM

Cocoa has numerical comparison as follows: [someString compare:someOtherString options:NSNumericalSearch];.

It's trivial to implement a function for NSArray's sortUsingFunction:context: that uses it, or a category on NSString to add a -numericalCompare: method to be used with sortUsingSelector:.

millenomi on December 13, 2007 3:57 AM

Anyone coding for European languages will have certainly come across a variant of this where bugs crop up complaining that accented characters (like é) end up at the bottom of a sort instead of next to their a-z equivalents. It gets even more fun when you start considering compound letters like 'ß'.

To the poster who suggested treating the name in base-62, what you're essentially doing is sorting by string length first, then alphabetically - not addressing the core problem, e.g. if you filenames were all the same length (z1.., z10., z100) the system falls down.

Dino on December 13, 2007 3:58 AM

@Francis: most digicams I've seen actually use padding zeros. They mostly pad to 5 digits, since it's usually IMG*.JPG, and since we are still using useless FAT16 as a common denominator filesystem on the memory cards, we are not getting long file names. Smells like 1995, don't you think? And it's actually almost 2008 outside... That's just sad.

Nikolai on December 13, 2007 4:00 AM

(Side note: Cocoa calls numerical sorting what you refer to as natural sorting.)

millenomi on December 13, 2007 4:01 AM

> How many keyboards have you seen on a digital camera?

That made me laugh out loud. Good shit, Francis.

And i think the person who wrote about using leading zeros meant that the leading zeros would be stripped after the sort was done. Or atleast that's how i pictured it. That would be quite a task on bigger lists though.

Jazz on December 13, 2007 4:03 AM

paketep:
Sorry, but if your users are asking for something and your response is "no thats stupid", instead of "ok, I'll do it that way" then you have some problems with your logic. Users are called users for a reason: they _use_ the stuff.

John on December 13, 2007 4:08 AM

On the Michael Herf website, you can find that :
http://www.stereopsis.com/strcmp4humans.html

I must mention that on his website you can find lots of nice peaces of code.

Alex on December 13, 2007 4:08 AM

@Dino: not sure I understand your criticism?

In my example z100 and z101 are the same length and would be sorted correctly.

If you want to sort "z1..", "z10.", "z100" then it would still work, provided you assign the "." character a value on the number line to determine where it appears relative to other values.

On reflection I think where it falls down is on filenames with DIFFERENT lengths (e.g. apple, ball, cab would be sorted as cab, ball, apple - which isn't quite right!)

Graham Stewart on December 13, 2007 4:11 AM

The python sort above is still case sensitive, so 'abc' will be sorted after 'DEF'. It might be worth changing 'else text' to 'else text.upper()'. I can't remember if that is safe for languages other than english though...

Simon on December 13, 2007 4:16 AM

>> Why is that software not gently prodding those users to provide descriptions of the content of those files so that an intelligible value can be used for the file name?

Well duh, probably because it's kind of silly to rename the file if you're going to add tags to it later anyway? Besides, if I download 150 pics from my cam, I'm not really eager to have to name them all...

Gabri on December 13, 2007 4:33 AM

Natural sort mostly solves the problem of numbers in file names. What if the file names also contain dates? It would be fantastic for users if their file explorer would also recognise and sort by date. The environment's locale settings could help in resolving ambiguities (such as M-D-Y in USA but D-M-Y in Europe) and recognising month names and abbreviations. Users shouldn't have to use the ISO date format YYYY-MM-DD in their file names. For example, 1-jun-2007, 1.6.2007, 6-1-2007, june-6th-2007, 2007-6-june, jun07, 1-07, and 2007/6/1 are all possible candidates for a date in a file name (the last one would be illegal on Un*x-style file systems because slash "/" is the directory separator). There are some challenges in there, and it could never be as good as a human, but with some good heuristics and usability tests to iron out surprising or unexpected behaviour, I reckon it could be made to work very nicely.

Andrew Bettison on December 13, 2007 4:40 AM

@Andrew: Meta-data should be your friend, EXIF when talking about photos. Filenames are obsolete.

Nikolai on December 13, 2007 4:49 AM

Jeff - splitting hairs, I know, but it's all about your ordering predicate, not your sort algorithm. That's one thing (among many things, IMO) that the C++ Standard Library gets right - it correctly identifies the variability point of the sort algorithm (the ordering predicate) and exposes it nicely.

Michael Starke - users are rational? Nah. Anyway - when there is a mix of alpha & numeric in a name, users are going to presume that the numeric part should sort by numeric magnitude, while the alpha part sorts alphabetically. And you'll probably find that they'll think that punctuation characters should sort after letters. Or before letters. Or something. But *not* interspersed before, between (upper vs lower case) and after letters.

mpbk - leading zeroes are something programmers would think about, not users.

Stuart Dootson on December 13, 2007 4:54 AM

@Michael Starke
Because when I take my camera out for a day out I can easily take over 100 pictures. It is not a fun job naming them all, and is quite labor intensive, despite the tools we are given. Instead I just whack them in a folder with a name and date that sums up that day.

Ross on December 13, 2007 4:59 AM

Implement NaturalSortComparer as an extension method for IEnumerable<string>.

Job Done.

Stuart Cam on December 13, 2007 5:02 AM

If you want the same sort order as explorer just use the same function explorer use, it's already there.

Here is "natural sort" in 6 lines:
First, somewhere in your class import StrCmpLogicalW:

[DllImport("Shlwapi.dll", CharSet = CharSet.Unicode)]
private static extern int StrCmpLogicalW(string psz1, string psz2);

And then just use it with Array.Sort:

Array.Sort<string>(myStringArray, delegate(string l, string r)
{
return StrCmpLogicalW(l, r);
});

That's all.

NirD on December 13, 2007 5:04 AM

Sorting is also a problem with internationalization.
Hebrew is written right to left for example...

You also have to think about equality and comparison.
How do you sort between z1.txt and z01.txt?

Tsahi Levent-Levi on December 13, 2007 5:14 AM

If that's the best we can do for the internationalized version I don't even want to start. Basically it relies on having a dictionary of all possible special characters you might encounter. I think such a feature might actually exist in unicode (to find out if something is just A with an accent) but putting the dictionary together yourself is code worthy of posting on the daily WTF. :-)

wpp on December 13, 2007 5:25 AM

I think that you are correct in certain instances. Sure, it would be preferable to do a natural sort, but in most instances an alphabetical sort is just fine. SQL Server does all of its sorting by alphabetical order (they refer to it as Dictionary Order) and I would argue that this sorting is what is used quite often when displaying data, especially on the web. I honestly think the problem is nowhere near as big as you are stating it (since it only affect leading or trailing numbers), but in the case of a file manager, I can see where it would be a problem.

Justin Etheredge on December 13, 2007 5:27 AM

What's all this fuss about filenames, anyway? I think filenames are becoming increasingly irrelevant to users, you know, regular, user-type users. Users dont care about filenames, but about content. It's all tags and metadata and stuff.

I think natural sorting is an improvement for displaying filenames, but it's about a decade too late

Rik on December 13, 2007 5:28 AM

I think there is a much deeper problem here. They need to sort their file names... (Almost) Human like sorting of file names is a non-solution to a much deeper problem (very common actually).

Also, these functions will do an almost job as they will only work for us and with the numbers at the end and might or not work with extensions. It sucks and is not constant. Users will use it but will still have to adapt (albeit in an easier way).

So why do they need to sort those hundred and twenty files again?

JF on December 13, 2007 5:28 AM

The problem isn't well defined.

z1.txt and z01.txt were mentioned by Tsahi, but what about zz.txt and z1.txt?

With that Python code, I think the first are going to stay in the order in which they were, and the second order is undefined, as far as I know.

Remco Gerlich on December 13, 2007 5:33 AM

First of all, the problem is not ASCII sort order or whatever. It's called "sorting with lexicographical order". Calling it ASCII sort only muddles the issue (a different character encoding will no solve this issue).
Someone has already mentioned earlier, the problem is that you are conflating the sort operation with the element comparison operation.

Yes, users should not be expected to know or care. They want "natural sorting". But we as programmers should be expected to know the difference.
A programmer should know what comparison sort is (and the difference between it and other types), and how it's composed of discrete comparisons.
A programmer should know more than just to use "items.Sort()".

M on December 13, 2007 5:39 AM

Asks a bunch of users to sort physical objects (or a list of them). They will all come up with different schemes. There is only one that makes sense in the case that was described and its sorting by object number of a specific type.

. on December 13, 2007 5:42 AM

If you want to sort considering international alphabet you should add also a collation algorithm: http://en.wikipedia.org/wiki/Unicode_collation_algorithm

Eduardo Diaz on December 13, 2007 5:52 AM

This statement bugs me, and I'm sort of surprised that no one as brought this up:

"Perhaps a more human-friendly natural sort option should be built into mainstream programming languages, too"

The programming language is NOT the collection of standard libraries or framework that are generally provided with a compiler. This statement would lead one to believe C# and the .NET framework are one and the same, or that C is not C unless you have implementations of stdio, etc. Too many people already think like this, so please don't perpetuate this point of view.

Daniel on December 13, 2007 5:54 AM

The Python implementation use some methods, like isdigit(), split,...

if you implement a few routines its possible to write C, C++, C#
code to natural compare strings using a couple of lines!

Here is a simple algorithm:
//1., 'ax103cool' & 'ax2008hot' -> 'ax'
offset = GetCommonStringPrefix(strA, strB)
No(offset <=0): use the usual compare routine

//2., '103cool' & '2008hot' -> '103', '2008'
endOfsA = AreAnyNumbers( strA +offset )
endOfsB = AreAnyNumbers( strB +offset )

No: use the usual compare routine
Yes: CompareNumbers( strA +offset, strB +offset, endOfsA, endOfsB )

Nikos on December 13, 2007 5:54 AM

"Perhaps a more human-friendly natural sort option should be built into mainstream programming languages"

Enter Tcl's "lsort -dictionary". Not surprising from a language whose aim is to be as human-friendly as possible.

> lsort {a1 a10 a100 a2 a3 a20 a30 a200 a300} ; # default ASCII sort
a1 a10 a100 a2 a20 a200 a3 a30 a300

> lsort -dictionary {a1 a10 a100 a2 a3 a20 a30 a200 a300} ; natural sort
a1 a2 a3 a10 a20 a30 a100 a200 a300

See how life can be simple?

BTW, Tcl's license is BSD-style. Help yourself, grab the code.

El Fredo on December 13, 2007 5:58 AM

For differences in Human-sorting, consider the issue of surnames that start with Mc or Mac (e.g. McDonald, MacFarlane, McFarlane etc). A common problem here in Scotland.

Some organisations/individuals will sort "Mc" as if it reads "Mac" (I believe British Telecom do this in the phone book).
Others treat them separately.
Others place MC and Mac before all other names starting with M.
Others ignore the construct altogether and would file McFarlane under "F"!

Likewise, if we are sorting 'sensibly' then we also need to consider hyphenated and non-hyphenated words (a human would put "e-mail" in with the rest of the "email").

It is worth looking at Schwartzian Transforms for this:
http://en.wikipedia.org/wiki/Schwartzian_transform

Joe Cheng offers some examples of writing Schwartzian Transforms in C# 3.0:
http://jcheng.wordpress.com/2007/08/23/schwartzian-transform-in-c-30/

Graham Stewart on December 13, 2007 6:01 AM

Well windows 2000's explorer didn't sort file the natural way, so i've learned to add leading zeros..

Manu on December 13, 2007 6:09 AM

There are even thornier issues about sorting and searching when a user finds that Etienne is in the E section and Étienne is listed past the Z area.
Any special language characters require special consideration to deal with and if you have more than one language set possible as input values then get set for some fun.

Mistral on December 13, 2007 6:11 AM

Hey Now Jeff,
This is an issue we all see often just like the when my focus is stolen. (http://www.codinghorror.com/blog/archives/001011.html) I like both of these posts since it's an issue we all deal with & notice.
Coding Horror Fan,
Catto

Catto on December 13, 2007 6:14 AM

My solution for the number problem:

http://undev.spaces.live.com/blog/cns!6526A337A687E45D!142.entry

Of course realizing domain concerns may cause you to write your own alogorithm like Graham suggests. Most times you don't need that level of control.

Josh on December 13, 2007 6:15 AM

The filename isn't a good interface to photo tracking anyway. @Ross is right when he says it's no fun (re)naming the photos after a day of taking them. Even if you could name them as you were taking them, you wouldn't, it would interrupt the adhoc nature of taking photos. And names like birthday-cake.jpg, football-cheerleader.jpg and zoo-animals.jpg like @Michael Starke suggested arn't good either, those names are hardly descriptive enough to find one picture a group of files of any useful size (would you really take ONE picture that had MANY animals in it on a trip to the zoo anyway?)

The proper interface for sorting photos is VISUALLY, since photos are visual media. Displaying photo file icons as thumbnails in the file manager was a huge step in the right direction there. But it's still deficient. The photo file manager should work visually too, letting me move files around to "sort" them and remembering those locations (the big grid of photos in many photo management apps doesn't really help, except to browse all photos -- and improperly mirrors the real-world "contact sheet" concept which exists for different purposes).

But back to the issue: sorting. Users are ambiguous in what they want, or they don't even know what they want often times. I've had to handle support calls of the form (not recently, thank god):
"I can't find the file I saved yesterday, can you help me find it?"
"Sure, what did you name it?"
"I don't know."
The filename is THE interface to file management currently. I'm not sure there is much hope, unfortunately, for users who are unable to adjust to that by now (not saying it's perfect in 2007, but if you are going to be productive with certain tools (computers), you should know what's important, especially when the interface arguably sucks).

Our goals should be to make their jobs easier (but easier does not mean "requires no thinking"). This means, in many cases, providing MORE easily accessible options. Changing the sort algo should maybe be provided as part of the sorting interface. Click column headings to change the sort order (asc or desc), right click to select a sort algo for that column of data. Can't navigate your files when sorted ASCIIbetical? Change it to natural order. Still not finding it, then change to an algo that sorts by trying to interpret date formats within the filenames (prime candidate for providing after-market "sort algo plugin" functionality -- since you can't predict what users are going to want, so enable someone else to build it).

Some people think giving users FEWER options is better (uh, ahem, a lot of the Gnome interface), but all that does is make the user change their habits to do things a certain way. It's not about reducing the number of options or hiding options, it's about making them more easily and obviously accessible. It doesn't make sense to force someone to go to Edit | Preferences, then vgrep a huge list of options or checkboxes to change their default sort algo (if even it was an option), but it does make sense to build those options into the UI widget that is the common or accepted endpoint for that functionality.
Rather than debate which sort algo the users are expecting, we should enable users to exploit multiple sort algos and switch between them easily based on their mood or work style.

Axel on December 13, 2007 6:17 AM

If you dim the array as a string and use the sort method, it will produce results in the literal way, but if you dim the array as an integer, it usually works in the logical way. (In VB.NET)

Zak on December 13, 2007 6:17 AM

Yea for Ruby's sort_by method on Enumerable objects!

a = ["x1x10", "x10x1", "x5x5", "b55c5", "x3b4", "x3b2"]

p a.sort_by {|string| string.downcase.scan(/([a-z]+|\d+)/).flatten.collect {|s| s.match(/\d/) ? s.to_i : s } }

Ok, so it's a bit more complicated than that (of course). Wasn't taking into account non alphanumeric characters:

p a.sort_by {|string| string.downcase.scan(/([a-z]+|[^\d]+)/i).flatten.collect {|s| s.match(/\d/) ? s.to_i : s } }

It's probably still not perfect, but it's workable now, at least.

Brian on December 13, 2007 6:20 AM

An interesting side-effect of the Explorer "human" sort is that some programs create numbered files without the leading zeros because they know they'll be shown in the right order in Explorer. Then you bring that into another program, and you have the issues.

Scott on December 13, 2007 6:23 AM

@Michael Starke & Paketep:
There are no stupid users, only stupid developers. Forcing users to do things a specific way or refusing to recognize that a users way of seeing things could be right is arrogant and blatantly idiotic. If a user wants to name his or her files yyyyy1, yyyyy2, etc. who do you think you are to tell they are wrong? You should not gently be prodding anyone or anything except your own ignorance and arrogance.

Right, sorry about the flaming, but the above posts came across so mindnumbingly stupid that it couldn't be helped. I'll get my coat.

Fake

Fake51 on December 13, 2007 6:26 AM

>the gnarly i20n issues I've hinted at

I've seen (and created) more than my share of internationalization bugs, but this self-referential mis-shortening (should be i18n) is exceptionally sweet!

Mark Nelson on December 13, 2007 6:32 AM

Why all the drama about sorts? Sometimes it makes sense to sort it one way, sometimes the other. There's no way a programming language could intrinsically know when either is appropriate. So occasionally you need to insert a few lines of code. Oh horrors.

It's not like you need to re-invent the wheel each time either. Most shops have a pre-written function or at worst code to post in.

Next are we going to mull over the tragedy of lower/upper case conversion?

Karl on December 13, 2007 6:35 AM

>Why are they choosing file names like that though?

Because they feel like it. Because they are mad. Because they are masochists.
Since when do they have to explain why they want something?
They are the users. They want something. It is not your business to know why or to tell them what they need. If you think you know better (and many times you really do know better what they need) then by all means do it, implement it, talk to them, show it to them, but you have to convince them to accept your suggestion. They dont have to convince you for the necessity of their demand.

As programmers It is our business to offer the best but in respect of the limits users pose to us no matter how we judge these limits. We can always warn or suggest or even refuse to do something we dont want for whatever reason, but we can never force the user to accept our suggestion. In IT like in life, purpose DOES NOT justify the means.

If software was politics, you would be a 'fasist'. (No offense)

In fact every kind of programmer can be described as a politician. Jeff do it. Write about it. Please. It will be fun.

bill on December 13, 2007 6:37 AM

What's iinternationalizationn?

JJM on December 13, 2007 6:39 AM

IanG on Tap (http://www.interact-sw.co.uk/iangblog) just blogged about this this morning. His solution: http://www.interact-sw.co.uk/iangblog/2007/12/13/natural-sorting

Max on December 13, 2007 6:42 AM

I think a lot of you have missed the point here. The post was not about filenames, it was about sorting *strings*. Natural sort makes more sense than ASCIIbetical in almost all situations.

Our bug tracking system sends us emails about bugs with a subject like "[ProjectName 10] Something went wrong". A natural sort on subject gives me bug 9 next to bug 10, rather than seperated by 8990 other bug reports.

Making a room booking application? Why isn't F9 before F10 in the room list?

Also, the most important thing to remember when deciding wether to "fix" this sort of problem is not wether or not the user is being stupid, but wether making the change will make your users happier and more productive. Forget everything else.

Mat Scales on December 13, 2007 6:44 AM

In case anyone is interested: My opinion on the matter is sufficiently described in http://nikolai.prokoschenko.de/archives/161 (technorati is slow to update)

Nikolai on December 13, 2007 6:44 AM

Leading zeros have the problem that they limit the collection size at 10^(z+1) - 1, and people just don't like the idea of having such a limit, because they don't know what to do if they change their mind. It's like line numbers in BASIC: you shouldn't have to deal with it yourself.

Also, normal users usually don't deal with mass renaming and Windows Explorer is not everything and the kitchen sink for renaming (which is a design decision). It's much easier to solve this using a tool like Lupas Rename, which has (for most users) a daunting complexity, if users even know such a tool exists.

@ people calling it a case of "stupid users":
Little natural language tricks like extracting the address from an e-mail signature, the name below "regards, xxx" and understanding that "thursday next week" means Date.Now().AddDays(7) are what adds UI polish to your application. It's the "they thought of -that-, too!" that's sorely lacking in a lot of applications.

Rob Janssen on December 13, 2007 6:48 AM

*DO* let Ned's 10-liner tempt you. You know you want to use Python. Embrace the Python. It is your destiny.

Daniel 'Dang' Griffith on December 13, 2007 6:49 AM

The leading zeros perspective is interesting. Apartment and office buildings will commonly number the units using the hundreds place for the floor and the tens and ones place for the unit number (limiting the number of units per floor to 99, a seemingly arbitrary limit, albeit practical), making the first nine units on the third floor as 301 through 309. Why aren't these 31 through 39?

Axel on December 13, 2007 7:00 AM

As a data warehouse developer, I'm no stranger to the way business people expect to see things sorted, and it's never ASCII... One thing I do is to add extra columns to my databases and put in "corrected" versions of the keys I'm using, so my table might look like:

Customer Number|Customer|SortKey1
MS948|Fred Smith|MS00000948
MS9215|John Star|MS00009215

..which means I get to use SQL's standard ASCII sorting (indexed for speed, if need be) rather than muck around with anything in client applications. The client presents the real key to the user, but sorts by the pre-calculated ASCII-sort-safe key, and I only have to do the hard work once, up front.

Matt Gibson on December 13, 2007 7:04 AM

@Rob Janssen I wholeheartedly agree. I got an email the other day and was mightily impressed when Apple's Mail.app detected the words "7pm next Tuesday" in it and wrapped them in a subtle little mouseover drop-down that prompted me to create a new appointment in my calendar for the right time and date, with the title being the subject of the email. Now _that's_ impressively user-friendly and integrated...

Matt Gibson on December 13, 2007 7:08 AM

What about z1.2.txt? Do these natural sort algorithms handle decimals? :)

troll on December 13, 2007 7:16 AM

Shouldn't any user-facing sorts be done with the locale's collation function? Isn't this its responsibility?

Robert Fisher on December 13, 2007 7:16 AM

@Matt: I'm constantly pissed off by Google Mail, which considers any long number a FedEx tracking number -- in Germany. This whole heuristics is entirely context- and language-specific and it needs better means.

Nikolai on December 13, 2007 7:25 AM

known Problem, easy to avoid. If i list such a directory in shell, i just use ls|sort -g. My approach implementing this would be, to determine first, which sort is wanted (ask the user). Then sort ascii as long as there are non-number chars, then sort the numbers as ints.

allo on December 13, 2007 7:29 AM

Numeric collation in Cocoa, and the sort order in the Mac Finder is a result of the property UCOL_NUMERIC_COLLATION being set on an ICU collator. ICU (available in C/C++ and Java from icu-project.org) underlies MacOSX collation, and Apple contributed this functionality to ICU.

http://www.icu-project.org/apiref/icu4c/ucol_8h.html#583fbe7fc4a850e2fcc692e766d2826c2efca83794416797ef04abda570c6f5b

This will also work with any non-ASCII digits, arabic, indic, etc.

Srl on December 13, 2007 7:34 AM

As was mentioned by David Cumps early on, PHP has had natural sorting (natsort) for quite some time. The earliest comments for that function are from May of 2003.

More than enough time for Microsoft to have copied yet another idea... errr... have invented natural sort for .Net.

Same for Ruby and all those other lame languages that you know Jeff.
And I use the word lame because as you said:
"None of the common computer languages (that I know of) implement anything other than ASCIIbetical sorts."

And PHP cannot be called a non-common computer language.

rustyvz on December 13, 2007 7:35 AM

This is the implementation I always use, is very good IMO.

http://www.codeproject.com/KB/recipes/csnsort.aspx

Iceman on December 13, 2007 7:41 AM

BTW, some digital cameras are smarter than others. Read about it here: "http://www.bbspot.com/News/2005/11/sony_photo_sharing.html".

Daniel 'Dang' Griffith on December 13, 2007 7:43 AM

paketep you surely make a great programmer with that attitude hahahah

LOL on December 13, 2007 7:51 AM

More about StrCmpLogicalW on MichKa's blog:
http://blogs.msdn.com/michkap/archive/2005/01/05/346933.aspx

Serge Wautier on December 13, 2007 7:52 AM

Probably I'm the only person here who has this problem, but try using Windows Explorer to work with a long list of files whose file names are GUIDs. Whatever order Windows Explorer uses to "sort" those, it sure ain't human-intuitivable.

Programmer 1's solution: "Why are you using GUIDs for file names?"
Programmer 2's solution: "They're sorted, you just don't see how."
Programmer 3's solution: "Who needs sorting, anyway? You can just search the files."
Programmer 4's solution: "Windows Explorer is stupid."

Beep! Sorry, you are not a winner. Please play again.

mike on December 13, 2007 7:57 AM

Okay, we all agree that "z4.txt" should come before "z21.txt", but what about "z4foo.txt" and "z21foo.txt"?

We could make a general statement that files should be sorted numerically if they end in digits, but what about "z4-foo.txt" and "z21-foo.txt"?

Or a more complex example:

abc-23-foo.txt
abc-d43-foo.txt
abc-d2-foo.txt
abc-100a-foo.txt
abc-12-foo.txt
abc-12a-foo.txt

Be very careful with what you wish for. You might actually get it.

David on December 13, 2007 7:58 AM

Java has had locale-specific sorting for a long time, but the default is to use the Unicode equivalent of ASCII order. To sort an array of Strings in Java properly you have to say:

Arrays.sort(myStrings, Coallator.getInstance());

rather than:

Arrays.sort(myStrings);

The singleton Coallator.getInstance() yields the default comparator for the current locale. So running it in different countries yields a slightly different sort order.

David Leppik on December 13, 2007 8:03 AM

I think everyone, including the original ranter, has missed the programming issue.

Alphabetical sort order is legitimate, as is natural order, date order and lots of others.

Differing users will want a different sort order out of the exact same app, AND field. An ideal app would choose a default and give the user the option to change it.

You can sort by file Name in Explorer, and it sorts in natural order. But what if I want straight alphabetical?

An ideal Explorer would let me right click on the Name header, and alter the specific sort order for that column.

John Pirie on December 13, 2007 8:19 AM

Some memes are toxic.

Most of the commenters still consider this a sorting issue, or a file name issue, when in fact we know by that the root problem is the relative order between two strings.
To put it simply, we are looking at string comparison.

I also vote for the "not part of the programming language". Such a string comparison should (perhaps) be part of the standard library, or framework, but definitely not part of the language per se. Especially because it's just so ill defined.

By the way, I am interested. Do the various implementations mentioned support "natural order" for Arabic digits?

M on December 13, 2007 8:24 AM

Long lists of GUIDS hmmm. Now there's something that the natural sort barfs on. It doesn't know to sort them in hex like ASCII sort would.

Joshua on December 13, 2007 8:24 AM

I'm amused by some of the "one line program" examples listed. I have a one line program: SortListByLastNameAndPrintThenSortByFirstNameAndPrint(phoneBook); Some of us seem to be forgetting that implementations of function require lines of code. The source lines in some library functions can and should ignored in the LOC count. Other library functions should be included in the LOC. Where do we draw the line? Vendor supplied libraries? Partner company supplied libraries? Guy-in-the-next-cube libraries? My own libraries?

LOC = lines of code

Les on December 13, 2007 8:29 AM

In the old days, when we used to write our own solutions, libraries, and routines, we would devise a specific sort routine to suite the task. Now, now that we are being reduced to library “users” and block pushers, I’m too lazy to even consider writing one in c#.

Steve on December 13, 2007 8:33 AM

mmm php natsort()
http://us2.php.net/manual/en/function.natsort.php

phil f on December 13, 2007 8:35 AM

Ruby version similar to the mentioned Python short:

module Enumerable
def sensible_sort
convert = lambda {|text| Integer(text) rescue text }
alphanum = lambda {|key| key.split(/(\d+)/).map(&convert) }
sort_by(&alphanum)
end
end

%w{a1 a2 a3 a10 a11 A12 A13}.sensible_sort # => ["A12", "A13", "a1", "a2", "a3", "a10", "a11"]

The convert lambda would be better off comparing the text with a regexp or so to ensure it's a digit; exceptions are slow. But they're prettier here, and it's fine for short lists.

Thomas Hurst on December 13, 2007 8:37 AM

If you listen, normal people have novel ways of thinking about computers that are completely natural and that we've entirely lost sight of from long habit. When I was teaching my Mom to use a computer (a Mac) I had a hard time explaining to her why she couldn't name all her pictures of me "Jerry." "Well, if they were all named Jerry, how would the computer know which one you meant?" "Why, the one I just clicked on, of course," she said, giving me a funny look. After all, the images have icons that are thumbnails of their contents, so it's perfectly obvious to her which one she wants!

And you know, that sort of logic is very hard to refute, and it's perfectly feasible for a GUI-driven computer to work that way. Now, programmers understand why any chunk of data that needs to be accessed needs a unique key, and find that natural, but that key doesn't necessarily need to be the file's name, nor does the name need to be the primary way users identify files. Every file has an inode number or other unique file system identifier anyway that could be used behind the scenes, and that's actually a more persistent way to refer to the file programmatically than the name (which is why the Mac uses it for aliases).

My mom did eventually accept that each file in a folder needed a unique name and that if she wanted to merge two folders she might have to rename some files to avoid losing them entirely, which of course, if you think about it from her perspective, is a dumb outcome from a dumb requirement. So now she's one of us.

Naturally, it was also obvious to her how files should sort, and I was happy to install an extension that did that for her on her Mac OS 9. At least I could solve that problem.

Jerry Kindall on December 13, 2007 8:51 AM

I find that "natural" ordering painfully annoying.
Why?
It does not work with hexadecimal numbers
Each implementation is different, so I _never_ know where the file will end up. ASCII may be unintuitive, but at least it is predictable...
Yes, there are cases where it is clearly preferable (well actually for me the only case is utterly idotic programs that do not put leading 0 in file names) but overall it has been more of an annoyance to _me_.
Or to summarize: this is basically UI desgin. And in UI design, if you think something is "obviously" better then you did not think about it enough.

Reimar on December 13, 2007 8:59 AM

I have to half agree with the pro-ascii sentiments. If our users are adding a numbering scheme to their file listings, then we as programmers have failed to give them the provision to add that kind of metadata in the proper space. Attempting to cover every possible numbering scheme the user could come up with would provide very difficult, but a simple drag-and-drop provision that remembers the way the user sorted the files would remove that numbering need all together.

Bill on December 13, 2007 9:22 AM

First of... great blog. This post is genius and hits at the very problem with programmer. They are not designers. As a programmer you need to be an architect as well as designer. Its not what is easiest to program, but what is the best experience for the user. I completely agree that natural sorting should be a feature of all sorts - and default for the end user.

VelkyMX on December 13, 2007 9:44 AM

using System;
using System.Runtime.InteropServices;

class C
{
[DllImport("shlwapi.dll", CharSet=CharSet.Unicode)]
extern static int StrCmpLogicalW(string x, string y);

static void Main()
{
string[] names = new string[] { "string200", "string1", "string10", "string4", "string2", "string100", "string20", "string40" };
Array.Sort(names, StrCmpLogicalW);
}
}

c on December 13, 2007 9:44 AM

"Users will inevitably complain that their items aren't sorting properly, and file bugs on these "errors". Being card-carrying members of the homo logicus club, we programmers produce a weary sigh, and try to keep any obvious eye-rolling in check as we patiently inform our users that this isn't an error."

I think the problem really comes back to this. We're basing decisions about how to interface with the user on implementation details. Users have no knowledge whether we use built in sort or implement our own sorting/comparison algorithms. So, the requirement is on us to make sure that our software interfaces the correct and natural way.

Not that I always succeed, but I'm always trying to think like a user first, then find an implementation to provide that behavior. Too often, we tend to think like a developer, then provide a user interface to that implementation. This goes for natural sorting, but also the rest of the UI!

Steve Bush on December 13, 2007 9:53 AM

As usual, for any given problem there is a solution in python which is elegant, succinct and _fundamentally broken for corner cases_.

What does your four-liner do with numbers that don't fit into an int?

Hacking something together for a one-off script is one thing. Writing robust code that will stand the test of time and handle any case the user throws at it is rather more difficult. It should not be a surprise that such a beast needs 40+ lines of code to get it right.

Eric "purple guy" Lippert on December 13, 2007 9:59 AM

Thank you for that awesome post, Jerry. It really helps keep things in perspective.

A lot of programmers seem to get easily offended at the slightest implication that the "dumb" end-users actually have a greater awareness of their own computing needs than said programmers do.

Axel also makes a brilliant point about the filenames of photos (to use that example) being essentially unnecessary for users to deal with. How did our parents view, sort, and catalog their photos? The albums with my baby pictures have four photos per page, with the dates written underneath. And the photos are generally arrange in chronological order. The standard Windows Explorer list, details, or non-thumbnail icon views (or DOS "dir" listing) of photos is utterly useless for the overwhelming majority of tasks. Unless you're writing code that needs to know the disk-based filenames of photos, then the filenames are just extra clutter to have to deal with.

I long for the day when I don't need to be exposed to any file-system baggage to organize, view, or work with digital photos. The metaphor of the physical photo album is likely the most natural means of access for anyone who is not a programmer. Likewise, I long for the day when I can organize, copy to my iPod, and play all of my music and video media without ever having to think about the directory structures and filenames. iTunes doesn't really cut it yet, and neither does Windows Media Player.

And I'm not a "dumb" or "casual" user. I've been assembling and troubleshooting PCs for 20 years, I've worked doing systems integration and tech support, and I've been programming for the past 8 years (not including high school, as I didn't program for about 6-7 years in the middle). I used to be proud of my command-line DOS-fu. Now I just have better things to do with my time, and I think that computer programs should work the way that the majority of users expect them to.

Chris Lasting on December 13, 2007 10:03 AM

Using one of the string compare algorithms that have been mentioned here in a nice extension class can get you code like this without having to rewrite your comparer every time:

string[] testItems = { "z24", "z2", "z15", "z1", "z3", "z20", "z5", "z11" };
testItems.NaturalSort();
foreach (string item in testItems)
{
Console.WriteLine(item);
}

It's one line of code and what could be more natural than

testItems.NaturalSort();

Here's the extension code I played with:

[edit: a bit too much code for a single comment. Sorry. Removed. Please host on a blog entry if you're posting many lines of code! ]

Tyler Jensen on December 13, 2007 10:07 AM

Silly rabbit, 'sort' is in the eye of the beholder, not the coder.

Steve on December 13, 2007 10:08 AM

This happens to be one of the very first Perl golf challenges (see http://www.xs4all.nl/~thospel/golf/challenge.html). It can be done in 58 characters. Not that I'd recommend this approach in production code...

Stephen Turner on December 13, 2007 10:21 AM

It's worth noting you can take hundreds of files and rename them with leading 0's in less then a second.

An app I use for this purpose from time to time is FlashFileRenamer.

If you can get your code to do what you want, thats totally cool too. But renaming the files might sometimes be a good solution.

From their site:

"Add number sequences (001, 002, 003, ...)
Pad existing numbers with zeros to make them sort correctly. "

Screenshot of that function:

http://www.rlvision.com/flashren/screen_enum.gif

More Screenshots:

http://www.rlvision.com/flashren/screenshots.asp

Mike on December 13, 2007 10:37 AM

So using .NET 3.0 Method Extensions, instead of List.Sort you can use List.NaturalSort, hows that?

using System.Runtime.InteropServices;
public static class ListExensions
{
[DllImport("Shlwapi.dll", CharSet = CharSet.Unicode)]
private static extern int StrCmpLogicalW(string psz1, string psz2);

public static void NaturalSort(this List<string> list)
{
list.Sort(StrCmpLogicalW);
}

}

Eric Malamisura on December 13, 2007 10:45 AM

private static readonly Regex regex = new Regex("^[0-9]+", RegexOptions.Compiled | RegexOptions.CultureInvariant);
static int NatureSortComparer(string a, string b)
{
int indexA = 0, indexB = 0;
while(indexA<a.Length && indexB<b.Length)
{
if(a[indexA]>='0' && a[indexA]<='9')
{
if (b[indexB]<'0' || b[indexB]>'9') return -1;
string strNumA = regex.Match(a.Substring(indexA)).Value;
string strNumB = regex.Match(b.Substring(indexB)).Value;
int numA = int.Parse(strNumA);
int numB = int.Parse(strNumB);
if (numA > numB) return 1; else if (numA < numB) return -1;
indexA += strNumA.Length - 1;
indexB += strNumB.Length - 1;
} else if(b[indexB]>='0' && b[indexB]<='9')
{
return 1;
}
if (a[indexA] > b[indexB]) return 1; else if (a[indexA] < b[indexB]) return -1;
indexA++;
indexB++;
}
if (indexA > indexB) return 1; else if (indexA < indexB) return -1;
return 0;
}

http://rafb.net/p/C6odtX29.html

Sachman Bhatti on December 13, 2007 10:45 AM

Michael Starke: "I can think of only a few situations (though admittedly I may be missing some) where sorting ASCIIbetically produces counter-intuitive results. "


How about addresses? 9 Fake St. should come before 10 Fake St. shouldn't it?

Dealing with this right now, actually. Well... ignoring for the moment to be more accurate. ;)

Are we allowed to use the code people post here?

Telos on December 13, 2007 10:47 AM

The reason I posted the answer, and an explanation for why I just wasted 20 minutes on a problem I had no practical reason to solve:

http://xkcd.com/356/

Telos: You can use my code, anyone can - it's public domain. It can be improved though, and most importantly culture sensitive.

Sachman Bhatti on December 13, 2007 10:50 AM

When I started as a developer, I used to be proactive, rather than reactive. That is, giving the users what they want before they wanted it. Yes, it's not the actual definition of proactive, but it was my boss' definition of it, so there.

Eventually, I gave up on this for one very simple reason. When precognizing what the user wanted, 99% of the time I got it wrong. So, I started asking the users to tell me what they wanted and verified what they told me was correct, and I gave it to them. Guess what happened. They either didn't use what they asked for, or said it didn't work right and called the whole thing broken because of it. That was my epiphany; users are not stupid, they just don't really know what they want.

Later, we hired business analysts whose job was to talk to users and write up design specs that everyone involved signs off on. Then, we coded to those specs. I totally regret not coming up with that earlier in my career. It would have seriously reduced my stress level and improved my overall health.

Buck on December 13, 2007 10:50 AM

Jeff, you are mistaken on this issue. This has nothing to do with ASCII encoding. This is not an issue of "ASCII versus natural sort" - it is LEXICOGRAPHICAL ("dictionary-type") sort versus natural sort.

Of course, you don't have to believe me:
http://en.wikipedia.org/wiki/Lexicographical_order

The fact that letters and numbers are encoded in ASCII (or Unicode, or whatever) is irrelevant. Think about it - even in the "real world", where you don't use ASCII, letters and numbers have to be ordered SOMEHOW, otherwise how do you determine the order of 2 strings such as "foo1" and "fooA"?

The issue is that lexicographical sorting compares only *one character at a time*, whereas natural sort groups numerical substrings. For example, with lexicographical sorting, "foo123bar" would appear before "foo2bar", because "foo1" < "foo2" ("1" < "2"). See? Nothing to do with ASCII. Even a non-techie would agree that 1 is less than 2. With natural sorting, numerical substrings would be grouped, so the partial comparison would be "foo123" > "foo2"; therefore, "foo2" would appear before "foo123" in a natural sort.

Anyway, if anyone needs a natural string compare function for C, here you go:
http://sourcefrog.net/projects/natsort/

Will on December 13, 2007 11:05 AM

> I've seen (and created) more than my share of internationalization bugs, but this self-referential mis-shortening (should be i18n) is exceptionally sweet!

Yes, it's a good one. :) fixed.

> What does your four-liner do with numbers that don't fit into an int?

There are a lot of magical things about the four-liners that make me nervous..

> alphanum_key should be alphanum for this example?

The original commenter changed the name and I didn't catch it. Fixed.

Jeff Atwood on December 13, 2007 11:08 AM

Jeff, nobody here has yet pointed out that you messed up the definitions of "alphabetical", "ASCIIbetical", and "natural" sort orders.

"Natural sort order" is an empty phrase, essentially meaningless. Programmers often use it to mean "the order you get when you convert everything to an integer or a sequence of bytes, and sort those". Users often use it to mean "the order I'm thinking of right now".

"ASCIIbetical" sort order is the order you get when you treat a string as a sequence of ASCII bytes, and sort those. It is the order given by strcmp() on an ASCII machine, in which "100" < "25" < "Banana" < "apple" < "banana".

"Alphabetical" sort order is alphabetical sort order, in which "apple" < "Banana", and "Apple" and "apple" sort next to each other. It doesn't say what to do with numbers or other non-letter characters. It is a part of the order given by strcmpi() in libraries that have it (although strcmpi() has to do something consistent with all the non-letter byte values, too, of course).

Then there's the librarian's sort, in which "apple" < "Banana" < "eight" < "ninety-nine" < "25" < "twenty-four" < "zero"; it's easy to get computers to do this, but I don't know of any standard libraries that support it out of the box.

What you want, in which "2" < "100" < "Apple" < "ninety-nine" < "Z100", is a kind of unnatural hybrid of all these methods. I assume Perl supports it somehow (after all, Perl is the language in which "z99"+1 = "aa00"), but as other people have commented, it's not the sort of thing you ought to be trying to do in the first place.

Anonymous Cowherd on December 13, 2007 11:09 AM

Anonymous Cowherd, as I pointed out, the term "ASCII sort order" or "ASCIIbetical" is wrong. We are talking about lexicographical sort order (only 1 character is compared at a time), and it has nothing to do with ASCII.

You could use strncmp (case-insensitive compare) for your sort comparison function, and you would still get case-insensitive lexicographical order. e.g. "abc1, ABC234, abc3"

Natural sort order does mean something: it means "modify the lexicographical sorting algorithm so that numerical substrings are appropriately grouped and compared, rather than always comparing one character at a time".

Will on December 13, 2007 11:14 AM

It's not a sorting problem, it's an ordering problem. A sorting algorithm knows nothing about the data except a way to compare two elements. Most programming languages, of course, get this right. Having a natural_sort function is the braindead solution (no wonder PHP has one).

So why there isn't a 'natural order' compare operator for strings, you might ask? Simple answer, because lexicographical ordering (no, there is no such thing as ASCIIbetical ordering, thanks for letting us participate to your collective ignorance) is a well defined mathematical concept, 'natural order' is not. Yeah, we got that x-10 goes after x-9, where does x-cover go? Before x-1 or after the last element? What is the natural order of hexadecimal strings?
The 'natural order' depends on what you're using your strings for, so yes, you'll have to write it yourself, each time. Asking for 'natural order' is like asking for the 'do_the_what_the_users_thinking' function.

E[X] on December 13, 2007 11:18 AM

I donated code to the winvn source base (news reader started by nasa) that does this type of character ordered in C. The current maintainer cleaned up a bit for his needs (and to make up for my lack of C skills, I am primarilly a delphi developer, and my C gets mighty rusty) -> You can get the source for WinVN at http://www.marks-lab.com/

I also have the Delphi version I maintain for myself. I should probably post it for people somewhere.

Xepol on December 13, 2007 11:19 AM

This one hits home for me. We have to manage things like lists of addresses, and it really annoys the users when "Unit 1208" goes before "Unit 300". I probably could have solved this from scratch, but looking at some of the Java/Python/Perl implementations made it much clearer what had to be done.

So, for the rest of you, here's a C# implementation that actually works (i.e. produces results identical to Dave Koelle's Alphanum, which I thought was the best of the three), and doesn't rely on the quirky Windows filename sorter:

http://convolved.com/code/AlphanumComparer.cs

There's probably a more "compact" way to write it, using fewer lines of code, but I honestly don't care. I'd much rather have something that's easy to understand and make changes to.

If someone else already beat me to the punch, then I apologize. If not, enjoy!

Aaron G on December 13, 2007 11:21 AM

Jeff, to be clear, here is an example of lexicographical comparison, which illustrates why it is has nothing to do with ASCII (or Unicode, or any kind of computer encoding.)

Lexicographical comparison tells us to compare ONLY ONE PAIR OF CHARACTERS AT A TIME. When we hit a pair of character that is different, the ordering of those 2 characters determines the ordering of the 2 strings.

Consider the 2 strings "foo123" and "foo2". Which string "comes first", according to lexicographical sort order?

Compare 1st character: "f" = "f" (the strings are still the same, at this point)
Compare 2nd character: "o" = "o" (the strings are still the same, at this point)
Compare 3rd character: "o" = "o" (the strings are still the same, at this point)
Compare 4th character: "1" < "2" (therefore "foo123" < "foo2")

The only difference with "natural sort" algorithms is that numerical substrings are grouped and compared, so the 4th step becomes:
Compare 4th letter/numerical substring: "123" > "1" (therefore "foo123" > "foo2")

Wow, did you see how I made NO reference to ASCII (or even computers) in that example.

Will on December 13, 2007 11:26 AM

Also, the one that says don't need filenames. I'm with you users have no business with file names and directory structure. Unfortunately at this point this would require a major rewrite of almost every existent program, I doubt it will happen anytime soon.

E[X] on December 13, 2007 11:27 AM

By the way, for everyone who thinks "ASCIIbetical" (actually LEXICOGRAPHIC") sort order is stupid, how do you think dictionaries are sorted? Using case-insensitive lexicographical sort order.
http://en.wikipedia.org/wiki/Lexicographical_order

The funny thing is that lexicographical order is pretty much alphabetical order, despite what other posters here have claimed.

The only reason you don't see this issue with dictionaries is that there aren't a whole lot of english words with embedded numbers. For those of you who think "alphabetical order" can say nothing about where numbers should go: Now that "w00t" is apparently a new dictionary word (sadly), how do you think they'll put it in the dictionary?

Will on December 13, 2007 11:37 AM

"What does your four-liner do with numbers that don't fit into an int?"

Python can store arbitrarily long numbers into an "int".

Carlos Rodrigues on December 13, 2007 11:47 AM

@Chris Lasting:
"I long for the day when I don't need to be exposed to any file-system baggage to organize, view, or work with digital photos. The metaphor of the physical photo album is likely the most natural means of access for anyone who is not a programmer. Likewise, I long for the day when I can organize, copy to my iPod, and play all of my music and video media without ever having to think about the directory structures and filenames. iTunes doesn't really cut it yet, and neither does Windows Media Player."

Hmm. I do most of my photo work in Aperture, with a smattering of iPhoto (my wife and kids work in iPhoto). I couldn't tell you the file and/or folder names of any of our photos. I simply don't care about the file name (other than the root folder for backups). Talking with Windows friends about photo management ("but how do you organize pictures without meticulously named folders and files including the date and event name???") is like talking to a brick wall. But, for me, this is already a reality.

I'm not sure where iTunes fails you in this regard, though. I don't know or care what my music files are named (well, I suppose I "know" that they are named the same as the track name, and organized in my music folder by artist name then album name ... but on a specific basis, if I have two tracks in the same album with the same name, I don' know/care which one has a "-1" or whatever appended in the file system). When does iTunes make you look at the file system to deal with music?

Tom Dibble on December 13, 2007 11:51 AM

This is one of the typical programmer friendly features you find in open source languages all of the time, yet somehow escape mainstream closed source languages. Such features make so much sense that I want to punch my monitor when they are suddenly absent. For example, MySQL, in its collective, open-source wisdom, has the LIMIT feature. MSSQL(at least 2000, I don't know about newer versions) has a glaring lack of such a feature. Instead, anyone who wants to pull a subsection from a result set has to do a series of order flips using the TOP feature.

Mattkins on December 13, 2007 12:02 PM

Not that anyone will actually read this before posting, but *be very careful posting < and > here*. You'll need to encode them as &amp;lt; and &ampgt; otherwise they'll be interpreted as HTML tags. I know, it sucks, and I need to change it.

I've corrected all the above posts, at least the ones that contained angle brackets.

Jeff Atwood on December 13, 2007 12:07 PM

Cache (Mumps) means not having to sort.

SET SORTED("WILDFIRE")=""
SET SORTED("BANANA")=""
SET SORTED("DOGBREATH")=""
SET SORTED("ZAGNUT")=""
SET SORTED("JIM BOB")=""
SET SORTED("JIM BOB2")=""
SET SORTED("JI2 BOB3")=""

ZW SORTED

SORTED("BANANA")=""
SORTED("DOGBREATH")=""
SORTED("JI2 BOB3")=""
SORTED("JIM BOB")=""
SORTED("JIM BOB2")=""
SORTED("WILDFIRE")=""
SORTED("ZAGNUT")=""

ljfaraci on December 13, 2007 12:24 PM

Disagree.

First, my Windows Explorer (XP) sorts in the expected manner: z1.txt, z10.txt z2.txt, etc.

Ascii sort is the most predictable. Once you start playing around with that, it will lead to confusion. Who really knows if 10 should be one-zero or ten?

pwb on December 13, 2007 12:38 PM

"It does not work with hexadecimal numbers"

What? Really? Hex?
Not only do only only you and DEAD people know Hex, but most of them don't use it that regularly.
Seriously, designing a natural sort that caters to Hex is catering to a tiny little niche audience that can and will almost all rewrite the sort method themselves.

Tom on December 13, 2007 12:41 PM

"Ascii sort is the most predictable. Once you start playing around with that, it will lead to confusion. Who really knows if 10 should be one-zero or ten?"

Erm, every single person who graduated first grade, and then didn't go into the "wonderful world of edge-conditions". What's the matter with you?

Tom on December 13, 2007 12:42 PM

dwb wrote:
"Disagree.

First, my Windows Explorer (XP) sorts in the expected manner: z1.txt, z10.txt z2.txt, etc.

Ascii sort is the most predictable. Once you start playing around with that, it will lead to confusion. Who really knows if 10 should be one-zero or ten?"
-------------

XP defaults to "natural" sorting in Explorer. Someone must have used a registry editor or TweakXP to change that on your computer. It's called lexicographical ordering, "not ascii" sort, and I agree it's the most predictable. There are many ways to implement "natural sorting". (For example, should leading zeroes be ignored? How do you interpret groups of numbers separated by periods? etc., etc.)

Will on December 13, 2007 12:50 PM

Ruby:
module Enumerable
def sort_nicely
sort_by { |el| el.split(/(\d+)/).map { |s| s =~ /^\d+$/ ? s.to_i : s } }
end
end

nons on December 13, 2007 12:50 PM

Tom wrote:

(Quoting dwb) "Ascii sort is the most predictable. Once you start playing around with that, it will lead to confusion. Who really knows if 10 should be one-zero or ten?"

Erm, every single person who graduated first grade, and then didn't go into the "wonderful world of edge-conditions". What's the matter with you?

-------

Oh, really? Tell me, is "1.10" less than or greater than "1.2"?

If you are talking about decimal fractions, then "1.10" is less than 1.2".

If you are talking about SOFTWARE VERSION numbers than "1.10" is usually greater than "1.2"

What about "1.02.3" and "1.2.3"?

Natural sorting is ambiguous and ill-defined, JUST LIKE NATURAL LANGUAGE - I'm sure there are many implementations, just like there are many idiosyncratic forms of English grammar and spelling depending on what country you live in (e.g. U.S., Britain, Canada, etc.).

For example, the British always treat companies and sports teams as plural nouns (e.g. "Microsoft are an American software company."). Americans switch between plural and singular, depending on what "sounds right" (e.g. "The Miami Heat *are* an American professional basketball team. Miami *is* currently 2nd-last in the Eastern Conference standings. *They* have been struggling to win all season." I'm confused - is/are the Miami Heat singular or plural?)

Will on December 13, 2007 12:56 PM

This discussion is funny.

The problem isn't with sorting. The problem is with the data structure. If you want a hierarchy use one. Make a tree and sort the levels of the tree.

Take this shopping list for an example:

Wheat bread
White Bread
White Milk
Chocolate Milk


It's a list. If i sort the list i'll get:
Chocolate milk
Wheat Bread
White Bread
White Milk


Whaaah /cry the user says 'cause milk is seperated and not sorted propery!
Whaaah /cry the developer says "get used to it" this is the way computers have worked since the iron age.

the problem isnt the sort, the problem is the list. What we want is a Tree:

Bread:
Wheat
White
Milk:
Chocolate
White

Then when we sort we sort each level of the tree.

When you sort the strings:

dumbuser10
dumbuser2
dumbdeveloper10
dumbdeveloper2

If you expect the result:

dumbdeveloper2
dumbdeveloper10
dumbuser2
dumbuser10

Then you are expecting the tree:

levels of the tree.

Take this shopping list for an example:

Wheat bread
White Bread
White Milk
Chocolate Milk


It's a list. If i sort the list i'll get:
Chocolate milk
Wheat Bread
White Bread
White Milk


Whaaah /cry the user says 'cause milk is seperated and not sorted propery!
Whaaah /cry the developer says "get used to it" this is the way computers have worked since the iron age.

the problem isnt the sort, the problem is the list. What we want is a Tree:

Bread:
Wheat
White
Milk:
Chocolate
White

Then when we sort we sort each level of the tree.

When you sort the strings:

dumbuser10
dumbuser2
dumbdeveloper15
dumbdeveloper3
smartdeveloper1

If you expect the result:

dumbdeveloper3
dumbdeveloper15
dumbuser2
dumbuser10
smartdeveloper1


You are expecting a tree:
dumb:
developer:
3
15
user:
2
10

smart:
developer:
1


So fricken use a tree user and developer alike.

brian on December 13, 2007 1:03 PM

omg sorry about that post, i guess i'm dumb user 2 'cause i cut and pasted all fooked

brian on December 13, 2007 1:04 PM

As if you'd ever need to sort hexadecimal in a system with actual users. If you find yourself encountering this issue frequently, it's a strong indication of awful UI design and potentially terrible back-end design.

If some of you are really so concerned about this obscure edge case, put in an heuristic that checks for a string composed of only decimal digits and same-case letters A-F, with at least two characters from each category. One regex and a few lines of base conversion will take care of that. It'll be wrong sometimes, but again, if you're displaying hex to the users then you've already lost the battle.

And uh, brian, do you frequently see people organizing their shopping lists into tree structures? You've raised both the human and machine complexity exponentially, for no practical reason at all.

And Will: I really doubt that you'd often need to sort fixed- or floating-point numbers as a smaller part of an alphanumeric string. If you ever did need to do this, it would be much more likely to be something like a table of contents, where 1.10 actually *should* come after 1.9.

Jesus guys, it IS important to consider edge cases, but that doesn't mean you should scrap the whole idea if a few of those edge cases will produce incorrect results. You just make sure that you've defined this behaviour, and set expectations accordingly. I think what this really boils down to is that a lot of programmers are control freaks and can't entertain the notion that no answer is 100% correct. Instead of coming up with a 90% correct solution, they just bitch, moan, throw their hands up in the air, and give up on the problem altogether. Those people are going to get eaten alive in the real world.

Aaron G on December 13, 2007 1:34 PM

"natural sort groups numerical substrings"

Yes, but if "natural" means "the way humans do it", it has to do much more and handle other popular ordered substrings...

Days of week:
•tuesday
•wednesday
•thursday

Date/time:
•15-8-2004
•1-10-2004
•1-1-05

•5:00 AM
•14:00
•4:00 PM
(special tricky for 12:00-1:00 AM/PM)

Street adresses:
•9, James street
•10, James street
•10 bis, James street
•10 ter, James street
•5, Rudolph street

Product reference, or anywhere digits are not numerical, like phone "number":
-01K305
-1K205

Hexadecimal:
0001
F01A
f01b
(0-9 and A-Z/a-z in straight order)

Version:
•v1.9
•v1.10
•v1.10.1
(a period is not a decimal separator)

Sorting with or without leading article (Mr, Miss, St, "The" etc).

etc etc

And then there is the issue that some sorts depend on locale.
And then there is the issue that some strings were built with *other*, untold locale (like filenames).
And then there is the issue that some strings are ambiguous and require looking at the entire set.
And then there is the issue that some sets are fundamentally ambiguous.
And then there is the issue that since the sort depends on the set, it is not stable nor predictable.


If you want to make good human-like sort... good luck.

Musaran on December 13, 2007 1:35 PM

And in the case of the version numbering... thats a tree:

v:
--1:
----9:
------0
----10:
------0
------1:
--------0

If human want tree... human should use tree.

brian on December 13, 2007 1:42 PM

Windows Explorer in your example sorts by file name, not by file name with extension. Hence the difference.

If you apply Array.Sort() to file names only -- you would get result that is pleasant for users.

Dennis Gorelik on December 13, 2007 1:52 PM

I have both learned to use padding zeroes in the filenames, as well as written the´sorting a couple of times in my profession.

Hmm.. I bet there's an ISO standard for this somewhere. If people liked communicating across borders more, they'd be less exited about localization and more into internationalization. There is, seriously, no need to have a local way or writing dates. But then again, what do you expect from people who don't use the metric system...

Peterh on December 13, 2007 1:52 PM

Musaran, I agree with you totally - natural sorting is not well-defined. I was simply describing how most natural sort algorithms are implemented.

Aaron G, funny you should say nobody needs to sort decimal numbers as a sub-string. This natural sort algorithm actually supports sorting decimal sub-strings by NOT ignoring leading zeroes (although that behaviour can be disabled):
http://sourcefrog.net/projects/natsort/
Before you say "well, that's just some random guy's sort routine", he claims that his code has been incorporated into PHP:
http://us3.php.net/manual/en/function.strnatcmp.php

Actually, this whole discussion boils down to a central issue in human factors: How Software Works (well-defined algorithms) versus What Users Want (fuzzy, ill-defined specifications).

Once AI is smart enough to figure what users really want, we'll all be out of jobs. (The code will write itself).

Will on December 13, 2007 1:54 PM

Daniel 'Dang' Griffiths:

Didn't your teacher tell you you'd go blind if you continued to "Embrace the Python".

Maybe you could try "polishing the Ruby"?

On the topic of this post:

One sort approach is never going to work in all situations. Let your users try out the sort with various filenames etc, get feedback from them, change the sort to behave as close to their expectations as possible. Easy really.

Ash on December 13, 2007 2:00 PM

Ash wrote:
"One sort approach is never going to work in all situations. Let your users try out the sort with various filenames etc, get feedback from them, change the sort to behave as close to their expectations as possible. Easy really."

------

Yeah, I totally can see how that would work with Windows Explorer.
Each time you open Explorer, Clippy could pop out of nowhere and say "You seem to be viewing a list of files. How do you like the way I sorted them?"

Will on December 13, 2007 2:06 PM

Aaron G wrote:

"As if you'd ever need to sort hexadecimal in a system with actual users. If you find yourself encountering this issue frequently, it's a strong indication of awful UI design and potentially terrible back-end design."
---------

Are you sure nobody uses hexadecimal?

What if your users are developers (e.g. Visual Studio)? Or your application is highly technical (Modbus master/slave, network packet sniffer such as WireShark, debugger, etc.)?

Will on December 13, 2007 2:12 PM

Peterh wrote: "If people liked communicating across borders more, they'd be less exited about localization and more into internationalization."

------

The funny thing is, people do it all the time. It's called the Internet.

And what do you think happens when someone posts on an "American" message board using non-American spelling/grammar (e.g. British English treats companies and other groups as plural nouns) or non-American date conventions (e.g. DAY/MONTH/YEAR instead of the American MONTH/DAY/YEAR). They get flamed. People yell at them for using "bad grammar" or "not knowing how to write dates".

Of course, I'm sure this also happens when Americans post on non-American websites.

The point is that people hold on to their little cultural conventions and idiosyncracies as if they were authoritative and universal.

Good luck on shifting the focus from "localization" to "internationalization".

Will on December 13, 2007 2:25 PM

Hey Jeff, first time commenting. :)

Great blog, your recent blog about the 80/20 actually inspired me to start my own blog.

Saw this one and have to give it my own try:
Quick Java Comparator:
http://simplesql.tigris.org/servlets/ProjectDocumentList?folderID=0

Cheers!

Eric B on December 13, 2007 2:26 PM

@Tom:
"What? Really? Hex?
Not only do only only you and DEAD people know Hex, but most of them don't use it that regularly."

That's not the point. The point is what is the natural order for hex strings? The 'natural order' is under defined, you can't have it in the language (or in its standard library).
Only the programmer can decide what is the natural order. No, the user can't, the user sees only what she'd like you have to take into account all your users.

E[X] on December 13, 2007 2:35 PM

@ Aaron G:

"If some of you are really so concerned about this obscure edge case, put in an heuristic that checks for a string composed of only decimal digits and same-case letters A-F, with at least two characters from each category. "

Oh yes... right... lets put there an heuristcs so that we use two different ordering depending on the elements of the set. Wait... what appens if only some strings are hexadecimal?

"Jesus guys, it IS important to consider edge cases, but that doesn't mean you should scrap the whole idea if a few of those edge cases will produce incorrect results. You just make sure that you've defined this behaviour, and set expectations accordingly"

No, this is the problem. A lexicographical ordering is the same everywhere. The 'natural order' is not, you can't have it as a function in the programming language, it depends on the domain.

If you want to show a sorted list of strings to the user you have to choose one ordering. If, depending on the domain, there is an ordering that is *always*correct* you should implement that. If the list of strings is general enough you have to choose an ordering and screw some of your users. The lexicographical ordering has the undeniable advantage of screwing the users in a predictable fashion with a solid mathematical background.
The natural order is just something you pulled off your @ss, subject to change when you find a "bug".

E[X] on December 13, 2007 2:46 PM

E[X], well whether you like it or not, strnatcmp seems to be standard in PHP:
http://ca.php.net/strnatcmp

Just because "natural sorting" is ambiguous, doesn't mean someone can't make up a "standard" for it.

Will on December 13, 2007 2:47 PM

@Will
well PHP is also the language where "no one needs namespaces", calling that a *standard* is quite a bit far fetched. Is just the way they think it should be.

E[X] on December 13, 2007 2:52 PM

E[X] wrote: "well PHP is also the language where "no one needs namespaces", calling that a *standard* is quite a bit far fetched. Is just the way they think it should be."

-------------------

E[X], fair enough.

Natural sorting has its place (e.g. sorting lists of numbers, if the list is in text form, for some reason). Lexicographical ordering also makes sense in a lot of places. Unfortunately, as this blog post and many of its replies show, most people do not even understand the concept behind lexicographical (uh, sorry, should I say "ASCIIbetical") ordering, so good luck explaining to the end-user why his digital photos are listed in the "wrong order".

I agree with the sentiment that computer software should strive to do "what the user wants". I disagree with the knee-jerk reaction that developers and computer scientists are stupid because the language doesn't always help the developer do what the users want.

Will on December 13, 2007 2:57 PM

Valid point.

Natural sort would increase productivity across America. (White House paying attention?)

Rightnow, workaround is the best:

z001.txt
z011.txt
z100.txt

Kalman Toth on December 13, 2007 3:18 PM

Certainly computers should always do what the user wants, but in this circumstance I doubt that you would find a consensus amongst users at what the proper order should be anyhow.

Appending numbers to the end of filenames is a dead give away that the file was produced by a machine and in that case the name itself is not relevant. I mean if every single one of your digital camera pictures start with “PIC” and you had dozens of folders with PIC1.JPG then putting them together and sorting them will just end up with a nonsensical order anyhow.

Sorting them by the date would be much more meaningful to the user.

I also agree with one of the earliest posters who said this wasn’t a sorting problem at all but a compare routine problem that the sort program calls.

Davide on December 13, 2007 3:47 PM

Davide wrote: "Certainly computers should always do what the user wants, but in this circumstance I doubt that you would find a consensus amongst users at what the proper order should be anyhow."
--------

No kidding, that's sort of the point of this whole discussion.

The problem is that the "human factors" crowd is flaming the "stupid developers" because programming languages don't have a built-in "natural sort" function. Look at the excerpt of the blog Jeff posted:

"Jesus Christ people. You're programmers. You're almost all college graduates and none of you know what the f**k "Alphabetical" means. You should all be ashamed."

This idiot doesn't even realize that computers DO sort in alphabetical order. The issue is that the ordering is lexicographical, not that it isn't alphabetical.

BTW, human beings produce numbered lists all the time. Even if you think they don't, your compare function should certainly handle that case, don't you think?

Will on December 13, 2007 3:58 PM

It's not as straightforward as it seems -- there are some really nasty corner cases. Can you think of an algorithm that will do "the right thing" in all of these cases?

1_1 vs 1_10

1.1 vs 1.10

1.1 vs 1.01

1.1.1 vs 1.10.1

12.25.2007 vs 9.9.08

Christmas 2007 vs My 32nd Birthday

If you can't get it right every time, you might be better off doing something that breaks simply and consistently.

So not giving numbers special handling is actually quite a good solution. Treating the digits at the *end* of the filename (photo1, photo2, photo10 ...) also makes sense, but going any further is just asking for trouble.

Peter on December 13, 2007 4:51 PM

I see a lot of posted code, but not a lot of test cases.

Jon Abaca on December 13, 2007 5:04 PM

Oh, and for those of you who still think this is an "ASCII sorting" problem, check this out:

CompareString()
http://msdn2.microsoft.com/en-us/library/ms647476.aspx

This is a Win32 string comparison function that does NOT use ASCII comparisons. It works with different locales and character sets (including double-byte characters), it supports multi-character elements and has options to "intelligently" sort words and it supports case-insensitive comparisons.

However, it will STILL produce the following sorted list, because it doesn't do "natural sort" with numerical grouping:
z1.txt
z100.txt
z1000.txt
z2.txt

Go ahead, try it with strings like "z1" and "z100". You will get the same result as plain old "ASCII-betical" strcmp().

And there IS a built-in natural sort function with numerical grouping in Win32:
StrCmpLogicalW()
http://msdn2.microsoft.com/en-us/library/bb759947.aspx
"Compares two Unicode strings. Digits in the strings are considered as numerical content rather than text. This test is not case sensitive."

While "ASCII-betical sorting" does have a meaning, it is not the problem here.


Will on December 13, 2007 5:05 PM

Oh, and for those of you who still think this is an "ASCII sorting" problem, check this out:

CompareString()
http://msdn2.microsoft.com/en-us/library/ms647476.aspx

This is a Win32 string comparison function that does NOT use ASCII comparisons. It works with different locales and character sets (including double-byte characters), it supports multi-character elements and has options to "intelligently" sort words and it supports case-insensitive comparisons.

However, it will STILL produce the following sorted list, because it doesn't do "natural sort" with numerical grouping:
z1.txt
z100.txt
z1000.txt
z2.txt

Go ahead, try it with strings like "z1" and "z100". You will get the same result as plain old "ASCII-betical" strcmp().

And there IS a built-in natural sort function with numerical grouping in Win32:
StrCmpLogicalW()
http://msdn2.microsoft.com/en-us/library/bb759947.aspx
"Compares two Unicode strings. Digits in the strings are considered as numerical content rather than text. This test is not case sensitive."

While "ASCII-betical sorting" does have a meaning, it is not the problem here.


Will on December 13, 2007 5:06 PM

Not all that hard in Objective-C (at least not for us, don't know how the people over at Apple implemented it): http://developer.apple.com/documentation/Cocoa/Conceptual/Strings/Articles/SearchingStrings.html#//apple_ref/doc/uid/20000149-SW1

Aron on December 13, 2007 5:19 PM

Will,
"Clippy could pop up out of nowhere", nice ridiculing of my point. But then you've called other posters idiots and the like so I think your level of class is obvious for all to see.

Also, you completely miss my point. It is simply that the user could ask for sorting by their favorite colour, the magnitude of the planets, dates closest to christmas etc etc. As a developer you must inform them that it is not necessarily the best approach (for example how will other users understand the sort). But if they still want it, it's then your job to give them what they expect.

You're not a Python zealot by any chance?

Ash on December 13, 2007 6:20 PM

Nice article, but perhaps a bit too pessimistic in saying that 40+ lines of code are necessary. Here's 10 lines of good old C code (code in the { ... } and not including the main()) that uses qsort(3). This assumes, of course, that people all agree that C is the lowest common denominator. If you agree with that premise, similar Java code should take the same or fewer lines of code. I know Jeff explicitly said he didn't want a 1 line code contest, but I find this may be a good exercise for interviewing people.

#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>

static int qsortComparatorHelper(const char *v1, const char *v2)
{
if (isdigit(*v1) || isdigit(*v2)) {
int val1 = 0, val2 = 0;
while (isdigit(*v1)) val1 = val1 * 10 + (*v1++ - '0');
while (isdigit(*v2)) val2 = val2 * 10 + (*v2++ - '0');
return (val1 == val2) ? qsortComparatorHelper(v1, v2) : val1 - val2;
} else {
return (*v1 && *v2 && *v1 == *v2) ? qsortComparatorHelper(++v1, ++v2) : (*v1 - *v2);
}
}

static int qsortComparator(const void *v1, const void *v2)
{
return qsortComparatorHelper(*(const char **) v1, *(const char **) v2);
}

int main()
{
char *str_array[6] = { "z1", "p2", "p100", "p200", "p3", "p100a" };
int i;

qsort(str_array, 6, sizeof(char *), qsortComparator);
for (i = 0; i < 6; i++) {
printf("%s\n", str_array[i]);
}
return 0;
}

Carleton on December 13, 2007 6:56 PM

Leading zeros breaks down every time you need to increase the number of leading zeros in your file names, so it's at best a temporary fix.

Even if the file names aren't computer generated, people may use temporary names with the intent of making them more meaningful later, only to end up with large numbers of files sitting around with nothing more meaningful than the date and order in which they were generated (and often that's not much use, either).

Many people like the idea of sorting by date, but that only works if your brain works well with dates in the first place, and/or you have a rough idea of when you generated the particular file you're looking for. The only time I've found dates useful (being a person that rarely knows what day of the week it is, never mind the actual date), was sorting photographs after a long trip. In that particular case at least one of the photos taken within a given week (Wednesday through Tuesday) had a relevant name in it, so I could rename the folder that contained all of the photos for that week with that name.

Other than that, I always sort my data by subject matter and revision information. Photographs are sorted similarly, dates simply help with grouping them together before sorting by subject in cases where I took a number of related photos within a given range of dates (or on a particular date).

I guess the point I'm getting to is that everyone has different expectations when they're storing and sorting data. They may also have different expectations when they're dealing with different types of data. On the other hand, the only people that expect data to be sorted the way most computer applications sort it are people that are used to computers doing this sort of thing. The majority of the people that understand why computers sort the data that way should be working to fix it (and obviously some of them have, as file managers wouldn't do any differently from array sorts otherwise, and for a long time they didn't).

Vizeroth on December 13, 2007 6:56 PM

Ash,

I am not a python zealot. I mostly code in C for Linux embedded systems and C++ (MFC) for Windows.

And I only called 1 person an idiot - the blogger (quoted by Jeff) who said:
"Silly me, I just figured that alphabetical sorting was such a common need (judging by the number of people asking how to do it I'm not wrong either) that I wouldn't have to write the damn thing. But I didn't count on the stupid factor. Jesus Christ people. You're programmers. You're almost all college graduates and none of you know what the f**k "Alphabetical" means. You should all be ashamed. If any of you are using your language's default sort algorithm, which is almost guaranteed to be ASCIIbetical (for good reason) to get alphabetical sorting you proceed to the nearest mirror and slap yourself repeatedly before returning to your desks and fixing your unit tests that didn't catch this problem."

I guess you didn't find any of that offensive, rude, or low-class.

And my point was that, in some cases, you CAN'T adapt to your users needs. You can only try to please some of them, some of the time (which is what Explorer does).

Maybe some of my posts were strongly worded. Two reasons:
1) I dislike that implication that developers are too stupid to give the users what they want.
2) I'm disappointed that everyone thinks this issue is an artifact of "asciibetical" sorting, rather than lexicographical ordering. I don't see how you can solve a problem without fully understanding it first.

Will on December 13, 2007 7:18 PM

BTW, the fact that Win32, PHP, Objective-C have BUILT-IN support for "natural sort" ordering and that numerous 3rd-party libraries are available (including a C language library) proves that developers DO care about human factors and that we are not all stupid or hopelessly afflicted with Asperger's Syndrome, as Kate Rhodes and Jeff seem to think.

So I don't think we need to "proceed to the nearest mirror and slap [ourselves] repeatedly before returning to [our] desks", like Kate Rhodes suggested.

And I still have no problem with calling her an idiot. Especially when she thinks we don't know what the f--- "alphabetical order" means. I found that statement quite ironic, since "asciibetical order" is PRETTY MUCH the same as "alphabetical order", as long you're talking about case-insensitive comparisons. This problem has nothing to do with "alphabetical" versus "asciibetical". It is about "lexicographical order" (comparing one character at a time) versus "natural sort order" (i.e. treating groups of digits as numbers, rather than individual characters of text.).

Will on December 13, 2007 7:30 PM

One last comment. Many people have pointed out that no "natural sorting" algorithm can be perfect, or please everyone in all situations. Others have pointed out that "natural sorting" is an ambiguous spec, whereas lexicographical ordering is mathematically well-defined.

Maybe, just maybe, this is why we haven't seen "natural order" string comparisons built into every single popular high-level language out there? Could that be the reason, as opposed to the knee-jerk assumption by the human factors crowd that all developers are morons and are out of touch with normal people?

Will on December 13, 2007 7:36 PM

Okay, one last comment.

On the same page as the C library implementation of natural string comparison...
http://sourcefrog.net/projects/natsort/

...there are links for the following implementations:
Javascript
Java
Python
Ruby
Perl
Macintosh (as a system extension)

Hope that's enough platforms for everyone.

Will on December 13, 2007 7:48 PM

Damn good post! ...educational for me!

Bryan Wilhite on December 13, 2007 9:32 PM

>What's iinternationalizationn?

I believe it's a department of the Ministry of 'Ousinge.

Dave on December 13, 2007 11:05 PM

"One last comment. Many people have pointed out that no "natural sorting" algorithm can be perfect, or please everyone in all situations. Others have pointed out that "natural sorting" is an ambiguous spec, whereas lexicographical ordering is mathematically well-defined.

Maybe, just maybe, this is why we haven't seen "natural order" string comparisons built into every single popular high-level language out there? Could that be the reason, as opposed to the knee-jerk assumption by the human factors crowd that all developers are morons and are out of touch with normal people?"

Ironically, the fact that you're keener on a "mathematically well-defined" and "perfect" sort, rather than one which actually tries to meet the needs of the normal user, shows a certain level of "out of touchness".
The idea is not to produce the "wondersort", the Holy Grail of Sorting that will make every user think, "Ah, this man can sort. He sorts with such grace and prowess". The idea is to provide a sort that pleases the maximum number of people in your products' target audience.
And that is only ASCII in a *very* small number of cases.

Tom on December 14, 2007 12:33 AM

"That's not the point. The point is what is the natural order for hex strings? The 'natural order' is under defined, you can't have it in the language (or in its standard library)."
What, so there's no natural ordering on hex values? What about 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,10,11,12..., where 00012=12?
Hell, I've never actually needed to *sort* Hex, give me an example of when you'd be sorting Hex values, and how you sort them.

"Only the programmer can decide what is the natural order."
No, the programmer can only decide if the programmer is the end user. Otherwise, he's got no idea what the actual sort requirements are for the user.

"No, the user can't, the user sees only what she'd like you have to take into account all your users."
So, in order to take into account all of your users, instead of choosing a sort that benefits all of them as much as possible, you choose a sort which benefits *no-one* who isn't working with ASCII values.
I feel like I'm coming up against the traditional wall here of "What if one user came along who needed to view everything translated into Klingon, in mirror writing, and they were dyslexic?"

The quest for perfection is admirable, but all too often it can lead to: 'Well, ASCII sort isn't any better if the user's working in Hex, so it's good enough. Yes, I know less than 5% of the population of the world use Hex for this, and none of them will be using my product, but they might start, and then where would we be?'

Tom on December 14, 2007 1:00 AM

@Tom:
I think the problem that many of us have is that some people seem to think "how can you be so stupid and not use natural sorting in your application".
And this attitude is something I'd find stupid. It might be that one special application where users mostly use Hex numbers or GUIDs.
But even worse: where does that problem actually appear (except file names, and you should be using your system's functions to listing them most of the time)?
If it is some custom application, why does the user have to append numbers? Does the application maybe unreasonable expect something to be unique? Then your problem is not that you don't use natural sorting.
Or maybe your application incorrectly combined things into one field that should have been two?
In short: If you need natural sorting, this means that for some reason the user was forced to combine a name and a number into one thing, and natural sorting is treating the symptoms, not fixing the problem.
Sometimes fixing the symptoms is the best we programmers can do, and then we should do it. But we should not make ourselves believe we actually fixed the problem.

Reimar on December 14, 2007 2:50 AM

I think it is time to move to fuzzy logic. Take a few hunderd of thousand of users in all countries feed them random generated filenames and ask them to sort them. Use this as basis of the new algoritme.

Then let the program learn itself by observing the user that actually uses it.

Olaf on December 14, 2007 4:47 AM

@Reimar
"I think the problem that many of us have is that some people seem to think "how can you be so stupid and not use natural sorting in your application."

Actually, the original post was about not having natural sorting in your *language library*. I appreciate that non-Java and .NET coders won't understand this one, but most apps that *need* natural sort are written in those languages, and being able to rely on

And it's not stupidity, it's laziness. The real reason ASCIIsort, sorry "lexicographic" sort is used is because the comparisons are easy to do. Getting a natural sort to work without bugs is tricky, accurately assessing meet the sorting needs of the consumer is seriously tricky. But making lexicographic the *default* is lazy, because it's almost never the right answer.

What I was going to say is that if the custom app is for a specific purpose, of course you should use a more sensible sort, maybe lexicographic, maybe something highly specialised. But for mainstream apps, ASCIIsort is wrong.

"In short: If you need natural sorting, this means that for some reason the user was forced to combine a name and a number into one thing, and natural sorting is treating the symptoms, not fixing the problem."

I don't understand "forced". If you have a list of... well, here's an example (in the epilogue), which is better than anything I could come up with:
http://www.davekoelle.com/alphanum.html

Tom on December 14, 2007 7:40 AM

Quoting Tom: "But making lexicographic the *default* is lazy, because it's almost never the right answer."
...
"Ironically, the fact that you're keener on a "mathematically well-defined" and "perfect" sort, rather than one which actually tries to meet the needs of the normal user, shows a certain level of "out of touchness"."
-----------------------------

Lexicographic ordering (which is not LIMITED to ASCII) is not lazy. It's WELL-DEFINED. That's how computers work, but as we all know, it's not how people think. Like I said before, there's the fundamental tension between How Software Works and What People Want.

Who's being lazy? I see bloggers and posters flaming away that languages don't provide natural "sort" (actually natural string comparison) by default, where 3rd-party or built-in solutions are available for almost every platform.

I see people who don't understand that this is an issue of string comparison methods, not a sorting problem.

I see people who think this has something to do with ASCII comparisons. (See MSDN for the CompareString() function, which works with all kinds of different locales, multibyte character sets and word delimiters, but STILL sorts numbered lists in a user-unfriendly way: e.g. z100, z2, z30).

I see people how think the natural sorting issue is so simple, it can be solved with some one-liner progam.

To me the ironic part is I also see a lot of posters who say that natural sort should be so good, that it should use fuzzy logic, or it should adapt to meet the needs of all its users, or it should be designed to meet possibly changing or conflicting requirements in mind. Yet the language/library providers are LAZY, because they didn't build it in?

Guess what? Something that specialized really DOESN'T belong in a language spec or a built-in library. You want something like that for your users? Go write it yourself.

Or use Google, and find the multitude of 3rd party libraries/code that have already done it for you. Now who's being lazy? On a single web page, there are implementations linked for 7 different platforms.

So, I'm sorry, but it looks to me that some people are too lazy to try to understand the real problem or google for a readily available solution. See Kate Rhodes rant against people who "don't know what the f--- alphabetical means" (this has nothing to do with alphabetical order), and Jeff's post which describes the problem incorrectly (this has nothing to do with ASCII encoding, ASCII order, or anything ASCII)

Fine, call me a "keener" on mathematics or perfection. Guess what? All computer languages and computers are based on principles of mathematics. All software can be expressed as mathematical algorithms. You want fuzzy logic or ambiguous specs in your code? Write it yourself.

If you actually look at the natural sort implementations out there, you'll likely see - gasp - that they are more complex than a simple one-liner, and they all work slightly differently, because guess what? Nobody knows agrees on exactly "natural sort" is, just like nobody agrees on natural language (I gave the example of region differences between spelling, grammar and date formats in U.S., Canada and Britain).

So fine - call me old school, purist, pedantic, annoying, or whatever. I guess I'm part of a dying breed of coders who care about CS, mathematics, and implementation details, even though I finished school in this decade.

But despite what you think, I care about my users. That's why, when I NEEDED natural sort for my application, I used google and downloaded Martin Pool's implementation, which has been incorporated into PHP.
http://sourcefrog.net/projects/natsort/

Or maybe I should have wrote a rant on the internet about how "asciibetical != alphabetical" (completely irrelevant to the issue of natural sorting), how developers need to slap themselves (according to Kate Rhodes), and how language and standard library maintainers are "lazy" because they don't do all my work for me.

Will on December 14, 2007 9:03 AM

Tom, BTW, in the page you and Jeff linked:
http://www.davekoelle.com/alphanum.html
Read this:
"There is currently a glitch when it comes to periods/decimal points - specifically, periods are treated only as strings, not as decimal points. The solution to this glitch is to recognize a period surrounded by digits as a decimal point, and continue creating a numeric chunck that includes the decimal point. If a letter exists on either side of the period, or if the period is the first or last character in the string, it should be viewed as an actual period and included in an alphabetic chunk. While I have recently figured this out in theory, I have not yet implemented it into the algorithms. To be truly international, the solution shouldn't just consider periods, but should consider whatever decimal separator is used in the current language."

See? The author admits his algorithm doesn't do EVERYTHING the user might want. In other words, the algorithm isn't complete, and it doesn't meet the user's requirements.

Now look at the page I linked:
http://sourcefrog.net/projects/natsort/
"Leading zeros are not ignored, which tends to give more reasonable results on decimal fractions.

1.001 < 1.002 < 1.010 < 1.02 < 1.1 < 1.3

Some applications may wish to change this by modifying the test that calls isspace. "

See, the author admits he's made an ARBITRARY decision based on what the user might want, but as the application developer, you MIGHT want to change that behaviour.

This is not to criticize either author, who have obviously done good jobs, but show that NATURAL SORTING IS NOT AS SIMPLE AS YOU THINK.

And you wonder why this stuff isn't standardized?

It's the same reason ENGLISH isn't standardized. (If you think it is, you're dreaming).

Will on December 14, 2007 9:12 AM

"Actually, the original post was about not having natural sorting in your *language library*. I appreciate that non-Java and .NET coders won't understand this one, but most apps that *need* natural sort are written in those languages,"
------------
Oh well, I guess Win32's StrCmpLogicalW doesn't count because it's not native .NET code. Fine, here's some C# code for you:
http://www.codeproject.com/KB/recipes/csnsort.aspx

And notice how, ironically, in the codeproject article, the sorting order that everyone here hates is called "alphabetical sort". Because guess what? ASCIIbetical pretty much IS alphabetical. The only difference is that alphabetical order only implies "A-Z" (case insensitive), but ASCII order includes numbers, punctuation marks and other symbols. And alphabetical sort is a synonym for lexicographical ordering, which is still the ordering that everyone here hates. So when you say the problem is a question of ASCIIbetical versus alphabetical, I find that highly ironic.

Will on December 14, 2007 9:31 AM

@Tom: "The idea is not to produce the "wondersort", the Holy Grail of Sorting that will make every user think, "Ah, this man can sort. He sorts with such grace and prowess". The idea is to provide a sort that pleases the maximum number of people in your products' target audience."

Reg Braithwaite mentioned recently on raganwald that he admired Jeff for writing a blog that catered simultaneously to multiple audiences: the hardcore programmers and engineers, and the fuzzy UI/Web design types. But catering to several audiences means that you'll inevitably get comment flamewars when the audiences collide.

Tom, a Web-design type, claims that the purpose of a computer program is to "please the maximum number of people" and just generally make its users happy. (If it miscomputes once a week, that's not too bad, and the users will understand, because they know how flaky computers can be sometimes.)
Will, a Real Programmer type, claims that the purpose of a computer program is to implement an algorithm to solve a well-defined problem, with 100% reliability. (If some user has to write a five-line Perl script to tailor the output to his own preferences, that's not too bad, and the users will understand, because they know how inscrutable third-party utilities can be sometimes.)

Real Programmers will tend to prefer predictable old strcmp() over fuzzy-happy-sort, because we're thinking about trying to reuse the sort routine in a program that will crash the airplane if it ever does anything unexpected. Even if we're just sorting things for a GUI display, we prefer simple predictable results over complex fuzzy results, because we deal with predictability better than uncertainty. I want to be able to "ls | head" and know that if I see "z1" and "z2" consecutively, then "z10" isn't in the list. I don't want fuzzy-happy-sort messing with my output in unpredictable ways.

Anonymous Cowherd on December 14, 2007 12:11 PM

Anonymous Cowherd wrote: "Will, a Real Programmer type, claims that the purpose of a computer program is to implement an algorithm to solve a well-defined problem, with 100% reliability. (If some user has to write a five-line Perl script to tailor the output to his own preferences, that's not too bad, and the users will understand, because they know how inscrutable third-party utilities can be sometimes.)"
----

You are half right. Like I said, I do care about my users. For example, when I needed natural sorting for an application, I googled for the appropriate third-party code and incorporated it into my app.

No need to whine about lack of built-in language/library support or developer stupidity. Especially when there IS built-in support in several places (Win32, PHP, etc.) and lots of third parties have generously released their own code.

If natural sort wasn't available as a third-party add-on, I would've took
a stab at writing it myself.

Don't really see the big deal here.

Will on December 14, 2007 1:33 PM

BTW, I have done a lot of coding/design for UI and Web apps (for embedded devices, not for the Internet) myself. So I understand the concerns of the human factors crowd. However, I happen to code in "lower-level" languages such as C and C++, so I also understand the concerns of "Real Programmers".

I think a balance needs to be struck, instead of each side needlessly flaming the other.

Will on December 14, 2007 1:36 PM

"no (rational) human would name files with such a scheme if they ever intended those files for human consumption."

In fact I have gigs of MP3s named like this. I have about 200 CDs which I have ripped for the most part in the following schema: genre/artist/CD/trackname.mp3

How is the track name assigned? TrackNo - Name.mp3. For these to sort correctly, you either need a natural sort or leading zeros, otherwise "10 - Eclipse.mp3" will sort before "2 - Breathe.mp3".

BS that people don't name files like that! (Or at least have to compensate for "broken" sorting routines.) And that's just one example.

Evan on December 14, 2007 2:28 PM

Jeff,

As IanG pointed out (http://www.interact-sw.co.uk/iangblog/2007/12/13/natural-sorting), it seems Python has a rather "unfair advantage".

Python knows implicitly how to compare an array to another array. The verbosity of a C# solution comes almost totally from the fact that there is no "default comparison" for any of the provided collection types.

Another small leg-up Python has is that it is dynamically typed. In C# we spend a lot of time making sure that everything is the proper type. Type inference can do some work for us, but everything still has to have a definite type at compile time, so the type information has to be there somewhere. Dynamic typing enables Python to do things with in-line lambdas that C# just simply can't. In C#, if you want a lambda that will work for two types that are not implicitly inter-cast-able, but don't want to have to write two versions of it, then the best you can do is to use generics to write a "lambda generator": a function that creates your lambda in the appropriate type and then returns it. But this inherently eliminates the in-line option.

Waterbreath on December 14, 2007 2:32 PM

Will wrote:

"Of course, I'm sure this also happens when Americans post on non-American websites."

Yes and no. Why not take PirateBay as an example (swedish hosted and allows sv-SE and en)

----

"The point is that people hold on to their little cultural conventions and idiosyncracies as if they were authoritative and universal."


Exactly my point! Try using ISO, it's great. ;-)

If the engineers (thats the people posting here) won't use ISO, there is little chance we'll ever stop crashing things because a computer program mis-matched yards and meter.

Peterh on December 14, 2007 3:23 PM

"So fine - call me old school, purist, pedantic, annoying, or whatever. I guess I'm part of a dying breed of coders who care about CS, mathematics, and implementation details, even though I finished school in this decade."
Listen, I agree totally that Lexicographic ordering is the correct algorithm for a "well-defined" sort. Which is why I'd use lexicographic if I was working on something buried inside the internal computation. But placing correct definition over user-friendliness in anything the user is going to see is bad practice.


"NATURAL SORTING IS NOT AS SIMPLE AS YOU THINK."
*I* never said it was simple. *I* said it was "seriously tricky" (and I'm British, so I understate). And yes, arbitrary decisions may have to be made. But the alternative is to make no decision at all, and rely on the old "it's mathematically well-defined, so it must be the best".

"If it miscomputes once a week, that's not too bad, and the users will understand, because they know how flaky computers can be sometimes.
*****
If some user has to write a five-line Perl script to tailor the output to his own preferences, that's not too bad, and the users will understand, because they know how inscrutable third-party utilities can be sometimes."
That's actually a pretty good summary of the 2 polar positions. And of course, which is *correct* depends completely on the circumstances.
I prefer the term 'Core' to 'Real', because 'Real' somehow suggests I'm a 'Fake'. Also, I find the idea that every single user of your product will be able to knock off a correctly coded 5-line Perl script not entirely "Real".

"It's the same reason ENGLISH isn't standardized. (If you think it is, you're dreaming)."
Well, fek glorp frq mz dokx, then.

Tom on December 14, 2007 4:53 PM

Tom wrote:
'(Quoting me) "It's the same reason ENGLISH isn't standardized. (If you think it is, you're dreaming)."

Well, fek glorp frq mz dokx, then.'
------------

Wow, that's hilarious. I guess I'll have to give examples, or no one will believe me.

American English: color, defense, center
British/Canadian English: colour, defence, centre

American English: "Microsoft is a software company."
British English: "Microsoft are a software company."

American English: Lieutenant is pronounced LOO-teh-nant
Canadian English: Lieutenant is pronounced LEF-teh-nant

American date convention: MM/DD/YYYY
Canadian date conventions: DD/MM/YYYY, YYYY/MM/DD, etc.

American English: a car's trunk
British English: a car's boot

You think these differences are trivial, but I see non-Americans getting flamed on American message boards for using "incorrect" spelling and grammar all the time.

Will on December 14, 2007 5:02 PM

Oh, and if English is the same everywhere (i.e. STANDARDIZED), why are Harry Potter books "translated" into American English?
http://en.wikipedia.org/wiki/Harry_Potter_in_translation#Americanisation_as_translation

Why do people complain that Windows is not "localized" to "British English"?
http://blogs.msdn.com/michkap/archive/2006/09/12/750155.aspx

I'm so confused.

Will on December 14, 2007 5:12 PM

"I think a balance needs to be struck, instead of each side needlessly flaming the other."
Absolutely, and I'm sorry if I went off the handle a bit with the "10 vs. onezero" thing. It just seemed like such a daft question. But is lexicographic ordering any good there? I mean, 10.0 is still "less than" 2.5. And what about Version numbers that go 1.8, 1.9, 1.10, 1.11...
I assumed Natural Sort meant "natural for the medium you're working in". And I used the example given in the , not because it's a perfect sort, but because it's a perfect example of a situation where lexicographic is a very bad choice.
Another is:
DrWhoEp1;DrWhoEp10;DrWhoEp11;DrWhoEp12;DrWhoEp2... I mean, I know I could use leading zeroes, but still.

"I happen to code in "lower-level" languages such as C and C++, so I also understand the concerns of "Real Programmers"."
God Bless You for those "" marks. I've used Java, Visual Basic and I'm moving into the world of C++, with some previous experience. Frankly, I'm sure I'll be *exactly* as mediocre a programmer in C++ as I was in Java.

Tom on December 14, 2007 5:22 PM

"You think these differences are trivial, but I see non-Americans getting flamed on American message boards for using "incorrect" spelling and grammar all the time."
Now, to be fair, that *is* hilarious.
But do you get my point, that these minor variations don't actually *prevent* understanding. That's why Microsoft don't put as much work into creating a Britishy version of Windows as they do into Portuguese, you can work without it. (By the way, I wouldn't encourage them, we'd probably end up with a new Clippy "now in Cockney Rhyming slang")
"Oi, guvnor, looks like you're in a spot of Barney, need a Brass" *shudder*

Tom on December 14, 2007 5:35 PM

How should these filenames be sorted?

example1.txt
example2.txt
example3.txt
example4.txt
example5.txt
example-1.txt
example-2.txt
example-3.txt
example-4.txt
example-5.txt

If the numbers are being evaluated correctly, shouldn't they be listed in this order:


example-5.txt
example-4.txt
example-3.txt
example-2.txt
example-1.txt
example1.txt
example2.txt
example3.txt
example4.txt
example5.txt

Hmm.

Simon Wright on December 14, 2007 9:38 PM

Simon,

That's a good example. It obviously depends on whether "-" is meant to represent a negative sign or is simply used as a delimiter. Most people use the "-" character as a delimiter in file names, and I would guess that most "natural sort" implementations would not interpret those numbers as negative. Of course, the software has NO WAY of figuring out what the user means by "-", unless it has some fancy AI, and access to a representative set of data.

Just another example of how natural sorting is not well-defined. That's why I have no problem if it isn't built into every single language/library by default. Leave it to the 3rd-party libraries and application developers.

Will on December 14, 2007 10:08 PM

In Ruby:
sort_by { |item| item.to_s.split(/(\d+)/).map { |e| [e.to_i, e] } }

My understanding:
- We are going to compare arrays (map)
- Arrays will be made of chunks extracted from each 2 strings (split)
- chunk is anything before next number if any (\d+\)
- or else it is then that number
- [e.to_i, e] pairs compared eqv e compared when e not a number.

Seen on http://anarchaia.org/archive/2007/12/15.html

Now I feel stupid, and admiring.

JeanHuguesRobert on December 15, 2007 1:59 PM

I find it hard to believe that anyone would ever think z1, z10, z100 etc was the natural sort order, but then I do little programming and a lot of downloading of open source software these days. Often, 17-odd numbered versions of a package will be placed in a folder & Apache left to index them as it was programmed to - in simple ASCII order. The latest version usually ends up in the middle of the list, which is not just annoying but quite stressful after a while. I really can't believe some would say ASCII sort is the natural order, really can't.

eekee on December 15, 2007 3:41 PM

Oh also, thanks for the Python tip.

eekee on December 15, 2007 3:45 PM

One-liner contest:

def nsort(l): return sorted(l, key=lambda s: [map(int, e[1::2]) for e in re.split(r'(\d+)', s)])

Maxwell Terry on December 16, 2007 9:37 AM

yay python

yossi on December 16, 2007 6:45 PM

@will
I completely agree to the need of being "old school".Each information delivered/supplied is only valid in the context that the exchange is taking place. Therefore "natural sort order" is valid for an end-user who would count from one to 12 upward as
one
two
three
...
ten
eleven
twelve

without adding leading zeroes or whatever.... Therfore the Apple and MS decision in the finder/explorer would appear *obvious* to the user.

While analyzing version-numbers from CVS (1.1.6 , 1.1.005 , 1.1.5 etc) this "natural" approach will not fit, as for the user the approach is not *obvious*

So the "art of sorting" is not implementing the "right algorithm" but "finding the right context"


btw. There had been words about US-English / UK-English and NON-English. As long as there are still applications and libraries (mostly US-English) that don't accept or display European characters (Umlaute ÄÜÖ etc) natural sorting is the minor problem.
@see several implementations of isalpha() toupper() and feed them with "ÄÜÖ"

regards
mink

Mink on December 17, 2007 2:30 AM

I think it's hilarious how many programmers still think they are smarter than their users. Hubris, nothing but hubris.

JeffK on December 17, 2007 10:14 AM

"Hubris, nothing but hubris."

Which, of course, is one of the three great virtues of a programmer. (^_^)

Robert Fisher on December 18, 2007 9:01 AM

You can see my over-complex implementation in C# at http://tgmayfield.livejournal.com/396.html

Thomas G. Mayfield on December 18, 2007 9:04 AM

I love the comments above with people commenting on how users should be naming files. LOL. Try telling my dad that he shouldn't name his photos: kid1, kid2, kid3, etc...

Todd on December 20, 2007 5:41 AM

I took a stab at it: http://www.codeproject.com/KB/string/NaturalSortComparer.aspx

The Cowboy on December 28, 2007 5:22 PM

Any thoughts on implementing this in SQL? It seems like for web applications this is where this logic needs to exist. (Typically a SQL query both sorts and paginates the data before returning it to the app.)

Landon on January 8, 2008 2:02 PM

I tried the javascript version at sourcefrog, but found it to be buggy, even on simple test arrays. David Koelle's Alphanum worked well, but there was no javascript version yet. So I ended up porting the Perl version of Alphanum into javascript. Tested and works in Opera, Firefox, IE6 and IE7.

http://my.opera.com/GreyWyvern/blog/show.dml/1671288

GreyWyvern on January 17, 2008 9:42 AM

I sort huge amounts of data - not just filenames - including numbers in regular places - Statute names and citations for instance. A natural sort would put chapter numbers and years in an order that makes sense. Why isn't it just available as a UNIX utility, I wonder.

gypsies0 on March 13, 2008 7:43 AM

Jeff Atwood is right.

Programmers should do what users want (when they know what they want like in this particular case).

But what is the root cause of this sorting problem?

I'll tell you what -- filenames which must be unique. Filesystems which after 50 years _still_ use the filename as an access key.

That is so retarded!

I can have two identical cans of Coca-Cola in my fridge, and I can't have two files with the same name on my hard drive?

I can have two folders in a filing cabinet with the same person name on both of them, yet every filesystem is so stupid that it cannot differentiate those in electronic form.

Not to mention that merging two physical folder's contents doesn't destroy anything, try that on a computer for a change.

In my opinion computers won't be user-friendly until they start use whatever internal means they can (128-bit GUID for example) to differentiate between files, and until they drop file extensions alltoghether and start using MIME to tag each file with the proper type on creation.

When you get an option to create the files birthday_cake, birthday_cake, and birthday_cake and to recognize the right one later based on the preview instead of having to remember the sequence number (was it the birthday_cake2 I wanted?) only then we will have some progress.

Igor Levicki on March 13, 2008 4:58 PM

ls -v existed for quite a while and I've been using it. With some of the test case on the page, I get:

1.01
1.1
1.1.1
1.10
1.10.1
1_1
1_10
9.9.08
12.25.2007
25
100
Banana
abc-12-foo.txt
abc-12a-foo.txt
abc-23-foo.txt
abc-100a-foo.txt
abc-d2-foo.txt
abc-d43-foo.txt
apple
banana
dumbdeveloper3
dumbdeveloper15
dumbuser2
dumbuser10
example-1.txt
example-2.txt
example-3.txt
example-4.txt
example-5.txt
example1.txt
example2.txt
example3.txt
example4.txt
example5.txt
smartdeveloper1
z001.txt
z010.txt
z1.txt
z2.txt
z10.txt
z100.txt

The ordering seems human enough. It doesn't work for case-insensitivity, negatives, and dates. Since leading zero makes the number fractional, you get: z010.txt < z1.txt < z10.txt. As much as I love to reinvent the wheel to get it just right within each language, I just execute external commands.

Personally, and not to mention coincidentally, I have been using lowercase, non-negative, and iso-date-format for my filenames. That, of course, is not what typical users do. Sigh.

----

As for "filenames must be unique", that has been surpassed ever since NTFS in 1993. NTFS support multiple forks on a file. There was a time when I put different versions of the same file together as a single file before migrating to database. IE7 even use forks to store where you download each file from. I'm quite sure that someone somewhere is working on an image viewer that let you previews all forks on a file before loading the one you choose. Be patient.

Zodi on March 22, 2008 4:57 AM

when the poster started this thread, he use filename as example, whereas, the general idea behind is in the programming concept of sorting.

not just filenames, but data aswell.

i recently have this problem, i have house numbers which actually contais letters (yes and they call it house)

and they wanted to sort it naturally (the users)

(43, 43 A, 43B)

kris on April 2, 2008 1:08 AM

Strangeness with StrCmpLogicalW:
--------------------------------
In MSVC++ 2008, a test program I wrote using StrCmpLogicalW will sort strings thusly (which seems fine):

V2 v6.0 v6.0A v6.1 v7 v7.4 v10

but in VisualBasic 6, here's some debug output I get (it's in the middle of a real program that compares two things and throws away the lesser, so it's not a group sorter per se):

StrCmpLogicalW ( 'V2', 'v10' ) = -1 [correct]
StrCmpLogicalW ( 'v10', 'V2' ) = 1 [correct]

StrCmpLogicalW ( 'v6.0', 'v10' ) = 1 [wrong!]
StrCmpLogicalW ( 'v10', 'v6.0' ) = -1 [wrong!]

StrCmpLogicalW ( 'v6.0A', 'v6.0' ) = 1 [correct]
StrCmpLogicalW ( 'v6.0', 'v6.0A' ) = -1 [correct]

StrCmpLogicalW ( 'v6.1', 'v6.0A' ) = -1 [wrong!]
StrCmpLogicalW ( 'v6.0A', 'v6.1' ) = 1 [wrong!]

StrCmpLogicalW ( 'v7', 'v6.0A' ) = 1 [correct]
StrCmpLogicalW ( 'v6.0A', 'v7' ) = -1 [correct]

StrCmpLogicalW ( 'v7.4', 'v7' ) = 1 [correct]
StrCmpLogicalW ( 'v7', 'v7.4' ) = -1 [correct]

I haven't tried this in VB.NET. I don't want to upgrade the application just over this issue.

Jeff J. on June 19, 2008 8:10 AM

I think that most of the time, you shouldn't be attacking this problem in your sort algorithm, but at the data entry end of the question. If people are entering addresses, say, then you need to do what Outlook does - have a nice big box, let them type like it's a mailing label, and then analyse the result and chop it up into pieces appropriately.

Then you don't have to try to sort "10 Main Road" and "9 Main Road" because you've got the numbers into an appropriate field.

PS to everyone that thinks they've got this cracked:

221 Baker Street
221A Baker Street
221B Baker Street
223 Baker Street
225 Baker Street
.
.
.
335 Baker Street
284 Baker Street
282 Baker Street
.
.
.
2 Baker Street

You try getting that sort order right!

That's delivery order for a postie - walk along one side of the street and then come back along the other side.

Richard on August 7, 2008 7:05 PM

.NET sorts right include considering internationalization.
Hier the example:
class app {
public static void Main()
{
string[] list = new string[] { "Ooo", "ooo", "o12",
"o1", "o10", "o23", "o2", "o4", "ö3", "u4" };
System.Array.Sort(list);
foreach (string str in list)
System.Console.WriteLine(str);
}
}

And the result is:
o1
o10
o12
o2
o23
ö3
o4
ooo
Ooo
u4

Lars on August 25, 2008 4:08 PM

But what will this do for names with leading numerics? I have a lot of files dragged from my browser cache, names all start with either 0-9 or A-F. But using the Nautilus file manager, sort by name PARSES the leading numerics (up to the first alpha), then sorts on that substring as a decimal number!!!! So, for instance, 99xx.jpg sorts BEFORE 1111xx.jpg. To find a file, I have to figure out that whole decimal number (often up to 6 digits) and look for it. I ***WANT*** it to sort all the files starting with zero together, then all the ones, etc. so looking at the first 2 or 3 characters will get me to or very close to the right one. Will your "Natural Sort" do this? I doubt it. No, I think my complaint is that this app DOESN'T use ASCII sort!

Jim Hartley on January 20, 2009 4:03 PM

@Graham Stewart:

'Can't you just treat the whole "name" as a base-62* numbering system?'

An interesting idea, and it seems to work ok for the examples you give, but try English words and see what happens. E.g. "about" > "ape".

Also this order:
a1 a2 a3 z1 z2 z3 a10 a11 a12 z10 z11 z12 a100 a101 ...

Doesn't seem "natural".

On the other hand, this is a hard problem. Sorting a group of numbers numerically is easy. Detecting whether a set of values is numeric or not and sorting accordingly is easy.

Sorting a group of strings that might be partly numbers is complicated. When the sort order of two strings depends entirely on what other strings may be present in the sort set, you're bound to get an undesired result for certain purposes with some data sets. Kate, there is no single right sort for all string sets and all purposes, even if you ignore i18n issues.

That being said, yes in many specific cases application programmers and library programmers could do a better job of providing a greater variety of useful sort features. Javascript especially seems poor.

Lars H on July 2, 2009 3:16 PM






(no HTML)


Verification (needed to reduce spam):


Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved.