The default sort functions in almost every programming language are poorly suited for human consumption. What do I mean by that? Well, consider the difference between sorting filenames in Windows explorer, and sorting those very same filenames via Array.Sort() code:
| Explorer shell sort | Array.Sort()
|
|
|
Quite a difference.
I can say without the slightest hint of exaggeration that this exact sorting problem has been a sore point on every single project I've ever worked on. Users will inevitably complain that their items aren't sorting properly, and file bugs on these "errors". Being card-carrying members of the homo logicus club, we programmers produce a weary sigh, and try to keep any obvious eye-rolling in check as we patiently inform our users that this isn't an error. Items are sorting in proper order. Proper ASCII order, that is. As we're walking away, hopefully you won't hear us mutter under our breath what we're actually thinking-- "Stupid users! They don't even understand how sorting works!"
I always felt a pang of regret when rejecting these requests. Honestly, look at those two lists-- what sane person would want ASCII order? It's a completely nonsensical ordering to anyone who doesn't have the ASCII chart committed to memory (and by the way, uppercase A is decimal 65). I never really understood that there was another way to sort, even though natural sort has been right in front of us all along in the form of Mac Finder and Windows Explorer file listings. I had language-induced blinders on. If our built-in sort returns in ASCII order, then that must be correct. It was bequeathed upon us by the Language Gods. Can there be any other way?
Kate Rhodes is a bit up in arms about our collective ignorance of ASCIIbetical vs. Alphabetical. Can't say I blame her. I'm as guilty as anyone. Turns out the users weren't the stupid ones after all -- I was.
Silly me, I just figured that alphabetical sorting was such a common need (judging by the number of people asking how to do it I'm not wrong either) that I wouldn't have to write the damn thing. But I didn't count on the stupid factor. Jesus Christ people. You're programmers. You're almost all college graduates and none of you know what the f**k "Alphabetical" means. You should all be ashamed. If any of you are using your language's default sort algorithm, which is almost guaranteed to be ASCIIbetical (for good reason) to get alphabetical sorting you proceed to the nearest mirror and slap yourself repeatedly before returning to your desks and fixing your unit tests that didn't catch this problem.
It isn't called "Alphabetical sort"; it's collectively known as natural sort. But she's right about one thing: it's hard to find information on natural sorting, and many programmers are completely ignorant of it. None of the common computer languages (that I know of) implement anything other than ASCIIbetical sorts. There are a few places you can find natural sort algorithms, however:
Don't let Ned's clever Python ten-liner fool you. Implementing a natural sort is more complex than it seems, and not just for the gnarly i18n issues I've hinted at, above. But the Python implementations are impressively succinct. One of Ned's commenters posted this version, which is even shorter:
import re
def sort_nicely( l ):
""" Sort the given list in the way that humans expect.
"""
convert = lambda text: int(text) if text.isdigit() else text
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
l.sort( key=alphanum_key )
I tried to come up with a clever, similarly succinct C# 3.0 natural sort implementation, but I failed. I'm not interested in a one-liner contest, necessarily, but it does seem to me that a basic natural sort shouldn't require the 40+ lines of code it takes in most languages.
As programmers, we'd do well to keep Kate's lesson in mind: ASCIIbetical does not equal alphabetical. ASCII sorting serves the needs of the computer and the compiler, but what about us human beings? Perhaps a more human-friendly natural sort option should be built into mainstream programming languages, too.
| [advertisement] Axosoft OnTime 2008 is four developer tools in one: bug tracking, project wiki, feature management, and help desk. It manages your development process so developers can focus on coding. Installed or Hosted – Free Single-user license -- Free 30-day team trial. |
Posted by Jeff Atwood View blog reactions
« Blacklists Don't Work Our Fractured Online Identities »
Leading zeros are your friends.
mpbk on December 13, 2007 03:09 AMPHP has a built in natural sort function as well: http://be2.php.net/natsort
Would be nice to have it built in .NET as well.
David Cumps on December 13, 2007 03:10 AMI don't see how this would actually be a "sorting" problem for .NET. Nothing should be wrong with how .NET sorts, but rather you would need to provide it a custom comparer. You can still use the same Sort method, just provide a comparer that can tell if "z2" is greater than "z10" or not.
John Chapman on December 13, 2007 03:18 AMYou're absolutely right, Jeff - there's no reason that there shouldn't, somewhere in the libraries for .NET or Java, be a natural sorting algorithm. I know that in Java there's more than enough locale specific functionality that the argument of different character sets or sorting rules (someone in one of the links mentions that, in Sweden, 'V' and 'W' sort together) doesn't really hold water. Besides, it's Sun and Microsoft - don't they have motivated interns to do that sort of work?
BrotherBeal on December 13, 2007 03:25 AMThough I don't disagree with the theme that we should do what we can as programmers to reduce cognative friction, make lives easier, and exceed user expectations, I have to wonder, what exactly is coming out unsorted here? Obviously the file names are a contrived example, because let's be honest, no (rational) human would name files with such a scheme if they ever intended those files for human consumption.
I can think of only a few situations (though admittedly I may be missing some) where sorting ASCIIbetically produces counter-intuitive results. For example, numbers/dates tacked on to the beginning (or end) of file names as an ad hoc method of version control. Maybe, a user ID number (perhaps an integer) that's used as a primary key in a database might sort out of order. In answer to these "incorrect" sorts, I would argue that the users are not using the proper tools (perhaps because we as programmers haven't supplied them, or due to other reasons such as cost).
My point is that the types of things that tend to sort poorly in ASCIIbetical are the type of things that we as programmers should be abstracting away in our software so that users don't have to consider whether something's sorted "incorrectly." Certainly there will be situations where abstracting such things is infeasible, or unnecessary, and such a "natural" sort might be appropriate, but when that happens, write the sort, port it out to a library assembly (especially now with .NET 3.0 extension methods this becomes tractable), and get on with it. If you come across a situation where it's necessary later, link the library into your assembly.
Just don't put the cart before the horse Jeff.
Michael Starke on December 13, 2007 03:26 AMLeading zeros are your friends.
because let's be honest, no (rational) human would name files with such a scheme if they ever intended those files for human consumption.
We are thinking as programmers here, ask any normal person to write a list of file names and he/she will write
photo1
photo2
photo3
photo..
photo10
photo11
Natural sort is the way!
Edddy on December 13, 2007 03:34 AMIf you want to sort like Explorer, you are apparently supposed to use StrCmpLogicalW. See: (http://msdn2.microsoft.com/en-us/library/bb759947.aspx) and (http://www.pinvoke.net/default.aspx/shlwapi/StrCmpLogicalW.html). I think it is only suited to sorting filenames though, because it may not take account of sorting properly for different languages/cultures.
Chris L on December 13, 2007 03:35 AMI'm sorry, but I don't agree. Let's call it alphanumerical.
z100 should always go BEFORE z2, the same way that table goes before tar.
Also, anyone naming their files *name*-*number* and not putting in the extra digits should not order by name, but by date.
And I take offense at Kate Rhodes' comments. It's yet another case of stupid users.
paketep on December 13, 2007 03:38 AMOff the top of my head...
Can't you just treat the whole "name" as a base-62* numbering system?
The number line would look something like:
0 1 2 3 4 5 6 7 8 9 a A b B c C d D e E f F g G h H i I ... etc
Convert to decimal in the usual way so:
z1 = (61 * 62) + (1) = 3783
z2 = 3784
z10 = (61 * 62^2) + (1 * 62) + 0 = 234546
z11 = 234547
z100 = (61 * 62^3) + (1 * 62^2) + (0 * 62) + 0 = 14541852
z101 = 14541853
That seems to work.
Downsides are that you need to be able to handle very large numbers to get support a useful filename length and base-62 only covers alphanumeric, you'd need a much bigger base to cover punctuation and unicode.
(Apologies if my maths is a bit wonky on the above. It's before lunch and my coffee isn't kicking in. But I think the principle is sound.)
Graham Stewart on December 13, 2007 03:39 AM>> We are thinking as programmers here, ask any normal person to write a list of file names and he/she will write
>> photo1
>> photo2
>> photo3
>> photo..
>> photo10
>> photo11
Why are they choosing file names like that though? Could it be because we've failed as developers to provide them tools to give their files more descriptive names?
Why not name those same photos:
birthday cake.jpg
football cheerleader.jpg
Pam smiling.jpg
zoo animals.jpg
That certainly seems more like the kind of names a user would pick. I would wager if you asked a person if they'd rather have files named rationally with a description of their content, rather than some arbitrary numbering scheme, just about everyone would rather have descriptive file names.
Let's solve the root of the problem before so that we don't have to continually solve the symptoms.
Michael Starke on December 13, 2007 03:41 AMMy Ruby version of Dave Koelle's algorithm:
http://rikkus.info/arch/sensible_sort.rb
> Leading zeros are your friends
How many leading zeros?? 3, 4, 10??
Then you have to type them in to access whatever the thing is you've indexed with them - really unfriendly when there are more than about 3.
More thought required methinks.
> Why are they choosing file names like that though?
How many keyboards have you ever seen on a digital camera?
Francis Fish on December 13, 2007 03:46 AM>> How many keyboards have you ever seen on a digital camera?
Well, duh! The camera has no way of knowing if you're taking a picture of a pumpkin or a polar bear, and without a keyboard you cannot tell it. But most digital cameras these days come with a suite of software that you use to sink the photos from the camera to the computer. Why is that software not gently prodding those users to provide descriptions of the content of those files so that an intelligible value can be used for the file name? If a user chooses not to provide those descriptions, then they have to remember that DSC01722.jpg is the picture of Pam from the office party. If that's what they want to do, you can't stop them. But you can put the right tools in the hands of users who do want the feature.
Michael Starke on December 13, 2007 03:57 AMCocoa has numerical comparison as follows: [someString compare:someOtherString options:NSNumericalSearch];.
It's trivial to implement a function for NSArray's sortUsingFunction:context: that uses it, or a category on NSString to add a -numericalCompare: method to be used with sortUsingSelector:.
millenomi on December 13, 2007 03:57 AMAnyone coding for European languages will have certainly come across a variant of this where bugs crop up complaining that accented characters (like é) end up at the bottom of a sort instead of next to their a-z equivalents. It gets even more fun when you start considering compound letters like 'ß'.
To the poster who suggested treating the name in base-62, what you're essentially doing is sorting by string length first, then alphabetically - not addressing the core problem, e.g. if you filenames were all the same length (z1.., z10., z100) the system falls down.
Dino on December 13, 2007 03:58 AM@Francis: most digicams I've seen actually use padding zeros. They mostly pad to 5 digits, since it's usually IMG*.JPG, and since we are still using useless FAT16 as a common denominator filesystem on the memory cards, we are not getting long file names. Smells like 1995, don't you think? And it's actually almost 2008 outside... That's just sad.
Nikolai on December 13, 2007 04:00 AM(Side note: Cocoa calls numerical sorting what you refer to as natural sorting.)
millenomi on December 13, 2007 04:01 AM> How many keyboards have you seen on a digital camera?
That made me laugh out loud. Good shit, Francis.
And i think the person who wrote about using leading zeros meant that the leading zeros would be stripped after the sort was done. Or atleast that's how i pictured it. That would be quite a task on bigger lists though.
Jazz on December 13, 2007 04:03 AMpaketep:
Sorry, but if your users are asking for something and your response is "no thats stupid", instead of "ok, I'll do it that way" then you have some problems with your logic. Users are called users for a reason: they _use_ the stuff.
On the Michael Herf website, you can find that :
http://www.stereopsis.com/strcmp4humans.html
I must mention that on his website you can find lots of nice peaces of code.
Alex on December 13, 2007 04:08 AM@Dino: not sure I understand your criticism?
In my example z100 and z101 are the same length and would be sorted correctly.
If you want to sort "z1..", "z10.", "z100" then it would still work, provided you assign the "." character a value on the number line to determine where it appears relative to other values.
On reflection I think where it falls down is on filenames with DIFFERENT lengths (e.g. apple, ball, cab would be sorted as cab, ball, apple - which isn't quite right!)
Graham Stewart on December 13, 2007 04:11 AMThe python sort above is still case sensitive, so 'abc' will be sorted after 'DEF'. It might be worth changing 'else text' to 'else text.upper()'. I can't remember if that is safe for languages other than english though...
Simon on December 13, 2007 04:16 AM>> Why is that software not gently prodding those users to provide descriptions of the content of those files so that an intelligible value can be used for the file name?
Well duh, probably because it's kind of silly to rename the file if you're going to add tags to it later anyway? Besides, if I download 150 pics from my cam, I'm not really eager to have to name them all...
Gabri on December 13, 2007 04:33 AMNatural sort mostly solves the problem of numbers in file names. What if the file names also contain dates? It would be fantastic for users if their file explorer would also recognise and sort by date. The environment's locale settings could help in resolving ambiguities (such as M-D-Y in USA but D-M-Y in Europe) and recognising month names and abbreviations. Users shouldn't have to use the ISO date format YYYY-MM-DD in their file names. For example, 1-jun-2007, 1.6.2007, 6-1-2007, june-6th-2007, 2007-6-june, jun07, 1-07, and 2007/6/1 are all possible candidates for a date in a file name (the last one would be illegal on Un*x-style file systems because slash "/" is the directory separator). There are some challenges in there, and it could never be as good as a human, but with some good heuristics and usability tests to iron out surprising or unexpected behaviour, I reckon it could be made to work very nicely.
Andrew Bettison on December 13, 2007 04:40 AM@Andrew: Meta-data should be your friend, EXIF when talking about photos. Filenames are obsolete.
Nikolai on December 13, 2007 04:49 AMJeff - splitting hairs, I know, but it's all about your ordering predicate, not your sort algorithm. That's one thing (among many things, IMO) that the C++ Standard Library gets right - it correctly identifies the variability point of the sort algorithm (the ordering predicate) and exposes it nicely.
Michael Starke - users are rational? Nah. Anyway - when there is a mix of alpha & numeric in a name, users are going to presume that the numeric part should sort by numeric magnitude, while the alpha part sorts alphabetically. And you'll probably find that they'll think that punctuation characters should sort after letters. Or before letters. Or something. But *not* interspersed before, between (upper vs lower case) and after letters.
mpbk - leading zeroes are something programmers would think about, not users.
Stuart Dootson on December 13, 2007 04:54 AM@Michael Starke
Because when I take my camera out for a day out I can easily take over 100 pictures. It is not a fun job naming them all, and is quite labor intensive, despite the tools we are given. Instead I just whack them in a folder with a name and date that sums up that day.
Implement NaturalSortComparer as an extension method for IEnumerable.
Job Done.
Stuart Cam on December 13, 2007 05:02 AMIf you want the same sort order as explorer just use the same function explorer use, it's already there.
Here is "natural sort" in 6 lines:
First, somewhere in your class import StrCmpLogicalW:
[DllImport("Shlwapi.dll", CharSet = CharSet.Unicode)]
private static extern int StrCmpLogicalW(string psz1, string psz2);
And then just use it with Array.Sort:
Array.Sort(myStringArray, delegate(string l, string r)
{
return StrCmpLogicalW(l, r);
});
That's all.
NirD on December 13, 2007 05:04 AMSorting is also a problem with internationalization.
Hebrew is written right to left for example...
You also have to think about equality and comparison.
How do you sort between z1.txt and z01.txt?
If that's the best we can do for the internationalized version I don't even want to start. Basically it relies on having a dictionary of all possible special characters you might encounter. I think such a feature might actually exist in unicode (to find out if something is just A with an accent) but putting the dictionary together yourself is code worthy of posting on the daily WTF. :-)
wpp on December 13, 2007 05:25 AMI think that you are correct in certain instances. Sure, it would be preferable to do a natural sort, but in most instances an alphabetical sort is just fine. SQL Server does all of its sorting by alphabetical order (they refer to it as Dictionary Order) and I would argue that this sorting is what is used quite often when displaying data, especially on the web. I honestly think the problem is nowhere near as big as you are stating it (since it only affect leading or trailing numbers), but in the case of a file manager, I can see where it would be a problem.
Justin Etheredge on December 13, 2007 05:27 AMWhat's all this fuss about filenames, anyway? I think filenames are becoming increasingly irrelevant to users, you know, regular, user-type users. Users dont care about filenames, but about content. It's all tags and metadata and stuff.
I think natural sorting is an improvement for displaying filenames, but it's about a decade too late
Rik on December 13, 2007 05:28 AMI think there is a much deeper problem here. They need to sort their file names... (Almost) Human like sorting of file names is a non-solution to a much deeper problem (very common actually).
Also, these functions will do an almost job as they will only work for us and with the numbers at the end and might or not work with extensions. It sucks and is not constant. Users will use it but will still have to adapt (albeit in an easier way).
So why do they need to sort those hundred and twenty files again?
JF on December 13, 2007 05:28 AMThe problem isn't well defined.
z1.txt and z01.txt were mentioned by Tsahi, but what about zz.txt and z1.txt?
With that Python code, I think the first are going to stay in the order in which they were, and the second order is undefined, as far as I know.
Remco Gerlich on December 13, 2007 05:33 AMFirst of all, the problem is not ASCII sort order or whatever. It's called "sorting with lexicographical order". Calling it ASCII sort only muddles the issue (a different character encoding will no solve this issue).
Someone has already mentioned earlier, the problem is that you are conflating the sort operation with the element comparison operation.
Yes, users should not be expected to know or care. They want "natural sorting". But we as programmers should be expected to know the difference.
A programmer should know what comparison sort is (and the difference between it and other types), and how it's composed of discrete comparisons.
A programmer should know more than just to use "items.Sort()".
Asks a bunch of users to sort physical objects (or a list of them). They will all come up with different schemes. There is only one that makes sense in the case that was described and its sorting by object number of a specific type.
. on December 13, 2007 05:42 AMIf you want to sort considering international alphabet you should add also a collation algorithm: http://en.wikipedia.org/wiki/Unicode_collation_algorithm
Eduardo Diaz on December 13, 2007 05:52 AMThis statement bugs me, and I'm sort of surprised that no one as brought this up:
"Perhaps a more human-friendly natural sort option should be built into mainstream programming languages, too"
The programming language is NOT the collection of standard libraries or framework that are generally provided with a compiler. This statement would lead one to believe C# and the .NET framework are one and the same, or that C is not C unless you have implementations of stdio, etc. Too many people already think like this, so please don't perpetuate this point of view.
Daniel on December 13, 2007 05:54 AMThe Python implementation use some methods, like isdigit(), split,...
if you implement a few routines its possible to write C, C++, C#
code to natural compare strings using a couple of lines!
Here is a simple algorithm:
//1., 'ax103cool' & 'ax2008hot' -> 'ax'
offset = GetCommonStringPrefix(strA, strB)
No(offset '103', '2008'
endOfsA = AreAnyNumbers( strA +offset )
endOfsB = AreAnyNumbers( strB +offset )
No: use the usual compare routine
Yes: CompareNumbers( strA +offset, strB +offset, endOfsA, endOfsB )
"Perhaps a more human-friendly natural sort option should be built into mainstream programming languages"
Enter Tcl's "lsort -dictionary". Not surprising from a language whose aim is to be as human-friendly as possible.
> lsort {a1 a10 a100 a2 a3 a20 a30 a200 a300} ; # default ASCII sort
a1 a10 a100 a2 a20 a200 a3 a30 a300
> lsort -dictionary {a1 a10 a100 a2 a3 a20 a30 a200 a300} ; natural sort
a1 a2 a3 a10 a20 a30 a100 a200 a300
See how life can be simple?
BTW, Tcl's license is BSD-style. Help yourself, grab the code.
El Fredo on December 13, 2007 05:58 AMFor differences in Human-sorting, consider the issue of surnames that start with Mc or Mac (e.g. McDonald, MacFarlane, McFarlane etc). A common problem here in Scotland.
Some organisations/individuals will sort "Mc" as if it reads "Mac" (I believe British Telecom do this in the phone book).
Others treat them separately.
Others place MC and Mac before all other names starting with M.
Others ignore the construct altogether and would file McFarlane under "F"!
Likewise, if we are sorting 'sensibly' then we also need to consider hyphenated and non-hyphenated words (a human would put "e-mail" in with the rest of the "email").
It is worth looking at Schwartzian Transforms for this:
http://en.wikipedia.org/wiki/Schwartzian_transform
Joe Cheng offers some examples of writing Schwartzian Transforms in C# 3.0:
http://jcheng.wordpress.com/2007/08/23/schwartzian-transform-in-c-30/
Well windows 2000's explorer didn't sort file the natural way, so i've learned to add leading zeros..
Manu on December 13, 2007 06:09 AMThere are even thornier issues about sorting and searching when a user finds that Etienne is in the E section and Étienne is listed past the Z area.
Any special language characters require special consideration to deal with and if you have more than one language set possible as input values then get set for some fun.
Hey Now Jeff,
This is an issue we all see often just like the when my focus is stolen. (http://www.codinghorror.com/blog/archives/001011.html) I like both of these posts since it's an issue we all deal with & notice.
Coding Horror Fan,
Catto
My solution for the number problem:
http://undev.spaces.live.com/blog/cns!6526A337A687E45D!142.entry
Of course realizing domain concerns may cause you to write your own alogorithm like Graham suggests. Most times you don't need that level of control.
Josh on December 13, 2007 06:15 AMThe filename isn't a good interface to photo tracking anyway. @Ross is right when he says it's no fun (re)naming the photos after a day of taking them. Even if you could name them as you were taking them, you wouldn't, it would interrupt the adhoc nature of taking photos. And names like birthday-cake.jpg, football-cheerleader.jpg and zoo-animals.jpg like @Michael Starke suggested arn't good either, those names are hardly descriptive enough to find one picture a group of files of any useful size (would you really take ONE picture that had MANY animals in it on a trip to the zoo anyway?)
The proper interface for sorting photos is VISUALLY, since photos are visual media. Displaying photo file icons as thumbnails in the file manager was a huge step in the right direction there. But it's still deficient. The photo file manager should work visually too, letting me move files around to "sort" them and remembering those locations (the big grid of photos in many photo management apps doesn't really help, except to browse all photos -- and improperly mirrors the real-world "contact sheet" concept which exists for different purposes).
But back to the issue: sorting. Users are ambiguous in what they want, or they don't even know what they want often times. I've had to handle support calls of the form (not recently, thank god):
"I can't find the file I saved yesterday, can you help me find it?"
"Sure, what did you name it?"
"I don't know."
The filename is THE interface to file management currently. I'm not sure there is much hope, unfortunately, for users who are unable to adjust to that by now (not saying it's perfect in 2007, but if you are going to be productive with certain tools (computers), you should know what's important, especially when the interface arguably sucks).
Our goals should be to make their jobs easier (but easier does not mean "requires no thinking"). This means, in many cases, providing MORE easily accessible options. Changing the sort algo should maybe be provided as part of the sorting interface. Click column headings to change the sort order (asc or desc), right click to select a sort algo for that column of data. Can't navigate your files when sorted ASCIIbetical? Change it to natural order. Still not finding it, then change to an algo that sorts by trying to interpret date formats within the filenames (prime candidate for providing after-market "sort algo plugin" functionality -- since you can't predict what users are going to want, so enable someone else to build it).
Some people think giving users FEWER options is better (uh, ahem, a lot of the Gnome interface), but all that does is make the user change their habits to do things a certain way. It's not about reducing the number of options or hiding options, it's about making them more easily and obviously accessible. It doesn't make sense to force someone to go to Edit | Preferences, then vgrep a huge list of options or checkboxes to change their default sort algo (if even it was an option), but it does make sense to build those options into the UI widget that is the common or accepted endpoint for that functionality.
Rather than debate which sort algo the users are expecting, we should enable users to exploit multiple sort algos and switch between them easily based on their mood or work style.
If you dim the array as a string and use the sort method, it will produce results in the literal way, but if you dim the array as an integer, it usually works in the logical way. (In VB.NET)
Zak on December 13, 2007 06:17 AMYea for Ruby's sort_by method on Enumerable objects!
a = ["x1x10", "x10x1", "x5x5", "b55c5", "x3b4", "x3b2"]
p a.sort_by {|string| string.downcase.scan(/([a-z]+|\d+)/).flatten.collect {|s| s.match(/\d/) ? s.to_i : s } }
Ok, so it's a bit more complicated than that (of course). Wasn't taking into account non alphanumeric characters:
p a.sort_by {|string| string.downcase.scan(/([a-z]+|[^\d]+)/i).flatten.collect {|s| s.match(/\d/) ? s.to_i : s } }
It's probably still not perfect, but it's workable now, at least.
An interesting side-effect of the Explorer "human" sort is that some programs create numbered files without the leading zeros because they know they'll be shown in the right order in Explorer. Then you bring that into another program, and you have the issues.
Scott on December 13, 2007 06:23 AM@Michael Starke & Paketep:
There are no stupid users, only stupid developers. Forcing users to do things a specific way or refusing to recognize that a users way of seeing things could be right is arrogant and blatantly idiotic. If a user wants to name his or her files yyyyy1, yyyyy2, etc. who do you think you are to tell they are wrong? You should not gently be prodding anyone or anything except your own ignorance and arrogance.
Right, sorry about the flaming, but the above posts came across so mindnumbingly stupid that it couldn't be helped. I'll get my coat.
Fake
Fake51 on December 13, 2007 06:26 AM>the gnarly i20n issues I've hinted at
I've seen (and created) more than my share of internationalization bugs, but this self-referential mis-shortening (should be i18n) is exceptionally sweet!
Mark Nelson on December 13, 2007 06:32 AMWhy all the drama about sorts? Sometimes it makes sense to sort it one way, sometimes the other. There's no way a programming language could intrinsically know when either is appropriate. So occasionally you need to insert a few lines of code. Oh horrors.
It's not like you need to re-invent the wheel each time either. Most shops have a pre-written function or at worst code to post in.
Next are we going to mull over the tragedy of lower/upper case conversion?
Karl on December 13, 2007 06:35 AM>Why are they choosing file names like that though?
Because they feel like it. Because they are mad. Because they are masochists.
Since when do they have to explain why they want something?
They are the users. They want something. It is not your business to know why or to tell them what they need. If you think you know better (and many times you really do know better what they need) then by all means do it, implement it, talk to them, show it to them, but you have to convince them to accept your suggestion. They dont have to convince you for the necessity of their demand.
As programmers It is our business to offer the best but in respect of the limits users pose to us no matter how we judge these limits. We can always warn or suggest or even refuse to do something we dont want for whatever reason, but we can never force the user to accept our suggestion. In IT like in life, purpose DOES NOT justify the means.
If software was politics, you would be a 'fasist'. (No offense)
In fact every kind of programmer can be described as a politician. Jeff do it. Write about it. Please. It will be fun.
What's iinternationalizationn?
JJM on December 13, 2007 06:39 AMIanG on Tap (http://www.interact-sw.co.uk/iangblog) just blogged about this this morning. His solution: http://www.interact-sw.co.uk/iangblog/2007/12/13/natural-sorting
Max on December 13, 2007 06:42 AMI think a lot of you have missed the point here. The post was not about filenames, it was about sorting *strings*. Natural sort makes more sense than ASCIIbetical in almost all situations.
Our bug tracking system sends us emails about bugs with a subject like "[ProjectName 10] Something went wrong". A natural sort on subject gives me bug 9 next to bug 10, rather than seperated by 8990 other bug reports.
Making a room booking application? Why isn't F9 before F10 in the room list?
Also, the most important thing to remember when deciding wether to "fix" this sort of problem is not wether or not the user is being stupid, but wether making the change will make your users happier and more productive. Forget everything else.
Mat Scales on December 13, 2007 06:44 AMIn case anyone is interested: My opinion on the matter is sufficiently described in http://nikolai.prokoschenko.de/archives/161 (technorati is slow to update)
Nikolai on December 13, 2007 06:44 AMLeading zeros have the problem that they limit the collection size at 10^(z+1) - 1, and people just don't like the idea of having such a limit, because they don't know what to do if they change their mind. It's like line numbers in BASIC: you shouldn't have to deal with it yourself.
Also, normal users usually don't deal with mass renaming and Windows Explorer is not everything and the kitchen sink for renaming (which is a design decision). It's much easier to solve this using a tool like Lupas Rename, which has (for most users) a daunting complexity, if users even know such a tool exists.
@ people calling it a case of "stupid users":
Little natural language tricks like extracting the address from an e-mail signature, the name below "regards, xxx" and understanding that "thursday next week" means Date.Now().AddDays(7) are what adds UI polish to your application. It's the "they thought of -that-, too!" that's sorely lacking in a lot of applications.
*DO* let Ned's 10-liner tempt you. You know you want to use Python. Embrace the Python. It is your destiny.
Daniel 'Dang' Griffith on December 13, 2007 06:49 AMThe leading zeros perspective is interesting. Apartment and office buildings will commonly number the units using the hundreds place for the floor and the tens and ones place for the unit number (limiting the number of units per floor to 99, a seemingly arbitrary limit, albeit practical), making the first nine units on the third floor as 301 through 309. Why aren't these 31 through 39?
Axel on December 13, 2007 07:00 AMAs a data warehouse developer, I'm no stranger to the way business people expect to see things sorted, and it's never ASCII... One thing I do is to add extra columns to my databases and put in "corrected" versions of the keys I'm using, so my table might look like:
Customer Number|Customer|SortKey1
MS948|Fred Smith|MS00000948
MS9215|John Star|MS00009215
..which means I get to use SQL's standard ASCII sorting (indexed for speed, if need be) rather than muck around with anything in client applications. The client presents the real key to the user, but sorts by the pre-calculated ASCII-sort-safe key, and I only have to do the hard work once, up front.
Matt Gibson on December 13, 2007 07:04 AM@Rob Janssen I wholeheartedly agree. I got an email the other day and was mightily impressed when Apple's Mail.app detected the words "7pm next Tuesday" in it and wrapped them in a subtle little mouseover drop-down that prompted me to create a new appointment in my calendar for the right time and date, with the title being the subject of the email. Now _that's_ impressively user-friendly and integrated...
Matt Gibson on December 13, 2007 07:08 AMWhat about z1.2.txt? Do these natural sort algorithms handle decimals? :)
troll on December 13, 2007 07:16 AMShouldn't any user-facing sorts be done with the locale's collation function? Isn't this its responsibility?
Robert Fisher on December 13, 2007 07:16 AM@Matt: I'm constantly pissed off by Google Mail, which considers any long number a FedEx tracking number -- in Germany. This whole heuristics is entirely context- and language-specific and it needs better means.
Nikolai on December 13, 2007 07:25 AMknown Problem, easy to avoid. If i list such a directory in shell, i just use ls|sort -g. My approach implementing this would be, to determine first, which sort is wanted (ask the user). Then sort ascii as long as there are non-number chars, then sort the numbers as ints.
allo on December 13, 2007 07:29 AMNumeric collation in Cocoa, and the sort order in the Mac Finder is a result of the property UCOL_NUMERIC_COLLATION being set on an ICU collator. ICU (available in C/C++ and Java from icu-project.org) underlies MacOSX collation, and Apple contributed this functionality to ICU.
This will also work with any non-ASCII digits, arabic, indic, etc.
Srl on December 13, 2007 07:34 AMAs was mentioned by David Cumps early on, PHP has had natural sorting (natsort) for quite some time. The earliest comments for that function are from May of 2003.
More than enough time for Microsoft to have copied yet another idea... errr... have invented natural sort for .Net.
Same for Ruby and all those other lame languages that you know Jeff.
And I use the word lame because as you said:
"None of the common computer languages (that I know of) implement anything other than ASCIIbetical sorts."
And PHP cannot be called a non-common computer language.
rustyvz on December 13, 2007 07:35 AMThis is the implementation I always use, is very good IMO.
http://www.codeproject.com/KB/recipes/csnsort.aspx
Iceman on December 13, 2007 07:41 AMBTW, some digital cameras are smarter than others. Read about it here: "http://www.bbspot.com/News/2005/11/sony_photo_sharing.html".
Daniel 'Dang' Griffith on December 13, 2007 07:43 AMpaketep you surely make a great programmer with that attitude hahahah
LOL on December 13, 2007 07:51 AMMore about StrCmpLogicalW on MichKa's blog:
http://blogs.msdn.com/michkap/archive/2005/01/05/346933.aspx
Probably I'm the only person here who has this problem, but try using Windows Explorer to work with a long list of files whose file names are GUIDs. Whatever order Windows Explorer uses to "sort" those, it sure ain't human-intuitivable.
Programmer 1's solution: "Why are you using GUIDs for file names?"
Programmer 2's solution: "They're sorted, you just don't see how."
Programmer 3's solution: "Who needs sorting, anyway? You can just search the files."
Programmer 4's solution: "Windows Explorer is stupid."
Beep! Sorry, you are not a winner. Please play again.
mike on December 13, 2007 07:57 AMOkay, we all agree that "z4.txt" should come before "z21.txt", but what about "z4foo.txt" and "z21foo.txt"?
We could make a general statement that files should be sorted numerically if they end in digits, but what about "z4-foo.txt" and "z21-foo.txt"?
Or a more complex example:
abc-23-foo.txt
abc-d43-foo.txt
abc-d2-foo.txt
abc-100a-foo.txt
abc-12-foo.txt
abc-12a-foo.txt
Be very careful with what you wish for. You might actually get it.
David on December 13, 2007 07:58 AMJava has had locale-specific sorting for a long time, but the default is to use the Unicode equivalent of ASCII order. To sort an array of Strings in Java properly you have to say:
Arrays.sort(myStrings, Coallator.getInstance());
rather than:
Arrays.sort(myStrings);
The singleton Coallator.getInstance() yields the default comparator for the current locale. So running it in different countries yields a slightly different sort order.
David Leppik on December 13, 2007 08:03 AMI think everyone, including the original ranter, has missed the programming issue.
Alphabetical sort order is legitimate, as is natural order, date order and lots of others.
Differing users will want a different sort order out of the exact same app, AND field. An ideal app would choose a default and give the user the option to change it.
You can sort by file Name in Explorer, and it sorts in natural order. But what if I want straight alphabetical?
An ideal Explorer would let me right click on the Name header, and alter the specific sort order for that column.
Some memes are toxic.
Most of the commenters still consider this a sorting issue, or a file name issue, when in fact we know by that the root problem is the relative order between two strings.
To put it simply, we are looking at string comparison.
I also vote for the "not part of the programming language". Such a string comparison should (perhaps) be part of the standard library, or framework, but definitely not part of the language per se. Especially because it's just so ill defined.
By the way, I am interested. Do the various implementations mentioned support "natural order" for Arabic digits?
M on December 13, 2007 08:24 AMLong lists of GUIDS hmmm. Now there's something that the natural sort barfs on. It doesn't know to sort them in hex like ASCII sort would.
Joshua on December 13, 2007 08:24 AMI'm amused by some of the "one line program" examples listed. I have a one line program: SortListByLastNameAndPrintThenSortByFirstNameAndPrint(phoneBook); Some of us seem to be forgetting that implementations of function require lines of code. The source lines in some library functions can and should ignored in the LOC count. Other library functions should be included in the LOC. Where do we draw the line? Vendor supplied libraries? Partner company supplied libraries? Guy-in-the-next-cube libraries? My own libraries?
LOC = lines of code
In the old days, when we used to write our own solutions, libraries, and routines, we would devise a specific sort routine to suite the task. Now, now that we are being reduced to library “users” and block pushers, I’m too lazy to even consider writing one in c#.
Steve on December 13, 2007 08:33 AMmmm php natsort()
http://us2.php.net/manual/en/function.natsort.php
Ruby version similar to the mentioned Python short:
module Enumerable
def sensible_sort
convert = lambda {|text| Integer(text) rescue text }
alphanum = lambda {|key| key.split(/(\d+)/).map(&convert) }
sort_by(&alphanum)
end
end
%w{a1 a2 a3 a10 a11 A12 A13}.sensible_sort # => ["A12", "A13", "a1", "a2", "a3", "a10", "a11"]
The convert lambda would be better off comparing the text with a regexp or so to ensure it's a digit; exceptions are slow. But they're prettier here, and it's fine for short lists.
Thomas Hurst on December 13, 2007 08:37 AMIf you listen, normal people have novel ways of thinking about computers that are completely natural and that we've entirely lost sight of from long habit. When I was teaching my Mom to use a computer (a Mac) I had a hard time explaining to her why she couldn't name all her pictures of me "Jerry." "Well, if they were all named Jerry, how would the computer know which one you meant?" "Why, the one I just clicked on, of course," she said, giving me a funny look. After all, the images have icons that are thumbnails of their contents, so it's perfectly obvious to her which one she wants!
And you know, that sort of logic is very hard to refute, and it's perfectly feasible for a GUI-driven computer to work that way. Now, programmers understand why any chunk of data that needs to be accessed needs a unique key, and find that natural, but that key doesn't necessarily need to be the file's name, nor does the name need to be the primary way users identify files. Every file has an inode number or other unique file system identifier anyway that could be used behind the scenes, and that's actually a more persistent way to refer to the file programmatically than the name (which is why the Mac uses it for aliases).
My mom did eventually accept that each file in a folder needed a unique name and that if she wanted to merge two folders she might have to rename some files to avoid losing them entirely, which of course, if you think about it from her perspective, is a dumb outcome from a dumb requirement. So now she's one of us.
Naturally, it was also obvious to her how files should sort, and I was happy to install an extension that did that for her on her Mac OS 9. At least I could solve that problem.
Jerry Kindall on December 13, 2007 08:51 AMI find that "natural" ordering painfully annoying.
Why?
It does not work with hexadecimal numbers
Each implementation is different, so I _never_ know where the file will end up. ASCII may be unintuitive, but at least it is predictable...
Yes, there are cases where it is clearly preferable (well actually for me the only case is utterly idotic programs that do not put leading 0 in file names) but overall it has been more of an annoyance to _me_.
Or to summarize: this is basically UI desgin. And in UI design, if you think something is "obviously" better then you did not think about it enough.
I have to half agree with the pro-ascii sentiments. If our users are adding a numbering scheme to their file listings, then we as programmers have failed to give them the provision to add that kind of metadata in the proper space. Attempting to cover every possible numbering scheme the user could come up with would provide very difficult, but a simple drag-and-drop provision that remembers the way the user sorted the files would remove that numbering need all together.
Bill on December 13, 2007 09:22 AMFirst of... great blog. This post is genius and hits at the very problem with programmer. They are not designers. As a programmer you need to be an architect as well as designer. Its not what is easiest to program, but what is the best experience for the user. I completely agree that natural sorting should be a feature of all sorts - and default for the end user.
VelkyMX on December 13, 2007 09:44 AMusing System;
using System.Runtime.InteropServices;
class C
{
[DllImport("shlwapi.dll", CharSet=CharSet.Unicode)]
extern static int StrCmpLogicalW(string x, string y);
static void Main()
{
string[] names = new string[] { "string200", "string1", "string10", "string4", "string2", "string100", "string20", "string40" };
Array.Sort(names, StrCmpLogicalW);
}
}
"Users will inevitably complain that their items aren't sorting properly, and file bugs on these "errors". Being card-carrying members of the homo logicus club, we programmers produce a weary sigh, and try to keep any obvious eye-rolling in check as we patiently inform our users that this isn't an error."
I think the problem really comes back to this. We're basing decisions about how to interface with the user on implementation details. Users have no knowledge whether we use built in sort or implement our own sorting/comparison algorithms. So, the requirement is on us to make sure that our software interfaces the correct and natural way.
Not that I always succeed, but I'm always trying to think like a user first, then find an implementation to provide that behavior. Too often, we tend to think like a developer, then provide a user interface to that implementation. This goes for natural sorting, but also the rest of the UI!
Steve Bush on December 13, 2007 09:53 AMAs usual, for any given problem there is a solution in python which is elegant, succinct and _fundamentally broken for corner cases_.
What does your four-liner do with numbers that don't fit into an int?
Hacking something together for a one-off script is one thing. Writing robust code that will stand the test of time and handle any case the user throws at it is rather more difficult. It should not be a surprise that such a beast needs 40+ lines of code to get it right.
Thank you for that awesome post, Jerry. It really helps keep things in perspective.
A lot of programmers seem to get easily offended at the slightest implication that the "dumb" end-users actually have a greater awareness of their own computing needs than said programmers do.
Axel also makes a brilliant point about the filenames of photos (to use that example) being essentially unnecessary for users to deal with. How did our parents view, sort, and catalog their photos? The albums with my baby pictures have four photos per page, with the dates written underneath. And the photos are generally arrange in chronological order. The standard Windows Explorer list, details, or non-thumbnail icon views (or DOS "dir" listing) of photos is utterly useless for the overwhelming majority of tasks. Unless you're writing code that needs to know the disk-based filenames of photos, then the filenames are just extra clutter to have to deal with.
I long for the day when I don't need to be exposed to any file-system baggage to organize, view, or work with digital photos. The metaphor of the physical photo album is likely the most natural means of access for anyone who is not a programmer. Likewise, I long for the day when I can organize, copy to my iPod, and play all of my music and video media without ever having to think about the directory structures and filenames. iTunes doesn't really cut it yet, and neither does Windows Media Player.
And I'm not a "dumb" or "casual" user. I've been assembling and troubleshooting PCs for 20 years, I've worked doing systems integration and tech support, and I've been programming for the past 8 years (not including high school, as I didn't program for about 6-7 years in the middle). I used to be proud of my command-line DOS-fu. Now I just have better things to do with my time, and I think that computer programs should work the way that the majority of users expect them to.
Chris Lasting on December 13, 2007 10:03 AMUsing one of the string compare algorithms that have been mentioned here in a nice extension class can get you code like this without having to rewrite your comparer every time:
string[] testItems = { "z24", "z2", "z15", "z1", "z3", "z20", "z5", "z11" };
testItems.NaturalSort();
foreach (string item in testItems)
{
Console.WriteLine(item);
}
It's one line of code and what could be more natural than
testItems.NaturalSort();
Here's the extension code I played with:
[edit: a bit too much code for a single comment. Sorry. Removed. Please host on a blog entry if you're posting many lines of code! ]
Tyler Jensen on December 13, 2007 10:07 AMSilly rabbit, 'sort' is in the eye of the beholder, not the coder.
Steve on December 13, 2007 10:08 AMThis happens to be one of the very first Perl golf challenges (see http://www.xs4all.nl/~thospel/golf/challenge.html). It can be done in 58 characters. Not that I'd recommend this approach in production code...
Stephen Turner on December 13, 2007 10:21 AMIt's worth noting you can take hundreds of files and rename them with leading 0's in less then a second.
An app I use for this purpose from time to time is FlashFileRenamer.
If you can get your code to do what you want, thats totally cool too. But renaming the files might sometimes be a good solution.
From their site:
"Add number sequences (001, 002, 003, ...)
Pad existing numbers with zeros to make them sort correctly. "
Screenshot of that function:
http://www.rlvision.com/flashren/screen_enum.gif
More Screenshots:
http://www.rlvision.com/flashren/screenshots.asp
Mike on December 13, 2007 10:37 AMSo using .NET 3.0 Method Extensions, instead of List.Sort you can use List.NaturalSort, hows that?
using System.Runtime.InteropServices;
public static class ListExensions
{
[DllImport("Shlwapi.dll", CharSet = CharSet.Unicode)]
private static extern int StrCmpLogicalW(string psz1, string psz2);
public static void NaturalSort(this List list)
{
list.Sort(StrCmpLogicalW);
}
}
Eric Malamisura on December 13, 2007 10:45 AMprivate static readonly Regex regex = new Regex("^[0-9]+", RegexOptions.Compiled | RegexOptions.CultureInvariant);
static int NatureSortComparer(string a, string b)
{
int indexA = 0, indexB = 0;
while(indexA<a.Length && indexB<b.Length)
{
if(a[indexA]>='0' && a[indexA]<='9')
{
if (b[indexB]<'0' || b[indexB]>'9') return -1;
string strNumA = regex.Match(a.Substring(indexA)).Value;
string strNumB = regex.Match(b.Substring(indexB)).Value;
int numA = int.Parse(strNumA);
int numB = int.Parse(strNumB);
if (numA > numB) return 1; else if (numA < numB) return -1;
indexA += strNumA.Length - 1;
indexB += strNumB.Length - 1;
} else if(b[indexB]>='0' && b[indexB]<='9')
{
return 1;
}
if (a[indexA] > b[indexB]) return 1; else if (a[indexA] < b[indexB]) return -1;
indexA++;
indexB++;
}
if (indexA > indexB) return 1; else if (indexA < indexB) return -1;
return 0;
}
http://rafb.net/p/C6odtX29.html
Sachman Bhatti on December 13, 2007 10:45 AMMichael Starke: "I can think of only a few situations (though admittedly I may be missing some) where sorting ASCIIbetically produces counter-intuitive results. "
How about addresses? 9 Fake St. should come before 10 Fake St. shouldn't it?
Dealing with this right now, actually. Well... ignoring for the moment to be more accurate. ;)
Are we allowed to use the code people post here?
The reason I posted the answer, and an explanation for why I just wasted 20 minutes on a problem I had no practical reason to solve:
Telos: You can use my code, anyone can - it's public domain. It can be improved though, and most importantly culture sensitive.
When I started as a developer, I used to be proactive, rather than reactive. That is, giving the users what they want before they wanted it. Yes, it's not the actual definition of proactive, but it was my boss' definition of it, so there.
Eventually, I gave up on this for one very simple reason. When precognizing what the user wanted, 99% of the time I got it wrong. So, I started asking the users to tell me what they wanted and verified what they told me was correct, and I gave it to them. Guess what happened. They either didn't use what they asked for, or said it didn't work right and called the whole thing broken because of it. That was my epiphany; users are not stupid, they just don't really know what they want.
Later, we hired business analysts whose job was to talk to users and write up design specs that everyone involved signs off on. Then, we coded to those specs. I totally regret not coming up with that earlier in my career. It would have seriously reduced my stress level and improved my overall health.
Buck on December 13, 2007 10:50 AMJeff, you are mistaken on this issue. This has nothing to do with ASCII encoding. This is not an issue of "ASCII versus natural sort" - it is LEXICOGRAPHICAL ("dictionary-type") sort versus natural sort.
Of course, you don't have to believe me:
http://en.wikipedia.org/wiki/Lexicographical_order
The fact that letters and numbers are encoded in ASCII (or Unicode, or whatever) is irrelevant. Think about it - even in the "real world", where you don't use ASCII, letters and numbers have to be ordered SOMEHOW, otherwise how do you determine the order of 2 strings such as "foo1" and "fooA"?
The issue is that lexicographical sorting compares only *one character at a time*, whereas natural sort groups numerical substrings. For example, with lexicographical sorting, "foo123bar" would appear before "foo2bar", because "foo1" < "foo2" ("1" < "2"). See? Nothing to do with ASCII. Even a non-techie would agree that 1 is less than 2. With natural sorting, numerical substrings would be grouped, so the partial comparison would be "foo123" > "foo2"; therefore, "foo2" would appear before "foo123" in a natural sort.
Anyway, if anyone needs a natural string compare function for C, here you go:
http://sourcefrog.net/projects/natsort/
> I've seen (and created) more than my share of internationalization bugs, but this self-referential mis-shortening (should be i18n) is exceptionally sweet!
Yes, it's a good one. :) fixed.
> What does your four-liner do with numbers that don't fit into an int?
There are a lot of magical things about the four-liners that make me nervous..
> alphanum_key should be alphanum for this example?
The original commenter changed the name and I didn't catch it. Fixed.
Jeff Atwood on December 13, 2007 11:08 AMJeff, nobody here has yet pointed out that you messed up the definitions of "alphabetical", "ASCIIbetical", and "natural" sort orders.
"Natural sort order" is an empty phrase, essentially meaningless. Programmers often use it to mean "the order you get when you convert everything to an integer or a sequence of bytes, and sort those". Users often use it to mean "the order I'm thinking of right now".
"ASCIIbetical" sort order is the order you get when you treat a string as a sequence of ASCII bytes, and sort those. It is the order given by strcmp() on an ASCII machine, in which "100" < "25" < "Banana" < "apple" < "banana".
"Alphabetical" sort order is alphabetical sort order, in which "apple" < "Banana", and "Apple" and "apple" sort next to each other. It doesn't say what to do with numbers or other non-letter characters. It is a part of the order given by strcmpi() in libraries that have it (although strcmpi() has to do something consistent with all the non-letter byte values, too, of course).
Then there's the librarian's sort, in which "apple" < "Banana" < "eight" < "ninety-nine" < "25" < "twenty-four" < "zero"; it's easy to get computers to do this, but I don't know of any standard libraries that support it out of the box.
What you want, in which "2" < "100" < "Apple" < "ninety-nine" < "Z100", is a kind of unnatural hybrid of all these methods. I assume Perl supports it somehow (after all, Perl is the language in which "z99"+1 = "aa00"), but as other people have commented, it's not the sort of thing you ought to be trying to do in the first place.
Anonymous Cowherd on December 13, 2007 11:09 AMAnonymous Cowherd, as I pointed out, the term "ASCII sort order" or "ASCIIbetical" is wrong. We are talking about lexicographical sort order (only 1 character is compared at a time), and it has nothing to do with ASCII.
You could use strncmp (case-insensitive compare) for your sort comparison function, and you would still get case-insensitive lexicographical order. e.g. "abc1, ABC234, abc3"
Natural sort order does mean something: it means "modify the lexicographical sorting algorithm so that numerical substrings are appropriately grouped and compared, rather than always comparing one character at a time".
Will on December 13, 2007 11:14 AMIt's not a sorting problem, it's an ordering problem. A sorting algorithm knows nothing about the data except a way to compare two elements. Most programming languages, of course, get this right. Having a natural_sort function is the braindead solution (no wonder PHP has one).
So why there isn't a 'natural order' compare operator for strings, you might ask? Simple answer, because lexicographical ordering (no, there is no such thing as ASCIIbetical ordering, thanks for letting us participate to your collective ignorance) is a well defined mathematical concept, 'natural order' is not. Yeah, we got that x-10 goes after x-9, where does x-cover go? Before x-1 or after the last element? What is the natural order of hexadecimal strings?
The 'natural order' depends on what you're using your strings for, so yes, you'll have to write it yourself, each time. Asking for 'natural order' is like asking for the 'do_the_what_the_users_thinking' function.
I donated code to the winvn source base (news reader started by nasa) that does this type of character ordered in C. The current maintainer cleaned up a bit for his needs (and to make up for my lack of C skills, I am primarilly a delphi developer, and my C gets mighty rusty) -> You can get the source for WinVN at http://www.marks-lab.com/
I also have the Delphi version I maintain for myself. I should probably post it for people somewhere.
Xepol on December 13, 2007 11:19 AMThis one hits home for me. We have to manage things like lists of addresses, and it really annoys the users when "Unit 1208" goes before "Unit 300". I probably could have solved this from scratch, but looking at some of the Java/Python/Perl implementations made it much clearer what had to be done.
So, for the rest of you, here's a C# implementation that actually works (i.e. produces results identical to Dave Koelle's Alphanum, which I thought was the best of the three), and doesn't rely on the quirky Windows filename sorter:
http://convolved.com/code/AlphanumComparer.cs
There's probably a more "compact" way to write it, using fewer lines of code, but I honestly don't care. I'd much rather have something that's easy to understand and make changes to.
If someone else already beat me to the punch, then I apologize. If not, enjoy!
Aaron G on December 13, 2007 11:21 AMJeff, to be clear, here is an example of lexicographical comparison, which illustrates why it is has nothing to do with ASCII (or Unicode, or any kind of computer encoding.)
Lexicographical comparison tells us to compare ONLY ONE PAIR OF CHARACTERS AT A TIME. When we hit a pair of character that is different, the ordering of those 2 characters determines the ordering of the 2 strings.
Consider the 2 strings "foo123" and "foo2". Which string "comes first", according to lexicographical sort order?
Compare 1st character: "f" = "f" (the strings are still the same, at this point)
Compare 2nd character: "o" = "o" (the strings are still the same, at this point)
Compare 3rd character: "o" = "o" (the strings are still the same, at this point)
Compare 4th character: "1" < "2" (therefore "foo123" < "foo2")
The only difference with "natural sort" algorithms is that numerical substrings are grouped and compared, so the 4th step becomes:
Compare 4th letter/numerical substring: "123" > "1" (therefore "foo123" > "foo2")
Wow, did you see how I made NO reference to ASCII (or even computers) in that example.
Will on December 13, 2007 11:26 AMAlso, the one that says don't need filenames. I'm with you users have no business with file names and directory structure. Unfortunately at this point this would require a major rewrite of almost every existent program, I doubt it will happen anytime soon.
E[X] on December 13, 2007 11:27 AMBy the way, for everyone who thinks "ASCIIbetical" (actually LEXICOGRAPHIC") sort order is stupid, how do you think dictionaries are sorted? Using case-insensitive lexicographical sort order.
http://en.wikipedia.org/wiki/Lexicographical_order
The funny thing is that lexicographical order is pretty much alphabetical order, despite what other posters here have claimed.
The only reason you don't see this issue with dictionaries is that there aren't a whole lot of english words with embedded numbers. For those of you who think "alphabetical order" can say nothing about where numbers should go: Now that "w00t" is apparently a new dictionary word (sadly), how do you think they'll put it in the dictionary?
"What does your four-liner do with numbers that don't fit into an int?"
Python can store arbitrarily long numbers into an "int".
Carlos Rodrigues on December 13, 2007 11:47 AM@Chris Lasting:
"I long for the day when I don't need to be exposed to any file-system baggage to organize, view, or work with digital photos. The metaphor of the physical photo album is likely the most natural means of access for anyone who is not a programmer. Likewise, I long for the day when I can organize, copy to my iPod, and play all of my music and video media without ever having to think about the directory structures and filenames. iTunes doesn't really cut it yet, and neither does Windows Media Player."
Hmm. I do most of my photo work in Aperture, with a smattering of iPhoto (my wife and kids work in iPhoto). I couldn't tell you the file and/or folder names of any of our photos. I simply don't care about the file name (other than the root folder for backups). Talking with Windows friends about photo management ("but how do you organize pictures without meticulously named folders and files including the date and event name???") is like talking to a brick wall. But, for me, this is already a reality.
I'm not sure where iTunes fails you in this regard, though. I don't know or care what my music files are named (well, I suppose I "know" that they are named the same as the track name, and organized in my music folder by artist name then album name ... but on a specific basis, if I have two tracks in the same album with the same name, I don' know/care which one has a "-1" or whatever appended in the file system). When does iTunes make you look at the file system to deal with music?
Tom Dibble on December 13, 2007 11:51 AMThis is one of the typical programmer friendly features you find in open source languages all of the time, yet somehow escape mainstream closed source languages. Such features make so much sense that I want to punch my monitor when they are suddenly absent. For example, MySQL, in its collective, open-source wisdom, has the LIMIT feature. MSSQL(at least 2000, I don't know about newer versions) has a glaring lack of such a feature. Instead, anyone who wants to pull a subsection from a result set has to do a series of order flips using the TOP feature.
Mattkins on December 13, 2007 12:02 PMNot that anyone will actually read this before posting, but *be very careful posting < and > here*. You'll need to encode them as < and &gt; otherwise they'll be interpreted as HTML tags. I know, it sucks, and I need to change it.
I've corrected all the above posts, at least the ones that contained angle brackets.
Jeff Atwood on December 13, 2007 12:07 PMCache (Mumps) means not having to sort.
SET SORTED("WILDFIRE")=""
SET SORTED("BANANA")=""
SET SORTED("DOGBREATH")=""
SET SORTED("ZAGNUT")=""
SET SORTED("JIM BOB")=""
SET SORTED("JIM BOB2")=""
SET SORTED("JI2 BOB3")=""
ZW SORTED
SORTED("BANANA")=""
SORTED("DOGBREATH")=""
SORTED("JI2 BOB3")=""
SORTED("JIM BOB")=""
SORTED("JIM BOB2")=""
SORTED("WILDFIRE")=""
SORTED("ZAGNUT")=""
Disagree.
First, my Windows Explorer (XP) sorts in the expected manner: z1.txt, z10.txt z2.txt, etc.
Ascii sort is the most predictable. Once you start playing around with that, it will lead to confusion. Who really knows if 10 should be one-zero or ten?
pwb on December 13, 2007 12:38 PM"It does not work with hexadecimal numbers"
What? Really? Hex?
Not only do only only you and DEAD people know Hex, but most of them don't use it that regularly.
Seriously, designing a natural sort that caters to Hex is catering to a tiny little niche audience that can and will almost all rewrite the sort method themselves.
"Ascii sort is the most predictable. Once you start playing around with that, it will lead to confusion. Who really knows if 10 should be one-zero or ten?"
Erm, every single person who graduated first grade, and then didn't go into the "wonderful world of edge-conditions". What's the matter with you?
Tom on December 13, 2007 12:42 PMdwb wrote:
"Disagree.
First, my Windows Explorer (XP) sorts in the expected manner: z1.txt, z10.txt z2.txt, etc.
Ascii sort is the most predictable. Once you start playing around with that, it will lead to confusion. Who really knows if 10 should be one-zero or ten?"
-------------
XP defaults to "natural" sorting in Explorer. Someone must have used a registry editor or TweakXP to change that on your computer. It's called lexicographical ordering, "not ascii" sort, and I agree it's the most predictable. There are many ways to implement "natural sorting". (For example, should leading zeroes be ignored? How do you interpret groups of numbers separated by periods? etc., etc.)
Will on December 13, 2007 12:50 PMRuby:
module Enumerable
def sort_nicely
sort_by { |el| el.split(/(\d+)/).map { |s| s =~ /^\d+$/ ? s.to_i : s } }
end
end
Tom wrote:
(Quoting dwb) "Ascii sort is the most predictable. Once you start playing around with that, it will lead to confusion. Who really knows if 10 should be one-zero or ten?"
Erm, every single person who graduated first grade, and then didn't go into the "wonderful world of edge-conditions". What's the matter with you?
-------
Oh, really? Tell me, is "1.10" less than or greater than "1.2"?
If you are talking about decimal fractions, then "1.10" is less than 1.2".
If you are talking about SOFTWARE VERSION numbers than "1.10" is usually greater than "1.2"
What about "1.02.3" and "1.2.3"?
Natural sorting is ambiguous and ill-defined, JUST LIKE NATURAL LANGUAGE - I'm sure there are many implementations, just like there are many idiosyncratic forms of English grammar and spelling depending on what country you live in (e.g. U.S., Britain, Canada, etc.).
For example, the British always treat companies and sports teams as plural nouns (e.g. "Microsoft are an American software company."). Americans switch between plural and singular, depending on what "sounds right" (e.g. "The Miami Heat *are* an American professional basketball team. Miami *is* currently 2nd-last in the Eastern Conference standings. *They* have been struggling to win all season." I'm confused - is/are the Miami Heat singular or plural?)
Will on December 13, 2007 12:56 PMThis discussion is funny.
The problem isn't with sorting. The problem is with the data structure. If you want a hierarchy use one. Make a tree and sort the levels of the tree.
Take this shopping list for an example:
Wheat bread
White Bread
White Milk
Chocolate Milk
It's a list. If i sort the list i'll get:
Chocolate milk
Wheat Bread
White Bread
White Milk
Whaaah /cry the user says 'cause milk is seperated and not sorted propery!
Whaaah /cry the developer says "get used to it" this is the way computers have worked since the iron age.
the problem isnt the sort, the problem is the list. What we want is a Tree:
Bread:
Wheat
White
Milk:
Chocolate
White
Then when we sort we sort each level of the tree.
When you sort the strings:
dumbuser10
dumbuser2
dumbdeveloper10
dumbdeveloper2
If you expect the result:
dumbdeveloper2
dumbdeveloper10
dumbuser2
dumbuser10
Then you are expecting the tree:
levels of the tree.
Take this shopping list for an example:
Wheat bread
White Bread
White Milk
Chocolate Milk
It's a list. If i sort the list i'll get:
Chocolate milk
Wheat Bread
White Bread
White Milk
Whaaah /cry the user says 'cause milk is seperated and not sorted propery!
Whaaah /cry the developer says "get used to it" this is the way computers have worked since the iron age.
the problem isnt the sort, the problem is the list. What we want is a Tree:
Bread:
Wheat
White
Milk:
Chocolate
White
Then when we sort we sort each level of the tree.
When you sort the strings:
dumbuser10
dumbuser2
dumbdeveloper15
dumbdeveloper3
smartdeveloper1
If you expect the result:
dumbdeveloper3
dumbdeveloper15
dumbuser2
dumbuser10
smartdeveloper1
You are expecting a tree:
dumb:
developer:
3
15
user:
2
10
smart:
developer:
1
So fricken use a tree user and developer alike.
omg sorry about that post, i guess i'm dumb user 2 'cause i cut and pasted all fooked
brian on December 13, 2007 01:04 PMAs if you'd ever need to sort hexadecimal in a system with actual users. If you find yourself encountering this issue frequently, it's a strong indication of awful UI design and potentially terrible back-end design.
If some of you are really so concerned about this obscure edge case, put in an heuristic that checks for a string composed of only decimal digits and same-case letters A-F, with at least two characters from each category. One regex and a few lines of base conversion will take care of that. It'll be wrong sometimes, but again, if you're displaying hex to the users then you've already lost the battle.
And uh, brian, do you frequently see people organizing their shopping lists into tree structures? You've raised both the human and machine complexity exponentially, for no practical reason at all.
And Will: I really doubt that you'd often need to sort fixed- or floating-point numbers as a smaller part of an alphanumeric string. If you ever did need to do this, it would be much more likely to be something like a table of contents, where 1.10 actually *should* come after 1.9.
Jesus guys, it IS important to consider edge cases, but that doesn't mean you should scrap the whole idea if a few of those edge cases will produce incorrect results. You just make sure that you've defined this behaviour, and set expectations accordingly. I think what this really boils down to is that a lot of programmers are control freaks and can't entertain the notion that no answer is 100% correct. Instead of coming up with a 90% correct solution, they just bitch, moan, throw their hands up in the air, and give up on the problem altogether. Those people are going to get eaten alive in the real world.
Aaron G on December 13, 2007 01:34 PM"natural sort groups numerical substrings"
Yes, but if "natural" means "the way humans do it", it has to do much more and handle other popular ordered substrings...
Days of week:
•tuesday
•wednesday
•thursday
Date/time:
•15-8-2004
•1-10-2004
•1-1-05
•5:00 AM
•14:00
•4:00 PM
(special tricky for 12:00-1:00 AM/PM)
Street adresses:
•9, James street
•10, James street
•10 bis, James street
•10 ter, James street
•5, Rudolph street
Product reference, or anywhere digits are not numerical, like phone "number":
-01K305
-1K205
Hexadecimal:
0001
F01A
f01b
(0-9 and A-Z/a-z in straight order)
Version:
•v1.9
•v1.10
•v1.10.1
(a period is not a decimal separator)
Sorting with or without leading article (Mr, Miss, St, "The" etc).
etc etc
And then there is the issue that some sorts depend on locale.
And then there is the issue that some strings were built with *other*, untold locale (like filenames).
And then there is the issue that some strings are ambiguous and require looking at the entire set.
And then there is the issue that some sets are fundamentally ambiguous.
And then there is the issue that since the sort depends on the set, it is not stable nor predictable.
If you want to make good human-like sort... good luck.
And in the case of the version numbering... thats a tree:
v:
--1:
----9:
------0
----10:
------0
------1:
--------0
If human want tree... human should use tree.
brian on December 13, 2007 01:42 PMWindows Explorer in your example sorts by file name, not by file name with extension. Hence the difference.
If you apply Array.Sort() to file names only -- you would get result that is pleasant for users.
Dennis Gorelik on December 13, 2007 01:52 PMI have both learned to use padding zeroes in the filenames, as well as written the´sorting a couple of times in my profession.
Hmm.. I bet there's an ISO standard for this somewhere. If people liked communicating across borders more, they'd be less exited about localization and more into internationalization. There is, seriously, no need to have a local way or writing dates. But then again, what do you expect from people who don't use the metric system...
Peterh on December 13, 2007 01:52 PMMusaran, I agree with you totally - natural sorting is not well-defined. I was simply describing how most natural sort algorithms are implemented.
Aaron G, funny you should say nobody needs to sort decimal numbers as a sub-string. This natural sort algorithm actually supports sorting decimal sub-strings by NOT ignoring leading zeroes (although that behaviour can be disabled):
http://sourcefrog.net/projects/natsort/
Before you say "well, that's just some random guy's sort routine", he claims that his code has been incorporated into PHP:
http://us3.php.net/manual/en/function.strnatcmp.php
Actually, this whole discussion boils down to a central issue in human factors: How Software Works (well-defined algorithms) versus What Users Want (fuzzy, ill-defined specifications).
Once AI is smart enough to figure what users really want, we'll all be out of jobs. (The code will write itself).
Will on December 13, 2007 01:54 PMDaniel 'Dang' Griffiths:
Didn't your teacher tell you you'd go blind if you continued to "Embrace the Python".
Maybe you could try "polishing the Ruby"?
On the topic of this post:
One sort approach is never going to work in all situations. Let your users try out the sort with various filenames etc, get feedback from them, change the sort to behave as close to their expectations as possible. Easy really.
Ash on December 13, 2007 02:00 PMAsh wrote:
"One sort approach is never going to work in all situations. Let your users try out the sort with various filenames etc, get feedback from them, change the sort to behave as close to their expectations as possible. Easy really."
------
Yeah, I totally can see how that would work with Windows Explorer.
Each time you open Explorer, Clippy could pop out of nowhere and say "You seem to be viewing a list of files. How do you like the way I sorted them?"
Aaron G wrote:
"As if you'd ever need to sort hexadecimal in a system with actual users. If you find yourself encountering this issue frequently, it's a strong indication of awful UI design and potentially terrible back-end design."
---------
Are you sure nobody uses hexadecimal?
What if your users are developers (e.g. Visual Studio)? Or your application is highly technical (Modbus master/slave, network packet sniffer such as WireShark, debugger, etc.)?
Will on December 13, 2007 02:12 PMPeterh wrote: "If people liked communicating across borders more, they'd be less exited about localization and more into internationalization."
------
The funny thing is, people do it all the time. It's called the Internet.
And what do you think happens when someone posts on an "American" message board using non-American spelling/grammar (e.g. British English treats companies and other groups as plural nouns) or non-American date conventions (e.g. DAY/MONTH/YEAR instead of the American MONTH/DAY/YEAR). They get flamed. People yell at them for using "bad grammar" or "not knowing how to write dates".
Of course, I'm sure this also happens when Americans post on non-American websites.
The point is that people hold on to their little cultural conventions and idiosyncracies as if they were authoritative and universal.
Good luck on shifting the focus from "localization" to "internationalization".
Will on December 13, 2007 02:25 PMHey Jeff, first time commenting. :)
Great blog, your recent blog about the 80/20 actually inspired me to start my own blog.
Saw this one and have to give it my own try:
Quick Java Comparator:
http://simplesql.tigris.org/servlets/ProjectDocumentList?folderID=0
Cheers!
Eric B on December 13, 2007 02:26 PM@Tom:
"What? Really? Hex?
Not only do only only you and DEAD people know Hex, but most of them don't use it that regularly."
That's not the point. The point is what is the natural order for hex strings? The 'natural order' is under defined, you can't have it in the language (or in its standard library).
Only the programmer can decide what is the natural order. No, the user can't, the user sees only what she'd like you have to take into account all your users.
@ Aaron G:
"If some of you are really so concerned about this obscure edge case, put in an heuristic that checks for a string composed of only decimal digits and same-case letters A-F, with at least two characters from each category. "
Oh yes... right... lets put there an heuristcs so that we use two different ordering depending on the elements of the set. Wait... what appens if only some strings are hexadecimal?
"Jesus guys, it IS important to consider edge cases, but that doesn't mean you should scrap the whole idea if a few of those edge cases will produce incorrect results. You just make sure that you've defined this behaviour, and set expectations accordingly"
No, this is the problem. A lexicographical ordering is the same everywhere. The 'natural order' is not, you can't have it as a function in the programming language, it depends on the domain.
If you want to show a sorted list of strings to the user you have to choose one ordering. If, depending on the domain, there is an ordering that is *always*correct* you should implement that. If the list of strings is general enough you have to choose an ordering and screw some of your users. The lexicographical ordering has the undeniable advantage of screwing the users in a predictable fashion with a solid mathematical background.
The natural order is just something you pulled off your @ss, subject to change when you find a "bug".
E[X], well whether you like it or not, strnatcmp seems to be standard in PHP:
http://ca.php.net/strnatcmp
Just because "natural sorting" is ambiguous, doesn't mean someone can't make up a "standard" for it.
@Will
well PHP is also the language where "no one needs namespaces", calling that a *standard* is quite a bit far fetched. Is just the way they think it should be.
E[X] wrote: "well PHP is also the language where "no one needs namespaces", calling that a *standard* is quite a bit far fetched. Is just the way they think it should be."
-------------------
E[X], fair enough.
Natural sorting has its place (e.g. sorting lists of numbers, if the list is in text form, for some reason). Lexicographical ordering also makes sense in a lot of places. Unfortunately, as this blog post and many of its replies show, most people do not even understand the concept behind lexicographical (uh, sorry, should I say "ASCIIbetical") ordering, so good luck explaining to the end-user why his digital photos are listed in the "wrong order".
I agree with the sentiment that computer software should strive to do "what the user wants". I disagree with the knee-jerk reaction that developers and computer scientists are stupid because the language doesn't always help the developer do what the users want.
Will on December 13, 2007 02:57 PMValid point.
Natural sort would increase productivity across America. (White House paying attention?)
Rightnow, workaround is the best:
z001.txt
z011.txt
z100.txt
Certainly computers should always do what the user wants, but in this circumstance I doubt that you would find a consensus amongst users at what the proper order should be anyhow.
Appending numbers to the end of filenames is a dead give away that the file was produced by a machine and in that case the name itself is not relevant. I mean if every single one of your digital camera pictures start with “PIC” and you had dozens of folders with PIC1.JPG then putting them together and sorting them will just end up with a nonsensical order anyhow.
Sorting them by the date would be much more meaningful to the user.
I also agree with one of the earliest posters who said this wasn’t a sorting problem at all but a compare routine problem that the sort program calls.
Davide on December 13, 2007 03:47 PMDavide wrote: "Certainly computers should always do what the user wants, but in this circumstance I doubt that you would find a consensus amongst users at what the proper order should be anyhow."
--------
No kidding, that's sort of the point of this whole discussion.
The problem is that the "human factors" crowd is flaming the "stupid developers" because programming languages don't have a built-in "natural sort" function. Look at the excerpt of the blog Jeff posted:
"Jesus Christ people. You're programmers. You're almost all college graduates and none of you know what the f**k "Alphabetical" means. You should all be ashamed."
This idiot doesn't even realize that computers DO sort in alphabetical order. The issue is that the ordering is lexicographical, not that it isn't alphabetical.
BTW, human beings produce numbered lists all the time. Even if you think they don't, your compare function should certainly handle that case, don't you think?
Will on December 13, 2007 03:58 PMIt's not as straightforward as it seems -- there are some really nasty corner cases. Can you think of an algorithm that will do "the right thing" in all of these cases?
1_1 vs 1_10
1.1 vs 1.10
1.1 vs 1.01
1.1.1 vs 1.10.1
12.25.2007 vs 9.9.08
Christmas 2007 vs My 32nd Birthday
If you can't get it right every time, you might be better off doing something that breaks simply and consistently.
So not giving numbers special handling is actually quite a good solution. Treating the digits at the *end* of the filename (photo1, photo2, photo10 ...) also makes sense, but going any further is just asking for trouble.
Peter on December 13, 2007 04:51 PMI see a lot of posted code, but not a lot of test cases.
Jon Abaca on December 13, 2007 05:04 PMOh, and for those of you who still think this is an "ASCII sorting" problem, check this out:
CompareString()
http://msdn2.microsoft.com/en-us/library/ms647476.aspx
This is a Win32 string comparison function that does NOT use ASCII comparisons. It works with different locales and character sets (including double-byte characters), it supports multi-character elements and has options to "intelligently" sort words and it supports case-insensitive comparisons.
However, it will STILL produce the following sorted list, because it doesn't do "natural sort" with numerical grouping:
z1.txt
z100.txt
z1000.txt
z2.txt
Go ahead, try it with strings like "z1" and "z100". You will get the same result as plain old "ASCII-betical" strcmp().
And there IS a built-in natural sort function with numerical grouping in Win32:
StrCmpLogicalW()
http://msdn2.microsoft.com/en-us/library/bb759947.aspx
"Compares two Unicode strings. Digits in the strings are considered as numerical content rather than text. This test is not case sensitive."
While "ASCII-betical sorting" does have a meaning, it is not the problem here.
Oh, and for those of you who still think this is an "ASCII sorting" problem, check this out:
CompareString()
http://msdn2.microsoft.com/en-us/library/ms647476.aspx
This is a Win32 string comparison function that does NOT use ASCII comparisons. It works with different locales and character sets (including double-byte characters), it supports multi-character elements and has options to "intelligently" sort words and it supports case-insensitive comparisons.
However, it will STILL produce the following sorted list, because it doesn't do "natural sort" with numerical grouping:
z1.txt
z100.txt
z1000.txt
z2.txt
Go ahead, try it with strings like "z1" and "z100". You will get the same result as plain old "ASCII-betical" strcmp().
And there IS a built-in natural sort function with numerical grouping in Win32:
StrCmpLogicalW()
http://msdn2.microsoft.com/en-us/library/bb759947.aspx
"Compares two Unicode strings. Digits in the strings are considered as numerical content rather than text. This test is not case sensitive."
While "ASCII-betical sorting" does have a meaning, it is not the problem here.
Not all that hard in Objective-C (at least not for us, don't know how the people over at Apple implemented it): http://developer.apple.com/documentation/Cocoa/Conceptual/Strings/Articles/SearchingStrings.html#//apple_ref/doc/uid/20000149-SW1
Aron on December 13, 2007 05:19 PMWill,
"Clippy could pop up out of nowhere", nice ridiculing of my point. But then you've called other posters idiots and the like so I think your level of class is obvious for all to see.
Also, you completely miss my point. It is simply that the user could ask for sorting by their favorite colour, the magnitude of the planets, dates closest to christmas etc etc. As a developer you must inform them that it is not necessarily the best approach (for example how will other users understand the sort). But if they still want it, it's then your job to give them what they expect.
You're not a Python zealot by any chance?
Ash on December 13, 2007 06:20 PMNice article, but perhaps a bit too pessimistic in saying that 40+ lines of code are necessary. Here's 10 lines of good old C code (code in the { ... } and not including the main()) that uses qsort(3). This assumes, of course, that people all agree that C is the lowest common denominator. If you agree with that premise, similar Java code should take the same or fewer lines of code. I know Jeff explicitly said he didn't want a 1 line code contest, but I find this may be a good exercise for interviewing people.
#include
#include
#include
static int qsortComparatorHelper(const char *v1, const char *v2)
{
if (isdigit(*v1) || isdigit(*v2)) {
int val1 = 0, val2 = 0;
while (isdigit(*v1)) val1 = val1 * 10 + (*v1++ - '0');
while (isdigit(*v2)) val2 = val2 * 10 + (*v2++ - '0');
return (val1 == val2) ? qsortComparatorHelper(v1, v2) : val1 - val2;
} else {
return (*v1 && *v2 && *v1 == *v2) ? qsortComparatorHelper(++v1, ++v2) : (*v1 - *v2);
}
}
static int qsortComparator(const void *v1, const void *v2)
{
return qsortComparatorHelper(*(const char **) v1, *(const char **) v2);
}
int main()
{
char *str_array[6] = { "z1", "p2", "p100", "p200", "p3", "p100a" };
int i;
qsort(str_array, 6, sizeof(char *), qsortComparator);
for (i = 0; i < 6; i++) {
printf("%s\n", str_array[i]);
}
return 0;
}
Leading zeros breaks down every time you need to increase the number of leading zeros in your file names, so it's at best a temporary fix.
Even if the file names aren't computer generated, people may use temporary names with the intent of making them more meaningful later, only to end up with large numbers of files sitting around with nothing more meaningful than the date and order in which they were generated (and often that's not much use, either).
Many people like the idea of sorting by date, but that only works if your brain works well with dates in the first place, and/or you have a rough idea of when you generated the particular file you're looking for. The only time I've found dates useful (being a person that rarely knows what day of the week it is, never mind the actual date), was sorting photographs after a long trip. In that particular case at least one of the photos taken within a given week (Wednesday through Tuesday) had a relevant name in it, so I could rename the folder that contained all of the photos for that week with that name.
Other than that, I always sort my data by subject matter and revision information. Photographs are sorted similarly, dates simply help with grouping them together before sorting by subject in cases where I took a number of related photos within a given range of dates (or on a particular date).
I guess the point I'm getting to is that everyone has different expectations when they're storing and sorting data. They may also have different expectations when they're dealing with different types of data. On the other hand, the only people that expect data to be sorted the way most computer applications sort it are people that are used to computers doing this sort of thing. The majority of the people that understand why computers sort the data that way should be working to fix it (and obviously some of them have, as file managers wouldn't do any differently from array sorts otherwise, and for a long time they didn't).
Vizeroth on December 13, 2007 06:56 PMAsh,
I am not a python zealot. I mostly code in C for Linux embedded systems and C++ (MFC) for Windows.
And I only called 1 person an idiot - the blogger (quoted by Jeff) who said:
"Silly me, I just figured that alphabetical sorting was such a common need (judging by the number of people asking how to do it I'm not wrong either) that I wouldn't have to write the damn thing. But I didn't count on the stupid factor. Jesus Christ people. You're programmers. You're almost all college graduates and none of you know what the f**k "Alphabetical" means. You should all be ashamed. If any of you are using your language's default sort algorithm, which is almost guaranteed to be ASCIIbetical (for good reason) to get alphabetical sorting you proceed to the nearest mirror and slap yourself repeatedly before returning to your desks and fixing your unit tests that didn't catch this problem."
I guess you didn't find any of that offensive, rude, or low-class.
And my point was that, in some cases, you CAN'T adapt to your users needs. You can only try to please some of them, some of the time (which is what Explorer does).
Maybe some of my posts were strongly worded. Two reasons:
1) I dislike that implication that developers are too stupid to give the users what they want.
2) I'm disappointed that everyone thinks this issue is an artifact of "asciibetical" sorting, rather than lexicographical ordering. I don't see how you can solve a problem without fully understanding it first.
BTW, the fact that Win32, PHP, Objective-C have BUILT-IN support for "natural sort" ordering and that numerous 3rd-party libraries are available (including a C language library) proves that developers DO care about human factors and that we are not all stupid or hopelessly afflicted with Asperger's Syndrome, as Kate Rhodes and Jeff seem to think.
So I don't think we need to "proceed to the nearest mirror and slap [ourselves] repeatedly before returning to [our] desks", like Kate Rhodes suggested.
And I still have no problem with calling her an idiot. Especially when she thinks we don't know what the f--- "alphabetical order" means. I found that statement quite ironic, since "asciibetical order" is PRETTY MUCH the same as "alphabetical order", as long you're talking about case-insensitive comparisons. This problem has nothing to do with "alphabetical" versus "asciibetical". It is about "lexicographical order" (comparing one character at a time) versus "natural sort order" (i.e. treating groups of digits as numbers, rather than individual characters of text.).
Will on December 13, 2007 07:30 PMOne last comment. Many people have pointed out that no "natural sorting" algorithm can be perfect, or please everyone in all situations. Others ha