April 26, 2006
I try to avoid using spaces in filenames and URLs. They're great for human readability, but they're remarkably inconvenient in computer resource locators:
- A filename with spaces has to be surrounded by quotes when referenced at the command line:
XCOPY "c:\test files\reference data.doc" d:\
XCOPY c:\test-files\reference-data.doc d:\
- Any spaces in URLs are converted to the encoded space character by the web browser:
So it behooves us to use something other than a space in file and folder names. Historically, I've used underscore, but I recently discovered that the correct character to substitute for space is the dash. Why?
The short answer is, that's what Google expects:
If you use an underscore '_' character, then Google will combine the two words on either side into one word. So bla.com/kw1_kw2.html wouldn't show up by itself for kw1 or kw2. You'd have to search for kw1_kw2 as a query term to bring up that page.
The slightly longer answer is, the underscore is traditionally considered a word character by the \w regex operator.
Here's RegexBuddy matching the \w operator against multiple ASCII character sets:
As you can see, the dash is not matched, but underscore is. This_is_a_single_word, but this-is-multiple-words.
Like NutraSweet and Splenda, neither is really an acceptable substitute for a space, but we might as well follow the established convention instead of inventing our own. That's how we ended up with the backslash as a path seperator.
Posted by Jeff Atwood
-- Jeff Atwood
Although the second statement is true for almost all Windows applications; the first statement is NOT always true, especially for Microsoft Office products (Word, Outlook) even Windows text box dialogs.
You can test this by using the above examples in a Word document, Outlook email, text box, etc.; searching for one individual token of the entire expression, e.g. "word". In these applications, the search will find "word" in both of the hyphenated AND underscored versions of the expression.
Also, you can Ctrl-Left-Arrow/Ctrl-Right-Arrow to the first last character of each token in both versions. Whereas applications that treat the underscored version as a single "word", Ctrl-Left-Arrow/Ctrl-Right-Arrow will advance only to the first last character of the entire expression.
By my reading, this interpretation (dash as a word break character but not underscore) is continued in the Unicode world. It defines the rules for word boundaries here:
(Just for reference, in unicode terminology, the underscore is called a "LOW LINE", and the dash is called a hyphen. There are several hyphen characters, see the link above.)
ithe first statement is NOT always true, especially for Microsoft Office products (Word, Outlook) even Windows text box dialogs./i
As a developer, this always drives me nuts. I expect a text box to treat underscores and dashes like my IDEs treat them: an underscore as a letter, and a dash as "punctuation". In programming of course, this makes sense. An underscore is not an "operator" in the context of many (most?) common programming languages, unlike the dash (minus), or whitespace (symbol separator).
However, an underscore really has no official place in the English language. It was created for the typewriter, as a way of underlining. It's word-separation usage only came with the advent of 2GLs. So confusion is understandable, if not really acceptable, when dealing with word-processing functionality.
At any rate, it would be nice if everyone treated it the same way. At least within a given context.
Underscores are also ugly and hard to type quickly, whereas dashes are easy and the same width as a space, or only slightly longer. Underscores at 2-3 spaces wide make words look disconnected, even if they're otherwise less obtrusive.
Of course you could simply use + and ban the use of it in filenames, with appropriate server support. (Or be prepared to use %2B.)
I tend to prefer the underscore because it's below the letters, as if in an underline. But I don't think there's a right answer. Just preference.
Another difference between underscores and dashes: very often, you can't see underscores when the words are underlined (as in most hyperlinks). Sometimes this is an advantage, but in most cases it isn't.
2 little notes from a unixish perspective (I thought I'd share because I'm reading your blog to get the windowsish perspective):
\w is not portable regex, but rather from the preg (perl regex) set.
Also, spaces in file names, esp. as arguments to shell scripts, are easily lost because once inside the script, evalling a string more than once makes it "fall apart", quoted or not.
The solution there is the magic variable "$@" (quotes included) which evals to all positional arguments, properly quoted one by one.
I like this one better:
I remember a quote somewhere, something like :
8 chars ought to be enough for anybody. :)
nah, sorry to low .. :)
We switched from _ to - a few years ago for usability reasons: some folks don't see an _ in a URL and assume it's a space.
Here's another vote for '.' as a word separator in filenames (does UNICODE call that a FULL STOP? - actually, it does).
I have no in-depth analysis for why, it's just what I have always used to avoid the evils of spaces in filenames.
I still like it better.
This was really useful. I'm working a file to convert search friendly urls into a keyword search on my website. I was going to use underscores but now I'm going to use dashes or underscores. Thanks.
I, too, hate the spaces! You don't know how badly I've wanted to change "Program Files" to ProgramFiles or Program_Files.
That being said, what about camel/pascal casing? Seems like that would be another option for this discussion. I wonder how the readability changes?
The space/dash/underscore self-deliberation comes up frequently as I'm ripping CD's to mp3s and trying to get a good filename. I've resolved to using _ for spaces and -- for a delimiter.
I guess I would argue that the _ makes more sense as a space, but maybe that's because that's what I've always used.
I agree that it's nice to adhere to a standard, but if there are good reasons to throw the standard out the window, why not consider it? Not that I've come up with good reasons...
renaming "Program Files" can break some archaic installers / updates so I wouldn't risk it
I personally simply add a Junction (Symbolic Link) that maps Program Files to Program_Files (so they point at the same place on a NTFS disc) and I can then refer to it in either way. Very handy :-)
Interestingly i've heard Vista will be dropping all the two word directoring and moving from Program Files - Programs, My Documents - Documents etc.
p.s. util to help create junctions: http://www.sysinternals.com/Utilities/Junction.html
Jeff, that's terrible advice. If programmers don't use the same characters in their filenames that their users use then how is software ever going to work properly?
W3C's CSS validator is broken because somebody didn't think to check it with a URL that has a %20 in it: http://www.kirit.com/W3C%27s%20CSS%20validation%20service
Microsoft's Response.Redirect() is broken because it can't work out which bit of a URL is which, also due to sloppy encoding practiced by most web developers: http://www.kirit.com/Response.Redirect%20and%20encoded%20URIs
And again a problem with encoding means that 404 handlers on IIS are broken too: http://www.kirit.com/Errors%20in%20IIS%27s%20custom%20404%20error%20handling
And because you've reverse engineered Google's software doesn't seem a good reason either. Sooner or later they'll change their implementation and then where will you be? If Google considers a hyphen as a space then Google is broken. Sooner or later they'll fix that (one hopes).
One comment suggests '+', like Technorati uses in tags? That's even more broken -- the '+' is used as a space substitute in query strings but NOT in file specifications, a bug in nearly every URL parser I've found.
"If you use an underscore '_' character, then Google will combine the two words on either side into one word."
Personally, I feel this is Google's problem, not mine. But that's just my opinion. I will only go so far to accomodate Google. (And it shows; their cache of my site is a mess, because my CMS throws insane session cookies at Googlebot, in the URL's querystring. I haven't gotten around to fixing it yet... see previous comment.)
Here's another vote for '.' as a word separator in filenames (does UNICODE call that a FULL STOP? - actually, it does).
One problem with period is that it makes it difficult to figure out where the file extension begins, eg:
Spaces in URLs are bad because they have to be replaced with that ugly unreadable %20 notation. But what exactly is wrong with spaces in file names? As long as you don't need to share a file with backwards Unix systems that don't understand file names with spaces I don't see the problem. Where do you ever enter a file name?
1. In a file selection dialog. Standard Windows file dialogs don't care about spaces. No quotes are necessary.
2. On the command line. Auto-completion automatically puts quotes around your file name as necessary.
3. In a program's source code. Strings must be surrounded by quotes anyway, ergo spaces are not a problem.
4. In some text storage facility, such as the registry or an XML file or whatever. Any well-designed storage format respects embedded spaces, so once again no problem.
The only problematic situation I can come up with (other than sharing with Unix systems) are batch files. That's the only time you have to consciously remember to use quotes for file names with spaces. And even that's no longer true when you upgrade to Windows PowerShell!
C'mon guys! Either computers work for you or you work for them.
Stick with spaces.
There_is_a_reason_that_we_don't_write_our_sentences_like_this. There-is-also-a-reason-that-we-don't-write-our-sentences-like-this. And.this.is.a.joke!
I'll let you hardcore filename worshipers use dashes, dots, and underscores. I'll use spaces.
You forgot the Wiki style: no spaces!
Which I suppose is what other posters were referring to with "Camel Case style", but I think of this as Wiki-style..
There is one downside to using '-' over '_'. The former is commonly used in language. What about double barrel names for example?
Now I've lost the difference between ' ' and '-'. I think Bit-Jockey makes a great point but sadly, I still use underscores. A habit I just can't break, mainly because spaces really suck in Urls.
File names and folder names are not sentences. I always laugh when people bring up the prose arguments. Apples and oranges. File names and procedures/functions in programming languages are short little blurbs of text. They are not English sentences. There is plenty of whitespace all over your screen and between windows/scrollbars/wasted space areas. In English text, there is not plenty of whitespace inside the paragraph.
So please, don't compare apples to oranges. The same arguments are used in programming languages. Cee programmers tend to argue that underscores are easier to read. I find that the underscores add more symbols to the already symbol infested obfuscated Cee code.
Let me repeat: file names and folder names are not English sentences.
Your first and last name is not an English sentence either. JohnDoe or John-Doe or John_Doe is just as easy to read. If we were discussing English sentences, then john_doe_went_to_the_park_to_see_trees would be easier to read. But we are not discussing English sentences.
If your file and folder names are becoming so long that they are English sentences, then you need to rename them. My Program Files In That Cool Windows Directory is not a good folder name.
I am not a programmer, so this issue doesn't impact me on that front. The reality is that encountering a name with substitutes for spaces offends all my esthetic sensibilities; it just looks hideous, especially the underscores. Usability also goes down the drain with anything that requires such frequent use of the shift key.
Probably this would be a non-issue except for stone-age, intolerant Unix. We can also blame Microsoft, Apple and all the rest of the computing world for implementing long filenames without carefully considering all the implications. This is part of the ongoing campaign to make computers "user friendly" so they can be operated by complete novices, but it has resulted in a giant tower of Babel.
As happens each time everybody sits in their corner and does their own thing, we now have a big mess which will probably never be resolved.
I agree with the above comment that computers are supposed to work for us; we shouldn't be required to adapt our lifestyle to suit them.
In this case, the number of space-intolerant scenarios is relatively small, so it is illogical that we all have to modify our behaviour to cater for them.
In situations like this, a de facto standard usually emerges and saves the day. Since none of the conventions being used today is acceptable, this unfortunately doesn't look like happening any time soon.
While I agree its very annoying to put spaces in file paths and your Google dashes finding is very interesting and compelling, I just found out how to pass references to annoying file paths in Outlook 2003.
Enclose the path in angle brackets. Its fabulous. It can be done after pasting in the horrid path.
Wow. All I wanted was supporting arguments to choose a naming convention here at the office. Given that 99% of the lusers are ignoratti, the space is here to stay. So I can now see based on the above arguments, that actually it is historically developer's fault for not making the use of space in filename easier for programmers to work with.
All those back up scripts, etc. need to be hand coded to handle space names. Why? Because developers were so callous(sp?) as to select the language-text space char as the code separator. Should actually be some other funky (read non-printing) thinggy like chr(9) A Tabb for anyone?
Go figure, we always blame the luser, when they do the most natural thing. In this case lusers are correct, and OS/dev languages are incorrect, they should find a diffrent 'standard' to using the space (maybe keyboards should have to 'space' bars eh? One for 'text' space and oine for 'code' space - hey my new invention for the Unicode world!!!
Sometimes you don't get to pick your tools, and they have bugs that make use of spaces problematic. Try calling a post-build script from Visual C++ 6 with an argument that contains a space, for example. Not possible.
A space in a filename is generally a clue that you need to add a level of directory hierarchy. Instead of "My Documents", use "Documents/Mine". Instead of "2008 Acme Invoices", try "Invoices/Acme/2008". Spaces are semantic separators in English. In directory hierarchies, the equivalent token is the slash/backslash.
Personally, I really hate spaces embedded in directory and file names.
I am a programmer so, granted, I am looking at it from that perspective.
uSoft cannot even get it right though. For example, HP mounts CF and
SD cards under \SD Card and \CF Card on their versions of PocketPC.
Try passing full PocketPC pathnames with embedded spaces as args to
any of the CMD shell work-alikes in .bat files under PocketPC! It's
a mess!! %1 becomes \SD and %2 becomes Card\... -- every .bat
file you code has to deal with that crap. I gave up programming for
PocketPC mostly because of that annoyance (and the lack of support
for the most basic concept of PWD - current or present working dir.
Very interesting discussion about this topic here, though, lottsa
varied ideas. The good news is that the HTML world hates looking at
%20 as much as I hate seeing embedded spaces in my directory and file
names. I generally use the underbar as a replacement for separation,
but the Google discovery seems to suggest that maybe dashes or dots
would be better.
Hey, dashes-to-make-a-long-identifier-is-lisp-style! (Or possibly Scheme.)
For myself, I have come to use dots as a handy compromise: they're not in the middle of the line like a dash, and they are not considered a word character like an underscore, nor are they escaped in URLs.
Since this area is far more personal preference than anything else, I feel quite justified in my choice. :)
For file and directory names or anything that might be used in a URL, I agree with Jeff about this, all-lower-case-hyphenated-words work best.
A friend of mine used to say "the shift key is cumbersome and often unnecessary". In the debate over underscores vs. hyphens for file names I definitely agree with his statement.
Hyphens are less keystrokes, you benefit with the google/word seperator issue, the underlined hyperlink confusion issue, and a side benefit of adding a lower case convention is that you won't hit case sensitivity issues when moving files back and forth between windows and unix/linux/mac machines. I got hit by that once and vowed it wouldn't happen again.
I didn't realize the regex connection before reading this post. Now I have even more defense of my hyphen-reasoning. Thanks!
@jeff reading your this comment,
I have searched in google with keyword web-site, but google considered it one word.
Great post. however, having the "-" verses the "_" is always best. We have experimented with many sites....
oval wall mirrors
are three links.. the dash in the 2nd and 3rd link has helped us with rankings.