September 10, 2007
Everyone who has ever purchased a hard drive finds out the hard way that there are
two ways to define a gigabyte.
When you buy a "500 Gigabyte" hard drive, the vendor defines it using the decimal
powers of ten definition of the "Giga" prefix.
500 * 109 bytes = 500,000,000,000 = 500 Gigabytes
But the operating system determines the size of the drive using the computer's
binary powers of two definition
of the "Giga" prefix:
465 * 230 bytes = 499,289,948,160 = 465 Gigabytes
If you're wondering where 35 Gigabytes of your 500 Gigabyte drive just disappeared to, you're not alone. It's an old trick perpetuated by hard drive makers-- they intentionally use the official SI definitions of the Giga prefix so they can inflate the the sizes of their hard drives, at least on paper. This was always an annoyance, but now it's much more difficult to ignore, as it results in large discrepancies with today's enormous hard drives. When is a Terabyte hard drive not a Terabyte? When it's 931 GB.
As Ned Batchelder notes, the hard drive manufacturers are technically conforming to the letter of the
SI prefix definitions. It's us computer science types who are abusing the official prefix
Prefix Derived From
||Greek root for giant|
||Greek root for monster|
||Greek root for five, "penta"|
||Greek root for six, "hexa"|
||Latin root for seven, "septum", p dropped, first letter changed to S to avoid confusion with other SI symbols|
||Greek root for eight, "octo", c dropped, y added to avoid having symbol of zero-like letter O|
As the size of the prefix grows, so does the gap between the official and informal
meaning of the prefix.
And yes, there are larger official
SI prefixes beyond these,
just in case someone needs more than 1000 yottabytes. Ned noted that
one of the SI proposals is for the prefix "luma", representing 1063.
Speaking of impossibly large numbers, if you're like most people reading this article, then you probably arrived here through Google. Google is a
tragically but forever
misspelled version of Googol:
A googol is 10100, i.e. a 1 followed by 100 zeros. In official SI prefix terms, a googol is approximately a yotta squared, squared. Even larger is the googolplex, which is equal to 10 to the power of a googol (10googol); this number is about the same size as the number of possible games of chess. Even larger numbers have been defined, such as
Skewes' number, Graham's number, and the
Moser, which I won't even try to describe.
But I digress. When we use gigabyte to mean 230, that's an inaccurate and informal
usage. Instead, we're
supposed to be using the more accurate and disambiguated
IEC prefixes. They were introduced in 1998 and formalized with
IEEE 1541 in 2000.
You occasionally see these more correct prefixes used in software, but
adoption has been slow at best. There are several problems:
- They sound ridiculous. I hear the metric system used more often in the United
States than I hear the words "kibibyte" or "mebibyte" uttered by anyone with a straight face. Which is to say, never.
- Hard drive manufacturers won't use them. Drive manufacturers don't
care about being correct. What they do care about is consumers buying their drives
because they have the largest possible number plastered on the front of the box.
If a big lawsuit wasn't enough to get them to mend their ways, I seriously doubt
that the recommendation of an international standards body is going to sway them.
- Tradition rules. It's hard to give up on the
rich binary history of kilobytes, megabytes, and gigabytes, particularly when
the alternatives are so questionable.
It's good to keep in mind the discrepancy between the decimal and binary meanings
of the SI prefixes. The difference can bite you if you're not careful. But I think
we're stuck with contextual, dual-use meanings of the SI prefixes for the forseeable
future. Or perhaps we're all overthinking this, as Alan Green notes:
Whenever I try to discuss [this] with my friends, they say, "Yotta getta life".
Posted by Jeff Atwood
"I am a CS major. I am utterly unaware of the world outside my tiny little shell. EVERYBODY IN THE WHOLE WORLD thinks that the SI prefixes mean powers of 2, and there is SO much history behind this usage -- literally DOZENS of years! Nobody uses the peta- prefix except for people talking about HARD DRIVES!"
(I'm a computer engineer myself, see above, just seriously amused by all this.)
DOZENS == sixteens, right?
I don't know is this would be more clear.
I had made a comment to the effect that 1024 bytes is 0x400 bytes in hexadecimal and that 1000 bytes is 0x3E8 bytes in hexadecimal.
Maybe if I were to use binary it would be more obvious.
1024 bytes is 10000000000 bytes in binary.
1000 bytes is 01111101000 bytes in binary.
I call 1,048,576 bytes megabyte or 1 MB.
1,048,576 bytes is 100000000000000000000 bytes in binary.
1,000,000 bytes is 011110100001001000000 bytes in binary.
Our computers run countless billions of binary operations all day long and only convert to decimal when we humans need to see the data. Often it will display in hexadecimal for a kind human willing to meet the computer halfway.
At then end of a day the computer is the final judge and it clearly prefers to think of KB, MB and GB in terms of a binary number in the form of a 1 followed by 10, 20 or 30 zeros respectively.
This lovely machine has been programmed to convert the binary to decimal when needed so lets not force our "arbitrary" metric on it.
I know the pour marketing sod is a soulless bag of crap and is lying his/her booty off on the front of the box with a statement that 1,000,000,000 bytes is a GB.
I mean really now! Are you telling me that you are not skeptical of advertising already. Don't we as a planet take it for granted that all marketing people are earth are liars. They have gotten degrees in deception making and make a living distorting truth for the financial gain of their employer to the detriment of everyone else on earth.
There is no need for this debate. Computer will continue to use KB, MB and GB internally as powers of two. Marketing people will use powers of ten because it is a convenient lie. (They love the convenient ones, as they make their worthless lives easier.) Educated consumers already know the exchange rate of marketing to computer science MBs.
As a final note, I would like to suggest that all marketing professionals commit suicide.
Again, that is simply a suggestions.
It took me about a day to get over the retarded names of the SI units. Then I realized that they are not much worse than the base 10 units.
It's obvious that memory is addressed via bits and therefore the maximum theoretical addressable memory is always a power of two. Memory requirements don't necessarily scale like that.
Also it's funny that Knuth comments on this, where he basically agrees but says the names are too funny to be taken seriously.
Once we slam a spacecraft or two into something in space we'll probably think they are less funny.
I cannot believe the reaction from people decrying the new binary prefixes. Here is the issue.
Point #1. It can make sense to talk about both a decimal gigabyte and about a binary gigabyte. Can people accept that?
Result: We need two different units in order to talk about these things unambiguously. This gives us three options: Use the existing usage to mean the binary unit, use it to mean the decimal unit or define two new prefixes for each 'type' of gigabyte.
Two new units is rather foolish. And whhich sounds more reasonable - to state that the kilo prefix has an exception for certain units, but that there's this new prefix for that unit that maps to the more standard usage of kilo? Or to make it so that kilo always, always, always means 10^3, and create the new prefix to always mean 2^10?
I think you know the answer.
What we have:
•kB means 1000 B by official SI definition.
•kB means 1024 B in traditional memory-related computer domain.
•KiB means 1024 B by official IEEE 1541 definition (note the capital "K").
So kB is more or less ambiguous, depending how the context relates to memory:
-RAM uses k=1024, else we get holes in adress space (yikes!).
-Bandwith uses k=1000, because it has no power-of-2 constraint.
-Hard drive capacity uses k=1000, but chunck allocation uses k=1024 to fit nicely in RAM.
-Audio CD uses non-power-of-2 chunks allocation because it is a streaming media (datarate being more important than adressing).
-Flash memory is treated like RAM if it hold a BIOS, like a hard drive if it's assembled into a USB pen.
Some suggestions to remove ambiguity:
•State how much your kB is (visible everytime a size is displayed, not buried deep in the doc).
Verbose, but easy fix to add.
•Use "KdB" to mean 1000 B (d standing for decimal).
Non-standard, but as compact as can be.
•State both kB and KiB.
Might look bloated, but is also the most informative.
Use k=1024 only when realy necessary.
Remember the rest of the world uses k=1000, and rightly so.
@Luc: "We have a problem when we have a 500GB (1000 based) and you need a real 500GB (1024 based).
Workers in computer (Admins, programmers...) can deal with that.
But ordinary people are very confused about that."
I think we're using a very odd value of "ordinary". If you need a 500GiB(base 2) hard-drive (which are pretty rare), surely it's more sensible just to get a 600GB(base 10). Easier to find, and one touch more space.
I think you are right, and I think that we are being short changed. I also bought a 500GB hard drive and found the same problem.
When I buy a car, I expect to get it home with four seats - not three. I also expect to get four wheels - not three.
If computers are logical systems, then we should try to talk about them in the same way. In maths, it is acceptable to round up or down figures accurately, for simple notation. Thus if the drive is 465GB (rounding up or down to suit), then that is what it should be. If all manufacturer followed suit, then there would be no problem and no exaggerations, or of the publics feelings of being short changed.
I think software should start incorporating units as they were fonts. If you are European, you leave of inches and feet units out and you never encounter them anywhere on your system again! Same with bits, bytes and kilo's. If you are nerd, you install a 1024 kilo unit, if you are stock trader you insert 1000 as kilo, if you are drugsdealer you insert K as kilo and if you are blond you insert "big" as kilo. And if this function is not usefull enough to incorporate for the kilo-nerds, then please do it to get rid of those freakin imperial shit. Oh, and make page-sizes font-like too! Trash the letter and tabloid. A0-A6 is all we need.
"they intentionally use the official SI definitions of the Giga prefix so they can inflate the the sizes of their hard drives"
What a load of crap. Hard drives have been measured in powers of ten since they were first invented; long before the DOSes and MacOSes of the world started reporting sizes in powers of two.
The problem here is Microsoft, not marketing. What conceivable benefit is there to reporting a 100,000,000,000 byte drive as "93 GB" in one place and "95,367 MB" in another place? None. Microsoft's notation is stupid and useless.
Western Digital was absolutely correct in their response to getting sued:
'Surely Western Digital cannot be blamed for how software companies use the term “gigabyte”—a binary usage which, according to Plaintiff’s complaint, ignores both the historical meaning of the term and the teachings of the industry standards bodies. In describing its HDD’s, Western Digital uses the term properly. Western Digital cannot be expected to reform the software industry. ... Apparently, Plaintiff believes that he could sue an egg company for fraud for labeling a carton of 12 eggs a “dozen,” because some bakers would view a “dozen” as including 13 items.' http://paulhutch.com/wordpress/?p=214
Using "G-" to mean "1,073,741,824" is just wrong, plain and simple.
For those who haven't yet, you'll want to check XKCD for a definitive standard on the topic:
Well if it's tradition to always use binary prefixes then someone should change the Ethernet spec and other networking standards which have always used decimal prefixes, not binary.. the only thing that's naturally binary is memory (RAM).. hard drives shouldn't necessarily be.
What it comes down to it that consumers are STUPID! They believe Microsoft Windows when it tells them a file's size is 1 GB when really it's 1 GiB.. Maybe the dumbasses should try suing Microsoft for supplying faulty software instead of going after hard drive manufacturers.
The MBR disk format for hard drives has an upper limit of 2TiB per partition. If you have a disk that's more than 2TiB in size, you need to switch to the GPT format, which most OSes have only recently made available (e.g Windows XP 32-bit doesn't support GPT)
That definitely is a case where the binary/decimal confusion arises; while 2TB drives are pretty rare, it's not hard to build a big RAID array over the 2TiB limit. You do have to keep track of the fact that it's a 2TiB limit, not a 2TB limit. Single HDDs are up to 1.5TB, so it really won't be long before HDD manufacturers make disks that don't work with Windows XP. That will really annoy the anti-Vista zealots.
Honestly, it never used to be a problem. SI prefixes have always been powers of two for binary quantities (which is only bytes) and powers of ten for decimal quantities.
It's well known by anyone that actually needs to know - I never get confused. Bytes are always powers of two. (Network communications speeds are number of *bits*, not *bytes* and take powers of ten prefixes.)
The worst was the "1.44MB" disks. These are actually 1044 kilobytes. (1044 * 2 ^ 10 bytes, in case people aren't keeping up.)
Hard Drive manufacturers definintely used to be more generous. I remember a 20 MB hard disk drive that had a bit more than (from memory) 21 million bytes capacity - back in those days, they actually made sure they met what it says on the box and you actually got more, no matter how you measure your megabyte.
(And kibibyte? It's stupid. Sounds like the unit of food eaten by an ISO standard cat in a cat food eating time unit. And a bunch of nerds trying to be nerdy.)
I can picture future cyber punks and general underground hoodlums now...
"Hey homey, you bustin' some yo yo yo worth of yobibyte warez for me?" 8^D
Jeff, You're losing it.
The 1024 vs 1000 issue is so irrelevant. Every hard drive manufacturer uses the 1000 measurement so when you're deciding which drive to buy you can safely compare and know that you aren't getting one product with a smaller capacity than the other.
So what if you don't get a nice round free space number when you install the drive in your PC. The only time the issue might be a problem is if you have exactly 500GB (1024) of data and you try to buy a hard drive to hold it.
This article was just padding.
Back in the day of FAT16 (Windows 95), "large" (4 gig) hardrives suffered from inefficiency on cluster size. For example, I believe under FAT16 the smallest size was 32K, so if you had a 1K file, it would wasted 31K. FAT32 improved on this, but I still believe space was lost but not as much as smaller (32K) clusters could be defined.
So, 1 Terabyte drive is for "marketing" purposes by the hard drive maunfacturer. You'll never physically stoe 1 Terabyte.
I'm sure someone can explain the gory details on this better than I can.
Gigabytes weren't an SI unit until the IEEE decided to make them one. It's rather insulting, actually - reminds me of the gritty cop shows where the FBI would step into a police investigation and say "You guys go home, let the experts handle this". Except that the FBI actually has that authority, whereas the IEEE just wishes it does.
The meaning of SI prefixes only applies to SI (metric) units, which bytes aren't. A byte is already 8 bits, and it isn't divisible into centibytes or millibytes, so it doesn't even make sense as a metric unit. The composite units were so named because they approximated metric units, not because they were equivalent. That's not "wrong"; it happens in every industry, it's just that the nerds in other industries don't kick and scream anal-retentively about it.
For those claiming that the inconsistency was always there because the network industry used kbps, think again. One of the reasons they used kiloBITS per second was to disambiguate it from storage units. The term was adapted from baud - slightly different meaning, but essentially equivalent by the time of 14400 baud modems, when baud was becoming an awkward measurement anyway. There was a legitimate need to compare bandwidth with storage (the Internet), but it also did not make sense to use powers of 2 because bandwidth was actually provided in powers of 10 (bits). There was no foul play here, just pragmatism.
Memory capacity, on the other hand, is always 2^n bytes. Hard drives are generally multiples of 512, too; when 500 gigabytes is used to mean 500 * 10^9 bytes, it is actually an approximation. The real number might be something like 499,289,948,160 bytes, though it could be more or less depending on the geometry. 500 GB is never quite accurate using ANY convention.
I think it's obvious that the units for memory and disk should be the same, since data is constantly being swapped from one to the other. So let's put the question about why the rules for memory should apply to hard drives to rest.
Of course I know what the proposed solution is. Just have everyone switch to the dorky "bi" prefixes! That's nice, except that every part of the industry EXCEPT for the hard drive manufacturers has been using the same convention for 50 years. You don't just stomp your foot, shake your fist and tell us to mend our evil non-standard ways. Standards should reflect conventions that are already widely used, not fight them. Frankly, I'd rather deal with the hard drive capacity gap than deal with the silly new SI units invented by academic suits with hardly any practical experience.
Whenever one inured to the inconsistent KB/MB/GB definitions used in some computing contexts first hears the kibibyte, mebibyte, gibibyte KiB/MiB/GiB construction, they think it silly. I did, too.
But after a few years of being bitten by related problems, and having to explain/argue the exceptions, and familiarity with the new words/abbreviations, it looks better.
The use of powers-of-2 internally by computers is an implementation detail that only insiders need to optimize for, in their minds and communications. For everyone else, base-10 works better. There's no reason for average users to understand or even see KiB, MiB, GiB names/numbers, in disk sizes, file sizes, bandwidths, clock speeds, etc. Everything can and should be in base-10, shift-units-at-a-glance SI. And the proportion of average users to insiders keeps growing. SI will win.
For Jeff's question about ever needing to use 'petabytes', Many workplaces are now dealing with petabytes of data. We have a few petabytes of spinning disks at the Internet Archive; I know commercial and big-science entities have far more.
And, regarding being "glad I won't have to deal with saying" zetta and yotta, why so pessimistic about the progress of technology and/or your own lifespan?
Sebastian: Err, yes. You're right - 1440, not 1044 KB. Sorry about that!
Sean: There are very good reasons why RAM is going to be in powers of two - it would be quite a lot of effort to allow for 1000 megabytes of RAM on on DIMM and 1000 on another, compared to 1024 on each. (RAM is addressed by a computer on an address bus. Each line of that address bus is a bit of the address; allocating addresses to RAM dimms thus naturally falls on the boundry of an address bus line. Which translates to a power-of-two in the address space. That's why 1 GB of RAM is always going to be 2^30; because 2*10^9 is not going to divide easily on an address bus. Doing divisions by 1000 is going to add an extra cycle or two to every RAM access plus some extra chips!) Hard drives aren't addressed this way, so that's why they can be sizes that aren't otherwise 'nice' for computers.
Will: Modems always were wierd; mainly becaue they (usually) used 10 bit bytes. Yeah, I was aware of some confusion with communications people, but I got the impression they hardly delt with bytes anyway.
(Meanwhile, we're skipping the octet vs byte debate? 8 bits wasn't always standard, you know! :-)
'Aaron G' writes: "A byte is already 8 bits, and it isn't divisible into centibytes or millibytes, so it doesn't even make sense as a metric unit."
In information theoretic contexts, even bits can be fractional. And when describing very slow links, it could be meaningful to speak of such exotic and peculiar things as centibytes or centibits per second.
Contrived and weird, yes, but not totally nonsensical.
I see no reason why there can't be an option for using both kilobytes (1000 bytes) and kibibytes (1024 bytes), like the labels here that say something like "1Gal. (3.8L)." Slowly people will begin to understand the relation between the two, like how many people learn that a yard is almost a meter.
As for sounding ridiculous, that's just ridiculous. They may sound funny, but so does the mole (mol) and the joule (J). In fact, my chemistry teacher in high school had us make a mole (the animal) for a grade!
Once again, the drives could use the metric standard and the binary standard, as in "500GB (465GiB)," allowing consumers to see the difference and keep them happier with the manufacturers because they knew the two possible measures that could be used, instead of feeling they were ripped off.
In the programming sense, using the standard 10^x is rather an annoying convention because of the nature of bits - 0 or 1. If they were to somehow come up with a 10 state bit (easily possible with quantum computers,) then I could see the warrant on using the standard metric definitions, but until then, no thanks. This difference in systems - base 2 instead of base 10 - led to the rise of other counting systems, such as octal and hexadecimal (hex). Personally, I like to count memory and the like in hex in the binary notation. In hex, this use of "strange" numbers tacked onto the end disappear. For example:
1024 Bytes = 0x00000400 Bytes = 1KiB
1024 KiByt = 0x00100000 Bytes = 1MiB
1024 MiByt = 0x40000000 Bytes = 1GiB
It also comes in handy to use the KiB notation in small systems, where you need to know exactly how much memory you have left and if it's enough for a 4KiB image.
Oh yes, the reason we use binary measurements is because comuters use binary! Addressing for both RAM and hard-disk is done using the binary/hex system. That being the case, it makes sense to me that they use the binary versions of the prefixes, but that would confuse people. So once again, I think listing both notations on the package makes plenty of sense.
Not to mention, if a byte were a standard SI unit, then it would be made of 10 bits. Then you could have a real decibyte. But naturally, if there was such a change, all the software out there right now would wind up being pretty useless because it isn't built for 10 bit architectures (although that can fairly quickly be remedied).
In the end, I think placing both labels on products will help get people used to the relation of a GB and a GiB. I have started to be able to tell approximate size of large files from one system to another, similarly to the conversion of yards and meters.
Anyhow, that's what I think.