November 30, 2005
When was the last time you saw a HTML header like this?
content="Everything you wanted to know about GUIDs but were afraid to ask">
content="GUID, UUID, globally unique identifiers, 128-bit">
The web is a metadata-free zone. It's widely known that Google completely ignores metadata in its indexes. The <meta> tag has fallen so far out of favor that it drags the whole concept of metadata down with it. And perhaps rightfully so. Cory Doctorow viciously deconstructs metadata in Metacrap: Putting the torch to seven straw-men of the meta-utopia:
There are at least seven insurmountable obstacles between the world as we know it and meta-utopia. I'll enumerate them below:.
1. People lie
Metadata exists in a competitive world. Suppliers compete to sell their goods, cranks compete to convey their crackpot theories (mea culpa), artists compete for audience. Attention-spans and wallets may not be zero-sum, but they're damned close. That's why:
- A search for any commonly referenced term at a search-engine like Altavista will often turn up at least one porn link in the first ten results.
- Your mailbox is full of spam with subject lines like "Re: The information you requested."
- Publisher's Clearing House sends out advertisements that holler "You may already be a winner!"
- Press-releases have gargantuan lists of empty buzzwords attached to them.
Meta-utopia is a world of reliable metadata. When poisoning the well confers benefits to the poisoners, the meta-waters get awfully toxic in short order.
The other six reasons are equally caustic, and all have a common theme: relying on users to create accurate metadata means you're betting on an optimistic view of human behavior. And we all know how well that works out.
Which brings me to the complete abandonment of the <meta> tag. Isn't it ironic that groups still advocate manually adding metadata to web pages? Who, exactly, is adding The Dublin Core Metadata Element Set to the <head> section of their web pages? Nobody, that's who.
Manual metadata may be suspect, but automated generation of metadata is practically the holy grail. Google's entire 450 zillion dollar market cap is predicated on one tiny, automatically generated piece of metadata on every web page they index: PageRank. Popularity rules the web. It's high school all over again: either you're popular and people link to you, or.. well, good luck on that whole prom thing.
But popularity has some limitations. For one thing, PageRank doesn't work on an intranet. Office documents are rarely HTML, rarely linked to each other, and you probably don't have a large enough sample set to do any fancy statistical analysis, either. That's why the Google Search Appliance not only actively indexes metadata in the <meta> tag, it requires metadata to return relevant results. It's right in the manual. Just try doing that with the capital-g Google.
Perhaps that's why Tim Bray steadfastly maintains that some form of metadata is necessary to improve search results.
One of the Web's distinguishing features is that there's a big gaping hole where the metadata ought to be. The Web has resources, identified by URI, and you can ask for "representations," which come with some metadata, but the metadata is about the representation, not the resource. Given a URI, the Web has no built-in way to ask questions about it, for example "What is this about?" or "When does it expire?" or "Is this suitable for children?" or "Is this good?"
I'm not an advocate of the utopian semantic web, mind you, but I sure would like something that can tell the difference between a Jaguar and a Jaguar instead of telling me which one is more popular.
Posted by Jeff Atwood
One clarification: the META description tag is indeed used by Google. Not to determine search result ranking, of course, but to determine which snippet of text to display next to the search result:
I verified this with a google search. It's pretty handy, actually, since Google can make some really awkward decisions about what text to display next to a search result.
Who, exactly, is adding The Dublin Core Metadata Element Set to the section of their web pages? Nobody, that's who.
Surprisingly many pages actually have DC metadatas (compared to the desert I thought I would find).
Try installing the Dublin Core Viewer extension if you're using Firefox, you'll see the icon switch to orange (from gray) from time to time.
Well, there's metadata and there's metadata. Intelligently designed blog entries include an rdf island that promotes auto-trackback discovery. Purely optional. But given the limited number of people writing blog software, it seems to work out relatively well -- virtually any blog on a well-known platform includes the rdf, users are none the wiser, and stuff like auto-trackbacks works out nicely.
Not a huge big deal or anything, but a small example of where some purely optional metadata, defined for a limited audience, seems to work out well.
1) Why _does_ Google ignore meta tags? Shoot, if the biggest player blows off a nominal standard (well, suggestion), that's pretty much gonna be it for that tag, no?
2) I am under the (perhaps false) impression that cache-related meta tags are honored by, like, proxies and stuff. Is that true?
I have meta http-equiv="Content-Language" content="en" / at a minimum in my pages. You get XHTML validation warnings if you don't specify a default language.
Ok, so it stripped my code.
meta http-equiv="Content-Language" content="en"
Cache tags are used by proxies and browsers - their vocabularies are technical, and do not affect page-rank, so they don't get stuffed full of index spam. These tags are functional rather than descriptive.
Telling your caching proxy that you'd like the page cached for "Hot Jaguar Pics, Hot Jaguar Action" won't help. :-)
We use some of the Dublin Core metadata!
Okay, it's really for our own personal reference. We use content management, and wanted some way to be able to check the creation date of our pages via "View Source" ... lo and behold, the Dublin Core had a nicely standardized way of publishing "extended attributes" like dates.
Check it out for yourself. a href="http://www.tronox.com/"http://www.tronox.com//a and view source.
Why _does_ Google ignore meta tags? Shoot, if the biggest player blows off a nominal standard (well, suggestion), that's pretty much gonna be it for that tag, no?
For all the reasons Cory Doctorow listed: considered as a collective whole, people are stupid, lazy liars. And people certainly aren't objective when asked to describe themselves or things they have a financial stake in.
Metadata might work if an external entity was assigning it to the pages rather than the authors, but then you're almost back in automated metadata land.
I am under the (perhaps false) impression that cache-related meta tags are honored by, like, proxies and stuff. Is that true?
As Christian pointed out, these (language, caching, etc) aren't descriptive, they're functional. In other words, they can tell you what language the page is in, but they can't tell you if it's any good.