January 3, 2011
Let's look at where stackoverflow.com traffic came from for the year of 2010.
When 88.2% of all traffic for your website comes from a single source, criticizing that single source feels … risky. And perhaps a bit churlish, like looking a gift horse in the mouth, or saying something derogatory in public about your Valued Business Partnertm.
Still, looking at the statistics, it's hard to avoid the obvious conclusion. I've been told many times that Google isn't a monopoly, but they apparently play one on the internet. You are perfectly free to switch to whichever non-viable alternative web search engine you want at any time. Just breathe in that sweet freedom, folks.
Sarcasm aside, I greatly admire Google. My goal is not to be acquired, because I'm in this thing for the long haul – but if I had to pick a company to be acquired by, it would probably be Google. I feel their emphasis on the information graph over the social graph aligns more closely with our mission than almost any other potential suitor I can think of. Anyway, we've been perfectly happy with Google as our de-facto traffic sugar daddy since the beginning. But last year, something strange happened: the content syndicators began to regularly outrank us in Google for our own content.
Syndicating our content is not a problem. In fact, it's encouraged. It would be deeply unfair of us to assert ownership over the content so generously contributed to our sites and create an underclass of digital sharecroppers. Anything posted to Stack Overflow, or any Stack Exchange Network site for that matter, is licensed back to the community in perpetuity under Creative Commons cc-by-sa. The community owns their contributions. We want the whole world to teach each other and learn from the questions and answers posted on our sites. Remix, reuse, share – and teach your peers! That's our mission. That's why I get up in the morning.
However, implicit in this strategy was the assumption that we, as the canonical source for the original questions and answers, would always rank first. Consider Wikipedia – when was the last time you clicked through to a page that was nothing more than a legally copied, properly attributed Wikipedia entry encrusted in advertisements? Never, right? But it is in theory a completely valid, albeit dumb, business model. That's why Joel Spolsky and I were confident in sharing content back to the community with almost no reservations – because Google mercilessly penalizes sites that attempt to game the system by unfairly profiting on copied content. Remixing and reusing is fine, but mass-producing cheap copies encrusted with ads … isn't.
I think of this as common sense, but it's also spelled out explicitly in Google's webmaster content guidelines.
However, some webmasters attempt to improve their page's ranking and attract visitors by creating pages with many words but little or no authentic content. Google will take action against domains that try to rank more highly by just showing scraped or other auto-generated pages that don't add any value to users. Examples include:
Scraped content. Some webmasters make use of content taken from other, more reputable sites on the assumption that increasing the volume of web pages with random, irrelevant content is a good long-term strategy. Purely scraped content, even from high-quality sources, may not provide any added value to your users without additional useful services or content provided by your site. It's worthwhile to take the time to create original content that sets your site apart. This will keep your visitors coming back and will provide useful search results.
In 2010, our mailboxes suddenly started overflowing with complaints from users – complaints that they were doing perfectly reasonable Google searches, and ending up on scraper sites that mirrored Stack Overflow content with added advertisements. Even worse, in some cases, the original Stack Overflow question was nowhere to be found in the search results! That's particularly odd because our attribution terms require linking directly back to us, the canonical source for the question, without nofollow. Google, in indexing the scraped page, cannot avoid seeing that the scraped page links back to the canonical source. This culminated in, of all things, a special browser plug-in that redirects to Stack Overflow from the ripoff sites. How totally depressing. Joel and I thought this was impossible. And I felt like I had personally failed all of you.
The idea that there could be something wrong with Google was inconceivable to me. Google is gravity on the web, an omnipresent constant; blaming Google would be like blaming gravity for my own clumsiness. It wasn't even an option. I started with the golden rule: it's always my fault. We did a ton of due diligence on webmasters.stackexchange.com to ensure we weren't doing anything overtly stupid, and uber-mensch Matt Cutts went out of his way to investigate the hand-vetted search examples contributed in response to my tweet asking for search terms where the scrapers dominated. Issues were found on both sides, and changes were made. Success!
Despite the semi-positive resolution, I was disturbed. If these dime-store scrapers were doing so well and generating so much traffic on the back of our content – how was the rest of the web faring? My enduring faith in the gravitational constant of Google had been shaken. Shaken to the very core.
Throughout my investigation I had nagging doubts that we were seeing serious cracks in the algorithmic search foundations of the house that Google built. But I was afraid to write an article about it for fear I'd be claimed an incompetent kook. I wasn't comfortable sharing that opinion widely, because we might be doing something obviously wrong. Which we tend to do frequently and often. Gravity can't be wrong. We're just clumsy … right?
I can't help noticing that we're not the only site to have serious problems with Google search results in the last few months. In fact, the drum beat of deteriorating Google search quality has been practically deafening of late:
Anecdotally, my personal search results have also been noticeably worse lately. As part of Christmas shopping for my wife, I searched for "iPhone 4 case" in Google. I had to give up completely on the first two pages of search results as utterly useless, and searched Amazon instead.
People whose opinions I respect have all been echoing the same sentiment -- Google, the once essential tool, is somehow losing its edge. The spammers, scrapers, and SEO'ed-to-the-hilt content farms are winning.
Like any sane person, I'm rooting for Google in this battle, and I'd love nothing more than for Google to tweak a few algorithmic knobs and make this entire blog entry moot. Still, this is the first time since 2000 that I can recall Google search quality ever declining, and it has inspired some rather heretical thoughts in me -- are we seeing the first signs that algorithmic search has failed as a strategy? Is the next generation of search destined to be less algorithmic and more social?
It's a scary thing to even entertain, but maybe gravity really is broken.
Posted by Jeff Atwood
Out of the 88.2%, how much of the traffic was previously browsing stackoverflow, then went to Google to run a search, then came back to stackoverflow?
I know frequently when I'm looking for an answer to a question I use Google search rather than the stackoverflow search because it works well (for me) and I'm familiar with how Google behaves. I think the SO search probably works fine, but it's not what I'm familiar with. Also very often I don't know whether my I want to search stackoverflow or serverfault, so again Google works better. I know each site has it's intended audience/purpose, but very often there are nuggets of information that exist on the opposite site. By altering the "inurl:" parameter on Google I can quickly look at a particular site.
Also, within the 88% aren't there people that Googled "stackoverflow" rather than typing it into the address bar? Technically Google would be the traffic source on these, but I wouldn't categorize them the same.
Anecdotally I had the same problem the other day. Horror of horrors Bing was better in the end!
Hmm... Why "horror of horrors"? :confused: Shouldn't it rather be: Good thing Bing was better in the end?
I honestly wonder why Google has never implemented a rating system so far - where users could rate search results. Perhaps something like ranging from A ("exactly what I was looking for") to F ("Yuck! Obvious scraper/scammer/spammer! - never even show me anything from that domain again!")
Ok, I am a blekko.com employee (and a former Google employee :-) so take my comment with a grain of salt. But one of the reasons I joined up with Blekko is that I believe Rich Skrenta's fundamental tenent that "algorithmic" search was always a hoax. (He doesn't state it like that but it is what the reasoning comes out to :-).
Basically "algorithmic" search is code for looking for "signals" (which is code for an HTML construction that indicates a value intent) and applying those signals to a list of possible results. Back in the way back times that was everyone had a 'links' page of sites they thought were the coolest/best. Google could scrape those, infer the intent, and then rank based on linkage (the original Backrub algorithm).
The achilles heel of algorithmic search is this, "What if you have people who lie?" Which is to say that an algorithim cannot tell, a priori, if the web page it is scanning, which was written by a human for human consumption, was written "from the heart" (which is to say original content, original expression) or was written "from the wallet" (which is to say to specific key word and phrase requirements). Since human labor on the Internet is cheap, an algorithm based on infering human intent cannot discriminate between "good" humans and "bad" humans.
Blekko's premise is that people know good content. And if a small fraction of those people are willing to take a bit of time to identify the content that is "best" for a given category. Blekko enables that understanding to be codified into slash tags, which are a community resource. Thus a small fraction of people with good taste can create a much better search experience for everyone.
Of course their remains the question of why can't evil humans do the same thing to Blekko, by creating their own slash tags which have primarily their ad revenue generating content? The answer is that while such slash tags can be created, you as the user decide which (if any) slash tags you want to use to filter your results. If you try a user's slashtag and find it is full of spammy links you don't have to use those links, what is better you can use them as "anti-filters" which is to say exclude any sites this spammer has in their slashtag from the returned results.
Unlike the curated directory which Yahoo! pioneered in the 90's, Blekko crawls the web like an algorithmic search engine, and then sieves the result through what is a community constructed filter of quality. The goal is a scalable, robust, search engine with consistent high quality results. Content farms and content duplicators have to fool a human to get into a slashtag, which thankfully continues to be an unsolved problem.
Great discussion by the way on this.
"I greatly admire Google" - Except for that part where for political reasons they tamper with search results. OH! And their usage of data "accidentally" acquired by Street View cars. And how about how for years they collaborated with authoritarian governments like China to censor free-speech.
Definitely admirable. Definitely NOT evil.
I'm looking for useful information on baby monitors. Sadly, SEO devils know this, and have made it very difficult to find anything real.
If you want a nice tour of these crap sites, try searching for "The Summer Infant Best View color video baby monitor is one of the highest rated video monitors", and use the quotes so you get exact matching on the phrase.
It takes you on a nice tour of content replicators and scrapers. There are automatic variations as well -- "highest rated video monitors" becomes "best rated video monitors", and so forth.
If you break it down to searching for duplicate sentences, I bet you find enormous numbers of duplicators very quickly.
I really admire Blekko. It's been a few years since I started complaining about the vertical shopping list that search engines have become. I always identified the lack of categorized searching as one of the main problems (exactly the ability to trim my search results based on categorized information as opposed to search terms), but always bumped into a wall when I was asked how it would ever be possible for a search engine to categorize the web. The answer was obvious, it wouldn't be the search engine doing it. It would be the users. I just couldn't imagine how. Your slashtag solution is tremendously elegant. It's essentially a categorized search in the hands of the user, allowing them to trust a community effort to categorize search results, but also create their own which can be made private, if they so wish, for the maximum search results customization possible.
This is the type of innovation that defined Google back in 1998. An innovation that I no longer expect from this company, which I predict will lose its dominance of the web search engine market sometime in the next 10 years because it is falling prey to the exact same vices displayed by the companies it displaced back in 2000 when it became a phenomenon; Google corporate nature slows down new developments and its commitment to the current winning strategies clouds their vision of future (and present complaints that are starting to emerge). Without competition, Google has been falling behind in user expectations and admiration, to the point of having become a common target of criticism.
I'm not saying however you guys are the solution. I'd hope you to be because your current process really strikes a chord on how I personally see the web search requirements of the decade that is just starting. The SEO button and the /rank slashtag are also a boon that cannot be overstated. The fact you folks chose the angel funding venue also gives me some confidence in the ability of your project to stay afloat in bad weather. As far as I'm concerned, I'll do my part by using it as much as possible, create slashtags when needed, and essentially be part of what I hope to become a growing community. As I said I don't trust the current solutions as the ones who will take web search engine to new heights. They are becoming old and disconnected from their users requirements.
I'm really happy Matt Cutts is aware of this problem now, because it is HUGE.
I recently had to 301 redirect two domains to new domain names, due to a copyright issue with the domain name. The domains were trusted, 2 years old, excellent rankings. 4 Weeks after a redirect and the rankings were completely gone. Nada. Zilch.
Here's the kicker. All those scraper sites that loved the old domains so much are now OUTRANKING the new 301'ed domains.
Everyone is talking about this on Webmaster World forums. I can't find stuff easier on google anymore. The long tail has been cut off, and the google rabbit doesn't know it's head from it's tail(!) Finding anything complex just doesn't work like it used to. From looking for new drivers to finding solutions to php coding problems, it takes AGES to find what i need.
The relevancy signals are all screwed up. G, i hope you get this fixes, not for my sites but for everyone! Bing is very, very, VERY attractive right now!
@InsomniacGeek put it in very transactional terms, but let me back up a step and suggest that what's happening is what the world wants to happen, and you simply don't like it. (I'm not a fan, either, but let's see if we understand it.)
Just as TV has democratized from the 3 networks with programmers guessing what people will want to watch out to today's 108 channels being run over by YouTube.
The news of today is the same story. From a couple of national authorities plus a local rag that ran the wire stories plus the want ads, blogs such as this have arisen and are being monetized by scrapers and packagers. This is NOT because Google is evil (a separate question); it's because the efreedoms of the world actually collect repeat hits from viewers who indiscriminately are happy to see a story of interest, and don't give a damn about cultivating sources, depth, communities, etc. No repeat hits, no eyeballs for sale == no aggregators.
I'll assert (at least for discussion) that the aggregators are performing exactly the service that the lumpen proletariat wants. And Google is merely delivering on its promise of “relevancy” for the majority of its visitors. Just not what you or I think is “relevant.”
Unfortunately I get the feeling that Google has started to go the Microsoft root - forgetting about polishing the core functionality of the product and just adding bloat upon bloat. The awful site preview 'feature' is an example of this - you can't even turn it off.
I mean route of course ;)
@Coruskate: wikipedia actually is in (a pretty big) part a "scraping" website, one that aggregates texts that were published elsewhere, indexes them, enriches them, updates them, and so on. This is perfectly legitimate when the original content was Creative Common (which by the way is also not always true).
So, by many metrics, wikipedia is "better" for the consumer than the (often obscure, poorly laid out, poorly edited, maybe even advertising infested) websites that originally published the content. Being the original publisher doesn't automatically qualify you as "better" from the consumer point of view - even if I agree that original content producers should be rewarded.
Another way to look at this is as an incentive issue, rather than an algorithm issue. A large part of the incentive for spammers is Google itself, in the form of Adsense. They could easily reduce spam by having more stringent Adsense publisher guidelines, but of course they wont, because this affects their (Google's) bottom line.
Basically Google's incentives are as aligned with the MFA spammers as they are with the consumer (if not more so), especially given the lack of viable alternatives (although thankfully Bing and Blekko are gradually getting there).
I unfortunately have to second your perception of decreasing quality of google results. Just yesterday I needed a SIP VOIP provider, but almost all the results were SPAM or sites looking very shoddy. I chose the one that seemed to be the least untruthful of them (fonosip.com) but it appears that I was ripped off. I bought credits, registered my SIP device in their network, but cannot make any calls. They do not reply to my e-mails and their twitter account (@fonosip) seems to be pure automated spams to farm links... Guess I just lost 15 USDs but losing faith in Google was much worse than that.
Funnily enough - before Christmas I was hunting down some obscure jquery/ASP.net compatibility bugs, and I was delighted to find so many results on Google related to my problem. Imagine my surprise when it was just one SE post, scraped and reposted over and over and over. And the SE post was actually about 3/4 of the way down the page, so I only found it later on in my search. I then started noticing a LOT of scraper sites popping up higher than SE. Only in the last month or so have I noticed this though.
One reason that Google will produce "worse" SEO-spammed results than other search engines (which nobody seems to have noted in my quick scan) is the obvious one.
Spammers don't gain jack from trying to optimize their Bing or Yahoo results.
Google has the most impact, and thus it's the thing they'll put the vast majority of their effort into gaming.
Maybe we should outlaw the root of all this: affiliate programs. Linkshare, Amazon and other big sellers are causing this since they are giving content farms a raison d'etre.
I agree with Robert Osborne and Jasonharrop -- Google must use feedback from their users:
1) Google Toolbar data -- the more users visit certain page or subdomain -- the better.
2) Google Search results -- the more users click search results -- the better destination subdomain is.
3) Google Search results views -- the more users see results -- the worse subdomain is. Ideally, every view in search result should end up in some actual web site views (if possible).
The only reasonable explanation of why Google does not efficiently do that is ... deteriorating quality of management in their search team.
I also noticed that search results weren't quite what I expected lately, so what I did was to set the search to display 50 results per page instead of 10, to be able to quickly scan through more results.
With right search words and a quick scan I can find what I need, even if not in the first 4-5 results...
I always hope to see better.
I think the problem isn't the search engine, the problem is that the advertisement is allows on such sites. It generates value for noone, especially the adverts firm.
How about the net-ads get a hold of themselves and stop accepting any webpage as a potential ad-spot. If there were a policy that said that ads would not be put up on sites that scraped content of other sites just to get webhits, then this problem wouldn't occur. Infact, ads would become more valuable so such a policy could be a win-win.
Seconded, but Bing was not much better, it seems all search engines are getting gamed starting back in November and still are.
Right or wrong my personal observation is it was not every day, just most days. But some days searching was totally useless.
Thanks to Attwood/Coding H/Stack O for pointing this out and thanks for pointing it out to an audience that actually cares.
They should just divide your page rank by 2^(number of adds on the page). That ought to kill off all these spammy content aggregators.
I'm just confused, I'm defending Google on this one. Yes, search result quality has been affected in some ways, with some types of searches. Do I see this degradation in my general day-to-day use of Google as a search engine? Not much.
To vet my next statements: I'm a .NET developer. I spend 8 hours a day or more on my development workstation, and more time on my personal computers, and this is my observation.
Most of what I have noticed is centralized around product searching. If I'm looking for something relatively popular, or very general, I tend to get more of the "scraper" type sites. This is something I see on all search engines though.
Just an example... or a few:
Today, looking for a way to use the OnStar system in my vehicle via bluetooth (way cool by the way).
The Search: http://www.google.com/search?rlz=1C1GPCK_enUS410US410&sourceid=chrome&ie=UTF-8&q=onstar+bluetooth
Very good results, and I found exactly what I was looking for.
Next, looking for a way to connect to an X session on an android phone.
The Search: http://www.google.com/search?rlz=1C1GPCK_enUS410US410&sourceid=chrome&ie=UTF-8&q=android+display+x+via+ssh+tunnel
Found what I was looking for without trouble.
Some programming things: Looking for a reference on the "RenderBeginTag" for writing server controls in ASP.NET.
The Search: http://www.google.com/search?sourceid=chrome&client=ubuntu&channel=cs&ie=UTF-8&q=RenderBeginTag
Good results, found exactly what I was looking for.
Looking for a refresher on how to serialize an object to XML.
The Search: http://www.google.com/search?sourceid=chrome&client=ubuntu&channel=cs&ie=UTF-8&q=serializing+object+in+.net
You guessed it, found exactly what I was looking for.
Lastly, looking for a quote from the late and great George Carlin.
The Search: http://www.google.com/search?sourceid=chrome&client=ubuntu&channel=cs&ie=UTF-8&q=George+Carlin+The+Public+Sucks
Wow, found it.
And that's just a *very* small sample of my searches for the day.
Granted, there wasn't much in the way of product searching in my examples, but I don't think Google is all that great for that. There are better tools: epinions, Amazon, etc.
Is the OP correct? Yes. Google does need to do their best to promote original content and provide accurate results. Do scrapers suck? Yes, and we have a right to complain about them.
All I'm saying is: it's not bad. Google does what it can and they do a great job. Oh and the public sucks. :)
From a colonial outpost, excellent article and comments on an issue that has become as clear as daylight (over our veld at least) to ordinary users like myself. Google is now a problem and not the answer if you just want decent information.
I'm sure there's lots of useful info in the comments. Too technical for this user. One suggestion to Google: since their best results are in my field almost invariably parasitic on Wikipedia in the first instance, why don't they respond to Jimmy Wells's current funding appeal with a modest donation? Preferably confidential, and not too large.
I'm a researcher and web designer/pc repairer (not a programer) but have been doing web research for the past ten years and for a book for the past two years and specifically in the past year--hands down--Google search results are "dumber" and more shallow.
Results are more topical, less detailed and often don't include any relevant results for the actual search you are doing. I find an astounding percentage of results going to yahoo answers and the like where visitors on those sites post their own opinions (and are primarily uniformed teenagers or people getting paid through services like Digital Turk).
I ran a computer repair business for two years and was self taught using Google as my primary knowledge base--if I tried to do that today it would be incredibly more difficult.
On the book research I am having difficulties at times finding more than headline grabs from news stories or spam/scrape sights with gimmick tags.
Google has apparently caved to the "refine results for us" crowd and lost a lot of utility for dedicated data miners such as myself.
Essentially what is happening is this: imagine you run a publishing company. You decided to publish your books under a special copyright that allows anyone to just re-publish it and redistribute it in any way they want. After a certain time have passed you started getting very very angry about your sales dropping. You blaming all those who took advantage of that special license you so generously offered awhile ago.
Somewhat similar situation has happened to Google with Android. They created this great operating system and open sourced it. Suddenly, Verizon decided to use Android but being an open source, and Verizon being a business they decided it's in their best interest to completely disable Google search on their phone and also disable Google Maps because the free Google maps compete with their own $10/month navigation service and it also happened that Microsoft paid them to replace the default search from Google to Bing.
But at least Google did not whine about it like you. They moving forward with their own deals with Samsung and making their own phone where noone would block Google from it.
Here's one idea for how Google can strike back.
Beating the Content Farms: Google Can Automate the Like Button
Let's see. You figure out how Google works, piggy back off of their technology, "giving" people back their answers to their questions, and are perplexed that someone else is piggy backing off of you and Google?
Wake up and smell the web!
Spammy content farms are big business, $25 is a drop in the bucket for them.
I really like the comparison of Google to gravity, because it implies that Google is a critical component of the nature of the internet. The problem with Google is that it is only a service of the internet, not an integrated part of the internet's structure. We expect the internet to have a search feature because everything has a search feature, our operating systems, our apps, everything. We expect the pages on the internet to be indexed and organized because, hey, everything else is too. Unfortunately the internet doesn't have any of these features built in. Google is just another website just like every other website, and it's limited by the same things that limit every other website application. Using Google to search the internet is like using a third party app to search your Outlook emails (if you have billions of emails). What Google, or any search engine, does is ultimately flawed by the technology it's built upon. If the internet was created today from scratch it would no doubt be engineered to support indexing and all of the other things that we take for granted, so third parties don't have to hack their ways around billions of individual html files.
What is this page for? http://stackoverflow.com/questions-all/145
Seems like a link farm, apparently. I landed those chronological lists twice today. They are absolutely unrelated for my Google search queries. Since the page numbers are gradually increases, even keywords were not there.
It would seem that Google's perfect system already exists in Gmail.
I've been using Gmail since it started in 2005. As of today I have over 50,000 archived emails. When I first started Gmail, I would get a couple of spam emails a month in my inbox and immediately report them as spam using the dedicated button.
In the past 3 years, spam emails in my inbox have completely disappeared. They still accumulate by the hundreds in my spam folder, but I never get them in my inbox.
Thanks to their reporting system, a minority of Gmail users provide cross-referenced data that allows the spam to be identified and properly categorized plus whatever else Google uses. I believe there was a white paper on their anti-spam technology a couple years ago showing their techniques.
In contrast, I have an old GMX.de free email account that is now nothing but hundreds of spam emails a month in the inbox.
Google is hardly a monopoly. I've been using DuckDuckGo for a while now, and it seems to work very well. I've also started playing with Blekko. If more people start leaving Google, maybe they'll make more of an effort to weed out spam sites.
At the rise of Search the public couldn't believe that a single search box could ever return adequate results from all over the web at all. SEO evaluated enormous, but when it comes to results that reflect your needs more exact SEO is not the holy grail.
For example business apps do use other indexing mechanisme like metadata combined with a social distance in addition to ranking. I do believe the public gut feeling that SEO will not provide you with the right answer will become real in the end. At least a part of that feeling. I also believe that the quality of results could improve by adding the social component in SEO.
So is it time to redefine gravity and implement a new model to assure that we still will find relevant and authentic information in the future? I do think it is time to innovate.
Great article. Thanks.
I personally find myself spending more and more time on filtering results that I know (or think) is SO scraped content.
Google should make it possible to flag other sites' content as duplicate of that on your own site (of course they need to verify that it actually is duplicate), so that sites going to extremes in terms of scraping will get degraded in the results.
I guess Google has (long time ago) lost the human touch, and is increasingly becoming a system to beat.
Gravity is not broken, gravity is just gravity.
But yes, there needs to be an explicitly made catalog of things, because otherwise search results can contain what ever related material.
I just tried today to search for information on how to pay a certain well-known company through a certain well-known bank. Page after page of spam, spam, spam, spam, spam, spam, spam, baked beans and spam. I must have tried 10 different searches with various synonyms, phrases and exclusions, and all it did was slightly change the order and keyword relevance of the spam.
The worst part is that the results from all of these sites are identical. It would be nice if Google had at least a modicum of intelligence to say, "Hey, if Mr. User here isn't interested in result #1, he's probably not going to be interested in all of these identical copies of it down below."
Three years ago it used to be that the content I was looking for often didn't exist at all or wasn't indexed, and I was okay with irrelevant results. I'd gladly settle for only a 50% chance of getting the results I wanted as a replacement for the ocean of obvious, pathetic spam I seem to get 90% of the time now.
Here's a thought, Google: How about tossing all of these garbage copypasta spam sites into a "mirrors" link for the original result? Surely you can figure out which site is actually the original; you invented PageRank, so checking a few indexing dates should be practically a "hello world" level of difficulty.
I have been increasingly plagued by these auto-generated sites over the past months. As of this morning (Monday 17 January 2011, European time), my Google searches are almost entirely spam-free. Is this wishful thinking or has someone at Google been reading this thread?
I'm a little late reading this blog, but the example searches seem fine--searching for "dishwasher reviews" yields lots of legitimate looking dishwasher reviews. The first hit for "iphone 4 cases" had more case models for sale than I thought existed. StackOverflow continues to rank very (very!) high in my search results. I don't recall being annoyed at seeing republished copies of answers, so personalization may put the real SO higher than the scammers for me. (We all get personalized search results, so we really can't talk about a single Google search result anymore.)
I have noticed spammy looking republished content in search results of course. Does Google manually adjust search results as people point them out? Or was there very recent change to the spam recognition algorithms?
Here's an example of a broken Google:
There's a fairly popular YouTube video called Waffles by Julian Smith.
I used a quote from it for the title of a post on my blog. Now, if you type in the quote, or something like unto it ("i'm not retarded i ate a jellyfish") into Google, the very first result is that blog post. Which has absolutely nothing to do with that popular video.
One thing to consider here is, who is it that Google is failing? They're failing you, as the owner of Stack Overflow, because you want and expect (and have every reason to expect, based on the guidelines you quote) that your results should show up before scraped results. But are the failing their users, who are looking for content? I don't think so. If I'm searching for something that has a result on Stack Overflow, then I'm looking for the content of that page. If the scraped page has all that same content, then I can find what I'm looking at, whether I go to stackoverflow.com or hijackstackoverflowtraffic.com. As a random user who is not part of the SO community, why would I care which site I find it on?
I haven't personally observed this with SO, but I see it all the time with mailing list archives. I don't care whether the archive link that comes up first is the site that officially owns the mailing list or some other random site. As long as they have the messages in a readable format, I can find the answer I'm looking for.
In the case of Wikipedia, we noticed around 2005 that a search on a piece of Wikipedia text would get three pages of mirror sites before us. I believe some people contacted Google asking "hey, what's up with that?" and then a short while later it was fixed.
(It was about then Wikipedia started showing up as the top of every Google result for everything ...)
I think we assumed that this was a more general algorithm penalising duplicate content. But your example suggests it isn't.
There's always the option to limit Google searches to a particular site.
Ex. to search pink butterflies on SO:
"site:stackoverflow.com pink butterflies"
But, I agree. Google search is starting to suck and syndication sites are taking over the internet. I really wish Google could find a way to drop the value of purely syndicated/scraped sites so the good content could be allowed to float back to the surface.
I was considering writing something to the Stack Overflow team after this happened to me the first time but, obviously, plenty of other SO users beat me to it.
I just tried the search for a typical "python algorithm sum", and Stack Overflow was on the first three results.
"Google Algorithm Change Targets Content Farms"
Finally!!!!! People are waking up to the tholian web. Bill Joy (in some pop java guide) had a pretty clear grasp of how useful this all really is. He didn't seem impressed. Lately I see google as the incredible shrink wrapping graph. You see there is a nasty little self referential dividend they get with your every return visit. Ultimately the 'game' plays itself and expands its /mindshare/ like a brain slug, till you can't remember anything without them. Its going to be tough but some part of myself misses actual pages and words that dont move or flash. In the end I owe everything to books. The other thing is I'm very edgy about the 'underside' you know those 'spooky' pages that seem have died since Google....whoa spell likes G not g oogle .... creepy. I have a vague memory loving to hunt and having tons of links to keep track. As for what can 'we' do? I think we may need to build tools for keeping track of sites, capable of analyzing presenting the connections in useful ways.
Google should fair to treat every user,more and more be dependent on everyone.
http://www.greenlaserpointer.org provides the ir filtered 100mw to 700mw green laser pointers and portable green lasers, 100-400mw red laser and portable red lasers,20-100mw blue laser pointers, 100mw laser, laser pointer, laser pointer pen, laser pointers, green laser pointer, laser green, green laser 5mw, 532nm laser, laser pen, etc.
Crown-Sat is a China based professional hi-tech company engaged in R&D, production and distribution of consumer electronics products, such as automotive, office and daily using electronics. As one of the best manufacturer and wholesaler in Shenzhen, we had good relationship with our customers all over the world. With the faith of “making friends before doing business”, we had almost exported our goods to the overseas market such as USA, Mexico, Germany, France, UK, Spain, Portugal, Sweden, Russia, Japan, Singapore, Malaysia etc.
"High Quality, Competitive Price, On-Time Delivery, and Good After-Sale Service" are our principle. We have received good reputation and support from our customers, promoting our development in these fields as well. Up to now, we have provided a large quantity of digital products such as the DVB, mini projector, solar related and cartronics, and the most hot sell products are Openbox S9 HD PVR,SkyBox S9 HD PVR ,Dreambox DM500 HD,
This was a really great post. I will keep visiting here.
Thanks for such a nice post.
To help those out there, you can watch all your favorite naruto and naruto shippuden episodes on www.shannospot.com