If you've ever searched for anything, you've probably run into stop words. Stop words are words so common they are typically ignored for search purposes. That is, if you type in a stop word as one of your search terms, the search engine will ignore that word (if it can). If you attempt to search using nothing but stop words, the search engine will throw up its hands and tell you to try again.
Seems straightforward enough. But there can be issues with stop words. Imagine, for example, you wanted to search for information on this band.
"The" is one of the most common words in the English language, so a naive search for "The The" rarely ends well.
Let's consider some typical English stopword lists.
| SQL Server stop words | Oracle stop words | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
You'd think a pure count of frequency, how often the word occurs, would be enough to make a common group of words "stop words", but apparently not everyone agrees. The default SQL Server stop word list is much larger than the Oracle stop word list. What makes "many" a stop word to Microsoft, but not to Oracle? Who knows. And I'm not even going to show the MySQL full text search stop word list here, because it's enormous, easily double the size of the SQL Server stop word list.
These are just the default stop word lists; that doesn't mean you're stuck with them. You can edit the stop word list for any of these databases. Depending on what you're searching, you might decide to have different stop words entirely, or maybe no stop words at all.
Way back in 2004, I ran a little experiment with Google -- over a period of a week, I searched for an entire dictionary of ~110k individual English words and recorded how many hits Google returned for each.
Yes, this is probably a massive violation of the Google terms of service, but I tried to keep it polite and low impact -- I used Gzip compressed HTTP requests, specified only 10 search results should be returned per query (as all I needed was the count of hits), and I added a healthy delay between queries so I wasn't querying too rapidly. I'm not sure this kind of experiment would fly against today's Google, but it worked in 2004. At any rate, I ended up with a MySQL database of 110,000 English words and their frequency in Google as of late summer 2004. Here are the top results:
|
|
Again, a very different list than what we saw from SQL Server or Oracle. I'm not sure why the results are so strikingly different. Also, the web (or at least Google's index of the web) is much bigger now than it was in 2004; a search for "the" returns 13.4 billion results -- that's 25 times larger than my 2004 result of 522 million.
On Stack Overflow, we warn users via an AJAX callback when they enter a title composed entirely of stop words. It's hard to imagine a good title consisting solely of stopwords, but maybe that's just because our technology stack isn't sufficiently advanced yet.
Google doesn't seem to use stop words any more, as you can see from this search for "to be or not to be".
Indeed, I wonder if classic search stop words are relevant in modern computing; perhaps they're a relic of early 90's computing that we haven't quite left behind yet. We have server farms and computers perfectly capable of handling the extremely large result sets from querying common English words. A Google patent filed in 2004 and granted in 2008 seems to argue against the use of stop words.
Sometimes words and phrases that might be considered stopwords or stop-phrases may actually be meaningful or important. For example, the word "the" in the phrase "the matrix" could be considered a stopword, but someone searching for the term may be looking for information about the movie "The Matrix" instead of trying to find information about mathematical information contained in a table of rows and columns (a matrix).A search for "show me the money" might be looking for a movie where the phrase was an important line, repeated a few times in the movie. Or a search for "show me the way" might be a request to find songs using that phrase as a title from Peter Frampton or from the band Styx.
A Google patent granted this week explores how a search engine might look at queries that contain stopwords or stop-phrases, and determine whether or not the stopword or stop-phrase is meaningful enough to include in search results shown to a searcher.
Apparently, at least to Google, stop word warnings are a thing of the past.
I didn't notice that until I read this!
Alfonso Jimnez on November 13, 2008 1:08 AMOh, my, you don't know how many times I tried to find any The The album in various web stores... If I did not know exactly the album title, I ended either with no matches or too many matches.
zgoda on November 13, 2008 1:35 AMI remember once having an enormously frustrating time trying to use the internet to discover the name of the album by the band 'A' that contained their hit single, 'Nothing'.
In that case:
If I ever have a band I'll call it It is. Our first album will be called No Matches.
And my stage name will be Orange.
Kramii on November 13, 2008 2:03 AMAn all-stop-words title that probably gets used fairly often: What might have been.
Daniel Franke on November 13, 2008 2:04 AMI'm almost positive that google didn't ignore stopwords back in 2004 either. I remember searching for combinations stop queries 'the the' or 'the who' (a matter of personal taste in music :) and got nice results.
I think the differences in the stopword lists could be partly attributed to the different corpora used for frequency counts + some tuning of the number after checking the results. if oracle used their own corporate archive to decide the stopwords and mysql used the Wall Street journal archive - they'll probably get different lists (obviously, the most frequent ones will be the same).
btw, back at 2004 google were granting keys and API for 1000 automated queries a day. use 10 email accounts and obtain 10 keys... without violating their policy.
Stop words have always been a pain, because they lack granularity (they just don't exist). But going without any kind of search filtering is just as annoying because it brings up unwanted data, and even google still falls for this every so often.
What happens if you're looking for video games written in PHP? A search for php video game on google returns a vast heap of results that are results for video game which happen to have a php extension somewhere in their URL. The same happens quite frequently when you're looking for esoteric PHP-related concepts on the web. In this situation, neither stop-wording php nor avoiding any stop-words solves the problem: a more clever technique is required to eliminate or rate down some occurrences but not others.
Victor Nicollet on November 13, 2008 2:18 AMHard to believe someone in this day and age would search on the the and not the the.
OG on November 13, 2008 2:21 AMOne key point is that the list Jeff gives (522,000,000 for 'the' and so on) is not the frequency of the word, but the *number of pages* containing that word.
The word itself may appear many times within the page, meaning that the relative frequency of 'the' and 'of' compared with 'reviews' is much greater than indicated. A typical 500-word page will probably have a few dozen 'the's.
I imagine Google has plenty of rules added by hand for these words, to reflect the fact that the semantics of 'review' are much more specific than those of 'of'. Plus they seem to give a higher ranking to results which use the search keywords as a phrase rather than separately; so 'the the' and 'to be or not to be' are probably handled in that way. If you rearrange the words in 'to be or not to be' you will get different results.
Leigh Caldwell on November 13, 2008 2:23 AMWe're missing out, potentially, on more great band names:
What the ...?
Who the ...?
Stop me, oh, stop me...
Stop me if you think that you've heard this one before
I'm getting only 11 billion results for the:
Results 1 - 10 of about 11,840,000,000 for the. (0.30 seconds)
The worst of all in that respect is the band Can. Searching for them has always been a real pain. At least that's not a terrible name. But the band Ours is both troublesome to search for *and* an annoying name. I avoid mentioning them to people even though I quite like their music.
Doko E on November 13, 2008 3:28 AMWikipedia, on the other hand, takes search terms literally, without parsing. Searching for The The in Wikipedia is a five second job.
graw on November 13, 2008 3:31 AMHow's it going with integrating Lucene into Stackoverflow?
Having worked with Lucene in the past, I can truly say that search is one of the most interesting technologies I've worked with.
http://bash.org/?514353
... but google has learned their lesson.
Doing a naive search for the the on google UK gives a more reasonable result.
http://www.google.co.uk/search?rlz=1C1GGLS_en-GBGB291sourceid=chromeie=UTF-8q=The+The
Half way down the first page it suggests the the band as an alternative search query.
Mark Allanson on November 13, 2008 4:32 AMOut of interest, did the dictionary you used for your Google experiment contain any profanity.
Paul D. Waite on November 13, 2008 4:34 AMI'm an internet search engine developer, and yes, we have to carefully handle stop words; we never ignore them.
Pavel on November 13, 2008 4:40 AMJohnny - quit that.
Morrissey on November 13, 2008 5:00 AMIf you search for The The with quotes, Google returns the band's official website, their Wikipedia page, etc. Exactly what you're looking for. Specifying an exact phrase instead of just a set of words seems to solve a lot of the issues you identify.
David on November 13, 2008 5:08 AMOne key point is that the list Jeff gives (522,000,000 for 'the' and so on) is not the frequency of the word, but the *number of pages* containing that word. The word itself may appear many times within the page, meaning that the relative frequency of 'the' and 'of' compared with 'reviews' is much greater than indicated. A typical 500-word page will probably have a few dozen 'the's.
An excellent point, and of course you're right. Actual frequency count won't be the same as appeared at least once on the page as per Google results.
Jeff Atwood on November 13, 2008 5:10 AMSpecifying an exact phrase instead of just a set of words seems to solve a lot of the issues you identify
Right, you and I know that, but the average search users don't know to put things in quotes -- they just type stuff and expect it to work.
Jeff Atwood on November 13, 2008 5:11 AMMy earliest experience with stopwords was 8 or 9 years ago. I searched for the who (without quotes) on Google. It filtered both of the words out and told me it couldn't find any results. I didn't use it for 3 or 4 months after that, but we ALL eventually turn back to Google.
Jake Voytko on November 13, 2008 5:22 AMThere's a brand new remastered Smiths greatest hits out with some rarities. Both Marr Moz endorsed it:
The Sound Of The Smiths
a href=http://www.amazon.com/Sound-Smiths-Very-Best-Deluxe/dp/B001EX6DNK/ref=wl_itt_dp?ie=UTF8coliid=I3UFG84NDTRPWEcolid=1OFNT6W8VZ56Thttp://www.amazon.com/Sound-Smiths-Very-Best-Deluxe/dp/B001EX6DNK/ref=wl_itt_dp?ie=UTF8coliid=I3UFG84NDTRPWEcolid=1OFNT6W8VZ56T/a">http://www.amazon.com/Sound-Smiths-Very-Best-Deluxe/dp/B001EX6DNK/ref=wl_itt_dp?ie=UTF8coliid=I3UFG84NDTRPWEcolid=1OFNT6W8VZ56T/a">http://www.amazon.com/Sound-Smiths-Very-Best-Deluxe/dp/B001EX6DNK/ref=wl_itt_dp?ie=UTF8coliid=I3UFG84NDTRPWEcolid=1OFNT6W8VZ56Thttp://www.amazon.com/Sound-Smiths-Very-Best-Deluxe/dp/B001EX6DNK/ref=wl_itt_dp?ie=UTF8coliid=I3UFG84NDTRPWEcolid=1OFNT6W8VZ56T/a
Just a silly follow-up to the appeared at least once on the page nature of the discussion: is /that/ even true? If a billion pages link the word orange to a given page, won't that page turn up pretty high in searches for orange, even if it doesn't contain that word in the content?
Five Minute Argument on November 13, 2008 5:39 AMI was reading about a project (Nutch I think) the other day where each stop word is combined with their following word to form a new un-common word. For example:
The band The The was a great band
would be analyzed and produce something like: band thethe great band
Phil on November 13, 2008 5:39 AM(edit to last message)
well, I would suppose it would probably produce: theband band thethe wasa agreat great band
Im using Lucene too. Im very satisfied with it.
The only problem i have is when an index is updated, inserted or deleted very often. I sometimes get an error message saying that an index file isnt readable.
I didnt find a real solution for that until now. Im storing now all IDs to index / delete in a Table. A cronjob takes care of this table and does all the index stuff. So its not a real live search but with a delay of round about 15 minutes. Any suggestions for that?
Btw. I love your blog and read all the books you have recommended here...
Marco Schierhorn on November 13, 2008 5:49 AMActually the trick for searching common english words (at least with google) is to search in another language
http://www.google.pt/search?hl=pt-PTq=the+the
or
http://www.google.es/search?hl=esq=the+the
will yield the correct results the the the band is the first result
Sven on November 13, 2008 6:00 AMMy pet peeve is programming sties that don't let you search for things like c++, or stl::hash
a nony mouse on November 13, 2008 6:03 AMThe worst of all in that respect is the band Can
I can has search?
No, but seriously, the worst group to search for (at work) is the Barenaked Ladies.
Bill on November 13, 2008 6:06 AMI remember trying to look up the tv series As if a few years ago with no success. It works in google now - nice.
N. Velope on November 13, 2008 6:08 AM
Another thing to consider is that (I think) Google also uses N-Gram models (I seem to recall that they released a set of models up to 3-Gram or 5-Gram from their corpus).
http://en.wikipedia.org/wiki/N-gram
And in a weird bit of serendipity, Johnny Marr played in The The for several years.
Mike on November 13, 2008 6:23 AMStop words aren't a relic of early '90s computing, they're a relic of standard pre-web information retrieval systems (reaching much farther back than the '90s!). Stop words were an enhancement to the quality of search results, just like word stemming or tf-idf.
This is from a world that searched databases of scholarly or otherwise serious information--you wanted to get _everything_ relevant to your query (all 1000+ relevant documents perhaps), and nothing that wasn't relevant.
Stop words allowed you to avoid the situation of returning a document that said the the are the the. to a query asking for the white house, just because you included the word the.
Google is in a whole new world. You will likely have several thousands of results for any query, and you want only the best few, so stop words are certainly less relevant than they were before. If you've got great results in your top ten, who cares if you return the the are the the as your 151st result?
Rudd on November 13, 2008 6:49 AMHaving been a web surfer since the days of NCSA Mosaic, I just got in the habit of not even bothering to type in stop words (which I generally define as any word that would not be capitalized in a title) on my search queries.
I guess I need to break that habit now. Thanks for the info.
T.E.D. on November 13, 2008 6:51 AMSites should do an exact phrase match unioned with the a non phrase match for any local search.
When I type something in to google I almost always use quotes. If google doesn't find the exact match it automatically falls back to dropping the quotes and executing the query again without my intervention.
Chris Lively on November 13, 2008 7:28 AMThe Smiths and The The?
Jeff - Your 80s are showing.
Ordinary Geek on November 13, 2008 7:45 AMI am still wondering why 'sex' is not at the top of Google's list... :P
lontxo on November 13, 2008 7:48 AMA search for the the on yahoo gives excellent results.
Vinod on November 13, 2008 8:10 AMSearching for a phrase with the same stop word is weird! If I search for The in Google, it gives me 13,490,000,000 results; if I search for The The The The, I get 1,160,000,000 results, and so on! Indexing?
Saj on November 13, 2008 8:55 AMIt sure would be nice if google would stop trimming special characters out of a search even when it's a search string enclosed in quotation marks! ANY way at all to escape important symbols (my example last time I complained about this was wanting to quickly look up the syntax for the $get shorthand; man is that a useless search string once the $ gets trimmed out) would be great
Grank on November 13, 2008 9:06 AMSaj,
I assume there are about 12.33 bilion pages that have the word 'the' at least once, but not more than 3 times.
sobani on November 13, 2008 9:09 AMSearching for The The? Jeff, if I didn't adore you before, I certainly do now!
Heather on November 13, 2008 9:10 AMI didn't notice those words until I read this article.
xerafhica on November 13, 2008 9:18 AM@Sobani:
Exactly. Which makes the indexing bizarre! Not as smart as it could be.
Saj on November 13, 2008 9:26 AMEspecially since stopwords differ so much from one language to another.. THE is the french word for tea, OR is the french word for gold, THESE also means thesis, and so on... Dunno about other languages, but it's pretty hard to find a good golden thesis on tea...
Nicolas on November 13, 2008 9:38 AM@Grank
Here is what you get if you use google's code search engine:
http://www.google.com/codesearch?hl=enlr=q=%22%24get%22
Thanks for posting this. I'm building a search application and the information from your MySQL link might make things considerably faster.
Thanks again for posting this topic!
David on November 13, 2008 9:53 AMApparently, at least to Google, stop word warnings are a thing of the past.
which is not to say that people should start implementing that into their applications, or that Databases need to change
this is Google you're talking about, the best (by far) search algorithm up to now
Eber Irigoyen on November 13, 2008 9:54 AMYour example is funny ... I just started a vinyl record website and I have this comment in my code:
#TODO, ALLOW EXCEPTIONS ON STOP WORDS, FOR EXAMPLE, ARTIST The The
Rob Lambert on November 13, 2008 9:55 AMThe band Live is almost impossible to search for on Google. The first couple of results are relevant (wikipedia and the official website), but after that, it's all about live bands. 'Live music', and 'Live CDs' are equally worthless queries. Of course, this isn't so much about stopwords as a semantic failure.
sancho on November 13, 2008 10:02 AMThere's one band that's even harder to search for than A; it's the outfit that brilliantly decided to call themselves !!!:
http://en.wikipedia.org/wiki/!!!
(The article mentions that you can find them by searching for chk chk chk, like searching for love symbol to find Prince in his glyph period.)
Daniel Rutter on November 13, 2008 10:13 AMAnd behold, the comment auto-link-highlighter can't believe ! could be in a URL :-).
Daniel Rutter on November 13, 2008 10:13 AMNo, but seriously, the worst group to search for (at work) is the Barenaked Ladies.
Anal Cunt is worse, trust me. If you don't get fired for searching those words, you'll certainly be fired for your eclectic taste in music.
I work for Barnes and Noble as a lowly bookselling drone, and get a kick every time that someone asks for What is the What[1], since all four words of the title are stop words in the internal search system!
[1]http://search.barnesandnoble.com/What-Is-the-What/Dave-Eggers/e/9780307385901
humblefool on November 13, 2008 11:14 AMsaj, on results for the the the the: What you've discovered is not a billion pages with that phrase, but a flaw in Google's page-count estimator that makes it think there are a billion. When you get the page with the first ten results, it doesn't actually go count how many pages there are with the phrase; it makes up an estimate based on word frequencies and such. That estimate is often way too high.
I'm not sure how accurate their page counts are for single individual words, but I wouldn't necessarily trust Jeff's 2004 results to be exact, either.
Brooks Moses on November 13, 2008 11:24 AMHaving stop words in Google's phrase queries is something that I actually miss, because there was a clever hack that involved them. If you searched for, say, row the boat, what it would really do was a wildcard search for row * boat, where * is any single word. That was occasionally quite useful when I could half-remember a phrase I wanted to search for, as I could just use the as a wildcard for the words I couldn't remember. And as far as I can tell there's no other way to do exactly that search.
Brooks Moses on November 13, 2008 11:28 AMMy worst was trying to find information about COM (as in, Component Object Model, that ancient Microsoft technology). Most people use the acronym, not the expanded name, so I have to search for the acronym.
However, searching for COM gives me the most useless results: absolutely anything with a .com in it.
And also Comit Olmpico Mexicano (Mexican Olympic Committee)
----
I also HATE how Google tries to be smart and put Spanish results first because it notices Accept-Language: es, and probably that my IP is in Argentina. I want them ordered by RELEVANCE without caring about language. The most useful results for tech topics will be in English.
If a common word is essential to getting the results you want, you can include it by putting a + sign in front of it. (Be sure to include a space before the + sign.)
http://www.google.com/support/bin/static.py?page=searchguides.htmlctx=basicshl=en
I wrote a note the other day to a vendor. An ingredient in their product was rapeseed. I had copied the ingredient list from their web site and pasted it into a text box. The web application kept failing because it automatically detected profanity. I removed the ingredient and it worked. Their application probably did a plain old dictionary search.
De Morgans Law: http://en.wikipedia.org/wiki/De_Morgan%27s_laws
!(p q) == !p | !q
Therefore
!(!p !q) == !!p | !!q
Canceling the double negative out front of each term gives the intended result of:
p | q
Use De Morgans law to hack an OR query on stack overflow if you *really* need that OR ability.
But I didn't do so hot in discrete mathematics back in college so YMMV. Be sure to check over the logic. ;)
Ness on November 13, 2008 12:08 PMSo when will you be adding the or operator on the stack overflow tag search?
Craig Francis on November 13, 2008 12:28 PMthis reminds of a very nice quote on bash.org s top 100 list...
it was actually a friend of mine getting furious and frustrated since google didnt return any search result for his query on the band...'the who'
:)
So when will you be adding the or operator on the stack overflow tag search?
I have never used or in 10 years of using Google to search for things, quite successfully I might add. Why would I need it on SO? Is there some point in the far future where it becomes useful?
I do use not sometimes, and we support that on SO.
Jeff Atwood on November 13, 2008 12:35 PMCheck out http://www.tbray.org/ongoing/When/200x/2003/07/11/Stopwords
Tim on November 13, 2008 12:36 PMMy mistake... it has already been added:
http://stackoverflow.com/questions/tagged/html%20or%20mac
And the reason for the addition was because I wanted my own feed with unanswered questions, for all the areas (tags) I'm interested in.
Craig Francis on November 13, 2008 12:38 PMIn general, band names are just the worst. But Google handles the well known ones pretty accurately; suprisingly, X and The Band both turned up the bands right away. Lesser known bands like Hey Hey My My (from France), though, not so much.
tk. on November 13, 2008 12:38 PMMy mistake... it has already been added:
Oh, you meant for *tags*. Sorry, I misunderstood you. Sure -- you're right, AND and OR are necessary for tag browsing.
I don't really think of that as search, though.
Jeff Atwood on November 13, 2008 12:57 PMhttp://www.google.com/search?hl=ruq=the+thestart=0sa=N
First link leads you to mp3 file of this band.
If you want a really good list of common words and phrases, you can buy Google's 6 DVD set of n-grams for $150: http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
That would be a word/phrase frequency count from Google's crawl of a trillion web pages.
David Leppik on November 14, 2008 1:48 AMNot that I've read the replies, but it's a security issue how you configure the stop words, no? As they're released and there's no standardization in databases, it is arduous at the least as a developer.
Re: Google, YouTube automatically wraps your input with now. Isn't the new workaround to put a + before a stop word? No sense in that?
-Kapital K +THE
Searching for http also gives very interesting results on Google.
@ Brooks Moses
Thanks. But that should not be the case.
Saj on November 14, 2008 4:54 AMActually, as a geek, the most frustrating thing is dealing with punctuation. How many times do I search for code or a specific error message that contains punctuation tha tis simply stripped out.
Or just search for C#, more likely than not, the hits you get back will be about C, not C#.
Which really makes you wonder, since it's us geeks that write search engines...
Dude on November 14, 2008 7:03 AMWhat happened to the font? Ew.
Charles on November 14, 2008 9:38 AMI'd just be happy if I could search for .NET and not get results for every URL ending in .net, or C# and not get ones for C. They are quite unique terms, nowadays.
Buck on November 14, 2008 10:02 AMLove it! Good article Jeff.
I just tried a search for the in google and got this:
Results 1 - 10 of about 12,030,000,000 for the. (0.23 seconds)
Thats over a billion more results than a posting above - I wonder why?
Sci-Fi Si on November 16, 2008 9:24 AMIt will be fun if Windows 7 is out ...
;-
Never use single-letters as distinctive product names!
Jeff, I now have even more respect for you, The The are my favorite band of all time, seen them (him - Matt Johnson) in concert many times, not only do you maintain a great blog, and have proved yourself a fantastic web developer (stackoverflow is so cool), but you now have taste in music!
http://en.wikipedia.org/wiki/Matt_Johnson_(singer)
http://www.thethe.com/
How about the title:
Should I use as or is and ()?
The main reason for using stop words is for performance reasons. It is expensive to process huge hit lists with most search engines.
Note that Google just gives estimates for the total number of results. I'm sure you noticed the nice round numbers. Back in 2004 if you queried multiple times in a row you would occasionally get different top results and frequently get different total number of results.
Sean Timm on November 27, 2008 4:53 AMThe concept of stop words is a crude approximation to a word probability distribution. If your language model is good enough, then it should already discount common words in search, either through absolute frequency or inverse document frequency. In fact, the discounting of stop words is the perfect way to test your search model, which is fundamentally a language model.
Also, I am extremely tempted to do the same thing as you did with querying Google for page counts.
Pasha on January 10, 2009 6:11 AMthe The - I haven't heard from that band since college...
By the way, the first half dozen or so items returned in a search for the the in quotes directly related to the band.
Jon Peltier on February 6, 2010 11:13 PMI think the traditional problem with needing stop words is when you are searching only based on match frequencies. It is of course the obvious, and naive, algorithm, and it can produce poor results when the search terms are common.
Better modern approaches don't use a fixed set of stop words. Instead the use a corpus and statistical methods. This is not too different from going from the original anti-spam filters based solely on the presence of bad words, which rarely work well, to the statistical-based Bayes classifiers and such. The corpus can either be a large body of known text, or better yet, the actual and entire collection of documents you're searching. This also means its adaptive and automatic; nobody is trying to guess which terms are more meaningful than others.
In this approach there's nothing special about the word the. It is only that if it is very common in the corpus than it will have very little weight compared to other search terms; but, importantly, not zero-weight! There is a whole gradient of term weights; not just a binary 0 or 1. Likewise the weights are not fixed, as the body of indexed work changes the weights can change too.
Combine that with term proximity searching and you're going to suddenly get very good results for the the.
BTW, about the posted mentioning Lucene. People might also want to check out Xapian. It's not marketed as well and is not as glossy on the outside; but it has very sophisticated and finely tuned guts. It does this type of searching very very well.
One of the things that impressed me about Google the first time I tried it was the results of a search for go game. Other search engines of the day wouldn't produce anything useful unless I added romanized versions of foreign names for the game.
Today you can search Google for go and 3 of the top results are related to the game.
Hi Jeff,
I work for Microsoft's Live Search (http://www.live.com) and I can tell you that stop words are alive and well. However, like Google, we know when to use them and when not to use them. Search for the the on both live and google and you will see relevant results. Even without quotes the engine's know what you mean.
To see an example where stopwords are used, try typing in {the nintendo 64} in both live and google. You'll notice that the word the is not highlighted. It was treated as a stopword in that case.
Tanton Gibbs on February 6, 2010 11:13 PM2009 New Style Christian Louboutin Very Prive Platform Pump
Christian Louboutin Rolando Hidden Platform Pump Blue
Christian Louboutin Rolando Hidden Platform Pump Dark Red
This is only a preview. Your comment has not yet been posted.
As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.
Having trouble reading this image? View an alternate.
| Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |
Posted by: |