I <3 Steve McConnell*
Coding Horror
programming and human factors
by Jeff Atwood

December 12, 2005

Getting Started with Indexing Service

Microsoft's ancient circa-1997 Indexing Service gets no respect. And that's a shame, because it's a surprisingly decent content indexing engine that supports arbitrary metadata. Sure, there may be better choices, but Indexing Service's saving grace is that it's completely free. It's a default component of Windows 2000, Windows XP Pro, and Windows 2003 Server. And I'll show you how you can programmatically query it from .NET, too.

First, let's set us up the bomb a little Indexing Service catalog to play with:

  1. In Computer Management, right click Indexing Service and select New, Catalog.
  2. Give the new catalog a name. You'll use this name in code to select the right catalog, so treat it like a variable name and use something that makes sense. I used test.
  3. Select a location path for the catalog. Note that this is NOT the location of the content you want to index, but the physical location of the hidden catalog.wci index folder. I know, it's confusing. I chose c:\test\ as my location.
  4. Expand the new Test catalog, and right click the Directories folder. Select New, Directory. You can ignore the UNC textbox unless you're indexing content on a remote computer. Enter the path to the content you wish to index. I chose c:\test\index-me\.

Click OK, restart Indexing Service, and you end up with something like this:

indexing-service-hidden-folder.png

Now, if you plop files in the index-me folder, Index Server will automatically index them -- assuming an appropriate IFilter is installed. Click on the Indexing Service node to watch it happen; the total indexed and unindexed document counts are shown in real time.

index-server-doc-count.png

After you've added some documents, bring up the integrated query-- click on the Query the Catalog node under your catalog-- and verify that, indeed, you can find specific words in your documents. Ok, full text search works. That's not very exciting.. until you add some custom metadata to the mix! Plop this HTML file in the index-me folder:

<head>
<
title>Html Test Page 2</title>

<!-- automatically maps to DocAuthor in Index Server -->
<meta name="author" content="John Doe" />
<!-- automatically maps to DocKeywords in Index Server -->
<meta name="keywords" content="giraffe, elephant, mouse, aardvark" />
<!-- automatically maps to DocSubject in Index Server -->
<meta name="subject" content="animal" />

<!-- custom meta tags  -->
<meta name="testing" content="dos" />
<
meta name="metacategory" content="awesome" />
<
meta name="metanumber" content="two" />
<
meta name="metainteger" content="222" />

</
head>
<
body>
Jackdaws love my big sphinx of quartz.
</body>

Once you do, you'll notice that the Properties folder for your Catalog contains some interesting new properties that correspond to our <meta> tags:

indexing-service-custom-properties.png

As noted in the HTML comments, a few of the <meta> tags automagically map directly to standard Index Server properties. You can query these properties directly using Indexing Service Query Language:

$DocAuthor John AND Doe
$DocKeywords mouse
$DocSubject animal
$DocTitle test

That's nice for free, but the truly custom properties require a bit more work. If you try "$testing dos", you'll get an unceremonious "No such property" message. The first thing we need to do is mark the property cacheable. Right click the "testing" property in the properties folder and check the "Cached" checkbox:

indexing-service-property-dialog.png

Take note of the storage level drop-down as well, because it has a special meaning: properties marked as primary storage can be returned in the search results. This can be a big deal for performance, since you can have as many bits of metadata as you want coming back in the initial search results.

You'll need to reindex once you mark a property cacheable. There are two ways to do that:

  1. The scorched earth way: stop the service, delete the hidden catalog.wci file, then restart the service.
  2. The obscure UI way: right click the directory in the Directories folder of your Catalog, select "All Tasks", then select "Full Rescan" or "Incremental Rescan".

Once we've cached the property and rescanned, we now need to map the friendly name "testing" to the GUID of the new property. You can either map friendly names with a manually edited text file (see MSKB 1, MSKB 2), or you can do it in code at query time. We'll do it through code.

Create a new ASP.NET project and add a project reference to the ixsso COM object:

add-ixsso-reference.png

Next, place a TextBox, Button, and DataGrid on the default webpage. Add using statements for System.Data.OleDb and Cisso. Then paste this code in the Button1_Click event:

CissoQueryClass q = new CissoQueryClass();
CissoUtilClass util = new CissoUtilClass();

OleDbDataAdapter da = new OleDbDataAdapter();
DataSet ds = new DataSet("IndexServerResults");

q.Query = TextBox1.Text;
q.DefineColumn("testing = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 testing");            
q.Catalog = "Test";
q.SortBy = "rank[d]";
q.Columns = "rank, path, size, testing";     
//q.MaxRecords = 1000;
util.AddScopeToQuery(q, @"c:\test\index-me\", "deep");

object o = q.CreateRecordset("nonsequential");
da.Fill(ds, o, "IndexServerResults");

DataGrid1.DataSource = ds;           
DataGrid1.DataBind();

Entering a query of "$testing dos" returns the above HTML document, as we would expect. The ability to query arbitrary metadata along with full-text search makes index server much more flexible and powerful than I had ever realized. A set of custom generated HTML could easily index entirely database-driven websites, in tandem with the rich Indexing Service query language (examples, more examples).

In general Indexing Services just works, but there are some non-obvious things I ran into while experimenting with it:

  • Custom properties set to primary storage can be returned in the search results, but they are always returned as datatype "object", which means the DataGrid can't bind to them automatically.
  • I couldn't get pagination to work using the CissoQueryClass object properties. This means you'll realistically need to limit the number of results with the MaxRecords property, which does work.
  • There's a hidden performance tuning option. Right click the Indexing Service node, then select "All Tasks" and "Tune Performance."
  • Be sure to turn on abstracts -- short textual summaries -- for your catalog. You can do this by right clicking the Catalog, selecting Properties, unchecking Inherit, then checking "Generate Abstracts".
  • The list of "stop words" for Indexing Services can be edited in c:\windows\system32\noise.enu. See this MSKB article for more details.
  • If you are querying Indexing Services from ASP.NET, bear in mind that your queries will only return documents that the ASP.NET process account has permissions to! Don't let this one bite you like it did me.

Posted by Jeff Atwood    View blog reactions

 

« an Incomplete Guide to Building a Web Site that Works UI Follies, Volume III »

 

Comments

I have often wondered (and not yet found time to find out) whether the data gathered by the indexing service remains secure or whether it can be used to gain insight into documents you would otherwise have no access to.

One of the reasons why I never got into using the context menu properties dialog custom fields for documents in Windows Explorer is that the MRU list seems to be shared between all user accounts, so comments placed against files can be read by anyone regardless of permissions just by opening a combo box list (at least this was the case when I last tried it).* This then triggered the suspicion that perhaps the indexing service also needed some testing effort before trusting it to make sure that it didn't have the same problem, at which point it seemed too much bother. A good example of the sort of obvious vital questions that documentation typically never covers.

*The other reason was that many applications recreate their documents files when editing so anything you put in there has only the odd chance of being retained over time.

Paul Coddington on December 13, 2005 09:15 PM

A few notes.

1. A shout out to my homey David Truxall, who had the only decent (aka not Server.CreateObject) .NET code sample for querying Index Server:

http://www.dotnetjunkies.com/WebLog/davetrux/archive/2004/03/03/8345.aspx

2. I did some performance testing of Index Server using the BBS and MAGAZINE archives of http://www.textfiles.com . That's 16,164 textfiles in 358 folders (408 megabytes total). The catalog.wci index folder was 123 megabytes. With that corpus, I got query times of..

"bbs", 7262 rows, ~400 ms
"phreak", 848 rows, ~50 ms
"hack", 2602 rows, ~135 ms
"apple", 1910 rows, ~100 ms
"atwood", 16 rows, ~5 ms

The number of results returned is obviously critical to Index Server performance, which makes it even more of a shame that I couldn't get paging to work. Luckily the MaxRecords property works fine to restrict the total # of results.

Jeff Atwood on December 13, 2005 09:18 PM

I have an entirely dumb question here, which is how this compares with various desktop-search utilities (for example, the one from that search company with the big G, little o.)

mike on December 13, 2005 10:46 PM

Great post! I have another dumb question, how would you apply this to index a database drive ASP.NET site apart from having the site generate static html files?

Haacked on December 14, 2005 12:03 AM

> how this compares with various desktop-search utilities

I am totally unclear how Index Server cooperates with the standard Windows Search, if at all. I'm not sure it does. There's also that Advanced button in the file and folder properties dialog which contains a "For fast searching, allow Indexing Service to index this folder" checkbox.

In general, as a standard desktop search, it's pretty weak. But as a basic indexing solution for programmers, it's not bad.

> apart from having the site generate static html files?

You got it. Indexing Service can only index files in the filesystem. There would have to be some kind of background or batch process that periodically generated HTML files that represent your database to a series of folders. With the <META> tags properties, this is totally feasible. I have it on good authority that http://www.drugstore.com still indexes its site this way, for example.

I'm sure it's possible to come up with a fancier solution, but hey, this one is free!

Jeff Atwood on December 14, 2005 12:48 PM

There are free solutions out there that are a lot easier to deploy. Yes, this search engine ships as part of windows, but it is a huge pain to configure, it is opaque and generally confusing.

If you need freetext search in a program, I'd be happier with something simple like SWISH-E
<a href="http://swish-e.org/">http://swish-e.org/</a>; for example

Christian Mogensen on December 15, 2005 09:05 AM

> but it is a huge pain to configure, it is opaque and generally confusing

Well, I'm not so sure about that. Anyone can set up a quick Indexing Service demo app within 10-15 minutes using what I just posted.

It's kind of clunky, definitely, but it's not so complicated if you have sample code and guidance (eg, this post).

Jeff Atwood on December 15, 2005 01:42 PM

Very nice article. I have been wondering if there was a way to index and display MetaData. This is probably the best information on the Indexing Service that I have found from around the internet, especially related to .NET integration.

Very nice job.

Ryan Smith on December 15, 2005 02:08 PM

Have a question, which is cifferent from what you have posted here., but thought you guys can probably answer my question.
How do i change the default location of the catalog.wci directory. I have a web server that has c and d drive and the c is getting filled up. catalog.wci file is ~1.3Gb and is growing taking up the space on the cdrive. i want to change the path to d drive .. like d:\inetpub from its default location c:\inetpub. i have researched and found that the locatiion is set in the registry set: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ContentIndex

and specifically the key "ISAPIDefaultCatalogDirectory"
but when i look it up in the registry of the machine in question it just shows "Web"
Any idea how I can change this?
are there are implications to this change?

Thanks
Vijay

Vijay on January 27, 2006 01:23 PM

> How do i change the default location of the catalog.wci directory

I would delete the current index and recreate it-- you select the physical location of the catalog in steps 1-3, above.

Jeff Atwood on January 27, 2006 01:36 PM

Is there some way to set the catalog to incrementally update itself on some timeframe (eg at 2300 daily) ?

Catherine on February 9, 2006 08:12 AM

The catalog has hooks in the filesystem. It should update more or less automatically as files are added, deleted, or changed in the underlying filesystem.

So the catalog should always be very close to updated with no interaction at all. However, if you copy 50,000 files into a folder in the catalog, it might take a little while to get up to date..

Jeff Atwood on February 9, 2006 02:10 PM

Hi, when I delete a file it still shows as a result of a search for a few hours. Is there a way to trigger it on each delete??

Diego on February 9, 2006 02:21 PM

It works programmabily but i cannot make it work with those .idq files whatever i try.

where should those be located?

i put one idq in a folder and put the path in registry where it was suggested and nothing. i cannot query the parameter.

i repeat. through c# i can do that.

What i really need is to be able to programmabily add catalogs (i know how to do that), remove catalogs (i know how to do that too), add properties (i know how to do that too) and make them cached (HOW IS THAT IN C#?). Please help me.

thank you in advance.

Cristi Manole on March 20, 2006 10:47 AM

Does anybody know how to display the frase around the hit in the page results?
I don't wont to highlight the hits on the document, but to show the title and instead of the abstract, the frase around the hit.

Karina on October 6, 2006 12:46 AM

Is there a way i can highlight the search words in the searched document?

Parsh on November 19, 2006 10:15 AM

First of all, this is a very usefull article! Thanks for sharing!

>Karina on October 6, 2006 12:46 AM
Is there a way i can highlight the search words in the searched document?

Yes, there is : http://www.nsftools.com/misc/SearchAndHighlight.htm
works with javascript.

Jeroenvw on December 28, 2006 03:09 AM

Hi, the article is really useful!

But I have performance problem when I have to load about 2000 results. The delay that I get is about 20 seconds and the problem seems to be in that row:

da.Fill(ds, o, "IndexServerResults");

Can you advice me what to do with it?

Jeronimo on February 1, 2007 05:42 AM

Hi, I keep having difficulties implementing this on our Windows 2003 SBS server :-(
Catalogs are being built from filesystem; not from webpages...
Would there be a standard set of aspx pages (or other) that I can upload and with some changes (pointing to the corret catalog) would actually work. I have tried dozens & none of them I get a result on the browser.
When i query through the default indexing service interface, normal results appear...

Much appreciated & thanks in advance...
Regards,

De Special on March 7, 2007 12:56 PM

Great article Jeff. Thank you for your time and attention.

The footnote about the "BITING ASP.NET SERVICE SECURITY CREDENTIALS" really did save me - albeit after a day's "where the hell are the documents, the query runs fine and there are no errors!" frustration. Well, better late than never.

One quick note: the real power of indexing service is the iFilter API. Most companies develop filter for their propriety document types and with this API the indexing service is able to index virtually all documents that you have filters for (pdf, raw image files (with XMP), dwg, mp3, doc, excel, jpg, gif etc). From this perspective, there is no alternative to this immensely powerful monster that is embedded in your server, and I'd go as far to say that any serious web developer would be foolish not to try and master this free tool. Learn and use all the capabilities of your environment, if not, sooner or later someone who does will come and kick your a**.

Soksa.Icy on July 19, 2007 09:20 AM

Hi. I am trying to index a network share (ex. \\server\share) The only way I can get it to work is to plug in an admin account when creating the directory in indexing service. It will index the network share then. I want to be able to index our file server from our IIS server....

thanks,
Ben

Benjamin Cox on July 20, 2007 02:44 AM

How do you filter "unfiltered" documents? I have 3 Word documents that will not show up and I can't figure out why.

MJ on September 6, 2007 08:54 AM

Nice article! Really really nice!

Well, I have a question: I want to Index a directory that is in another server, but when I use the map name like "\\directory" or "Z:/directory" it seems to not work because says no results and in Total Docs just appears 2 (I dont know which ones) but there are at least hundreds of documents in that directory. There's something I'm missing? What do I have to do?

Thanks in advance!

Benny.

Benny on March 5, 2008 01:59 PM

thats very nice tips! lucky i found your article. that save a lot of my time to fix my problem.

Jane on April 21, 2008 06:03 PM

Can someone suggest what would be query command to get the directory path? I tried "#direstory "TARGETNAME=kato" but recived an error.

Mary Thomas on June 10, 2008 11:48 AM

And this service has been sitting there for all these years?!
<bangs head on table three times>

Thanks for the tip!

Speaking of indexing, do you have any opinion about Lucene.NET?

Markus on July 2, 2008 09:56 PM







(hear it spoken)


(no HTML)




Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved.