In this article I will show you how to implement a basic Lucene indexer which loop through XML nodes in an XML document. In this example I will use the Books.xml from Microsoft. I assume you have a new Console project created and that you referenced the Lucene library. I use version 2.0.4 in this example. Let’s look at the code:

First reference some namespaces:

using System;
using System.Globalization;
using System.IO;
using System.Xml;
using System.Xml.XPath;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;

Nothing exiting going on here.

Add the following code to Program.Main():

private static void Main(string[] args)
{
// create a directory to store the index in
string indexPath = @"c:\LuceneSampleCatalog";
Directory.CreateDirectory(indexPath);

// index the books
IndexBooks(indexPath);
}

Again, nothing exiting here, we just make sure the index directory exist and call the IndexBooks() function.

And here is the IndexBooks() function:

private static void IndexBooks(string indexPath)
{
DateTime startIndexing = DateTime.Now;
Console.WriteLine("start indexing at: " + startIndexing);

// read in the books xml
var booksXml = new XmlDocument();
booksXml.Load("books.xml");

// create the indexer with a standard analyzer
var indexWriter = new IndexWriter(indexPath, new StandardAnalyzer(), true);

try
{
// loop through all the books in the books.xml
foreach (XPathNavigator book in booksXml.CreateNavigator().Select("//book"))
{
// create a Lucene document for this book
var bookDocument = new Document();

// add the ID as stored but not indexed field, not used to query on
bookDocument.Add(new Field("id", book.GetAttribute("id", string.Empty), Field.Store.YES, Field.Index.NO, Field.TermVector.NO));

// add the title and genre as stored and un tokenized fields, the value is stored as is
bookDocument.Add(new Field("auhtor", book.SelectSingleNode("author").Value, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.TermVector.NO));
bookDocument.Add(new Field("genre", book.SelectSingleNode("genre").Value, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.TermVector.NO));

// add the title and description as stored and tokenized fields, the analyzer processes the content
bookDocument.Add(new Field("title", book.SelectSingleNode("title").Value, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO));
bookDocument.Add(new Field("description", book.SelectSingleNode("description").Value, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO));

// add the publication date as stored and un tokenized field, note the special date handling
DateTime publicationDate = DateTime.Parse(book.SelectSingleNode("publish_date").Value, CultureInfo.InvariantCulture);
bookDocument.Add(new Field("publicationDate", DateField.DateToString(publicationDate), Field.Store.YES, Field.Index.UN_TOKENIZED, Field.TermVector.NO));

// add the document to the index
indexWriter.AddDocument(bookDocument);
}

// make lucene fast
indexWriter.Optimize();
}
finally
{
// close the index writer
indexWriter.Close();
}

DateTime endIndexing = DateTime.Now;
Console.WriteLine("end indexing at: " + endIndexing);
Console.WriteLine("Duration: " + (endIndexing - startIndexing).Seconds + " seconds");
Console.WriteLine("Number of indexed document: " + indexWriter.DocCount());
}

In this function all the magic happens: First we load in the XML file that we are going to index. Next we open the index writer, the class resposible for writing the Lucene index. Then we loop through each book element in the books XML, create a document from it and store that document in the index.

Last, but certainly not least, we optimize the index. This ensures that all the documents are stored in a single index file.

That is it for this article, you can download the full source code here, in the next article I will show you how you can query the index.

Be Sociable, Share!