Dev @ Work

A day in the life of a developer

Lucene.NET articles

March 14th, 2009 by

I just published the Lucene.NET articles I have been working on. I hope you like them.

Here is an overview of the articles I have written:

  1. Introduction to Lucene
  2. Indexing basics
  3. Search basics
  4. Alternatives ( did you mean …)
  5. Faceted search / Drill down
  6. Class reference

Cheers!

Class Reference – Lucene.NET

March 14th, 2009 by

This article contains an overview of the most important classes in Lucene.

Documents (Lucene.Net.Documents.Document)

Documents are the central entity in Lucene, it is what gets stored in the index. A document can represent any kind of information you want, for example: it can contain database records, emails,  HTML pages, word documents, etc.. A document doesn’t have any required attributes, it is basically just a list of [fields]. Optionally you can set a boost for a document.

Fields (Lucene.Net.Documents.Field)

Fields are used to describe a [document]. A field is basically a key value pair containing the name of the field and it’s value. There are different behaviors for a field: there are Keyword fields, UnIndexed fields, Text fields, you can use the constructor functions to instantiate a field.

Field types

Field method/type Analyzed Indexed Stored Usage
Field.Keyword(string, string)   x x URLs, nickname, social security numbers
Field.Keyword(string, dateTime)   x x  
Field.UnIndexed(string, string)     x Document type, when not used for search criteria
Field.UnStored(string, string) x x   Document titles and content
Field.Text(string, string) x x x Document titles and content
Field.Text(string, TextReader) x x   Document titles and content

IndexWriter (Lucene.Net.Index.IndexWriter)

The index writer is responsible for writing [documents] to an index. The index can be stored in either a directory on disk (Lucene.Net.Store.FSDirectory), or in memory (Lucene.Net.Store.RAMDirectory). It uses [analyzer] to break up text before a [document] is stored in the index. You can either create a new index or you can change an existing index.

Analyzerer (Lucene.Net.Analysis.Analyzer)

This class, and it derivatives, is responsible for breaking down text into single words or terms and do processing on them. For example: remove commonly used words, like: ‘the’, ‘and’ and ‘a’ or transform word into other words in case of verbs ( walked -> walk ) or even add synonyms.

Term (Lucene.Net.Index.Term)

A term is a key/value pair on which you want to search, where the key is the name of the [field] and value is the value on which to search.

IndexSearcher (Lucene.Net.Search.IndexSearcher)

The index searcher performs the actual search, it opens the index, searches through it using the search [query] and returns the [hits] matching the the [query].

Query (Lucene.Net.Search.Query)

A query describes the results you want to get from a search.

Hits (Lucene.Net.Search.Hits)

A list with all the documents returned from a search operation performed by the [indexreader].

That is it for this article, I hope you learned how Lucene works and that you understand the basic concepts. If you have additional information, comments or questions please don’t hesitate to respond.

In my next article I will show you how to implement a basic index application, which can index a bunch of plain text files from disk. No rocket sience but that is not the point.

Posted in | 2 Comments »

Introduction to Lucene.NET – Lucene.NET

March 14th, 2009 by

This the my introduction to Lucene.NET as well as my first article on the subject, I will explain what it is and what you can do with Lucene.NET.First of all let me unveil what Lucene.NET is:

Lucene.NET is a information retrieval library which allows you to search through content in the broadest sense of the word: if you want to index your mother in law, you can! Well, you have to find a way to digitalize her first and I won’t cover that problem here. My point is that Lucene is agnostic to a specific information formats.

Because Lucene.NET is just a library, meaning that is doesn’t do anything out of the box, it is used in a wide variety of products, ranging from online search engines like Google and site specific search engines to desktop application and shrink wrapped software and even T9 like application for cell phones. The library is both powerful and scalable meaning that it can be used to search through millions of documents and still be lightning fast. The origin of Lucene.NET is Lucene, in fact:  it is a direct port from Java (Lucene) to C# (Lucene.NET) and the used index formats are binary compatible. There are also ports available to other platforms like C++ and Python.

Read the rest of this entry »

Posted in | No Comments »

Faceted Search and Drill-Down – Lucene.NET

March 14th, 2009 by

In this article we will build faceted search and drill-down functionality into our application. Implementing faceted search is actually quiet simple, once you know the concepts. I will start with some theory behind the implementation and then I will show you the actual code. This article elaborates on my previous Lucene articles and I assume you have read indexing, searching and alternatives. If you didn’t read those you can download the starting point source code here.

The theory

The most important thing you should keep in mind is that the every document in the Lucene index is at a fixed position, called the index. When performing a search, you basically get a BitVector(Java) or BitArray(.NET) back from the IndexReader. Every bit that is turned on ( 1, true ) represents a hit, the location of the bit in the BitArray corresponds with the location of a document 3. For example: 00010100 means that the third and fifth documents are hits for the executed search query.

We can use these knowledge to compare two result sets to each other and get the common results from them using a bit wise AND operation. For example: 00010100 AND 00000100 means that only the third document is present in both result sets. When we calculate the number of bits turned on, we know how many documents are present in both result sets.

Example:

We have a facet called content type and it contains the values: ‘news’, ‘articles’ and ‘vacancies’. We initially perform a search operation on each of those values and store their BitArray for later use.

Next we perform a search on a user given term, for example ‘lucene’ and we retrieve the BitArray for this result set.

Now we can get the number of news items in the result set by applying a bit wise AND operation on the BitArrays of the lucene term result set and news facet value result set. Once we calculate the cardinality of the resulting BitArray we now how many documents in the lucene term result set are actually news documents.

This covers the theory behind the faceted search in Lucene, lets get down and dirty with some code.

The code

I assume that you went through the previous articles and have the indexer and search interface working. Optionaly, you can download the starting point source code here.

First lets add a little helper function which can calculate the cardinality of a BitArray. I am not going to explain this function in depth, see my blog post for more details.

Now lets implement the magic:

private static void FacetedSearch(string indexPath, string genre, string term)
{
// create searcher
var searcher = new IndexSearcher(indexPath);

// first get the BitArray result from the genre query
var genreQuery = new TermQuery(new Term("genre", genre));
var genreQueryFilter = new QueryFilter(genreQuery);
BitArray genreBitArray = genreQueryFilter.Bits(searcher.GetIndexReader());
Console.WriteLine("There are " + GetCardinality(genreBitArray) + " document with the genre " + genre);

// Next perform a regular search and get its BitArray result
Query searchQuery = MultiFieldQueryParser.Parse(term, new[] {"title", "description"}, new[] {BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD}, new StandardAnalyzer());
var searchQueryFilter = new QueryFilter(searchQuery);
BitArray searchBitArray = searchQueryFilter.Bits(searcher.GetIndexReader());
Console.WriteLine("There are " + GetCardinality(searchBitArray) + " document containing the term " + term);

// Now do the faceted search magic, combine the two bit arrays using a binary AND operation
BitArray combinedResults = searchBitArray.And(genreBitArray);
Console.WriteLine("There are " + GetCardinality(combinedResults) + " document containing the term " + term + " and which are in the genre " + genre);
}

First we open the index reader, we have done this plenty of times, no news here.

Next we get the bit array of a term query on the genre term. This bit array contains a list of bits each indicating whether a document is in selected by the query or not.

Then we perform a regular search, I covered this kind of code in my previous article. However, instead of iterating through the result we retrieve the corresponding bit array.

Now that we have both bit arrays we can determine which documents are in both results by combining them using a bitwise AND operation. We can use the cardinality of the resulting bit array to give use a possible drill down.

That is it for this article. You can download the full source code here.

Posted in | 21 Comments »

Alternatives, did you mean… – Lucene.NET

March 14th, 2009 by

In this article I will explain how you can implement the auto correct feature, commonly known as ‘did you mean …’. Google does it, why won’t you provide the same functionality for your users? With Lucene you can easily build this functionality into your own applications, in the next few sections I will show you how.

I assume you have read my previous articles and that you have the index and search application up and running, if not: you can download the source code of our starting point here.

Read the rest of this entry »

Posted in | 3 Comments »

Indexing Basics – Lucene.NET

March 14th, 2009 by

In this article I will show you how to implement a basic Lucene indexer which loop through XML nodes in an XML document. In this example I will use the Books.xml from Microsoft. I assume you have a new Console project created and that you referenced the Lucene library. I use version 2.0.4 in this example. Let’s look at the code:

First reference some namespaces:

using System;
using System.Globalization;
using System.IO;
using System.Xml;
using System.Xml.XPath;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;

Nothing exiting going on here.

Add the following code to Program.Main():

private static void Main(string[] args)
{
// create a directory to store the index in
string indexPath = @"c:\LuceneSampleCatalog";
Directory.CreateDirectory(indexPath);

// index the books
IndexBooks(indexPath);
}

Again, nothing exiting here, we just make sure the index directory exist and call the IndexBooks() function.

And here is the IndexBooks() function:

private static void IndexBooks(string indexPath)
{
DateTime startIndexing = DateTime.Now;
Console.WriteLine("start indexing at: " + startIndexing);

// read in the books xml
var booksXml = new XmlDocument();
booksXml.Load("books.xml");

// create the indexer with a standard analyzer
var indexWriter = new IndexWriter(indexPath, new StandardAnalyzer(), true);

try
{
// loop through all the books in the books.xml
foreach (XPathNavigator book in booksXml.CreateNavigator().Select("//book"))
{
// create a Lucene document for this book
var bookDocument = new Document();

// add the ID as stored but not indexed field, not used to query on
bookDocument.Add(new Field("id", book.GetAttribute("id", string.Empty), Field.Store.YES, Field.Index.NO, Field.TermVector.NO));

// add the title and genre as stored and un tokenized fields, the value is stored as is
bookDocument.Add(new Field("auhtor", book.SelectSingleNode("author").Value, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.TermVector.NO));
bookDocument.Add(new Field("genre", book.SelectSingleNode("genre").Value, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.TermVector.NO));

// add the title and description as stored and tokenized fields, the analyzer processes the content
bookDocument.Add(new Field("title", book.SelectSingleNode("title").Value, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO));
bookDocument.Add(new Field("description", book.SelectSingleNode("description").Value, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO));

// add the publication date as stored and un tokenized field, note the special date handling
DateTime publicationDate = DateTime.Parse(book.SelectSingleNode("publish_date").Value, CultureInfo.InvariantCulture);
bookDocument.Add(new Field("publicationDate", DateField.DateToString(publicationDate), Field.Store.YES, Field.Index.UN_TOKENIZED, Field.TermVector.NO));

// add the document to the index
indexWriter.AddDocument(bookDocument);
}

// make lucene fast
indexWriter.Optimize();
}
finally
{
// close the index writer
indexWriter.Close();
}

DateTime endIndexing = DateTime.Now;
Console.WriteLine("end indexing at: " + endIndexing);
Console.WriteLine("Duration: " + (endIndexing - startIndexing).Seconds + " seconds");
Console.WriteLine("Number of indexed document: " + indexWriter.DocCount());
}

In this function all the magic happens: First we load in the XML file that we are going to index. Next we open the index writer, the class resposible for writing the Lucene index. Then we loop through each book element in the books XML, create a document from it and store that document in the index.

Last, but certainly not least, we optimize the index. This ensures that all the documents are stored in a single index file.

That is it for this article, you can download the full source code here, in the next article I will show you how you can query the index.

Posted in | 2 Comments »

Search Basics – Lucene.NET

March 14th, 2009 by

In this article I will show you how to implement a basic Lucene searcher which searches through the index created in the previous article. I assume you have read my previous article and that you have the index application up and running, if not: you can download the source code of our starting point here.

First reference some additional namespaces:

using Lucene.Net.QueryParsers;
using Lucene.Net.Search;

Add the following code to Program.Main(), under the call to IndexBooks():

// search the created index
Search(indexPath, "Sequel");

We pass the path to the index we want to search and the term we want to search on.

And here is the Search() function:

private static void Search(string indexPath, string term)
{
// create searcher
var searcher = new IndexSearcher(indexPath);

// create a query which searches through the title and description, the term can be in the title or the description
Query searchQuery = MultiFieldQueryParser.Parse(term, new[] {"title", "description"}, new[] {BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD}, new StandardAnalyzer());

// perform the search
Hits hits = searcher.Search(searchQuery);

// loop through all the hits and show their title
for (int hitIndex = 0; hitIndex < hits.Length(); hitIndex++)
{
// get the corresponding document
Document hitDocument = hits.Doc(hitIndex);

// write its title to the console
Console.WriteLine(hitDocument.GetField("title").StringValue());
}
}

This function does the heavy lifting in our search implementation. First we open an index reader using the specified index path. Then we create a query which searches through the title or description field in the index using the specified term. With the call to searcer.Search() we actually perform the search. Next we loop through all the hits returned by the searcher and write their title to the console.

That is it for this article, you can download the full source code here, in the next article I will show you how you can implement alternatives.

Posted in | 3 Comments »

Lucene.NET

March 14th, 2009 by

LuceneWelcome to my Lucene.NET article page. On this page you can find various articles related to Lucene.NET. Let me first explain what Lucene.NET is: it is a .NET port of the popular Lucene library of the Apache Software Foundation. Lucene is a high performance information retrieval library aka search library.

Lucene is not a complete, out of the box, search application; It is an API. The advantage is that you can index and search everything you want but don’t expect Lucene to index HTML or PDF files, you have to write your own indexer.

Lucene is scalable, it is used in some of the busiest websites: Wikipedia, CNET, CodeCrawler and many more. Those are examples of websites but the library is not limited to web applications, it can be used in any kind of application.

In this series of articles I want to show you how to build a Lucene implementation on the .NET framework using C#. I hope you like the articles I have written. If you have any questions, comments or suggestions please let me now. Enjoy!

Table of Contents

  1. Introduction to Lucene
  2. Indexing basics
  3. Search basics
  4. Alternatives ( did you mean …)
  5. Faceted search / Drill down
  6. Class reference

Click here for all articles and post tagged with Lucene.NET.

Posted in | 3 Comments »

BitArray Calculating Cardinality

February 20th, 2009 by

For my faceted search Lucene.NET implementation I needed to get the cardinality of a BitArray. The cardinality is the number of bits in turned on. I found a VB snippet and I converted that one to C#.

Here is my implementation:

private static readonly byte[] _bitsSetArray256 = new byte[] { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 };

public static int GetCardinality( BitArray bitArray )
{
var array = (uint[]) bitArray.GetType().GetField( "m_array", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance ).GetValue( bitArray );
int count = 0;

for ( int index = 0; index < array.Length; index ++ )
{
count += _bitsSetArray256[ array[ index ] & 0xFF ] + _bitsSetArray256[ ( array[ index ] >> 8 ) & 0xFF ] + _bitsSetArray256[ ( array[ index ] >> 16 ) & 0xFF ] + _bitsSetArray256[ ( array[ index ] >> 24 ) & 0xFF ];
}

return count;
}

The reflection access to the private field of BitArray isn’t exactly fast, but I will optimize it later and post the updated snippet right here.

I hope this implementation is as useful for you as it is for me.

Lucene.NET and Facetted Search

February 20th, 2009 by

LuceneFor my work at Liones I has to implement a Lucene.NET search solution, including drill-down/faceted search. This was pretty hard to accomplish because there wasn’t much reference material, at least not for C#, but I just finished the first implementation and it works like a charm. I am not going to discus the implementation in detail because I decided to write some articles about it.

For now, I am thinking about the following articles:

  1. Introduction to Lucene
  2. Indexing basics
  3. Search basics
  4. Alternatives ( did you mean …)
  5. Faceted search / Drill down
  6. Class reference

Let met know if you have got more ideas for articles.