Faceted Search and Drill-Down – Lucene.NET
March 14th, 2009 by TrilobyteIn this article we will build faceted search and drill-down functionality into our application. Implementing faceted search is actually quiet simple, once you know the concepts. I will start with some theory behind the implementation and then I will show you the actual code. This article elaborates on my previous Lucene articles and I assume you have read indexing, searching and alternatives. If you didn’t read those you can download the starting point source code here.
The theory
The most important thing you should keep in mind is that the every document in the Lucene index is at a fixed position, called the index. When performing a search, you basically get a BitVector(Java) or BitArray(.NET) back from the IndexReader. Every bit that is turned on ( 1, true ) represents a hit, the location of the bit in the BitArray corresponds with the location of a document 3. For example: 00010100 means that the third and fifth documents are hits for the executed search query.
We can use these knowledge to compare two result sets to each other and get the common results from them using a bit wise AND operation. For example: 00010100 AND 00000100 means that only the third document is present in both result sets. When we calculate the number of bits turned on, we know how many documents are present in both result sets.
Example:
We have a facet called content type and it contains the values: ‘news’, ‘articles’ and ‘vacancies’. We initially perform a search operation on each of those values and store their BitArray for later use.
Next we perform a search on a user given term, for example ‘lucene’ and we retrieve the BitArray for this result set.
Now we can get the number of news items in the result set by applying a bit wise AND operation on the BitArrays of the lucene term result set and news facet value result set. Once we calculate the cardinality of the resulting BitArray we now how many documents in the lucene term result set are actually news documents.
This covers the theory behind the faceted search in Lucene, lets get down and dirty with some code.
The code
I assume that you went through the previous articles and have the indexer and search interface working. Optionaly, you can download the starting point source code here.
First lets add a little helper function which can calculate the cardinality of a BitArray. I am not going to explain this function in depth, see my blog post for more details.
Now lets implement the magic:
private static void FacetedSearch(string indexPath, string genre, string term)
{
// create searcher
var searcher = new IndexSearcher(indexPath);
// first get the BitArray result from the genre query
var genreQuery = new TermQuery(new Term("genre", genre));
var genreQueryFilter = new QueryFilter(genreQuery);
BitArray genreBitArray = genreQueryFilter.Bits(searcher.GetIndexReader());
Console.WriteLine("There are " + GetCardinality(genreBitArray) + " document with the genre " + genre);
// Next perform a regular search and get its BitArray result
Query searchQuery = MultiFieldQueryParser.Parse(term, new[] {"title", "description"}, new[] {BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD}, new StandardAnalyzer());
var searchQueryFilter = new QueryFilter(searchQuery);
BitArray searchBitArray = searchQueryFilter.Bits(searcher.GetIndexReader());
Console.WriteLine("There are " + GetCardinality(searchBitArray) + " document containing the term " + term);
// Now do the faceted search magic, combine the two bit arrays using a binary AND operation
BitArray combinedResults = searchBitArray.And(genreBitArray);
Console.WriteLine("There are " + GetCardinality(combinedResults) + " document containing the term " + term + " and which are in the genre " + genre);
}
First we open the index reader, we have done this plenty of times, no news here.
Next we get the bit array of a term query on the genre term. This bit array contains a list of bits each indicating whether a document is in selected by the query or not.
Then we perform a regular search, I covered this kind of code in my previous article. However, instead of iterating through the result we retrieve the corresponding bit array.
Now that we have both bit arrays we can determine which documents are in both results by combining them using a bitwise AND operation. We can use the cardinality of the resulting bit array to give use a possible drill down.
That is it for this article. You can download the full source code here.
11 Replies to “Faceted Search and Drill-Down – Lucene.NET”
August 7th, 2009 at 03:14
Out of interest, how is using the bitwise operations better than simply doing a search and getting back the hits and then calling hits.Length()?
August 9th, 2009 at 21:58
Good question, well you could do that but that quickly comes an expensive solutions when you have multiple facets. You have to run the query for each facet value. To optimize the performance you would rather store the bit vectors and use an binary AND operation to get the document count for each facet.
August 10th, 2009 at 02:51
Thanks for the reply. I agree that for the purposes of optimising performance you could store the bit vectors and do the faceted search that way. I’m at the beginning of a new project that will use Lucene and want to avoid premature optimisation. It would also mean I would have to do all possible faceted searches when [i]updating[/i] data, then persist the resultant bit vectors, which is not ideal (for me).
BTW, great set of articles on Lucene.Net! Very well written and the sample projects are great.
August 10th, 2009 at 09:18
Hi Frank,
You are right about the premature optimization but keep in mind that performance optimization takes a lot of time.
Thanks for the compliment!
October 12th, 2009 at 17:50
Hi,
thanks for this interesting and helpful tutorial. I have learned a lot.
Thank you very much!!!
jan
October 12th, 2009 at 17:55
Hello Jan,
Thanks for the compliment and you are welcome!
Best regards,
Bert
November 16th, 2009 at 00:16
Hi Bert,
As Jan said the tutorial is very very helpful, however I am confused about how the GetCardinality method actually works.
Mostly I’m interested in what you are doing on the first line of the method and the count += statement within the for loop.
Could you explain it a little more because for me at the moment this seems like a bit of black magic that can just pluck the correct number out of the air.
Many thanks
John
November 18th, 2009 at 06:14
Excellent! This is such a simple and fast method for facetted search with Lucene. With your help I got a proof of concept within my own project up and running within the day. Really love the And comparison with the BitArrays .. neat.
Like John, would love an explanation of the basics GetCardinality() method, particularly the theory behind the values in _bitsSetArray256.
Thanks mate!
November 18th, 2009 at 11:05
I’ve posted a question on stackoverflow and got a very good reply, here is the link.
http://stackoverflow.com/questions/1754560/can-someone-explain-to-me-what-this-getcardinality-method-is-doing
March 10th, 2010 at 04:07
Hello, very good article – is there any way to get the distinct values out from hits?
July 21st, 2010 at 15:26
Hi
Just wanted to leave a thank you for the great series of tutorials and sample code, very much appreciated
Regards
Kieran