I've recently posted:
- an introduction of apt-xapian-index;
- an example of how to query it;
- a way to add simple result filters to the query;
- a way to suggest keywords and tags to use to improve the query.
- a way to search for similar packages.
Note that I've rewritten all the old posts to only show the main code snippets: if you were put off by the large lumps of code, you may want to give it another go.
Today I'll show how to implement an adaptive cutoff to get rid of the worse results.
Recall that Xapian shows results by decreasing order of quality, and as we pull more an more results out, we reach a point where the matches are so approximated that they look random.
This can be a problem if we want to change the order of the result, for example we may want to sort by package size, or by popcon popularity. There are many scenarios in which a really bad match could end up at the top of the results.
For most cases, you just want to say "discard all results whose quality is less than 70%". But sometimes you have queries that OR lots of terms, and even your top result, while still being a very good result, may be below the cutoff you decided.
Implementing an adaptive cutoff is extremely simple: first, you get the quality estimate of the top result:
# Retrieve the first result, and check its relevance matches = enquire.get_mset(0, 1) topWeight = matches[0].weight
Then you tell Xapian that you want a cutoff value that is, for example, 70% of that:
# Tell Xapian that we only want results that are at least 70% as good as that enquire.set_cutoff(0, topWeight * 0.7)
Finally, you repeat the query. If you want, you can go for bigger result sets, as the cutoff will make it so that if you have lots of results, they will very likely be all good results:
matches = enquire.get_mset(0, 200)
This is it.
You can use the wsvn interface to get to the full source code and the module it uses.
Next in the series: smart way of querying tags.