apt-xapian-index: searching for similar packages

I've recently posted:

Today I'll show how to abuse Xapian to show a list of packages similar to a given one.

This time I'll try just linking to the code in wsvn and showing in the blog only show the most important bits.

So, we have a package name, and we want to show what are the packages similar to that one.

To do it, we simply build a big OR query with all the terms indexed for that package: Xapian will show us the packages whose terms are most similar, and that does the trick.

This works because Xapian gives us the best results first, therefore even if no package except the given one will give an exact match, we still get the nearest matches first.

In order to get the list of indexed terms given a package name we need to do two things:

  1. Get the Xapian document for the package.
  2. Get the termlist of the document.

To get the Xapian document we search for a term that only that document can have. In the index, the package name is indexed with the special prefix "XP", so we can search for that:

def docForPackage(pkgname):
    "Get the document corresponding to the package with the given name"
    # Query the term with the package name
    query = xapian.Query("XP"+pkgname)
    enquire = xapian.Enquire(db)
    enquire.set_query(query)
    # Get the top result only
    matches = enquire.get_mset(0, 1)
    if matches.size() == 0:
        return None
    else:
        m = matches[0]
        return m[xapian.MSET_DOCUMENT]

Then we build the big term list, by iterating the termlist of the document:

# Build a term list with all the terms in the given packages
terms = []
# Get the document corresponding to the package name
doc = docForPackage(pkgname)
if not doc: continue
# Retrieve all the terms in the document
for t in doc.termlist():
    if len(t.term) < 2 or t.term[:2] != 'XP':
        terms.append(t.term)

Note that it's trivial to fetch terms from more than one document, if you want to query "all packages a bit like this one and a bit like that one", although that's less of a useful feature.

Lastly, we build the final query:

# Build the big OR query
query = xapian.Query(xapian.Query.OP_AND_NOT,
            # Terms we want
            xapian.Query(xapian.Query.OP_OR, terms),
            # AND NOT the input packages
            xapian.Query("XP"+pkgname))

I add an AND_NOT part with the input package name so that we don't get in the output the package that we asked for.

This is it:

$ ./axi-query-similar.py debtags
20309 results found.
Results 1-20:
33% debtags-edit - GUI browser and editor for Debian Package Tags
27% tagcolledit - GUI editor for tagged collections
25% libtagcoll2-dev - Functions used to manipulate tagged collections (development version)
24% tagcoll - Commandline tool to perform operations on tagged collections
19% packagesearch - GUI for searching packages and viewing package information
18% doodle - Desktop Search Engine (client)
18% doodled - Desktop Search Engine (daemon)
18% libept0 - High-level library for managing Debian package information
18% upgrade-system - system upgrader from Konflux
18% libept-dev - High-level library for managing Debian package information
17% ept-cache - Commandline tool to search the package archive
16% tracker-utils - metadata database, indexer and search tool - commandline tools
[...]

You can use the wsvn interface to get to the full source code and the module it uses.

Next in the series: adaptive quality cutoff.