apt-xapian-index: dynamically generated tag clouds

About apt-xapian-index, I have already posted:

Today I'll show how to create tag clouds. Not only that, but I'll show how to implement tag clouds that change as the user types a query.

This example uses python-gtk, and has been created together with Matteo Zandi.

Generating a tag cloud out of any Xapian query is simple, and it is just a matter of presenting into a tag cloud the information that you get with the technique shown in a smart way of querying tags: you get the tags related to the query, and you lay out their names with a font size proportional to their Xapian rank.

For the presentation, we can load pretty names from the Debtags vocabulary in /var/lib/debtags/vocabulary:

from debian_bundle import deb822

# Facet name -> Short description
facets = dict()
# Tag name -> Short description
tags = dict()
for p in deb822.Deb822.iter_paragraphs(open("/var/lib/debtags/vocabulary", "r")):
    if "Description" not in p: continue
    desc = p["Description"].split("\n", 1)[0]
    if "Tag" in p:
        tags[p["Tag"]] = desc
    elif "Facet" in p:
        facets[p["Facet"]] = desc

The query then goes on as usual, and when we get the tags from the eset we also record their score and normalise it between 0 and 1. I found that computing the logarithm of scores helps to avoid having a tag cloud with a few huge tags and a lot of tiny tiny tags:

class Filter(xapian.ExpandDecider):
    def __call__(self, term):
        return term[:2] == "XT"

def format(k):
    if k in tags:
        facet = k.split("::", 1)[0]
        if facet in facets:
            return "<i>%s: %s</i>" % (facets[facet], tags[k])
        else:
            return "<i>%s</i>" % tags[k]
    else:
        return k

taglist = []
maxscore = None
for res in enquire.get_eset(15, rset, Filter()):
    # Normalise the score in the interval [0, 1]
    weight = math.log(res.weight)
    if maxscore == None: maxscore = weight
    tag = res.term[2:]
    taglist.append(
        (tag, format(tag), float(weight) / maxscore)
    )
taglist.sort(key=lambda x:x[0])

Finally, you mark up a gtkhtml2.Document to display in a gtkhtml2 widget:

def mark_text_up(result_list):
    # 0-100 score, key (facet::tag), description
    document = gtkhtml2.Document()
    document.clear()
    document.open_stream("text/html")
    document.write_stream("""<html><head>
<style type="text/css">
a { text-decoration: none; color: black; }
</style>
</head><body>""")
    for tag, desc, score in result_list:
        document.write_stream('<a href="%s" style="font-size: %d%%">%s</a> ' % (tag, score*150, desc))
    document.write_stream("</body></html>")
    document.close_stream()
    return document

That's it, try it out.

You can use the git web interface to get to the full source code and the module it uses.