About apt-xapian-index, I have already posted:
- an introduction of apt-xapian-index;
- an example of how to query it;
- a way to add simple result filters to the query;
- a way to suggest keywords and tags to use to improve the query.
- a way to search for similar packages.
- a way to implement an adaptive cutoff on result quality.
- a smart way of querying tags
- how to implement search as you type
Today I'll show how to create tag clouds. Not only that, but I'll show how to implement tag clouds that change as the user types a query.
This example uses python-gtk, and has been created together with Matteo Zandi.
Generating a tag cloud out of any Xapian query is simple, and it is just a matter of presenting into a tag cloud the information that you get with the technique shown in a smart way of querying tags: you get the tags related to the query, and you lay out their names with a font size proportional to their Xapian rank.
For the presentation, we can load pretty names from the Debtags vocabulary in
/var/lib/debtags/vocabulary
:
from debian_bundle import deb822
# Facet name -> Short description
facets = dict()
# Tag name -> Short description
tags = dict()
for p in deb822.Deb822.iter_paragraphs(open("/var/lib/debtags/vocabulary", "r")):
if "Description" not in p: continue
desc = p["Description"].split("\n", 1)[0]
if "Tag" in p:
tags[p["Tag"]] = desc
elif "Facet" in p:
facets[p["Facet"]] = desc
The query then goes on as usual, and when we get the tags from the eset
we
also record their score and normalise it between 0 and 1. I found that
computing the logarithm of scores helps to avoid having a tag cloud with a few
huge tags and a lot of tiny tiny tags:
class Filter(xapian.ExpandDecider):
def __call__(self, term):
return term[:2] == "XT"
def format(k):
if k in tags:
facet = k.split("::", 1)[0]
if facet in facets:
return "<i>%s: %s</i>" % (facets[facet], tags[k])
else:
return "<i>%s</i>" % tags[k]
else:
return k
taglist = []
maxscore = None
for res in enquire.get_eset(15, rset, Filter()):
# Normalise the score in the interval [0, 1]
weight = math.log(res.weight)
if maxscore == None: maxscore = weight
tag = res.term[2:]
taglist.append(
(tag, format(tag), float(weight) / maxscore)
)
taglist.sort(key=lambda x:x[0])
Finally, you mark up a gtkhtml2.Document
to display in a gtkhtml2 widget:
def mark_text_up(result_list):
# 0-100 score, key (facet::tag), description
document = gtkhtml2.Document()
document.clear()
document.open_stream("text/html")
document.write_stream("""<html><head>
<style type="text/css">
a { text-decoration: none; color: black; }
</style>
</head><body>""")
for tag, desc, score in result_list:
document.write_stream('<a href="%s" style="font-size: %d%%">%s</a> ' % (tag, score*150, desc))
document.write_stream("</body></html>")
document.close_stream()
return document
That's it, try it out.
You can use the git web interface to get to the full source code and the module it uses.