Since having workable bindings for the C++ Debtags libraries seems to be still a bit in the future, I'm planning to build a bit of native infrastructure in some higher level language. First step is seeing what language I could start playing with.
The problem
At the most basic level, in Debtags we have a number of packages, each of which have a set of tags.
The way I usually save tags is a file with the format:
package1, package2: tag1, tag2, tag3 package3: tag1, tag2
That is, every line has a list of packages with the same tags, and the list of their tags.
Since any script I'm going to write has to at least be able to parse the data
into something like a package -> tags
hash, then print it out.
Let's see how perl, python and ruby perform.
Tests
C++
The reference point for the experiment will be the C++ implementation,
tagcoll
:
$ time tagcoll copy package-tags > /dev/null
real 0m0.421s
user 0m0.412s
sys 0m0.000s
Perl
First attempt is with Perl, creating the script that parses into a hash of
package => set of tags
and prints the result.
There are set modules for Perl on CPAN, but I have none handy at the moment. However, since they are implemented using hashes, I can approximate them by using a hash.
Note that I also want to have a different copy of the tag set for every package, so that I can manipulate them in the future without unwanted side effects.
Here is the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | #!/usr/bin/perl -w use strict; my %db; # Read the tag database while (<>) { chop(); my ($pkgs, $tags) = split(': '); # Create the tagset using keys of a hash my %tags = map { $_ => undef } split(', ', $tags); for my $p (split(', ', $pkgs)) { # Make a copy of the tagset $db{$p} = {%tags}; } } # Write the tag database while (my ($pkg, $tags) = each %db) { print $pkg, join(', ', keys %$tags), "\n"; } |
Here is the running time:
$ time ./parse.pl package-tags > /dev/null
real 0m0.448s
user 0m0.436s
sys 0m0.008s
Not so bad, comparable with tagcoll.
Python
Then comes Python. I'm not much of a Python fancier, but I'm rather attracted
by the new set
native type introduced with Python 2.4, which seems to have
most of what I need nice and done.
Here is the script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | #!/usr/bin/python import sys input = sys.stdin if len(sys.argv) > 1: input = open(sys.argv[1],"r") # Read the tag database db = {} for line in input: # Is there a way to remove the last character of a line that does not # make a copy of the entire line? line = line.rstrip("\n") pkgs, tags = line.split(": ") # Create the tag set using the native set tags = set(tags.split(", ")) for p in pkgs.split(", "): db[p] = tags.copy() # Write the tag database for pkg, tags in db.items(): # Using % here seems awkward to me, but if I use calls to # sys.stdout.write it becomes a bit slower print "%s:" % (pkg), ", ".join(tags) |
Here is the running time:
$ time ./parse.py package-tags > /dev/null
real 0m0.418s
user 0m0.376s
sys 0m0.036s
I'm pleased, very pleased. Using the native set seems to be not only handy, but efficient.
Ruby
Finally, Ruby. I like to use Ruby. In this case, however, it lacks a native set implementation, although it has a set module which is implemented using a hash.
Here is the script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | #!/usr/bin/ruby require 'set' infile = ARGV[0] ? File.new(ARGV[0]) : $stdin # Read the tag database db = {} infile.each_line do |line| line.chop() pkgs, tags = line.split(": ") # Create the set using the Set module tags = Set.new(tags.split(", ")) pkgs.split(", ").each do |p| # Is this a copy or a reference? I need to find out. db[p] = tags end end # Write the tag database db.each do |key, tags| # Ouch, Set does not do join by itself print key, ": ", tags.to_a.join(", ") end |
Here is the running time:
$ time ./parse.rb package-tags > /dev/null
real 0m1.637s
user 0m1.572s
sys 0m0.052s
I hope I got something wrong in the script, but I can't see what.
Results
As much as I don't fancy Python, it looks like it's currently the best choice for playing around with Debtags. I hope the native sets will bring me joy.
If in the future I'll be asked "how come you chose Python for this Debtags thing?", I can point to this page.