The DPL platforms are too long and you could use a very, very short executive summary? No problem, I have the technology for it.
After the results you can find the kit to build yourself an extractor in the comfort of your home.
The results
- 93sam: jobs, deal, dds, nms, asking
- aigarius: applications, aigarius, choose, trademarks, apps
- ajt: humbug, effective, neat, hoping, success
- hertzog: broadly, wouter, serve, represent, pushed
- sho: deadline, helps, excellence, freaks, tasks
- sjr: qb, published, xxxx, r, yet
- stratus: stable, websites, feature, submitter, involving
- svenl: unfair, protest, ban, publish, banning
- wouter: controversy, seem, background, delegates, therefore
Acquiring the data
for i in 93sam aigarius ajt hertzog sho sjr stratus svenl wouter do wget http://www.debian.org/vote/2007/platforms/$i done
Tokenizing
1 2 3 4 5 6 | #!/bin/sh for file in "$@" do lynx -dump -stdin < $file | tr -c '[a-zA-Z]' ' ' | tr '[A-Z]' '[a-z]' | sed -e 's/ /\n/g' | sed -e '/^$/d' > $file.tok done |
Extracting the most representative keywords
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | #!/usr/bin/python import sys, math def read_tokens(file): "Read all the tokens from one file" return [ line[:-1] for line in open(file) ] # Read all the "documents" docs = [ read_tokens(file) for file in sys.argv[1:] ] # Aggregate token counts aggregated = {} for d in docs: for t in d: if t in aggregated: aggregated[t] += 1 else: aggregated[t] = 1 def tfidf(doc, tok): "Compute TFIDF score of a token in a document" return doc.count(tok) * math.log(float(len(docs)) / aggregated[tok]) # Output the top 5 tokens by TFIDF for every document for name, doc in zip(sys.argv[1:], docs): print name, sorted(set(doc), key=lambda tok: tfidf(doc, tok), reverse=True)[:5] |
Errata
Jacobo suggests
to use lynx -dump -nolist
or w3m -dump
for a more tokenizer-friendly text expansion.