Latest posts for tag tips
A friend on IRC said: "I wish vim had a command to resolve all the IP addresses in a block of text".
But it does:
:<block>!perl -MSocket -pe 's/(\d+\.\d+\.\d+\.\d+)/gethostbyaddr(inet_aton($1), AF_INET)/ge'
If you use it often, put the perl command in a one-liner script and call it an editor macro. It works on other editors, too, and even without an editor at all. And it can be scripted!
We live with the power of Unix every day, so much that we risk forgetting how awesome it is.
As everyone should know, by default MySQL is an embarassing stupid toy:
mysql> create table foo (val integer not null); Query OK, 0 rows affected (0.03 sec) mysql> insert into foo values (1/0); ERROR 1048 (23000): Column 'val' cannot be null mysql> insert into foo values (1); Query OK, 1 row affected (0.00 sec) mysql> update foo set val=1/0 where val=1; Query OK, 1 row affected, 1 warning (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 1 mysql> select * from foo; +-----+ | val | +-----+ | 0 | +-----+ 1 row in set (0.00 sec)
Luckily, you can tell it to stop being embarassingly stupid:
mysql> set sql_mode="traditional"; Query OK, 0 rows affected (0.00 sec) mysql> update foo set val=1/0 where val=0; ERROR 1365 (22012): Division by 0
(There is an even better sql mode you can choose, though: it is called "Install PostgreSQL")
Unfortunately, I've been hired to work on a project that relies on the embarassing stupid behaviour of MySQL, so I cannot set sql_mode=traditional globally or the existing house of cards will collapse.
Here is how you set it session-wide with SQLAlchemy 0.6.x: it took me quite a while to find out:
import sqlalchemy.interfaces # Without this, MySQL will silently insert invalid values in the # database, causing very long debugging sessions in the long run class DontBeSilly(sqlalchemy.interfaces.PoolListener): def connect(self, dbapi_con, connection_record): cur = dbapi_con.cursor() cur.execute("SET SESSION sql_mode='TRADITIONAL'") cur = None engine = create_engine(..., listeners=[DontBeSilly()])
Why does it take all that effort is beyond me. I'd have expected this to be turned on by default, possibly with a switch that insane people could use to turn it off.
Aruba ha deciso, di punto in bianco, di iscrivermi a tutte le loro newsletter.
Le newsletter non hanno link di deiscrizione. O meglio, forse ce l'hanno, ma si vedono solo decodificando la mail usando programmi che io non ho intenzione di usare. A prescindere dal link di deiscrizione, perché dovrei deiscrivermi da delle newsletter alle quali non mi sono mai iscritto?
Ho mandato questa mail a abuse@staff.aruba.it, e altre 3 segnalazioni dopo di questa, che ovviamente sono state ignorate:
Buon giorno, vi segnalo questo spam inviato da voi stessi (in allegato la mail con gli header intatti). Potreste per favore procedere con provvedimenti disciplinari contro voi stessi? Il vostro comportamento su internet viola le piú banali regole di netiquette, ed è vostro interesse, come provider, istruire voi stessi sulle stesse e farvele rispettare. Cordiali saluti, Enrico
Che dire, una nazione del terzo mondo si merita ISP da terzo mondo.
È pur sempre un'ottima scusa per studiarsi gli header_check di postfix: ora le mail delle newsletter di Aruba, che son tra l'altro dei patozzi da 300Kb l'una, incontrano un REJECT direttamente nella sessione SMTP:
550 5.7.1 Criminal third-world ISP spammers not accepted here.
Per farlo, ho aggiunto a /etc/postfix/main.cf:
# Reject aruba spam right away header_checks = pcre:/etc/postfix/known_idiots.pcre
E poi ho creato il file /etc/postfix/known_idiots.pcre:
/^Received:.+smtpnewsletter[0-9]+.aruba.it/ REJECT Criminal third-world ISP spammers not accepted here.
Nel frattempo ho mandato un'email al Garante Privacy e una all'AGCOM, piú per curiosità che altro. Non mi aspetto nessuna risposta, ma se succede qualcosa lo aggiungo volentieri qui.
Me and Yuwei had a fun day at hhhmcr (#hhhmcr) and even managed to put together a prototype that won the first prize \o/
We played with the gmp24 dataset kindly extracted from Twitter by Michael Brunton-Spall of the Guardian into a convenient JSON dataset. The idea was to find ways of making it easier to look at the data and making sense of it.
This is the story of what we did, including the code we wrote.
The original dataset has several JSON files, so the first task was to put them all together:
#!/usr/bin/python # Merge the JSON data # (C) 2010 Enrico Zini <enrico@enricozini.org> # License: WTFPL version 2 (http://sam.zoy.org/wtfpl/) import simplejson import os res = [] for f in os.listdir("."): if not f.startswith("gmp24"): continue data = open(f).read().strip() if data == "[]": continue parsed = simplejson.loads(data) res.extend(parsed) print simplejson.dumps(res)
The results however were not ordered by date, as GMP had to use several accounts to twit because Twitter was putting Greather Manchester Police into jail for generating too much traffic. There would be quite a bit to write about that, but let's stick to our work.
Here is code to sort the JSON data by time:
#!/usr/bin/python # Sort the JSON data # (C) 2010 Enrico Zini <enrico@enricozini.org> # License: WTFPL version 2 (http://sam.zoy.org/wtfpl/) import simplejson import sys import datetime as dt all_recs = simplejson.load(sys.stdin) all_recs.sort(key=lambda x: dt.datetime.strptime(x["created_at"], "%a %b %d %H:%M:%S +0000 %Y")) simplejson.dump(all_recs, sys.stdout)
I then wanted to play with Tf-idf for extracting the most important words of every tweet:
#!/usr/bin/python # tfifd - Annotate JSON elements with Tf-idf extracted keywords # # Copyright (C) 2010 Enrico Zini <enrico@enricozini.org> # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. import sys, math import simplejson import re # Read all the twits records = simplejson.load(sys.stdin) # All the twits by ID byid = dict(((x["id"], x) for x in records)) # Stopwords we ignore stopwords = set(["by", "it", "and", "of", "in", "a", "to"]) # Tokenising engine re_num = re.compile(r"^\d+$") re_word = re.compile(r"(\w+)") def tokenise(tweet): "Extract tokens from a tweet" for tok in tweet["text"].split(): tok = tok.strip().lower() if re_num.match(tok): continue mo = re_word.match(tok) if not mo: continue if mo.group(1) in stopwords: continue yield mo.group(1) # Extract tokens from tweets tokenised = dict(((x["id"], list(tokenise(x))) for x in records)) # Aggregate token counts aggregated = {} for d in byid.iterkeys(): for t in tokenised[d]: if t in aggregated: aggregated[t] += 1 else: aggregated[t] = 1 def tfidf(doc, tok): "Compute TFIDF score of a token in a document" return doc.count(tok) * math.log(float(len(byid)) / aggregated[tok]) # Annotate tweets with keywords res = [] for name, tweet in byid.iteritems(): doc = tokenised[name] keywords = sorted(set(doc), key=lambda tok: tfidf(doc, tok), reverse=True)[:5] tweet["keywords"] = keywords res.append(tweet) simplejson.dump(res, sys.stdout)
I thought this was producing a nice summary of every tweet but nobody was particularly interested, so we moved on to adding categories to tweet.
Thanks to Yuwei who put together some useful keyword sets, we managed to annotate each tweet with a place name (i.e. "Stockport"), a social place name (i.e. "pub", "bank") and a social category (i.e. "man", "woman", "landlord"...)
The code is simple; the biggest work in it was the dictionary of keywords:
#!/usr/bin/python # categorise - Annotate JSON elements with categories # # Copyright (C) 2010 Enrico Zini <enrico@enricozini.org> # Copyright (C) 2010 Yuwei Lin <yuwei@ylin.org> # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. import sys, math import simplejson import re # Electoral wards from http://en.wikipedia.org/wiki/List_of_electoral_wards_in_Greater_Manchester placenames = ["Altrincham", "Sale West", "Altrincham", "Ashton upon Mersey", "Bowdon", "Broadheath", "Hale Barns", "Hale Central", "St Mary", "Timperley", "Village", "Ashton-under-Lyne", "Ashton Hurst", "Ashton St Michael", "Ashton Waterloo", "Droylsden East", "Droylsden West", "Failsworth East", "Failsworth West", "St Peter", "Blackley", "Broughton", "Broughton", "Charlestown", "Cheetham", "Crumpsall", "Harpurhey", "Higher Blackley", "Kersal", "Bolton North East", "Astley Bridge", "Bradshaw", "Breightmet", "Bromley Cross", "Crompton", "Halliwell", "Tonge with the Haulgh", "Bolton South East", "Farnworth", "Great Lever", "Harper Green", "Hulton", "Kearsley", "Little Lever", "Darcy Lever", "Rumworth", "Bolton West", "Atherton", "Heaton", "Lostock", "Horwich", "Blackrod", "Horwich North East", "Smithills", "Westhoughton North", "Chew Moor", "Westhoughton South", "Bury North", "Church", "East", "Elton", "Moorside", "North Manor", "Ramsbottom", "Redvales", "Tottington", "Bury South", "Besses", "Holyrood", "Pilkington Park", "Radcliffe East", "Radcliffe North", "Radcliffe West", "St Mary", "Sedgley", "Unsworth", "Cheadle", "Bramhall North", "Bramhall South", "Cheadle", "Gatley", "Cheadle Hulme North", "Cheadle Hulme South", "Heald Green", "Stepping Hill", "Denton", "Reddish", "Audenshaw", "Denton North East", "Denton South", "Denton West", "Dukinfield", "Reddish North", "Reddish South", "Hazel Grove", "Bredbury", "Woodley", "Bredbury Green", "Romiley", "Hazel Grove", "Marple North", "Marple South", "Offerton", "Heywood", "Middleton", "Bamford", "Castleton", "East Middleton", "Hopwood Hall", "Norden", "North Heywood", "North Middleton", "South Middleton", "West Heywood", "West Middleton", "Leigh", "Astley Mosley Common", "Atherleigh", "Golborne", "Lowton West", "Leigh East", "Leigh South", "Leigh West", "Lowton East", "Tyldesley", "Makerfield", "Abram", "Ashton", "Bryn", "Hindley", "Hindley Green", "Orrell", "Winstanley", "Worsley Mesnes", "Manchester Central", "Ancoats", "Clayton", "Ardwick", "Bradford", "City Centre", "Hulme", "Miles Platting", "Newton Heath", "Moss Side", "Moston", "Manchester", "Gorton", "Fallowfield", "Gorton North", "Gorton South", "Levenshulme", "Longsight", "Rusholme", "Whalley Range", "Manchester", "Withington", "Burnage", "Chorlton", "Chorlton Park", "Didsbury East", "Didsbury West", "Old Moat", "Withington", "Oldham East", "Saddleworth", "Alexandra", "Crompton", "Saddleworth North", "Saddleworth South", "Saddleworth West", "Lees", "St James", "St Mary", "Shaw", "Waterhead", "Oldham West", "Royton", "Chadderton Central", "Chadderton North", "Chadderton South", "Coldhurst", "Hollinwood", "Medlock Vale", "Royton North", "Royton South", "Werneth", "Rochdale", "Balderstone", "Kirkholt", "Central Rochdale", "Healey", "Kingsway", "Littleborough Lakeside", "Milkstone", "Deeplish", "Milnrow", "Newhey", "Smallbridge", "Firgrove", "Spotland", "Falinge", "Wardle", "West Littleborough", "Salford", "Eccles", "Claremont", "Eccles", "Irwell Riverside", "Langworthy", "Ordsall", "Pendlebury", "Swinton North", "Swinton South", "Weaste", "Seedley", "Stalybridge", "Hyde", "Dukinfield Stalybridge", "Hyde Godley", "Hyde Newton", "Hyde Werneth", "Longdendale", "Mossley", "Stalybridge North", "Stalybridge South", "Stockport", "Brinnington", "Central", "Davenport", "Cale Green", "Edgeley", "Cheadle Heath", "Heatons North", "Heatons South", "Manor", "Stretford", "Urmston", "Bucklow-St Martins", "Clifford", "Davyhulme East", "Davyhulme West", "Flixton", "Gorse Hill", "Longford", "Stretford", "Urmston", "Wigan", "Aspull New Springs Whelley", "Douglas", "Ince", "Pemberton", "Shevington with Lower Ground", "Standish with Langtree", "Wigan Central", "Wigan West", "Worsley", "Eccles South", "Barton", "Boothstown", "Ellenbrook", "Cadishead", "Irlam", "Little Hulton", "Walkden North", "Walkden South", "Winton", "Worsley", "Wythenshawe", "Sale East", "Baguley", "Brooklands", "Northenden", "Priory", "Sale Moor", "Sharston", "Woodhouse Park"] # Manual coding from Yuwei placenames.extend(["City centre", "Tameside", "Oldham", "Bury", "Bolton", "Trafford", "Pendleton", "New Moston", "Denton", "Eccles", "Leigh", "Benchill", "Prestwich", "Sale", "Kearsley", ]) placenames.extend(["Trafford", "Bolton", "Stockport", "Levenshulme", "Gorton", "Tameside", "Blackley", "City centre", "Airport", "South Manchester", "Rochdale", "Chorlton", "Uppermill", "Castleton", "Stalybridge", "Ashton", "Chadderton", "Bury", "Ancoats", "Whalley Range", "West Yorkshire", "Fallowfield", "New Moston", "Denton", "Stretford", "Eccles", "Pendleton", "Leigh", "Altrincham", "Sale", "Prestwich", "Kearsley", "Hulme", "Withington", "Moss Side", "Milnrow", "outskirt of Manchester City Centre", "Newton Heath", "Wythenshawe", "Mancunian Way", "M60", "A6", "Droylesden", "M56", "Timperley", "Higher Ince", "Clayton", "Higher Blackley", "Lowton", "Droylsden", "Partington", "Cheetham Hill", "Benchill", "Longsight", "Didsbury", "Westhoughton"]) # Social categories from Yuwei soccat = ["man", "woman", "men", "women", "youth", "teenager", "elderly", "patient", "taxi driver", "neighbour", "male", "tenant", "landlord", "child", "children", "immigrant", "female", "workmen", "boy", "girl", "foster parents", "next of kin"] for i in range(100): soccat.append("%d-year-old" % i) soccat.append("%d-years-old" % i) # Types of social locations from Yuwei socloc = ["car park", "park", "pub", "club", "shop", "premises", "bus stop", "property", "credit card", "supermarket", "garden", "phone box", "theatre", "toilet", "building site", "Crown court", "hard shoulder", "telephone kiosk", "hotel", "restaurant", "cafe", "petrol station", "bank", "school", "university"] extras = { "placename": placenames, "soccat": soccat, "socloc": socloc } # Normalise keyword lists for k, v in extras.iteritems(): # Remove duplicates v = list(set(v)) # Sort by length v.sort(key=lambda x:len(x), reverse=True) # Add keywords def add_categories(tweet): text = tweet["text"].lower() for field, categories in extras.iteritems(): for cat in categories: if cat.lower() in text: tweet[field] = cat break return tweet # Read all the twits records = (add_categories(x) for x in simplejson.load(sys.stdin)) simplejson.dump(list(records), sys.stdout)
All these scripts form a nice processing chain: each script takes a list of JSON records, adds some bit and passes it on.
In order to see what we have so far, here is a simple script to convert the JSON twits to CSV so they can be viewed in a spreadsheet:
#!/usr/bin/python # Convert the JSON twits to CSV # (C) 2010 Enrico Zini <enrico@enricozini.org> # License: WTFPL version 2 (http://sam.zoy.org/wtfpl/) import simplejson import sys import csv rows = ["id", "created_at", "text", "keywords", "placename"] writer = csv.writer(sys.stdout) for rec in simplejson.load(sys.stdin): rec["keywords"] = " ".join(rec["keywords"]) rec["placename"] = rec.get("placename", "") writer.writerow([rec[row] for row in rows])
At this point we were coming up with lots of questions: "were there more reports on women or men?", "which place had most incidents?", "what were the incidents involving animals?"... Time to bring Xapian into play.
This script reads all the JSON tweets and builds a Xapian index with them:
#!/usr/bin/python # toxapian - Index JSON tweets in Xapian # # Copyright (C) 2010 Enrico Zini <enrico@enricozini.org> # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. import simplejson import sys import os, os.path import xapian DBNAME = sys.argv[1] db = xapian.WritableDatabase(DBNAME, xapian.DB_CREATE_OR_OPEN) stemmer = xapian.Stem("english") indexer = xapian.TermGenerator() indexer.set_stemmer(stemmer) indexer.set_database(db) data = simplejson.load(sys.stdin) for rec in data: doc = xapian.Document() doc.set_data(str(rec["id"])) indexer.set_document(doc) indexer.index_text_without_positions(rec["text"]) # Index categories as categories if "placename" in rec: doc.add_boolean_term("XP" + rec["placename"].lower()) if "soccat" in rec: doc.add_boolean_term("XS" + rec["soccat"].lower()) if "socloc" in rec: doc.add_boolean_term("XL" + rec["socloc"].lower()) db.add_document(doc) db.flush() # Also save the whole dataset so we know where to find it later if we want to # show the details of an entry simplejson.dump(data, open(os.path.join(DBNAME, "all.json"), "w"))
And this is a simple command line tool to query to the database:
#!/usr/bin/python # xgrep - Command line tool to query the GMP24 tweet Xapian database # # Copyright (C) 2010 Enrico Zini <enrico@enricozini.org> # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. import simplejson import sys import os, os.path import xapian DBNAME = sys.argv[1] db = xapian.Database(DBNAME) stem = xapian.Stem("english") qp = xapian.QueryParser() qp.set_default_op(xapian.Query.OP_AND) qp.set_database(db) qp.set_stemmer(stem) qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME) qp.add_boolean_prefix("place", "XP") qp.add_boolean_prefix("soc", "XS") qp.add_boolean_prefix("loc", "XL") query = qp.parse_query(sys.argv[2], xapian.QueryParser.FLAG_BOOLEAN | xapian.QueryParser.FLAG_LOVEHATE | xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE | xapian.QueryParser.FLAG_WILDCARD | xapian.QueryParser.FLAG_PURE_NOT | xapian.QueryParser.FLAG_SPELLING_CORRECTION | xapian.QueryParser.FLAG_AUTO_SYNONYMS) enquire = xapian.Enquire(db) enquire.set_query(query) count = 40 matches = enquire.get_mset(0, count) estimated = matches.get_matches_estimated() print "%d/%d results" % (matches.size(), estimated) data = dict((str(x["id"]), x) for x in simplejson.load(open(os.path.join(DBNAME, "all.json")))) for m in matches: rec = data[m.document.get_data()] print rec["text"] print "%d/%d results" % (matches.size(), matches.get_matches_estimated()) total = db.get_doccount() estimated = matches.get_matches_estimated() print "%d results over %d documents, %d%%" % (estimated, total, estimated * 100 / total)
Neat! Now that we have a proper index that supports all sort of cool things, like stemming, tag clouds, full text search with complex queries, lookup of similar documents, suggest keywords and so on, it was just fair to put together a web service to share it with other people at the event.
It helped that I had already written similar code for apt-xapian-index and dde before.
Here is the server, quickly built on bottle. The very last line starts the server and it is where you can configure the listening interface and port.
#!/usr/bin/python # xserve - Make the GMP24 tweet Xapian database available on the web # # Copyright (C) 2010 Enrico Zini <enrico@enricozini.org> # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. import bottle from bottle import route, post from cStringIO import StringIO import cPickle as pickle import simplejson import sys import os, os.path import xapian import urllib import math bottle.debug(True) DBNAME = sys.argv[1] QUERYLOG = os.path.join(DBNAME, "queries.txt") data = dict((str(x["id"]), x) for x in simplejson.load(open(os.path.join(DBNAME, "all.json")))) prefixes = { "place": "XP", "soc": "XS", "loc": "XL" } prefix_desc = { "place": "Place name", "soc": "Social category", "loc": "Social location" } db = xapian.Database(DBNAME) stem = xapian.Stem("english") qp = xapian.QueryParser() qp.set_default_op(xapian.Query.OP_AND) qp.set_database(db) qp.set_stemmer(stem) qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME) for k, v in prefixes.iteritems(): qp.add_boolean_prefix(k, v) def make_query(qstring): return qp.parse_query(qstring, xapian.QueryParser.FLAG_BOOLEAN | xapian.QueryParser.FLAG_LOVEHATE | xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE | xapian.QueryParser.FLAG_WILDCARD | xapian.QueryParser.FLAG_PURE_NOT | xapian.QueryParser.FLAG_SPELLING_CORRECTION | xapian.QueryParser.FLAG_AUTO_SYNONYMS) @route("/") def index(): query = urllib.unquote_plus(bottle.request.GET.get("q", "")) out = StringIO() print >>out, ''' <html> <head> <title>Query</title> <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script> <script type="text/javascript"> $(function(){ $("#queryfield")[0].focus() }) </script> </head> <body> <h1>Search</h1> <form method="POST" action="/query"> Keywords: <input type="text" name="query" value="%s" id="queryfield"> <input type="submit"> <a href="http://xapian.org/docs/queryparser.html">Help</a> </form>''' % query print >>out, ''' <p>Example: "car place:wigan"</p> <p>Available prefixes:</p> <ul> ''' for pfx in prefixes.keys(): print >>out, "<li><a href='/catinfo/%s'>%s - %s</a></li>" % (pfx, pfx, prefix_desc[pfx]) print >>out, ''' </ul> ''' oldqueries = [] if os.path.exists(QUERYLOG): total = db.get_doccount() fd = open(QUERYLOG, "r") while True: try: q = pickle.load(fd) except EOFError: break oldqueries.append(q) fd.close() def print_query(q): count = q["count"] print >>out, "<li><a href='/query?query=%s'>%s (%d/%d %.2f%%)</a></li>" % (urllib.quote_plus(q["q"]), q["q"], count, total, count * 100.0 / total) print >>out, "<p>Last 10 queries:</p><ul>" for q in oldqueries[:-10:-1]: print_query(q) print >>out, "</ul>" # Remove duplicates oldqueries = dict(((x["q"], x) for x in oldqueries)).values() print >>out, "<table>" print >>out, "<tr><th>10 queries with most results</th><th>10 queries with least results</th></tr>" print >>out, "<tr><td>" print >>out, "<ul>" oldqueries.sort(key=lambda x:x["count"], reverse=True) for q in oldqueries[:10]: print_query(q) print >>out, "</ul>" print >>out, "</td><td>" print >>out, "<ul>" nonempty = [x for x in oldqueries if x["count"] > 0] nonempty.sort(key=lambda x:x["count"]) for q in nonempty[:10]: print_query(q) print >>out, "</ul>" print >>out, "</td></tr>" print >>out, "</table>" print >>out, ''' </body> </html>''' return out.getvalue() @route("/query") @route("/query/") @post("/query") @post("/query/") def query(): query = bottle.request.POST.get("query", bottle.request.GET.get("query", "")) enquire = xapian.Enquire(db) enquire.set_query(make_query(query)) count = 40 matches = enquire.get_mset(0, count) estimated = matches.get_matches_estimated() total = db.get_doccount() out = StringIO() print >>out, ''' <html> <head><title>Results</title></head> <body> <h1>Results for "<b>%s</b>"</h1> ''' % query if estimated == 0: print >>out, "No results found." else: # Give as results the first 30 documents; also use them as the key # ones to use to compute relevant terms rset = xapian.RSet() for m in enquire.get_mset(0, 30): rset.add_document(m.document.get_docid()) # Compute the tag cloud class NonTagFilter(xapian.ExpandDecider): def __call__(self, term): return not term[0].isupper() and not term[0].isdigit() cloud = [] maxscore = None for res in enquire.get_eset(40, rset, NonTagFilter()): # Normalise the score in the interval [0, 1] weight = math.log(res.weight) if maxscore == None: maxscore = weight tag = res.term cloud.append([tag, float(weight) / maxscore]) max_weight = cloud[0][1] min_weight = cloud[-1][1] cloud.sort(key=lambda x:x[0]) def mklink(query, term): return "/query?query=%s" % urllib.quote_plus(query + " and " + term) print >>out, "<h2>Tag cloud</h2>" print >>out, "<blockquote>" for term, weight in cloud: size = 100 + 100.0 * (weight - min_weight) / (max_weight - min_weight) print >>out, "<a href='%s' style='font-size:%d%%; color:brown;'>%s</a>" % (mklink(query, term), size, term) print >>out, "</blockquote>" print >>out, "<h2>Results</h2>" print >>out, "<p><a href='/'>Search again</a></p>" print >>out, "<p>%d results over %d documents, %.2f%%</p>" % (estimated, total, estimated * 100.0 / total) print >>out, "<p>%d/%d results</p>" % (matches.size(), estimated) print >>out, "<ul>" for m in matches: rec = data[m.document.get_data()] print >>out, "<li><a href='/item/%s'>%s</a></li>" % (rec["id"], rec["text"]) print >>out, "</ul>" fd = open(QUERYLOG, "a") qinfo = dict(q=query, count=estimated) pickle.dump(qinfo, fd) fd.close() print >>out, ''' <a href="/">Search again</a> </body> </html>''' return out.getvalue() @route("/item/:id") @route("/item/:id/") def show(id): rec = data[id] out = StringIO() print >>out, ''' <html> <head><title>Result %s</title></head> <body> <h1>Raw JSON record for twit %s</h1> <pre>''' % (rec["id"], rec["id"]) print >>out, simplejson.dumps(rec, indent=" ") print >>out, ''' </pre> </body> </html>''' return out.getvalue() @route("/catinfo/:name") @route("/catinfo/:name/") def catinfo(name): prefix = prefixes[name] out = StringIO() print >>out, ''' <html> <head><title>Values for %s</title></head> <body> ''' % name terms = [(x.term[len(prefix):], db.get_termfreq(x.term)) for x in db.allterms(prefix)] terms.sort(key=lambda x:x[1], reverse=True) freq_min = terms[0][1] freq_max = terms[-1][1] def mklink(name, term): return "/query?query=%s" % urllib.quote_plus(name + ":" + term) # Build tag cloud print >>out, "<h1>Tag cloud</h1>" print >>out, "<blockquote>" for term, freq in sorted(terms[:20], key=lambda x:x[0]): size = 100 + 100.0 * (freq - freq_min) / (freq_max - freq_min) print >>out, "<a href='%s' style='font-size:%d%%; color:brown;'>%s</a>" % (mklink(name, term), size, term) print >>out, "</blockquote>" print >>out, "<h1>All terms</h1>" print >>out, "<table>" print >>out, "<tr><th>Occurrences</th><th>Name</th></tr>" for term, freq in terms: print >>out, "<tr><td>%d</td><td><a href='/query?query=%s'>%s</a></td></tr>" % (freq, urllib.quote_plus(name + ":" + term), term) print >>out, "</table>" print >>out, ''' </body> </html>''' return out.getvalue() # Change here for bind host and port bottle.run(host="0.0.0.0", port=8024)
...and then we presented our work and ended up winning the contest.
This was the story of how we wrote this set of award winning code.
I like the idea of matching photos to GPS traces. In Debian there is gpscorrelate but it's almost unusable to me because of bug #473362 and it has an awkward way of specifying time offsets.
Here at SoTM10 someone told me that
exiftool gained -geosync
and -geotag
options. So it's just a matter of creating a little tool that shows a photo and
asks you to type the GPS time you see in it.
Apparently there are no bindings or GIR files for gtkimageview in Debian, so I'll have to use C.
Here is a C prototype:
/* * gpsoffset - Compute EXIF time offset from a photo of a gps display * * Use with exiftool -geosync=... -geotag trace.gpx DIR * * Copyright (C) 2009--2010 Enrico Zini <enrico@enricozini.org> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ #define _XOPEN_SOURCE /* glibc2 needs this */ #include <time.h> #include <gtkimageview/gtkimageview.h> #include <libexif/exif-data.h> #include <stdio.h> #include <stdlib.h> static int load_time(const char* fname, struct tm* tm) { ExifData* exif_data = exif_data_new_from_file(fname); ExifEntry* exif_time = exif_data_get_entry(exif_data, EXIF_TAG_DATE_TIME); if (exif_time == NULL) { fprintf(stderr, "Cannot find EXIF timetamp\n"); return -1; } char buf[1024]; exif_entry_get_value(exif_time, buf, 1024); //printf("val2: %s\n", exif_entry_get_value(t2, buf, 1024)); if (strptime(buf, "%Y:%m:%d %H:%M:%S", tm) == NULL) { fprintf(stderr, "Cannot match EXIF timetamp\n"); return -1; } return 0; } static time_t exif_ts; static GtkWidget* res_lbl; void date_entry_changed(GtkEditable *editable, gpointer user_data) { const gchar* text = gtk_entry_get_text(GTK_ENTRY(editable)); struct tm parsed; if (strptime(text, "%Y-%m-%d %H:%M:%S", &parsed) == NULL) { gtk_label_set_text(GTK_LABEL(res_lbl), "Please enter a date as YYYY-MM-DD HH:MM:SS"); } else { time_t img_ts = mktime(&parsed); int c; int res; if (exif_ts < img_ts) { c = '+'; res = img_ts - exif_ts; } else { c = '-'; res = exif_ts - img_ts; } char buf[1024]; if (res > 3600) snprintf(buf, 1024, "Result: %c%ds -geosync=%c%d:%02d:%02d", c, res, c, res / 3600, (res / 60) % 60, res % 60); else if (res > 60) snprintf(buf, 1024, "Result: %c%ds -geosync=%c%02d:%02d", c, res, c, (res / 60) % 60, res % 60); else snprintf(buf, 1024, "Result: %c%ds -geosync=%c%d", c, res, c, res); gtk_label_set_text(GTK_LABEL(res_lbl), buf); } } int main (int argc, char *argv[]) { // Work in UTC to avoid mktime applying DST or timezones setenv("TZ", "UTC"); const char* filename = "/home/enrico/web-eddie/galleries/2010/04-05-Uppermill/P1080932.jpg"; gtk_init (&argc, &argv); struct tm exif_time; if (load_time(filename, &exif_time) != 0) return 1; printf("EXIF time: %s\n", asctime(&exif_time)); exif_ts = mktime(&exif_time); GtkWidget* window = gtk_window_new(GTK_WINDOW_TOPLEVEL); GtkWidget* vb = gtk_vbox_new(FALSE, 0); GtkWidget* hb = gtk_hbox_new(FALSE, 0); GtkWidget* lbl = gtk_label_new("Timestamp:"); GtkWidget* exif_lbl; { char buf[1024]; strftime(buf, 1024, "EXIF time: %Y-%m-%d %H:%M:%S", &exif_time); exif_lbl = gtk_label_new(buf); } GtkWidget* date_ent = gtk_entry_new(); res_lbl = gtk_label_new("Result:"); GtkWidget* view = gtk_image_view_new(); GdkPixbuf* pixbuf = gdk_pixbuf_new_from_file(filename, NULL); gtk_box_pack_start(GTK_BOX(hb), lbl, FALSE, TRUE, 0); gtk_box_pack_start(GTK_BOX(hb), date_ent, TRUE, TRUE, 0); gtk_signal_connect(GTK_OBJECT(date_ent), "changed", (GCallback)date_entry_changed, NULL); { char buf[1024]; strftime(buf, 1024, "%Y-%m-%d %H:%M:%S", &exif_time); gtk_entry_set_text(GTK_ENTRY(date_ent), buf); } gtk_widget_set_size_request(view, 500, 400); gtk_image_view_set_pixbuf(GTK_IMAGE_VIEW(view), pixbuf, TRUE); gtk_container_add(GTK_CONTAINER(window), vb); gtk_box_pack_start(GTK_BOX(vb), view, TRUE, TRUE, 0); gtk_box_pack_start(GTK_BOX(vb), hb, FALSE, TRUE, 0); gtk_box_pack_start(GTK_BOX(vb), exif_lbl, FALSE, TRUE, 0); gtk_box_pack_start(GTK_BOX(vb), res_lbl, FALSE, TRUE, 0); gtk_widget_show_all(window); gtk_main (); return 0; }
And here is its simple makefile:
CFLAGS=$(shell pkg-config --cflags gtkimageview libexif) LDFLAGS=$(shell pkg-config --libs gtkimageview libexif) gpsoffset: gpsoffset.c
It's a simple prototype but it's a working prototype and seems to do the job for me.
I currently cannot find out why after I click on the text box, there seems to be no way to give the focus back to the image viewer so I can control it with keys.
There is another nice algorithm to compute time offsets to be implemented: you choose a photo taken from a known place and drag it on that place on a map: you can then look for the nearest point on your GPX trace and compute the time offset from that.
I have seen that there are programs for geotagging photos that implement all such algorithms, and have a nice UI, but I haven't seen any in Debian.
Are there any such softwares that can be packaged?
If not, the interpolation and annotation tasks can now already be performed by exiftool, so it's just a matter of building a good UI, and I would love to see someone picking up the task.
Third step of my SoTM10 pet project: finding the POIs.
I put together a query to find all nodes with a given tag inside a bounding box, and also a query to find all the tag values for a given tag name inside a bounding box.
The result is this simple POI search engine:
#
# poisearch - simple geographical POI search engine
#
# Copyright (C) 2010 Enrico Zini <enrico@enricozini.org>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
#
from pysqlite2 import dbapi2 as sqlite
class PoiDB(object):
def __init__(self):
self.db = sqlite.connect("pois.db")
self.db.enable_load_extension(True)
self.db.execute("SELECT load_extension('libspatialite.so')")
self.oldsearch = []
self.bbox = None
def set_bbox(self, xmin, xmax, ymin, ymax):
'''Set bbox for searches'''
self.bbox = (xmin, xmax, ymin, ymax)
def tagid(self, name, val):
'''Get the database ID for a tag'''
c = self.db.cursor()
c.execute("SELECT id FROM tag WHERE name=? AND value=?", (name, val))
res = None
for row in c:
res = row[0]
return res
def tagnames(self):
'''Get all tag names'''
c = self.db.cursor()
c.execute("SELECT DISTINCT name FROM tag ORDER BY name")
for row in c:
yield row[0]
def tagvalues(self, name, use_bbox=False):
'''
Get all tag values for a given tag name,
optionally in the current bounding box
'''
c = self.db.cursor()
if self.bbox is None or not use_bbox:
c.execute("SELECT DISTINCT value FROM tag WHERE name=? ORDER BY value", (name,))
else:
c.execute("SELECT DISTINCT tag.value FROM poi, poitag, tag"
" WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE ("
" xmin >= ? AND xmax <= ? AND ymin >= ? AND ymax <= ?) )"
" AND poitag.tag = tag.id AND poitag.poi = poi.id"
" AND tag.name=?",
self.bbox + (name,))
for row in c:
yield row[0]
def search(self, name, val):
'''Get all name:val tags in the current bounding box'''
# First resolve the tagid
tagid = self.tagid(name, val)
if tagid is None: return
c = self.db.cursor()
c.execute("SELECT poi.name, poi.data, X(poi.geom), Y(poi.geom) FROM poi, poitag"
" WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE ("
" xmin >= ? AND xmax <= ? AND ymin >= ? AND ymax <= ?) )"
" AND poitag.tag = ? AND poitag.poi = poi.id",
self.bbox + (tagid,))
self.oldsearch = []
for row in c:
self.oldsearch.append(row)
yield row[0], simplejson.loads(row[1]), row[2], row[3]
def count(self, name, val):
'''Count all name:val tags in the current bounding box'''
# First resolve the tagid
tagid = self.tagid(name, val)
if tagid is None: return
c = self.db.cursor()
c.execute("SELECT COUNT(*) FROM poi, poitag"
" WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE ("
" xmin >= ? AND xmax <= ? AND ymin >= ? AND ymax <= ?) )"
" AND poitag.tag = ? AND poitag.poi = poi.id",
self.bbox + (tagid,))
for row in c:
return row[0]
def replay(self):
for row in self.oldsearch:
yield row[0], simplejson.loads(row[1]), row[2], row[3]
Problem 3 solved: now on to the next step, building a user interface for it.
Second step of my SoTM10 pet project: creating a searchable database with the points. What a fantastic opportunity to learn Spatialite.
Learning Spatialite is easy. For example, you can use the two tutorials with catchy titles that assume your best wish in life is to create databases out of shapefiles using a pre-built, i386-only executable GUI binary downloaded over an insecure HTTP connection.
To be fair, the second of those tutorials is called "An almost Idiot's Guide", thus expliciting the requirement of being an almost idiot in order to happily acquire and run software in that way.
Alternatively, you can use A quick tutorial to SpatiaLite which is so quick it has examples that lead you to write SQL queries that trigger all sorts of vague exceptions at insert time. But at least it brought me a long way forward, at which point I could just cross reference things with PostGIS documentation to find out the right way of doing things.
So, here's the importer script, which will probably become my reference example for how to get started with Spatialite, and how to use Spatialite from Python:
#!/usr/bin/python
#
# poiimport - import nodes from OSM into a spatialite DB
#
# Copyright (C) 2010 Enrico Zini <enrico@enricozini.org>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
#
import xml.sax
import xml.sax.handler
from pysqlite2 import dbapi2 as sqlite
import simplejson
import sys
import os
class OSMPOIReader(xml.sax.handler.ContentHandler):
'''
Filter SAX events in a OSM XML file to keep only nodes with names
'''
def __init__(self, consumer):
self.consumer = consumer
def startElement(self, name, attrs):
if name == "node":
self.attrs = attrs
self.tags = dict()
elif name == "tag":
self.tags[attrs["k"]] = attrs["v"]
def endElement(self, name):
if name == "node":
lat = float(self.attrs["lat"])
lon = float(self.attrs["lon"])
id = int(self.attrs["id"])
#dt = parse(self.attrs["timestamp"])
uid = self.attrs.get("uid", None)
uid = int(uid) if uid is not None else None
user = self.attrs.get("user", None)
self.consumer(lat, lon, id, self.tags, user=user, uid=uid)
class Importer(object):
'''
Create the spatialite database and populate it
'''
TAG_WHITELIST = set(["amenity", "shop", "tourism", "place"])
def __init__(self, filename):
self.db = sqlite.connect(filename)
self.db.enable_load_extension(True)
self.db.execute("SELECT load_extension('libspatialite.so')")
self.db.execute("SELECT InitSpatialMetaData()")
self.db.execute("INSERT INTO spatial_ref_sys (srid, auth_name, auth_srid,"
" ref_sys_name, proj4text) VALUES (4326, 'epsg', 4326,"
" 'WGS 84', '+proj=longlat +ellps=WGS84 +datum=WGS84"
" +no_defs')")
self.db.execute("CREATE TABLE poi (id int not null unique primary key,"
" name char, data text)")
self.db.execute("SELECT AddGeometryColumn('poi', 'geom', 4326, 'POINT', 2)")
self.db.execute("SELECT CreateSpatialIndex('poi', 'geom')")
self.db.execute("CREATE TABLE tag (id integer primary key autoincrement,"
" name char, value char)")
self.db.execute("CREATE UNIQUE INDEX tagidx ON tag (name, value)")
self.db.execute("CREATE TABLE poitag (poi int not null, tag int not null)")
self.db.execute("CREATE UNIQUE INDEX poitagidx ON poitag (poi, tag)")
self.tagid_cache = dict()
def tagid(self, k, v):
key = (k, v)
res = self.tagid_cache.get(key, None)
if res is None:
c = self.db.cursor()
c.execute("SELECT id FROM tag WHERE name=? AND value=?", key)
for row in c:
self.tagid_cache[key] = row[0]
return row[0]
self.db.execute("INSERT INTO tag (id, name, value) VALUES (NULL, ?, ?)", key)
c.execute("SELECT last_insert_rowid()")
for row in c:
res = row[0]
self.tagid_cache[key] = res
return res
def __call__(self, lat, lon, id, tags, user=None, uid=None):
# Acquire tag IDs
tagids = []
for k, v in tags.iteritems():
if k not in self.TAG_WHITELIST: continue
for val in v.split(";"):
tagids.append(self.tagid(k, val))
# Skip elements that don't have the tags we want
if not tagids: return
geom = "POINT(%f %f)" % (lon, lat)
self.db.execute("INSERT INTO poi (id, geom, name, data)"
" VALUES (?, GeomFromText(?, 4326), ?, ?)",
(id, geom, tags["name"], simplejson.dumps(tags)))
for tid in tagids:
self.db.execute("INSERT INTO poitag (poi, tag) VALUES (?, ?)", (id, tid))
def done(self):
self.db.commit()
# Get the output file name
filename = sys.argv[1]
# Ensure we start from scratch
if os.path.exists(filename):
print >>sys.stderr, filename, "already exists"
sys.exit(1)
# Import
parser = xml.sax.make_parser()
importer = Importer(filename)
handler = OSMPOIReader(importer)
parser.setContentHandler(handler)
parser.parse(sys.stdin)
importer.done()
Let's run it:
$ ./poiimport pois.db < pois.osm
SpatiaLite version ..: 2.4.0 Supported Extensions:
- 'VirtualShape' [direct Shapefile access]
- 'VirtualDbf' [direct Dbf access]
- 'VirtualText' [direct CSV/TXT access]
- 'VirtualNetwork' [Dijkstra shortest path]
- 'RTree' [Spatial Index - R*Tree]
- 'MbrCache' [Spatial Index - MBR cache]
- 'VirtualFDO' [FDO-OGR interoperability]
- 'SpatiaLite' [Spatial SQL - OGC]
PROJ.4 Rel. 4.7.1, 23 September 2009
GEOS version 3.2.0-CAPI-1.6.0
$ ls -l --si pois*
-rw-r--r-- 1 enrico enrico 17M Jul 9 23:44 pois.db
-rw-r--r-- 1 enrico enrico 37M Jul 9 16:20 pois.osm
$ spatialite pois.db
SpatiaLite version ..: 2.4.0 Supported Extensions:
- 'VirtualShape' [direct Shapefile access]
- 'VirtualDbf' [direct DBF access]
- 'VirtualText' [direct CSV/TXT access]
- 'VirtualNetwork' [Dijkstra shortest path]
- 'RTree' [Spatial Index - R*Tree]
- 'MbrCache' [Spatial Index - MBR cache]
- 'VirtualFDO' [FDO-OGR interoperability]
- 'SpatiaLite' [Spatial SQL - OGC]
PROJ.4 version ......: Rel. 4.7.1, 23 September 2009
GEOS version ........: 3.2.0-CAPI-1.6.0
SQLite version ......: 3.6.23.1
Enter ".help" for instructions
spatialite> select id from tag where name="amenity" and value="fountain";
24
spatialite> SELECT poi.name, poi.data, X(poi.geom), Y(poi.geom) FROM poi, poitag WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE (xmin >= 2.56 AND xmax <= 2.90 AND ymin >= 41.84 AND ymax <= 42.00) ) AND poitag.tag = 24 AND poitag.poi = poi.id;
Font Picant de la Cellera|{"amenity": "fountain", "name": "Font Picant de la Cellera"}|2.616045|41.952449
Font de Can Pla|{"amenity": "fountain", "name": "Font de Can Pla"}|2.622354|41.974724
Font de Can Ribes|{"amenity": "fountain", "name": "Font de Can Ribes"}|2.62311|41.979193
It's impressive: I've got all sort of useful information for the whole of Spain in just 17Mb!
Let's put it to practice: I'm thirsty, is there any water fountain nearby?
spatialite> SELECT count(1) FROM poi, poitag WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE (xmin >= 2.80 AND xmax <= 2.85 AND ymin >= 41.97 AND ymax <= 42.00) ) AND poitag.tag = 24 AND poitag.poi = poi.id;
0
Ouch! No water fountains mapped in Girona... yet.
Problem 2 solved: now on to the next step, [[trying to show the results in some usable way||osm-search-nodes]].
I have a pet project here at SoTM10: create a tool for searching nearby POIs while offline.
The idea is to have something in my pocket (FreeRunner or N900), which doesn't require an internet connection, and which can point me at the nearest fountains, post offices, atm machines, bars and so on.
The first step is to obtain a list of POIs.
In theory one can use Xapi but all the known Xapi servers appear to be down at the moment.
Another attempt is to obtain it by filtering all nodes with the tags we want out of a planet OSM extract. I downloaded the Spanish one and set to work.
First I tried with xmlstarlet, but it ate all the RAM and crashed my laptop, because for some reason, on my laptop the Linux kernels up to 2.6.32 (don't now about later ones) like to swap out ALL running apps to cache I/O operations, which mean that heavy I/O operations swap out the very programs performing them, so the system gets caught in some infinite I/O loop and dies. Or at least this is what I've figured out so far.
So, we need SAX. I put together this prototype in Python, which can process a nice 8MB/s of OSM data for quite some time with a constant, low RAM usage:
#!/usr/bin/python
#
# poifilter - extract interesting nodes from OSM XML files
#
# Copyright (C) 2010 Enrico Zini <enrico@enricozini.org>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
#
import xml.sax
import xml.sax.handler
import xml.sax.saxutils
import sys
class XMLSAXFilter(xml.sax.handler.ContentHandler):
'''
A SAX filter that is a ContentHandler.
There is xml.sax.saxutils.XMLFilterBase in the standard library but it is
undocumented, and most of the examples using it you find online are wrong.
You can look at its source code, and at that point you find out that it is
an offensive practical joke.
'''
def __init__(self, downstream):
self.downstream = downstream
# ContentHandler methods
def setDocumentLocator(self, locator):
self.downstream.setDocumentLocator(locator)
def startDocument(self):
self.downstream.startDocument()
def endDocument(self):
self.downstream.endDocument()
def startPrefixMapping(self, prefix, uri):
self.downstream.startPrefixMapping(prefix, uri)
def endPrefixMapping(self, prefix):
self.downstream.endPrefixMapping(prefix)
def startElement(self, name, attrs):
self.downstream.startElement(name, attrs)
def endElement(self, name):
self.downstream.endElement(name)
def startElementNS(self, name, qname, attrs):
self.downstream.startElementNS(name, qname, attrs)
def endElementNS(self, name, qname):
self.downstream.endElementNS(name, qname)
def characters(self, content):
self.downstream.characters(content)
def ignorableWhitespace(self, chars):
self.downstream.ignorableWhitespace(chars)
def processingInstruction(self, target, data):
self.downstream.processingInstruction(target, data)
def skippedEntity(self, name):
self.downstream.skippedEntity(name)
class OSMPOIHandler(XMLSAXFilter):
'''
Filter SAX events in a OSM XML file to keep only nodes with names
'''
PASSTHROUGH = ["osm", "bound"]
TAG_WHITELIST = set(["amenity", "shop", "tourism", "place"])
def startElement(self, name, attrs):
if name in self.PASSTHROUGH:
self.downstream.startElement(name, attrs)
elif name == "node":
self.attrs = attrs
self.tags = []
self.propagate = False
elif name == "tag":
if self.tags is not None:
self.tags.append(attrs)
if attrs["k"] in self.TAG_WHITELIST:
self.propagate = True
else:
self.tags = None
self.attrs = None
def endElement(self, name):
if name in self.PASSTHROUGH:
self.downstream.endElement(name)
elif name == "node":
if self.propagate:
self.downstream.startElement("node", self.attrs)
for attrs in self.tags:
self.downstream.startElement("tag", attrs)
self.downstream.endElement("tag")
self.downstream.endElement("node")
def ignorableWhitespace(self, chars):
pass
def characters(self, content):
pass
# Simple stdin->stdout XMl filter
parser = xml.sax.make_parser()
handler = OSMPOIHandler(xml.sax.saxutils.XMLGenerator(sys.stdout, "utf-8"))
parser.setContentHandler(handler)
parser.parse(sys.stdin)
Let's run it:
$ bzcat /store/osm/spain.osm.bz2 | pv | ./poifilter > pois.osm
[...]
$ ls -l --si pois.osm
-rw-r--r-- 1 enrico enrico 19M Jul 10 23:56 pois.osm
$ xmlstarlet val pois.osm
pois.osm - valid
Problem 1 solved: now on to the next step: importing the nodes in a database.
I sometimes meet some Italian programmer who prefers his system to be in English, so they get untranslated manpages and error messages.
I sometimes notice that their solution often leaves them something to complain about.
Some set LANG=C
and complain they can't see accented characters.
Some set LANG=en_US.UTF-8
and complain that OpenOffice Calc wants dates in
the format MM/DD/YYYY
which is an Abomination unto
Nuggan, as well as unto Me.
But the locales system can do much better than that. In fact, most times people would be extremely happy with LC_MESSAGES in English, and everything else in Italian:
$ locale LANG=it_IT.UTF-8 LC_CTYPE="it_IT.UTF-8" LC_NUMERIC="it_IT.UTF-8" LC_TIME="it_IT.UTF-8" LC_COLLATE="it_IT.UTF-8" LC_MONETARY="it_IT.UTF-8" LC_MESSAGES=en_US.UTF-8 LC_PAPER="it_IT.UTF-8" LC_NAME="it_IT.UTF-8" LC_ADDRESS="it_IT.UTF-8" LC_TELEPHONE="it_IT.UTF-8" LC_MEASUREMENT="it_IT.UTF-8" LC_IDENTIFICATION="it_IT.UTF-8" LC_ALL=
A way to do this (is there a better one?) is to tell the display manager (GDM,
KDM...) to use the Italian locale, and then override the right LC_*
bits in
~/.xsessionrc
:
$ cat ~/.xsessionrc export LC_MESSAGES=en_US.UTF-8
That does the trick: English messages, with Proper currency, Proper dates, accented letters sort Properly, Proper A4 printer paper, Proper SI units. Even Nuggan would be happy.
Does it happen to you that you cp a big, big file (say, similar in order of magnitude to the amount of RAM) and the system becomes rather unusable?
It looks like Linux is saying "let's cache this", and as you copy it will try to free more and more ram in order to cache the big file you're copying. In the end, all the RAM is full with file data that you are not going to need.
This varies according to how /proc/sys/vm/swappiness
is set.
I learnt about posix_fadvise
and I tried to play with it. The result is this
preloadable library that
hooks into open(2)
and fadvises everything as POSIX_FADV_DONTNEED
.
It is all rather awkward. fadvise in that way will discard existing cache pages if the file is already cached, which is too much. Ideally one would like to say "don't cache this because of me" without stepping on the feet of other system activities.
Also, I found I need to also hook into write(2)
and run fadvise after every
single write, because you can't fadvise a file to be written in its entirety,
unless you pass fadvise the file size in advance. But the size of the output
file cannot be known by the preloaded library, so meh.
So, now I can run: nocache cp bigfile someplace/
without trashing the
existing caches. I can also run nocache tar zxf foo.tar.gz
and so on.
I wish, of course, that there were no need to do so in the first place.
Here is the nocache library source code, for reference:
/* * nocache - LD_PRELOAD library to fadvise written files to not be cached * * Copyright (C) 2009--2010 Enrico Zini <enrico@enricozini.org> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ #define _XOPEN_SOURCE 600 #include <fcntl.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <dlfcn.h> #include <stdarg.h> #include <errno.h> #include <stdio.h> typedef int (*open_t)(const char*, int, ...); typedef int (*write_t)(int fd, const void *buf, size_t count); int open(const char *pathname, int flags, ...) { static open_t func = 0; int res; if (!func) func = (open_t)dlsym(RTLD_NEXT, "open"); // Note: I wanted to add O_DIRECT, but it imposes restriction on buffer // alignment if (flags & O_CREAT) { va_list ap; va_start(ap, flags); mode_t mode = va_arg(ap, mode_t); res = func(pathname, flags, mode); va_end(ap); } else res = func(pathname, flags); if (res >= 0) { int saved_errno = errno; int z = posix_fadvise(res, 0, 0, POSIX_FADV_DONTNEED); if (z != 0) fprintf(stderr, "Cannot fadvise on %s: %m\n", pathname); errno = saved_errno; } return res; } int write(int fd, const void *buf, size_t count) { static write_t func = 0; int res; if (!func) func = (write_t)dlsym(RTLD_NEXT, "write"); res = func(fd, buf, count); if (res > 0) { int saved_errno = errno; int z = posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED); if (z != 0) fprintf(stderr, "Cannot fadvise during write: %m\n"); errno = saved_errno; } return res; }
Updates
Steve Schnepp writes:
Robert Love did a O_STREAMING patch for 2.4. It wasn't merged in 2.6 since POSIX_FADV_NOREUSE should be used instead.
But unfortunatly it's currently mapped either as WILLNEED or as a noop.
It seems that there is a google code project that has spawned to control this.