WikiTaxi: use a local copy of WikiPedia

Accessing WikiPedia is easy when you're online, but have you ever wanted to take it along with you for off-line situations? I have, and there is a lovely program to do so for Windows: WikiTaxi. You don't have to install the program; just extract it from its 7zip archive and put it in a convenient location somewhere on your Windows drive.

After downloading one of the page dumps from Wikimedia, you convert (to SQLite) the compressed page dump file (e.g. enwiki-.....-pages-articles.xml.bz2) to WikiTaxi's format with the included importer program, which takes a while: on my system the converter ran for just over an hour to translate the full WikiPedia compressed XML source (3.3GB) to WikiTaxi's format, resulting in a 5.89 GB file.

wikitaxi

WikiTaxi is well documented, and it is fun to use. The only thing missing are WikiPedia's images, but those are difficult to acquire.

The Berkeley DB Book

BDB book Berkeley DB (BDB) has certainly grown up over the years, from a simple key/value database to a key/value database with transactional capabilities, replication and more. I read The Berkeley DB Book by Himanshu Yadava (Apress), and I was fascinated. The book is very well written; it is always clear, and the author discusses all programming facets of BDB and, where appropriate, compares Berkeley DB to relational database systems, which is a great help if you are used to the latter.

The book is for programmers, and as such, it has a ton of code, which is great. The author writes "Using just C in my code examples would have addressed the largest subsection of programmers", and I agree: I was a bit disappointed that coding is mainly in C++, but chapter 12 shows you the C API (and if you want Java, its API is discussed in chapter 11).

From why and when to use Berkeley DB, through building simple applications and an in-depth discussion of data stores, the book goes on to advanced operations (cursors, duplicate keys, joins, etc.) with plenty of good examples. In Chapter 8, replication; the architecture and APIs provided by BDB, and as always, good code samples. Just before discussing the BDB utilities (the command-line tools: db_stat, db_checkpoint, etc), Himanshu Yadava goes into great detail building distributed transactions and discussing strategies you can apply.

Anyone interested in data storage should read this book, if only to remember that your multi-CPU, terabyte relational database is great, but that there are still a myriad applications that would greatly profit from a smaller, light-weight and blindingly fast database system like Berkeley DB. The Berkeley DB Book belongs on your bookshelf.

Migrate to PostgreSQL now!

After the Sun's purchase of Mysql, I expected something like this:

MySQL will start offering some features only in MySQL Enterprise.

If you haven't yet, it is time to start migrating to a fully Open-Sourced RDBMS that also works on multiple platforms: PostgreSQL.

via /.

Every three or four years

Every three or four years, depending on how a company writes off its hardware, you have machines to replace. Now, replacing a box with a few cables on it isn't hard: you rip the old cords out of the wall and throw the lot on the dump. After that, you place the new boxes in your data centre and plug in all the cords in their respective sockets.

But that isn't quite all there is to it, is it?

You then install a base operating system. If you are lucky, nothing much has changed and you load backup tapes (or whatever media you've used) and restore from that. If you aren't so lucky (as what happened to me), the machine you are replacing wasn't quite, shall we call it up to date?

In that case, it is more or less a start from scratch kind of operation. Software has changed, you decide to use a different IMAP server, the MTA configuration needs tweaking, Apache's authentication modules have changed (for the, it must be, trillionth and a half time), etc., etc., etc.

Oh, well, I'm all done.

Well: not quite. There is still half a load of utilities and stuff that need recompiling (new version of GCC, you know), but I should be getting there soon.

I hope. :-)

MySQL UDF and LDAP

User-defined functions are compiled as shared object files and then added to and removed from the MySQL server dynamically.

I hacked up a small test to demonstrate their implementation, although I'll only show you the results here.


CREATE TABLE u ( username varchar(20) );
INSERT INTO u VALUES ('jpm');
SELECT * FROM u;
+----------+
| username |
+----------+
| jpm      |
+----------+

So far, nothing special, but now for something completely different:


CREATE FUNCTION ldapcn RETURNS STRING SONAME 'libudf_jp2.so';

The CREATE FUNCTION loads the shared object file into the server's address space, where it remains available until the function is dropped from the data dictionary.


SELECT username, LDAPCN(username) AS cn FROM u;
+----------+---------------+
| username | cn            |
+----------+---------------+
| jpm      | Jan-Piet Mens |
+----------+---------------+

During the SELECT, MySQL invokes my UDF, passing it the string argument, and I go off and search for the user in an LDAP directory tree, returning the user's Common Name as the function's value. That in turn is the result of the function which MySQL uses.

Powerful.

More on MySQL UDF.

Time for commitments

subversionSpreading source code and configuration files all over the show is a Bad Thing™ which is why I'm setting up a subversion repository. For those who don't know, subversion is a version control system which runs either standalone or under an Apache web server (or for very poor folk, on a file system :-) ).

Subversion comes pre-packaged for a number of platforms. A pretty Windows-Explorer-integrated GUI is available in the form of TortoiseSVN for MS Windows platforms.

When running under Apache, Subversion can utilize any of the sundry authentication modules that Apache has to offer, and it can also be restricted to use SSL, including using SSL client certificates for access control.

Subversion's documentation is excellent, and includes the Subversion book in a variety of formats.

BIND DLZ

LogoBrowsing around in the source tree of ISC's BIND 9.4.1 name server, I notice a directory called dlz/ in the contrib directory. That contains a patch named Bind DLZ, or Dynamically Loadable Zones, a feature richt implementation sponsored by NLnet, that allows data (including new zones!) served by a BIND name server to be modified without reloading or restarting it (something that many people who serve a large amount of zones hate to do because of BIND's rather long startup time).

Bind DLZ supports a number of backends including Berkeley DB, PostgreSQL, MySQL and LDAP, and it doesn't impose a schema to the LDAP backend; theoretically I can use almost any schema, as long as I observe some rules. Quite interesting is the possibility to limit zone transfers (AXFR) by adding an object to the directory:

dn: dlzrecordid=0,dlzZoneName=mens.de,o=dns
dlzrecordid: 0
objectclass: dlzxfr
dlzIPAddr: 127.0.0.1
dlzIPAddr: 192.168.1.173

Bind DLZ comes with an impressive set of performance tools including a data set with 2,697,736 domains which can be used to test the configuration. I used dnsCSVDataReader.pl to convert those to an LDIF with which I could load my slapd. This config file did the job:

inputfile: dns_data-1.0.csv
writer: binddlz::writers::ldap::file
file: dnsin.ldif
base: o=dns

I was aware of the LDAP SDB back-end patch for BIND 9, which works very well, but that only allows individual zones to be retrieved from an LDAP directory.

BIND DLZ looks very interesting indeed.

e-Mail, where art thou from?

Have you ever been curious as to which country your e-mail messages are sent from? Simply looking at the domain doesn't always help, as the ubiquitous .com domains for example don't necessarily originate from the US. I am interested so I have implemented a Geo-IP lookup on the sender's host address. In order to have that visualized in e-mail clients, I'll have the mail server add a custom header containing the two-letter ISO 3166 country code to incoming messages.

How is a country determined from a simple IP address? There are providers of databases which can map an IP address to a country with varying degrees of precision, depending on what I'm willing to pay for the service. One of these is MaxMind: they offer subscription services to their premium data, but they also have a database named GeoLite Country which is free to use. Together with their snazzy little C API, these two offerings form the basis with which to implement the lookup.

Download and install the C API to get the database and the library on your system. As soon as the library and its corresponding database are installed, make a point of retrieving the latest version of the database. Instructions on how to do this are contained in the file data/README. That is it. The system can be tested with the geoiplookup program.

$ geoiplookup  211.197.114.120
GeoIP Country Edition: KR, Korea, Republic of

One way to determine the countries of origin of messages is to scan through mail logs to collect a list of IP addresses with which to feed geoiplookup. Boring! I want it done on the fly. :-)

Now to the MTA.

Since version 4.51, Exim supports calling external C functions from loadable modules. These are loaded dynamically by Exim and the result of the custom function can be used within a string expansion. Follow the instructions for enabling dlfunc in Local/Makefile to the letter, or else the loadable modules won't work. If your version of Exim has support for dlfunc, the loadable module is instantiated and invoked with a call to a string expansion of the form ${dlfunc{file.so}{f}{a0…}}, where file.so is the full path to your shared object, f the name of the C function to be called, and a0 an optional argument of which there can be a maximum of eight.

David Saez created an a few of these custom functions, including one aptly named ip2country which utilizes MaxMind's C API to query the Country database. A small warning: don't blindly trust the Makefile's install target: you'll probably want to uncomment the second line.

I've inserted a call to the function in the smtp_data ACL somewhere, like so. I have a condition which ensures the function is only called on messages originating on the Internet and not from internal MXen, but I leave that as an excercise to the reader:

warn condition = ${if ...  }{0}{1}}
     add_header = X-senderGeoIP: ${dlfunc{/usr/exim/bin/exim-ext.so}{ip2country}{$sender_host_address}}

This causes the specified header to be added to incoming messages (in some older Exim versions, use message instead of add_header). In MUA that support it, I can see the header nicely.

X-senderGeoIP: DE

Because I access most of my mail with Mutt, I want to view the country code in the index. Mutt has support for the non-standard X-Label header, which can be made visible with the %y switch in the $hdr_format variable. Procmail filters my messages, and I'll coerce it to add an X-Label: header with the value of X-senderGeoIP. First a wee bit o' .procmailrc:

# Grab the country (Geo-IP)
XCOUNTRY=`formail -cx "X-senderGeoIP"`

# If the message has an X-senderGeoIP header, set the X-Label
# header on it, so that Mutt can display it in its indices.

:0 fhw
* ^X-senderGeoIP:.*
| formail -i "X-Label: $XCOUNTRY"
 

Download this code: sundry/procmail/x-sendergeoip

You'll be asking yourself why I didn't directly use the X-Label header when determining the country code in the MTA. Good question. I don't want to clobber a header that other MUAs might be using, preferring to clobber it only on my own mail.

I like the result: here is a partial index of my current spam shown in Mutt with a hdr_format setting of "%4C %Z %{%b %d} %-15.15F (%4c) [%?Y?%-3.3y&—-?] %s":

Mutt

Now to Lotus Notes.

The Inbox folder's design needs to be updated to contain the content of the header we inserted. Unfortunately, that appears to be easier said than done, as I failed miserably doing that. I solved it in a rather convoluted way by having a Lotusscript agent set the field before new mail arrives.

Sub Initialize
        ' XsenderGeoIP. Agent, before new mail arrives. Had to do this
        ' because I can't get at the original field with @Formula.
        Dim s As New NotesSession
        Dim db As NotesDatabase
        Dim doc As NotesDocument
        Dim mime As NotesMIMEEntity
        Set db = s.CurrentDatabase
        Dim geo As String
        s.ConvertMIME = False ' Do not convert MIME to rich text
        Set doc = s.DocumentContext
       
        With doc
                Set mime = doc.GetMIMEEntity
                If Not(mime Is Nothing) Then
                        Dim filter(1) As String
                        filter(0) = "X-senderGeoIP"
                       
                        geo = mime.GetSomeHeaders(filter, True)
                        geo = Right(geo, 2)
                        doc.xxgeo = geo
                        Call doc.save(True, False)
                End If
        End With
        s.ConvertMIME = True ' Restore MIME conversion
End Sub

Download this code: lotus/notes-xsendergeoip.lss

The final result is then:

Lotus Notes

E-mail with added value for three of us: me, myself and I. :-)