Last month I mentioned I wanted to import a bunch of notes from my old PIM into Tomboy, but expected a lot of copying and pasting busywork since I didn’t know how to do a mass import. Fortunately, a real live Tomboy developer dropped by to clue me in on the D-Bus interface with which I could use Python to script something up. (Thanks, Sandy!)

Having a starting point, my first search turned up a great Ars Technica article by Ryan Paul that gave me all the information I needed: “Using the Tomboy D-Bus interface.” Read it for an explanation of what DBus is about, and some good tips for using Tomboy’s API.

With this post I’m just going to focus on the simple task of loading a bunch of flat files in to Tomboy, with some elaboration on character set issues I ran in to along the way.

(Python was great for this stuff. I’m just getting started with learning the language, but was able to experiment and figure out a lot of things in the interactive shell on the way to the rewarding dbus.Boolean(True) in response to tomboy.SetNoteContents(note, s).)

Tomboy?

Tomboy is a popular (and awesome) GNOME note-taking application for GNU/Linux, by the way, which may have been helpful to mention earlier for the readers who have dropped out by now because they had no idea what I’m talking about. (Although I guess they might have clicked on a link or two.) For you that remain, obviously you are familiar with the program and just want me to get on with it.

The D-Bus interface is available in version 0.8, which is included with Ubuntu 7.10/Gutsy Gibbon. I have 7.04/Feisty Fawn and Tomboy 0.6.3 on my main machine, but was able to do the import on my 7.10 laptop and then manually copy the *.note files in to ~/.tomboy for the older version, with no apparent problems.

Get the magic tomboy object

From Ars:

import dbus, gobject, dbus.glib
import os

# get the d-bus session bus
bus = dbus.SessionBus()
# access the tomboy d-bus object
obj = bus.get_object("org.gnome.Tomboy", "/org/gnome/Tomboy/RemoteControl")
# access the tomboy remote control interface
tomboy = dbus.Interface(obj, "org.gnome.Tomboy.RemoteControl")

(Except import os was added by me for the file system stuff below.)

Import your files

My meager contribution (which I’m placing in the Public Domain for simplicity’s sake):

# some directory/folder...
path = os.path.expanduser('~/Desktop/notable-files/')

dirlist = os.listdir(path)
dirlist.sort()

for fname in dirlist:
	print(fname)
	f = open(path + fname)
	# d-bus complains if string params aren't valid UTF-8
	title = unicode(f.readline(), 'iso8859_1')

	# reset to start of file and read whole file
	f.seek(0)
	s = f.read();

	# replace left and right curly single quotes with '
	s = s.replace('\x91', "'").replace('\x92', "'")
	# replace left and right curly double quotes with "
	s = s.replace('\x93', '"').replace('\x94', '"')
	# replace en and emdash with --
	s = s.replace('\x96', '--').replace('\x97', '--')

	s = unicode(s, 'iso8859_1')

	# creating named notes seems to prevent notes
	#   from showing up as "New Note NNN"
	note = tomboy.CreateNamedNote(title)
	tomboy.SetNoteContents(note, s)

Notes about the notes

My files happened to be in a state where I could use the first line of the file as a title. You might alternatively use the name of the file as the title, but make sure to prepend it to the data when setting the note contents, probably followed by \n\n.

SetNoteContents will overwrite the title with the first line of the data passed to it, so this may seem redundant to first set the title with CreateNamedNote and then set the contents where my first line is the same as the title, but in my experience, setting note contents after CreateNote results in notes named something like “New Note 539,” and this doesn’t get corrected even after restarting Tomboy. I’ve seen other odd behavior in 0.6 with note titles, where a note shows up as “New Note #” in search results even though the first line is different. I’ve had to change the title to force the correct display in listings.

Character set stuff

So what’s the deal with the string replacements and unicode conversions?

When I first tried importing my four hundred files, I ran in to an error like this:

>>> tomboy.SetNoteContents(note, s)
ERROR:dbus.connection:Unable to set arguments (dbus.String(u'note://tomboy/afe
70879-5b43-455d-8a28-352ff4c3d806'), 'char test \n\nI\x92m testing stuff.\n\n
\x93Blah blah blah\x94. \n') according to signature u'ss':
<type 'exceptions.UnicodeError'>: String parameters to be sent over D-Bus must be valid UTF-8
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/var/lib/python-support/python2.5/dbus/proxies.py", line 135, in __call__
    **keywords)
  File "/var/lib/python-support/python2.5/dbus/connection.py", line 593, in call_blocking
    message.append(signature=signature, *args)
UnicodeError: String parameters to be sent over D-Bus must be valid UTF-8

That’s from a test file I created later, but the first instance of this had to do with a file that contained the “Registered” trademark symbol. These were files I had originally created in Windows, so I poked around and learned something about converting to unicode. It seemed likely that my Windows files were ISO-8859-1. The R symbol showed up as hex in a Python string: '\xae'. I could get it to print correctly with print u'\xae'. To convert a string variable holding the whole file, I found that the unicode conversion unicode(s, 'iso8859_1') worked out for the trademark and copyright symbols. (A table of Python standard encodings was helpful.)

The conversion to unicode worked fine for the Registered and Copyright symbols, but not so great for curly quotes. They went through D-Bus without complaint, but turned in to these funny little boxes when viewed in Tomboy:

Tomboy D-Bus Import Odd CharsTomboy D-Bus Import Odd Chars Enlarged

With the enlarged view, you can see the numbers associated with this character set mismatch. So, a single right curly quote (otherwise known as an apostrophe) is 92, whatever that means.

Let’s look at the Tomboy .note XML file.

  • With cat and Python interactive print, these characters show up as blanks.
  • In vi, the apostrophe shows up as <92>. (With the other squares following suit.)
  • In Python interactive mode “non” print (e.g. >>> s): \xc2\x92.
  • Out of curiosity, I later copied a curly apostrophe from a web page and pasted it into Tomboy (so, bypassing D-Bus and removing Windows-created files from the equation) and it shows up in Python as \xe2\x80\x99.

Rather than dig further in to this behavior, I added the replace statements in the code above.

Something else to look out for: if your string (“s”) is already unicode, you may get an error like this:

>>> s.replace('\x92', 'YYZ')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 0: ordinal not in range(128)

In that case, you want this instead: s.replace(u'\x92', 'YYZ').

Finally, here is another screenshot demonstrating what things might look like in Python interactive mode when experimenting with this stuff:

Character Set Mismatches, Unicode, ISO-8859-1

Question marks are a common placeholder for character set hiccups. I’ve also experienced some headaches with Windows filenames that didn’t cross over to GNU/Linux very well, with ??? as a symptom.

I don’t know if this post should properly be categorized in the internationalization (“i18n”) department, but I’m going to use those terms to potentially ensnare future searchers, in the hopes that this may be of some benefit to them/you. :-)