Archive for the 'Python' Category

Textual Informatics

Tuesday, April 14th, 2009

I’d like to talk a little about some of the techniques involved in extracting information from text, or apprehending text as information. I won’t be demonstrating anything terribly sophisticated – in fact, my examples to begin with will all be expressible using small bits of Javascript – but I’d like to expose some of the rude mechanics of textual informatics, and try to build up a vocabulary which later on we might use to talk about some more serious and significant issues. I’m interested in seeing whether and how Franco Moretti’s proposals for a “sociologisation” of literary studies can be distinguished from an “informationalisation” or “informaticisation” of the field, for example. I’d also like to get a better grasp on bioinformatics, which has been creeping up on everyone for a while now; my uninformed guess is that strings of unicode characters and strings of DNA codons are, from an informational perspective, not very significantly unalike. But this might turn out to be true only for quite large values of “not very significantly”.

A “rambler” is a text generator that treats a piece of input text as a series of linked pairs of words. In the preceding sentence, “A” and “rambler” make the first such pair, “rambler” and “is” make the second, and so on. The word “a” is paired in this way with the words “rambler”, “text”, “piece” and “series”, while the word “text” is paired with “generator” and “as”. The rambler starts on a random word somewhere in the text, and moves from that word to any of the words it is paired with. Here’s a (very boring) 100-word ramble around that sentence:

series of linked pairs of words. pairs of linked pairs of linked pairs of linked pairs of linked pairs of linked pairs of words. text as a series of input text generator that treats a series of words. input text as a text generator that treats a text as a text generator that treats a piece of input text as a text generator that treats a piece of words. a series of input text as a series of input text generator that treats a series of input text generator that treats a text generator that treats a series of input

(if the rambler hits the last word of its input text, and that word has no other word that it is paired with, then it just jumps randomly to any word it likes).

Rambling around the whole of this text so far, we get:

…expose some more serious and so on. The rambler hits the first such pair, “rambler” make the word of DNA codons are, from an “informationalisation” or apprehending text generator that treats a “sociologisation” of linked pairs of linked pairs of words. In the word it likes). just jumps randomly to get a little about some more serious and strings of DNA codons are, from an “informationalisation” or “informaticisation” of input (if the preceding sentence, “A” and “series”, while the techniques involved in this might turn out to get a “sociologisation” of unicode characters and so on. The rambler hits the field…

It’s nonsense, but bits of it hint in the direction of sense – the rambler is pulling apart and recombining its input text, following a graph of associations extracted from the text itself:

fool1

What the rambler code actually does is to scan the text once, constructing this graph as a data structure, and then perform a random walk around that structure – it decomposes the text into a dictionary of terms (labelled vertices in the graph), and a set of linked references (edges in the graph) between entries in that dictionary. The linear form of the source text is completely exploded – in fact, we could not faithfully reconstruct the original text from the rambler’s representation of it (although its random walk might accidentally output it once in a blue moon).

The rambler is a diverting toy – it “understands” nothing about the text it consumes, or the text it produces – but its output is more syntactically regular than a completely random jumbling of its input terms because it is informed by – or uses as information – the way that the rules of syntax mean that particular kinds of words tend to be paired together in that input. It is also, more subtly, informed by the frequency with which particular nouns and adjectives, or particular “flavours” of word, appear in combination – the nonsensical ramble over the text of Cold World in the preceding post retains quite a bit of the flavour of the original. Try it on something else – a good long LCC post, for instance – and you’ll see what I mean.

For the next one of these, I’ll be discussing an algorithm for finding the longest common substring of two strings…

All Reflections Drained

Thursday, April 9th, 2009

Xasthur’s latest release is in no readily justifiable sense a black metal album, yet it clearly belongs in a continuum with Malefic’s previous work. For much of the time it sounds like a collection of instrumental out-takes from the preceding full-length release, Defective Epitaph, performed in a ruined cathedral by an orchestra of sighing zombies. The subtracted element is Malefic’s ghastly, searing howl, which makes only occasional appearances: for the most part, the vocals range between indistinct hissing, tormented groans, eerie sussurations and a quavery baritone that calls to mind the Tindersticks’ Stuart Staples mumbling into the folds of his tuxedo. There’s a sample (“I perceived everything and everyone in existence to be dead – including myself…”) which sounds a lot like Gwyneth Paltrow doing Sylvia Plath (as heard on Wrest’s Lurker of Chalice) but turns out, if I’m not mistaken, to be from this oddity:

Otherwise the invariants of Xasthur’s musical scheme are all in place: dominant minor ninths, repetition, spooky organs, repetition, guitars treated to sound like rusty steel drums, repetition, everything out of tune with everything else. Malefic plays the drums throughout, with a sort of commendable doggedness – he’s clearly not one to give up an instrument just because he’s crap at it (as witness also his continued agonizing over the cello). The titles you could make up for yourself: “obfuscated in oblivion”, “masquerade of incisions”, “maze of oppression”. (Here’s some for the next album: “catacomb of splinters”, “shroud of desolation”, “cipher of emptiness”, “errancy of the void”). It could all hardly be any drearier, although if you play the title track at about twice the speed the vocals take on a sort of campy urgency which is quite enjoyable in its own way.

The System Of The World

Friday, January 23rd, 2009

A partial order is a relation that is:

  • Reflexive: Every element is “before” itself.
  • Transitive: if P is before Q, and Q is before R, then P is before R.
  • Antisymmetrical: If P is before Q, and Q is before P, then P is equal to Q.

A partial order is not a total order, because for any P and Q, it is not necessarily the case that either P is before Q or Q is before P.

We can represent a partial order with a directed, acyclic graph:

diag1

The circled letters in this graph are “elements”, and an arrow between two elements P and Q means that P is before Q. The transitive rule means that if you can get from P to S by going via R, then P is before S – we don’t show the arrow from P to S explicitly, and neither do we show the arrow from each element to itself.

The powerset of a set is the set of all the parts (subsets) of that set; for example, the powerset of {a, b, c} is {{}, {a}, {b}, {c}, {a, b}, {b, c}, {a, c}, {a, b, c}}. The relation “is a part of” over the elements of this powerset is a partial order:

diag2

Now, suppose we assign a (prime) number to each of a, b and c: a=2, b=3, c=5. To each of the sets in the powerset of {a, b, c}, we can then assign a value by multiplying together the values assigned to its elements. The empty set {}=1, {a}=2, {a, b}=a * b = 6, {a, b, c}=a * b * c=30:

diag3

The relation “is a part of” can now be mapped to the relation “is a factor of”: 2 and 3 are factors of 6, 3 and 5 are factors of 15, 3 and 10 are factors of 30 and so on. Note that this wouldn’t have worked if we’d used non-prime numbers to number a, b and c: if we’d numbered them a=2, b=3 and c=4, that would have implied that {a} was a part of {c}, because 2 is a factor of 4.

Let’s call our “universe” set {a, b, c} U, and its powerset P, Now, we can find the complement in U of any set in P by dividing the value of U (30) by the value assigned to that set. For example, the complement in U=30 of {b, c}=15 is {a}=30 / 15 = 2. What’s more, the complement in {b, c}=15 of {c}=5 is {b}=15 / 5 = 3.

What about the intersection of two sets? Let U be {a, b, c, d}, and the “universal value” be 2 * 3 * 5 * 7=210. The intersection of {a, b, c}=30 and {b, c, d}=105 will be the set valued with the product of the common prime factors of 30 and 105, which are 3 and 5. 3 * 5 = 15, the value of {b, c}.

To find the union of {a, b, d}=42 and {a, c}=10, we can use the intersection and complement operations described above. The intersection of 42 and 10 is 2. The complement of 2 in 42 is 21, and the complement of 2 in 10 is 5. If we multiply them all together, we get 2 * 21 * 5 = 210 = {a, b, c, d}.

Now suppose that all we have is a set S containing various elements {a, b, c…}, and a function from S to a set O of values ranging from a minimum value 1 (which numbers the empty set) to some maximum “universal value” (which numbers a set of which all of the elements of S are parts). O is missing some numbers – 4 isn’t in it, and neither is 9 – since all it contains is prime numbers and the sums of sets of prime numbers, and there is no set {2, 2} or {3, 3}.

Given S and the function from S to O, we can say for any two elements of S whether one is a part of the other, whether they overlap at all, whether another element is a part of their overlapping part or not, and so on. We can say – without knowing what is “inside” a, b and c – how they stand in relation to each other, at least as far as relations of inclusion go. We can identify some elements as “primitive”, inasmuch as they are valued with prime numbers, and others as “composite” (although S may not actually contain the “primitive” elements out of which they are composed).

This concludes a toy example of how we can express relationships between objects by assigning them values from a partial order. What Badiou does with Heyting algebras in Logics of Worlds is rather more complex, needless to say. But what may surprise you is that this approach to ordering entities by numbering primitive elements with primes, and their composites with products of primes, was first developed by…wait for it…Leibniz!

Internet Violence

Tuesday, February 28th, 2006

So, I wrote this script:

import httplib
from time import sleep

def vote():
        conn = httplib.HTTPConnection("www.frontpagemag.com")
        conn.request("GET", "/survey/vote.asp")
        print "Cookie monster want cookie!"
        r1 = conn.getresponse()
        cookie = r1.getheader('set-cookie')
        cookie = cookie[:cookie.find(';')]
        conn.close()
        conn = httplib.HTTPConnection("www.frontpagemag.com")
        conn.request("POST", "/survey/Response.asp", "profVote=46&VOTE=Vote+Now%21",
                     {'Content-Type': 'application/x-www-form-urlencoded',
                     'User-Agent':'Cookie Monster',
                     'Cookie': cookie})
        r1 = conn.getresponse()
        conn.close()
        print "Cookie monster cast vote now!"
	sleep(5)

while 1:
    vote()

What does it do? Every 5 seconds, it adds another vote for Michael Bérubé at attention-seeking gimp-boy David Horowitz’s poll of America’s Worst Professor over at FrontPage magazine. Note that I have nothing against Michael Bérubé personally; quite the reverse. He asked readers of his weblog to vote early and often for him, in the hope of beating such luminaries as Top! Public! Intellectual! Noam Chomsky and professional Native American impersonator Ward Churchill. I got a bit tired with hitting the back button and deleting cookies from my browser, so I decided to automate the process.

Horowitz’s grim parody of a McCarthyite hate-site slowed to a crawl on Monday; he’s blaming “leftwing vandals” for what amounts to a slashdotting of the votes page by a combination of manual refreshing, browser-based Javascript-driven automated vote-stuffing, and possibly the odd instance of an earlier version of the bot above (which, I confess, didn’t have the decency to wait a few seconds between requests). I would imagine that the biggest culprit was browsers refreshing the results page, as doing so involves downloading not only the HTML of the current leaderboard but also a whole slew of other rather more bandwidth-intensive crap: cartoons of Dick Cheney and so on.

In a way, this is reassuring: the failure of the FrontPage site to deal with such a temporary rise in popularity indicates that its regular readership must be fairly tiny. Still, it must be said: Horowitz’s inability, or refusal, to distinguish between a few japing academics hitting the refresh button on their browsers and a malicious and co-ordinated DDos attack is entirely unsurprising, as is the tone of righteous outrage he feigns in his denunciation of the “leftwing vandals” who took such hearty and appropriate delight in buggering about with his creepy and idiotic poll.

Python List Monad

Thursday, August 18th, 2005

Another Python Cookbook Recipe, translating a little something from the world of Haskell into Python.

Tripoli on SourceForge

Thursday, May 19th, 2005

Tripoli now has a SourceForge project page.

The multiple-readers-one-writer locking class from the synchronization module was adapted and improved by Matthew Scott and turned into a handy Python Cookbook recipe.

Part of Matthew’s motivation for writing his own version of this code was that he needed to use it in a non-GPL project, and all of Tripoli is currently licensed under the GPL. I don’t in fact believe that library code – in particular the implementation of generic methods and techniques – requires the protections of the full GPL: once the technique is known, there is little that can be done to it in the way of proprietory modification. So the next release of Tripoli will probably be split into two packages, an application package under the GPL and a library package (including the concurrent Python classes) under the LGPL.

The cookbook recipe has two other significant advantages over my original code: it’s re-entrant, and it’s unit tested. I’ve been working on making my own implementation re-entrant, and creating some appropriate tests for it: these also will be in the next release. My approach to re-entrancy is slightly different to Matthew’s, in that the original non-reentrant class is kept as it is and a re-entrant wrapper provided for it. This makes for a cleaner design, in my view. I’m also looking at some ways of supporting some common patterns in testing multi-threading code.

Once the MrowLock updates are done, I’ll start on a proper HTTP-based Tripoli server – the current XML-RPC version was just a quick hack to get something working.

Tripoli + XMLRPC = Somewhat Useful?

Saturday, April 30th, 2005

I’ve wrapped the main Tripoli object, the impressively-named Manifold, in an XMLRPC server and created a client for it. The client needs to be connected to a TripleCallbackListener that will receive callbacks from the server when a triple matching a request pattern becomes available. This means that get and read operations are completely asynchronous.

Here’s a quick start guide to using the stuff. First, do this:

from xmlrpc import TripleCallbackListener, ManifoldServer, ManifoldClient

listener = TripleCallbackListener(8023)
server = ManifoldServer(8024)
client = ManifoldClient(server.url, listener)
server.start()
listener.start()

We’ll start by requesting some things that don’t exist yet, so we can see what happens when they do.

def printTriple(triple):
    print triple.subject, triple.predicate, triple.object

client.get("likes", "*", "*", "*", printTriple)
client.get("dislikes", "*", "*", "*", printTriple)

The first parameter is the name of the triple space to query, the next three are a pattern to match (all wildcards here, so any triple will do), and the last is a function to call back when a matching triple becomes available. Ours will just print whatever matching triple is returned.

Now we will create some triples in a triple space called “default”:

client.put("default", "Dominic", "likes", "Python")
client.put("default", "Dominic", "hates", "VB.Net")
client.put("default", "Python", "is", "cool")
client.put("default", "VB.Net", "is", "teh suck")

We’ll copy my likes into one triple space, “likes”, and my dislikes into another, “dislikes”. The pattern here is wildcarded for the subject and object of the triple, but requires an exact match on the predicate.

client.copy("default", "*", "likes", "*", "likes")
client.copy("default", "*", "hates", "*", "dislikes")

Our callbacks got called (hopefully)!

Note that copy returns an integer, which is the number of tuples copied. Now we’ll copy-collect the whole graph into another triple space:

count = client.copy_collect_graph("default", "Dominic", "dominic")

Let’s retrieve those triples:

for i in xrange(count):
    client.get("dominic", "*", "*", "*", printTriple)

That should give you the general idea.

Tripoli + HTTP = Really Useful?

Wednesday, April 27th, 2005

Tripoli needs an HTTP server to be really useful. You could run one on your desktop, and use it to tag files with arbitrary metadata, including their relationship to other files.

Trusted users on other machines in the network could access the server to request copies of those files. You could flag files for the attention of other users, or insert them into workflow processes by giving them a state attribute.

copy, copy-collect, copy-graph and copy-collect-graph could be used to move triples between triplespaces running on different servers.

The other things it needs are persistence and reliability. At the moment all the triples are held in memory, and will vanish when the triplestore server is shut down. Saving a bunch of triples to a file on disk, and restoring them again later, is not hard. Real-time logging to disk of additions and removals would make it possible to restore from the last good save in the event of unplanned shutdown.

The other other thing it needs is a short and simple tutorial somewhere explained what the heck it’s for, and how to use it…

Tripoli: A Python Triplespace Implementation

Tuesday, April 26th, 2005

I’ve just uploaded some rough and ready triplespace implementation code. This implements an indexed store for triples (3-tuples that represent binary relations, e.g. subject, predicate and object, as in RDF) and a tuple-space-like interface to the store with put, get, read, copy, copy-collect, copy-graph and copy-collect-graph operations.

The graph operations are the interesting ones, as they hook triples together into directed labelled graphs by treating the subject and object values in the triples as vertex labels, and the predicate values as edge labels. It’s possible to put a set of values like

Dominic, likes, Python
Python, is-cooler-than, Java
Haskell, is-cooler-than, Python

and then copy/collect the graph of triples connected to the node Dominic (Dominic likes Python, Python is cooler than Java) or to the node Haskell (Haskell is cooler than Python, Python is cooler than Java) into a separate triple space.

Much work left to do on this, but I had a sudden blaze of motivation today and thought I’d get as much done as possible while it lasted.

Concurrent Python

Saturday, February 5th, 2005

I’ve gathered together some of the recent Python code I’ve been playing with into a single file, concurrency.py. At present this contains support for MultiEvents, MultiQueues, AsyncResults, DataflowDicts and DataflowObjects.

MultiEvent is a class that lets you listen for several events at once, and tells you when any one of them has been set. MultiQueue lets you wait for an item to be available in any of several queues. When an item arrives, it tells you what the item is and what queue it was in. An AsyncResult represents a computation running on a separate thread that might not have finished yet (very like IAsyncResult in the .Net framework). The threaded function transformer/decorator causes a method to run in its own thread and return an AsyncResult.

DataflowDict provides a dictionary of single-assignment variables. If you try to read a variable that hasn’t been assigned yet, it will block until a value becomes available. DataflowObject wraps a DataflowDict, translating attribute gets and sets into dataflow dictionary lookups. Together with threaded, the dataflow classes make some interesting kinds of nondeterministic / declarative concurrency possible. Even with the scaffolding I’ve created, it’s still nowhere near as easy in Python as it is in Oz.

There is more to be done here (including, no doubt, some careful inspection of the code for subtle or egregious threading errors – race conditions, that kind of thing), but I think I’ve made a fairly good start.

Thread-safe MultiQueue in Python

Thursday, February 3rd, 2005

Another Python Cookbook recipe submission. This one is strikingly similar to the tuple space code I wrote a while back.

I actually wrote it because I needed a MultiQueue class to make a more efficient ActiveObject that would multiplex on several message queues using the threads in a thread pool.

The aim here is to support a large number of concurrent objects without having to have a large number of threads, based on the assumption that not all of the objects are going to be processing messages from each other all of the time. Every object still has its own dedicated message loop, but this loop is now activated by multiplexing thread-pool threads rather than owning its own thread. If the thread pool contains 5 threads, then you can have 5 actions in progress concurrently with the other ActiveObjects on stand-by for when a thread becomes available.

I have a feeling that Stackless Python makes these sorts of trade-offs moot…

Active Objects in Python

Tuesday, February 1st, 2005

My first submitted Python Cookbook Recipe, based on looking at a bit of Oz code in CTM and thinking, how would I do that in Python?

GvR gets ML-envy

Tuesday, January 4th, 2005

The madness of Guido van Rossum progresses apace. Did I say madness? I meant visionary blue-sky thinking, obviously.

I don’t know what approach to type-checking would be best for Python, but I would hope for something more Pythonic. Some standardisation of the various interfaces and adaptors offered by Zope, PEAK and Twisted could be good – they come from the Python culture, and reflect the needs of actual Python developers now.

Beyond that, it all depends what it’s for.

If it’s for making Python go faster, then much of the machinery provided by a sophisticated type system is probably besides the point (or on the wrong side of the 80/20 split). I’d guess that the biggest gains will come simply from knowing that foo is always a string and bar is always an int. If you want to know what a list is a list of, look at the type of the variable used to iterate over it: for foo::int in bar. If you’re going to have typed tuples, why not go the whole hog and make them structs/records?

If it’s for expressivity, then fine but it won’t really be Python any more. The Python school of expression is quite distinct from the ML school or the Java school. This is a good thing.

If it’s for providing static guarantees, then choose another language frankly.

Some more on Tuple Spaces

Thursday, November 18th, 2004

Francis asked:

I still don’t understand what this is for. Can you describe an example application from a user’s point of view?

Also, in what way is it selected which tuple is extracted from the bag when you extract them. Does it filter on them? Or select one at random? Or what?

Suppose we have a bag containing three tuples:

("foo", "bar", "baz")
(1, 2, 3)
(1, "bar", 3)

Tuple selection is done by pattern matching, so for example:

("foo", ANY, ANY)   ->  ("foo", "bar", "baz")
(ANY, "bar", ANY)   ->  either ("foo", "bar", "baz") or (1, "bar", 3)
(ANY, ANY, 3)       ->  either (1, 2, 3) or (1, "bar", 3)

In the original Linda system, patterns are typed so that we could ask for (string, “bar”, string) or (int, “bar”, int), and get tuples matching the types. An even more powerful system might enable matching with expressions, e.g. (int < 2, left(string, 2)=”ba”, int > 2).

If several tuples match the pattern supplied by the user, only one is returned (in the case of my implementation, the first one it finds; which one this will be is determined by the behaviour of the search algorithm).

Suppose we have placed 100 tuples into the tuple space, and we want to read them out in the order in which they were written in. The tuple space itself does not know or care about the order in which tuples were placed in it, so we must place an index value in each of the tuples and match on that value: (ANY, index). This incidentally means that we can read the values out in a specified order even if they were not written in in that order.

Consider a writer outputting the results of a breadth-first binary tree traversal, where the nodes are indexed. They might be written out in the order:

            5
      3              8
1       4     6
   2              7

A reader process would be able to pick up the nodes in order, as they became available. Hence, it would block until the writer had written (5, 3, 8, 1), then pick up (1), then block until the writer had written (4, 6, 2), then pick up (2, 3, 4, 5, 6), then block until the writer had written (7), then pick up (7, 8).

Now for an example application, something less theoretical. Suppose we wish to distribute multiple concurrent searches or computations over a bank of several machines. Think SETI at home, or Google. Tuple spaces provide a simple co-ordination language for many collaborating agents to participate in this process. A process (or group of processes) posts tuples and then waits for a response to arrive from somewhere. Another process (or group of processes) looks for tuples like the ones the first (group of) process(es) posted, and responds to them. None of the processes has to know about the existence of any other process for this to work.

Here is a good introduction to the Linda system, which explains the rationale and some typical uses at greater length.

More on tuple spaces

Wednesday, November 17th, 2004

First of all, the code’s had a slight polish, although it’s still some way from gleaming perfection. What it really needs is unit testing, and possibly some profiling as well. Oh, and some document strings. Stuff like that. I’m afraid that the more interested I am in something, the louder my inner cowboy shouts yeehaa!. Process is fine when the work you’re doing is essentially boring: at least then it can be boring and respectable.

I’ve based the design so far on a number of untested assumptions about how efficient certain operations are. In fact I’m pretty certain that a naive approach – just storing the tuples in a list and running along the list looking for matches – would perform better for smallish tuple sets. By indexing absolutely everything, I’m making tuple selection over large sets faster at the cost of making everything else slower (but slower by an amount that increases slowly relative to the size of the tuple set; at least, assuming my assumptions about dictionary lookup and tests for set membership in Python are correct).

It might be that a hybrid approach would be best – index on the first item in the tuple, then do a linear-time search through an unindexed list after that. Best of all would be to make this configurable.

Meanwhile, it works well enough for me to be able to try out some co-ordination scenarios. The new version of the code has three reader processes simultaneously trying to access items produced by a worker process. No synchronization primitives required: they just pull tuples out of the tuple space until there are none left. But how do they know when there are none left? Because they also update a shared count tuple.

What makes it all so much simpler than it otherwise might have to be is the atomicity of the tuple space operations. The cost of this, in my implementation at least, is that these operations take place sequentially, on a single thread. Calls from other threads are marshalled to this thread via a queue, and block until the worker thread calls them back. Again, this is almost certainly not the most efficient scheme. Simultaneous non-destructive reads should presumably be able to be processed concurrently.

Tuple Spaces in Python

Tuesday, November 16th, 2004

For the moment I’m pretending PyLinda doesn’t exist, but you don’t have to. If you want a working implementation of the ideas discussed here, it’s probably a good place to start.

I’ve been working on my own tuple space implementation in Python, and the results so far can be found here. Running the script will result in either (Bob, Alice) or (Alice, Bob) being printed to STDOUT. Not much to look at.

What does the script do? A high-level account would go something like this:

  • Create a tuple space – a common store for tuples, that can be searched using simple pattern-matching.
  • Put a tuple into the tuple space that consists of the word Users and an empty tuple: (Users, ())
  • Put a tuple in that consists of the words Userlist ready
  • Start two concurrent processes (let’s call them Bob and Alice).
  • Each process looks to see if there’s a tuple that matches the pattern (Userlist ready). It will block until there is such a tuple.
  • Each process then tries to take a tuple matching the pattern (Users, ANY) out of the tuple space. Because there is only one such tuple, only one will succeed. The other will block until some such tuple becomes available.
  • Whichever process successfully grabbed the Users tuple then puts a tuple into the tuple space that updates the Users list with its own name: (Users, (Bob, )) or (Users, (Alice, )) as may be.
  • Now that there is a tuple in the tuple space matching the pattern (Users, ANY), the second process can retrieve it. It does the same thing the first process did, writing a tuple back into the tuple space that updates the Users list. The list will now read either (Bob, Alice) or (Alice, Bob).
  • Just to round things off nicely, each process will put a tuple into the tuple space saying that it’s finished. The main thread will try to retrieve both of these tuples from the tuple space, and will block until both are available. It will then shut down the thread servicing tuple requests, and the script will terminate.

OK. Now, why?

Tuple spaces provide a co-ordination language that allow multiple processes to exchange information, synchronize with each other, maintain or service worklists and generally…co-ordinate their behaviour. The model I’ve implemented is very simple, and consists of three primitives: tupleOut which writes a tuple into the tuple space, tupleIn which removes a tuple from the tuple space, and tupleRead which gets a copy of a tuple leaving the original in place. There is a little magic involved in the pattern matching, which allows us to provide a template specifying the kind of tuple we want. Whatever tuples match the template provided, only one will be returned and we have no control over which one it will be. As the Alice and Bob example above shows, a degree of nondeterminism is characteristic of the system. At the same time, it’s possible to choreograph interactions between processes in entirely deterministic ways if so desired. We could with relatively little effort ensure that Bob could not add himself to the user list before Alice had done so.

My interest in tuple spaces was initially sparked by reading the discussion of dataflow variables in CTM, and thinking about how that approach to modelling concurrency could be applied in Python. Using tuple spaces for concurrency differs from a message-passing approach in that none of the tuples has an explicit recipient: any process that can access the space can read/remove any tuple in the space. The coupling between processes is thus extremely loose by default; and yet tighter coupling is possible without a great deal of extra effort.

As far as the implementation is concerned, my immediate priority is to improve the efficiency of the search algorithm, which currently finds all the tuples matching a query and then returns only one of them – very wasteful! Next up is improving the way blocking requests are matched with newly-added tuples. Once the core is robust, I want to start wrapping some protocols around it – it needs to start speaking HTTP, so that processes in separate locations can communicate with each other.

Calendula kicks off

Tuesday, October 26th, 2004

Calendula is the working title for an app I’m doing for my father-in-law. It’s a fairly simple database app for tracking which carers work which shifts in the Cavanagh household, and working out what wages are due to them.

I originally wanted to do it as a web app, using Twisted, but doing web apps the way I like to do them is a lot of different kinds of work all at once: database schema, queries, data access layer, business object code, Resource classes to service web requests, XML/RDF --> XSLT --> XHTML page-creation (I never met a template system I really liked), Javascript on the client. Great for larger-scale projects, but it means a lot of work up-front before there’s anything you can really play with. I found myself thinking about frameworks for pipelined request processing. That’s not the task at hand. Get on with the task at hand, moron!

So now I’m using wxPython instead.