Textual Informatics
Tuesday, April 14th, 2009I’d like to talk a little about some of the techniques involved in extracting information from text, or apprehending text as information. I won’t be demonstrating anything terribly sophisticated – in fact, my examples to begin with will all be expressible using small bits of Javascript – but I’d like to expose some of the rude mechanics of textual informatics, and try to build up a vocabulary which later on we might use to talk about some more serious and significant issues. I’m interested in seeing whether and how Franco Moretti’s proposals for a “sociologisation” of literary studies can be distinguished from an “informationalisation” or “informaticisation” of the field, for example. I’d also like to get a better grasp on bioinformatics, which has been creeping up on everyone for a while now; my uninformed guess is that strings of unicode characters and strings of DNA codons are, from an informational perspective, not very significantly unalike. But this might turn out to be true only for quite large values of “not very significantly”.
A “rambler” is a text generator that treats a piece of input text as a series of linked pairs of words. In the preceding sentence, “A” and “rambler” make the first such pair, “rambler” and “is” make the second, and so on. The word “a” is paired in this way with the words “rambler”, “text”, “piece” and “series”, while the word “text” is paired with “generator” and “as”. The rambler starts on a random word somewhere in the text, and moves from that word to any of the words it is paired with. Here’s a (very boring) 100-word ramble around that sentence:
series of linked pairs of words. pairs of linked pairs of linked pairs of linked pairs of linked pairs of linked pairs of words. text as a series of input text generator that treats a series of words. input text as a text generator that treats a text as a text generator that treats a piece of input text as a text generator that treats a piece of words. a series of input text as a series of input text generator that treats a series of input text generator that treats a text generator that treats a series of input
(if the rambler hits the last word of its input text, and that word has no other word that it is paired with, then it just jumps randomly to any word it likes).
Rambling around the whole of this text so far, we get:
…expose some more serious and so on. The rambler hits the first such pair, “rambler” make the word of DNA codons are, from an “informationalisation” or apprehending text generator that treats a “sociologisation” of linked pairs of linked pairs of words. In the word it likes). just jumps randomly to get a little about some more serious and strings of DNA codons are, from an “informationalisation” or “informaticisation” of input (if the preceding sentence, “A” and “series”, while the techniques involved in this might turn out to get a “sociologisation” of unicode characters and so on. The rambler hits the field…
It’s nonsense, but bits of it hint in the direction of sense – the rambler is pulling apart and recombining its input text, following a graph of associations extracted from the text itself:

What the rambler code actually does is to scan the text once, constructing this graph as a data structure, and then perform a random walk around that structure – it decomposes the text into a dictionary of terms (labelled vertices in the graph), and a set of linked references (edges in the graph) between entries in that dictionary. The linear form of the source text is completely exploded – in fact, we could not faithfully reconstruct the original text from the rambler’s representation of it (although its random walk might accidentally output it once in a blue moon).
The rambler is a diverting toy – it “understands” nothing about the text it consumes, or the text it produces – but its output is more syntactically regular than a completely random jumbling of its input terms because it is informed by – or uses as information – the way that the rules of syntax mean that particular kinds of words tend to be paired together in that input. It is also, more subtly, informed by the frequency with which particular nouns and adjectives, or particular “flavours” of word, appear in combination – the nonsensical ramble over the text of Cold World in the preceding post retains quite a bit of the flavour of the original. Try it on something else – a good long LCC post, for instance – and you’ll see what I mean.
For the next one of these, I’ll be discussing an algorithm for finding the longest common substring of two strings…



