Skip to content

Sketch: charNG

June 14, 2011

presenting…. charNG !!!
(pronounced “carnage”)
a character N-gram generator:

There are plenty of character n-gram generators out there: JanusNode, Dissociated Press, Travesty. But I wanted to code one myself, you know, to get a feel for it. Plus, I wanted it to be client-side JavaScript, so any y’all could hoard it in case the others disappeared like that eskimo/poormfa Travesty generator did. But WTF is a character n-gram generator, you ask? Well, here’s a brief description:

FAQ: WTF is a character n-gram generator?

In a netpoetic post I talk about building n-gram language models from sequences of words. However we could also look at sequences of characters. For example, the following shows the number of times that each character occurs in Shakespeare’s sonnets:

a 4569,
b 1082,
c 1306,
d 2722,
e 9000,
f 1557,
g 1341,
h 4997,
i 4232,
j 66,
k 545,
l 3030,
m 1998,
n 4443,
o 5582,
p 986,
q 51,
r 4162,
s 4850,
t 6739,
u 2300,
v 925,
w 1642,
x 60,
y 1948,
z 21,
A 368,
B 145,
C 31,
D 43,
E 23,
F 106,
G 16,
H 65,
I 444,
J 3,
K 6,
L 64,
M 96,
N 77,
O 126,
P 25,
R 18,
S 140,
T 469,
U 21,
V 1,
W 255,
Y 34,
SPACE 16050,
COMMA 1803,
! 98,
‘ 647,
( 1,
) 1,
– 155,
. 363,
: 175,
; 302,
? 97

This is a character unigram language model. This is like having a lottery machine containing 4569 ping-pong balls with the character ‘a’, 1082 ping-pong balls with the character ‘b’, and so on.

As with word language models, we could write a program to generate poetry by selecting characters one at a time, giving us output that might look like:

elefosemTsiv tubeS
ve i t WCittf  ownr e a i uru rfurae wIseaaoiulv r  ,tcvioab . Tinh,r merm  tg
 Tmierawtncmg yd  p rdl  d oe
y    M  ohet ic kofitnu t d, rps''ymsyw eehvus,e dwstuhasr;mel oreu

Which is very incoherent. However, we could also build a character bigram language model, that is, one counting the frequencies of characters that are adjacent to each other. A couple segments of such a model in Shakespeare’s Sonnets looks like this:

Ab 1,
Ac 1,
Ad 3,
Af 2,
Ag 10,
Ah 5,
Al 18,
Am 3,
An 246,
Ap 5,

e[NEWLINE] 85,
e[SPACE] 2496,
e! 21,
e’ 84,
e, 530,
e- 19,
e. 125,
e: 73,

So the character bigram “Ab” occurs once, in “Above a mortal pitch…”, the bigram “Ac” occurs once, in “Accuse me thus…”, the bigram “An” occurs 246 times, and so on. Selecting from it in a manner analoguous with word n-gram generation, we could get poems that look like the following. Note that we already start having recognizable sequences such as “If…”, “And…” and “That”.

Ifou thesgsoonase;
Andomid thirsld,
Anounene bid pl,
 we w
That ve-my gus an,
An akees o thave mees

Next we could look at a character trigram model, which looks like the following:

Abo 1,
Acc 1,
Adm 1,
Ado 1,
Adv 1,
Aft 2,
Aga 10,

me[NEWLINE] 9,
me[SPACE] 222,
me! 7,
me’ 16,
me, 65,
me- 1,
me. 19,
me: 9,

and which produces lines such as the following:

Maket on onct and one woud sougetur fablipen,
And I he forld suntleter prow;
But bart see worge welf and bind's I'll casucy st maishought,
Bout love tareind my hic conquich a fairaill onew.

And a character 4-gram model, which looks like the following:

a[SPACE]an 1,
a[SPACE]ba 4,
a[SPACE]be 5,
a[SPACE]br 1,
a[SPACE]ca 2,

or’d 4,
or’t 1,
or,[NEWLINE] 1,
or,[SPACE] 3,
or-l 1,

and produces the following:

Thou shor'd from vanish'd nor vain'd love work my live unt lost;
For thy dest, burdeserter loving, blushin oth loving the in eachemy,
Markling forsake do is a have stance lie, dream from advanish:
Of thou dost my losed longuest you defect,

with increasing levels of coherence. Furthermore, certain n-gram models produce interesting portmanteaux, such as “advanish”. This is due to the local coherence effects of n-grams. During 4-gram generation, the algorithm is only aware of the 3 characters it has most recently generated. When it has reached the point of generating “advan”, it is only aware of the prevous sequence “van”, so it is consistent with the model to end with “vanish”.

charNG features

Anyway, charNG has several features:

  • Generate from unigram to 10-gram models. (You should be able to generate from more, if you edit the HTML form.)
  • Various types of chaining. Chaining is how the generation algorithm determines what character comes next. For example, let’s say you’ve generated the characters “hello wor” so far. Then:
    • Markov: will look at the last n-1 characters for an n-gram model. For example, a 4-gram model will look at the last 3 characters “wor” to determine what single character should come next, based on the frequency of 4-grams in the corpus. In Shakespeare’s sonnets, here are the 4-grams that start with “wor”, with counts:

      word 13,
      work 8,
      worl 33,
      worm 4,
      worn 10,
      wors 14,
      wort 30,

      so it’s like getting a lottery machine, putting 13 balls with “word”, 8 balls with “work”, etc, then picking one of the balls and choosing “d” or “k” or etc. as the next character.

    • 1-char overlap: will only look at the last character to pick the next n-gram. In the “Hello Wor” example, it will pick an n-gram that begins with the character “r”
    • Cento (cut-up): doesn’t look at the previous characters at all. It just generates sets of n-grams. This is inspired by centos, which date back to the 3rd century AD, and which were rediscovered in the 1900s as the cut-up technique, although as far as I know most centos/cut-ups were word-based, and Ausonius at least seemed to constrain centos by rhythm.
  • Amount of generation details to print. If you put it on “verbose” details, you’ll see a pretty comprehensive trace of how it works.
  • Randomly-inserted newlines and spaces, to make it look more eecummings-like. I got the idea from JanusNode! although this doesn’t use word-based rules like JanusNode’s eecummingsifier does…
  • Display the n-gram model, i.e. list all the n-grams that are in the corpus, from unigrams to 10-grams (though again you should be able to see higher-order n-grams by editing the HTML form)
  • Display any portmanteaux that has been output. I got this idea from Substance McGravitas… basically it looks at the text in the Output textarea, and prints out any words that it doesn’t see in the Corpus textarea. As far as I can tell, 4-grams have the best portmanteaux.

Anyway, it’s not an interactive generator, since you just set some settings and click go. But it’s still a lot of fun, trying different settings and corpora, generating a bunch of characters, and picking out good stuff that you see flow by. It’s like sitting by a cool stream, occassionally scooping out cupfuls of water! Of course, you can then put the output through JanusNode mappings, or hand-edit them, etc.

Thanks to Matthew for helping me test it, to Alex Rudnick for his efforts in n-gram education, and to Chris Funkhouser for general encouragement.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: