Skip to content

cross-corpus Part-of-Speech substitution quatrain

January 25, 2011

when you need follow some fool that loves the knowledge an
so understand that richer soul shut down major envy
when you see the information thinkin smooth an
yet fat caps an all hit trick wishin since scrambling


Jan 22-24 2011, words selected from part-of-speech classes, using the Stanford Parser and Stanford Part-of-Speech tagger on Shakespeare’s Sonnets and the Notorious BIG lyrics.

So I’ve been fooling around with Parsers and Part-of-Speech (POS) Taggers recently. Recall that previously I ran Shakespeare’s sonnets through a part-of-speech tagger, producing lines that look like this:

When/WRB I/PRP do/VBP count/VB the/DT clock/NN that/WDT tells/VBZ the/DT time/NN ,/,

where WRB is a wh-adverb, PRP is a personal pronoun, VBP is a verb present tense, etc.

Previously I then collected all words for a given class (for example, all the wh-adverbs that Shakespeare used) and built a JanusNode ruleset.

This time, I collected all words for a given class in a separate corpus, specifically the lyrics of the Notorious BIG. For example, here are the wh-adverbs used by Biggie:


(I used a parser rather than a tagger because… I dunno, I was just using it when I decided to do this. I think a parser is less accurate than a tagger for POS tagging, unfortunately.)

Now recall the Shakespeare sonnet line:

When/WRB I/PRP do/VBP count/VB the/DT clock/NN that/WDT tells/VBZ the/DT time/NN ,/,

considering only the POS tags, we have:


So then I wrote a program to create an HTML page of drop-downs that uses the words for given POS classes in Biggie’s lyrics, that looks like this:

So the first drop-down is all the WRB (wh-adverb) words used by Biggie, the second drop-down is all the PRP (personal pronouns) that Biggie used, etc.

So the POS template is of Shakespeare’s sonnet 12, but the words per POS class are by Biggie, get it? And since the human has a chance to select the words, you get around the incoherency problem of class-based n-grams (i.e. that you tend to get adjacent words that the original text authors would never put together.)

At least that was the theory. In truth, it got kind of tedious because there was no way to save the selections made so far (besides editing the html, which got boring) and after the first line I realized there must have been some bug in the program (or the tagging was really poor) because the word “an” was the only thing that came through as a comma. So I quit after the first quatrain, maybe to fix these problems and try again later if I don’t think of something more interesting.

(le update: je pense que this is fairly similar to “Rimbaudelaires” by ALAMO, if you’ll pardon my French.)

Anyways, this is Method 15a38f2c-ae43-45c6-a526-28c9cd5d8e00 (cross-corpus Part-of-Speech substitutions) and here is the html file with the drop-downs, if you want to mess with it for some reason. Later!

No comments yet

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: