Skip to content

Presenting… jGnoetry!!!!

December 27, 2011

A Gnoetry implementation in JavaScript: try it out at http://www.eddeaddad.net/jGnoetry.

I resisted doing a Gnoetry implementation because, you know, I didn’t want to be labeled as a “gnoetry programmer” or whatever… but I found myself trying to help a couple people get Gnoetry installed recently, and since I couldn’t even run it myself this was a massive pain in the rear. Plus, I have to admit, there are two really cool things about Gnoetry (as far as I remember and can tell from eRoGK7’s youtube video): the “swipe and wipe” regeneration interface, and the use of weighed multiple corpora. And as it turns out I’d solved most of the implementation problems while doing charNG and WpN, so it was pretty straightforward… in fact, jGnoetry took me less than a month of spare-time coding.

Anyway, jGnoetry is a bigram generator which constrains output by syllabic forms. You start out using jGnoetry by clicking the “Generate” button, which then turns into the “Re-Generate” button.

Next, you wave your mouse over the generated words to select those that will be re-generated, and those that will be static.

Exactly how regeneration works is decided by the “options” you’ve set. I’m aware that my previous generators, like charNG and eDiastic are pretty noisy in terms of options, so I’ve hidden most of jGnoetry’s parameterizations in DHTML visibility spans.

For example, clicking show options gives you this:

  • start with words selected or unselected – the selected words are the ones that will be re-generated
  • when you move over the words to be selected, either they can “toggle” between selected or unselected, or they can change one way (to either selected or unselected – the opposite of how the words are set to start out by default), or they can change only when you click on them, or (my favorite! and therefore the default) they can change one way AND when you click on them.
  • punctuation – when jGnoetry starts a line, how does it pick the first word to begin with? one options is “by punctuation”, meaning it tries to find where the sentences in the corpora began, and use those words – this might be useful if your corpora are made up of novels or news articles. The other option is to use the words that happen after newlines – this might be useful if your corpora is made up of poems.
  • whether you should use punctuation from the corpora or not. also, you could use most punctuation except for quotation marks, parenthesis, and brackets, which are tricky to handle in bigram generation systems because you can’t be guaranteed that an open-bracket will have a matching close-bracket.
  • when your poem is done generating, a lot of times the last character won’t be a punctuation. one option is to automatically append a punctuation mark.
  • finally, you can either use the capitalization of the corpus, or not. or, remove all capitalization then captialize sentences and lines and “I” (so you don’t look like some emo-sounding little-i-using poet)

clicking on “corpora” gives you the change to paste in new corpora, and to change the corpora weights.

finally, you can click “status” to see some details of the generation process, or “template” to see the details about the form you’re using.

If you don’t like the default templates, you can edit them. [s] stands for “syllable”, and [n] stands for “newline”, natch.

One of the nuisances of python-on-linux gnoetry was that you couldn’t edit poems after you generated them. So I made sure that this was possible in jGnoetry. Just paste your poem into the template textarea and click generate.

The words will be made editable in the area above, and the program will try to determine the poem structure, in case you want to edit it.

Anyway, try jGnoetry out, if it doesn’t work let me know and I’ll fix it if I have time and feel like it!

update: the interface is a bit edited, looks like this now:

6 Comments leave one →
  1. December 27, 2011 10:34 pm

    It falls off. I can’t keep. until . I can’t
    Keep. Resonating pain until it falls
    Off. She complies memory. They to me
    Off like & now my doesn’t belong to it falls.

  2. December 27, 2011 10:35 pm

    unsupervised, btw.

  3. January 1, 2012 3:42 am

    Just announcing that I put a bunch of my generation software on sourceforge:

    http://sourceforge.net/projects/poetrygen/

    Most of them were already accessible on the web, but the sourceforge “files” page has a zip file containing charNG, eDiastic, jGnoetry, and WpN all in one bundle for your downloading convenience!

    Happy New Year, everybody!!!1!!!

  4. January 3, 2012 5:03 am

    When I was coding jGnoetry I needed a function just to count syllables, and I thought about pasting in a javascript version of the relevant info from cmudict, but that would have greatly increased the jGnoetry file size. So instead I used a simple algorithm:

    definition: a contiguous vowel sequence is a series of vowels (including “y”). i.e. the word “function” has 2 contiguous vowel sequences: “u” and “io”

    given a word
    if there is more than one contiguous vowel sequence in the word {
    if there is a final e, remove it from the word
    }
    output the remaining number of contiguous vowel sequences

    I wondered about the accuracy of this, so I cat’ted together all the texts I’d ever used for poetry generation (mostly gutenberg and wikinews stuff, also some lyrics) and compared the performance of this algorithm to full cmudict.

    Turns out of individual word types, 30060 of 64763 were not in cmudict. but if I look at word tokens (i.e. “the” is an individual word type that occurs as several thousand tokens in the corpus) only 6% of the corpus’ tokens are not in cmudict. of those tokens in cmudict, the algorithm agrees with full cmudict 93.7% of the time. So even if my simple algorithm got wrong all the words not in cmudict, it’d still be right 87.9% of the time.

    I decided to add a table of exceptions to the algorithm, based on cmudict syllable counts. Of course there’s a tradeoff: the larger the table, the higher the accuracy, but also a higher file size. I decided to go with a table of 500 exceptions to the simple algorithm.

    is a word type in cmudict?  there: 34703, not there: 30060
    if the word type is in cmudict, do the algorithm and cmudict agree?  agree: 27861, not agree: 6842
    3290261 tokens seen in all.
       of these, 201778 tokens not in cmudict (6.132583402958 %)
       and 3088483 tokens are in cmudict (93.867416597042 %)
       of those tokens in cmudict,
           2892383 agree with the algorithm (93.6506045200832 %) which is (87.9074030905147 %) of all tokens total 
           196100 do not agree with the algorithm (6.34939547991684 %) which is (5.96001350652729 %) of all tokens total 
           of those that do not agree with the algorithm,
              191474 (97.6409994900561 %) are in the modified table top 4000 which is (5.81941675751559 %) of all tokens total 
              186833 (95.2743498215196 %) are in the modified table top 3000 which is (5.67836411761863 %) of all tokens total 
              177927 (90.7327893931667 %) are in the modified table top 2000 which is (5.4076865026817 %) of all tokens total 
              157150 (80.137684854666 %) are in the modified table top 1000 which is (4.77621684115637 %) of all tokens total 
              130800 (66.700662927078 %) are in the modified table top 500 which is (3.97536851939709 %) of all tokens total 
              73145 (37.2998470168281 %) are in the modified table top 100 which is (2.22307592011698 %) of all tokens total 
    

    So assuming cmudict as a gold standard, this gives me a final per-token accuracy of 91.9%, with 5.96% known to be incorrect and 2.14% unknown.

    So it looks like a simple syllable-counting algorithm with a small exceptions table can get you pretty far. (I saw a reference to someone who did a dissertation including this issue, but I didn’t get a chance to track it down.)

  5. May 21, 2012 9:28 pm

    I’m sure this is well-mined ground, but your notes prodded me thinking about the problem of (not) matching-brackets with n-gram generation, and came up with some notes towards an implementation.

  6. May 23, 2012 2:00 am

    the problem of (not) matching-brackets with n-gram generation, and came up with some notes towards an implementation.

    Great stuff. I dig the idea of increasing the probability of a close-token being picked once an open-token has been seen, especially if you tokenize punctuation.

    I think in the NLP community there’s a movement towards enhancing n-gram models in this way, though I’m not familiar with the details.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 33 other followers

%d bloggers like this: