The Carnegie Mellon University’s Pronouncing Dictionary is an indispensable tool for the poet-programmer: it is a corpus of roughly 133,000 words and their phonemic representation in a machine readable format. Some examples:
POETIC P OW0 EH1 T IH0 K POETICALLY P OW0 EH1 T IH0 K L IY0 POETRY P OW1 AH0 T R IY0 POETS P OW1 AH0 T S POFFENBERGER P AO1 F AH0 N B ER0 G ER0 POG P AA1 G
The phonemes tend to be used for text-to-speech operations but can also be used for rhyming, meter-counting, and syllable-counting. Refer to the CMU Pronouncing dictionary homepage for a reference of the phoneme set.
It’s generally helpful to have a library wrapper around these data files in whatever language you’re working in. I used to do Computational Poetry in Perl and used Lingua::EN::Phoneme but have since switched to using NodeJS for its speed in processing many gigs of data in an asynchronous fashion. This motivated node-cmudict.
Usage is basic and straightforward:
var CMUDict = require('cmudict').CMUDict; var c = new CMUDict(); var phoneme_string = cmudict.get('prosaic'); // 'P R OW0 Z EY1 IH0 K'
And installation is trivial with the lovely npm:
npm install cmudict
NodeJS does not have a rich NLP ecosystem yet–nothing like Perl’s Lingua namespace or Python’s NLTK. But its speed in processing large text files make it ideal for some Computational Poetry applications. node-cmudict is a small step to encouraging the growth of such an ecosystem.