Skip to content

node-cmudict

September 7, 2011

The Carnegie Mellon University’s Pronouncing Dictionary is an indispensable tool for the poet-programmer: it is a corpus of roughly 133,000 words and their phonemic representation in a machine readable format. Some examples:

POETIC  P OW0 EH1 T IH0 K
POETICALLY  P OW0 EH1 T IH0 K L IY0
POETRY  P OW1 AH0 T R IY0
POETS  P OW1 AH0 T S
POFFENBERGER  P AO1 F AH0 N B ER0 G ER0
POG  P AA1 G

The phonemes tend to be used for text-to-speech operations but can also be used for rhyming, meter-counting, and syllable-counting. Refer to the CMU Pronouncing dictionary homepage for a reference of the phoneme set.

It’s generally helpful to have a library wrapper around these data files in whatever language you’re working in. I used to do Computational Poetry in Perl and used Lingua::EN::Phoneme but have since switched to using NodeJS for its speed in processing many gigs of data in an asynchronous fashion. This motivated node-cmudict.

Usage is basic and straightforward:

var CMUDict = require('cmudict').CMUDict;
var c = new CMUDict();
var phoneme_string = cmudict.get('prosaic'); // 'P R OW0 Z EY1 IH0 K'

And installation is trivial with the lovely npm:

npm install cmudict

node-cmudict won’t read and parse the CMU dict file until .get() is called; the whole dictionary is then cached in memory as a plain javascript object. It takes about 1 second to parse the file and after that lookups are incredibly fast.

NodeJS does not have a rich NLP ecosystem yet–nothing like Perl’s Lingua namespace or Python’s NLTK. But its speed in processing large text files make it ideal for some Computational Poetry applications. node-cmudict is a small step to encouraging the growth of such an ecosystem.

node-cmudict on Github

One Comment leave one →
  1. September 9, 2011 10:20 pm

    Great project. I use CMU’s Pronouncing Dictionary for ePoGeeS, and I think I saw it in Gnoetry’s source code too, so it’s good to see standardized libraries for it.

    I was thinking of using the Pronouncing Dictionary in a client-side JavaScript project; do you have a sense of how adaptable your NodeJS code is to client-side programming? I imagine caching the dictionary in memory is handled differently. If nothing else, maybe I’ll adapt the code, it looks like the Creative Commons Public License allows that. Or, I have some server-side projects in mind.

    I look forward to seeing how this proceeds!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 36 other followers

%d bloggers like this: