Counting Syllables and Detecting Rhyme in PHP

I was looking at a Software Developer posting on the FreshBooks careers page the other day, and near the end of the "How to apply" instructions there was a curious sentence:

"If you want to prove you're really paying attention, include a verse of ottava rima and a link to your GitHub profile with your application and you're guaranteed to have your application reviewed by our Software Development Manager."

I'd never heard of the ottava rima rhyming stanza form before, but a quick trip to Wikipedia remedied that. Essentially, an ottava rima stanza must satisfy three rules:

Each stanza must have 8 lines,
The lines must be iambic pentameter,
The stanza must have the rhyming format a-b-a-b-a-b-c-c.

Here is an example of an ottava rima stanza by Frere (as given in the Wikipedia article):

But chiefly, when the shadowy moon had shed
O'er woods and waters her mysterious hue,
Their passive hearts and vacant fancies fed
With thoughts and aspirations strange and new,
Till their brute souls with inward working bred
Dark hints that in the depths of instinct grew
Subjection not from Locke's associations,
Nor David Hartley's doctrine of vibrations.

The simplicity of the rules got me thinking: how hard would it be to write a program to check if a poem stanza is ottava rima?

In this article, we will write a simple ottava rima detector in PHP.

(If you're not interested in the details, you can skip to the code at GitHub.)

The Detector Function

In our PHP program, the ottava rima detector will be a function named is_ottava_rima that will return true or false depending on whether the stanza string passed to it satisfies the 3 ottava rima rules. The general structure looks something like this:

function is_ottava_rima(
    $stanza, 
    $delimiter = "\n", 
    $syllable_tolerance = 2
) {
    $lines = extract_lines_from($stanza, $delimiter);
    
    // Rule 1.
    if (count($lines) !== 8) {
        return false;
    }

    // Rule 2.
    foreach ($lines as $line) {
        if (!is_iambic_pentameter($line, $syllable_tolerance)) {
            return false;
        }
    }

    // Rule 3.
    if (!is_abababcc_rhyme($lines)) {
        return false;
    }

    return true;
}

For the remainder of this article, we will explore how to implement each of the is_ottava_rima sub-functions in detail.

Extracting Lines

The second rule of our ottava rima detector states:

1. Each stanza must have 8 lines.

This is the most straightforward part of the detector. The extract_lines_from function separates the stanza into lines which are then counted. Here's what the code looks like:

function is_ottava_rima(
    $stanza, 
    $delimiter = "\n", 
    $syllable_tolerance = 2
) {
    $lines = extract_lines_from($stanza, $delimiter);
  
    // Rule 1.
    if (count($lines) !== 8) {
        return false;
    }

    ...
}

function extract_lines_from($stanza, $delimiter = "\n") {    
    // Separate the stanza into lines.
    return explode($delimiter, trim($stanza));
}

Phonetic Transcription

Before we move on to the metre and rhyme rules, we need to briefly discuss the concept phonetic transcription.

In order for a program to be able to accurately count how many syllables a word has, or compare two words for rhyme, it needs to first transform the words into a phonetic transcription. This transform is necessary because, in English, the written form of a word, the orthography, can differ from the pronunciation of it. For example, the words "bough" and "trough" do not rhyme in English, even though their spellings might suggest they do.

Our detector will use a combination of two phonetic transcriptions: Arpabet and Metaphone code, although Metaphone code will only be used as a last resort in rhyme detection.

Arpabet, developed by Advanced Research Projects Agency (ARPA), represents each phoneme of General American English with a distinct sequence of ASCII characters. Here is an example of some of the phonemes:

Arpabet	Examples
AO	off (AO1 F); fall (F AO1 L); frost (F R AO1 S T)
EY	say (S EY1); eight (EY1 T)
P	pay (P EY1)

Unfortunately, there is no algorithm for taking an English word and converting it to an Arpabet representation. There is, however, a dictionary called the CMU Pronouncing Dictionary (CMUDict), that allows a program to convert ~120,000 English words to their Arpabet equivalents.

Metaphone is a phonetic algorithm that can transcribe any English word into Metaphone code. Unfortunately, the resulting transcriptions are not as accurate Arpabet. There are 3 versions of Metaphone (Metaphone, Double Metaphone and Metaphone 3). In our detector, we will be using the original Metaphone algorithm that is included in PHP.

Detecting Iambic Pentameter

The second rule of our ottava rima detector states:

2. The lines must be iambic pentameter.

What does it mean for something to be iambic? In English, it refers to a metrical foot comprising of an unstressed syllable followed by a stressed syllable (i.e. "da-DUM"). Thus, in the case of iambic pentameter, there are 5 iambs: "da-DUM da-DUM da-DUM da-DUM da-DUM".

To simplify our detector, we will be looking for a weaker version of iambic pentameter: we won't be worrying about the stresses ("da-DUM") and instead just be looking for 2 syllables per iamb ("da-da") for a total of 10 syllables per line. With this in mind, we can sketch out some code for an is_iambic_pentameter function:

function is_iambic_pentameter($line, $syllable_tolerance) {
    $syllables = 0;
    $words = explode(' ', $line);
    foreach ($words as $word) {
        $syllables += estimate_syllables($word);
    }
    return $syllables === 10;
}

Which brings us to the next question: how do we count the number of syllables in a word? Well, in order to answer that question, we need ask a more elemental question: what is a syllable?

A syllable consists of two parts: a syllable nucleus (most often a vowel) with optional initial and final margins (typically, consonants). So, we can think of a syllable as a vowel with some optional consonant garnishes. That means that we can roughly estimate the number of syllables in an English word by counting the number of vowels in it. Let's try that:

function estimate_syllables($word) {
    return count_english_vowels($word);
}

function count_english_vowels($word) {
    static $english_vowels = array('A', 'E', 'I', 'O', 'U');
    $vowel_count = 0;
    $letters = str_split(strtoupper($word));
    foreach ($letters as $letter) {
        if (in_array($letter, $english_vowels)) {
            $vowel_count++;
        }
    }
    return $vowel_count;
}

This heuristic works, but there are cases where it can give incorrect results due to the peculiarities of the orthography. For example: "here" would yield 2 syllables instead of 1 and "my" would yield 0 syllables instead of 1. How can we improve on that?

The answer lies in using the Arpabet phonetic transcription. Since a phonetic transcription provides a one-to-one mapping between symbols and sound, we can avoid the over- and under-counting we get with words like "here" and "my". To demonstrate, here are the Arpabet transcriptions of "here" and "my" (with the vowels bolded):

English	Arpabet
here	HH IY1 R
my	M AY1

In this case, both words are correctly shown as having a single Arpabet vowel each and therefore a single syllable. From this somewhat contrived example, we can see that Arpabet more accurately reflect vowel sounds as compared to English orthography. However, using Arpabet is not without its pitfalls. First off, it requires more time and memory, as the CMUDict database needs to be loaded into memory. Second, there are only ~120,000 words in the CMUDict, whereas the English language has ~600,000 words. That means there are a lot of words that CMUDict cannot transcribe to Arpabet.

Thus, in order to be able to handle any English word, we need to take a hybrid approach: If the word is in the CMUDict, we use the Arpabet transcription to get an accurate syllable count. If it's not, we use the English vowel counting heuristic. Here's what the code looks like:

function estimate_syllables($word, $syllable_tolerance) {
    $syllable_count = count_arpabet_vowels($word);
    if ($syllable_count === null) {
        $syllable_count = count_english_vowels($word);
    }
    return $syllable_count;
}

function count_arpabet_vowels($word) {
    static $arpabet_vowels = array(
        'AO', 'AA', 'IY', 'UW', 'EH', // Monophthongs
        'IH', 'UH', 'AH', 'AX', 'AE',
        'EY', 'AY', 'OW', 'AW', 'OY', // Diphthongs
        'ER' // R-colored vowels
    );
    $cmu_dict = CMUDict::get();
    $phonemes = $cmu_dict->getPhonemes($word);
    if ($phonemes !== null) {
        $vowel_count = 0;
        foreach ($phonemes as $phoneme) {
            if (in_array($phoneme, $arpabet_vowels)) {
                $vowel_count++;
            }
        }
        return $vowel_count;
    } else {
        return null;
    }
}

function count_english_vowels($word) {
    static $english_vowels = array('A', 'E', 'I', 'O', 'U');
    $vowel_count = 0;
    $letters = str_split(strtoupper($word));
    foreach ($letters as $letter) {
        if (in_array($letter, $english_vowels)) {
            $vowel_count++;
        }
    }
    return $vowel_count;
}

As you might have noticed, there's a $syllable_tolerance argument for the is_iambic_pentameter function. This is to compensate for two facts:

the English vowel counting heuristic can under- or over-estimate the number of syllables,
some poems don't match iambic pentameter perfectly.

Thus, to give the detector a bit of flexibility in detecting ottava rima, there is a $syllable_tolerance variable. A value of 2 means that the detector will accept lines that have between 8 and 12 syllables. Here's the final code for is_iambic_pentameter:

function is_iambic_pentameter($line, $syllable_tolerance) {
    $syllable_count = 0;
    $words = get_words_from($line);
    foreach ($words as $word) {
        $syllable_count += estimate_syllables($word);
    }
    $min_syllable_count = 10 - $syllable_tolerance;
    $max_syllable_count = 10 + $syllable_tolerance;
    return $syllable_count >= $min_syllable_count && 
           $syllable_count <= $max_syllable_count;
}

function get_words_from($line) {
    $cleaned_line = trim(preg_replace("/[^A-Za-z' ]/", ' ', $line));
    return preg_split('/\s+/', $cleaned_line);
}

You might have noticed the get_words_from function. This function removes all superfluous punctuation that many poets are so fond of, and then separates the line string into an array of word strings. The reason this logic has been given its own function is because it is used again when detecting rhyme.

Detecting Rhyme

The third rule of our ottava rima detector states:

3. The stanza must have the rhyming format a-b-a-b-a-b-c-c.

Rhyme, according to Wikipedia, is:

"a repetition of similar sounds in two or more words and is most often used in poetry and songs."

There are several different types of rhymes, including perfect rhymes, general rhymes, and mirror rhymes, each with their own subcategories. In the case of ottava rima, we will be specifically looking for a type of general rhyme called syllabic rhyme, where the last syllable of the words sounds the same.

What this means from a programming perspective is that we need only look at the last syllable of the last word of every line. Since we already have the get_words_from function for breaking lines into an array of words, all we need to do is grab the last word of that array, like so:

function does_rhyme($line1, $line2) {
    $words1 = get_words_from($line1);
    $last_word1 = $words1[count($words1) - 1];
    $words2 = get_words_from($line2);
    $last_word2 = $words2[count($words2) - 1];

    ...
}

Next, we need to check if the words are in the CMUDict and, if they are, we retrieve their corresponding Arpabet phonemes.

function does_rhyme($line1, $line2) {
    ...

    $words_found = true;
    $cmu_dict = CMUDict::get();
    $phonemes1 = $cmu_dict->getPhonemes($last_word1);
    if ($phonemes1 === null) {
        $words_found = false;
    }
    $phonemes2 = $cmu_dict->getPhonemes($last_word2);
    if ($phonemes2 === null) {
        $words_found = false;
    }

    ...
}

Once we know whether or not the CMUDict contains the words, we can check whether or not they rhyme by comparing the last syllable of each word. If the CMUDict did not contain either of the words then we instead compare the last Metaphone code symbol of each word.

function does_rhyme($line1, $line2) {
    ...

    if ($words_found) {
        $last_syllable1 = get_last_syllable_of($phonemes1);
        $last_syllable2 = get_last_syllable_of($phonemes2);
        $rhymes = $last_syllable1 === $last_syllable2;
    } else {
        $metaphone1 = metaphone($last_word1);
        $metaphone2 = metaphone($last_word2);
        $rhymes = substr($metaphone1, -1) === substr($metaphone2, -1);
    }

    if (!$rhymes) {
        error_log("$last_word1 and $last_word2 don't rhyme.");
    }

    return $rhymes;
}

How do we get the last syllable of an Arpabet transcribed word? Well, as we know from earlier, the vowel is the nucleus of a syllable. So, if we scan an Arpabet transcription backwards for the last vowel then chop off everything before it, we can approximately get the last syllable.

function get_last_syllable_of($phonemes) {
    static $arpabet_vowels = array(
        'AO', 'AA', 'IY', 'UW', 'EH', // Monophthongs
        'IH', 'UH', 'AH', 'AX', 'AE',
        'EY', 'AY', 'OW', 'AW', 'OY', // Diphthongs
        'ER' // R-colored vowels
    );

    $reversed_syllable_phonemes = array();
    foreach (array_reverse($phonemes) as $phoneme) {
        $reversed_syllable_phonemes[] = $phoneme;
        if (in_array($phoneme, $arpabet_vowels)) {
            break;
        }
    }
    return implode('', array_reverse($reversed_syllable_phonemes));
}

That completes the does_rhyme function for determining if two lines rhyme. Now, in order to determine if an array of lines has the a-b-a-b-a-b-c-c rhyming scheme, we simply need to apply that function to the appropriate lines:

function is_abababcc_rhyme($lines) {
    list($a1, $b1, $a2, $b2, $a3, $b3, $c1, $c2) = $lines;
    $a_rhyme = does_rhyme($a1, $a2) && 
               does_rhyme($a2, $a3) && 
               does_rhyme($a1, $a3);
    $b_rhyme = does_rhyme($b1, $b2) && 
               does_rhyme($b2, $b3) && 
               does_rhyme($b1, $b3);
    $c_rhyme = does_rhyme($c1, $c2);
    return $a_rhyme && $b_rhyme && $c_rhyme;
}

Bringing It All Together

That completes the simplified ottava rima detector. Here is what the final code looks like:

= $min_syllable_count && 
           $syllable_count <= $max_syllable_count;
}

function get_words_from($line) {
    $cleaned_line = trim(preg_replace("/[^A-Za-z' ]/", ' ', $line));
    return preg_split('/\s+/', $cleaned_line);
}

function estimate_syllables($word) {
    $syllable_count = count_arpabet_vowels($word);
    if ($syllable_count === null) {
        $syllable_count = count_english_vowels($word);
    }
    return $syllable_count;
}

function count_arpabet_vowels($word) {
    static $arpabet_vowels = array(
        'AO', 'AA', 'IY', 'UW', 'EH', // Monophthongs
        'IH', 'UH', 'AH', 'AX', 'AE',
        'EY', 'AY', 'OW', 'AW', 'OY', // Diphthongs
        'ER' // R-colored vowels
    );
    $cmu_dict = CMUDict::get();
    $phonemes = $cmu_dict->getPhonemes($word);
    if ($phonemes !== null) {
        $vowel_count = 0;
        foreach ($phonemes as $phoneme) {
            if (in_array($phoneme, $arpabet_vowels)) {
                $vowel_count++;
            }
        }
        return $vowel_count;
    } else {
        return null;
    }
}

function count_english_vowels($word) {
    static $english_vowels = array('A', 'E', 'I', 'O', 'U');
    $vowel_count = 0;
    $letters = str_split(strtoupper($word));
    foreach ($letters as $letter) {
        if (in_array($letter, $english_vowels)) {
            $vowel_count++;
        }
    }
    return $vowel_count;
}

function is_abababcc_rhyme($lines) {
    list($a1, $b1, $a2, $b2, $a3, $b3, $c1, $c2) = $lines;
    $a_rhymes = does_rhyme($a1, $a2) &&
                does_rhyme($a2, $a3) &&
                does_rhyme($a1, $a3);
    $b_rhymes = does_rhyme($b1, $b2) &&
                does_rhyme($b2, $b3) &&
                does_rhyme($b1, $b3);
    $c_rhymes = does_rhyme($c1, $c2);
    return $a_rhymes && $b_rhymes && $c_rhymes;
}

function does_rhyme($line1, $line2) {
    $words1 = get_words_from($line1);
    $last_word1 = $words1[count($words1) - 1];
    $words2 = get_words_from($line2);
    $last_word2 = $words2[count($words2) - 1];

    $words_found = true;
    $cmu_dict = CMUDict::get();
    $phonemes1 = $cmu_dict->getPhonemes($last_word1);
    if ($phonemes1 === null) {
        $words_found = false;
    }
    $phonemes2 = $cmu_dict->getPhonemes($last_word2);
    if ($phonemes2 === null) {
        $words_found = false;
    }

    if ($words_found) {
        $last_syllable1 = get_last_syllable_of($phonemes1);
        $last_syllable2 = get_last_syllable_of($phonemes2);
        $rhymes = $last_syllable1 === $last_syllable2;
    } else {
        $metaphone1 = metaphone($last_word1);
        $metaphone2 = metaphone($last_word2);
        $rhymes = substr($metaphone1, -1) === 
                  substr($metaphone2, -1);
    }

    if (!$rhymes) {
        error_log("$last_word1 and $last_word2 don't rhyme.");
    }

    return $rhymes;
}

function get_last_syllable_of($phonemes) {
    static $arpabet_vowels = array(
        'AO', 'AA', 'IY', 'UW', 'EH', // Monophthongs
        'IH', 'UH', 'AH', 'AX', 'AE',
        'EY', 'AY', 'OW', 'AW', 'OY', // Diphthongs
        'ER' // R-colored vowels
    );

    $reversed_syllable_phonemes = array();
    foreach (array_reverse($phonemes) as $phoneme) {
        $reversed_syllable_phonemes[] = $phoneme;
        if (in_array($phoneme, $arpabet_vowels)) {
            break;
        }
    }
    return implode('', array_reverse($reversed_syllable_phonemes));
}

Does It Work?

Of course, there's no point in writing all this code unless we test it. To that end, a test program was written that analyzed a smattering of ottava rima poems found on the University of Toronto Representative Poetry Online site. The test program analyzed ottava rima stanzas (for false negatives) and non ottava rima stanzas (for false positives). These were the results:

Rhyme	Stanzas	Detected
Ottava rima	23	52.17%
Garbage	15	0%

From this rather limited testing, we can can conclude that the detector has a 47.83% chance of false negatives and 0% chance of false positives.

So the detector works, but there is clearly room for improvement.

Conclusion

In this article we implemented a simple ottava rima detector in PHP. This was done by analyzing a string of text representing a poem stanza for the three characteristic properties of an ottava rima poem. The resulting program detected ottava rima stanzas correctly 52% of the time and never detected an non-ottava rima stanzas as ottava rima.

The final code, including the test program and test poetry, is available for download at GitHub.

2 comments

Greg Bulmash Aug '12

Wow. From a programmer whose degree is in Creative Writing, a *major* tip of the hat. Of course, the next challenge is to extract meaning. ;-)

cdmckay Aug '12

@Greg Bulmash: The next step is to reorganize the function as a fitness function and start breeding poems using a genetic algorithm :)