Part 1 (JM 6.3): write a program to compute unigram probabilities.
Part 2 (JM 6.4): try your program on two corpora (for example /share/classes/nigelward/corpus and /www/), and note the top 10 unigrams in each corpus.
optional Part 3 (JM 6.3): extend the program to compute bigram probabilities and run it on a corpus.
optional Part 4 (JM 6.5): use the bigrams to generate random sentences, as explained on page 202.
Due 3:01 Thursday. Hand in a print out of the code and the interesting parts of its output.
#! /usr/bin/perl -w
use English;
print "Starting";
#process each line
while (defined($line=<>)) {
print "line is $line";
print " ";
@words = split(' ',$line);
# process each word
foreach $token (@words) {
$count{$token}++;
print "so far \"$token\" seen $count{$token} times\n";
}
print "\n"
}
#scan the entire list of words, in alphabetical order
foreach $key (sort keys(%count)){
print "altogether, the word $key appeared $count{$key} times\n";
}
print "sorted by frequency\n";
#this is a really crude way to sort
# first create a list with count and token concatenated
foreach $key (sort keys(%count)){
push(@list, sprintf("%5d", $count{$key}) . " ". $key);
}
# now sort that list
foreach $thing (sort(@list)) {
print "$thing \n";}
print "\nHere are some random coinflips: ";
my $i;
for ($i=5; $i>=0; $i--) {
if (rand() > .5){print "H";} else {print "T";}
}
|