Part 1 (JM 6.3): write a program to compute unigram probabilities.
Part 2 (JM 6.4): try your program on two corpora (for example /share/classes/nigelward/corpus and /www/), and note the top 10 unigrams in each corpus.
optional Part 3 (JM 6.3): extend the program to compute bigram probabilities and run it on a corpus.
optional Part 4 (JM 6.5): use the bigrams to generate random sentences, as explained on page 202.
Due 3:01 Thursday. Hand in a print out of the code and the interesting parts of its output.
#! /usr/bin/perl -w use English; print "Starting"; #process each line while (defined($line=<>)) { print "line is $line"; print " "; @words = split(' ',$line); # process each word foreach $token (@words) { $count{$token}++; print "so far \"$token\" seen $count{$token} times\n"; } print "\n" } #scan the entire list of words, in alphabetical order foreach $key (sort keys(%count)){ print "altogether, the word $key appeared $count{$key} times\n"; } print "sorted by frequency\n"; #this is a really crude way to sort # first create a list with count and token concatenated foreach $key (sort keys(%count)){ push(@list, sprintf("%5d", $count{$key}) . " ". $key); } # now sort that list foreach $thing (sort(@list)) { print "$thing \n";} print "\nHere are some random coinflips: "; my $i; for ($i=5; $i>=0; $i--) { if (rand() > .5){print "H";} else {print "T";} } |