15th November 2011
|
|
|
Use bigram language models built from the two separate article categories to estimate the probability that the following sentences would occur in articles about "politics" or articles about "business". You could use this approach as a way of telling whether part of a new article was more likely about politics or about business:
The company refused to protect the British public.
The Chinese government plans are met.
For this part, I don't mind what combination of programming, using standard utilities and working by hand you use. The important thing is that you get the right results (efficiency is irrelevant) and that I can tell that your method was appropriate. Please include in your submission all information that I would need to check on and replicate your work. Please don't make use of any specialised programs for counting ngrams etc that you might be able to find on the web.
For this part, you will also have to solve some technical questions, for instance:
For this assignment, you don't need to download or run LT POS.
In the same way as in the last part, build unigram and bigram models for the compound words. Are there any important differences in the most frequent patterns? Where would you expect in advance that there would be differences, and can you find evidence of this happening?
As in the first part, use the language models to make a decision about those two sentences (you will have to assign the parts of speech to these sentences by hand).
( and ) are for groupingFor instance, the following pattern:
* means any number of occurrences (including none)
+ means at least one occurrence
? means at most one occurrence
{a, b, c} represents a or b or c
@ matches any word or part of speech
a,b,c means a followed by b followed by c
pushed_@, up_RP, @_DT, {@_JJ, @_NN}*, price_NNmatches the word "pushed" (with any part of speech), followed up "up", followed by any determiner, followed by any number of adjectives or nouns, followed by the word "price" (as a noun).
For this part of the exercise, do not include in your 10 patterns simple unigrams or bigrams of specific compound words. For each pattern, give an example of a phrase that would match it. Your aim here is to define general patterns that will be useful for diagnosing the category of any new article (even if its topic is rather different from any of those for the existing articles). So make sure that your patterns will match many examples. Make sure also that you exploit the part of speech tags in some of your patterns (and indeed specify some parts of the phrases matched using only parts of speech, not also with specific words).
For this, you need to set up Weka with training examples corresponding to the 20 articles you have been given. Each article is represented in terms of which of your selected patterns match in that article. Each of these patterns corresponds to a feature with values yes and no (indicating whether the pattern matches or not). So your data file might look somehing like this (with more pattern features, and probably with more mnemonic names):
@relation meaning
@attribute pattern1: {yes, no}
@attribute pattern2: {yes, no}
@attribute pattern3: {yes, no}
@attribute meaning: {politics, business}
@data
yes, no, no, politics
no, no, no, business
yes, yes, no, politics
(the first article, a politics article, matches the first pattern only;
the second one, a business article, matches none of the patterns, etc).
Use whatever methods you wish (e.g. write a program, do it by hand, etc)
to test whether each pattern matches each article. Describe how you did
this in your report. Consider adding new training articles yourself.
What rules does Weka produce? Is this actually
useful? Comment on the advantages and disadvantages of building classifiers
in this way.
Note that the strength of a system like Weka comes from being able to select from a large set of potentially relevant features (none of which are 100% correlated with a given classification decision) the combination that gives the best performance when taken together. So you will get the most out of this part if you consider a number of different patterns, even ones which sometimes match articles of the "other" category than the one they were designed for.