CS4025: Continuous Assessment, 2011

15th November 2011

Submission Instructions

What you should submit
A single .zip file, which includes a file report.pdf. report.pdf is for your report of what you did for tasks 1 to 4 below. Include whatever other files (e.g. program source files) we need to see in order to understand in detail what you did in the .zip file.

Submissions in any other form will be rejected. Please do not submit floppy disks, CDs, or any other magnetic media.

How you should submit
  1. Email me (once) with your zip file as an attachment.
  2. Hand in 2 (two) hardcopies (printout) of each of the files in your zip file with one single cover sheet properly filled out and signed. You can get cover sheets at the Departmental Reception.

You should deposit your hardcopies and the cover sheet into one of the boxes (it will be identified) opposite the departmental kitchen.

When you should submit
The deadline is Friday 9th December, 12 noon. You should email me your files as well as deposit your hardcopy in the appropriate box by this deadline.

Lateness penalties: work submitted up to one day late attracts a penalty of 10%, up to one week late, 25%; work handed in more than one week late is marked and returned but is counted as a 'no paper'.

Classifying news stories

This folder contains text from 40 different new stories, cut and pasted (in a rough way) from articles on the web. The news stories are classified as relating either to "politics" or to "business" according to the original source (the text was collected on 1st Nov 2005 from the BBC news and Yahoo news web sites, 10 articles of each category from each source). The goal of this assignment is to investigate ways in which one might automatically classify new articles into these two categories, taking inspiration from whatever patterns can be found in these examples.

Part 1: Ngram models (40%)

Find the frequencies of all unigrams and bigrams in both the "politics" articles and the "business" articles. Say how you have done this. Comment on some of the differences in the most common patterns between the two article categories. Which of these differences do you think are real indicators of article category, and which arise merely because the small number of articles selected just happen to cover certain topics?

Use bigram language models built from the two separate article categories to estimate the probability that the following sentences would occur in articles about "politics" or articles about "business". You could use this approach as a way of telling whether part of a new article was more likely about politics or about business:

The company refused to protect the British public.
The Chinese government plans are met.

For this part, I don't mind what combination of programming, using standard utilities and working by hand you use. The important thing is that you get the right results (efficiency is irrelevant) and that I can tell that your method was appropriate. Please include in your submission all information that I would need to check on and replicate your work. Please don't make use of any specialised programs for counting ngrams etc that you might be able to find on the web.

For this part, you will also have to solve some technical questions, for instance:

Explain what decisions you make for these and other issues that arise, and why you decide this way.

Part 2: Using part of speech information (20%)

The files in this folder are for the same news articles as in the last part, except that they have been part of speech tagged by a system called LT POS (developed at Edinburgh). The tagset used is documented in this file. The words in the files are replaced by compound words of the form <word>_<tag>. Notice that for this to have happened, LT POS also had to decide how to tokenise the input, and you can work out how it did this by looking at the example files.

For this assignment, you don't need to download or run LT POS.

In the same way as in the last part, build unigram and bigram models for the compound words. Are there any important differences in the most frequent patterns? Where would you expect in advance that there would be differences, and can you find evidence of this happening?

As in the first part, use the language models to make a decision about those two sentences (you will have to assign the parts of speech to these sentences by hand).

Part 3: Defining relevant patterns (20%)

For each of the two article categories, define 5 patterns that you feel may be useful in detecting an article of this category. A pattern should be expressed as a regular expression involving sequences of compound words in the tagged version of the articles. Use the following notation for regular expressions:
( and ) are for grouping
* means any number of occurrences (including none)
+ means at least one occurrence
? means at most one occurrence
{a, b, c} represents a or b or c
@ matches any word or part of speech
a,b,c means a followed by b followed by c
For instance, the following pattern:
pushed_@, up_RP, @_DT, {@_JJ, @_NN}*, price_NN
matches the word "pushed" (with any part of speech), followed up "up", followed by any determiner, followed by any number of adjectives or nouns, followed by the word "price" (as a noun).

For this part of the exercise, do not include in your 10 patterns simple unigrams or bigrams of specific compound words. For each pattern, give an example of a phrase that would match it. Your aim here is to define general patterns that will be useful for diagnosing the category of any new article (even if its topic is rather different from any of those for the existing articles). So make sure that your patterns will match many examples. Make sure also that you exploit the part of speech tags in some of your patterns (and indeed specify some parts of the phrases matched using only parts of speech, not also with specific words).

Part 4: Learning a classifier (20%)

Given the 10 patterns you produced above, and possibly other patterns you might choose to devise, e.g. expressing unigrams or bigrams that may be indicative of an article's category, use Weka (as in Practical 7) to learn a set of rules for classifying an article, using the JRip machine learning program.

For this, you need to set up Weka with training examples corresponding to the 20 articles you have been given. Each article is represented in terms of which of your selected patterns match in that article. Each of these patterns corresponds to a feature with values yes and no (indicating whether the pattern matches or not). So your data file might look somehing like this (with more pattern features, and probably with more mnemonic names):

@relation meaning

@attribute pattern1: {yes, no}
@attribute pattern2: {yes, no}
@attribute pattern3: {yes, no}
@attribute meaning: {politics, business}

@data yes, no, no, politics no, no, no, business yes, yes, no, politics

(the first article, a politics article, matches the first pattern only; the second one, a business article, matches none of the patterns, etc). Use whatever methods you wish (e.g. write a program, do it by hand, etc) to test whether each pattern matches each article. Describe how you did this in your report. Consider adding new training articles yourself. What rules does Weka produce? Is this actually useful? Comment on the advantages and disadvantages of building classifiers in this way.

Note that the strength of a system like Weka comes from being able to select from a large set of potentially relevant features (none of which are 100% correlated with a given classification decision) the combination that gives the best performance when taken together. So you will get the most out of this part if you consider a number of different patterns, even ones which sometimes match articles of the "other" category than the one they were designed for.