Assignment 3

Posterior Proabilites and Nearest Neighbor Classifier

Max possible score:

Complete any one of the tasks. Complete both for 50 Points EC


Task 1 

50 points

The task in this part is to implement a system that: As in the slides provided here, there are five types of bags of candies. Each bag has an infinite amount of candies. We have one of those bags, and we are picking candies out of it. We don't know what type of bag we have, so we want to figure out the probability of each type based on the candies that we have picked.

The five possible hypotheses for our bag are:

NOTE: These numbers can be hardcoded into your program

Command Line arguments:

The program takes a single command line argument, which is a string, for example CCLLLLLLCCCC. This string represents a sequence of observations, i.e., a sequence of candies that we have already picked. Each character is C if we picked a cherry candy, and L if we picked a lime candy. Assuming that characters in the string are numbered starting with 1, the i-th character of the string corresponds to the i-th observation. The program should be invoked from the commandline as follows:
compute_posterior <observations>
For example:
compute_posterior CCLLLLLLCCCC
We also allow the case of not having a command line argument at all, this represents the case where we have made no observations yet.

Output:

Your program should create a text file called "result.txt", that is formatted exactly as shown below. ??? is used where your program should print values that depend on its command line argument. At least five decimal points should appear for any floating point number.
Observation sequence Q: ???
Length of Q: ???

Before Observations:

P(h1) = 0.1
P(h2) = 0.2
P(h3) = 0.4
P(h4) = 0.2
P(h5) = 0.1

Probability that the next candy we pick will be C, given Q: 0.5
Probability that the next candy we pick will be L, given Q: 0.5


After Observation ??? = ???: (This and all remaining lines are repeated for every observation)

P(h1 | Q) = ???
P(h2 | Q) = ???
P(h3 | Q) = ???
P(h4 | Q) = ???
P(h5 | Q) = ???

Probability that the next candy we pick will be C, given Q: ???
Probability that the next candy we pick will be L, given Q: ???

Sample output for CCLLLLLLCCCC is given here.


Task 2

50 points

In this part you will implement nearest neighbor classifiers. Your program will classify test objects by comparing them to training objects and finding their nearest neighbor(s).

Command Line arguments:

The command line invocation will be as follows

nn_classify <training-file> <test-file> [<k>]

training-file: File with Training Dataset (More information below)
test-file: File with Testing Dataset (More information below)
k: Number of nearest neighbors to use (default of 1 if not given)

Both the training file and the test file are text files, containing data in tabular format. Each value is a number, and values are separated by white space. The i-th row and j-th column contain the value for the j-th feature of the i-th object. The only exception is the LAST column, that stores the class label for each object.

Example files that can be passed as command-line arguments are in the datasets directory. That directory contains three datasets, copied from the UCI repository of machine learning datasets:

NOTE: If your program has issues reading the regular files (*_training.txt and *_test.txt) try using the whitespace adjusted version(*_training_adj.txt and *_test_adj.txt)

For each dataset, a training file and a test file are provided. The name of each file indicates what dataset the file belongs to, and whether the file contains training or test data.

Output:

Create a file result.txt.
For each test object you should print a line in it containing the following info: Please also output overall classification accuracy (average of all the accuracy values) from above to standard output.

Sample output for pendigits dataset with 3-nearest neighbor classifier is given here.


How to submit

Implementations in C, C++, Java, and Python will be accepted. Points will be taken off for failure to comply with this requirement unless previously cleared with the Instructor.

Create a ZIPPED directory called <net-id>_assmt3.zip (no other forms of compression accepted, contact the instructor or TA if you do not know how to produce .zip files).
The directory should contain the source code for each task in seperate folders titled task1 and task2 (no need for any compiled binaries). Each folder should also contain a file called readme.txt, which should specify precisely: