Assignment 3
Posterior Proabilites and Nearest Neighbor Classifier
Max possible score:
- 4308: 50 Points [+50 Points possible EC]
- 5360: 50 Points [+50 Points possible EC]
Complete any one of the tasks. Complete both for 50 Points EC
Task
1
50 points
The
task in this part is to implement a system that:
- Can determine the posterior probability of different
hypotheses, given priors for these hypotheses, and given a sequence of
observations.
- Can determine the probability that the next observation
will be of a specific type, priors for different hypotheses, and given
a sequence of observations.
As
in the slides provided here, there are five types of bags of
candies. Each bag has an infinite amount of candies. We have one of
those bags, and we are picking candies out of it. We don't know what
type of bag we have, so we want to figure out the probability of each
type based on the candies that we have picked.
The
five possible hypotheses for our bag are:
- h1 (prior:
10%): This type of bag contains 100% cherry candies.
- h2 (prior:
20%): This type of bag contains 75% cherry candies and 25% lime candies.
- h3 (prior:
40%): This type of bag contains 50% cherry candies and 50% lime candies.
- h4 (prior:
20%): This type of bag contains 25% cherry candies and 75% lime candies.
- h5 (prior:
10%): This type of bag contains 100% lime candies.
NOTE: These numbers can be hardcoded into your program
Command
Line arguments:
The
program takes a single command line argument, which is a string, for
example CCLLLLLLCCCC. This string represents a sequence of observations,
i.e., a sequence of candies that we have already picked. Each character
is C if we picked a cherry candy, and L if we picked a lime candy.
Assuming that characters in the string are numbered starting with 1,
the i-th character of the string corresponds to the i-th observation.
The program should be invoked from the commandline as follows:
compute_posterior <observations>
For
example:
compute_posterior CCLLLLLLCCCC
We
also allow the case of not having a command line argument at all, this
represents the case where we have made no observations yet.
Output:
Your
program should create a text file called "result.txt", that is
formatted exactly as shown below. ??? is used where your program should
print values that depend on its command line argument. At least five
decimal
points should appear for any floating point number.
Observation sequence Q: ???
Length of Q: ???
Before Observations:
P(h1) = 0.1
P(h2) = 0.2
P(h3) = 0.4
P(h4) = 0.2
P(h5) = 0.1
Probability that the next candy we pick will be C, given Q: 0.5
Probability that the next candy we pick will be L, given Q: 0.5
After Observation ??? = ???: (This and all remaining lines are repeated for every observation)
P(h1 | Q) = ???
P(h2 | Q) = ???
P(h3 | Q) = ???
P(h4 | Q) = ???
P(h5 | Q) = ???
Probability that the next candy we pick will be C, given Q: ???
Probability that the next candy we pick will be L, given Q: ???
Sample output for CCLLLLLLCCCC is given here.
Task
2
50 points
In
this part you will implement nearest neighbor classifiers.
Your program will classify test objects by comparing them to training objects and finding their nearest neighbor(s).
Command
Line arguments:
The command line invocation will be as follows
nn_classify <training-file> <test-file> [<k>]
training-file: File with Training Dataset (More information below)
test-file: File with Testing Dataset (More information below)
k: Number of nearest neighbors to use (default of 1 if not given)
Both the training file and the test file are text files,
containing data
in tabular format. Each value is a number, and values are separated by
white space. The i-th row and j-th column contain the value for the
j-th
feature of the i-th object. The only exception is the LAST column, that
stores the class label for each object.
Example files that can be passed as command-line arguments are
in the datasets
directory. That directory contains three datasets, copied from the UCI
repository of machine learning datasets:
NOTE:
If your program has issues reading the regular files (*_training.txt
and *_test.txt) try using the whitespace adjusted
version(*_training_adj.txt and *_test_adj.txt)
- The pendigits dataset, containing
data for pen-based recognition of handwritten digits.
- 7494 training objects.
- 3498 test objets.
- 16 attributes.
- 10 classes.
- The satelite dataset. The full name
of this
dataset is Statlog (Landsat Satellite) Data Set, and it contains data
for classification of pixels in satellite images.
- 4435 training objects.
- 2000 test objets.
- 36 attributes.
- 6 classes.
- The yeast dataset, containing some
biological data whose purpose I do not understand myself.
- 1000 training objects.
- 484 test objets.
- 8 attributes.
- 10 classes.
For each dataset, a training file and a test file are provided. The
name
of each file indicates what dataset the file belongs to, and whether
the file contains training or test data.
Output:
Create a file result.txt. For each test object you should
print a line in it containing the following info:
- object ID. This is the line number where that object
occurs in the
test file. Start with 0 in numbering the objects, not with 1.
- predicted class (the result of the classification. Note: If your classification result is a tie among two or more
classes, choose one of them randomly).
- true class (from the last column of the test file).
- accuracy. This is defined as follows:
- If there were no ties in your classification result,
and the predicted class is correct, the accuracy is 1.
- If there were no ties in your classification result,
and the predicted class is incorrect, the accuracy is 0.
- If there were ties in your classification result, and
the
correct class was one of the classes that tied for best, the accuracy
is
1 divided by the number of classes that tied for best.
- If there were ties in your classification result, and
the
correct class was NOT one of the classes that tied for best, the
accuracy is 0.
Please also output overall classification accuracy (average of all the accuracy values) from above to standard output.
Sample output for pendigits dataset with 3-nearest neighbor classifier is given here.
How to submit
Implementations in C, C++, Java, and Python will
be accepted. Points will be taken off
for failure to comply
with this requirement unless previously cleared with the Instructor.
Create a ZIPPED
directory called <net-id>_assmt3.zip (no other
forms
of compression
accepted, contact the instructor or TA if you do not know how to
produce .zip files).
The directory should contain the source code for each task in seperate folders titled task1 and task2 (no need for
any compiled binaries). Each
folder should also contain a file called readme.txt, which
should specify precisely:
- Name and UTA ID of the student.
- What programming language is used for this task. (Make sure to
also give version and subversion numbers)
- How the code is structured.
- How to run the code, including very specific compilation
instructions,
if compilation is needed. Instructions such as "compile using g++" are
NOT considered specific if the TA needs to do additional steps to get
your code to run.
- If your code will run on the ACS Omega (not required) make a
note of it in the readme file.
- Insufficient or unclear instructions will be penalized.
- Code that
the Instructor cannot run gets AT MOST 75% credit (depending on if the instructor is able to get it to run with minor edits).