Programming Assignment 6 (OPTIONAL)

Due dates:
Interim report: Monday 12/03/2012, 11:59pm
Full assignment: Thursday 12/06/2012, 11:59pm.

Summary

In this assignment you will implement decision trees, including learning decision trees from training data and applying decision trees to classify test objects.


Command-line Arguments

You must a program that learns a decision tree for a binary classification problem, given some training data. In particular, your program will run as follows:
dtree training_labels_file training_features_file test_features_file
The arguments provide to the program the following information:
  1. The first argument is the file name where the class labels of the training objects are stored. The i-th row is the class label of the i-th training object.
  2. The second argument is the file name where the features of the training objects are stored. The i-th row and j-th column contain the value for the j-th feature of the i-th training object. You can assume in your program that every object has seven features, and each feature is an integer between -2 and 3.
  3. The third argument is the file name where the features of the test objects are stored. The i-th row and j-th column contain the value for the j-th feature of the i-th test object. As before, you can assume in your program that every object has seven features, and each feature is an integer between -2 and 3.
Example files that can be passed as command-line arguments are training_labels.txt, training_features.txt, and test_features.txt. In case you are curious, the class labels of the test objects from test_features.txt can be seen at test_labels.txt.


Training

The first part of the program execution performs training of a decision tree using the training data. Your decision tree should be binary and with depth 4. In other words, any non-leaf node should have two children, and any leaf node should have depth 4 (the root is considered to have depth 1). This means that, at classification time (see description of classification task later), each test object will be subjected to three tests. EXCEPTION: if for some node at depth < 4 you find that that node was only assigned training objects from one class (and no training objects from the other class), then you can make that node a leaf node.

The question chosen at each node should simply test if a specific feature is LESS THAN a specific value. For full credit, for each node of the tree (starting at the root) you should identify the optimal question, i.e., the question leading to the highest information gain for that node, as specified in pages 659-660 of the textbook.

Your decision tree will have at most 7 non-leaf nodes. For each of them your program should print:


Classification

For each test object described in test_features_file print out, in a separate line, the classification label (class1 or class2).


Interim report

The interim report should be submitted via e-mail to the instructor and the TA, and should contain the following: For purposes of grading, it is absolutely fine if your interim report simply states that you have done nothing so far (you still get the 10 points allocated for the interim report, AS LONG AS YOU SUBMIT THE REPORT ON TIME). At the same time, starting early and identifying potential bottlenecks by the deadline for the interim report is probably a good strategy for doing well in this assignment.

Grading

How to submit

Submissions should be made using Blackboard.

Implementations in C, C++, and Java will be accepted. If you would like to use another language, please first check with the instructor via e-mail. Points will be taken off for failure to comply with this requirement.

Submit a ZIPPED directory called programming6.zip (no other forms of compression accepted, contact the instructor or TA if you do not know how to produce .zip files). The directory should contain all source code. The directory should also contain a file called readme.txt, which should specify precisely:

Insufficient or unclear instructions will be penalized by up to 20 points. Code that does not run on omega machines gets AT MOST half credit (50 points).

Submission checklist