CSE 4308/5360 - Assignments
(Optional) Programming Assignment 6

Due date: Friday, August 13, 2010, 11:55pm.

Summary

In this assignment you will implement decision trees, including learning decision trees from training data and applying decision trees to classify test objects.


Command-line Arguments

You must a program that learns a decision tree for a binary classification problem, given some training data. In particular, your program will run as follows:
dtree training_labels_file training_features_file test_features_file
The arguments provide to the program the following information:
  1. The first argument is the file name where the class labels of the training objects are stored. The i-th row is the class label of the i-th training object.
  2. The second argument is the file name where the features of the training objects are stored. The i-th row and j-th column contain the value for the j-th feature of the i-th training object. You can assume in your program that every object has seven features, and each feature is an integer between -2 and 3.
  3. The third argument is the file name where the features of the test objects are stored. The i-th row and j-th column contain the value for the j-th feature of the i-th test object. As before, you can assume in your program that every object has seven features, and each feature is an integer between -2 and 3.
Example files that can be passed as command-line arguments are training_labels.txt, training_features.txt, and test_features.txt. In case you are curious, the class labels of the test objects from test_features.txt can be seen at test_labels.txt.


Training

The first part of the program execution performs training of a decision tree using the training data. Your decision tree should be binary and with depth 4. In other words, any non-leaf node should have two children, and any leaf node should have depth 4 (the root is considered to have depth 1). This means that, at classification time (see description of classification task later), each test object will be subjected to three tests. EXCEPTION: if for some node at depth < 4 you find that that node was only assigned training objects from one class (and no training objects from the other class), then you can make that node a leaf node.

The attribute chosen at each node should simply test if a specific feature is equal to a specific value. For full credit, for each node of the tree (starting at the root) you should identify the optimal attribute, i.e., the attribute leading to the highest information gain for that node, as specified in pages 659-660 of the textbook.

Your decision tree will have at most 7 non-leaf nodes. For each of them your program should print:


Classification

For each test object described in test_features_file print out:


Grading

How to submit

Implementations in C, C++, and Java will be accepted. If you would like to use another language, please first check with the instructor via e-mail. Points will be taken off for failure to comply with this requirement.

Submit a ZIPPED directory called programming6.zip (no other forms of compression accepted, contact the instructor or TA if you do not know how to produce .zip files). The directory should contain all source code. The directory should also contain a file called readme.txt, which should specify precisely:

Insufficient or unclear instructions will be penalized by up to 20 points. Code that does not run on omega machines gets AT MOST half credit (50 points).

Submission checklist