Programming Assignment 6

Summary

In this assignment you will implement decision trees and decision forests. Your program will learn decision trees from training data and will apply decision trees and decision forests to classify test objects.


Command-line Arguments

You must a program that learns a decision tree for a binary classification problem, given some training data. In particular, your program will run as follows:
dtree training_file test_file option
The arguments provide to the program the following information: Both the training file and the test file are text files, containing data in tabular format. Each value is a number, and values are separated by white space. The i-th row and j-th column contain the value for the j-th feature of the i-th object. The only exception is the LAST column, that stores the class label for each object. Make sure you do not use data from the last column (i.e., the class labels) as attributes (features) in your decision tree.

Example files that can be passed as command-line arguments are in the datasets directory. That directory contains three datasets, copied from the UCI repository of machine learning datasets:

For each dataset, a training file and a test file are provided. The name of each file indicates what dataset the file belongs to, and whether the file contains training or test data.

Note that, for the purposes of your assignment, it does not matter at all where the data came from. One of the attractive properties of decision trees (and many other machine learning methods) is that they can be applied in the exact same way to many different types of data, and produce useful results.


Training Phase

The first thing that your program should do is a decision tree using the training data. What you train and how you do the training depends on the came of the third command line argument, that we called "option". This option can take four possible values, as follows: All four options are described in more details in the lecture slides titled Practical Issues with Decision Trees. Your program should follow the guidelines stated in those slides.


Training Phase Output

After you learn your tree or forest, you should print it. Every node must be printed, in breath-first order, with left children before right children. For each node you should print a line containing the following info: To produce this output in a uniform manner, use these printing statements:


Classification

For each test object (each line in the test file) print out, in a separate line, the classification label (class1 or class2). If your classification result is a tie among two or more classes, choose one of them randomly. For each test object you should print a line containing the following info: To produce this output in a uniform manner, use these printing statements: After you have printed the results for all test objects, you should print the overall classification accuracy, which is defined as the average of the classification accuracies you printed out for each test object. To print the classification accuracy in a uniform manner, use these printing statements:


Grading


How to submit

Submissions should be made using Blackboard. Implementations in Python, C, C++, and Java will be accepted. If you would like to use another language, please first check with the instructor via e-mail. Points will be taken off for failure to comply with this requirement.

Submit a ZIPPED directory called programming-assignment6.zip (no other forms of compression accepted, contact the instructor or TA if you do not know how to produce .zip files). The directory should contain:

Insufficient or unclear instructions will be penalized by up to 20 points. Code that does not run on omega machines gets AT MOST half credit, unless you obtained prior written permission.