Assignment 3

Decision Trees and Decision Forests

Max possible score:



In this assignment you will implement decision trees and decision forests. Your program will learn decision trees from training data and will apply decision trees and decision forests to classify test objects.

Command-line Arguments

You must a program that learns a decision tree for a binary classification problem, given some training data. In particular, your program will run as follows:
dtree training_file test_file option
The arguments provide to the program the following information: Both the training file and the test file are text files, containing data in tabular format. Each value is a number, and values are separated by white space. The i-th row and j-th column contain the value for the j-th feature of the i-th object. The only exception is the LAST column, that stores the class label for each object. Make sure you do not use data from the last column (i.e., the class labels) as attributes (features) in your decision tree.

Example files that can be passed as command-line arguments are in the datasets directory. That directory contains three datasets, copied from the UCI repository of machine learning datasets:

NOTE: If your program has issues reading the regular files (*_training.txt and *_test.txt) try using the whitespace adjusted version(*_training_adj.txt and *_test_adj.txt)

For each dataset, a training file and a test file are provided. The name of each file indicates what dataset the file belongs to, and whether the file contains training or test data.

Note that, for the purposes of your assignment, it does not matter at all where the data came from. One of the attractive properties of decision trees (and many other machine learning methods) is that they can be applied in the exact same way to many different types of data, and produce useful results.


Training Phase

The first thing that your program should do is a decision tree using the training data. What you train and how you do the training depends on the came of the third command line argument, that we called "option". This option can take four possible values, as follows: All four options are described in more details in the lecture slides titled Practical Issues with Decision Trees. Your program should follow the guidelines stated in those slides.


Testing Phase

For each test object (each line in the test file) print out, in a separate line, the classification label. If your classification result is a tie among two or more classes, choose one of them randomly. For each test object you should print a line containing the following info: Use the following Format for each line: Object Index = <index>, Result = <predicted class>, True Class = <true class>, Accuracy = <accuracy>

After you have printed the results for all test objects, you should print the overall classification accuracy, which is defined as the average of the classification accuracies you printed out for each test object.

Use the following Format for this: Classification Accuracy = <average of all accuracies>

Note: This output can either be put in standard output stream or in a file titled output.txt


Grading


How to submit

Implementations in C, C++, Java, and Python will be accepted. Points will be taken off for failure to comply with this requirement unless previously cleared with the Instructor.

Create a ZIPPED directory called <net-id>_assmt3.zip (no other forms of compression accepted, contact the instructor or TA if you do not know how to produce .zip files).
The directory should contain the source code for the task (no need for any compiled binaries). Each folder should also contain a file called readme.txt, which should specify precisely: