Programming Assignment 7

Summary

In this assignment you will implement decision trees and decision forests. Your program will learn decision trees from training data and will apply decision trees and decision forests to classify test objects.


Command-line Arguments

You must implement a program that learns a naive Bayes classifier for a classification problem, given some training data and some additional options. In particular, your program can be invoked as follows:
naive_bayes <training_file> <test_file> histograms <number>
naive_bayes <training_file> <test_file> gaussians
naive_bayes <training_file> <test_file> mixtures <number>
The arguments provide to the program the following information: Both the training file and the test file are text files, containing data in tabular format. Each value is a number, and values are separated by white space. The i-th row and j-th column contain the value for the j-th feature of the i-th object. The only exception is the LAST column, that stores the class label for each object.Make sure you do not use data from the last column (i.e., the class labels) as attributes (features) in your decision tree.

Example files that can be passed as command-line arguments are in the datasets directory. That directory contains three datasets, copied from the UCI repository of machine learning datasets:

For each dataset, a training file and a test file are provided. The name of each file indicates what dataset the file belongs to, and whether the file contains training or test data.

Note that, for the purposes of your assignment, it does not matter at all where the data came from. One of the attractive properties of decision trees (and many other machine learning methods) is that they can be applied in the exact same way to many different types of data, and produce useful results.


Training: Histograms

If the third commandline argument is histograms, then you should model P(x | class) as a histogram separately for each dimension of the data. The number of bins for each histogram is specified by the fourth command line argument.

Suppose that you are building a histogram of N bins for the j-th dimension of the data. Let S be the smallest and L be the largest value in the j-th dimension among all training data. Let G = (L-S)/N. Then, your bins should have the following ranges:

The output of the training phase should be a sequence of lines like this:
Class %d, attribute %d, bin %d, P(bin | class) = %.2f
The output lines should be sorted by class number. Within the same class, lines should be sorted by attribute number. Within the same attribute, lines should be sorted by bin number.


Training: Gaussians

If the third commandline argument is gaussians, then you should model P(x | class) as a Gaussian separately for each dimension of the data. The output of the training phase should be a sequence of lines like this:
Class %d, attribute %d, mean = %.2f, std = %.2f
The output lines should be sorted by class number. Within the same class, lines should be sorted by attribute number.


Training: Mixtures of Gaussians

If the third commandline argument is histograms, then you should model P(x | class) as a mixture of Gaussians separately for each dimension of the data. The number of Gaussians for each mixture is specified by the fourth command line argument.

Suppose that you are building a mixture of N Gaussians for the i-th dimension of the data. Let S be the smallest and L be the largest value in the i-th dimension among all training data. Let G = (L-S)/N. Then, you should initialize all standard deviations of the mixture to 1, and you should initialize the means as follows:

You should repeat the main loop of the EM algorithm 50 times. So, no need to worry about any other stopping criterion, your stopping criterion is simply that the loop has been executed 50 times.

The output of the training phase should be a sequence of lines like this:

Class %d, attribute %d, Gaussian %d, mean = %.2f, std = %.2f
The output lines should be sorted by class number. Within the same class, lines should be sorted by attribute number. Within the same attribute, lines should be sorted by Gaussian number.


Classification

For each test object (each line in the test file) print out, in a separate line, the classification label (class1 or class2). If your classification result is a tie among two or more classes, choose one of them randomly. For each test object you should print a line containing the following info:To produce this output in a uniform manner, use these printing statements:After you have printed the results for all test objects, you should print the overall classification accuracy, which is defined as the average of the classification accuracies you printed out for each test object. To print the classification accuracy in a uniform manner, use these printing statements:

Grading


How to submit

Submissions should be made using Blackboard. Implementations in Python, C, C++, and Java will be accepted. If you would like to use another language, please first check with the instructor via e-mail. Points will be taken off for failure to comply with this requirement.

Submit a ZIPPED directory called programming-assignment7.zip (no other forms of compression accepted, contact the instructor or TA if you do not know how to produce .zip files). The directory should contain:

Insufficient or unclear instructions will be penalized by up to 20 points. Code that does not run on omega machines gets AT MOST half credit, unless you obtained prior written permission.