In
this assignment you will implement decision trees and decision forests.
Your program will learn decision trees from training data and will
apply decision trees and decision forests to classify test objects.
Command-line Arguments
You must a program that learns a decision tree for a binary
classification problem, given some training data. In particular, your
program will run as follows:
dtree training_file test_file option
The arguments provide to the program the following information:
The first argument is the name of the training file, where
the training data is stored.
The second argument is the name of the test file, where
the test data is stored.
The third argument can have four possible values: optimized,
randomized, forest3, or forest15.
It specifies how to train (learn) the decision tree, and will be
discussed later.
Both the training file and the test file are text files,
containing data
in tabular format. Each value is a number, and values are separated by
white space. The i-th row and j-th column contain the value for the
j-th
feature of the i-th object. The only exception is the LAST column, that
stores the class label for each object. Make sure you do
not use data from the last column (i.e., the class labels) as
attributes (features) in your decision tree.
NOTE:
If your program has issues reading the regular files (*_training.txt
and *_test.txt) try using the whitespace adjusted
version(*_training_adj.txt and *_test_adj.txt)
The pendigits dataset, containing
data for pen-based recognition of handwritten digits.
7494 training objects.
3498 test objets.
16 attributes.
10 classes.
The satelite dataset. The full name
of this
dataset is Statlog (Landsat Satellite) Data Set, and it contains data
for classification of pixels in satellite images.
4435 training objects.
2000 test objets.
36 attributes.
6 classes.
The yeast dataset, containing some
biological data whose purpose I do not understand myself.
1000 training objects.
484 test objets.
8 attributes.
10 classes.
For each dataset, a training file and a test file are provided. The
name
of each file indicates what dataset the file belongs to, and whether
the file contains training or test data.
Note
that, for the purposes of your assignment, it does not matter at all
where the data came from. One of the attractive properties of decision
trees (and many other machine learning methods) is that they can be
applied in the exact same way to many different types of data, and
produce useful results.
Training Phase
The first thing that your program should do is
a decision tree using the
training data. What you train and how you do the training depends on
the came of the third command line argument, that we called "option".
This option can take four possible values, as follows:
optimized: in this case, at each
non-leaf node of the
tree
(starting at the root) you should identify the optimal combination of
attribute (feature) and threshold, i.e., the combination that leads to
the highest information gain for that node.
randomized: in this case, at each
non-leaf node of
the tree (starting at the root) you should choose the attribute
(feature) randomly. The threshold should still be optimized, i.e., it
should be chosen so as to maximize the information gain for that node
and for that randomly chosen attribute.
forest3: in this case, your program
trains a random
forest containing three trees. Each of those trees should be trained as
discussed under the "randomized" option.
forest15: in this case, your program
trains a
random forest containing 15 trees. Each of those trees should be
trained as discussed under the "randomized" option.
All four options are described in more details in the lecture slides
titled Practical
Issues with Decision Trees. Your program should follow the
guidelines stated in those slides.
Testing Phase
For each test object (each line in the test
file) print out, in a separate line, the classification label. If your classification result is a tie among two or more
classes, choose one of them randomly. For each test object you should
print a line containing the following info:
index. This is the index of the object (the line where it occurs) in the
test file. Start with 0 in numbering the objects, not with 1.
predicted class (the result of the classification).
true class (from the last column of the test file).
accuracy. This is defined as follows:
If there were no ties in your classification result,
and the predicted class is correct, the accuracy is 1.
If there were no ties in your classification result,
and the predicted class is incorrect, the accuracy is 0.
If there were ties in your classification result, and
the
correct class was one of the classes that tied for best, the accuracy
is
1 divided by the number of classes that tied for best.
If there were ties in your classification result, and
the
correct class was NOT one of the classes that tied for best, the
accuracy is 0.
Use the following Format for each line: Object Index = <index>, Result = <predicted class>, True Class = <true class>, Accuracy = <accuracy>
After you have printed the results for all test objects, you
should print the overall classification accuracy, which is defined as
the average of the classification accuracies you printed out for each
test object.
Use the following Format for this: Classification Accuracy = <average of all accuracies>
Note: This output can either be put in standard output stream or in a file titled output.txt
Grading
20 points: Correct processing of the optimized
option.
Identifying and choosing, for each node, the (feature, threshold) pair
with the highest information gain for that node, and correctly
computing
that information gain.
10 points: Correct processing of the randomized
option. In other words, identifying and choosing, for each node, an
appropriate (feature, threshold) pair, where the feature is chosen
randomly, and the threshold maximizes the information gain for that
feature,
15 points: Correctly directing training objects to the
left or
right child of each node, depending on the (threshold, value) pair used
at that node.
5 points: Correct application of pruning, as specified in
the slides (if any).
15 points: Correctly applying decision trees to classify
test objects.
15 points: Correctly applying decision forests to classify
test objects.
20 points: Following the specifications in producing the
required output.
How to submit
Implementations in C, C++, Java, and Python will
be accepted. Points will be taken off
for failure to comply
with this requirement unless previously cleared with the Instructor.
Create a ZIPPED
directory called <net-id>_assmt3.zip (no other
forms
of compression
accepted, contact the instructor or TA if you do not know how to
produce .zip files).
The directory should contain the source code for the task (no need for any compiled binaries). Each
folder should also contain a file called readme.txt, which
should specify precisely:
Name and UTA ID of the student.
What programming language is used for this task. (Make sure to also give version and subversion numbers)
How the code is structured.
How to run the code, including very specific compilation
instructions,
if compilation is needed. Instructions such as "compile using g++" are
NOT considered specific if the TA needs to do additional steps to get
your code to run.
If your code will run on the ACS Omega (not required) make a note of it in the readme file.
Insufficient or unclear instructions will be penalized.
Code that
the Instructor cannot run gets AT MOST 75% credit (depending on instructor's review of the code).