Programming Assignment 6 (OPTIONAL)
Due dates:
Interim report: Monday 12/03/2012, 11:59pm
Full assignment: Thursday 12/06/2012, 11:59pm.
Summary
In this assignment you will implement decision trees, including
learning decision trees from training data and applying decision trees
to classify test objects.
Command-line Arguments
You must a program that learns a
decision tree for a binary classification problem, given some training
data. In particular, your program will run as follows:
dtree training_labels_file training_features_file test_features_file
The arguments provide to the program the following information:
- The first argument is the file name where the class labels of the
training objects are stored. The i-th row is the class label of the
i-th training object.
- The second argument is the file name where the features of
the training objects are stored. The i-th row and j-th column contain
the value for the j-th feature of the i-th training object. You can
assume in your program that every object has seven features, and each
feature is an integer between -2 and 3.
- The third argument is the file name where the features of
the test objects are stored. The i-th row and j-th column contain the
value for the j-th feature of the i-th test object. As before, you can
assume in your program that every object has seven features, and each
feature is an integer between -2 and 3.
Example files that can be passed as command-line arguments are training_labels.txt, training_features.txt, and test_features.txt. In case you are curious, the class labels of the test objects from test_features.txt can be seen at test_labels.txt.
Training
The first part of the program execution performs
training of a decision tree using the training data. Your decision tree
should be binary and with depth 4. In other words, any non-leaf node
should have two children, and any leaf node should have depth 4 (the
root is considered to have depth 1). This means that, at classification
time (see description of classification task later), each test object
will be subjected to three tests. EXCEPTION: if for some node at depth
< 4 you find that that node was only assigned training objects from
one class (and no training objects from the other class), then you can
make that node a leaf node.
The question chosen at each node should simply test if a specific
feature is LESS THAN a specific value. For full credit, for each node
of the tree (starting at the root) you should identify the optimal
question, i.e., the question leading to the highest information gain
for that node, as specified in pages 659-660 of the textbook.
Your decision tree will have at most 7 non-leaf nodes. For each of them your program should print:
- node ID (a number from one to seven, where the root ID is 1).
- left or right: this specifies if this node is the left or right child of its parent. For the root just print "root".
- node ID of the parent (0 for the parent of the root).
- a feature ID (a number from one to seven, specifying which of the seven features is used for the decision test).
- a value V. If feature ID = X and value = Y, then objects
whose X-th feature has a value LESS THAN Y are directed to the left
child of that node.
- number of training objects of class1 considered in this node.
- number of training objects of class2 considered in this node.
- information gain achieved by the question chosen for this node.
Classification
For each test object described in
test_features_file print out, in a separate line, the classification
label (class1 or class2).
Interim report
The interim report should be submitted via e-mail to the instructor and the TA, and should contain the following:
- On subject line: "CSE 4308/5360 - Programming Assignment 6 - Interim report".
- On body of message: Your name and UTA ID (all 10 digits, no spaces).
- On body of message, or as an attachment (in text, Word, PDF,
or OpenOffice format): a description (as brief or long as you want) of
what you have done so far for the assignment, and any
difficulties/bottlenecks you may have reached (in case you encounter
such difficulties, it is highly recommended to contact the instructor
and/or TA for help).
For purposes of grading, it is absolutely fine if your
interim report simply states that you have done nothing so far (you
still get the 10 points allocated for the interim report, AS LONG AS
YOU SUBMIT THE REPORT ON TIME). At the same time, starting early and
identifying potential bottlenecks by the deadline for the interim
report is probably a good strategy for doing well in this assignment. Grading
- 20 points: Correctly computing the information gain of each
(feature, value) pair at each node, using the training data assigned to
that node.
- 20 points: Identifying and choosing, for each node, the
(feature, value) pair with the highest information gain for that node,
according to the training set.
- 20 points: Correctly directing training objects to the left
or right child of each node, depending on the (feature, value) pair
used at that node.
- 20 points: Correctly applying the decision tree to test objects.
- 20 points: Printing out the required information in an easy-to-read format, according to the instructions.
How to submit
Submissions should be made using Blackboard.
Implementations in C, C++, and Java will be accepted. If you would like
to use another language, please first check with the instructor via
e-mail. Points will be taken off for failure to comply with this
requirement.
Submit a ZIPPED directory called programming6.zip (no other
forms of compression accepted, contact the instructor or TA if you do
not know how to produce .zip files). The directory should
contain all source code. The directory should also contain a file called
readme.txt,
which should specify precisely:
- Name and UTA ID of the student.
- What programming language is used.
- How the code is structured.
- How to run the code, including very specific compilation
instructions, if compilation
is needed. Instructions such as "compile using g++" are NOT
considered specific. Providing all the command lines that are needed to complete the compilation on omega is specific.
Insufficient or unclear instructions will be penalized by up to 20 points.
Code that does not run on omega machines gets AT MOST half credit (50 points).
Submission checklist
- Is the code running on omega?
- Is the implementations in C, C++, or Java? If not, have you
obtained written consent from the instructor?
- Is the submitted zipped file called programming6.zip?
- Does the attachment include a readme.txt file, as specified?