Summary
In this assignment you will implement decision trees, including learning decision trees from training data and applying decision trees to classify test objects.
Command-line Arguments
You must a program that learns a decision tree for a binary classification problem, given some training data. In particular, your program will run as follows:
dtree training_labels_file training_features_file test_features_file
The arguments provide to the program the following information:
- The first argument is the file name where the class labels of the training objects are stored. The i-th row is the class label of the i-th training object.
- The second argument is the file name where the features of the training objects are stored. The i-th row and j-th column contain the value for the j-th feature of the i-th training object. You can assume in your program that every object has seven features, and each feature is an integer between -2 and 3.
- The third argument is the file name where the features of the test objects are stored. The i-th row and j-th column contain the value for the j-th feature of the i-th test object. As before, you can assume in your program that every object has seven features, and each feature is an integer between -2 and 3.
Example files that can be passed as command-line arguments are training_labels.txt, training_features.txt, and test_features.txt. In case you are curious, the class labels of the test objects from test_features.txt can be seen at test_labels.txt.
Training
The first part of the program execution performs training of a decision tree using the training data. Your decision tree should be binary and with depth 4. In other words, any non-leaf node should have two children, and any leaf node should have depth 4 (the root is considered to have depth 1). This means that, at classification time (see description of classification task later), each test object will be subjected to three tests. EXCEPTION: if for some node at depth < 4 you find that that node was only assigned training objects from one class (and no training objects from the other class), then you can make that node a leaf node.
The question chosen at each node should simply test if a specific feature is LESS THAN a specific value. For full credit, for each node of the tree (starting at the root) you should identify the optimal question, i.e., the question leading to the highest information gain for that node, as specified in pages 659-660 of the textbook.
Your decision tree will have at most 7 non-leaf nodes. For each of them your program should print:
- node ID (a number from one to seven, where the root ID is 1).
- left or right: this specifies if this node is the left or right child of its parent. For the root just print "root".
- node ID of the parent (0 for the parent of the root).
- a feature ID (a number from one to seven, specifying which of the seven features is used for the decision test).
- a value V. If feature ID = X and value = Y, then objects whose X-th feature has a value LESS THAN Y are directed to the left child of that node.
- number of training objects of class1 considered in this node.
- number of training objects of class2 considered in this node.
- information gain achieved by the question chosen for this node.
Classification
For each test object described in test_features_file print out, in a separate line, the classification label (class1 or class2).
Interim report
The interim report should be submitted via e-mail to the instructor and the TA, and should contain the following:
- On subject line: "CSE 4308/5360 - Programming Assignment 6 - Interim report".
- On body of message: Your name and UTA ID (all 10 digits, no spaces).
- On body of message, or as an attachment (in text, Word, PDF, or OpenOffice format): a description (as brief or long as you want) of what you have done so far for the assignment, and any difficulties/bottlenecks you may have reached (in case you encounter such difficulties, it is highly recommended to contact the instructor and/or TA for help).
For purposes of grading, it is absolutely fine if your interim report simply states that you have done nothing so far (you still get the 10 points allocated for the interim report, AS LONG AS YOU SUBMIT THE REPORT ON TIME). At the same time, starting early and identifying potential bottlenecks by the deadline for the interim report is probably a good strategy for doing well in this assignment.
Grading
- 20 points: Correctly computing the information gain of each (feature, value) pair at each node, using the training data assigned to that node.
- 20 points: Identifying and choosing, for each node, the (feature, value) pair with the highest information gain for that node, according to the training set.
- 20 points: Correctly directing training objects to the left or right child of each node, depending on the (feature, value) pair used at that node.
- 20 points: Correctly applying the decision tree to test objects.
- 20 points: Printing out the required information in an easy-to-read format, according to the instructions.
How to submit
Submissions should be made using Blackboard.
Implementations in C, C++, and Java will be accepted. If you would like
to use another language, please first check with the instructor via
e-mail. Points will be taken off for failure to comply with this
requirement.
Submit a ZIPPED directory called programming6.zip (no other
forms of compression accepted, contact the instructor or TA if you do
not know how to produce .zip files). The directory should
contain all source code. The directory should also contain a file called
readme.txt,
which should specify precisely:
- Name and UTA ID of the student.
- What programming language is used.
- How the code is structured.
- How to run the code, including very specific compilation
instructions, if compilation
is needed. Instructions such as "compile using g++" are NOT
considered specific. Providing all the command lines that are needed to complete the compilation on omega is specific.
Insufficient or unclear instructions will be penalized by up to 20 points.
Code that does not run on omega machines gets AT MOST half credit (50 points).
Submission checklist
- Is the code running on omega?
- Is the implementations in C, C++, or Java? If not, have you
obtained written consent from the instructor?
- Is the submitted zipped file called programming6.zip?
- Does the attachment include a readme.txt file, as specified?