CSE 4309 - Assignments - Tentative Assignment 5

List of assignment due dates.

The assignment should be submitted via Canvas. Submit a file called assignment5.zip, containing the following two files: These naming conventions are mandatory, non-adherence to these specifications can incur a penalty of up to 20 points.

Your name and UTA ID number should appear on the top line of both documents.


Task 1 (70 points, programming)

In this task you will implement decision trees and decision forests. Your program will learn decision trees from training data and will apply decision trees and decision forests to classify test objects.

Arguments

You must implement a Matlab function or a Python executable file called decision_tree. Your function should be invoked as follows:
decision_tree(<training_file>, <test_file>, <option>, <pruning_thr>)
If you use Python, just convert the Matlab function arguments shown above to command-line arguments. The arguments provide to the function the following information:

The arguments provide the following information:

The training and test files will follow the same format as the text files in the UCI datasets directory. A description of the datasets and the file format can be found on this link. For each dataset, a training file and a test file are provided. The name of each file indicates what dataset the file belongs to, and whether the file contains training or test data. Your code should also work with ANY OTHER training and test files using the same format as the files in the UCI datasets directory.

As the description states, do NOT use data from the last column (i.e., the class labels) as features. In these files, all columns except for the last one contain example inputs. The last column contains the class label.


Training Stage

The first thing that your program should do is train a decision tree or a decision forest using the training data. What you train and how you do the training depends on the came of the third argument, that we called <option>. This option can take four possible values, as follows:

Training Phase Output

After you learn your tree or forest, you should print it. Every node must be printed, in breath-first order, with left children before right children. For each node you should print a line containing the following info: To produce this output in a uniform manner, use this printing statement:
fprintf('tree=%2d, node=%3d, feature=%2d, thr=%6.2f, gain=%f\n', tree_id, node_id, feature_id, threshold, gain);

Classification Stage

For each test object you should print a line containing the following info: To produce this output in a uniform manner, use this printing statement:
fprintf('ID=%5d, predicted=%3d, true=%3d, accuracy=%4.2f\n', object_id, predicted_class, true_class, accuracy);
After you have printed the results for all test objects, you should print the overall classification accuracy, which is defined as the average of the classification accuracies you printed out for each test object. To print the classification accuracy in a uniform manner, use this printing statement:
fprintf('classification accuracy=%6.4f\n', classification_accuracy);

Output for answers.pdf

In your answers.pdf document, you need to provide parts of the output for some invocations of your program listed below. For each invocation, provide: Include this output for the following invocations of your program:
decision_tree('pendigits_training', 'pendigits_test', 'optimized', 50)
decision_tree('pendigits_training', 'pendigits_test', 'randomized', 50)
decision_tree('pendigits_training', 'pendigits_test', 'forest3', 50)

Grading Rubric


Task 1b (Extra Credit, maximum 10 points).

In this task, you are free to change any implementation options that you are not free to change in Task 1. Examples of such options include number of trees in the random forest, using entropy to choose thresholds, how you do pruning, or any other relevant option for decision trees. You can submit a Matlab function or Python executable called decision_tree_opt, that implements your modifications. A maximum of 10 points will be given to the submission or submissions that, according to the instructor and GTA, achieve the best improvements (on any of the three datasets) compared to the specifications in Task 1. In your answers.pdf document, under a clear "Task 1b" heading, explain what modifications you made, what results you achieved, and what command line arguments to provide your program with in order to obtain those results.


Task 2 (10 points).

Figure 1: A decision tree for estimating whether the patron will be willing to wait for a table at a restaurant.

Part a (2 points): Suppose that, on the entire set of training samples available for constructing the decision tree of Figure 1, 80 people decided to wait, and 20 people decided not to wait. What is the initial entropy at node A (before the test is applied)?

Part b (2 points): As mentioned in the previous part, at node A 80 people decided to wait, and 20 people decided not to wait.

What is the information gain for the weekend test at node A?

Part c (2 points): In the decision tree of Figure 1, node E uses the exact same test (whether it is weekend or not) as node A. What is the information gain, at node E, of using the weekend test?

Part d (2 points): We have a test case of a hungry patron who came in on a rainy Tuesday. Which leaf node does this test case end up in? What does the decision tree output for that case?

Part e (2 points): We have a test case of a not hungry patron who came in on a sunny Saturday. Which leaf node does this test case end up in? What does the decision tree output for that case?


Task 3 (10 points)

  Class     A     B     C  
X 1 2 1
X 2 1 2
X 3 2 2
X 1 3 3
X 1 2 2
Y 2 1 1
Y 3 1 1
Y 2 2 2
Y 3 3 1
Y 2 1 1

We want to build a decision tree that determines whether a certain pattern is of type X or type Y. The decision tree can only use tests that are based on attributes A, B, and C. Each attribute has 3 possible values: 1, 2, 3 (we do not apply any thresholding). We have the 10 training examples, shown on the table (each row corresponds to a training example).

What is the information gain of each attribute at the root? Which attribute achieves the highest information gain at the root?


Task 4 (5 points)

Suppose that, at a node N of a decision tree, we have 1000 training examples. There are four possible class labels (A, B, C, D) for each of these training examples.

Part a: What is the highest possible and lowest possible entropy value at node N?

Part b: Suppose that, at node N, we choose an attribute K. What is the highest possible and lowest possible information gain for that attribute?


Task 5 (5 points)

Your boss at a software company gives you a binary classifier (i.e., a classifier with only two possible output values) that predicts, for any basketball game, whether the home team will win or not. This classifier has a 28% accuracy, and your boss assigns you the task of improving that classifier, so that you get an accuracy that is better than 60%. How do you achieve that task? Can you guarantee achieving better than 60% accuracy?


CSE 4309 - Assignments - Assignment 5