CSE 4311 - Assignments - Assignment 4

List of assignment due dates.

The assignment should be submitted via Canvas. Submit a file called assignment4.zip, containing the following files:

answers.pdf, for the output that each programming task asks you to include. Only PDF files will be accepted. All text should be typed, and if any figures are present they should be computer-generated. Scans of handwriten answers will NOT be accepted.
All source code files needed to run your solutions for the programming tasks. Your Python code should run on the versions specified in the syllabus, unless permission is obtained via e-mail from the instructor or the teaching assistant. Also submit any other files that are needed in order to document or run your code (for example, additional source code files).

These naming conventions are mandatory, non-adherence to these specifications can incur a penalty of up to 20 points.

Your name and UTA ID number should appear on the top line of both documents.

For all programming tasks, feel free to reuse, adapt, or modify, any code that is posted on the slides or the class website.

Task 1 (40 points, programming)

File nn_keras_base.py contains incomplete code that implements, using Keras, the training and evaluation process for a dense neural network that works on UCI datasets. To complete the code, you must create a file called nn_keras_solution.py, where you implement the following Python functions:

model = create_and_train_model(training_inputs, training_labels, layers, units_per_layer, epochs, hidden_activations)

test_accuracy = test_model(model, test_inputs, test_labels, ints_to_labels)

Function create_and_train_model creates and trains a model using the parameter values specified in the arguments. It returns the model that it has trained. These are the specifications for create_and_train_model:

Argument training_inputs is a 2D numpy array, where each row is a training input vector.
Argument training_labels is a numpy column vector. That means it is a 2D numpy array, with a single column. The value stored in training_labels[i,0] is the class label for the vector stored at training_inputs[i].
Argument units_per_layer is a list specifying the number of units in each hidden layer. The length of this list should be num_layers - 2. For example, units_per_layer[0] is the number of units in the first hidden layer (assuming num_layers >= 3), units_per_layer[1] is the number of units in the second hidden layer (assuming num_layers >= 4), and so on. If your network has no hidden layers (i.e., if the number of layers is 2) then this argument gets ignored.
Argument epochs specifies the number of epochs (i.e., number of training rounds) for the training process.
Argument hidden_activations is a list of strings specifying the activation function in each hidden layer. Each string can have three possible values: "sigmoid", "tanh", or "relu". The length of this list should be num_layers - 2. For example, hidden_activations[0] specifies the activation in the first hidden layer (assuming num_layers >= 3), hidden_activations[1] specifies the activation in the second hidden layer (assuming num_layers >= 4), and so on. If your network has no hidden layers (i.e., if the number of layers is 2) then this argument gets ignored.
Return value model is the model that the function has created and trained.

Function test_model evaluates the trained model on the test set. These are the specifications:

Argument model is the trained model returned by create_and_train_model.
Argument test_inputs is a 2D numpy array, where each row is a test input vector.
Argument test_labels is a numpy column vector. That means it is a 2D numpy array, with a single column. The value stored in test_labels[i,0] is the class label for the vector stored at test_inputs[i].
Argument ints_to_labels is a Python dictionary, that maps int labels (that the original class labels were mapped to for the purposes of training) to original class labels (which can be ints or strings). This is useful when your code prints test results, so that it can print the original class label, and not the integer that it was mapped to.
Return value test_accuracy is the classification accuracy on the test set, represented as a floating point number between 0 and 1. Ties on classification outputs for any test input should be handled appropriately, using the same rules as in Assignment 3.

The training and test files will follow the same format as the text files in the UCI datasets directory and the synthetic directory, that you are familiar with from Assignment 3. Function read_uci_file in uci_load.py handles reading these files and extracting the appropriate vectors and class labels. A description of the datasets and the file format can be found on this link. Your code should also work with ANY OTHER training and test files using the same format.

Training

In your implementation, you should use these guidelines:

Let Keras use its default method for initialization of all weights (i.e., your code should not address this issue at all).
For the optimizer, use "adam" with default settings.
All hidden layers should be fully connected ("dense", in Keras terminology).
The output layer should also be fully connected, and it should use the softmax activation function.
The loss function should be Categorical Cross Entropy (it is up to you if you use the sparse version or not, but make sure that the class labels that you pass to the fit() method are appropriate for your choice of loss function).
For any Keras option that is not explicitly discussed here (for example, the batch size), you should let Keras use its default values.

There is no need to output anything for the training phase. It is OK if Keras prints out various things per epoch.

Testing

For each test object, your test_model function should print a line containing the following info:

Object ID. This is the line number where that object occurs in the test file. Start with 0 in numbering the objects, not with `.
Predicted class (the result of the classification). If your classification result is a tie among two or more classes, choose one of them randomly. Note that the predicted class should be one of the labels used in the original dataset, and not one of the integer labels that were used for training. Argument ints_to_labels is a dictionary mapping integers to original class labels. For example, if you use the yeast_string data set, the label printed here should be a string and not the number.
True class (from the last column of the test file). As was the case for the predicted class, the true class should be the class label as defined in the original data set, and not the integer label that it was mapped to for the purposes of training.
Accuracy. This is defined as follows:
- If there were no ties in your classification result, and the predicted class is correct, the accuracy is 1.
- If there were no ties in your classification result, and the predicted class is incorrect, the accuracy is 0.
- If there were ties in your classification result, and the correct class was one of the classes that tied for best, the accuracy is 1 divided by the number of classes that tied for best.
- If there were ties in your classification result, and the correct class was NOT one of the classes that tied for best, the accuracy is 0.

To produce this output in Python in a uniform manner, use:

print('ID=%5d, predicted=%10s, true=%10s, accuracy=%4.2f' % 
           (test_index, predicted_class, actual_class, accuracy))

After you have printed the results for all test objects, you should print the overall classification accuracy, which is defined as the average of the classification accuracies you printed out for each test object. To print the classification accuracy in a uniform manner, use:

print('Classification accuracy on test set: %.2f%%' % (test_accuracy * 100))

In your answers.pdf document, please provide ONLY THE LAST LINE (the line printing the classification accuracy) of the output by the test stage, for the following test cases:

Training and testing on the yeast_string dataset, with 4 layers, 50 units for the first hidden layer, 20 units for the second hidden layer, 60 epochs, "tanh" activation for the first hidden layer, and "sigmoid" activation for the second hidden layer.
Training and testing on the synth5 dataset, with 4 layers, 20 units for the first hidden layer, 10 units for the second hidden layer, 80 epochs, "tanh" activation for the first hidden layer, and "sigmoid" activation for the second hidden layer.

Expected Classification Accuracy

I ran my solution 20 times for the yeast_string test case described above. The classification accuracy on the test set was between 57.02% and 58.68%. On my computer, the training process took about 2 seconds and the test process took about 20 seconds.

I ran my solution 20 times for the synth5 test case described above. The classification accuracy on the test set was between 84.50% and 92.75%. Out of the 20 tries, 19 produced test accuracies greater than or equal to 91.75%. One out of those 20 times produced an accuracy of 84.5%. On my computer, the training process took between 1 and 2 seconds and the test process took about 12 to 14 seconds.

Task 1b (Optional, extra credit, maximum 10 points).

A maximum of 10 extra credit points will be given to the submission or submissions that identify the function arguments achieving the best test accuracy for any of the three UCI datasets (pendigits, yeast, satellite). These function arguments, and the attained accuracy, should be reported in answers.pdf, under a clear "Task 1b" heading. These results should be achievable using the code that you submit for Task 1. You should achieve the reported accuracy at least five out of 10 times when you run your code.

Task 1c (Optional, extra credit, maximum 10 points).

In this task, you are free to change any implementation options that you are not free to change in Task 1. Examples of such options include different optimizers and settings for those optimizers, different types of layers, different batch sizes, etc. You can submit a solution called nn_keras_opt.py, that implements your modifications. A maximum of 10 points will be given to the submission or submissions that, according to the instructor and GTA, achieve the best improvements on any of the three UCI datasets (pendigits, yeast, satellite), compared to the specifications in Task 1. In your answers.pdf document, under a clear "Task 1c" heading, explain:

What modifications you made.
What results you achieved. You should achieve the reported accuracy at least five out of 10 times when you run your code.
What parameters to provide your code with in order to obtain those results.
How long it took your program to run with those arguments.

Task 2 (30 points, programming)

This task is identical to the previous task, except that the model will be trained and tested on the MNIST dataset. File dense_mnist_base.py contains incomplete code that implements, using Keras, the training and evaluation process for a dense neural network that works on the MNIST dataset. To complete the code, you must create a file called dense_mnist_solution.py, where you implement the following Python functions:

(training_inputs, training_labels, test_inputs, test_labels) = load_mnist()

model = create_and_train_model(training_inputs, training_labels, layers, units_per_layer, epochs, hidden_activations)

Function load_mnist loads the MNIST dataset. It does not take any arguments. It returns training inputs, training labels, test inputs, test labels, which have the same meaning as in the previous task. It is the responsibility of this function to make sure that the return values comply with the following requirements:

The inputs should have the appropriate shape so that function create_and_train_model behaves appropriately.
Input values are normalized to be floating point numbers between 0 and 1.

Function create_and_train_model has mostly the same specifications as in the previous task. The only exception that here it is up to you to decide whether training_inputs and test_inputs should be two-dimensional arrays or have a different shape. Depending on exactly how you implement this function for the previous task, and how you implement load_mnist for this task, the create_and_train_model implementation from the previous task may or may not work here. It is up to you to decide if you can reuse the implementation from the previous task, or if you need to make modifications.

There is no need to output anything for the training phase. It is OK if Keras prints out various things per epoch.

Unlike the previous task, you do not have to write a function that prints a line showing the result for each test object.

At the end, the code should print the overall classification accuracy on the test set. The last two lines in dense_mnist_base.py take care of that, so you do not need to do anything more to satisfy this requirement.

In your answers.pdf document, please provide ONLY THE LAST LINE (the line printing the classification accuracy) of the output by the test stage, for the following test case:

4 layers.
500 units for the first hidden layer, 400 units for the second hidden layer.
20 epochs.
"tanh" activation for the first hidden layer, "sigmoid" activation for the second hidden layer.

You may get different classification accuracies when you run your code multiple times with the same input arguments. This is due to the fact that weights are initialized randomly. I ran my solution 5 times for the test case described above. The classification accuracy on the test set was between 97.40% and 98.08%. On my computer, the training process took about 4 minutes and the test process took about 2 seconds. Note that the test process can be much faster here compared to the previous task, because here we do not print out information for each individual test object.

Task 2b (Optional, extra credit, maximum 10 points).

A maximum of 10 extra credit points will be given to the submission or submissions that identify the function arguments achieving the best test accuracy. These function arguments, and the attained accuracy, should be reported in answers.pdf, under a clear "Task 2b" heading. These results should be achievable using the code you submit for Task 2. You should achieve the reported accuracy at least five out of 10 times when you run your code.

Task 2c (Optional, extra credit, maximum 10 points).

In this task, you are free to change any implementation options that you are not free to change in Task 2. Examples of such options include different optimizers and settings for those optimizers, different types of layers, different batch sizes, etc. You can submit a solution called dense_mnist_opt.py, that implements your modifications. A maximum of 10 points will be given to the submission or submissions that, according to the instructor and GTA, achieve the best improvements compared to the specifications in Task 2. In your answers.pdf document, under a clear "Task 2c" heading, explain:

What modifications you made.
What results you achieved. You should achieve the reported accuracy at least five out of 10 times when you run your code.
What parameters to provide your program with in order to obtain those results.
How long it took your program to run with those arguments.

Task 3 (30 points, programming)

This task is similar to the previous task, except here you will use a convolutional neural network (CNN). File cnn_mnist_base.py contains incomplete code that implements, using Keras, the training and evaluation process for a CNN that works on the MNIST dataset. To complete the code, you must create a file called cnn_mnist_base.py, where you implement the following Python functions:

(training_inputs, training_labels, test_inputs, test_labels) = load_mnist()

model = create_and_train_model(training_inputs, training_labels, blocks, 
                               filter_size, filter_number, region_size, 
                               epochs, cnn_activation)

Function load_mnist has the same specifications as in the previous task. Note that, depending on exactly how you implement this function for the previous task, and how you implement create_and_train_model for this task, the load_mnist implementation from the previous task may or may not work here. It is up to you to decide if you can reuse the implementation from the previous task, or if you need to make modifications.

These are the specifications for create_and_train_model:

Argument training_inputs is a numpy array (you decide the shape), where training_inputs[i] is the i-th training input pattern.
Argument training_labels is a numpy column vector. That means it is a 2D numpy array, with a single column. The value stored in training_labels[i,0] is the class label for the input pattern stored at training_inputs[i].
Argument blocks specifies how many convolutional layers your model will have. Every convolutional layer must be followed by a max pool layer. The total number of layers should be 2*blocks + 2, to account for the input layer, and the output layer. The output layer should be fully connected and use the softmax activation function. Except for the output layer, there should be no other fully connected layers.
Argument filter_size specifies the number of rows of each 2D convolutional filter. The number of columns should be equal to the number of rows, so it is also specified by filter_size. For example, if filter_size = 3, each 2D filter has 3 rows and 3 columns.
Argument filter_number specifies the number of 2D convolutional filters used at each convolutional layer. For example, if filter_number = 5, each convolutional layer applies 5 2D filters.
Argument region_size specifies the size of the region for the max pool layer. For example, if region_size = 2, each output of a max pool layer should be the max value of a 2x2 region (i.e., 2 rows and 2 columns).
Argument epochs specifies the number of epochs (i.e., number of training rounds) for the training process.
Argument cnn_activation specifies the activation function that should be used in convolutional layers. This argument is a single string (not a list), that can be "sigmoid", "tanh", or "relu". All convolutional layers will have the same activation function.
Return value model is the model that the function has created and trained.

There is no need to output anything for the training phase. It is OK if Keras prints out various things per epoch.

As in Task 2, you do not have to write a function that prints a line showing the result for each test object. At the end, the code should print the overall classification accuracy on the test set. The last two lines in cnn_mnist_base.py take care of that, so you do not need to do anything more to satisfy this requirement.

In your answers.pdf document, please provide ONLY THE LAST LINE (the line printing the classification accuracy) of the output by the test stage, for the following test case:

blocks = 2
filter_size = 3
filter_number = 32
region_size = 2
epochs = 20
cnn_activation = 'relu'

You may get different classification accuracies when you run your code multiple times with the same input arguments. This is due to the fact that weights are initialized randomly. I ran my solution 5 times for the test case described above. The classification accuracy on the test set was between 98.71% and 98.94%. On my computer, the training process took about 6 minutes and the test process took about 6 seconds.

Task 3b (Optional, extra credit, maximum 10 points).

A maximum of 10 extra credit points will be given to the submission or submissions that identify the function arguments achieving the best test accuracy. These function arguments, and the attained accuracy, should be reported in answers.pdf, under a clear "Task 3b" heading. These results should be achievable by calling the code you submit for Task 3. You should achieve the reported accuracy at least five out of 10 times when you run your code.

Task 3c (Optional, extra credit, maximum 10 points).

In this task, you are free to change any implementation options that you are not free to change in Task 3. Examples of such options include different optimizers and settings for those optimizers, different types of layers, different batch sizes, etc. You can submit a solution called cnn_mnist_opt.py, that implements your modifications. A maximum of 10 points will be given to the submission or submissions that, according to the instructor and GTA, achieve the best improvements compared to the specifications in Task 3. In your answers.pdf document, under a clear "Task 3c" heading, explain:

What modifications you made.
What results you achieved. You should achieve the reported accuracy at least five out of 10 times when you run your code.
What parameters to provide your program with in order to obtain those results.
How long it took your program to run with those arguments.

Task 4 (Optional, extra credit, maximum 10 points).

Design your own dataset, that satisfies the following requirements:

The dataset is computer-generated, using some randomized process. You should submit the code (in any language that you like) that generates the data.
Every input is two-dimensional.
The dataset has at least 200 training objects, and at least 200 test objects.
There are two classes, half the training objects belong to each class, and half the test objects also belong to each class.
In the training set, the standard deviation for each class, along each dimension, is at least 2.
In the training set, the euclidean distance of the means of the two classes is at most 1.
In the test set, the standard deviation for each class, along each dimension, is at least 2.
In the test set, the euclidean distance of the means of the two classes is at most 1.
For each class, the euclidean distance between the mean of the training objects belonging to that class and the mean of the test objects belonging to that class is at least 2.
A 4-layer neural network (that is, with two hidden layers) achieves 100% accuracy on the test set. Your solution for Task 1 should be sufficient for achieving the accuracy. In your answers.pdf file, you can specify the parameters that your network uses.

You should submit the training and test files, formatted the same way as the data sets for Task 1. You should also submit the code (in any language that you like) that was used to generate the dataset. In answers.pdf, specify the file or files where the code and data are stored.

CSE 4311 - Assignments - Assignment 4