ELEC9741 – Electrical Engineering Data Science Part II Assignment

Assignment Task

1. Data Modelling 

A firm produces solid metal cylinders and needs to manufacture them with extremely tight tolerances in terms of both dimensions (diameter and height) and weight. The firm has 10 factories of equal capacity running around the clock in 10 different cities. However, the machines in one of the factories have been adversely affected by the environment and consequently produces cylinders of slightly different dimensions (these are referred to as ‘affected cylinders’ in the equations below). The quality assurance team have an automatic measurement and classification system at the only warehouse belonging to the firm (supplied by all the factories) that inspects each cylinder to decide if it can be shipped or if it should be recycled for not meeting specifications. The measured data can be modelled as: 

20240805061138AM-1009406673-911725941.PNG
20240805060156AM-1195144912-772579285.PNG
20240805055758AM-1593593298-138678964.PNG

a. Comment on the shape of the decision boundary. Explain your answer with appropriate equations describing the underlying models about data distribution and the implications on the decision boundary. 

b. The quality assurance team does not know how many factories are producing affected cylinders, but they suspect it might be one or two. Give the equations for the optimal decision surface for both assumptions (one factory is producing affected cylinders, two factories are producing affected cylinders) and based on this, provide a classification rule of the form

20240805055758AM-782467632-1990319950.PNG
20240805060321AM-25764323-1531777468.PNG

c. Implement a suitable simulation (of the measured data) in MATLAB and use it to demonstrate that your classification rule works. 

2. Machine Learning Pipeline

You have been provided with the health records of 58,976 patients who were admitted to critical care units (CCU). Each patient’s record is represented by 16 numerical values and stored in a CSV file called “MedicalRecords.csv” (You can download this from the course webpage). The first line of the file contains the header, while the subsequent lines hold individual patient records separated by commas.

Among the 16 numerical values, “LOSdays” represents the number of days each patient stayed in the CCU, from admission to discharge. The other 15 values represent the daily average counts of various medical events, such as callouts, diagnoses, procedures, CPT events, input events, labs, microbiology labs, notes, output events, medical prescriptions, procedural events, transfers between care units, chart events, and the summary of all the daily averages.

Part 1: Regression Problem

Your task is to design a machine-learning pipeline for regression to predict the number of days a patient is expected to stay in the CCU (LOSdays) based on the other 15 values.

a. Design and implement the pipeline. You may use the following steps as a guideline:

  • Prepare your data for the model training. Define your target and features and split the data into training and testing subsets.
  • Choose the appropriate regression model to train: You may select any regression model, such as Linear Regression, Support Vector Regression, Random Forest Regression, etc. You can utilize MATLAB built-in functions for these models. Use the training subset in the training of the model.
  • Choose the appropriate evaluation metric(s) for the task and evaluate the performance of your trained model on the test subset. Note that you are expected to implement them with your own code. The use of MATLAB built-in functions for evaluation metrics is not allowed.

b. Compute the mean, median, standard deviation, minimum, and maximum of all features. What do you observe?

c. Based on the statistics computed in (b), is a feature normalisation step needed?

d. Train and evaluate the model with and without feature normalisation step. Analyze the influence on performance. 

e. Compute the correlation between each input feature and the target output (LOSdays). What do you observe? Note: use your own code to compute the correlation. using MATLAB’s built-in functions is not allowed.

f. Based on the correlation values computed in (e), reduce the dimension of features (number of features) to 10, 4, and 2. Retrain the model for each feature size and compare the performance. What do you observe? 

Part 2: Classification Problem

In this part, your objective is to design a machine-learning pipeline for classification to predict whether a patient will have a long, medium, or short stay based on the 15 numerical values used in Part 1.

The duration classifications are defined as follows:

  • Short period: When the patient is expected to stay for less than 6 days.
  • Medium period: When the patient is expected to stay for 6 days or more but less than 12 days.
  • Long period: When the patient is expected to stay for 12 days or more.

g. Prepare your data for the classification task and replace the regression model from Part 1 with a classification model, utilizing any suitable method like Logistic Regression, Support Vector Machine classifier, Random Forest classifier, etc. You can use MATLAB’s built-in functions for training the classification model. 

h. Justify whether the same training/testing split used in Part 1 is valid for training the classification model. If not re-split the data appropriately. 

i. Determine whether the evaluation metric used in Part 1 is still valid for assessing the classification model. If not, suggest and implement other evaluation metrics.

j. Perform binary classification of patients who are expected to stay only one day or less and patients who are expected to stay more than one day. Re-split the data into training/testing, retrain the model, and compute the model’s performance. 

k. Compute the confusion matrix of the test set for the binary model in (j). How many samples there for each class? Analyse the observations. 

l. For the system in (j). Compute accuracy, balanced accuracy, and F1. Which metric you would use to report the system performance in this case?