Advanced Analytics Techniques Assignment

Publish By: Admin,
Last Updated: 16-Sep-24
Price: $120

Assignment Task

Problem 1 - Reading the dataset 

Q1. Read the first 10,000 rows from the credit card dataset provided in the assignment_data folder

  • Name your DataFrame df
  • Rename the column `PAY_0` to `PAY_1` and the column `default payment next month` to `payment_default`
  • Delete ID column

Q2. List which features are numeric, ordinal, and nominal variables, and how many features of each kind there are in the dataset. To answer this question

  • Find the definitions of the variables provided elsewhere in the course material (hint: make sure you do weekly tutorials)
  • Find the definitions of numeric, ordinal and nominal variables
  • Carefully consider the values of the data itself as well as the output of df.info().

Q3. Missing Values.

  • Print out the number of missing values for each variable in the dataset and comment on your findings.

Problem 2. Cleaning data and dealing with categorical features

Q1.

  • Use an appropriate pandas function to impute missing values using one of the following two strategies: mean and mode.
  • – Take into consideration the type of each variable and the best practices we discussed in class/lecture notes
  • Explain what data imputation is, how you have done it here, and what decisions you had to make.

Q2.

  • Print value_counts() of the `SEX` column and add a dummy variable named `SEX_FEMALE` to df using get_dummies()
  • Carefully explain what the values of the new variable `SEX_FEMALE` mean
  • Make sure the variable `SEX` is deleted from df

Q3. Print value_counts() of the `MARRIAGE` column and carefully comment on what you notice in relation to the definition of this variable.

Q4.

  • Apply get_dummies() to `MARRIAGE` feature and add dummy variables `MARRIAGE_MARRIED`, `MARRIAGE_SINGLE`, `MARRIAGE_OTHER` to df.
  • Carefully consider how to allocate all the values of `MARRIAGE` across these 3 newly created features
  • Explain what decisions you had to make
  • Make sure that `MARRIAGE` is deleted from df

Q5. In the column `EDUCATION`, convert the values {0, 5, 6} to the value 4.

Problem 3 Preparing X and y arrays

Q1.

  • Create a numpy array y from the first 8,000 observations of `payment_default` column from df
  • Create a numpy array X from the first 8,000 observations of all the remaining variables in df

Q2.

  • Use an appropriate sklearn library we used in class to create y_train, y_test, X_train and X_test by splitting the data into 75% train and 25% test datasets
  • – Set random_state to 4 and stratify the subsamples so that train and test datasets have roughly equal proportions of the target`s class labels
  • Standardise the data to mean zero and variance one using an approapriate sklearn library

Problem 4. Support Vector Classifier and Accuracies

Q1.

  • Train a Support Vector Classifier on the standardised data
  • – Use rbf kernel and set random_state to 3 (don`t change any other parameters)
  • Compute and print training and test dataset accuracies

Q2.

  • Extract 2 linear principal components from the standardised features using an appropriate sklearn library
  • Train a Support Vector Classifier on the 2 principal components computed above 
  • – Use rbf kernel and set random_state to 3 (don`t change any other parameters)
  • Compute and print training and test dataset accuracies

Q3.

  • Comment on the suitability of the two classifiers to predict credit card defaults by commenting on (and comparing) the computed accuracies from the last two questions.
  • Make comparisons both within and across the two questions