Advanced Analytics Techniques Assignment
Assignment Task
Problem 1 - Reading the dataset
Q1. Read the first 10,000 rows from the credit card dataset provided in the assignment_data folder
- Name your DataFrame df
- Rename the column `PAY_0` to `PAY_1` and the column `default payment next month` to `payment_default`
- Delete ID column
Q2. List which features are numeric, ordinal, and nominal variables, and how many features of each kind there are in the dataset. To answer this question
- Find the definitions of the variables provided elsewhere in the course material (hint: make sure you do weekly tutorials)
- Find the definitions of numeric, ordinal and nominal variables
- Carefully consider the values of the data itself as well as the output of df.info().
Q3. Missing Values.
- Print out the number of missing values for each variable in the dataset and comment on your findings.
Problem 2. Cleaning data and dealing with categorical features
Q1.
- Use an appropriate pandas function to impute missing values using one of the following two strategies: mean and mode.
- – Take into consideration the type of each variable and the best practices we discussed in class/lecture notes
- Explain what data imputation is, how you have done it here, and what decisions you had to make.
Q2.
- Print value_counts() of the `SEX` column and add a dummy variable named `SEX_FEMALE` to df using get_dummies()
- Carefully explain what the values of the new variable `SEX_FEMALE` mean
- Make sure the variable `SEX` is deleted from df
Q3. Print value_counts() of the `MARRIAGE` column and carefully comment on what you notice in relation to the definition of this variable.
Q4.
- Apply get_dummies() to `MARRIAGE` feature and add dummy variables `MARRIAGE_MARRIED`, `MARRIAGE_SINGLE`, `MARRIAGE_OTHER` to df.
- Carefully consider how to allocate all the values of `MARRIAGE` across these 3 newly created features
- Explain what decisions you had to make
- Make sure that `MARRIAGE` is deleted from df
Q5. In the column `EDUCATION`, convert the values {0, 5, 6} to the value 4.
Problem 3 Preparing X and y arrays
Q1.
- Create a numpy array y from the first 8,000 observations of `payment_default` column from df
- Create a numpy array X from the first 8,000 observations of all the remaining variables in df
Q2.
- Use an appropriate sklearn library we used in class to create y_train, y_test, X_train and X_test by splitting the data into 75% train and 25% test datasets
- – Set random_state to 4 and stratify the subsamples so that train and test datasets have roughly equal proportions of the target`s class labels
- Standardise the data to mean zero and variance one using an approapriate sklearn library
Problem 4. Support Vector Classifier and Accuracies
Q1.
- Train a Support Vector Classifier on the standardised data
- – Use rbf kernel and set random_state to 3 (don`t change any other parameters)
- Compute and print training and test dataset accuracies
Q2.
- Extract 2 linear principal components from the standardised features using an appropriate sklearn library
- Train a Support Vector Classifier on the 2 principal components computed above
- – Use rbf kernel and set random_state to 3 (don`t change any other parameters)
- Compute and print training and test dataset accuracies
Q3.
- Comment on the suitability of the two classifiers to predict credit card defaults by commenting on (and comparing) the computed accuracies from the last two questions.
- Make comparisons both within and across the two questions