Advanced Data Analytics in Eco Predictive Modeling R Programming Task
—
title: ” Predictive Modeling”
subtitle:
date:
output: html_document
—
“`{r include=FALSE}
knitr::opts_chunk$set(
cache = TRUE,
warning = FALSE
)
“`
***
Run this code to load a dataset on prices and characteristics of houses in Ames, Iowa from the `tidymodels` package, and extract a subset of variables:
“`
if (!require(“tidymodels”)) install.packages(“tidymodels”)
library(tidymodels)
data(ames)
houses = ames %>%
mutate(
id = factor(row_number()),
SalePrice = Sale_Price / 1000,
age = Year_Sold – Year_Built,
area = Gr_Liv_Area
) %>%
select(id, SalePrice, age, area, Lot_Area, TotRms_AbvGrd, Neighborhood)
“`
### Part A: Regressions and Prediction Errors
1. Estimate the following regression. (To keep things simple, you can just use the base `lm()` function.)
$$ SalePrice_i = beta_0 +beta_1 age_i + beta_2 area_i + beta_3 age_i times area_i + beta_4 age_i^2 + beta_5 area^2_i + beta_6 epsilon_i $$
2. Calculate predicted values for the outcome variable and add them to the `houses` dataframe as a new column (hint: remember `augment()`). Now calculate the root mean squared error (i.e., the square root of the MSE) of the prediction from this first regression. (Note that comparing models by RMSE produces identical rankings as comparing models by MSE — the loss function is the same, we’re just taking a monotonic transformation of the end result.)
3. Now beat my model! Estimate your own regression and calculate its RMSE. This regression should include both transformations and interactions of quantitative variables, but DO NOT yet include any categorical or indicator variables. (If you want to create polynomial terms more easily, you can try the `poly()` function with the `raw=TRUE` option.)
4. Estimate one more regression and calculate its RMSE. This time you’re allowed to use categorical variables. Try to see how low you can possibly get your RMSE.
### Part B: Training and Testing
Now, run the code below to split your dataset into two datasets: 75% for training, and 25% for testing. Choose your own integer to insert into the `set.seed()` function below. (The purpose of `set.seed()` is to make sure you get the same results each time you run the same code — we’re going to be drawing “random” samples, but computers are never truly random, only quasi-random.)
“`
set.seed(NUMBER)
data_split = initial_split(houses, prop = 0.75)
houses_train = training(data_split)
houses_test = testing(data_split)
“`
1. Estimate each of your 3 regression models from Part A on *only* your **training** dataset. Use these newly estimated models to generate predicted values for *only* your **testing** dataset. What is the RMSE for each model?
2. How do these RMSE values compare to each other? How do they compare to what you got in Part A? Why? Explain what’s going on.