Variable Specifications and Preprocessing

Variable specification defines the relationship between response and predictor variables as well as the data used to estimate the relationship. Four main types of specifications are supported by the fit and resample functions: traditional formula, design matrix, model frame, and preprocessing recipe.

Analysis Dataset

Different variable specifications will be illustrated with the Iowa City home prices dataset from the package. To start, a formula is defined relating sale amount to home characteristics. Two equivalent definitions are given. One in which predictors on the right hand side of the equation are explicitly included, and another in which . notation is used to indicate that all remaining variable not already in the model be included on the right hand side.

## Analysis library
library(MachineShop)

## Iowa City home prices dataset
str(ICHomes)
#> 'data.frame':    753 obs. of  17 variables:
#>  $ sale_amount : int  90000 168500 205000 121000 215000 278000 170000 290000 185000 109900 ...
#>  $ sale_year   : int  2005 2005 2005 2005 2005 2005 2005 2005 2005 2005 ...
#>  $ sale_month  : int  1 1 1 1 1 2 2 2 2 2 ...
#>  $ built       : int  2001 1976 1995 2001 1974 1991 1977 1920 1993 1955 ...
#>  $ style       : Factor w/ 2 levels "Home","Condo": 2 1 1 2 1 1 1 1 1 1 ...
#>  $ construction: Factor w/ 9 levels "1 1/2 Story Frame",..: 4 8 8 3 7 7 8 7 7 4 ...
#>  $ base_size   : int  878 1236 1466 1150 936 936 1220 985 914 864 ...
#>  $ add_size    : int  0 0 0 0 376 384 0 356 0 0 ...
#>  $ garage1_size: int  0 576 0 0 572 528 0 0 440 240 ...
#>  $ garage2_size: int  264 0 0 528 0 0 0 0 0 0 ...
#>  $ lot_size    : int  3718 8800 16720 3427 10455 13680 7205 9800 8960 8375 ...
#>  $ bedrooms    : int  2 4 3 2 5 4 3 4 3 2 ...
#>  $ basement    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 2 2 2 ...
#>  $ ac          : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ attic       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ lon         : num  -91.5 -91.5 -91.6 -91.6 -91.6 ...
#>  $ lat         : num  41.7 41.7 41.6 41.7 41.7 ...

## Formula: explicit inclusion of predictor variables
fo <- sale_amount ~ sale_year + sale_month + built + style + construction +
  base_size + add_size + garage1_size + garage2_size + lot_size +
  bedrooms + basement + ac + attic + lon + lat

## Formula: implicit inclusion of predictor (. = remaining) variables
fo <- sale_amount ~ .

Traditional Formula

Traditional formula calls to the fitting functions consist of a formula and dataset pair. This specification additionally allows for crossing (*), interaction (:), and removal (-) of predictors in the formula; in-line functions of response variables; and some in-line functions of predictors.

model_fit <- fit(fo, ICHomes, model = TunedModel(SVMRadialModel))
tuned_model <- as.MLModel(model_fit)
model_res <- resample(fo, ICHomes, model = tuned_model)
summary(model_res)
#>       Statistic
#> Metric         Mean       Median           SD          Min          Max NA
#>   RMSE 5.309297e+04 5.953998e+04 1.438662e+04 2.782945e+04 6.558972e+04  0
#>   R2   6.461119e-01 6.725713e-01 1.156861e-01 4.180951e-01 8.101114e-01  0
#>   MAE  2.885963e+04 3.036286e+04 4.774341e+03 2.109884e+04 3.600989e+04  0

Design Matrix

Support is provided for calls with a numeric design matrix and response object pair. The design matrix approach has lower computational overhead than the others and can thus enable a larger number of predictors to be included in an analysis.

x <- as.matrix(ICHomes[c("built", "base_size", "lot_size", "bedrooms")])
y <- ICHomes$sale_amount

model_fit <- fit(x, y, model = TunedModel(SVMRadialModel))
tuned_model <- as.MLModel(model_fit)
model_res <- resample(x, y, model = tuned_model)
summary(model_res)
#>       Statistic
#> Metric         Mean       Median           SD          Min          Max NA
#>   RMSE 6.139280e+04 6.047216e+04 1.169974e+04 4.573839e+04 7.742124e+04  0
#>   R2   5.393985e-01 5.380906e-01 4.670476e-02 4.572501e-01 6.225863e-01  0
#>   MAE  3.439138e+04 3.436936e+04 4.772329e+03 2.789822e+04 4.225911e+04  0

Model Frame

Model frames are created with the traditional formula or design matrix syntax described above and then passed to the fitting functions. They allow for the specification of variables for stratified resampling or for weighting of cases in model fitting.

mf <- ModelFrame(fo, data = ICHomes, strata = ICHomes$sale_amount)

model_fit <- fit(mf, model = TunedModel(SVMRadialModel))
tuned_model <- as.MLModel(model_fit)
model_res <- resample(mf, model = tuned_model)
summary(model_res)
#>       Statistic
#> Metric         Mean       Median           SD          Min          Max NA
#>   RMSE 5.453330e+04 5.523876e+04 1.058467e+04 3.457570e+04 72582.433998  0
#>   R2   6.294583e-01 6.538984e-01 1.297982e-01 3.574595e-01     0.791075  0
#>   MAE  2.910045e+04 2.976941e+04 3.013686e+03 2.335728e+04 32718.650705  0

Preprocessing Recipe

Preprocessing recipes provide a flexible framework for defining predictor and response variables as well as preprocessing steps to be applied to them prior to model fitting. Using recipes helps ensure that estimation of predictive performance accounts for all modeling step. As with model frames, the recipes approach allows for specification of case strata and weights.

library(recipes)

rec <- recipe(fo, data = ICHomes) %>%
  role_case(stratum = sale_amount) %>%
  step_center(base_size, add_size, garage1_size, garage2_size, lot_size) %>%
  step_pca(base_size, add_size, garage1_size, garage2_size, lot_size, num_comp = 2) %>%
  step_dummy(all_nominal_predictors())

model_Fit <- fit(rec, model = TunedModel(SVMRadialModel))
tuned_model <- as.MLModel(model_fit)
model_res <- resample(rec, model = tuned_model)
summary(model_res)
#>       Statistic
#> Metric         Mean       Median           SD          Min          Max NA
#>   RMSE 5.493730e+04 4.411786e+04 1.943057e+04 3.061942e+04 8.405026e+04  0
#>   R2   6.264021e-01 6.203221e-01 1.368953e-01 3.905121e-01 8.146512e-01  0
#>   MAE  3.103498e+04 2.919259e+04 5.826028e+03 2.316113e+04 4.101513e+04  0