Variable specification defines the relationship between response and
predictor variables as well as the data used to estimate the
relationship. Four main types of specifications are supported by the
fit
and resample
functions: traditional
formula, design matrix, model frame, and preprocessing recipe.
Different variable specifications will be illustrated with the Iowa
City home prices dataset from the package. To start, a formula is
defined relating sale amount to home characteristics. Two equivalent
definitions are given. One in which predictors on the right hand side of
the equation are explicitly included, and another in which
.
notation is used to indicate that all remaining variable
not already in the model be included on the right hand side.
## Analysis library
library(MachineShop)
## Iowa City home prices dataset
str(ICHomes)
#> 'data.frame': 753 obs. of 17 variables:
#> $ sale_amount : int 90000 168500 205000 121000 215000 278000 170000 290000 185000 109900 ...
#> $ sale_year : int 2005 2005 2005 2005 2005 2005 2005 2005 2005 2005 ...
#> $ sale_month : int 1 1 1 1 1 2 2 2 2 2 ...
#> $ built : int 2001 1976 1995 2001 1974 1991 1977 1920 1993 1955 ...
#> $ style : Factor w/ 2 levels "Home","Condo": 2 1 1 2 1 1 1 1 1 1 ...
#> $ construction: Factor w/ 9 levels "1 1/2 Story Frame",..: 4 8 8 3 7 7 8 7 7 4 ...
#> $ base_size : int 878 1236 1466 1150 936 936 1220 985 914 864 ...
#> $ add_size : int 0 0 0 0 376 384 0 356 0 0 ...
#> $ garage1_size: int 0 576 0 0 572 528 0 0 440 240 ...
#> $ garage2_size: int 264 0 0 528 0 0 0 0 0 0 ...
#> $ lot_size : int 3718 8800 16720 3427 10455 13680 7205 9800 8960 8375 ...
#> $ bedrooms : int 2 4 3 2 5 4 3 4 3 2 ...
#> $ basement : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 2 2 2 ...
#> $ ac : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
#> $ attic : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#> $ lon : num -91.5 -91.5 -91.6 -91.6 -91.6 ...
#> $ lat : num 41.7 41.7 41.6 41.7 41.7 ...
## Formula: explicit inclusion of predictor variables
fo <- sale_amount ~ sale_year + sale_month + built + style + construction +
base_size + add_size + garage1_size + garage2_size + lot_size +
bedrooms + basement + ac + attic + lon + lat
## Formula: implicit inclusion of predictor (. = remaining) variables
fo <- sale_amount ~ .
Traditional formula calls to the fitting functions consist of a
formula and dataset pair. This specification additionally allows for
crossing (*
), interaction (:
), and removal
(-
) of predictors in the formula; in-line functions of
response variables; and some in-line functions of predictors.
model_fit <- fit(fo, ICHomes, model = TunedModel(SVMRadialModel))
tuned_model <- as.MLModel(model_fit)
model_res <- resample(fo, ICHomes, model = tuned_model)
summary(model_res)
#> Statistic
#> Metric Mean Median SD Min Max NA
#> RMSE 5.309297e+04 5.953998e+04 1.438662e+04 2.782945e+04 6.558972e+04 0
#> R2 6.461119e-01 6.725713e-01 1.156861e-01 4.180951e-01 8.101114e-01 0
#> MAE 2.885963e+04 3.036286e+04 4.774341e+03 2.109884e+04 3.600989e+04 0
Support is provided for calls with a numeric design matrix and response object pair. The design matrix approach has lower computational overhead than the others and can thus enable a larger number of predictors to be included in an analysis.
x <- as.matrix(ICHomes[c("built", "base_size", "lot_size", "bedrooms")])
y <- ICHomes$sale_amount
model_fit <- fit(x, y, model = TunedModel(SVMRadialModel))
tuned_model <- as.MLModel(model_fit)
model_res <- resample(x, y, model = tuned_model)
summary(model_res)
#> Statistic
#> Metric Mean Median SD Min Max NA
#> RMSE 6.139280e+04 6.047216e+04 1.169974e+04 4.573839e+04 7.742124e+04 0
#> R2 5.393985e-01 5.380906e-01 4.670476e-02 4.572501e-01 6.225863e-01 0
#> MAE 3.439138e+04 3.436936e+04 4.772329e+03 2.789822e+04 4.225911e+04 0
Model frames are created with the traditional formula or design matrix syntax described above and then passed to the fitting functions. They allow for the specification of variables for stratified resampling or for weighting of cases in model fitting.
mf <- ModelFrame(fo, data = ICHomes, strata = ICHomes$sale_amount)
model_fit <- fit(mf, model = TunedModel(SVMRadialModel))
tuned_model <- as.MLModel(model_fit)
model_res <- resample(mf, model = tuned_model)
summary(model_res)
#> Statistic
#> Metric Mean Median SD Min Max NA
#> RMSE 5.453330e+04 5.523876e+04 1.058467e+04 3.457570e+04 72582.433998 0
#> R2 6.294583e-01 6.538984e-01 1.297982e-01 3.574595e-01 0.791075 0
#> MAE 2.910045e+04 2.976941e+04 3.013686e+03 2.335728e+04 32718.650705 0
Preprocessing recipes provide a flexible framework for defining predictor and response variables as well as preprocessing steps to be applied to them prior to model fitting. Using recipes helps ensure that estimation of predictive performance accounts for all modeling step. As with model frames, the recipes approach allows for specification of case strata and weights.
library(recipes)
rec <- recipe(fo, data = ICHomes) %>%
role_case(stratum = sale_amount) %>%
step_center(base_size, add_size, garage1_size, garage2_size, lot_size) %>%
step_pca(base_size, add_size, garage1_size, garage2_size, lot_size, num_comp = 2) %>%
step_dummy(all_nominal_predictors())
model_Fit <- fit(rec, model = TunedModel(SVMRadialModel))
tuned_model <- as.MLModel(model_fit)
model_res <- resample(rec, model = tuned_model)
summary(model_res)
#> Statistic
#> Metric Mean Median SD Min Max NA
#> RMSE 5.493730e+04 4.411786e+04 1.943057e+04 3.061942e+04 8.405026e+04 0
#> R2 6.264021e-01 6.203221e-01 1.368953e-01 3.905121e-01 8.146512e-01 0
#> MAE 3.103498e+04 2.919259e+04 5.826028e+03 2.316113e+04 4.101513e+04 0