smooth.spline {stats}R Documentation

Fit a Smoothing Spline

Description

Fits a cubic smoothing spline to the supplied data.

Usage

smooth.spline(x, y = NULL, w = NULL, df, spar = NULL,
              cv = FALSE, all.knots = FALSE, nknots = NULL,
              df.offset = 0, penalty = 1, control.spar = list())

Arguments

x a vector giving the values of the predictor variable, or a list or a two-column matrix specifying x and y.
y responses. If y is missing, the responses are assumed to be specified by x.
w optional vector of weights of the same length as x; defaults to all 1.
df the desired equivalent number of degrees of freedom (trace of the smoother matrix).
spar smoothing parameter, typically (but not necessarily) in (0,1]. The coefficient λ of the integral of the squared second derivative in the fit (penalized log likelihood) criterion is a monotone function of spar, see the details below.
cv ordinary (TRUE) or “generalized” cross-validation (GCV) when FALSE.
all.knots if TRUE, all distinct points in x are used as knots. If FALSE (default), a subset of x[] is used, specifically x[j] where the nknots indices are evenly spaced in 1:n, see also the next argument nknots.
nknots integer giving the number of knots to use when all.knots=FALSE. Per default, this is less than n, the number of unique x values for n > 49.
df.offset allows the degrees of freedom to be increased by df.offset in the GCV criterion.
penalty the coefficient of the penalty for degrees of freedom in the GCV criterion.
control.spar optional list with named components controlling the root finding when the smoothing parameter spar is computed, i.e., missing or NULL, see below.
Note that this is partly experimental and may change with general spar computation improvements!
low:
lower bound for spar; defaults to -1.5 (used to implicitly default to 0 in R versions earlier than 1.4).
high:
upper bound for spar; defaults to +1.5.
tol:
the absolute precision (tolerance) used; defaults to 1e-4 (formerly 1e-3).
eps:
the relative precision used; defaults to 2e-8 (formerly 0.00244).
trace:
logical indicating if iterations should be traced.
maxit:
integer giving the maximal number of iterations; defaults to 500.
Note that spar is only searched for in the interval [low, high].

Details

The x vector should contain at least four distinct values. Distinct here means “distinct after rounding to 6 significant digits”, i.e., x will be transformed to unique(sort(signif(x, 6))), and y and w are pooled accordingly.

The computational λ used (as a function of spar) is lambda = r * 256^(3*spar - 1) where r = tr(X' W X) / tr(Σ), Σ is the matrix given by Sigma[i,j] = Integral B''[i](t) B''[j](t) dt, X is given by X[i,j] = B[j](x[i]), W is the diagonal matrix of weights (scaled such that its trace is n, the original number of observations) and B[k](.) is the k-th B-spline.

Note that with these definitions, f_i = f(x_i), and the B-spline basis representation f = X c (i.e., c is the vector of spline coefficients), the penalized log likelihood is L = (y - f)' W (y - f) + λ c' Σ c, and hence c is the solution of the (ridge regression) (X' W X + λ Σ) c = X' W y.

If spar is missing or NULL, the value of df is used to determine the degree of smoothing. If both are missing, leave-one-out cross-validation (ordinary or “generalized” as determined by cv) is used to determine λ. Note that from the above relation, spar is spar = s0 + 0.0601 * log(lambda), which is intentionally different from the S-plus implementation of smooth.spline (where spar is proportional to λ). In R's (log λ) scale, it makes more sense to vary spar linearly.

Note however that currently the results may become very unreliable for spar values smaller than about -1 or -2. The same may happen for values larger than 2 or so. Don't think of setting spar or the controls low and high outside such a safe range, unless you know what you are doing!

The “generalized” cross-validation method will work correctly when there are duplicated points in x. However, it is ambiguous what leave-one-out cross-validation means with duplicated points, and the internal code uses an approximation that involves leaving out groups of duplicated points. cv=TRUE is best avoided in that case.

Value

An object of class "smooth.spline" with components

x the distinct x values in increasing order, see the Details above.
y the fitted values corresponding to x.
w the weights used at the unique values of x.
yin the y values used at the unique y values.
lev leverages, the diagonal values of the smoother matrix.
cv.crit cross-validation score, “generalized” or true, depending on cv.
pen.crit penalized criterion
crit the criterion value minimized in the underlying .Fortran routine ‘sslvrg’.
df equivalent degrees of freedom used. Note that (currently) this value may become quite unprecise when the true df is between and 1 and 2.
spar the value of spar computed or given.
lambda the value of λ corresponding to spar, see the details above.
iparms named integer(3) vector where ..$ipars["iter"] gives number of spar computing iterations used.
fit list for use by predict.smooth.spline, with components
knot:
the knot sequence (including the repeated boundary knots).
nk:
number of coefficients or number of “proper” knots plus 2.
coef:
coefficients for the spline basis used.
min, range:
numbers giving the corresponding quantities of x.
call the matched call.

Note

The default all.knots = FALSE and nknots = NULL entails using only O(n^{0.2}) knots instead of n for n > 49. This cuts speed and memory requirements, but not drastically anymore since R version 1.5.1 where it is only O(nk) + O(n) where nk is the number of knots. In this case where not all unique x values are used as knots, the result is not a smoothing spline in the strict sense, but very close unless a small smoothing parameter (or large df) is used.

Author(s)

R implementation by B. D. Ripley and Martin Maechler (spar/lambda, etc).

This function is based on code in the GAMFIT Fortran program by T. Hastie and R. Tibshirani (http://lib.stat.cmu.edu/general/), which makes use of spline code by Finbarr O'Sullivan. Its design parallels the smooth.spline function of Chambers & Hastie (1992).

References

Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S, Wadsworth & Brooks/Cole.

Green, P. J. and Silverman, B. W. (1994) Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall.

Hastie, T. J. and Tibshirani, R. J. (1990) Generalized Additive Models. Chapman and Hall.

See Also

predict.smooth.spline for evaluating the spline and its derivatives.

Examples

data(cars)
attach(cars)
plot(speed, dist, main = "data(cars)  &  smoothing splines")
cars.spl <- smooth.spline(speed, dist)
(cars.spl)
## This example has duplicate points, so avoid cv=TRUE

lines(cars.spl, col = "blue")
lines(smooth.spline(speed, dist, df=10), lty=2, col = "red")
legend(5,120,c(paste("default [C.V.] => df =",round(cars.spl$df,1)),
               "s( * , df = 10)"), col = c("blue","red"), lty = 1:2,
       bg='bisque')
detach()

##-- artificial example
y18 <- c(1:3,5,4,7:3,2*(2:5),rep(10,4))
xx  <- seq(1,length(y18), len=201)
(s2  <- smooth.spline(y18)) # GCV
(s02 <- smooth.spline(y18, spar = 0.2))
plot(y18, main=deparse(s2$call), col.main=2)
lines(s2, col = "gray"); lines(predict(s2, xx), col = 2)
lines(predict(s02, xx), col = 3); mtext(deparse(s02$call), col = 3)

## The following shows the problematic behavior of 'spar' searching:
(s2  <- smooth.spline(y18,          con=list(trace=TRUE,tol=1e-6, low= -1.5)))
(s2m <- smooth.spline(y18, cv=TRUE, con=list(trace=TRUE,tol=1e-6, low= -1.5)))
## both above do quite similarly (Df = 8.5 +- 0.2)



[Package Contents]