ssden {gss} | R Documentation |
Estimate probability densities using smoothing spline ANOVA models.
The symbolic model specification via formula
follows the same
rules as in lm
, but with the response missing.
ssden(formula, type=NULL, data=list(), alpha=1.4, weights=NULL, subset, na.action=na.omit, id.basis=NULL, nbasis=NULL, seed=NULL, domain=as.list(NULL), quad=NULL, qdsz.depth=NULL, bias=NULL, prec=1e-7, maxiter=30, skip.iter=FALSE) ssden1(formula, type=NULL, data=list(), alpha=1.4, weights=NULL, subset, na.action=na.omit, id.basis=NULL, nbasis=NULL, seed=NULL, domain=as.list(NULL), quad=NULL, prec=1e-7, maxiter=30)
formula |
Symbolic description of the model to be fit. |
type |
List specifying the type of spline for each variable.
See |
data |
Optional data frame containing the variables in the model. |
alpha |
Parameter defining cross-validation score for smoothing parameter selection. |
weights |
Optional vector of bin-counts for histogram data. |
subset |
Optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
Function which indicates what should happen when the data contain NAs. |
id.basis |
Index of observations to be used as "knots." |
nbasis |
Number of "knots" to be used. Ignored when
|
seed |
Seed to be used for the random generation of "knots."
Ignored when |
domain |
Data frame specifying marginal support of density. |
quad |
Quadrature for calculating integral. Mandatory if variables other than factors or numerical vectors are involved. |
qdsz.depth |
Depth to be used in |
bias |
Input for sampling bias. |
prec |
Precision requirement for internal iterations. |
maxiter |
Maximum number of iterations allowed for internal iterations. |
skip.iter |
Flag indicating whether to use initial values of
theta and skip theta iteration. See |
The model specification via formula
is for the log density.
For example, ~x1*x2
prescribes a model of the form
log f(x1,x2) = g_{1}(x1) + g_{2}(x2) + g_{12}(x1,x2) + C
with the terms denoted by "x1"
, "x2"
, and
"x1:x2"
; the constant is determined by the fact that a
density integrates to one.
The selective term elimination may characterize (conditional)
independence structures between variables. For example,
~x1*x2+x1*x3
yields the conditional independence of x2 and x3
given x1.
Parallel to those in a ssanova
object, the model terms
are sums of unpenalized and penalized terms. Attached to every
penalized term there is a smoothing parameter, and the model
complexity is largely determined by the number of smoothing
parameters.
The selection of smoothing parameters is through a cross-validation
mechanism described in the references, with a parameter
alpha
; alpha=1
is "unbiased" for the minimization of
Kullback-Leibler loss but may yield severe undersmoothing, whereas
larger alpha
yields smoother estimates.
A subset of the observations are selected as "knots." Unless
specified via id.basis
or nbasis
, the number of
"knots" q is determined by max(30,10n^{2/9}), which is
appropriate for the default cubic splines for numerical vectors.
ssden
returns a list object of class "ssden"
.
ssden1
returns a list object of class
c("ssden1","ssden")
.
dssden
and cdssden
can be used to
evaluate the estimated joint density and conditional density;
pssden
, qssden
, cpssden
,
and cqssden
can be used to evaluate (conditional) cdf
and quantiles.
The method project.ssden
can be used to calculate the
Kullback-Leibler projection of "ssden"
objects for model
selection; project.ssden1
can be used to calculate the
square error projection of "ssden1"
objects.
In ssden
, default quadrature will be constructed for
numerical vectors on a hyper cube, then outer product with factor
levels will be taken if factors are involved. The sides of the
hyper cube are specified by domain
; for domain$x
missing, the default is
c(min(x),max(x))+c(-1,1)*(max(x)-mimn(x))*.05
. In 1-D, the
quadrature is the 200-point Gauss-Legendre formula returned from
gauss.quad
. In multi-D, delayed Smolyak cubatures
from smolyak.quad
are used on cubes with the marginals
properly transformed; see Gu and Wang (2003) for the marginal
transformations.
For reasonable execution time in higher dimensions, set
skip.iter=TRUE
in call to ssden
.
ssden1
does not involve multi-D quadrature but does not
perform as well as ssden
. It can be used in very high
dimensions where ssden
is infeasible.
The results may vary from run to run. For consistency, specify
id.basis
or set seed
.
Chong Gu, chong@stat.purdue.edu
Gu, C. (2002), Smoothing Spline ANOVA Models. New York: Springer-Verlag.
Gu, C. and Wang, J. (2003), Penalized likelihood density estimation: Direct cross-validation and scalable approximation. Statistica Sinica, 13, 811–826.
## 1-D estimate: Buffalo snowfall data(buffalo) buff.fit <- ssden(~buffalo,domain=data.frame(buffalo=c(0,150))) plot(xx<-seq(0,150,len=101),dssden(buff.fit,xx),type="l") plot(xx,pssden(buff.fit,xx),type="l") plot(qq<-seq(0,1,len=51),qssden(buff.fit,qq),type="l") ## Clean up ## Not run: rm(buffalo,buff.fit,xx,qq) dev.off() ## End(Not run) ## 2-D with triangular domain: AIDS incubation data(aids) ## rectangular quadrature quad.pt <- expand.grid(incu=((1:40)-.5)/40*100,infe=((1:40)-.5)/40*100) quad.pt <- quad.pt[quad.pt$incu<=quad.pt$infe,] quad.wt <- rep(1,nrow(quad.pt)) quad.wt[quad.pt$incu==quad.pt$infe] <- .5 quad.wt <- quad.wt/sum(quad.wt)*5e3 ## additive model (pre-truncation independence) aids.fit <- ssden(~incu+infe,data=aids,subset=age>=60, domain=data.frame(incu=c(0,100),infe=c(0,100)), quad=list(pt=quad.pt,wt=quad.wt)) ## conditional (marginal) density of infe jk <- cdssden(aids.fit,xx<-seq(0,100,len=51),data.frame(incu=50)) plot(xx,jk$pdf,type="l") ## conditional (marginal) quantiles of infe (TIME-CONSUMING) ## Not run: cqssden(aids.fit,c(.05,.25,.5,.75,.95),data.frame(incu=50)) ## End(Not run) ## Clean up ## Not run: rm(aids,quad.pt,quad.wt,aids.fit,jk,xx) dev.off() ## End(Not run) ## One factor plus one vector data(gastric) gastric$trt fit <- ssden(~futime*trt,data=gastric) ## conditional density cdssden(fit,c("1","2"),cond=data.frame(futime=150)) ## conditional quantiles cqssden(fit,c(.05,.25,.5,.75,.95),data.frame(trt="1")) ## Clean up ## Not run: rm(gastric,fit) ## Sampling bias ## (X,T) is truncated to T<X<1 for T~U(0,1), so X is length-biased rbias <- function(n) { t <- runif(n) x <- rnorm(n,.5,.15) ok <- (x>t)&(x<1) while(m<-sum(!ok)) { t[!ok] <- runif(m) x[!ok] <- rnorm(m,.5,.15) ok <- (x>t)&(x<1) } cbind(x,t) } xt <- rbias(100) x <- xt[,1]; t <- xt[,2] ## length-biased bias1 <- list(t=1,wt=1,fun=function(t,x){x[,]}) fit1 <- ssden(~x,domain=list(x=c(0,1)),bias=bias1) plot(xx<-seq(0,1,len=101),dssden(fit1,xx),type="l") ## truncated bias2 <- list(t=t,wt=rep(1/100,100),fun=function(t,x){x[,]>t}) fit2 <- ssden(~x,domain=list(x=c(0,1)),bias=bias2) plot(xx,dssden(fit2,xx),type="l") ## Clean up ## Not run: rm(rbias,xt,x,t,bias1,fit1,bias2,fit2)