SFB Workshop |
The behavior of natural dynamic systems such as the cardiovascular system is characterized by non-stationarities and phase-shifts making conventional methods of frequency analysis not fully reliable. We propose a novel technology (Phase-Rectified Signal Averaging - PRSA) that allows assessment of (quasi-)periodic system dynamics despite phase-resettings and (1/f-) noise. The method aligns segments of the signal according to pre-defined anchor points. The segments of the signal are averaged.
In a first clinical application, PRSA was used to specifically quantify deceleration-related oscillations embedded in long-term-recordings of heartbeat intervals. In patients surviving acute phase of myocardial infarction, the so-called deceleration capacity index was a significantly better predictor of mortality than both, standard heart-rate variability and left ventricular ejection fraction (the latter is the current "gold standard" in risk prediction). We present the development of the method in 1,455 post-infarction patients (Munich cohort) as well as its validation (in 1,256 post-infarction patients (London and Oulu cohorts)).
Since non-stationarities and 1/f-noise are ubiquitous in nature, the novel method may also be proposed for a variety of other applications.
We provide an empirical framework for assessing the distributional properties of daily speculative returns within the context of the continuous-time modeling paradigm traditionally used in asset pricing finance. Our approach builds directly on recently developed realized variation measures and non-parametric jump detection statistics constructed from high-frequency intraday data. A sequence of relatively simple-to-implement moment-based tests involving various transforms of the daily returns speak directly to the import of different features of the underlying continuous-time processes that might have generated the data. As such, the tests may serve as a useful diagnostic tool in the specification of empirically more realistic asset pricing models. Our results are also directly related to the popular mixture-of-distributions hypothesis and the role of the corresponding latent information arrival process. On applying our sequential test procedure to the thirty individual stocks in the Dow Jones Industrial Average index, the data suggest that it is important to allow for both time-varying diffusive volatility, jumps, and leverage effects in order to satisfactorily describe the daily stock price dynamics. At a broader level, the empirical results also illustrate how the realized variation measures and high-frequency sampling schemes may be used in eliciting important distributional features and asset pricing implications more generally.
This is joint work with Torben G. Andersen, Tim Bollerslev, Per H. Frederiksen and Morten O. Nielsen.
Motivated by flexible regression and classification applications, we propose a new class of priors for collections of dependent random probability measures indexed by predictors. The priors are formulated as adaptive kernel-weighted mixtures of Dirichlet processes (DP). In particular, an unknown distribution at an arbitrary location in the predictor space can be formulated as a mixture of DP basis distributions placed at random locations. The weights depend on probability masses on the different bases and on the distance from the basis location. This structure is shown to have a number of useful theoretical properties. The practical utility of the approach is illustrated through application to a density regression problem in which one wants to study how the conditional density of a response variable changes as multiple predictors change. An efficient MCMC algorithm is developed for posterior computation relying on retrospective sampling. The methods are illustrated using simulated data and epidemiologic examples.
(This is joint work with Ju-Hyun Park.)
P-splines combine a basis of (many) identical B-splines with a difference penalty. Intentionally the basis is too flexible, but smoothness is tuned with the penalty. This approach can be easily generalized to two dimensions, using tensor products of B-splines and penalties in two directions. Very fast algorithms for weighted spatial P-spline smoothing are available when the data are on a grid.
In two dimensions, there may be, real or artificial, barriers or gaps between parts of the domain, as Tim Ramsay (JRSS-B, 2001) has shown. One of his examples is a U-shaped domain where one leg of the U slopes up and the other leg down. Without special precautions, a smoother would happily bridge the gap, drawing the inner edges of the smoothed legs towards each other. To prevent such undesirable effects, Ramsay proposed the use of finite elements, after detailed triangulation of the domain.
It would be attractive if P-splines could still be used on such "difficult domains". I present three ways to approach this problem. 1) Eliminate the penalties in well-chosen places with a weighting scheme. 2) Transform the domain, using mathematical insight, to move parts near gaps far away from each other. This may not be easy for complicated domains. 3) Use the Schwartz-Christoffel transform (numerical conformal mapping) to reshape (in principle) arbitrary domains into rectangles.
We will discuss the use of differential equation models to consider the analysis of two types of space-time data. In one case we look at soil moisture observed at a collection of locations in two hour increments over the course of several months. Soil moisture levels are a complex process that is driven by inputs of precipitation along with transpiration and drainage. Infinitesimal change in soil moisture can be expressed using a differential equation involving forms for transpiration and drainage as a function of soil moisture. In a second case we consider space time point patterns driven by a latent space time intensity surface which is a realization of a stochastic process, that is, our model is an example of a Cox process. The intensity surface is formulated as a growth process characterized through a stochastic differential equation. Our motivating example is to understand annual urban growth through single family home construction.
We show how both of these examples can be handled using hierarchical modelling specifications. In each case, we discretize time, replacing integrals by sums. However, each introduces further computational wrinkles. The former works with empirically specified transpiration and drainage functions while the latter treats a very large number of points (roughly 12,000 houses). Results of the analyses will be presented.
We investigate the use of wavelets as basis functions for non-parametric regression within the field of diffusion tensor magnetic resonance imaging (DT-MRI). The focus is on a unified approach for estimation, regularization and interpolation of the diffusion tensor from recorded human brain data. For this purpose the elements of the local diffusion tensors which can be seen as 3d covariance matrices are modelled jointly as spatially varying coefficient surfaces. We represent necessary transformations of these surfaces enabling the independent application of 3d wavelet transforms. Different thresholding mechanisms are discussed by means of a simulation study. A basic introduction to wavelet theory is provided.
Hidden Markov models (HMMs) provide a powerful tool widely used in bioinformatics, and they have been successfully applied to the segmentation of DNA sequences. Here, the objective is to locate, within individual DNA sequences, homogeneous segments that are compositionally different from the rest of the sequence. The hidden states represent the homogeneous segments to be detected, which are characterized by their distribution of nucleotides, or by their first-order Markovian transition probabilities between nucleotides. This method allows, in principle, an explorative search for "interesting" and potentially functionally important regions in a DNA sequence.
However, a shortcoming of this approach is that it does not take the evolutionary context of a DNA sequence into account. Modern approaches to bioinformatics show an increasing interest in this context. This is based on the rationale that biological systems have not been designed, but have evolved, and that integrating this fact into the modelling framework is likely to increase the accuracy or biological consistency of the results.
In my talk, I will discuss an extended phylogenetic factorial HMM applied to alignments of homologous DNA sequences, which aims to approach the sequence segmentation problem within a phylogenetic context. The approach is based on combining two types of probabilistic models: a phylogenetic tree representing the vertical relationships between the sequences, and various hidden Markov chains representing horizontal dependencies between different sites in the alignment. One chain of hidden states is associated with the topology of the phylogenetic tree, while two parallel chains are associatedwith the overall probability of nucleotide substitutions and changes in the nature of the mutation processes. Inference is carried out within the Bayesian paradigm, using a stochastic version of dynamic programming, Gibbs sampling, and reversible jump Markov chain Monte Carlo. The talk will conclude with a discussion of various applications of the presented method, including the detection of recombination in HIV-1, finding evidence for gene conversion in crop plants, and monitoring rate heterogeneity along a bacterial DNA sequence alignment.
Penalized spline smoothing may be seen as one of the 'en vogue' smoothing method of these days. Originally suggested by O'Sullivan (SIAM, 1986) it were Eilers and Marx (Stat. Sci. 1996) who made the procedure knowledgeable under the phrase P-spline smoothing. The benefits of the routine became apparent when it was linked to (generalized) linear mixed models, as convincingly exposed in the book by Ruppert, Wand & Carroll (2003, Cambridge Uni Press). In this case, the spline coefficient is treated as a priori normally distributed so that the estimates result by posterior prediction.
The generalized linear mixed model formulation uncovers a strong connection to Baysian models, and in fact, a penalized spline smoother can be easily formulated using the Baysian paradigm. In case of non-normal response, analytic posterior estimates are no longer available, simply because the integrals in the marginal likelihood are not analytically. To circumvent heavy numerical routines, the use of a Laplace approximation seems plausible, as also suggested in Breslow & Clayton (1993, JASA) as Penalized Quasi Likelihood (PQL). Nonetheless, the PQL was brought into disrepute, as did the Laplace approach in the Baysian field.
In the talk, we revive the Laplace idea and show that it guarantees consistent and well behaved estimates while keeping the computation small. This is demonstrated in three aspects. First, we discuss locally adaptive smoothing using the Laplace idea. Secondly, we investigate asymptotic results showing the Laplace approximation to work. And finally, we show the whole concept in complex duration time modeling.
In this paper we study the socio-economic and spatial determinants of sex differences in childhood undernutrition in India. We apply a geo-additive semiparametric Bayesian modeling approach to micro data from the 1998/99 National Family Health Survey in India. Among the most important findings are that girls fare worse in situations where there is intense competition for household resources. Also, with our approach we are able to explain a significant share of the pronounced North-South gradient in undernutrition (with the North having much higher rates than the South for both sexes). But even after accounting for our covariates, girls in South India are significicantly better nourished than elsewhere.
Multi-state models provide a unified framework for the description of time-continuous stochastic processes with discrete state space. One particular example are Markov processes which can be characterised by a set of time-constant transition intensities between the states. In this talk, we will extend such parametric approaches to semiparametric models with flexible transition intensities based on Bayesian versions of penalised splines. In particular, the transition intensities will be modelled as a function of time and can further be related to parametric as well as nonparametric covariate effects. Covariates with time-varying effects and frailty terms can be included in addition. Inference may be conducted either fully Bayesian using Markov chain Monte Carlo simulation techniques or empirically Bayesian based on a mixed model representation. A counting process formulation of semiparametric multi-state models provides a formula for the likelihood and also forms the basis for model validation via standardised martingale residual processes. As an application we will consider data on the process of human sleep with a discrete set of possible sleep states such as REM and Non-REM phases. In this case simple parametric approaches are clearly inappropriate since the dynamics underlying human sleep are strongly changing throughout the night. Inaddition the transition intensities will be related to covariates such as nocturnal secretion of certain hormones.
Ratings play a prominent role in the credit industry. Their key purpose is to provide a simple qualitative classification of the solidity, solvency and prospects of a debt issuer. The importance of credit ratings has increased significantly with the introduction of the new regulatory framework known as Basel II. In this framework, ratings can be used directly to determine the size of a bank's capital buffer. As capital constitutes a relatively costly source of funding for a bank, ratings and rating changes directly affect the banks' willingness to grant credit to individual firms. Moreover, if ratings and thus capital requirements co-vary with the business cycle, economic fluctuations may be exacerbated by capital becoming increasingly scarce in adverse economic conditions, precisely when it is needed most. It is clear that a good understanding of the dynamic behavior of ratings and rating changes is therefore important from both a regulatory and financial industry perspective. In this talks different models are discussed for different rating data-sets. The motivation of all models is to extract the credit cycle directly from the time series at hand. The first class of models that we consider is a set of standard Gaussian time series models. Since the model contains latent factors (credit risk factors or credit cycles), the model needs to be represented as a state space model. The second class of models is a multivariate panel time series model for binomial data that represents the numbers of upgrades and downgrades in predefined periods. The non-Gaussian feature of such a data-set complicates inference. Monte Carlo maximum Likelihood methods are used for this purpose. The third class of model is new and designed for micro-data on rating transitions. The main novelty of this model class is that rating transitions are modeled continuously in event time rather than calendar time and are subject to common dynamic latent factors. Although the model is relatively complex, we show that it can be estimated efficiently using modern importance sampling techniques for non-Gaussian models in state space form.
GARCH option pricing models have the advantage of a well-established econometric foundation. However, multiple states need to be introduced as single state GARCH and even Lévy processes are unable to explain the term structure of the moments of financial data. We show that the continuous time version of the Markov switching GARCH(1,1) process is a stochastic model where the volatility follows a switching process. The continuous time switching GARCH model derived in this paper, where the variance process jumps between two or more GARCH volatility states, promises to capture the features of implied volatilities in an intuitive and tractable framework.
In 2001, Davies and Kovac conjectured that the so-called taut-string algorithm of Hartigan and Hartigan yields an estimate with the lowest number of modes in a Kolmogoroff tube around the e.c.d.f. We can show that this is true.
A related optimisation problem is to obtain an histogram with the smallest number of (unequal length) bins in such a tube. Here, the taut-string does not yield an optimum solution. But actually, it can be used as first step for determining a solution of this second optimisation problem.
We consider nonparametric and semiparametric regression estimation for longitudinal/clustered data and multi-dimensional data. The first half of the talk focuses on nonparametric regression estimation for clustered/longitudinal data using kernel and spline methods. We show that unlike independent data, common kernels and splines are not asymptotically equivalent for clustered/longitudinal data. Conventional kernel extensions of GEEs fail to account for the within-cluster correlation, while spline methods are able to account for this correlation. We identify an asymptotically equivalent kernel for the smoothing spline for clustered/longitudinal data. The second half of the talk considers semiparametric regression models for multi-dimensional data, where an outcome depends on some covariates parametrically and some multi-dimensional gene expressions within a pathway nonparametrically. Estimation proceeds with kernel machine techniques that are used in machine learning. We show that there is a close connection between the least-square kernel machine and a linear mixed effects model, and estimation can proceed within a unified linear mixed model framework. Variable selection methods within this framework are discussed. The results are illustrated using simulation studies and a prostate cancer microarray data example.
Starting from the integral representation of fractional Brownian motion (FBM) we introduce the class of fractional Lévy processes (FLP) by replacing the Brownian motion by a general Lévy process with no Brownian component. We study the second order properties and sample path properties and introduce an integration theory for integrals with respect to FLPs. In particular we are interested in moving average (MA) processes with the long memory property. Our main result states that the Lévy-driven MA process with fractionally integrated kernel coincides with the MA process with the corresponding (not fractionally integrated) kernel and driven by the corresponding FLP. This result proves useful for the simulation of long memory MA processes. As an example we consider (fractionally integrated) CARMA processes.
Count data often exhibit overdispersion and/or require an adjustment for zero outcomes with respect to a Poisson model. Zero-inflated Poisson (ZIP) and zero-inflated generalized Poisson (ZIGP) regression models are found to be useful classes to model such data. The talk will focus on an extension of ZIGP regression models for count data by allowing for regression on zero-inflation and overdispersion parameters. The model parameters are fitted by maximum likelihood (ML). Asymptotic normality of the ML estimates in this non-exponential family setting is proven. These extended ZIGP models are applied to data dealing with outsourcing of patent filing processes. A model comparison using AIC statistics and Vuong tests (see Vuong(1989)) is carried out. For the given data, our extended ZIGP regression model will prove to be superior over GP and ZIP models and even ZIGP models with constant overall dispersion and zero-inflation parameters demonstrating the usefulness of our proposed extensions. The talk is based on the joint work with Claudia Czado and Vinzenz Erhardt.
This talk is about estimation procedures for the COGARCH(1,1) model, a continuous-time GARCH model introduced by Klueppelberg, Lindner and Maller (2004). As the discrete-time GARCH process, it has only one source of uncertainty: a driving Levy process. The COGARCH model is of particular interest for finance, since it can capture well-known facts of volatility such as heavy tails or clustering on high levels. After summarizing some properties of the COGARCH(1,1) model we are going to discuss three different ways to estimate the parameters. The first strategy is the estimation by the method of moments. The second one can be considered a QML method and uses an approximation of the COGARCH process. The third approach is a Markov chain Monte Carlo estimation procedure. We compare these three methods via simulation studies. Finally we illustrate the QML method in an application to data from the ASX200.
Reference: Klueppelberg, C., Lindner, A., Maller, R. (2004). A continuous time GARCH process driven by a Levy process: stationarity and second order behaviour. J. Appl. Prob. 41, no. 3, 601-622.
Modeling the dynamics of stock prices is central to asset pricing and risk management. Accordingly, a large amount of research has been conducted in order to find an adequate description of the dynamics of asset prices. Within the continuous-time stochastic volatility literature, however, most of the empirical studies based on daily or coarser frequency data do not allow for a very clear distinction between pure diffusive multi-factor stochastic volatility models and lower-order models with jumps. In view of the often large intraday price movements we therefore consider high frequency data. Since the direct modeling of high frequency returns is complicated by intraday volatility patterns and market microstructure effects, we make use of realized variation measures summarizing the information contained in the high frequency returns. In particular, we adopt the highly accurate realized variation model of Bollerslev, Kretschmer, Pigorsch and Tauchen (2005) to estimate different continuous-time stochastic volatility models using the general scientific modeling method recently proposed by Gallant and McCulloch (2005). The Bayesian characteristic of this estimation method allows us to assess in detail the adequacy and empirical properties of our various models.
(This is joint work with Tim Bollerslev, Ron Gallant, Uta Pigorsch and George Tauchen.)
Common approaches to the fitting of additive mixed models are based on the representation of additive models as mixed models. We propose an alternative approach based on boosting techniques. Boosting originates in the machine learning community where it has been developed as a technique to improve classification procedures by combining estimates with reweighted observations. In linear mixed models as well as in additive mixed models the advantage of the proposed componentwise boosting technique is that it is suitable for high dimensional settings where many influence variables are present. It allows to fit models for many covariates with implicit selection of relevant variables.
The talk will survey recent work on Monte Carlo methodology designed for simulation and inference for diffusion models which involves no error due to time discretisation. The methods include techniques for Monte Carlo maximum likelihood estimation, a Monte Carlo EM approach, and a fully Bayesian analysis for completely but discretely observed processes. Filtering methodology for partially observed systems will also be briefly described. As well as "exactness", avoiding the need for discretisation schemes leads to substantial computational advantages.
In the process of developing risk prediction models, various steps of model building and model selection are involved. If this process is not adequately controlled overfitting may result in serious overoptimism leading to potentially erroneous conclusions. For time-to-event data, we will introduce suitable measures of prediction error for assessing the performance of a risk prediction model. Resampling methods will be used to adjust the estimates of prediction error and to detect overfitting and resulting overoptimism; in particular we generalize the famous bootstrap cross-validation and .632+ estimator of the prediction error for application to time-to-event data. The concepts will be illustrated by means of data from prognostic studies in oncology. Finally, we will explore to what extent the methodology can be used in situations characterized by a large number of potential predictor variables.
(This is joint work with Thomas Gerds.)
Variable importance measures for random forests are receiving increasing attention as a means of variable selection in many classification tasks, e.g. in statistical genomics to select a subset of genes or genetic markers relevant for the prediction of a certain disease. We show that the random forest variable importance measure is a sensible means for variable selection in many applications, but is not reliable in situations where potential predictor variables vary in their scale level or their number of categories, as e.g. when both genetic and environmental variables are considered as potential predictors. Simulation studies are presented, illustrating that, when the random forest variable importance measure is used in such data situations, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. An alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees, is presented. When this method is applied with subsampling without replacement, the resulting variable importance measure can be used reliably for variable selection even in data situations where the potential predictor variables vary in their scale level or their number of categories.
(This is joint work with Achim Zeileis, Anne-Laure Boulesteix and Torsten Hothorn.)