Statistical Learning, Data Mining & Knowledge Discovery

ModellingOfUnobserved

Modelling of unobserved heterogeneity in marketing research

In light of increasing customer demands and drawbacks of undifferentiated marketing actions, segmentation and targeting efforts have increasingly gained importance during the last years. However, segmentation and targeting actions require the identification of homogenous customer segments which exhibit a similar response structure with respect to attitudinal or behavioral variables. A vast number of traditional clustering algorithms exist which have recently been complemented by model-based procedures. These methods allow for simultaneous estimation of model parameters and segment mebership and have been shown to perform favorably in simulation studies and various applications contexts. This subproject aims at advancing existing and developing novel clustering procedures for marketing and tourism research. Particularly, novel cluster analysis procedures should be developed that account for variable importance in course of the clustering process. Furthermore, the development of suitable vsualization techniques could further enhance the applicability and diffusion of existing clustering procedures.
In the context of latent class analysis procedures, the projects main focus is on finite mixtures of generalized linear regressions. Furthermore, the identification and treatment of unobserved heterogeneity in partial least squares (PLS) path modeling using novel response-based segmentation techniques is researched. This includes the evaluation of the effects of unobserved heterogeneity in marketing models and applications.
Since clustering algorithms are commonly applied on survey data, the assessment of data quality is likewise of research interest. For example, past research has shown that individuals use ordinal scales – which are commonly used in surveys – differently. Accounting for this type of scale usage heterogeneity usually involves sequential rescaling procedures which may lead to considerable bias in cases where there is a pronounced interaction between scale usage and segment affiliation. Therefore, research effort is geared towards the development of procedures which can account for these influencing factors simultaneously. Lastly, the project aims at evaluating the performance of single item scales in different contexts.
Leisch, Schwaiger, Sarstedt, Kriegel
ImmatVerm

Intangible Assets and Firm Value

Marketing has been challenged to provide evidence for the incremental value and long-term financial impact of marketing-driven intangible assets like corporate reputation, brand value or customer satisfaction. Hence, marketing has to make use of theories and methods which have their origins, for instance, in finance, controlling, econometrics, and time series analysis. Since the availability of marketing data in terms of units of analysis (product, brand, or corporation level) and temporal aggregation (e.g., daily, monthly of yearly level) has improved, this trend is amplified and creates new challenges.

Against this backdrop, this project’s main research focus is on how marketing has short- and long-term effects on financial performance measured by, for instance, level and volatility of cash flows, profitability, and market capitalization. Moreover, research efforts are geared to asymmetric effects of shocks, interactions between types of intangible assets, the Lucas Critique, and endogeneity of model variables. Until now, our studies and co-operations have been focused on corporate reputation as strategic intangible asset.
Schwaiger, Sarstedt, Raithel
Ident

Identification of similarities and relatedness in patent data

Background, objectives and methods used:
The increasing importance of intellectual property rights creates huge challenges for users of patent- and trademark systems. Searching activities for property rights that either might stay in conflict with own applications or are relevant for business operations become increasingly difficult. Therefore, all users of IP rights systems have to bear high transaction costs.

The major objective of this project is to develop methods that allow the identification of patents with similar contents. Scientific literature offers many algorithms that compare textual information. However, in the field of patent data, there are no studies available comparing those algorithms and drawing conclusions about the advantages and disadvantages of each text processing methodology. Due to the fact that patent language is highly structured (similar to scientific publications) the application of proven methods in this area seems plausible. Beginning with simple keyword analysis, more advanced methods (Errami et al. 2008 - eTBLAST) should be applied to identify relevant documents. The resulting measures of similarity from patent data can easily be evaluated against information provided in Patent Search Reports in which patent examiners explicitly name similar and relevant documents. In addition, referring to patent family and involved inventors of the focal patent other batches of similar or related documents can be formed. The basic objective of these examinations is to provide methodologies that can also be applied by non-experts (e.g. in SME or in science) to identify prior art. Moreover, those metrics should be used to analyze the intensity of competition from alternative technical approaches in a specific field of patenting.
Harhoff, Natterer
dataMining

Empirical identification of complementarities in economic decisions

Organizational complementarities have been a central issue in empirical research on organizational design over the last decade. A central concern when measuring complementarities is the potential of unobserved third factors to bias the results. First, potentially observable contingencies that may affect the strength and direction of organizational complementarities are often disregarded. For example, information technology and the decentralization of organizational decision rights have been found to be complementary although information technology may also be complementary with the centralization of decision authority, contingent on a firm's strategy. Second, organizational elements could be found to be complementary although their co-variation actually results from unobservable third factors such as managerial preferences. The project pursues two strategies in more reliably identifying organizational complementarities. First, employing an innovative telephone interview technique we build datasets containing information on organizational complements and a large variety of contingencies that possibly affect these complementarities. Second, when testing for complementarities we apply new empirical approaches that try to mitigate the bias from remaining unobserved factors.
Kretschmer, Dr. Ferdinand Mahr
ComplexSearch

Complex Search in large Databases of digital Images

Similarity search is an important query type for image retrieval which is used in the medical domain. However, finding a computation method for determining the similarity between two images is usually not generally solvable. Instead the degree of similarity might depend on the given query image, the considered application or even on the user posing the query. Thus, to offer a satisfying result, a search engine has to adapt the underlying similarity measure to these contextual influences. In this project, we plan to develop new methods for adaptable similarity search being able to dynamically adapt a similarity measure to various types of query contexts. A major challenge of our adaptable similarity model is to define a context model. The goal of this model is to analyze the current user input and map it to a suitable parameterization of the above similarity model. Statistical concepts play an important role in the intended system. First of all, the current application context cannot be defined in a straight-forward way. Instead it is necessary to derive the context based on a sample of previously collected user feedback. Furthermore, statistical methods are needed to handle the uncertainty implied by the underlying feature descriptors. Since it is possible that identical feature descriptors are derived from semantically dissimilar images, the feature-based similarity should be considered as a likelihood rather than a fact. So far, a novel method for similarity search in medical image data was realized within the project. The method analyzes computer tomography scans and will be integrated in a novel computer aided decision and support system for clinicians.
Kriegel, Schubert, Graf
BayesReg

Bayesian Regularisation for Regression and Latent Variable Models

In many empirical studies and diverse fields of applications, data collected or available in scientific or public data sources are increasingly of high dimension and complex structure. Bayesian regularisation allows to approach resulting challenging methodological problems in unified framework through appropriate priors, enforcing shrinkage, selection and smoothing in a broad class of structured additive regression and latent variable models. The general aim of the project is the development, implementation and application of models and inferential methods, based on funding from the DFG and in cooperation with partners from substantive sciences within the Center and beyond, such as the Munich Center of Health and the LMU Center of International Health.
Fahrmeir
GlobalEnergy

Global Energy Governance: Bilateral Trade and the Diffusion of International Organizations

The overarching research question defining this project is: what are the main drivers of the proliferation of international organizations (henceforth IOs) that regulate the energy sector? The vital role of oil and gas in the modern global economy makes the understanding of energy governance one of the most important issues in both economics and political science. Our core argument is that countries join IOs regulating the energy sector in response to the membership previously gained by trade partners. Thus, we argue that by analyzing the network of bilateral trade flows among states it is possible to predict the diffusion of memberships in such IOs. Furthermore, we expect that this diffusion effect is particular strong if two countries have large trade flows in the energy sector, e.g. mineral fuels, or in energy-related sectors, e.g. chemicals. Finally, in explaining the sequence of the foundation of IOs our study takes also into account other competing arguments. In particular, we control for the possibility that their diffusion is driven by emulation, i.e. states learn specific policies from other states that have similar institutions or economic features, and by security concerns. Building our theoretical framework upon the policy diffusion literature and using a newly-compiled dataset, our study employs cutting-edge tools of both network analysis and spatial econometrics. To the best of our knowledge, few previous studies carry out an analysis combining these two methods and so this is a further important contribution of this project. In sum, we provide the most detailed empirical analysis in the field on the relationship between economic integration and global energy governance.
Thurner, Baccini, M.A. Manuel Munz
ChronoTime

Analysis of Chronobiological Timeseries

The daily structure of organisms – from bacteria to humans – is controlled by an endogenous (circadian) clock. Understanding the underlying mechanisms of this fundamental biological function allows for example to optimise medical diagnostics und therapies or better adjustments of the biological and social times of individuals (e.g., in shift-work). The time series of individual daily rhythms that can be measured from the expression of genes to behaviour, but also the interaction between different circadian time series readily lend themselves for statistical and mathematical analysis and modelling. The structures of chronobiological time series data will be investigated in this sub-project with newly developed statistical and mathematical methods, specifically by similarity search and data mining.

The results produced by this approach will be applied to the investigation of daily structures in different chronobiological types (genetically determined chronotypes with circadian clocks that embed themselves differently into the day – earlier or later). The sources for these investigations are both detailed parameters about individual sleep-wake behaviour stored in large databases (MCTQ; n>80,000) and measured time series (temperature, activity, etc.). Our aim is to investigate the influence of different effectors (season, latitude, work and free time, or shift-work) on the daily structures of chronotypes. The results allow insights into how different genotypes of the circadian system adapt their phenotypes to different temporal changes (search for relevant sub-spaces). The analysis methods for chronobiological time series, developed by the Kriegel group, will be iteratively applied to the data produced by the Roenneberg group.
Roenneberg, Kriegel, Renz
DataMining

Data Mining and Searching Uncertain Data

The problem of searching and mining in uncertain databases has become very popular in recent years. The increasing availability of novel data-collection devices enables to accumulate large amounts of information in unprecedented rates and variability. On the other hand, the collected information is often noisy, incomplete or rendered uncertain due to anonymization. As a consequence, novel data models representing uncertain data have been developed and integrated into modern database systems. These representations call for new, more advanced definitions of and algorithms for similarity queries. As query evaluation becomes more complex, efficiency issues arise, calling for novel search methodologies that cope with the special nature of uncertain data. In this subproject, new probability-based algorithms for similarity search, cluster analysis and further data mining applications will be elaborated. An important goal of the investigation in this sub-project is the design and implementation of novel effective but scalable methods for probabilistic similarity queries on uncertain data. In particular, we will study several distance-based query types for multi-dimensional data, such as probabilistic top-k queries (k-nearest neighbor queries, reverse k-nearest neighbor queries, ranking queries and inverse ranking queries), probabilistic distance-range queries, probabilistic join queries and probabilistic skyline queries. Another important item to be addressed within this sub-project is the efficient management of uncertain data, in particular the development of appropriate index structures for probabilistic search in uncertain databases. The techniques required to index uncertain data mainly depend on the underlying uncertainty model. In this project we will investigate indexing methods suitable for the most prominent models, the tuple uncertainty model and the attribute uncertainty model with priority given to the second one.
Kriegel, Renz