PhD Defense S. Yuan
Novel Clustering Methods For Complex Cluster Structures In Behavioral Sciences
- Location: Cobbenhagen building, Aula
- Supervisor: Prof. J.K. Vermunt
- Co-supersivors: Dr. K. Van Deun, Dr. K. De Roover
Large-scale data sets with a large number of variables become increasingly available in behavioral research. Encompassing a wide range of measurements and indicators, they provide behavioral scientists with unprecedented opportunities to synthesize different pieces of information so that novel - and sometimes subtle – subgroups (also called clusters) of populations can be identified. The successful detection of clusters is of great practical significance for a wide range of social and behavioral research topics. For example, in treating depressed patients, the first step in generating personalized recommendations is to accurately link the patients to the many subtypes of depression. In the organization context, it is highly problematic to assume that all leaders should follow the same developmental paths; in fact, tailoring training programs to the unique strengths of different leadership subgroups (e.g., the down-to-earth leaders and the excessively charismatic leaders) is always more effective than general developmental programs. When trying to understand the cognitive process underlying one’s voting behavior, once again, a one-size-fits-all approach likely produces erroneous descriptions. The broad social context as well as the surrounding environment in which a person grows up likely yields clusters of voters; only those belonging to the same cluster share a similar decision-making process for voting.
To provide behavioral researchers with the best tool for accurately recovering the clusters hidden in large, complex data sets, this dissertation developed new statistical models and computational tools and implemented these novel approaches in publicly accessible software. Generally speaking, the novel methods developed here advance previous approaches by addressing the following three major challenges. First, as noise is ubiquitous in psychological measures, a considerable number of variables collected may be completely irrelevant to the hidden clusters. These irrelevant variables have to be completely and automatically filtered out during data analysis. Second, when integrating variables from diverse data sources (for example questionnaires and genetic information, GPS coordinates, social media footprints, etc.), it is desirable to capture both the unique characteristics pertaining to each data source and the shared or connected characteristics across the many data sources. Third, when translating data analytics results into substantive conclusions so as to inform critical decisions (e.g., medical decisions, personnel selection, etc.), effective and accurate communication is vital yet not necessarily easy to achieve. The two most prominent difficulties are communicating the confidence and (un)certainty in the clusters recovered and visualizing the results through very accessible graphs.
With a variety of computer-simulated data and empirical behavioral data covering topics in clinical, social, personality, and organizational psychology, we were able to conclude that the various methods developed in the dissertation are more versatile, effective, and accurate in identifying subtle clusters in complex data sets, provide rich and unique insights in interpreting these clusters, and, thanks to the development of many software, can be readily accessed without many technical barriers. These methods are therefore useful for behavioral researchers to navigate in an increasingly digitized world and to recognize structures from massive information.