I co-lead the development of Candor (Conversation: A Naturalistic Dataset of Online Recordings), which is among the largest, publicly available, multimodal datasets of naturalistic conversation.

In the manuscript that accompanied the public release of the data, published in Science Advances, we explore the corpus by embracing three principles:

  1. Conversation in constructed around a highly cooperative system of turn-taking.
  2. Understanding the full complexity of conversation requires insights from a variety of disciplines that, although they examine the same phenomenon, often remain siloed in their research questions.
  3. Examining conversation computationally and at scale is an enduring challenge, but new technologies, particularly advances in machine learning, promise to unlock many aspects of conversation that were previously inaccessible.

Key Features

  • The full data download includes the raw audio and video files, along with hundreds of measures that capture everything from people’s overall enjoyment and their impressions of their conversation partner to millisecond-by-millisecond measures of turn taking, emotion expression, and vocal prosody.
  • We qualitatively reviewed every conversation to ensured data quality, and in the process, identified conversations that were particularly high in rapport – often among conversation partners who needed to bridge significant demographic and sociocultural divides.
  • The corpus consists of Americans, aged 19-66, who talked during the year 2020, thus offering a unique lens into how the national discourse shifted in one of the most tumultuous years in recent memory, including the onset of a global pandemic and hotly contested presidential election.

