GUM Corpus V12 - new documents and annotations

AZ
Amir.Zeldes@georgetown.edu
Wed, Mar 4, 2026 2:29 PM

(Apologies for cross-postings)

*** The GUM Corpus - Release 12.0.0 ***

*** Georgetown University Multilayer corpus ***

The Corpling Lab https://gucorpling.org/corpling/  at Georgetown University is happy to announce the first release of series 12 of the Georgetown University Multilayer corpus (GUM V12.0.0):

https://gucorpling.org/gum/

New in this version:

  • New documents – the corpus now contains 291,056 tokens

  • Completely reworked GUMBridge annotation scheme for bridging anaphora (work led by Lauren Levine):

  • Manual re-annotation effort of the entire corpus

  • Much more densely and consistently annotated using new guidelines

  • 11 subtypes of bridging anaphora

  • Multiple concurrent bridging subtypes are now possible

GUM is an open source corpus of richly annotated English texts from 24 genres:

  • Main genres: (available in train/dev/test)

  • academic writing

  • biographies

  • courtroom transcripts

  • essays

  • fiction

  • how-to guides

  • interviews

  • letters

  • news

  • online forum discussions

  • podcasts

  • political speeches

  • spontaneous face to face conversations

  • textbooks

  • travel guides

  • vlogs

  • Out-of-domain test genres: (test2, aka GENTLE partition):

  • dictionary entries

  • live esports commentary

  • legal documents

  • medical notes

  • poetry

  • mathematical proofs

  • course syllabuses

  • threat letters

The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.

This is the first version of GUM series 12, containing 301 documents annotated for:

  • Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features
  • Manually corrected lemmatization and morphological segmentation
  • Sentence segmentation and rough speech act (manual)
  • Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)
  • Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels and enhanced dependencies)
  • Construction Grammar annotations following UCxn
  • Information status (given-active/inactive, accessible-inferable/common ground/aggregate, and new)
  • Entity type, graded salience (0-5) and coreference annotation (including non-named entities, singletons, appositions, cataphora and discourse deixis), as well as Centering Theory annotations
  • Bridging anaphora classified into 11 subtypes (multiple concurrent types are possible)
  • Entity linking (Wikification) of all named entities with Wikipedia articles and Wikidata, including their non-named and pronominal mentions
  • Discourse parses in enhanced Rhetorical Structure Theory (eRST) and discourse dependencies, including multiple concurrent and non-projective relations
  • Discourse signal annotations classified into 9 major and 46 minor types indicating how the presence of a relation is marked (based on the Signaling Corpus scheme)
  • Shallow discourse relations following the PDTB v3 scheme
  • Five abstractive summaries for each document following strict, comparable guidelines across genres

Note on Reddit data: token text is not contained in the release but can be downloaded with an included script.

For more information and to search or download the corpus online, see the corpus website https://gucorpling.org/gum/ .

Best wishes,

The GUM team

PS – if you like GUM, also check out our automatically annotated AMALGUM https://github.com/gucorpling/amalgum/  corpus!

(Apologies for cross-postings) � *** The GUM Corpus - Release 12.0.0 *** *** Georgetown University Multilayer corpus *** � The Corpling Lab <https://gucorpling.org/corpling/> at Georgetown University is happy to announce the first release of series 12 of the Georgetown University Multilayer corpus (GUM V12.0.0): � https://gucorpling.org/gum/ � New in this version: � * New documents – the corpus now contains 291,056 tokens * Completely reworked GUMBridge annotation scheme for bridging anaphora (work led by Lauren Levine): * Manual re-annotation effort of the entire corpus * Much more densely and consistently annotated using new guidelines * 11 subtypes of bridging anaphora * Multiple concurrent bridging subtypes are now possible � GUM is an open source corpus of richly annotated English texts from 24 genres: � * Main genres: (available in train/dev/test) * academic writing * biographies * courtroom transcripts * essays * fiction * how-to guides * interviews * letters * news * online forum discussions * podcasts * political speeches * spontaneous face to face conversations * textbooks * travel guides * vlogs � * Out-of-domain test genres: (test2, aka GENTLE partition): * dictionary entries * live esports commentary * legal documents * medical notes * poetry * mathematical proofs * course syllabuses * threat letters � The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses. � This is the first version of GUM series 12, containing 301 documents annotated for: � * Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features * Manually corrected lemmatization and morphological segmentation * Sentence segmentation and rough speech act (manual) * Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual) * Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels and enhanced dependencies) * Construction Grammar annotations following UCxn * Information status (given-active/inactive, accessible-inferable/common ground/aggregate, and new) * Entity type, graded salience (0-5) and coreference annotation (including non-named entities, singletons, appositions, cataphora and discourse deixis), as well as Centering Theory annotations * Bridging anaphora classified into 11 subtypes (multiple concurrent types are possible) * Entity linking (Wikification) of all named entities with Wikipedia articles and Wikidata, including their non-named and pronominal mentions * Discourse parses in enhanced Rhetorical Structure Theory (eRST) and discourse dependencies, including multiple concurrent and non-projective relations * Discourse signal annotations classified into 9 major and 46 minor types indicating how the presence of a relation is marked (based on the Signaling Corpus scheme) * Shallow discourse relations following the PDTB v3 scheme * Five abstractive summaries for each document following strict, comparable guidelines across genres � Note on Reddit data: token text is not contained in the release but can be downloaded with an included script. � For more information and to search or download the corpus online, see the corpus website <https://gucorpling.org/gum/> . � Best wishes, The GUM team � PS – if you like GUM, also check out our automatically annotated AMALGUM <https://github.com/gucorpling/amalgum/> corpus! � �