TalkBank Batchalign CHATUtterance

CHATUtterance is a series of Bert-derivative models designed for the task of Utterance Segmentation released by the TalkBank project. This is the Mandarin model, which is trained on the the utterance diarization samples given by CHILDES Mandarin corpora: ZhouAssessment, Zhang Personal Narrative, Li Shared Reading.

Usage

The models can be used directly as a Bert-class token classification model following the instructions from Huggingface. Feel free to inspect this file for a sense of what the classes means. Alternatively, to get the full analysis possible with the model, it is best combined with the TalkBank Batchalign suite of analysis software, available here, using transcribe mode.

Target labels:

0: regular form
1: start of utterance/capitalized word
2: end of declarative utterance (end this utterance with a .)
3: end of interrogative utterance (end this utterance with a ?)
4: end of exclamatory utterance (end this utterance with a !)
5: break in the utterance; depending on orthography one can insert a ,