PolEval 2022/23

Task 1: Punctuation prediction from conversational language

Restore punctuation marks from the output of an ASR system.

Authors

Piotr Pęzik, Adam Wawrzyński, Michał Adamczyk, Sylwia Karasińska, Wojciech Janowski, Agnieszka Mikołajczyk, Filip Żarnecki, Patryk Neubauer and Anna Cichosz

Video introduction

Motivation

Speech transcripts generated by Automatic Speech Recognition (ASR) systems typically do not contain any punctuation or capitalization. In longer stretches of automatically recognized speech, lack of punctuation affects the general clarity of the output text [1]. The primary purpose of punctuation restoration (PR), punctuation prediction (PP), and capitalization restoration (CR) as a distinct natural language processing (NLP) task is to improve the legibility of ASR-generated text and possibly other types of texts without punctuation. For the purposes of this task, we define PR as restoration of originally available punctuation from read speech transcripts (which was the goal of a separate task in the PolEval 2021 competition) [2] and PP as prediction of possible punctuation in transcripts of spoken/ conversational language. Aside from their intrinsic value, PR, PP, and CR may improve the performance of other NLP aspects such as Named Entity Recognition (NER), part-of-speech (POS), and semantic parsing or spoken dialog segmentation [3, 4].

One of the challenges of developing PP models for conversational language is the availability of consistently annotated datasets. The very nature of naturally-occurring spoken language makes it difficult to identify exact phrase and sentence boundaries [5,6], which means that dedicated guidelines are required to train and evaluate punctuation models.

The goal of the present task is to provide a solution for predicting punctuation in the test set collated for this task. The test set consists of time-aligned ASR dialogue transcriptions from three sources:

CBIZ, a subset of DiaBiz, a corpus of phone-based customer support line dialogs (https://clarin-pl.eu/dspace/handle/11321/887)
VC, a subset of transcribed video-communicator recordings
SPOKES, a subset of the Spokes corpus [7]

Table 1 below summarizes the size of the three subsets in terms of dialogs, words and duration of recordings.

Subset	Corpus	Files	Words	Audio [s]	Speakers	License
CBIZ	DiaBiz	69	36 250	16 916	14	CC-BY-SA-NC-ND
VC	Video conversations	8	44 656	17 123	20	CC-BY-NC
Spokes	Casual conversations	13	42 730	20 583	19	CC-BY-NC

Table 1. Overall statistics of the corpus.

The full dataset has been split into three subsets as summarized in Table 2 below.

Set	Files	Words	Audio [s]	License
Train	69	98 095	44 030	CC-BY-SA-NC-ND
Dev	11	12 563	4 718	CC-BY-NC
Test	10	12 978	5 874	CC-BY-NC

Table 2. Training / development / test set statistics.

The punctuation annotation guidelines were developed in the CLARIN-BIZ project by Karasińska et al. [10].

Participants are encouraged to use both text-based and speech-derived features to identify punctuation symbols (e.g. multimodal framework [8]) or to predict casing along with punctuation [9]. We allow using the punctuation dataset available from: PolEval 2021: Task 1. Punctuation restoration from read text [2].

Task description

The workflow of this task is illustrated in Fig 1 below. Given raw ASR output, the task is to predict punctuation in annotated ASR transcripts of conversational speech.

Fig. 1 Overview of the punctuation prediction task.

The punctuation marks evaluated as part of the task are listed in Table 3 below. Blanks are marked as spaces. The distribution of explicit punctuation symbols in the training and development portion of the dataset provided is shown in Tables 3-6.

	Symbol	Mean per dialog	Median	Max	Sum
fullstop	.	111.15	59	1 157	8 892
comma	,	161.51	69	1 738	12 921
question_mark	?	24.36	11	229	1 949
exclamation_mark	!	3.46	4	45	277
hyphen	-	0.64	25	50	51
ellipsis	…	63.28	11	1 833	5 062
words		1 383.23	569	16 528	110 658

Table 3. Punctuation for raw text (all subcorpora)

	Symbol	Mean per dialog	Median	Max	Sum
fullstop	.	58.06	54	213	3 600
comma	,	70.61	59	388	4 378
question_mark	?	11.26	10	35	698
exclamation_mark	!	0.34	1	5	21
hyphen	-	0.02	1	1	1
ellipsis	…	12.29	9	54	762
words		528.74	483	2 180	32 782

Table 4. Punctuation for raw text (CBIZ)

	Symbol	Mean per dialog	Median	Max	Sum
fullstop	.	411.86	384	1 157	2 883
comma	,	737.86	577	1 738	5 165
question_mark	?	85.29	41	229	597
exclamation_mark	!	10.43	5	43	73
hyphen	-	/	/	/	/
ellipsis	…	514.00	365	1 833	3 598
words		5 704.14	4 398	9 469	39 929

Table 5. Punctuation for raw text (VC)

	Symbol	Mean per dialog	Median	Max	Sum
fullstop	.	219.00	193	607	2 409
comma	,	307.09	313	614	3 378
question_mark	?	59.45	39	150	654
exclamation_mark	!	16.64	10	45	183
hyphen	-	4.55	50	50	50
ellipsis	…	63.82	45	186	702
words		3 449.73	1 966	16 528	37 947

Table 6. Punctuation for raw text (Spokes)

Data format

We provide two types of data: text and audio data. Text data is provided in the TSV format. For Audio data we provide audio files as WAV files and transcripts with force-aligned timestamps.

WAVE files are provided as a separate download:

train and dev audio files: http://poleval.pl/task1/train-dev-audio.zip
test audio files: http://poleval.pl/task1/test-audio.zip

Text data

The datasets are encoded in the TSV format.

Field descriptions:

column 1: name of the audio file
column 2: unique segment id
column 3: segment text, where each word is separated by a single space

The segment text (column 3) format is:

single word text:word start timestamp in ms-word end timestamp in ms

Evaluation procedure

The baseline results will be provided in the final evaluation.

Submission format

Results are to be submitted as plain text file, where each line corresponds to a single segment. The text should include the predicted punctuation marks.

Metrics

Final results are evaluated in terms of precision, recall, and F1 scores for predicting each punctuation mark separately. Submissions are compared with respect to the weighted average of F1 scores for each punctuation sign. The method of evaluation is similar to the one used in the previous PolEval competition, PolEval 2021: Task 1. Punctuation restoration from read text [2]

Per-document score:

Global score per punctuation sign p:

Final scoring metric calculated as weighted average of global scores per

We would like to invite participants to a discussion about evaluation metrics, taking into account such factors as

ASR and Forced-Alignment errors,
inconsistencies among annotators,
the impact of only slight displacement of punctuation,
assigning different weights to different types of errors.

Metadata

Tags: poleval-2022, asr

References

Yi, J., Tao, J., Bai, Y., Tian, Z., & Fan, C. (2020). Adversarial transfer learning for punctuation restoration. arXiv preprint arXiv:2004.00248.
Mikołajczyk, A., Wawrzynski, A., Pezik, P., Adamczyk, M., Kaczmarek, A., & Janowski, W. PolEval 2021 Task 1: Punctuation Restoration from Read Text. Proceedings ofthePolEval2021Workshop, 21.
Nguyen, Thai Binh, et al. “Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models.” Proc. Interspeech 2020 (2020): 4263-4267.
Hlubík, Pavel, et al. “Inserting Punctuation to ASR Output in a Real-Time Production Environment.” International Conference on Text, Speech, and Dialogue. Springer, Cham, 2020.
Sirts, Kairit, and Kairit Peekman. “Evaluating Sentence Segmentation and Word Tokenization Systems on Estonian Web Texts.” Human Language Technologies–The Baltic Perspective: Proceedings of the Ninth International Conference Baltic HLT 2020. Vol. 328. IOS Press, 2020.
Wang, Xueyujie. “Analysis of Sentence Boundary of the Host’s Spoken Language Based on Semantic Orientation Pointwise Mutual Information Algorithm.” 2020 12th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA). IEEE, 2020.
Pezik, P. (2014, October). Spokes-a search and exploration service for conversational corpus data. In Selected papers from the CLARIN 2014 Conference (pp. 99-109).
Sunkara, Monica, et al. “Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech.” arXiv preprint arXiv:2008.00702 (2020).
Pappagari, R., Żelasko, P., Mikołajczyk, A., Pęzik, P., & Dehak, N. (2021). Joint prediction of truecasing and punctuation for conversational speech in low-resource scenarios. arXiv preprint arXiv:2109.06103.
Karasińska S., Cichosz, A. Adamczyk M., Pęzik P. Evaluating punctuation prediction in conversational language. Forthcoming.

POLEVAL 2022/23