Artificial Intelligence and Machine Learning

Japanese Broadcast News Transcription and Information Extraction

Developing a system for close-captioning and automatic information extraction for Japanese broadcast news speech.

By Sadaoki Furui, Katsutoshi Ohtsuki, and Zhi-Peng Zhang

Posted Feb 1 2000

Article
References
Authors
Footnotes
Figures

Inspired by the activities within the DARPA research community, we have been developing a large-vocabulary, continuous-speech recognition (LVCSR) system for Japanese broadcast news speech transcription [4]. This is a part of a joint research project with NHK broadcasting whose goal is the closed-captioning of TV programs. While some of the problems that we have investigated are Japanese-specific, others are language independent.

The broadcast news manuscripts used for constructing our language models were taken from NHK news broadcasts over a period between July 1992 and May 1996, and comprised roughly 500,000 sentences and 22 million words. To calculate word n-gram language models, we segmented the broadcast news manuscripts into words by using a morphological analyzer since Japanese sentences are written without spaces between words. A word-frequency list was derived for the news manuscripts, and the 20,000 most frequently used words were selected as vocabulary words. This 20,000-word vocabulary covered approximately 98% of the words in the broadcast news manuscripts. We calculated bigrams and trigrams and estimated unseen n-grams using Katz’s back-off smoothing method.

The feature vector consisted of 16 cepstral coefficients, normalized logarithmic power, and their delta features (derivatives). The total number of parameters in each vector was 34. Cepstral coefficients were normalized by the cepstral mean subtraction (CMS) method.

The acoustic models were gender-dependent shared-state triphone hidden Markov models (HMMs) and were designed using tree-based clustering. They were trained using phonetically balanced sentences and dialogues read by 53 male speakers and 56 female speakers. The total number of training utterances was 13,270 for male and 13,367 for female, and the total length of the training data was approximately 20 hours for each gender. The total number of HMM states was approximately 2,000 for each gender, and the number of Gaussian mixture components per state was four.

News speech data, from TV broadcasts in July 1996, were divided into two parts, a clean part and a noisy part, and were separately evaluated. The clean part consisted of utterances with no background noise, and the noisy part consisted of utterances with background noise. The noisy part included spontaneous speech such as reports by correspondents. We extracted 50 male utterances and 50 female utterances for each part. Each set included utterances by five or six speakers. All utterances were manually segmented into sentences. Due to space limitations, we report only the results for the clean part here.

Reading-dependent language modeling. Japanese text includes a mixture of three kinds of characters: Chinese characters (Kanji) and two kinds of Japanese characters (Hira-gana and Kata-kana). Each Kanji has multiple readings, and correct readings can only be decided according to context. Conventional language models are built using the written forms of words, and usually equal probability is assigned to all possible readings of each word. This method causes recognition errors in Japanese speech recognition because the assigned probability is sometimes very different from the true probability. We therefore constructed a language model that depends on the readings of words in order to take into account the frequency and context-dependency of the readings. This model reduced the word error rate by 4.4%, averaged over male and female, relative to the results of the conventional language model.

We constructed a language model that depends on the readings of words in order to take into account the frequency and context-dependency of the readings.

Filled-pause modeling. Broadcast news speech includes filled pauses at the beginning and in the middle of sentences, which cause recognition errors in our previous language models that use news manuscripts written prior to broadcasting. To cope with this problem, we introduced filled-pause modeling into the language model. This model reduced the word error rate by 7.9% relative to the results of the reading-dependent language model.

Online incremental speaker adaptation. Since each speaker in broadcast news utters several sentences in succession, the recognition error rate can be reduced by incrementally adapting the acoustic models within a segment that contains only one speaker. We applied online, unsupervised, instantaneous and incremental speaker adaptation combined with automatic detection of speaker changes. The maximum likelihood linear regression (MLLR) [1] and vector-field smoothing (VFS) [2] methods were instantaneously and incrementally carried out for each utterance.

We tried to determine the appropriate number of transformation matrices, or clusters, for MLLR. MLLR with seven clusters (silence, consonants, and five Japanese vowels) achieved the best performance in the experiment. This method reduced the word error rate by 11.8% relative to the results for the speaker-independent models.

By incorporating all the preceding methods, we achieved an 11.9% word error rate averaged over males and females for clean parts of broadcast news speech, which was a 25.1% reduction in word error rate over the baseline results.

Topic extraction. Transcribed broadcast news speech can also be used for automatic indexing or summarizing of the news. We have been investigating methods for automatically extracting a set of topic words expressing the content of each news broadcast [3]. Since most of the topic words were nouns, topic words are extracted from nouns in the speech recognition results on the basis of a significance measure for each word. Many of the measures that have been used in information retrieval from text databases were tried in a preliminary experiment, and the measure shown in Figure 1 was chosen. Figure 1 gives the amount of information borne by the word wi in the particular news article. In Figure 1, N is the vocabulary size (nouns only), gi is the frequency of word Wi in a news article, Gi is the frequency of word wi in all news articles, and Ga is the summation of all Gi‘s (as shown in Figure 2).

The Nikkei newspaper articles over a five-year period were used for calculating Gi and Ga values. Twenty-nine broadcast news articles comprising 142 utterances by 15 male speakers (8 anchors and 7 others) were used for evaluation. Each news article had 214 utterances (5 utterances on the average). True topic-words were given by three subjects; 410 phrases were given for each news article by each subject. A true topic-word set was constructed for each article from topic-words given by at least one subject (35.7 words on average). Supplementary experiments were also conducted by giving correct texts instead of transcription results as input. Eighty-nine percent of the true topic-word set were nouns. Figure 3 shows the results averaged over the 29 news articles. It is observed that, if transcription results are used as input, precision as well as recall are reduced by roughly 10% in comparison with that obtained by using the correct texts as input. When five topic-words were chosen from speech recognition results (recall = 13%), 87% of them were correct on average.

Figures

Figure 1. Determining the amount of information borne by the word w, in a particular new article.

Figure 2. Summation of word frequency.

Figure 3. Topic-word extraction from broadcast news speech or news text.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Japanese Broadcast News Transcription and Information Extraction

View in the ACM Digital Library

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DOI

10.1145/328236.328150

February 2000 Issue

Published: February 1, 2000

Vol. 43 No. 2

Pages: 71-73

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News May 2 2024

3D Printing Finds a Home

Samuel Greengard

Architecture and Hardware

Credit: Shutterstock 3D printer printing a structure

BLOG@CACM May 1 2024

HiPEAC’s Vision for the Future

Tullio Vardanega and Marc Duranton

Computing Profession

Credit: Roger Castro, Monzón HiPEAC Vision 2024 Next Computing Paradigm

BLOG@CACM Apr 29 2024

A Brief History of Embodied Artificial Intelligence, and Its Future Outlook

Shaoshan Liu and Shuang Wu

Architecture and Hardware

Credit: Getty Images Figure floating amidst vertical beams of light, illustration

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Figures

Japanese Broadcast News Transcription and Information Extraction

DOI

February 2000 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.