摘要
Speech keyword spotting system is a critical component of human-computer interfaces. And connectionist temporal classifier (CTC) has been proven to be an effective tool for that task. However, the standard training process of speech keyword spotting faces a data imbalance issue where positive samples are usually far less than negative samples. Numerous easy-training negative examples overwhelm the training, resulting in a degenerated model. To deal with it, this paper tries to reshape the standard CTC loss and proposes a novel re-weighted CTC loss. It evaluates the sample importance by its number of detection errors during training and automatically down-weights the contribution of easy examples, the majorities of which are negatives, making the training focus on samples deserving more training. The proposed method can alleviate the imbalance naturally and make use of all available data efficiently. Evaluation on several sets of keywords selected from AISHELL-1 and AISHELL-2 achieves 16%-38% relative reductions in false rejection rates over standard CTC loss at 0.5 false alarms per keyword per hour in experiments.