Forensic Automatic Speaker Recognition (FASR) : Problems and prospects
Today, particular attention is paid and huge financial resources are allocated to speaker recognition in forensics and other applications (banking technology, voice call centres, voice search, etc.). In many cases, this is due to the use of audio recording devices to record crime and, particularly the widespread use of mobile technologies, as well as the use of various state-of-the-art technologies in the fight against crime and international terrorism. In the ROXANNE project, speaker recognition is one of the main and most important elements of the system. However, direct application of automatic speaker recognition (ASR) systems in forensics raises a number of issues. In general, ASR methods work well only under controlled conditions, sufficiently good signal quality and relatively long duration.
There are currently several ASRs in the world that are intended for forensic use (both in forensics as well as in criminal search and operational work). It should be emphasized that, in forensics speaker recognition, the automatic identification method is used in conjunction with traditional speaker recognition methods and is one of the instrumental methods or an integral part of the combined method. However, traditional auditory-instrumental methods are very labour-intensive and require a lot of manual processing of audio recordings. Conversion of speech signals into text and automatic segmentation and assignment of speech signals to speakers (diarisation) particularly require a lot of time and effort. Therefore, the automation and implementation of such examinations in expert practice would significantly simplify the data processing process and speed up the performance of the said examinations.
The main problem in the application of ASR systems in forensics is the accuracy and reliability of the results of such systems. Therefore, the effective use of automated speaker recognition systems in conjunction with traditional instrumental methods requires an assessment of their accuracy under a variety of operating conditions. In general, the accuracy of identification methods depends on a number of factors that cannot always be assessed. However, the main factors that determine the accuracy of speaker recognition are as follows: recording duration; recording quality; overlap/mismatch of the conditions for making investigative and comparative audio recordings; and quality of training of the recognition system. Since it is very difficult to assess the impact of all the factors encountered in forensic speaker examinations, the performance of such systems can best be determined using voice databases developed on the basis of audio recordings submitted for examinations. The accuracy of identification generally depends on the duration of the audio recordings used for the purpose of training, the conditions under which investigative and comparative voice recordings are made, the emotional state of speakers, coding methods, etc. Despite the variety of created voice databases that attempt to record voices under a variety of conditions, forensic investigations still encounter factors whose impact on an automated speaker recognition system is often unknown.
Depending on the purpose of the application of speaker recognition technology in forensic science, all investigations using ASR systems can currently be conditionally divided into forensic investigations and criminal investigations (forensic identification and investigatory speaker recognition using automated methods). Since the same ASR technologies are used in both cases, there are some inaccuracies in the assessment of the adequacy of their application, as ASR systems are used both by special services for initial voice search in voice databases and for other special investigations (operational work) and forensic investigations. Due to the very rapid development of this field, the lack of terms and established practices mixes up forensics with searching for a person in voice databases, identification of a person and other investigations performed using ASR. In the case of forensics, semiautomatic or combined methods are used, where ASR is one of the methods of instrumental analysis. Therefore, at present, the court is not provided with speaker recognition conclusions based solely on the results of an automated system [2]. The main reason is that all automated systems produce errors, regardless of the degree of training of the system. That is why it is necessary to carry out further investigations (auditory analysis, linguistic-phonetic investigations, etc.), as well as expert evaluation of the results, and the final conclusion must be made by an expert assessing all the carried out investigations. Thus, at present, automatic speaker recognition can be used in forensics only as one of the methods of instrumental examination.
To date, there are several automated speaker recognition systems in the world developed specifically for law enforcement agencies (LEA). According to a global survey (190 countries) on the use of speaker identification by law enforcement agencies [3], a combined method is additionally used in most countries for speaker identification in operational work, i.e. auditory analysis and acoustic-statistical analysis are additionally used together with ASR systems.
In 2015, the ENFSI Expert Working Group Forensic Speech and Audio Analysis (ENFSI FSAAWG) adopted a good practice guide for conducting forensic examinations using forensic automatic speaker recognition (FASR) and forensic semiautomatic speaker recognition methods (FSASR) [1]. This document emphasizes that, in the case of forensics, it is not enough to present the result of an automated system alone, since additional research is needed to ensure the reliability of the results. FSASR is a method based on auditory (linguistic) voice and speech analysis conforming to the laws of psychoacoustics and linguistics, as well as on acoustic-instrumental analysis. Auditory (linguistic) analysis evaluates general voice features (pitch, strength, clarity), speech timbre, melody, rhythm, pause, etc. During acoustic-instrumental analysis, voice identification features are extracted from speech signals and then appropriate statistical calculations are performed. An expert provides appropriate conclusions after having carried out a full analysis and knowing the physical meaning of these features and the limits of their statistical distribution, and taking into account the results of the auditory analysis. The main advantage of the FSASR method is that it is a comprehensive and objective examination, i.e. having the same audio recordings makes it possible to repeat the examination and verify the results, as well as to objectively support the drawn conclusions.
References:
- Methodological guidelines for best practice in forensic semiautomatic and automatic speaker recognition. ENFSI Forensic Speech and Audio Analysis Working Group Meeting. Warsaw, 21 September 2015.
- Erica Gold and Peter French. International practices in forensic speaker comparison: seconds survey. International Journal of Speech, Language and the Law, Vol.26.1, 2019, pp. 1-20.
- Interpol survey of the use of speaker identification by Law Enforcement Agencies,https://cyber-emea.interpol.int/f5-w-68747470733a2f2f646f692e6f7267$$/10.1016/j.forsciint.2016.03.044 Speaker Identification Project. 2015, https://www.idiap.ch/en/scientific-research/projects/SIIP
Author:
Dr. Bernardas Šalna
Forensic Science Centre of Lithuania (LTEC)