Technology to enhance forensic speaker analysis

Law enforcement agencies investigate criminal networks to find participants, understand their role in the network and, eventually, collect evidence for prosecution. Opportunities to identify criminals often lie in the traces that their communication leaves behind. A cell phone leaves many different traces, but the availability and usability of these modalities may vary. Therefore, it is worthwhile to invest in multiple modalities, increasing the chance of a useful result.

Introduction

Nowadays, people are part of multiple social networks and own multiple communication devices, such as telephones and laptops. A person can use any of the devices in his/her disposal to communicate with the other people in a network. In addition, the relation “person-device” is many-to-many, as a device can have multiple users, e.g., all members of a family can have access to a single laptop.

As expected, criminals form their own networks and have access to multiple communication devices. They use cryptophones to maintain their “business relations”; they use action devices to coordinate specific events, while they also have a private phone to contact family and friends. To be able to intervene adequately, it is important to have insight into the structure and division of roles within a criminal network. To hit an organization, it is sometimes more effective to remove the so-called facilitators such as suppliers of cars, weapons or the person with technical knowledge rather than the management. Thus, law enforcement agencies investigate such networks to find participants, understand their role in the network and, eventually, collect evidence for prosecution.

Opportunities to identify criminals and criminal networks often lie in the traces that their communication leaves behind. Many forensic techniques are aimed at this. For instance, in the NFI's area of forensic speech investigation, experts listen carefully to the sound of a voice and use of language. Who is speaking on a recording? What is being said? Who is in contact with whom? Forensic experts listen to audio clips to verify whether two recordings are from the same speaker, but they are also assisted by automatic speaker verification systems when conditions allow. This makes the process more transparent and sometimes allows for stronger conclusions.

However, it does not stop there. Evidence that may help identify a phone owner includes intercepted speech, but also other kinds of traces such as text messages and cell tower locations. If a second device is involved of which the ownership is not disputed – the suspect’s “private” phone – location traces may provide crucial information to link a crime related phone to the suspect’s private phone [1].

 

Forensic practice

When needed to answer the question whether a given disputed speech fragment from the (unknown) offender and the reference material from a given suspect are from the same speaker, a forensic speaker comparison practitioner usually states two competing hypotheses:

  • Hypothesis 1: The disputed speech fragment and the reference speech fragment are from the same speaker
  • Hypothesis 2: The disputed speech fragment and the reference speech fragment are from different speakers

The first hypothesis is often called the prosecutor’s hypothesis and the second the defense’s hypothesis. The practitioner reports the “strength” of the evidence, formulated as a likelihood ratio of the observation under these two hypotheses, indicating how much the evidence contributes the likelihood of either hypothesis.

                                         Screenshot 2020-12-23 152749.png

 

The practitioner may choose to employ an automated speaker verification system. The use of such systems is growing as they perform well under good recording conditions, however, it is always used alongside an expert judgement [2]. A reason might be the decrease in performance when the speech fragments are too short, there is background noise or the sound quality is poor, which is likely to happen in practice. The current generation of speaker verification systems only measures some aspects of the speaker’s voice and ignores other potential clues. Improving such systems’ performance is a non-trivial yet a crucial goal.

 

Current research

A cell phone leaves many different traces, but the availability and usability of these modalities may vary. Therefore, it is worthwhile to invest in multiple modalities, increasing the chance of a useful result. The strength of the evidence of a single modality may not be sufficiently convincing of the identity of a phone user. In these cases, the modalities may be combined into a single likelihood ratio.

Both estimating the evidential strength for individual modalities and combining results are complex tasks in which computers in some cases are already better than humans. As a result, the role of forensic experts is shifting from doing the analysis itself to supervising the process and interpreting its results, while the actual analyses are largely automated. This should improve not only the average strength of conclusions but also transparency and consistency.

One of our current areas of research in speaker verification is to take into account high-level features of speech, such as the personal lexicon, within the likelihood ratio framework that is applicable in forensics. Specifically, we develop authorship analysis techniques, which are tailored to transcripts of phone calls in the forensic domain. We pay particular attention to explainability, a topic that has not been widely addressed in authorship analysis.

In order to combine these features with voice, both modalities should be statistically independent, otherwise there is a risk of counting the same information twice. We validate this and test whether combining results can improve the performance of the system. Several variables are accounted for as they may affect the outcome. Consider for example the quality of the recording (telephone call vs interview), the duration of the phone calls, background noise, the dialect of the speaker, etc. The aim is to identify in which conditions the speaker verification system benefit the most by adding authorship analysis.

In this field, there is still much room for improvement. We expect ROXANNE to boost capabilities in forensic practice.

 

References

[1] W. Bosma, S. Dalm, E. Van Eijk, R. El Harchaoui, E. Rijgersberg, H. T. Tops, A. Veenstra, and R. Ypma. Establishing phone-pair co-usage by comparing mobility patterns. Science and Justice, 2019.

[2] E. Gold and P. French. International practices in forensic speaker comparisons: second survey. International Journal of Speech, Language and the Law, 26(1): 1–20, June 2019.