Combination of Speech and Text Technologies with Criminal Network Analysis: Steps Toward First Field-Test Event of ROXANNE Project
Overview
Serious and organised crime knows no borders, and many criminals operate as part of large networks spanning across multiple countries. Criminals are leveraging technology to promote immodest child exploitation content, sell drugs, and to hack into national infrastructure globally with much more ease1. In recent years, organised crime groups have become more complex and sophisticated. These groups are increasingly using new and evolving technology to commit crimes and to communicate with other criminal groups2.
For law enforcement agencies (LEAs) to keep up in this new environment, they must change their approach to criminal investigations as relying mainly on physical evidence and witness statements is no longer sufficient in many cases3. Training in research-based investigative procedures and access to related tools and resources can help law enforcement officers carry out successful investigations4. It can help LEAs solve cases more swiftly and even prevent crimes in some cases.
There are multiple legal and ethical issues associated with acquiring real investigation data for developing and testing speech, text, and video technologies. So, while the issues related to real data collection are being considered, the datasets available below are currently exploited based on their nature and suitability for the ROXANNE project:
- Crime Scene Investigation (CSI) data
- National Institute of Standards and Technology (NIST) data
- ENRON data
Dataset Description
Crime Scene Investigation (CSI): CSI is a popular American criminal investigation television series. Episodes of the series include a video of about 40 minutes, an audio file, and a transcript. The audio and video are extracted from the DVD of the show. The transcripts were published by the University of Edinburgh. The transcripts also contain the role of each speaker (Suspect, Killer, or Other). Each episode involves a team of investigators, journalists, suspects, and a killer.
National Institute of Standards and Technology (NIST): NIST has performed speaker recognition evaluations (SRE) for more than two decades. The data from these evaluations are attractive because they consist of telephone calls and there are a lot of speakers involved.
ENRON: ENRON is a company that remains infamous for the unabated willful corporate fraud and corruption. Two years after the bankruptcy of the company, emails from 150 managers were made public by the Federal Energy Regulatory Commission. Overall, over 500,000 emails were collected, with the content of each email, and email addresses of each recipient, email address of the sender as well as the full name. Some of these emails highlighted integrity issues from some of the managers. Several laboratories, including SRI International, worked on removing the names of employees involved in such activities. In 2004, telephone recordings of several managers were made public5. The recordings consist of 64 recordings, of 5 minutes on average, each recording consisting of several phone calls, for a total of 6 hours of recordings. All transcripts were also made public in the PDF image version, with the first name of each speaker.
Technologies Involved
The technologies used on these datasets are those already prepared for the first field-test event of ROXANNE:
- Automatic Speech Recognition (ASR)
- Speaker Identification (SID)
- Gender Identification (GID)
- Keyword and Topic Detection
- Named Entity Recognition (NER)
- Network Analysis
- Automatic Speech Recognition (ASR)
The ASR technology allows human beings to use their voices to speak with a computer interface in a way that, in its most sophisticated variations, resembles normal human conversation6. To determine ASR performance, a set of segments was extracted from the dataset and transcribed automatically and the resulting transcript compared to a human-generated reference.
- Speaker Identification (SID)
Due to the low volume of data available in the aforementioned datasets, training a speaker identification system is impossible. Therefore, the speaker identification technology leverages pre-trained systems, trained on other available datasets. The pipeline of a speaker identification system is to structure the audio into enrolment and test audio.
- Gender Identification (GID)
This technology allows to automatically distinguish whether a female or a male is speaking in the segment or recording being examined. Because the basis for training this classifier is a large number of recordings of spontaneous speech in different languages, this technology can be considered independent of the language and text (speech content) used. For LEAs use, gender detection can bring the advantage of narrowing down the search space when the gender of a suspect/person of interest is known.
- Keyword and Topic Detection
The topic detection technology utilizes transcript text as input and uses Concise Semantic Analysis (CSA) for inferring word representations. Thus, once the underlying semantics has been inferred, a small set of concepts is used to represent the input data. The intuition behind this approach is that highly abstract semantic elements (concepts) are good discriminators for clustering very short transcript texts that come from a narrow (and noisy) domain.
- Named Entity Recognition (NER)
This technology extracts targeted named-entities (person names, location, organizations, etc.) from a given text. The highlight of our NER technology is that it is language-agnostic in the sense that it only requires training data for the target language, no language-specific rules are necessary. Furthermore, it can support new entities on-demand. For example, recognizing drug/weapon names appearing in the text could be important for LEAs to analyze the data.
- Network Analysis:
The results of ASR, SID, and GID were included in a network analysis tool that enables to display for each node in the network, the identity predicted by the speaker identification system, and the gender predicted. It supports LEAs in the identification of speakers involved in criminal investigations.
The visualization of network analysis involves technology fusion and prototyping web applications. The technology fusion includes combining the output of each of the below technologies:
- Speaker Identification
- Automatic Speech Recognition
- Keyword and Topic Detection
- Named Entity Recognition
- Gender Prediction
- Network Analysis
Using this tool, it is easier to listen to a conversation, identify speakers, their names and gender, check their connections with other characters, extract the text from the conversation, highlight named entities in their conversation and identify the topic of the conversation. We also provide some Social Network Analysis (SNA) features, such as identifying communities and central characters within the network.
Conclusion
The performed analysis on already existing datasets can be seen as a set of trial experiments allowing to objectively evaluate currently developed ROXANNE platform, integrating several types of processing modules, considering speech, and text as an input modality.
References
1. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/752850/SOC-2018-web.pdf
2. https://www.publicsafety.gc.ca/cnt/rsrcs/pblctns/cmbtng-rgnzd-crm/index-en.aspx#a1
3. https://perf.memberclicks.net/assets/ChangingNatureofCrime.pdf
4. https://nij.ojp.gov/topics/law-enforcement/investigations
5. https://web.archive.org/web/20070219025955/http://www.enrontapes.com/files.html
6. https://usabilitygeek.com/automatic-speech-recognition-asr-software-an-introduction/