Project extended abstract:
Discovering criminal and terrorist networks is the primary task of law enforcement and intelligence agencies across Europe. Criminals and terrorists use voice communication over different media. While the personal communication within their networks is usually performed within standard mobile networks and VoIP-type communications (skype, Google Hangouts, others), there might be a significant share of voice in public media, typically hate or propaganda speech on Youtube, Facebook or other social media channels. Determining and tracking target identities across such channels is extremely difficult, and speaker identification (SID) techniques (such as investigated in the European SiiP project) might not be effective in such challenging environments considering isolated data from one speaker only.
Link analysis (data-analysis techniques used to evaluate relationships/connections between nodes) has long been used for both intelligence and investigation work. At the end of the day, LEAs are not interested in independent individuals, but in the whole criminal or terrorist networks. The situation can be compared to the early days of Internet search - Altavista, Excite and others had some results and market uptake, but the whole domain changed when Google started to exploit the relations between web-pages (TF/IDF metrics, PageRank, etc.). In this project, we expect a similar break-through.
Consider the following example: with the current best SID technology, with an equal error rate of 1%, we obtain 5 false alarms and 5 misses on a set of 500 analyzed recordings. These can be audited by human experts. For analysis of millions of recordings from different media, such techniques are not powerful enough and we are convinced that only with link analysis, the field can make a significant leap forward.
This project proposes to combine the strengths of speaker data mining and link analysis to provide LEAs an efficient tool to track and uncover criminals and terrorists. The project will not process speaker data separately, but:
- Make massive use of conversational nature of speech data - in case we know that A speaks often to B, then detecting A on one side of the call will automatically increase the prior probability of B even if the acoustic evidence is not reliable (due for example to illness, channel change or noise). A reliable diarization (determining who spoke when in the conversation) will be developed in the project as a crucial component for this analysis.
- Use of call content. Standard text-independent speaker identification ignores the content of the call, while a simple sentence “Peter speaking” heard on two different calls can completely change the game. It is out of scope of the project to develop perfect speech-to-text (S2T) engines for all possible languages but commercially available ones will be deployed to generate relevant content information and combine it with acoustic speaker information. For languages with missing S2T, language-independent techniques such as universal phoneme sets or automatically determined acoustic units (AUD) will be used.
- Meta-information is crucial for link analysis. Some of it is available (phone and IMEI numbers, geographical information, time-stamps) but the targets are aware such information is collected and have developed ways to falsify or obscure it (one-shot usage of prepaid SIM cards, use of Internet anonymization services, etc). Significant amount of meta-information can however be automatically extracted from the speech signal – for example, automatic detection of age, gender and accent of call participants. For example, identification of a pimp in an illegal child sexual network can be helped by the fact that his calls are predominantly to persons in <20 age category. Another interesting meta-information is the environment – even if the speaker changes his cell phone number every day, he is not likely to change his favorite car. Detecting that a call took place in given car can help the investigation.
- By time-relation analysis, a classical problem of speaker recognition (speaker speaking very little in a call) can be turned into an advantage, as this speaker can simply be identified by the fact that he is speaking little. Hierarchy and trust can be also partially inferred from this analysis.
Data will be crucial for project success. As this project can not count on huge amounts of real investigation or wire-tap data, most of the R&D work will be done on data from public resources: media and social networks. However, we count on exercises performed on real data by LEAs participating in the consortium, that will provide the developers a valuable feedback.
The result of the project will be a prototype of system capable of
ingesting a significant amount of voice data from different media, along with meta-information,
analyzing this data in unsupervised or lightly supervised way.
Presenting the resulting network analysis and converting it to forms integrable with standard investigation SW solutions, such as IBM i2 Analyst Notebook.