ROXSD: a Simulated Dataset of Communication in Organized Crime


ROXSD: a Simulated Dataset of Communication in Organized Crime

First version from 2021:

The first version of ROXSD description was released in 2021 as a submission to SPSC symposium. The paper is available here.


(Latest version (status 5/2023):

The latest (most updated version of the ROXSD data is described in D4.3 document (will be available once the deliverables are accepted here).


A short description:

ROXSD audio:

In its latest version v3.0, the ROXSD calls subset contains 432 intercepted telephone conversations recorded into 481 audio files, encoded in 8kHz, 16-bit, stereo10 wave format. The dataset is composed of different types of calls: standard phone calls in which the caller calls the receiver’s telephone number, teleconference calls in which the caller calls a third person while already talking to the receiver,and calls that are made to a web conferencing service (Zoom, Webex) where the callers dial a common telephone number (the service’s dial-in number) in order to talk to each other.

The difference between the number of calls (432) and the number of recordings (481) is due to the fact that some of the calls were intercepted multiple times by different sides of the conversation, which is a consequence of the variety in call types: 270 calls are intercepted only on the caller’s side, 111 are intercepted only on the receiver’s side, and 45 calls on both sides. There is an additional teleconference call which was intercepted a total of 10 times. This results in some of the recordings being very similar in content. However, they are not an exact copy of each other, because of the following reasons: (i) The interception begins on the caller’s side as soon as the caller finishes dialing the receiver’s telephone number. Hence, the ringing dial tone as well as any sounds/speech which the caller’s phone picks up before the connection is established are captured by the intercepted recording coming from the caller’s side. For the same reason, the receiver’s intercepted recording is a few seconds shorter than that of the caller’s. There are also cases where, although both sides are intercepted, the receiver’s phone is not reachable, therefore there is no recording from the receiver (in such cases, either the receiver’s voice box message or the operator’s out-of-reach message can be heard in the caller’s recording). (ii) For teleconference calls involving three (or more) parties, a new interception is initiated when the caller calls a third (fourth, ...) person in order to connect them into the existing conversation. (iii) For web conferencing where multiple parties call the same (operator) telephone number, each party’s interception begins when they join the conference room. (iv) The audibility of speech in both recordings can be different than each other due to the background or microphone noise introduced by one of the parties, or issues with their interception equipment. These inexact copies of the same phone conversation are intentionally left in the dataset, as these artefacts closely reflect the nature of interception in the real world.

ROXSD video:

In order to illustrate the interest of exploiting the image modality, ROXSD was complemented with images and videos representative of files which may be found on a seized smartphone, seized computer, or grabbed from the internet. This corresponds mainly to selfie images or videos where various people are heard and/or seen while observing certain objects or locations. The captured images and videos enable the evaluation of face and scene matching technologies used in the Autocrime platform to enrich the speaker network with additional nodes and edges (for instance an edge is added between two speaker’s nodes when both persons are found - either through their voice or face - in a same video).

ROXHOOD - social media:

ROXHOOD dataset extends ROXSD by adding social media communications.



audio (voice), video, text (including social media), metadata (speakers, devices, telephone numbers, location, network of people).


Technologies to profit from the ROXSD data:

Audio: multilingual speech recognition, speaker identification (open set), speaker clustering, language recognition, voice activity detection, word boosting

Text: multilingual entity recognition, multilingual topic detection, co-reference resolution, relation extraction

Video: face characterisation, scene characterisation

Network analysis: social influence, outlier detection, community detection, link prediction, cross-network analysis


Accessing ROXSD database:

ROXSD dataset is part of the foreground of the project, thus will be made available for other bodies/institutions (as required by Grant agreement) for further research and development in security related areas. 

Please contact: petr dot motlicek ad idiap dot ch (for more information)