Combining computer vision techniques with speech technologies for the fight against crime

ROXANNE combines multiple analytics on various modalities (audio, text, image, metadata) to support LEAs in their investigations. In this post, we review how analysts and investigators can benefit from the significant progress achieved in computer vision in recent years, especially when combining it with speech or network analysis.

 The ROXANNE project combines advanced audio, text, and image processing technologies to support LEAs in their daily work and to accelerate their investigations by focusing their attention on potentially relevant information hidden in large volumes of data.

Even if the project is mainly focused on speech technologies, an important aspect investigated in ROXANNE is the study of the interest of additional modalities to improve the performances of the targeted capabilities. In this context, image and video processing technologies can provide valuable information on sets of images or videos analyzed in the context of criminal investigations related to drug trafficking, child sexual abuse, or other cases of interest for end-users.

Computer vision benefited in the last ten years from very significant performance increase thanks to the availability of large annotated datasets (e.g. ImageNet) and powerful GPU computing capabilities to enable faster training of deep learning-based algorithms for most computer vision tasks including:

  • Image classification (assigning a set of keywords to images)
  • Object detection and recognition in images or videos (localizing the position of specific objects)
  • Image semantic segmentation (assigning a class to each pixel)
  • Image content-based indexing and retrieval
  • Action recognition in videos
  • Manipulation detection in images or videos (e.g. deepfakes detection)


blg june.png

Figure 1: Example of keyword extraction from image for filtering purposes


Computer vision is commonly used to support forensic investigators in many ways: through person clustering and diarization (i.e. identifying where a same person appears in a video or across videos), gate analysis, camera identification (through camera fingerprinting: recognition of specific noise patterns of a given camera model or of a specific camera [1]), through the identification of objects of interest in large sets of images or videos (e.g. weapons, drugs, flags of terrorist groups…) Traditional computer vision techniques can also be used to estimate the underlying geometry of an image (e.g. using automatically estimated vanishing points) to derive measures from 2D images (for instance the height of a person).

In ROXANNE we are interested in assessing how computer vision can discover or confirm links between entities (e.g. devices, persons) belonging to a network built either from the processing of image or video data, or from another modality. Such networks can be further exploited by investigators through advanced network analysis tools. Typically such networks include:

  • Networks built from Call Detail Records (CDRs) between phone numbers (each node represents a phone number, each edge illustrates calls between two phones, edge strength may be proportional to the number of calls).
  • Networks built from sets of video clips (each node may represent a cluster of similar voice signatures (also called voiceprint) presumably belonging to the same person, an edge between two clusters indicates the joint presence of two voices in a same video clip). 
  • Networks built from combined exploitation of CDRs and lawfully intercepted calls in which each node represents a voice cluster (built from voiceprint clustering) and edges represent calls (from CDRs) between the voice clusters.

Most of the computer vision capabilities described above can support investigators through enhanced speed and accuracy in identifying information of interest among large volumes of data. When sets of images and videos are found on seized phones or computers related to a case, the following capabilities can help investigators to focus their attention on a specific suspect or device or reveal some hidden links between persons or devices involved in the case:

  • Highlight images of potential interest (e.g. observing drugs, weapons, sexual abuse/exploitation material …)
  • Find the same image on two devices (what is called near-duplicate image search: finding two instances of the same image even if the images have been cropped, resized, compressed, or even manipulated)
  • Evaluate weighted links between videos found on different devices where the weights depend on the level of joint occurrence of the following detections:
  • The same place observed in both videos
  • The same person observed in both videos
  • The same voice heard in both videos

Finally, the joint use of speech and visual modalities can disambiguate or improve the performance of diarization results obtained from a single modality.

In ROXANNE, we will study how the combined use of some of these computer vision techniques with speech processing (e.g. voiceprint clustering) can improve the relevance of highlighted devices, persons, or documents (e.g. videos) and thus accelerate the investigations. The selected techniques will be first evaluated and demonstrated on representative but not sensitive data built either from open academic datasets (e.g. OpenImage dataset [2]) or from video clips extracted from the CSI TV show which was already used for forensic science in the domain of Natural Language Processing [4], [5]). Then, the LEAs involved in the project will be able to assess the capabilities integrated in the ROXANNE Platform on their own data to provide feedback about the added value of proposed computer vision capabilities.

With these additional computer vision capabilities, the ROXANNE Platform will provide investigators with additional means and tools to efficiently browse large amounts of data and more rapidly focus on potential documents of interest.



[1] D. Cozzolino and L. Verdoliva, "Noiseprint: A CNN-Based Camera Model Fingerprint," in IEEE Transactions on Information Forensics and Security, vol. 15, pp. 144-159, 2020, doi: 10.1109/TIFS.2019.2916364. 

[2] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V. Ferrari. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale.
 IJCV, 2020. 


[4] Lea Frermann, Shay B. Cohen, Mirella Lapata (2017) Whodunnit? Crime Drama as a Case for Natural Language Understanding. Transactions of the Association for Computational Linguistics (TACL). 

[5] Mael Fabien and Seyyed Saeed Sarfjoo and Petr Motlicek and Srikanth Madikeri, Graph2Speak: Improving Speaker Identification using Network Knowledge in Criminal Conversational Data.