Multimodal networks integration improves the criminal entity network construction with the help of deep neural networks
In criminological study, gathering full information for construction of the global view on an entire network structure is difficult. In practice, the models built to depict criminal networks are usually partially uncertain and are constructed from heterogeneous data sources. To establish a better view of the whole criminal network structure, we propose the cross network identity technique. The cross network identity, also referred as “Cross-Domain Entity Resolution” or “Entity Linkage”, is a problem of finding matching nodes across networks referring to the same entity. Such networks are usually constructed based on different data sources. The cross network identity algorithm is therefore employed to join different networks based on information about entities. Such information includes but is not restricted to observed and recorded properties (location, name, time) across datasets such as investigation cases in ROXANNE.
A broad view of the user identity linkage problem
Source: Wang, Y., Feng, C., Chen, L. et al. User identity linkage across social networks via linked heterogeneous network embedding. World Wide Web 22, 2611–2632 (2019). https://doi.org/10.1007/s11280-018-0572-3
Traditional methods:
The cross-network identity methods can be roughly separated into two eras: the classical one with more traditional approaches and the more modern one dominated by deep neural networks. The former was initially brought into the discussion in order to identify accounts belonging to the same user across different social media networks like Twitter or Facebook. A natural person leaves different user data, exhibits different behaviour patterns and interacts with different user groups on various social media networks. Finding correspondence among the user’s different accounts helps getting a more comprehensive understanding of a user's personality and interests, it also deals with problems unsolvable by data from only one site, such as cold-start and data sparsity problems.
In traditional approaches, data like username, location, user generated contents and other interactive entities are leveraged to support entity linkage. Such raw data are usually converted into computer-friendly entity features, and the feature space describes the distance among entities. Generally speaking, the shorter the distance between two entities is, the more likely they belong to the same natural person. Traditionally, there are wide varieties of distance metrics to choose from, for instance the string similarities such as Jaro-Winker distance, Jaccard similarity or Levenshtein distance, if the feature is text-based data; mean square error, peak signal-to-noise ratio, and Levenshtein distance for graphical data; neighbourhood-based models or embedding-based models for topological structure data of the entity on the graph.
Based on the extracted features mentioned above, a zoo of supervised, semi-supervised and unsupervised models can be constructed. Either the networks are provided with ground truth, or a general understanding of the network structure is more important, or even for such networks, on which only partial data are provided with ground truth, there is a corresponding data schema to help.
Modern (DNN) methods:
In the modern era of the user linkage problem, an increasing number of embedding-based approaches emerge and rule the game. The main thought behind them is to learn network structures and to convert feature vectors into latent representation space automatically. This is a tremendous evolution from the empirical, highly expensive feature engineering in traditional approaches. The outcome of such embedding-based methods, also known as the embedding vectors, not only result in increasing performance of cross-network entities matching, the vectors can also be used as feedback information of network analysis components to improve the performance of the speech and text analysis components within ROXANNE’s pipeline systems.
Source: Junshuang Wu, Richong Zhang, Yongyi Mao, Hongyu Guo, Masoumeh Soflaei, and Jinpeng Huai. 2020. Dynamic Graph Convolutional Networks for Entity Linking. Proceedings of The Web Conference 2020. Association for Computing Machinery, New York, NY, USA, 1149–1159. DOI:https://doi.org/10.1145/3366423.3380192
The proximity and the feature-based approaches are two distinct solutions that both fulfill ROXANNE’s project demand. A Proximity-based approach works by first calculating entity proximity in each individual network, then returning the most similar entities from other networks. Such entity proximity can be described using, for example, either the probability of observing an edge chaining both entities (the so-called first-order proximity), or the similarity regarding not only the both entities but also their corresponding neighbours (the so-called second-order proximity). Unlike the proximity-based approach, a Feature-based approach considers the source network and the target network as an entire hyper-graph to learn latent (hyper-)network features. For an entity in the source network, the approach ranks all entities in the target network according to their possibilities of being the corresponding one of the source entity’s. The one ranking the highest is the match. Matching pairs in such a method fall in the same latent representations and the representations are thus more compatible with subsequent tasks.
Conclusion
To conclude, the user identity linkage technique fuses heterogeneous partial information and helps to construct better models describing network structure. The DNN approaches in-between even bring the accuracy of that to a higher level. Integrating the DNN-based user identity linkage module into the ROXXANNE project will provide us a better understanding of the heterogeneous data, and therefore provide more insightful analysis about the real-world criminal networks.