NLP technologies against online crime
The Big Data landscape is ever so expanding. The numerous technologies behind this term are establishing their presence and strive to solve daily, but chronic issues in areas such as administration, education, healthcare, and security. The latter area, nowadays, has primarily focused on cybersecurity. It is a common practice for criminal offenders to use web services to organize and commit crimes. More than ever before, methods and approaches of using Big Data are needed to prevent, predict, and investigate criminal cases. This can be delivered by extensive quantitative and qualitative analysis over available information related to crimes while trying to establish strong relations between cause and effect.
It can be said that a police officer with access to Big Data technologies, via several algorithms, is well equipped for crime prevention. These tools maximize the output of any investigation, while at the same time minimizing the effort required from the person performing the job. In some cases, the law enforcement agencies succeed to detect crime before it happens. In other cases, when the analysis of a huge amount of data is required, an algorithmic approach can be extremely beneficial to identify and investigate an already committed crime.
The usage of Big Data in Artificial Intelligence (AI) and Machine Learning (ML) technologies offers new opportunities. . The principle is, not only in the scenario of fighting crime, that an extensive amount of data is required for the best result of ML models. In the ROXANNE project, the technical development targets to significantly enhance the criminal network analysis based on text, speech, language, and video technologies.
For instance, some of the most prominent social problems related to crime are the detection of predatory communications, online offenders, child abuse, and cyber grooming in online conversation. Natural Language Processing (NLP) is currently one of the dominant techniques of AI, which deals with the natural language topic as a link between humans and computers. The goal of NLP is to read, understand, decrypt, and bring insights as an output of extensive analysis on the several human languages acting as input. The web and social media applications and platforms are an overall complex and multidimensional data landscape where NLP can be used.
A brief description of the technical background of NLP techniques, applied in the field of crime offender identification and detection, or other words in the never-ending fight against cybercrime, is following. Several NLP techniques are developed and can be used for broad text analysis. The ones that NLP experts are using more often are: (i) the Bag-Of-Words (BoW), (ii) the Word2Vec (W2V) and Word Embeddings (WE), (iii) the Term Frequency-Inverse Document Frequency (TF-IDF), and (iv) the Rules-Based (RB).
Each one has different characteristics and approaches on various text analysis tasks. The boW is a method that delivers word weighting, via counting the number of occurrences in a text dataset. This technique is used for the extraction of features, based on word frequency, through the comparative study between texts with similar content. In W2V and WE techniques, the high-level approach is to replace words with encoded vectors. The vectors which are used to represent the encoded document can be further used for classification purposes. The BoW method is extended by also focusing on the total frequencies of texts, in the text which is examined. As one of the oldest to NLP, RB approaches focus on patterns that match or parse, while often being used to fill in the blanks. In addition to the above-mentioned NLP techniques, Machine Learning (ML) classifiers are possible solutions to the analysis to predict, identify or solve a criminal case. Several algorithmic approaches that support the classification tasks are (i) Logistic Regression, (ii) Ridge, (iii) Naive Bayes, (iv) Support Vector Machine (SVM), and (v) Neural Networks (NNs).
ROXANNE consortium is carefully exploring its contribution to the extremely sensitive task of crime prevention, prediction, and identification. During the first 18 months of the project, technical partners worked on several NLP sub-tasks on the ROXANNE simulated dataset (ROXSD) and the CSI dataset. Both datasets, the one after simulation with volunteers and screenplay and the other based on several episodes of the famous TV series, are supposed to be close enough to real-world data which are currently not available to the consortium. Several NLP experiments related to topic detection, named entity recognition (NER), authorship attribution, semantic keyword extraction, and relation analysis based on the extracted entities were carried out throughout the project. At the same time, the connection of NLP subtasks with the Network Analysis is continuously examined to explore the identification of hidden criminal networks, besides the individual crime offenders.