Automatic Classification of Illicit Marketing
Sexual abuse is a horrible reality for many children around the world. As technology progresses and with the increasing ways of encryptions and anonymity over the internet, the perpetrators of these acts are increasingly hard to track. There have been several advances in recent time to automate the work and especially image recognition has seen great promise. While image recognition is a natural approach to these subjects as many abuses are documented and shared between perpetrators, there are potentially many leads that go unexplored if only focusing on images and videos. This study aims to evaluate several, state of the art, methods and models within natural language processing (NLP) for classifying text connected to forums aimed for the distribution of child sexual abusive material (CSAM). Feature representation techniques such as word vectors, paragraph vectors, the FastText algorithm were used in conjunction with the deep learning methods of multilayer perceptron, convolutional neural networks and long-short term memory. The models were trained and evaluated on a dataset based on forums from a Dark Net leak from last year. It was found that all models perform approximately equal with all performing over the benchmark set from traditional logistic regression. It was also found that a definite problem that exists is the lack samples connected to the subject and, if further progress is to be made, a need for a large annotated dataset to be developed.