Abstract Even after the advent of various communication networks, emails have retained their importance and serious, professional character.
Abstract Even after the advent of various communication networks, emails have retained their importance and serious, professional character. Moreover, as the number of Internet users increases, so does the number of spam emails. Spam refers to any unsolicited and unwanted communication, which leads to a significant waste of resources and overloads networks. The majority of these spam emails come from advertisers wishing to promote their products, while others have more malicious intentions, such as phishing emails aimed at tricking recipients into providing confidential information like website credentials or credit card details. In our research, we aim to improve spam detection by combining BERT (Bidirectional Encoder Representations from Transformers) and GraphSAGE. The embedding vectors generated by BERT are used to represent the nodes of a graph, which are then linked based on the calculation of cosine similarity. This graph structure is subsequently exploited by GraphSAGE, which doesn’t merely record the results of embedding mapping; it learns an inductive method of embedding generation. This enables GraphSAGE to generalize to unseen emails by sampling and aggregating the characteristics of neighboring emails to produce robust node representations. Our model was evaluated on three benchmark datasets: ENRON, SpamAssassin, and LingSpam. It achieved 98.87% accuracy, 99.81% precision, and 99.98% AUC on ENRON, 96.44% accuracy, 94.43% precision, and 98.86% AUC on SpamAssassin, and 99.20% accuracy, 96.98% precision, and 99.55% AUC on LingSpam, outperforming several state-of-the-art baselines. These results confirm the robustness of our approach in accurately distinguishing between spam and legitimate emails.