Uploaded by common.user152122

Real-Time Phishing URL Detection with Machine Learning

advertisement
Proceeding Paper
Real-Time Phishing URL Detection Using Machine Learning †
Atta Ur Rehman 1, * , Irsa Imtiaz 1 , Sabeen Javaid 1 and Muhamad Muslih 2
1
2
*
†
Department of Software Engineering, University of Sialkot, Sialkot 51040, Pakistan;
[email protected] (I.I.); [email protected] (S.J.)
Department of Information System, Nusa Putra University, Sukabumi 43155, Indonesia;
[email protected]
Correspondence: [email protected]
Presented at the 7th International Global Conference Series on ICT Integration in Technical Education &
Smart Society, Aizuwakamatsu City, Japan, 20–26 January 2025.
Abstract
The study investigates the use of powerful machine learning approaches to the real-time
detection of phishing URLs, addressing a critical cybersecurity concern. The dataset we
utilized in this research work was collected from the University of California Irvine (UCI)
Machine Learning Repository. It has 235,795 instances with fifty-four distinct parameters.
The label class is of binomial type and has only two target classes. We used a range of
complex algorithms, including k-nearest neighbor, naive Bayes, decision trees, random
forests, and random tree, to assess the discriminative characteristics retrieved from URLs.
The random forest classifier beat the other classifiers, reaching the greatest accuracy of
99.99%. The study demonstrates that these models achieve superior accuracy in identifying
phishing attempts, significantly outperforming traditional detection methodologies. The
findings underscore the potential of machine learning to provide a scalable, efficient, and
robust solution for real-time phishing detection. Implementing these innovative platforms
to existing security solutions is going to play a critical role in sustaining the protective line
against continuously evolving and persistent phishing schemes.
Keywords: phishing detection; machine learning; real-time security; URL classification;
random forest
Academic Editors: Debopriyo Roy,
George F. Fragulis and Peter Ilic
Published: 25 September 2025
Citation: Rehman, A.U.; Imtiaz, I.;
Javaid, S.; Muslih, M. Real-Time
Phishing URL Detection Using
Machine Learning. Eng. Proc. 2025,
107, 108. https://doi.org/
10.3390/engproc2025107108
Copyright: © 2025 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license
(https://creativecommons.org/
licenses/by/4.0/).
Eng. Proc. 2025, 107, 108
1. Introduction
A website is a singular Internet-based tool and entity in today’s world where everything is first and foremost digital, and an audience. Your website is an important tool
in your strategic plan, in addition to helping to boost brand recognition and to get your
business in front of prospective clients, partners, and investors. Phishing is a modern type
of web scam, where criminals aim at deceiving a person into submitting personal details
or account numbers, credit card numbers, or a password in an organization whose image
has been imitated that of authentic businesses, such as banks or even restaurants. At times,
they may use common fake email messages, which could look authentic to people, hence
creating an opportunity to direct them to a link containing bad code or make them reveal
their personal details. These strategies exploit victims’ trust for personal gain, which results
in monetary fraud, theft of identity, or unauthorized control of personal accounts. Phishing
attacks are a constant threat in the cyber space domain, as these can be aimed at anyone,
with no concern for their age or tech-savviness.
On the other hand, we can discuss one specific area in phishing, such as a phishing
attack on a website. Phishing URLs represent an ever-growing problem since people
https://doi.org/10.3390/engproc2025107108
Eng. Proc. 2025, 107, 108
2 of 11
can be easily manipulated and tricked into doing something they would not normally
do. Cybercriminal mimics a legitimate website by making minor changes in spelling or
adding slight differences that create an authentic look-alike website. The same may also
trick users into providing personal details, which is more often used in defrauding the
user through the use of phishing URLs, which are crafted by hackers using techniques
that make the URLs look like a well-designed website. Due to the inapparent nature
granted by the internet, hackers cannot find legal action for themselves and can freely
perform phishing attacks. There is nothing you can do to fully avoid the possibility of
becoming a victim or to ensure your information stays safe online other than being aware
of the phishing signs and exercising caution and skepticism when typing in unfamiliar
web addresses. Phishing campaigns and organizations globally encompass various forms,
such as detrimental advertisements, fraudulent emails, messages, and posts. According
to the 2024 Security Risk Report [1], phishing URLs have had a dramatic worldwide
impact, with 94% of the surveyed firms falling victim to such assaults. These situations
have serious effects, with 96% of the impacted organizations reporting financial losses,
57% seeing revenue declines owing to client attrition, and 40% suffering reputational harm.
A total of 51% of data breaches resulted in disciplinary action against employees, with
67% of individuals implicated experiencing personal consequences, underlining the critical
necessity for strong information security defenses. This demonstrates the broad impact of
phishing URLs and emphasizes the crucial need for enhanced digital security measures.
Phishing URLs continue to pose a substantial danger to internet security, and initiativetaking measures are required to reduce their impact and protect customers from future
abuse. One of the solutions is to check every single link that you are going to use to login to
different websites and enter your personal information, bank account, or any other sensitive
information. Manually checking each website URL is an ineffective and unsophisticated
method of phishing detection. One of the more frequently used strategies is database
comparison, which compares a requested URL to a list of known phishing sites. If a match
is identified, access to the site is restricted, and the user is notified. Despite its utility, this
strategy fails if the phishing URL has not already been reported.
Maintaining this form of database updated with the latest phishing URLs requires a
bit more work because many of these sites are deactivated daily, and the URLs are removed
from the report after seven days. A flaw of this scheme is that the attackers can use the
same sites even if they are delisted. Due to these disadvantages, the academics have no
option but to rely on machine intelligence to help identify the banking phishing URLs.
This machine learning methodology has elicited a lot of interest courtesy of the ability
to enhance the detection of phishing URLs and eliminate the drawbacks of the database
approach. Classification algorithms and frameworks are important in identifying websites
that are phishing since several attributes are often concealed, and various patterns of
criminality are used. These weblog algorithms work by identifying the context, content,
and patterns of URLs as potential threats. The classifiers are trained to differentiate between
genuine and phishing URLs depending on parameters such as presence of certain keywords,
misspellings, characters, and domain blacklist.
Based on the experiment, computerized machine learning classifiers like random
forests perform remarkably in filtering phishing URLs. Due to the strength of random
forests in handling large datasets with a large number of attributes, they can be used
to assess various features on URLs. They are capable of segmenting different kinds of
URLs by knowing the best hyperplane. When trained on a range of well-labeled datasets,
these classifiers have significantly enhanced the accuracy and efficiency of phishing URL
detection systems. In our research, we offer an original integration of a system for machine
learning with the goal of improving the detection and prevention against phishing site
Eng. Proc. 2025, 107, 108
3 of 11
assaults. This study provides a collection of phishing URLs that were obtained from
trusted resources. Next, we assess machine learning methods to suggest the approach more
accurately. To improve our model’s accuracy, we trained it using the University of California
Irvine (UCI) Machine Learning Repository’s phishing sites dataset. The Phishing URL
dataset is among the biggest accessible, including 100,945 phishing sites and 134,850 real
sites. Most of the URLs that we examined throughout the dataset’s construction are the
most recent ones. The source data of the webpage and URL are examined to extract various
characteristics. We use the “RapidMiner” technology to train our model for effective and
accurate phishing detection, which ensures consistent results.
2. Literature Review
Phishing attacks are a huge global danger to digital security, targeting both individuals
and companies in order to obtain sensitive information, including passwords, credit card
numbers, and personal details. These assaults usually utilize illegal emails, websites, or
communications that replicate trustworthy sources, necessitating early notice and prevention. Machine learning techniques have demonstrated considerable promise in enhancing
the detection and prevention of phishing URL attacks by analyzing various data attributes
to identify fundamental patterns. Machine learning algorithms, particularly those based on
supervised learning, are trained on datasets containing features like URL characters and
metadata. By approximately defining these characteristics, the accuracy of identification
of fake URLs and their differentiation from real ones decreases the possibility of phishing
attacks. This Literature Review examines contemporary research on an integrated machine
learning framework and model for phishing attack detection and prevention, emphasizing
their techniques, performance, and contribution to the field.
Machine learning, particularly that based on supervised learning, is widely used for
phishing attack detection. These algorithms trained datasets that include a variety of URL
characteristics. These attributes can be used by the ML model to identify patterns of the
different phishing URLs. A huge amount of work has already been carried out in this field,
and some of them are presented here. Technological advancement in the current world of
machine learning (ML) has enabled the construction of new frameworks to detect these
forms of scams. These frameworks implement different techniques to enhance the chances
and sensitivity of the detection systems. To this, Yogendra Kumar and Basant Subba [2]
have contributed by proposing a security framework that involves several machine learning
algorithms, namely random forest (RF), Neural Network (NN), Support Vector Machine
(SVM), Logistic Regression (LR), and K-Nearest Neighbor (KNN). Their approach, carried
out in a Google Colab Jupyter notebook and written in Python, showed 99.72% accuracy.
Gupta, Krishna Yadav, and Imran Razzak [3] outlined a different method of performing
lexical-based real-time identification of phishing URLs using machine learning. Their
system employed the RF method with 99.57% accuracy employing KNN, LR, and SVM.
The drawback of their solution, which was constructed using Python (3.10), ML, and DL,
is a longer reaction time, higher dependence on third parties, and the inability to track
newly launched websites. This study identifies areas for improvement in reaction time
and flexibility while also highlighting that machine learning has the capacity to identify
phishing in real time. Lizhen Tang and Qusay H. Mahmoud [4] investigated a variety of
antiphishing strategies, including list-based, heuristic, and ML approaches, and natural
language processing.
Their study demonstrated the need for increased accuracy performance and included
algorithms that reported 99.57% accuracy with a real-time system that required very little
processing time, such as SVM, decision tree, RF, KNN, and Bagging. This study highlights
important areas where accuracy may be improved and gives a wide picture of the state
Eng. Proc. 2025, 107, 108
4 of 11
of ML-based phishing detection at the moment. Kumar, Yogendra, Subba, and Basant [5]
proposed an automatic real-time system to detect phishing URLs based on the NB, SVM, LR,
ADB, DT, GDT, PE, KNN, and RF algorithms. With real-time configuration, they achieved
99.72%, and this pointed out the need to include time-varying characteristics of the URLs
in the future in order to enhance detection. From this research it is clear that incorporation
of dynamic characteristics is useful for enhancing real-time performance and that there
is value in using multiple approaches. Using LR, KNN, SVM, DT, NB, XG Boost, RF, and
ANN, Mehmet Korkmaz, Ozgur Koray Sahingoz, and Banu Diri [6] presented a system
for phishing detection based on machine learning. They achieved an average accuracy
of up to 94.59% for RF classifiers and planned to improve the system’s accuracy and
response time in the future by incorporating deep learning models and hybrid algorithms
like On Decision Tree (DT), Gradient Boosted Trees (GDT), Perceptrons (PE), KNN, and
random forest (RF).
In a real-time configuration, they were able to obtain 99.72% accuracy, and this demonstrated the necessity to incorporate time-varying aspects of the URLs in the future to
improve detection. This study shows that integrating dynamic characteristics is crucial for
better real-time performance and that combining several methods can be useful. Mehmet
Korkmaz, Ozgur Koray Sahingoz, and Banu Diri [6] demonstrated a machine learningbased phishing detection method system that makes use of LR, KNN, SVM, DT, NB, XG
Boost, RF, and ANN. They maintained an accuracy rate of 94.59% for RF classifiers and
proposed plans to use deep learning models and hybrid algorithms to increase the system’s
accuracy and reaction time in the future. As this paper shows, deep learning and hybrid
models may help to enhance the operation of phishing detection systems. Ammara Zamir,
Hikmat Ullah Khan, and Tassawar Iqbal constructed a phishing website detection model
with 97.3% accuracy using numerous ML techniques and feature selection algorithms [7].
They recommend that, for example, such an approach can be evaluated in a real-time mode
when the proposed approach is complemented with other feature extraction algorithms.
This work establishes the effectiveness of feature selection techniques and shows that
to improve detection of phishing, several extraction models can be used together. This
remarkable work was performed by Ali Aljofey, Qingshan Jiang, Qiang Qu, Mingqing
Huang, and Jean-Pierre Niyigena, who gave a smart model for phishing detection based
on CNN and URL. Ref. [8] had a model accuracy of 98.58%. They did note, however, some
limitations which included the technicality that they took a long time to train and the
fact that some websites are likely to be misclassified because they had registration and
login pages. This is established in this work through showing that deep learning models
are viable and identifying further areas that could be optimized to enhance performance.
Subsequently, by combining the CASE feature architecture, Dong-Jie Liu, Guang-Gang
Geng, Xiao-Bo Jin, and Wei Wang [9] established an efficient multistage phishing website
detection model using CNN and LSTM for deep learning, along with ML algorithms such
as NB, DT, and RF. They achieved a TPR of 94.36% and suggested future work on feature
augmentation and model layer fusion. This study helps to realize that additional multistage models and complex feature extraction methodologies should be used to enhance
the accuracy of phishing. Amani Alswailem [10] also extended the use of DT and other
Machine learning models like to evolve a 98.8% accurate method of detecting phishing sites.
He tried different combination of its dataset features but still gets the same accuracy with
only minor variations in accuracy. This suggest that dataset foucs on some of the featur
which can be cause of inconsisted accuracy and detection of phishing url detction.
Domain identification of phishing was discussed by Shouq Alnemari and Majid
Alshammari [11] by employing ANN, SVM, DT, and RF. On accuracy, an average of
97.3% for RF, 96% for DT, 95% for ANN, and 94% for SVM was noted. They suggested that
Eng. Proc. 2025, 107, 108
5 of 11
future research focus on the number of separate approaches of the ML algorithm for the
analysis of phishing domains. Through this study, it is evident that a traditional machine
learning approach is recommended, in addition to the need to develop new approaches
regularly in order to counteract phishing attacks. These investigations show that the
machine learning algorithms are helpful in detecting phishing attempts, and they also draw
attention to the current research on enhancing the live applicability, precision, and response
time [12,13]. Therefore, incorporating these gaps into future research directions will help
improve existing shortcomings and future problems. It is like a tree structure, where each
node within the tree is a test on an attribute; branches are the result of that test; and the
terminal nodes are called leaves, where they contain a class label or a numerical value. The
technique is useful when it comes to analyzing a decision process since decision trees are
uncomplicated and well-presented graphically.
3. Proposed Methodology
The proposed technique for this research study entails using machine learning (ML)
classifiers inside an integrated framework to detect phishing URLs. The study or investigation started with gathering a dataset. PhiUSIIL Phishing URL (Website) from UC
Irvine Machine Learning Repository.edu. Normalization of the data and feature extraction
followed in the process of data cleaning and feature extraction of the cleaning dataset. A
complete diagnosing model was built by training an ML classifier like KNN, NB, RF, DT,
and GBM on a merged dataset. Separate classifiers were used along with ensemble learning
techniques to combine the predictions of several classifiers and enhance the efficiency of the
model. There are standard procedures that may be considered ethical that are going to be
addressed in the right manner as follows. In brief, this research aims to develop a practical
and explainable phishing URL identification system using the ML approach that empowers
cybersecurity professionals to quickly pinpoint and eliminate phishing threats. Software
such as RapidMiner was used to implement a variety of machine learning techniques.
3.1. Framework
A machine learning framework, as illustrated in Figure 1, provides an interface that
enables developers to build and apply machine learning models efficiently. First, we
selected a single dataset from the UC Irvine Machine Learning Repository. After the
collection of data, we began the pre-processing stage, during which we cleaned up and
replaced any missing information. Following data cleaning, we performed feature selection
so that we could evaluate just the parameters that were necessary for the experiment.
SMOT was used to generate samples for minority classes. We divided the dataset into sets
for testing, validation, and training. To avoid overfitting, we modified the hyperparameters
in the validation set and trained the models in the training set. We used a different machine
learning classifier after that to confirm accuracy. To improve performance, we gathered
the predictions of several models using ensemble methods of learning like boosting and
stacking, along with k-fold cross validation. The flow diagram of phishing URL detection
has been shown.
Eng. Proc. 2025, 107, 108
6 of 11
Figure 1. Schemes follow the same formatting.
3.2. Dataset
The repository’s malicious websites were used as our dataset. It has 235,795 instances
with 54 different types of attributes (integer, category, and real). There are two values for
the target class label: 1 “phishing” and 0 “non-phishing”. The dataset has 100,945 phishing
URLs and 134,850 legal URLs. Most of the URLs are recent, providing current information
for efficient categorization. The dataset enhances model performance in machine learning
for phishing URL detection by offering a variety of training cases, strengthening the model,
and reducing its propensity for overfitting. Table 1 shows all the attributes of the PhiUSIIL
Phishing URL dataset.
Table 1. All relevant attributes of a phishing website URL.
File Name
URL
URL Length
Domain
Domain Length
Is Domain IP
TLD
URL Similarity Index
Char Continuation
Rate
TLD LegitimateProb
URL Char Prob
TLD Length
No Of Sub Domain
Has Obfuscation
No Of Obfuscated
Char
Obfuscation Ratio
No Of Letters In URL
Letter Ratio In URL
No Of Digits In URL
Digit Ratio In URL
No Of Equal Sign In
URL
No Of Q Mark In
URL
No Of Ampers and
In URL
No Of Other Special
Chars In URL
Special Char Ratio In
URL
Is HTTPS
Line Of Code
Largest Line Length
Has Title
Title
Domain Title
MatchScore
URL Title Match
Score
Has Favicon
Robots
Is Responsive
No Of URLR
Edirect
No Of Self Redirect
Has Description
No Of Pop up
No Of Frame
Has External Form
Submit
Has Social Net
Has Submit Button
Has Hidden Fields
Has Password Field
Bank, Pay
Crypto
Has Copyright Info
No Of Image
No Of CSS
No Of JS
No Of Self Ref
No Of Empty Ref
No Of External Ref
label
3.3. Replacing Missing Values
In machine learning, replacing missing values refers to the process of adding or
replacing missing data points inside a dataset. There are several possible causes of missing
numbers, such as intentional omission, equipment malfunctions, and mistakes in data
collection. Since many machine learning algorithms struggle to handle missing data,
Eng. Proc. 2025, 107, 108
7 of 11
predicted values are important. Ignoring missing values might lead to biased or incorrect
conclusions. We used this process there to avoid errors and deficient performance. It is
important to deeply investigate the implications of imputing missing data, and to examine
the influence of alternative imputation techniques on machine learning model performance.
3.4. Feature Selection
The method of selecting a subset of important characteristics from a larger collection
to build a model that avoids over-fitting while enhancing comprehension and performance
is known as feature selection in machine learning. Selecting the right attributes is essential
since using redundant or incorrect ones might provide disappointing results. It improves
generalization and model interpretation.
3.5. Split Data
Dividing data into smaller chunks is a crucial step in machine learning to ensure that
your model can react to new input. This method is commonly used to divide your data into
tests, validation, and training sets. Most of your data, known as the training set, is utilized
to train the model. This collection normally contains 70–80% of your data. The goal of the
test set is to assess your model’s final performance following training and validation. It
should be like the data your model would produce in the real world. Typically, it contains
10 to 15% of your data.
3.6. Smote
Synthetic Minority Oversampling Technique is used to correct class imbalances in
datasets by creating Synthetic samples for the minority group. It creates new instances
by interpolating existing minority samples, which helps to balance the class distribution.
This strategy enhances model performance and prevents prejudice against the dominant
class, especially in circumstances with underrepresented instances, such as fraud detection,
medical diagnosis, and phishing detection. To solve this class imbalance, we used SMOTE
on our dataset, which contains 100,945 phishing websites and 134,850 legal URLs.
3.7. Filter Examples
Filter Examples operator enabled us to carefully delete unnecessary or anomalous
rows from the dataset using particular criteria, assuring the dataset’s relevance and quality.
Filtering away these cases increased the quality and efficiency of our machine learning
models, resulting in more precise and dependable phishing detection findings.
3.8. Machine Learning Models
In machine learning model training, the most important step is to select algorithms
that outperform other algorithms for a dataset. In our experiments we had a large dataset.
So, we applied various machine learning classifiers, which include KNN, DT, NB, NB
kernel, RT, and RF. We designed a framework and then tested every individual in the
dataset. All the classifiers performed well and achieved higher accuracy than the others.
Their result is given below.
3.8.1. Random Forest
Random forest is a method of ensemble learning that uses many decision trees to
increase prediction accuracy while avoiding algorithm overfitting. Integrating the output
of many trees results in a more reliable and robust model. Random forest is known for
its high accuracy, resistance to overfitting, and ability to manage large, complex datasets.
When applied to our dataset, this method delivered an incredible 99.99. The random forest
algorithm categorizes the goal label by building a forest, or collection of decision trees, from
Eng. Proc. 2025, 107, 108
8 of 11
randomly selected decision trees to approximate the result. The random forest classifier
automatically corrects uneven classifications and oversees big datasets with ease.
3.8.2. Naïve Bayes and Naïve Bayes Kernel
The NB classifier is a Stochastic classifier that applies Bayes’ theorem. It assumes
feature isolation and computes likelihood of each class making predictions. Naive Bayes
analysis is a rapid and efficient approach for text classification. The “naive” assumption of
feature independence is the foundation of the Bayes machine learning approach, which
utilizes the Bayes theorem. Categorization is one of its common uses, particularly with
high-dimensional data. If features are conditionally independent, the method determines
if it is possible that a data point belongs to each class based on the likelihood that its
characteristics fall into that class. The formula for Bayes’ theorem is as follows:
P( x |c) =
P(c| x ) × P( x )
P(c)
The naive Bayes kernel classifier is a variation that uses kernel density estimation
to estimate the probability density function of features. This improves accuracy when
feature independence is not assumed. This approach can deal with more complicated
data distributions.
3.8.3. KNN
The K-Nearest Neighbor (KNN) classifier is a non-parametric, instance-based learning
technique that categorizes data points according to the classes of their nearest neighbors. It
makes no assumptions regarding feature independence and is suitable for classification
and regression problems. KNN is ideal for text classification due to its simplicity and ease
of implementation. Among the simplest instance-based learning techniques that come
in handy with both classification and regression tasks is K-Nearest Neighbor (KNN). In
doing so, it arrives at a reasonable guess of what the class or the value of the new data item
should be by averaging out the values of its closest neighbors or simply taking the most
frequently occurring value therein. With the use of Euclidean distance, you may place the
new feature where it will blend in the most naturally with the help of the KNN rule.
D ( x, y) = ∑
q
( x i − y i )2
n
3.8.4. Decision Tree
A DT is a tree-structured classifier that separates data into smaller groups based
on the values of input characteristics, resulting in a sequence of decisions that lead to a
final classification. Decision trees have the advantage of being readable, easy to grasp,
and capable of handling both numerical and categorical data. A decision tree is one
of the approaches to supervised machine learning and can handle both regression and
classification. However, before the establishment of a decision tree, entropy should first be
determined, and this computes the uncertainty of the dataset by examining the class label.
Entropy can be calculated using the following formula.
Hi = − ∑ p(i, k ) log2 p(i, k)
n
After determining the entropy, Information Gain can be computed using the formula
given below:
Information Gain ( H, A) = − ∑v | Hv |
Eng. Proc. 2025, 107, 108
9 of 11
3.8.5. Random Tree
Random tree is like random forest, but instead of merging many trees, it uses a single
tree with a randomly chosen group of characteristics at each split. Random tree is simpler
and faster than random forest.
4. Results
This section holds the results from the experiments for the proposed system. The
model we propose uses a single dataset to increase the ML framework’s flexibility. This
study investigates our dataset using a variety of machine learning classifiers and approaches. Using a large dataset for phishing detection decreases the possibility of error
while increasing applicability across various internet user communities. Using this dataset
enables rigorous feature selection, resulting in more accurate and reliable detection findings.
The accuracy score for the different classifiers is shown in Table 2.
Table 2. Performance of classifiers on phishing website URLs with all relevant attributes.
Classifier’s Accuracy
Classifier
Accuracy
Classification Error
Recall
Precision
Kappa
KNN
Naive Bayes
Naïve Bayes kernel
Random forest
Random tree
Decision tree
99.77%
99.96%
99.97%
99.99%
95.38%
99.99%
0.23%
0.04%
0.03%
0.00%
4.62%
0.01%
99.56%
99.99%
100%
99%
90.45%
99.97%
99.89%
99.92%
99.93%
99%
98.59%
100%
0.995%
0.999%
0.999%
0.999%
0.905%
1.00%
The table shows the comparison of various machine learning classifiers that we applied
to our dataset. In our experiment, we used one dataset for phishing URL detection. And
our dataset performed very well on each machine learning classifier. KNN, naïve Bayes,
naïve Bayes kernel, random forest, random tree, and decision tree all performed very
well on the dataset. Figure 2 illustrates the comparisons of accuracy obtained with the
different classifiers.
Figure 2. Performance metrics of classifiers on phishing URLs.
Eng. Proc. 2025, 107, 108
10 of 11
Table 3 presents the performance of the various classifiers used for phishing URL
detection across multiple studies. Random forest (RF) is the most frequently employed
classifier, consistently achieving high accuracy scores ranging from 94.36% to 99.99%. Other
models, such as Convolutional Neural Networks (CNNs), Logistic Regression (LR), and
Computer Vision-based approaches (CV), also show competitive performance, though with
slightly lower accuracy in some cases. The table highlights the effectiveness of ensemble
and deep learning methods in identifying phishing URLs.
Table 3. Accuracy of different classifiers on phishing website URLs reported in prior studies.
Author
Year
Classifier
Accuracy
Kumar Y [2]
Gupta B [3]
Tang L [4]
Sadique F [5]
Korkmaz [6]
Zamir A [7]
Aljofey A [8]
Liu D [9]
Amani A [10]
Alnemari S [11]
Atta Ur Rehman
2021
2021
2021
2020
2020
2019
2020
2021
2019
2023
2024
RF
RF
RF
CV
RF
RF
CNN
RF
DT
RF
RF
99.72%
99.57%
99.57%
86.6%
94.59%
97.3%
98.58%
94.36%
98.8%
97.3%
99.99%
5. Conclusions Future Network
Real-time phishing detection using machine learning techniques shows how several
algorithms can accurately detect phishing URLs using characteristics extracted from URLs
and algorithms like KNN, random forest, and decision tree. Random forest achieved the
maximum accuracy of 99.99%. This study demonstrated substantial effectiveness in realtime detection. Future research might concentrate on increasing scalability and efficiency
for real-world applications, including advanced deep learning techniques, and expanding
feature sets to incorporate real-time user behavior monitoring. Creating adaptive models
that constantly learn from new phishing strategies and connecting these systems with
wider security structures will be critical for staying ahead of developing attacks.
Author Contributions: A.U.R. conceptualized the study and supervised the project; I.I. and S.J.
performed data collection, preprocessing, and analysis; M.M. contributed to methodology design
and model validation. All authors have read and agreed to the published version of the manuscript.
Funding: The authors received no funding for this research work.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data available upon request from the corresponding author.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1.
2.
3.
Egress. Phishing Statistics for 2024. Available online: https://www.egress.com/blog/phishing/phishing-statistics-round-up
(accessed on 31 July 2024).
Kumar, Y.; Subb, B. A lightweight machine learning based security framework for detecting phishing attacks. In Proceedings of
the 2021 International Conference on Communication Systems and Networks (COMSNETS), Bangalore, India, 5–9 January 2021.
[CrossRef]
Gupta, B.B.; Yadav, K.; Razzak, I.; Psannis, K.; Castiglione, A.; Chang, X. A novel approach for phishing URLs detection using
lexical based machine learning in a real-time environment. Comput. Commun. 2021, 175, 1–22. [CrossRef]
Eng. Proc. 2025, 107, 108
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
11 of 11
Tang, L.; Mahmoud, Q.H. A survey of machine learning-based solutions for phishing website detection. Machines 2021, 3, 34.
[CrossRef]
Sadique, F.; Kaul, R.; Badsha, S.; Sengupta, S. An automated framework for real-time phishing URL detection. In Proceedings of the 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA,
6–8 January 2020. [CrossRef]
Korkmaz, M.; Sahingoz, O.K.; Diri, B. Detection of phishing websites by using machine learning-based URL analysis. In
Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT),
Kharagpur, India, 1–3 July 2020. [CrossRef]
Zamir, A.; Khan, H.U.; Iqbal, T.; Yousaf, N.; Aslam, F.; Anjum, A.; Hamdani, M. Phishing website detection using diverse machine
learning algorithms. Electron. Libr. 2020, 38, 1. [CrossRef]
Aljofey, N.; Jiang, Q.; Qu, Q.; Huang, M.; Niyigena, J.P. An effective phishing detection model based on character-level
convolutional neural network from URL. Electronics 2020, 9, 1514. [CrossRef]
Liu, D.J.; Geng, G.G.; Jin, X.B.; Wang, W. An efficient multistage phishing website detection model based on the CASE feature
framework: Aiming at the real web environment. Comput. Secur. 2021, 110, 102421. [CrossRef]
Alswailem, A.; Alabdullah, B.; Alrumayh, N.; Alsedrani, A. Detecting Phishing Websites Using Machine Learning. In Proceedings
of the 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia, 1–3
May 2019; pp. 1–6. [CrossRef]
Alnemari, S.; Alshammari, M. Detecting phishing domains using machine learning. Appl. Sci. 2023, 13, 84649. [CrossRef]
Ashfaq, F.; Jhanjhi, N.; Khan, N.; Muzafar, S.; Das, S. CrimeScene2Graph: Generating Scene Graphs from Crime Scene Descriptions
Using BERT NER. In Proceedings of the International Conference On Computational Intelligence In Pattern Recognition, Sonepat,
India, 19–20 April 2024; pp. 183–201.
Aldughayfiq, B.; Ashfaq, F.; Jhanjhi, N.; Humayun, M. Capturing semantic relationships in electronic health records using
knowledge graphs: An implementation using mimic iii dataset and graphdb. Healthcare 2023, 11, 1762. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
Download