Neural Network for Malicious Url Predicting (In development) | Python, TensorFlow, Deep Learning, Data Engineering
Updated 2/17/2025
This neural network
, built and trained from scratch using TensorFlow
, predicts whether URLs gathered from the wild are malicious or benign. The model was initially trained on Kaggle's Malicious URL Dataset
, and I’m currently working on enhancing its capabilities by incorporating learning rate decay and an F1 score to better assess metrics. Thus, I must state that this project is still in development, but I would say it is about 85% completed and usable to accurately predict malicious urls from “the wild”.
I have fixed the issues with the imbalanced Kaggle dataset! The initial problem was its high bias towards the ‘benign’ label, so as a temporary fix I added weights to the model. However, today (2/17/2025) I implemented a more effective solution by leveragin pandas Python module: double the size of the data base with known malicious urls. The result of this is a much more balanced dataset, which should provide even better results!
The idea behind this project stems from the growing sophistication of phishing attacks and malicious websites, aswell as from our usually bad human performance when faced with social engineering. Also, traditional blacklist approaches often fail to detect new or modified malicious URLs, creating a need for more dynamic, intelligent detection methods. If not, it’s going to be an endless game of cat and mouse.
As far as results, the model has demonstrated acceptable performance metrics on the test set, indicating strong real-world applicability. Indeed, this neural network
could serve as a valuable component in a company’s security infrastructure, particularly for real-time threat detection and prevention. After all, in cybersecurity:
“Prevention is better than cure.”
However, this kind of apps already exists, so I’m not claiming to reinvent the wheel as the saying goes. The primary objective here is to demonstrate a way to leverage machine learning to engineer more robust, adaptive cybersecurity solutions.
With that being said, the project is divided in 5 phases and it is being devolped using a Jupyter Notebook
and dividing the neural network
in individual components using custom functions, each one commented and explained. I made this choice makes the code easier to understand, debug, and tweak individual components (such as changing activation functions, model architecture, or hyperparameters) without risking the entire NN. Furthermore, that modular structure streamlines repurposing parts of the code in other machine learning
projects, especially when using similar preprocessing or model evaluation steps. This is particularly helpful in cybersecurity
, where some components of this neural network can be applied to apply similar models for other tasks like phishing detection, malware classification, or anomaly detection. Last, but not least, using this approach makes it possible to call the complete neural network training and prediction pipeline in just a few lines!
Phase 0: Importing Modules
In this phase I imported the necessary modules for the project:
pandas
: handles thecsv
data of the database.numpy
: helps in calculating class weights weights.sklearn
: splits the data into training, cross validation and test sets; adds class weights; and adds metrics.tensorflow
: builds, trains, evaluates and tests the model. Also, allows to tokenize data before feeeding it to the model.matplotlib
: illustrates evaluations of the model’s performance with graphics.
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from tensorflow.keras.models import Sequential
from tensorflow.keras import Input
from tensorflow.keras.layers import Dense,Dropout,BatchNormalization,Activation
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt
Phase 1: Data loading, Data Spliting and Url Preprocessing Functios
In this phase I define the custom functions to load the data, split it and preprocess the urls using tokenization:
1. Data loading using pandas
def load_data(filepath):
"""
Load URL data from CSV file
Args:
Filepath representing location of csv doc.
Returns:
Features X (URLs) and y (labels).
"""
#Use Pandas read_csv method to prepare urldata.csv
data=pd.read_csv(filepath)
#Define X and y from "data"
#Extract values from 'url' column to a numpy array
X=data['url'].values
#Use .map to replace labels with 0's or 1's. Also .values to transform pandas series into numpy arrays.
y=data['label'].map({'benign':0,'malicious':1}).values
#Return the values of X and y
return X,y
2. Data Splitting using sklearn
def split_data(X,y,test_size,random_state):
'''
Splits data into training and test sets.
Args:
X_preprocessed (tokenized urls), y (labels), test_size (size of test data in %), random_state (shuffling of the data befor spliting).
Returns:
X_train,X_test,y_train,y_test (training urls, test urls, train labels, test labels).
'''
#Use sklearn's train_test_split to create a test data set
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=test_size,random_state=random_state)
#Return train and test data sets
return X_train,X_test,y_train,y_test
3. Preprocessing the data (urls) using tensorflow.keras.Tokenizer
tokenizer=None # defined the tokenizer as a global variable to avoid fitting it many times and botching the model's results
def preprocessor_urls(urls):
"""
Convert URLs to numerical features using tokenization.
Args:
Urls exctracted from data set in variable X.
Returns:
Urls preprocessed with tokenization X_preprocessed.
"""
#Call global tokenizer to apply it just once in training
global tokenizer
#Keras tokenizer to tokenize each character in the URLs
#"num_words" set to 1000 so that the model only focuses on the most frequent unique tokens, thus avoiding slow training and overfitting.
#"char_level" ser to "True" so that the preprocessor represents urls with individual characters.
if tokenizer is None:
tokenizer=tf.keras.preprocessing.text.Tokenizer(num_words=1000,char_level=True)
#Fit the tokenizer on the URLs
tokenizer.fit_on_texts(urls)
#Return the tokenized urls as a binary Matrix
X_preprocessed=tokenizer.texts_to_matrix(urls, mode='binary')
#Return preprocessed data
return X_preprocessed #X_preprocessed
Phase 2: Creating, Training, and Testing model Functions
In this phase I define the custom functions to create, train and test the model.
4. Creating the Model using TensorFlow
In this function there are actually two models. The first is the shallower initial architecture and is commented because it’s results were not acceptable. So, I created a second more deep architecture that added more Dense
layers, with the addition of BatchDrop
to avoid overrelying in particular units, and BatchNormalization
for more efficient training and reducing risks of overfitting to the training data.
Another note concerning the architecture: I chose to use mainly relu
activations for a number of reasons:
- First, because its non-linearity and simplicity reduces computational efficiency, in contrast of using tanh for example; it also helps mitigate vanishing gradients, as it doesn’t “caps” positive values; furthermore, since negative values are output as zero in
relu
, it produces sparse activations which reduce layer interdependency and improves the learning efficiency of learned features. Not only that, when combined withBatchNormalization
, on the one hand, it mitigates “dead RelUs”, that is neurons always outputing 0, as outputs are normalized close to 0; and on the other hand, it allows for more aggressive learning rates, thus speeding up convergence. Finnaly, I choseDense
layers, because they allow each neuron to connect to every neuron in the previous layer, making them very effective for learning complex representations, such asmalicious urls
.
def create_model(input_shape): #Call: model=create_model(X_train_preprocessed.shape[1:])
"""
Create a neural network for URL classification
Args:
input_dim (Shape of a single input sample (e.g., (1000,)) (X_train_preprocessed.shape[1:])
Returns:
tf.keras.Model (Compiled neural network model)
"""
#Create the NN using a Sequential model of Dense and Dropout layers
#MODEL 1
#model=Sequential([
#First Dense Layar
#Dense(units=128,activation='relu',input_shape=input_shape,name='Dense_1'),
#First Droput Layer
#Dropout(0.3,name='Dropout_1'),
#Second Dense Layer
#Dense(units=64,activation='relu',name='Dense_2'),
#Second Dropout Layer
#Dropout(0.3,name='Dropout_2'),
#Third Dense Layer
#Dense(units=32,activation='relu',name='Dense_3'),
#Third Dropout Layer
#Dropout(0.3,name='Dropout_3'),
#Dense Output Layer
#Dense(1,activation='sigmoid',name='Output')
#],name='Neural_Natwork_for_Malicious_Url_Detection')
#Model 2
model=Sequential([
#Input layer
Input(shape=input_shape),
# First Dense Layer
Dense(units=128, name='Dense_1'), #128
BatchNormalization(name='BatchNorm_1'),
Activation('relu', name='Activation_1'),
Dropout(0.3, name='Dropout_1'),
# Second Dense Layer
Dense(units=256, name='Dense_2'), #64
BatchNormalization(name='BatchNorm_2'),
Activation('relu', name='Activation_2'),
Dropout(0.3, name='Dropout_2'),
# Third Dense Layer
Dense(units=128, name='Dense_3'), #32
BatchNormalization(name='BatchNorm_3'),
Activation('relu', name='Activation_3'),
Dropout(0.3, name='Dropout_3'),
# ADDED Fourth Dense Layer
Dense(units=64, name='Dense_4'),
BatchNormalization(name='BatchNorm_4'),
Activation('relu', name='Activation_4'),
Dropout(0.3, name='Dropout_4'),
# ADDED Fifth Dense Layer
Dense(units=32, name='Dense_5'),
BatchNormalization(name='BatchNorm_5'),
Activation('relu', name='Activation_5'),
Dropout(0.3, name='Dropout_5'),
# Dense Output Layer
Dense(1, activation='sigmoid', name='Output')
], name='Neural_Network_for_Malicious_Url_Detection')
#Compile the Model to configure it for training
#model.compile(
#optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
#loss=tf.keras.losses.BinaryCrossentropy(),
#metrics=['accuracy', 'precision', 'recall']
#)
return model
5. Training the Model using TensorFlow
This function trains the model using the preprocessed data. Things to note:
- Class weights have been added using
sklearn
to balance the training data set, because the Kaggle dats over-represents non-malicious urls by a factor of roughly 1 to 4. - Defines
early stopping
, so as to stop training if loss doesn’t reduce significantly. - The function returns both the trained model, and the history of training in a verbose (verbose=1) manner to get the whole picture of the training.
- The
.fit()
function uses the argumentvalidation_split
to reserve a small amount of training data (0.1 or 10%), so as to evaluate how well the model does when confronted with new, unseen data. In each iteration of forward propagation and backpropagation the results are validated with said data to prevent overfitting, track performance at each step, and helping determine what hyperparameters to tune. - The custom function takes
batch_size
andepochs
as arguments, so as to experiment with different values.
def train_model(X_train_preprocessed,y_train,batch_size,epochs): #Call: model_trained,history=train_model(X_train,y_train,epochs,batch_size)
"""
Trains a neural network for URL classification
Args:
input_dim (int): Number of input features from tokenized URLs (1000)
Returns:
tf.keras.Model: Compiled neural network model
"""
#Added class weights to increase recall
class_weights=compute_class_weight('balanced',classes=np.unique(y_train),y=y_train)
class_weight_dict=dict(enumerate(class_weights))
#Compile the Model to configure it for training
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=['accuracy', 'precision', 'recall']
)
#Define aerly stopping
early_stopping=EarlyStopping(
monitor='val_loss',
patience=100,
restore_best_weights=True)
#Use the .fit method to train the model and save the history of training
history=model.fit(
x=X_train_preprocessed,
y=y_train,
batch_size=batch_size, #Size of training data batches
epochs=epochs, #Times it will perform formward and backwards propagation
callbacks=[early_stopping], #Added early stopping
class_weight=class_weight_dict, #Added class weights
verbose=1,
validation_split=0.1) #Save a bit of training data as cross-validation data
return model,history
6. Evaluating the model using TensorFlow
This code evaluates the model using test data and the following metrics: loss
(measurement of the difference between predictions and actual labels of the test set), accuracy
(percentage of times the model predicted correctly), precision
(true positive rate, indicating reliability in malicious URL identification), and recall
(sensitivity, showing effectiveness in finding actual threats). As I mentioned far above, I am currently working in adding an F1 score
to better assess the model.
def evaluate_model(X_test_preprocessed,y_test):
#Define evaluation records to generate
test_loss,test_accuracy,test_precision,test_recall=trained_model.evaluate(X_test_preprocessed,y_test,verbose=1)
#Return the metrics
return f'Loss: {test_loss}',f'Accuracy: {test_accuracy}',f'Precision: {test_precision}',f'Recall:{test_recall}'
Phase 3: Predicting Maliciousness for Previously Unseen URL
In this phase I define a function to predict wether a url found in the wild is likely Malicious
or Bening
, using the trained model. Things to note:
- The prediction is a float between 0 (Bening) and 1 (Malicious).
- Defined a singular
url
argument so as to use any url. This implies applying the preprocessing and the global tokenizer to be able to feed the url to the trained model. - Defined an argument of the function as
threshold
to adjust the value according to needs. Indeed, it can be set lower so as to catalog more urls as malicious, or higher to do the opposite. The answer is a blanced threshold, but in the context of cybersecurity it is better to err in the side of caution. So, it is adviced to set it very low (0.3). Evenmore, the model learns patterns very well, so many times the float is very close to 0 or to 1.
7. Predicting with the trained model using TensorFlow
def predict_url(url,trained_model,threshold:float,tokenizer):
#Preprocess the new url using tf.keras.text.Tokenizer
preprocessed_url=tokenizer.texts_to_matrix([url],mode='binary') #Removed dependance on the preprocessor function
#Use the tf.keras.Model.predict method on the new url
prediction=trained_model.predict(preprocessed_url,verbose=0)[0][0]
#Set the threshold
if prediction>threshold:
return f'{url} is likely MALICIOUS!: {prediction}' #close to 1
else:
return f'{url} is likely BENINGN!: {prediction}' #close to 0
Phase 4: Implementing the Model
In this phase I call each function to actually build, train, evaluate the model; and to predict urls gathered “from the wild”.
8. Building, Training, Evaluating, Predicting using the custom functions
Something to note:
- This was implemented using a
Jupyter Notebook
, so each part of the code was run in a separate cell for easier implementation and debugging if needed. However, as I can’t represent that functionality, I will transcribe the last results I got.
#Load data
X,y=load_data('urldata.csv')
#Split data into training and test data
X_train,X_test,y_train,y_test=split_data(X,y,0.1,42)
print(X_train.shape) #Print the shape so as to ensure everything is working fine
(405158,)
#Preprocess data
X_train_preprocessed=preprocessor_urls(X_train)
X_test_preprocessed=preprocessor_urls(X_test)
print(X_train_preprocessed.shape[1:])
(1000,)
#Create the model
model=create_model((X_train_preprocessed.shape[1:])) #Shape (1000,) representing the 1000 features of each example
#Visualize model
model.summary()
Model: "Neural_Network_for_Malicious_Url_Detection"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ Dense_1 (Dense) │ (None, 128) │ 128,128 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ BatchNorm_1 (BatchNormalization) │ (None, 128) │ 512 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Activation_1 (Activation) │ (None, 128) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Dropout_1 (Dropout) │ (None, 128) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Dense_2 (Dense) │ (None, 256) │ 33,024 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ BatchNorm_2 (BatchNormalization) │ (None, 256) │ 1,024 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Activation_2 (Activation) │ (None, 256) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Dropout_2 (Dropout) │ (None, 256) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Dense_3 (Dense) │ (None, 128) │ 32,896 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ BatchNorm_3 (BatchNormalization) │ (None, 128) │ 512 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Activation_3 (Activation) │ (None, 128) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Dropout_3 (Dropout) │ (None, 128) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Dense_4 (Dense) │ (None, 64) │ 8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ BatchNorm_4 (BatchNormalization) │ (None, 64) │ 256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Activation_4 (Activation) │ (None, 64) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Dropout_4 (Dropout) │ (None, 64) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Dense_5 (Dense) │ (None, 32) │ 2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ BatchNorm_5 (BatchNormalization) │ (None, 32) │ 128 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Activation_5 (Activation) │ (None, 32) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Dropout_5 (Dropout) │ (None, 32) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ Output (Dense) │ (None, 1) │ 33 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
Total params: 206,849 (808.00 KB)
Trainable params: 205,633 (803.25 KB)
Non-trainable params: 1,216 (4.75 KB)
Model input shape: (None, 1000)
#Train the model (first test: 10 epochs, 32 batch size)
trained_model,history=train_model(X_train_preprocessed,y_train,256,110) #110 epochs visualization, new deeper NN, class weights
(Last 5 epochs of training)
Epoch 105/110
1425/1425 ━━━━━━━━━━━━━━━━━━━━ 9s 6ms/step - accuracy: 0.9098 - loss: 0.2540 - precision: 0.7828 - recall: 0.8463 - val_accuracy: 0.9184 - val_loss: 0.2194 - val_precision: 0.8154 - val_recall: 0.8388
Epoch 106/110
1425/1425 ━━━━━━━━━━━━━━━━━━━━ 11s 8ms/step - accuracy: 0.9115 - loss: 0.2508 - precision: 0.7874 - recall: 0.8453 - val_accuracy: 0.9204 - val_loss: 0.2160 - val_precision: 0.8263 - val_recall: 0.8323
Epoch 107/110
1425/1425 ━━━━━━━━━━━━━━━━━━━━ 9s 6ms/step - accuracy: 0.9107 - loss: 0.2518 - precision: 0.7845 - recall: 0.8453 - val_accuracy: 0.9186 - val_loss: 0.2179 - val_precision: 0.8167 - val_recall: 0.8381
Epoch 108/110
1425/1425 ━━━━━━━━━━━━━━━━━━━━ 9s 6ms/step - accuracy: 0.9089 - loss: 0.2533 - precision: 0.7793 - recall: 0.8465 - val_accuracy: 0.9208 - val_loss: 0.2159 - val_precision: 0.8295 - val_recall: 0.8300
Epoch 109/110
1425/1425 ━━━━━━━━━━━━━━━━━━━━ 9s 6ms/step - accuracy: 0.9099 - loss: 0.2516 - precision: 0.7814 - recall: 0.8486 - val_accuracy: 0.9230 - val_loss: 0.2180 - val_precision: 0.8392 - val_recall: 0.8271
Epoch 110/110
1425/1425 ━━━━━━━━━━━━━━━━━━━━ 9s 6ms/step - accuracy: 0.9109 - loss: 0.2507 - precision: 0.7843 - recall: 0.8468 - val_accuracy: 0.9207 - val_loss: 0.2174 - val_precision: 0.8266 - val_recall: 0.8336
#Evaluate the Model results using a Confusion Matrix
print(f'Results:\n{evaluate_model(X_test_preprocessed,y_test)}')
(‘Loss: 0.2092394232749939’,‘Accuracy: 0.9210538268089294’,‘Precision: 0.8312860131263733’,‘Recall:0.8282654285430908’)
These balanced results are the best I got in october 10 2024. The predictons with ‘wild urls’ are consistent with these results, but to compensate I lowered the threshold to compensate for false negatives. However, in November 15 2024 I retrained the neural network
and results were a bit more unbalanced:
1407/1407 ━━━━━━━━━━━━━━━━━━━━ 1s 990us/step - accuracy: 0.9254 - loss: 0.2075 - precision: 0.8614 - recall: 0.8101
Results: ('Loss: 0.20970836281776428', 'Accuracy: 0.924430251121521', 'Precision: 0.8572296500205994', 'Recall:0.8095238208770752')
#Predict unknown URLs from the wild (https://urlhaus.abuse.ch/browse/)
threshold=0.25
url='http://59.88.229.17:60855/bin.sh'
predict_url(url,trained_model,threshold,tokenizer)
‘http://59.88.229.17:60855/bin.sh is likely MALICIOUS!: 0.9999619126319885’
url='http://117.235.251.101:40567/Mozi.a'
predict_url(url,trained_model,threshold,tokenizer)
‘http://117.235.251.101:40567/Mozi.a is likely MALICIOUS!: 0.9999975562095642’
url='https://www.claude.ai/'
predict_url(url,trained_model,threshold,tokenizer)
‘https://www.claude.ai/ is likely BENINGN!: 0.05178939923644066’
url='https://www.semanariolacivilizacion.blogspot.com'
predict_url(url,trained_model,threshold,tokenizer)
‘https://www.semanariolacivilizacion.blogspot.com is likely BENINGN!: 0.24499544501304626’
#Test a huge list of fresh urls
url_list=[
'http://64.235.37.148/bins/k.mips',
'http://64.235.37.148/bins/k.m68k',
'http://64.235.37.148/bins/k.x86',
'http://31.172.80.237/qkdjdjj22.i586',
'http://182.117.69.207:40583/i',
'http://31.172.83.15/main_arm6',
'http://31.162.21.98:40024/Mozi.m',
'http://apitestlabs.com:8888/113681416431447.dll',
'http://cloudslimit.com:8888/113681416431447.dll',
'http://dailywebstats.com:8888/225761669829717.dll',
'https://www.discord.com',
'http://94.159.113.48:8888/113681416431447.dll',
'http://87.120.114.132/mirai.arm7',
'https://chxr.rooms.fierceatfifty.com/orderReview',
'http://87.120.114.132/mirai.ppc',
'http://whimar.com/wp-admin/maint/XjoPqhzc228.bin',
'http://whimar.com/wp-admin/maint/Verificerbarheden.mso',
'http://185.215.113.16/mine/random.exe',
'https://wall5tghf6fdg.api.opensourcesaas.org/FcPJXgYD/mine.png',
'http://87.120.112.102/roze.i586',
'http://103.72.57.120/TGIF/Jodozocw.dat',
'https://www.semanariolacivilizacion.blogspot.com',
'http://172.245.123.25/302/taskhostws.exe',
'http://157.173.104.153/up/bb.ps1',
'http://107.170.34.159/morsec/Invoke-Shellcode.ps1',
'http://157.173.104.153/up/Tool/ChromePass.exe',
'http://101.99.94.195/mZlaoZbpEVWPJcG210.bin',
'http://124.248.65.242:8899/sys/20230120_3.bin',
'http://invictaindia.com/sty/iTSqHIazA174.bin',
'http://invictaindia.com/sty1/Kajanlggenes.u32',
'http://8.138.96.41:10050/demon.x64.bin',
'http://169.1.16.29/swift-bypass-breakpoints.exe',
'https://www.mincultura.gov.co',
'http://169.1.16.29/demon.x641.exe',
'http://169.1.16.29/BidvestBank-Swift-DNS-Tunnel.exe',
'http://169.1.16.29/BidvestBank-Swift--DNS-evasion-encrypted-no-cloudflare.exe',
'http://169.1.16.29/LOUD_EYE',
'http://169.1.16.29/Swift-Beacon-Encrypted.exe',
'http://178.215.238.13/r.sh',
'http://91.218.67.59/wget.sh',
'http://87.120.112.102/update.sh',
'http://120.25.157.131/qz1.exe',
'http://185.121.233.82/tt/mips64',
'https://www.gamebooks.org',
'http://github.com/vizian123/msfvenomz/raw/main/reddit.exe',
'https://pastebin.com/raw/FYu4F1YR',
'http://120.25.157.131/fsx.exe',
'https://www.ydray.com/get/t/u17290663674746gFwb38bd70be00c5oQ',
'https://bitbucket.org/awgwrtwa/asss/downloads/1654-INICIO_DEMANDA_LABORAL_JUZGADO_CIVIL_DEL_CIRCUITO_DE_RAMA_JUDICIAL.CAB',
'http://47.236.122.191/Geek.exe',
'http://176.111.174.140/ywx.exe',
'http://web.johnmccrea.com/downloads/67065227a0640_rrrrrrrr.exe'
]
#Start a count
count=0
#Parse urls
for u in url_list:
#Call custom predict function in every url
count+=1
print(count,predict_url(u,trained_model,threshold,tokenizer))
1 http://64.235.37.148/bins/k.mips is likely MALICIOUS!: 0.9999939203262329
2 http://64.235.37.148/bins/k.m68k is likely MALICIOUS!: 0.9999939203262329
3 http://64.235.37.148/bins/k.x86 is likely MALICIOUS!: 0.9999952912330627
4 http://31.172.80.237/qkdjdjj22.i586 is likely MALICIOUS!: 0.9999898076057434
5 http://182.117.69.207:40583/i is likely MALICIOUS!: 0.9999951720237732
6 http://31.172.83.15/main_arm6 is likely MALICIOUS!: 0.9999966621398926
7 http://31.162.21.98:40024/Mozi.m is likely MALICIOUS!: 0.9999939799308777
8 http://apitestlabs.com:8888/113681416431447.dll is likely MALICIOUS!: 0.9994522929191589
9 http://cloudslimit.com:8888/113681416431447.dll is likely MALICIOUS!: 0.9998060464859009
10 http://dailywebstats.com:8888/225761669829717.dll is likely MALICIOUS!: 0.026539182290434837
11 https://www.discord.com is likely BENINGN!: 0.07051538676023483
12 http://94.159.113.48:8888/113681416431447.dll is likely MALICIOUS!: 0.9999892711639404
13 http://87.120.114.132/mirai.arm7 is likely MALICIOUS!: 0.9999964237213135
14 https://chxr.rooms.fierceatfifty.com/orderReview is likely MALICIOUS!: 0.523387610912323
15 http://87.120.114.132/mirai.ppc is likely MALICIOUS!: 0.9999939799308777
16 http://whimar.com/wp-admin/maint/XjoPqhzc228.bin is likely MALICIOUS!: 0.9999373555183411
17 http://whimar.com/wp-admin/maint/Verificerbarheden.mso is likely MALICIOUS!: 0.362288236618042
18 http://185.215.113.16/mine/random.exe is likely MALICIOUS!: 0.9999970197677612
19 https://wall5tghf6fdg.api.opensourcesaas.org/FcPJXgYD/mine.png is likely MALICIOUS!: 0.5353929996490479
20 http://87.120.112.102/roze.i586 is likely MALICIOUS!: 0.9999966621398926
21 http://103.72.57.120/TGIF/Jodozocw.dat is likely MALICIOUS!: 0.9996687173843384
22 https://www.semanariolacivilizacion.blogspot.com is likely BENINGN!: 0.24499544501304626
23 http://172.245.123.25/302/taskhostws.exe is likely MALICIOUS!: 0.7746655344963074
24 http://157.173.104.153/up/bb.ps1 is likely MALICIOUS!: 0.9997329711914062
25 http://107.170.34.159/morsec/Invoke-Shellcode.ps1 is likely MALICIOUS!: 0.99895179271698
26 http://157.173.104.153/up/Tool/ChromePass.exe is likely MALICIOUS!: 0.9998794794082642
27 http://101.99.94.195/mZlaoZbpEVWPJcG210.bin is likely MALICIOUS!: 0.9999423027038574
28 http://124.248.65.242:8899/sys/20230120_3.bin is likely MALICIOUS!: 0.9999881386756897
29 http://invictaindia.com/sty/iTSqHIazA174.bin is likely MALICIOUS!: 0.9993544816970825
30 http://invictaindia.com/sty1/Kajanlggenes.u32 is likely MALICIOUS!: 0.9999808669090271
31 http://8.138.96.41:10050/demon.x64.bin is likely MALICIOUS!: 0.9999973773956299
32 http://169.1.16.29/swift-bypass-breakpoints.exe is likely MALICIOUS!: 0.2990719676017761
33 https://www.mincultura.gov.co is likely BENINGN!: 0.16868330538272858
34 http://169.1.16.29/demon.x641.exe is likely MALICIOUS!: 0.9999975562095642
35 http://169.1.16.29/BidvestBank-Swift-DNS-Tunnel.exe is likely MALICIOUS!: 0.369070440530777
36 http://169.1.16.29/BidvestBank-Swift–DNS-evasion-encrypted-no-cloudflare.exe is likely BENINGN!: 0.17414702475070953
37 http://169.1.16.29/LOUD_EYE is likely MALICIOUS!: 0.9999887943267822
38 http://169.1.16.29/Swift-Beacon-Encrypted.exe is likely BENINGN!: 0.22295552492141724
39 http://178.215.238.13/r.sh is likely MALICIOUS!: 0.9999259114265442
40 http://91.218.67.59/wget.sh is likely MALICIOUS!: 0.8114745616912842
41 http://87.120.112.102/update.sh is likely MALICIOUS!: 0.9991742968559265
42 http://120.25.157.131/qz1.exe is likely MALICIOUS!: 0.9999975562095642
43 http://185.121.233.82/tt/mips64 is likely MALICIOUS!: 0.9999828338623047
44 https://www.gamebooks.org is likely BENINGN!: 0.07161882519721985
45 http://github.com/vizian123/msfvenomz/raw/main/reddit.exe is likely MALICIOUS!: 0.9517421722412109
46 https://pastebin.com/raw/FYu4F1YR is likely MALICIOUS!: 0.44059497117996216
47 http://120.25.157.131/fsx.exe is likely MALICIOUS!: 0.9999184012413025
48 https://www.ydray.com/get/t/u17290663674746gFwb38bd70be00c5oQ is likely MALICIOUS!: 0.9254165291786194
49 https://bitbucket.org/awgwrtwa/asss/downloads/1654-INICIO_DEMANDA_LABORAL_JUZGADO_CIVIL_DEL_CIRCUITO_DE_RAMA_JUDICIAL.CAB is likely MALICIOUS!: 0.2596791386604309
50 http://47.236.122.191/Geek.exe is likely MALICIOUS!: 0.9999968409538269
51 http://176.111.174.140/ywx.exe is likely MALICIOUS!: 0.9999595284461975
52 http://web.johnmccrea.com/downloads/67065227a0640_rrrrrrrr.exe is likely BENINGN!: 0.04001269489526749
Analysis of the results: These results illustrate very clearly the model’s performance, which is mostly good, although there are a few cases that show room for improvement.