Topics In Demand
Notification
New

No notification found.

Malware Analysis Behavioral Detection and Prevention on Bayesian Network Using Machine Learning
Malware Analysis Behavioral Detection and Prevention on Bayesian Network Using Machine Learning

52

0

An Abstract 

A signature-based analysis is no longer sufficient to counter the stealthy and polymorphic nature of malware attacks. As a result, the usage of behavioral or anomalous analysis offers a more adaptive and effective approach. Recent research, however, has found that the current network-level behavioral analysis approaches have numerous issues; these have been classified according to their common characteristics, which include diminished parameters, and a lack of prior knowledge. As a result, our research attempts to identify the optimal feature selection and distribution density model characteristics to create a Bayesian Network-based Predictive Analytics Model. The aim is to improve the forecast of the analysis and to evaluate detection and precision. versus the subject matter expert model, SVM, and the model's false alarm rate Standard and ground-truth k-NN and Lease Squared.

I. INTRODUCTION  

A. What are Bayesian Networks?  

Bayesian networks are a popular type of uncertain visual representation. They consist of two main components: structure and parameters. A directed acyclic graph (DAG) represents the relationships between random variables by defining their dependencies and conditional independence within a structured framework. Each node's probability distributions are among the parameters.  

A Bayesian network can be used to describe a probability distribution in a compact, flexible, and comprehensible way. A Bayesian network is a type of deterministic visualization structure used for developing designs with input and/or knowledge from experts. They will be utilized for a variety of activities, including prediction, anomaly detection, diagnostics, automated insight, reasoning, statistical prediction, and higher cognitive processes in uncertain situations. 

An Example of a Bayesian Network:  

 

Bayesian Network

 

Figure 1.1 Bayesian Network 

B. How does it work?  

Consider this scenario: One attempts to turn on a fan, but it fails to spin. The fan is connected to an extension socket or plug that has broken due to a plug failure. Bayesian networks are probabilistic graphical models that will be used to handle uncertainty in the future. 

 

System Crash Bayesian Network

 Figure 1.2 System Crash Bayesian Network  

According to reports, a lightning strike during severe weather strikes an influence line, causing it to fail. As a result, the Operating System (OS) could failure, leading to a system crash. A large power spike, on the other hand, could result in hardware failure and a system crash. However, a system crash could occur even if there is no storm. Malware can disrupt the operating system, leading to system crashes. The directed graph is advantageous because it allows us to rapidly determine which variables are independent of the others. A virus infection, OS failure, and/or hardware failure are all potential causes of a computer crash. 

  

  1. What is Malware?  

Malware, or "malicious software," is a broad term for any harmful program or code intended to harm or impair computers. Malware is a destructive and intrusive application that is intended to infiltrate, harm, or prevent networked machines and cellphones by temporarily seizing control of their activities. It disrupts regular functioning, just like the human flu. The use of malware has a variety of motives. Malware can be used for financial gain, interrupting work, making political comments, or gaining bragging rights. Malware cannot physically harm system hardware or network equipment (one acknowledged exception; see the search engine Android section below), but it can steal, encrypt, or erase data, change or hijack fundamental system operations, and track user activities. 

Organizations can leverage advanced behavioral analytics and real-time detection mechanisms to counteract such evolving threats. 

C. What Defines the Category of Malware Families? 

A malware family is a collection of related programs that share a significant amount of code. As a single piece of malware evolves over time, grouping them as a family broadens its scope, resulting in a new piece of malware with specific family characteristics.  

Malware variants are a subset of a malware family. The code base of the family has several notable derivations. All samples with the same provenance are contained in one malware variant. Petya is an example of a malware family, with varieties such as Golden Petya and Green Petya. The color of the ransom note text is the most noticeable change between the Petya varieties. The terms "malware family" and "malware variation" should not be confused with the components Family and Variant in detection names. Detection names are readable and correspond to specific detection signatures or technology. Antimalware solutions and suppliers use detection names in their names. The goal of detecting names is not to identify people; they cannot be used to identify people.  

  

C.KNN Classifier Algorithm in Bayesian Network 

The concept involves learning a Naïve Bayes model. The training data is used to classify the test instance based on its K-nearest neighbors. A major challenge in integrating KNN and Naive Bayes algorithms is the restricted training data availability, when k is small in scale. 

 

Steps to Implement KNN with Naïve Bayes: 

  1. Import the k-nearest neighbor (KNN) algorithm via the Scikit-learn library
  2. Define the feature set and target variables
  3. Divide the dataset into two subgroups: testing and training
  4. Set up the K-NN model by specifying the number of neighbors (k)
  5. Train the model by fitting the data
  6. Make predictions for future instances

 

E.CNN Classifier Algorithm in Bayesian Network 

Let us start with the basics, such as what an image is and how it is displayed, before moving on to CNN’s operations. A grayscale image is a matrix of pixel values with a single plane, whereas an RGB image is a matrix of pixel values with three planes. For further information, check out the image below. 

 

Steps to Build and Train a CNN Model 

  1. Build the CNN model: Establish a Convolutional Neural Network (CNN) Structure 

  1. Add Hidden Layers and Features: 

  • Convolutional Layer 1: 30 filters, (3×3) kernel size 
  • Max Pooling Layer 1: 2×2 pooling size 
  • Convolutional Layer 2: 15 filters, (3×3) kernel size 
  • Max Pooling Layer 2: 2×2 pooling size 
  • Dropout Layer: Drops 25% of neurons to prevent overfitting 
  • Flatten Layer: Converts data into a 1D vector 
  • Layer 1 is dense (fully connected) and consists of 128 neurons with ReLU activation 
  • Dropout Layer: Drops 50% of neurons 
  • Dense Layer 2: 50 neurons with SoftMax activation 
  • Output Layer: Num_class neurons with SoftMax activation 
  1. Split the Dataset: Divide data into training and testing sets 

  1. Visualize the Data: Plot sample images to understand the dataset 

  1. Build the model with the sample dataset 

  1. Make estimations using the trained model for future outcomes. 

  

II. LITERATURE REVIEW  

The three major issues identified in the literature are the inability to anticipate, high-level assumptions, and non-inferential analysis. This paper describes a system for providing instantaneous network access to a distributed pool of configurable technological assets that may be quickly allotted and released with no administrative activity or operator interaction. Previous studies that evaluate malware properties utilizing static, dynamic, and hybrid analysis, perform automatic malware detection using Machine Learning techniques, are described.[2]  

We are unable to identify zero-day vulnerabilities due to high level assumptions caused by p(θ), as well as non-inferential analyses [4]. The ground truth dataset (KDD-Cup99) was used to forecast Bayesian Models using the K-NN Classifier, which focused on the Confusion matrix in Accuracy, Precision, Recall, and F1-measure. It also categorizes malware into Spyware, Trojans, and Rootkits. We categorize time for the testing dataset and forecast time in milliseconds for the training dataset. [1]  

 

A CNN model was developed to detect low-profile malware. As it focused on pre-process level performance indicators, our technique achieved 90% accuracy using the resource consumption features and was able to identify various low-profile infections. A disadvantage of this study is that the authors only employed a shallow CNN model and did not compare different CNN models.[3]  

 

Bayesian Networks (also known as Belief Networks) are an Artificial Intelligence paradigm for uncertainty supervision that differs from deterministic approaches to understanding events [9]. Even though it was first published in 1763, the approaches used in health management and medicine decision-support systems are new [10] and frequently used in clinical decision support [9].  

 

 III. OBJECTIVES  

[1] The main aim of this paper is to create a survey of all the methods and analyze network behavioral on Bayesian networks.  

[2] It can provide a detailed idea of models for measuring security using a Bayesian Network with an attack graph. 

[3] It gives us an understanding of the construction of a qualitative Bayesian Network model for distinguishing attacks and technical failures and introduces a model to detect Cyber Security threats based on a Bayesian Network.  

Malware Classification Dataset

[4] To concentrate on obtaining results utilizing extra models from picture classification competition leaderboards.  

Table 1: Comparative Analysis  

Sr no  
Paper title  
Study  
Statistical  
Findings  
Parameters  
Parameter  
Limitations  
1.  
Network- 
Hybrid detection Decreased attributes θ and insufficient earlier data p(θ). Naive Bayes  
This paper  
Behavioral-based analysis is highly related to the heuristic approach to speeding up the process of finding satisfactory solutions, especially when dealing with real-time traffic.  
Level  
  
achieves  
Behavioral Malware  
 
98.6% accuracy for training and testing and 90.2%  
Analysis Model based on the Bayesian 
 
prediction  
Network [1]  
 
accuracy of 488754  
  
 
records.  
 
   
2.  
Measuring  
Taxonomy, hybrid  
The optimized hybrid detection framework  
The framework uses a reduced feature-set and multiple threads. Consider multiple attacks rather than merely classifying the data in a packet.  
network  
detection,  
maintained a high detection accuracy of 92.71%  
security  
source and spoof trace  
 
using  
IP detection  
 
Bayesian  
   
Network- 
   
based Attack Graphs [2]  
   
3  
Bayesian network  
Prediction algorithms  
The robust minimum support level is 0.0015 
Proposed (ICS) architecture has shown promising values of accuracy.  
model to  
Cost and attack awareness resource  
Not working on detection of undefined/unclassified attacks 
distinguish between  
allocation  
 
intentional attacks and accidents  
  
 
technical failures: a case study of floodgates [3] 
   
4  
Bayesian 
Attack Tree,  
By increasing expected the threat occurrence probability, even though  
Since we aimed to maximize, the highest result obtained for the model was through the Ip traceback was 91.51%; we can increase it.  
Networks for 
M2M transformation 
no BN node  
Online 
 
will be set to  
Cybersecurity 
  
‘True’  
threat 
   
etection 
   
[4] 
   
  
   
  
Two-stage  
Two-stage hybrid malware detection using a 2-MaD scheme.  
Lowest validation loss value of 0.1545, was  
Among 37,216 datasets, 31,310 were used as training datasets and 4,476 were used as validation datasets.  
  
ybrid  
  
93.80%. 
5  
malware  
 
In addition, the test accuracy was  
 
detection  
 
93.89%  
 
sing Deep  
   
 
Learning [5]  
   

 

IV. RESEARCH METHODOLOGY 

The descriptive study of the PCAP files retrieved from the baseline TCP/UDP traffic of Malaysia's healthcare provider’s live production network begins here. This is the research activity also called stage 1 of the research approach. The graph below shows the baseline traffic. It displays the baseline traffic distribution, including window size, frame length, delta time, and source and destination traffic.  

Figure 1.3 Class Load Dataset 

The ICML Representation Learning Workshops introduced an expression recognition challenge after Tang updated the SoftMax layer using a linear SVM and tested it on the MNIST and CIFAR-10 databases. The SVM is a linear classifier that serves as the highest range predictor. Using a linear SVM, CNN-SVM enabled feature extraction from input images. Agarap and Pepito succeeded using the Malimg dataset with an accuracy of 97.22% with CNN-SVM. 

  

 

Plotting Images

Figure 1.4 Plotting Images  

First, import the necessary packages. Matplotlib is included for graphing, while the OS module allows easy reading and loading of files from your computer. Additionally, cv2 is a Python library commonly used for solving computer vision problems. To install cv2, run pip install OpenCV-Python in your terminal. Similarly, install Keras using pip install Keras, as it serves as an API for building neural networks.  

CNN Model

Figure 1.5 CNN Model  

  

In this step, a Convolutional Neural Network (CNN) is created to classify and detect malware families, focusing on key features in Feature Extraction. The model incorporates two main layers: Conv2D for feature detection and Pooling for dimensionality reduction. A Pooling Layer performs down-sampling, reducing the spatial density of mapping of the features. This helps introduce translation invariance to small shifts and distortions while also minimizing the number of learnable parameters in subsequent layers. It is vital to remember that pooled layers do not contain any learnable parameters. However, like convolution operations, pooling operations have hyperparameters such as the filter size, stride, and padding, which influence how the pooling is performed.  

Comparison Analysis

 

Figure 1.6 Comparison of KNN and CNN model  

In this step, there is a comparison of the K-Nearest-neighbor and convolutional neural network in accuracy, precision, recall, and F1-measure. Also, CNN is defined as the best accuracy model for CNN as it has the best performance in the particular dataset. This section starts with the descriptive analysis of the PCAP files extracted from the baseline TCP/UDP traffic of the live production network from Malaysia’s healthcare provider. This is a part of stage 1 of the research methodology or simply a research activity. The graph below is the bar plotted for baseline traffic. It shows the baseline traffic of window size, frame length, delta time, and source and destination traffic distribution of the baseline traffic.  

  

  1. EXPERIMENTAL RESULTS  

The Python programming language was used to carry out all the experiments. The table presents the findings of the analysis, which include precision, recall, and accuracy. It works well with the assessment dataset, which was based on machine learning algorithm results. Various machine learning methods are assessed using the phishing dataset, which is split into two mutually exclusive sets: 80% for training and 20% for testing. Different classifiers and approaches exist for the same machine learning algorithm. Based on the analysis results, the Gradient Boosting Classifier emerges as the most effective machine learning methodology. 

  

  1. SUMMARY  

The KNN method is the best algorithm for malware classification, according to this paper's findings. However, this technique is ineffective in detecting malware in these circumstances. To achieve the highest accuracy for malware detection, it is recommended to employ the CNN algorithm (Convolutional Neural Network).  

  

  1. CONCLUSION AND FUTURE WORK  

In this, we acquired the K-nearest neighbor algorithm, classified and predicted the malware type, then found the accuracy and confusion matrix. The identified accuracy is 46% when using the KNN algorithm. Therefore, in order to prevent such problems, a feature selection technique is used to find the best accuracy in the Bayesian model. In addition to developing the Bayesian network model through feature selection and distribution function modeling, the approach will continue to be refined using Dynamic Bayesian Networks to incorporate temporal domain measurements defined in the CVSS framework. Descriptive analysis of the PCAP files extracted from the baseline. TCP/UDP traffic of the live production network from Malaysia’s healthcare provider. 

 

Work in the future includes the defense of devices from malware. Several antivirus systems now employ machine learning techniques. When employed with Windows PE binaries, deep learning architectures have shown to be effective at detecting malware. Compared to the performance of two classifiers on a malware picture dataset, PE files are portable executables. We have used three more CNNs and models from the ImageNet Large-scale Visual Recognition Challenge grayscale malware picture classification models On the Malimg dataset [10], it successfully trained the two models, and the findings show that the CNN model beats all previous efforts. It is the pinnacle of categorization achievement. on photos of malware in grayscale.  

 

VIII. REFERENCES   

  1. M. H. Mohd Yusof, M. R. Mokhtar, A. Mohd. Zin, C. Maple (2018). Embedded feature selection method for a network-level behavioral analysis detection model. International Journal of Advanced Computer Science and Applications, 9(12), 509-517.0  

  1. J. Singh, and J. Singh, “A survey on machine learning-based malware detection in executable files,” Journal of Systems Architecture, vol. 112, article no 101861, 2021.  

  1. Lingyu Wang, Anoop Singhal, Sushil Jajodia, “Toward Measuring Network Security Using Attack Graphs,” Proc. QoP 2007, Oct 29, 2007.  

  1. Elmrabit N, Yang S-H, Yang L, Zhou H (2020) Insider threat risk prediction based on Bayesian network. Compute Secure  96:101908.\\ https://doi.org/10.1016/j. cose.2020.101908 

  2. L. Naranjo, C. J, Perez, Y. Campos-Roca, & J. Martin. 2016. Addressing voice recording replications for Parkinson’s disease detection. Expert Systems with Applications 2016.  

  3. P. Wang and Y. Z. Wang. Malware behavioral detection and vaccine development by using a support vector model classifier. Journal of Computer and System Sciences, 81(6), 1012-1026. doi: 10.1016/j.jcss.2014.12.014.  

  4. B. Rahbarinia, R. Perdisci & M. Antonakakis. 2015. Efficient Behavior-Based Tracking of Malware-Control Domains in Large ISP Networks. 45th Annual IEEE/IFIP International Conference on  Dependable Systems and Networks; 403-414.  doi:10.1109/dsn.2015.35  

  5. S. G. Nari. "Automated Malware Classification based on Network Behavior." 2013 International Conference on Computing, Networking and Communications, Communications and  Information Security Symposium.  

  6. Anshul Tayal, Nishchol Mishra, and Sanjeev Sharma. Active monitoring and postmortem forensic analysis of network threats: A survey, 2017.  

  7. Malware families found in the MALIMG dataset [12]. (n.d.) Retrieved from https://www.researchgate.net/figure/Malwarefamilies-found-in-the-Malimg-Dataset12_tbl3_322221656 

  8. Agarap A.F.M. A Neural Network Architecture Combining Gated Recurrent Unit (GRU) And Support Vector Machine (SVM) For Intrusion Detection in Network Trac Data. 10th International Conference on Machine Learning and Computing, page 2630, 2018.  

  9. Abdel Bellili, Michel Gilloux, and Patrick Gallinari. A Hybrid MLP-SVM Handwritten Digit Recognizer. In Proceedings of Sixth International Conference on Document Analysis and Recognition, pages 28{32, 2001.  

  1. Viet Tra, Sheraz Khan, and Jongmyon Kim. Diagnosis of bearing defects under variable speed conditions using energy distribution maps of acoustic emission spectra and convolutional neural networks. The Journal of the Acoustical Society of America, 144: EL322 EL327, 10 2018.  

  1. Stolfo, S. J., Wei, F., Wenke, L., Prodromidis, A. & Chan, P. K. 2000.Cost-based Modeling and Evaluation for Data Mining With  Application to Fraud and Intrusion Detection: Results from the JAM Project. Proceedings DARPA Information Survivability Conference and Exposition DOI: 10.1109/DISCEX.2000.821515.  

  2. Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, and Jianfei Cai. Recent Advances in Convolutional Neural Networks. Pattern Recognition, 77:354 377, 2018.  

 

Author: 

Priyanka Jadav 

Priyanka Jadav is an Engineer at eInfochips in the Cybersecurity domain. She specializes in IoT/Cyber Security. She has expertise in Malware Analysis, Web Application & Mobile Vulnerability Assessment and Penetration Testing (VAPT) and Machine Learning. She holds a Master's degree in Cyber Security from Gujarat Technological University. 

 


That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.


eInfochips, an Arrow company, is a leading global provider of product engineering and semiconductor design services. With over 500+ products developed and 40M deployments in 140 countries, eInfochips continues to fuel technological innovations in multiple verticals. The company’s service offerings include digital transformation and connected IoT solutions across various cloud platforms, including AWS and Azure. Visit- https://www.einfochips.com/

© Copyright nasscom. All Rights Reserved.