Machine learning for network traffic classification under labeled data and training time constraints

Hussey, Jason P.

Publication

Machine learning for network traffic classification under labeled data and training time constraints

Hussey, Jason P.

Advisor

Camp, Tracy
Stone, Kerri A.

Date Issued

2023

Abstract

This thesis investigates using machine learning (including deep learning) for network traffic classification when constrained by too little labeled data or insufficient time to train models from scratch. Network traffic classification is essential in network security, network management, and application identification. Labeling network traffic data, however, is often time-consuming and expensive, which limits the amount of labeled data available for training machine learning models. This thesis investigates using a semi-supervised learning approach that leverages positively labeled and unlabeled data to improve classification performance when faced with a lack of labeled data. The method uses a combination of bootstrap aggregation and tree-based classifiers to classify unlabeled network traffic flows from the same class successfully. This same semi-supervised learning approach also successfully detects zero-day (i.e., never before seen) encrypted messaging applications for which no training data is available. Additionally, this thesis investigates using deep transfer learning from a state-of-the-art computer vision model for network traffic image classification. By representing network traffic flows as grayscale network traffic images, highly sophisticated image classification models can transfer to the task of network traffic classification. Using these advanced models as a source for training dramatically enhances the speed at which new models can train, addressing the constraint of having too little time for training. To investigate whether deep transfer learning is successful in network traffic image classification, this work used our network flow capture system (which creates a volume of unlabeled data) and commercial appliances (to turn the unlabeled dataset into a real-world labeled dataset). Experimental results in this thesis demonstrate that the semi-supervised learning technique of positive and unlabeled learning is highly effective at detecting hidden positives amongst unlabeled data. Furthermore, this thesis shows that representing network traffic flows as grayscale images allows state-of-the-art image classification models (e.g., ResNet) to transfer to the domain of network traffic classification effectively.

Rights

Copyright of the original work is retained by the author.

Machine learning for network traffic classification under labeled data and training time constraints

Hussey, Jason P.

Citations

Advisor

Editor

Date

Date Issued

Date Submitted

Keywords

Collections

Files

Research Projects

Organizational Units

Journal Issue

URI

Embargo Expires

Abstract

Associated Publications

Rights

Embedded videos