add infos about experiments

2023-06-15 12:50:46 -04:00 · 2023-06-15 12:50:46 -04:00 · 1fb7210797
commit 1fb7210797
parent 5b975fc333
1 changed files with 37 additions and 26 deletions
--- a/EET1/MLCS_conference/main.tex
+++ b/EET1/MLCS_conference/main.tex
@ -214,6 +214,9 @@ In the context of \gls{ids} for network equipment, we considered power consumpti
 After initial tests, power consumption proved to provide the most information about the system state relative to the practicality of measurement.

 In our setup, the power consumption of the device is measured in two different ways: measurement at the \gls{ac} line (between the device's \gls{psu} and the power outlet); and measurement at the \gls{dc} power (from the \gls{psu} to the motherboard of the device).
+For both \gls{ac} and \gls{dc}, a power measurment box is placed in series with the main power cable.
+The box measures the voltage drop generated by the current flowing through a shunt resistor.
+This box samples the voltage at one mega sample per seconds (1MSPS).
 During every operation of the device, the different instructions will have impacts on the overall power consumption \cite{727070} and will be detectable in either \gls{ac} and \gls{dc} power consumption.
 \gls{ac} powertraces are less intrusive to capture than \gls{dc} power consumption and offer the most transparent way to retrofit the proposed system for different devices.
 However, its \gls{snr} is lower compared to the \gls{dc} measurement because the \gls{ac}/\gls{dc} switching converter introduces a buffering of electrical energy, thus hiding some of the fine-grained details.
@ -276,8 +279,11 @@ This process enables training accurate machine-learning models (see Table~\ref{t

 Both temporal and frequency domains are investigated.
 For the temporal domain, the preprocessed \gls{ac} or \gls{dc} time series are considered.
-For the frequency domain, the \gls{psd} of the \gls{dc} data serves as input data.
-Figure~\ref{fig:firmwares} illustrates the influence on the boot-up sequence in both temporal and frequency domains for two different firmware versions.
+For the frequency domain, the \gls{psd} of the \gls{dc} data are considered.
+800 bootup traces were collected using an automated script to controle the state of the switch.
+Out of these 800 traces, 14 were discarded du to errors in the capture procedure (cropped measurements).
+For all evaluations, the data are divided into training and test data with a ration of 70:30.
+Figure~\ref{fig:firmwares} illustrates the captured data for two different firmware versions in temporal and frequency domain.

 \begin{figure}
    \begin{subfigure}{0.49\textwidth}
@ -298,8 +304,10 @@ Figure~\ref{fig:firmwares} illustrates the influence on the boot-up sequence in

 \subsection{Experiment 1: Classifying Firmware Versions}
 \label{Classifying-Firmware-Versions}                                                                                                                                                                                    
-Given a power trace during boot-up, this experiment aims to
-predict which firmware from a given set of ten different versions is currently installed on the device. The result can be used to confirm successful firmware updates and to check whether the device reports the correct version.
+Given a power trace during boot-up and boot-up traces for ten different firmware versions, the goal of Experiment I.1 is to predict which firmware is currently installed on the device.
+The result can be used to confirm successful firmware updates and to check whether the device reports the correct version.
+
+Two classification models were evaluated to solve this task: \gls{rfc} and \gls{svm}.

 \subsubsection{Results}
 The \gls{rfc} delivers the best results of all tested \gls{ml} algorithms over both time and requency domains.
@ -324,17 +332,15 @@ The \gls{rfc} delivers the best results of all tested \gls{ml} algorithms over b
        & \gls{svm} & \numprint[\%]{95.3} & \numprint[\%]{96.0} \tabularnewline
        \bottomrule
    \end{tabular}
-    \caption{Comparison between the different algorithms for firmware classification on an independent verification set of size 100}
+    \caption{Comparison between the different algorithms for firmware classification.}
    \label{tab:fw-results}
 \end{table}


 \subsection{Experiment 2: Detecting Firmware Change}
-Given the most recently collected power trace during boot-up and the power trace collected before it, the goal of Experiment 2 is to predict whether the firmware has been altered between these two traces.
+Given the most recently collected power trace during boot-up and the power trace collected before it, the goal of Experiment I.2 is to predict whether the firmware has been altered between these two traces.
 The model uses \gls{dtw} and a training procedure on the collected traces, which implements a distance value as a parameter to the model to provide a decision, on whether there is a change in the firmware version.

-\subsubsection{Results}
-
 The model uses a windowed \gls{dtw} to compute the distance between the current and the previous power traces collected during the boot-up.
 The distance that results from \gls{dtw} is then subjected to a comparison with the model parameters.
 The model has the parameter $D_{\max}$ (maximum distance).
@ -345,7 +351,6 @@ For each class $j$ from the training set, the training computes the \gls{dtw} di
 The results determine the parameters of the model.
 The maximum distance is for class $j$ is defined as $d_{j_{\max}}$ and the variance of the distances as $\sigma_j$. 

-
 The decision for each class can be made using the class's distance and variance. 
 A global decision, combining them all, is achieved by introducing the parameter, $D_{\max}$.
 This parameter is the mean of the parameter $d_{n_{\max}}$ across all the models belonging to a class.
@ -365,6 +370,7 @@ where $a$ and $b$ denote two boot-up samples.
 The equivalence case denotes that there is no change in the firmware.
 Because $D_{\max}$ is the average of all $d_{j_{\max}}$ values, thus it falls into the range of observed values.

+\subsubsection{Results}
 Training and test results indicate that the model achieves \numprint[\%]{99} accuracy when the $D_{\max}$ is \numprint{27.16}.
 The test data is data collected under the same conditions as the training data and includes firmware versions that are present during the training process and firmware versions that are not present. The test data has never been subjected to the training process, and the training procedure applies the above notations with the described parameters set during the training process.
 Based on the model accuracy, a generalization of the model is possible with the introduced $D_{\max}$ without requiring any input from the firmware classification model.    
@ -372,7 +378,7 @@ Based on the model accuracy, a generalization of the model is possible with the

 \section{Experiment Family II: Run-Time Monitoring}
 \label{RunTime}
-Secure Shell (SSH) allows users to securely access a remote device even if the network is insecure.
+\gls{ssh} allows users to securely access a remote device even if the network is insecure.
 Systems that enable SSH access maintain logs of SSH login attempts.
 However, maintaining a log of the login attempt history proves futile since an attacker with system control can forge these log entries.
 Since side-channel \glspl{ids} only focus on external properties and are independent of the system it monitors, they can defend against an attacker forging log entries.
@ -381,7 +387,8 @@ Since side-channel \glspl{ids} only focus on external properties and are indepen
 \subsection{Experiment 1: Detecting SSH Login Attempts}
 \label{detect_ssh}

-This experiment aims to identify instances of SSH login attempts in the power trace collected from a network switch during its regular operation.
+%This experiment aims to identify instances of SSH login attempts in the power trace collected from a network switch during its regular operation.
+Given an arbitrary power trace sample from the monitored device, the goal of Experiment II.1 is to predict whether the sample includes an SSH login attempt.

 \subsubsection{Feature Engineering}
 The signal collected from the network switch is a time series $T_1 \triangleq \{x \in \mathbb{R}\}$ sampled \numprint[MHz]{1} then downsamples by a factor of \numprint{1000} which results in 1 sample per millisecond.
@ -416,7 +423,6 @@ The skew in the dataset makes the model more certain while predicting a positive
 The \gls{svm} model trained on data in temporal domain using the Gaussian Kernel configured with $C = 1$ and $\gamma = 0.1$ achieved an accuracy of \numprint[\%]{98}.
 \gls{rfc}, configured with 500 trees and a maximum depth of 50, performed equally well and achieved an accuracy of \numprint[\%]{97}, also on temporal domain.

-
 Lastly, a \gls{1dcnn} trained on a mix of data from both time and frequency domain achieves an accuracy rate of \numprint[\%]{95} and minimizes \gls{fpr} to \numprint[\%]{1}.
 However, it has the highest \gls{fnr}.
    
@ -465,16 +471,22 @@ Thus, \gls{svm} had the best accuracy rates along with the lowest \gls{fnr} and
 \end{table}

 \subsection{Experiment 2: Classifying SSH Login Attempts}
-Given a window of power trace with an SSH login attempt, this experiment attempts to classify the login attempt as successful or unsuccessful.
+Given a window of power trace with an SSH login attempt, the goal of Experiment II.2 to classify the login attempt as successful or unsuccessful.

 \subsubsection{Feature Engineering}
-This experiment builds on top of experiment \ref{detect_ssh} and classifies the \gls{ssh} login attempts detected as successful or failed. The experiment considers the data only in the temporal domain. The matrix representation for this experiment is a slight modification of the previous one: $Z = \{ S_{1}, S_{2}, ... , S_{L}\}$ with rows of $S$ and $\forall i,j: |S_i|=|S_j|$, and the accompanying set of labels $Y_{Z} = \{ y_Z \in \{-1,1\}^{L}\}$ where $L$ is the total number of windows, $S$ is a window of \numprint{500} samples in time-domain, and all the windows correspond to either a successful or a failed SSH login attempt.
+This experiment builds on top of experiment II.1 and classifies the \gls{ssh} login attempts detected as successful or failed.
+The experiment considers the data only in the temporal domain.
+The matrix representation for this experiment is a slight modification of the previous one: $Z = \{ S_{1}, S_{2}, ... , S_{L}\}$ with rows of $S$ and $\forall i,j: |S_i|=|S_j|$, and the accompanying set of labels $Y_{Z} = \{ y_Z \in \{-1,1\}^{L}\}$ where $L$ is the total number of windows, $S$ is a window of \numprint{500} samples in time-domain, and all the windows correspond to either a successful or a failed SSH login attempt.

 \subsubsection{Results}

-Models trained using \glspl{svm} and \gls{1dcnn} gave the best results for the classification along with the lowest \gls{fpr} and \gls{fnr}. Optimizing the parameters of the \gls{rfc} with 250 trees, \glspl{svm} with $C = 100$, $\gamma = 10$, and Gaussian Kernel, and \gls{1dcnn}, the accuracy score reached \numprint[\%]{96.7}, \numprint[\%]{98.5} and \numprint[\%]{98.6} respectively. Table \ref{tab:ssh-classification-precision-comparison} details all the results.
+Models trained using \glspl{svm} and \gls{1dcnn} gave the best results for the classification along with the lowest \gls{fpr} and \gls{fnr}.
+Optimizing the parameters of the \gls{rfc} with 250 trees, \glspl{svm} with $C = 100$, $\gamma = 10$, and Gaussian Kernel, and \gls{1dcnn}, the accuracy score reached \numprint[\%]{96.7}, \numprint[\%]{98.5} and \numprint[\%]{98.6} respectively. Table \ref{tab:ssh-classification-precision-comparison} details all the results.

- The experiment uses roughly 5000 samples extracted from experiment \ref{detect_ssh} that includes only successful and unsuccessful SSH attempts. 65\% of all the samples comprise the training set, 15\% contributes to the validation set, and the test set includes 20\% of all the samples. Testing is done over roughly 1000 samples of 500 features. The \gls{svm} model performed the best and had the lowest \gls{fpr} and \gls{fnr}. The model requires a mean time of 203 ms ($\sigma$=9 ms) per prediction and requires 184MB of storage space.   
+ The experiment uses the 4095 samples extracted from experiment \ref{detect_ssh} that includes only successful and unsuccessful SSH attempts.
+ 65\% of all the samples form the training set, 15\% contribute to the validation set, and the test set includes 20\% of all the samples.
+ Testing is done over roughly 1000 samples of 500 features.
+ The \gls{svm} model performed the best and had the lowest \gls{fpr} and \gls{fnr}.


 \begin{table}[ht]
@ -514,7 +526,7 @@ Hence, currently, the only way to identify unauthorized hardware modification is
 \subsection{Experiment 1: Identifying Number of Expansion Modules}
 \label{expe:hardware-1}

-This experiment aims to identify the number of modules installed from a measure of \gls{ac} or \gls{dc} power consumption.
+Given an arbitrary power trace sample from the monitored device, the goal of Experiment III.1 is to predict the number of expansion modules installed on the device.
 In this experiment, there was no on-the-fly installation or removal of the module during the capture, only in between captures.

 \subsubsection{Feature Engineering}
@ -527,23 +539,22 @@ Within each trace, the program picked ten snippets of five values.
 Those values of the number and length of snippets correspond to the minimum training time needed to achieve a \numprint[\%]{100} accuracy with a stratified 10-fold cross-validation setup with the data used in this experiment. The average value of each snippet is then computed. The final training dataset is a 1D array of shape $(\numprint{1380},1)$.

 \textbf{\gls{ac} data:} Each number of expansion modules will cause a different pattern in the fundamental \numprint[Hz]{60} wave of the \gls{ac} power consumption.
-
 To create the training dataset, the prepossessing program extracted $N$ periods of the fundamental wave by detecting consecutive local minima in the trace. 
-%Depending on the number $N$, the model achieved different results (see Table \ref{tab:periods_ac}).
-The extracted periods of \numprint{3333} data points (one period of the \numprint[Hz]{60} captured at 1MSPS and decimated by 5) constitute the training set of shape $(\numprint{4320},\numprint{3333})$.
+The extracted periods of \numprint{3333} data points (one period of the \numprint[Hz]{60} captured at 1MSPS and decimated by a factor of 5) constitute the training set of shape $(\numprint{4320},\numprint{3333})$.

 \subsubsection{Results}

-The average \gls{dc} value measured in this experiment for each number of modules does not overlap.
+The average \gls{dc} values does not overlap for different number of expansion modules installed.
 This property enables both \gls{svm} and \gls{knn} to perfectly classify the number of modules installed.
 The \gls{svm} model trained with a linear kernel performed the same as the \gls{knn} model with $K=1$.
-Both methods classify the traces with a \numprint[\%]{100} accuracy.
-    
+Both methods classify the traces with a \numprint[\%]{100} accuracy on \gls{dc} data.

-The \gls{ac} periods, even when following different patterns depending on the number of modules, remain similar at some points and do not present a separation as clear as the \gls{dc} averages. The \gls{svm} model was able to identify the number of modules installed with an accuracy of \numprint[\%]{99}.
+The \gls{ac} periods do present different patterns depending on the number of modules but remain similar at some points and do not present a separation as clear as the \gls{dc} averages.
+The \gls{svm} model was able to identify the number of modules installed with an accuracy of \numprint[\%]{99}.

-
-Results from Table~\ref{tab:hardware-results} show that \gls{dc} data yields the best results with both approaches (\gls{svm} and \gls{knn}). These high accuracy and recall results are the result of the clear and non-overlapping grouping of the averages \gls{dc} consummation. The results presented are produced with a stratified 10-fold cross-validation setup.
+Results from Table~\ref{tab:hardware-results} show that \gls{dc} data yields the best results.
+These high accuracy and recall performances are the result of the non-overlapping grouping of the averages \gls{dc} consummation.
+The results presented are produced with a stratified 10-fold cross-validation setup.

 \begin{table}[ht]
    \begin{center}