% This is samplepaper.tex, a sample chapter demonstrating the
% LLNCS macro package for Springer Computer Science proceedings;
% Version 2.21 of 2022/01/12
%
\documentclass[runningheads]{llncs}
%
%\usepackage[T1]{fontenc}
% T1 fonts will be used to generate the final print and online PDFs,
% so please use T1 fonts in your manuscript whenever possible.
% Other font encondings may result in incorrect characters.
%
%\usepackage{graphicx}
% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
%
\usepackage[toc,acronym,abbreviations,nonumberlist,nogroupskip]{glossaries-extra}
\usepackage{numprint}
\usepackage{tabularx}
\usepackage{booktabs}
\usepackage{cite}
\usepackage{amsmath,amssymb,amstext}
\usepackage{xcolor}
\usepackage{textcomp}% http://ctan.org/pkg/amssymb
\usepackage{pifont}% http://ctan.org/pkg/pifont
\newcommand{\cmark}{\ding{51}}%
\newcommand{\xmark}{\textbullet}%
\usepackage[hidelinks]{hyperref}
%\usepackage{flushend}
\usepackage[pdftex]{graphicx}
\usepackage{subcaption}
\usepackage{multirow}
%\usepackage{adjustbox}

% If you use the hyperref package, please uncomment the following two lines
% to display URLs in blue roman font according to Springer's eBook style:
%\usepackage{color}
%\renewcommand\UrlFont{\color{blue}\rmfamily}

\input{acronyms}

\newcommand\agd[1]{{\color{red}$\bigstar$}\footnote{agd: #1}}
\newcommand\cn[0]{{\color{purple}$^\texttt{[citation needed]}$}}
\begin{document}
%
\title{Side-channel Based Runtime Intrusion Detection for Network Equipment}
% REAL AUTHORS and CONTACT ==============================
%\author{Arthur Grisel-Davy\and
%Julian Dickert\and
%Sebastian Fischmeister\and
%Goksen U. Guler\and
%Waleed Khan\and
%Carlos Moreno\and
%Jack Morgan\and
%Shikhar Sakhuja\and
%Philippe Vibien.
%}
%\authorrunning{Grisel-Davy et al.}
%
%\institute{University of Waterloo, Canada. \\
%agriseld@uwaterloo.ca}

% FAKES/ANONYMOUS
%
\author{
\phantom{
    \begin{minipage}{\textwidth}
Anon. Anonymous\and
Anon. Anonymous\and
Anon. Anonymous\and
Anon. Anonymous\and
Anon. Anonymous\and
Anon. Anonymous\and
Anon. Anonymous\and
Anon. Anonymous\and
Anon. Anonymous
    \end{minipage}
}
}
\authorrunning{ }

\institute{ ~\\ 
 }
%
\maketitle              % typeset the header of the contribution
%
\begin{abstract}
    Current security protection mechanisms for embedded systems often include running a \gls{hids} on the system itself.
    \glspl{hids} cover a wide attack surface but still present some blind side and vulnerabilities.
    In the case of a compromized device, the detection capability of its \gls{hids} becomes untrustworthy.
    In this context, embedded systems such as network equipment remain vulnerable to firmware and hardware tampering, as well as log manipulation.

    Side-channel emissions provide an independent and extrinsic source of information about the system, purely based on the physical by-product of its activities.
    Leveraging side-channel information, we propose a physics-based \gls{ids} as an additional layer of protection for embedded systems.
    The physics-based \gls{ids} uses machine-learning-based power analysis to monitor and assess the behaviour and integrity of network equipment. 

    The \gls{ids} successfully detects three different classes of attacks on an HP Procurve Network Switch 5406zl: (i)~firmware manipulation with \numprint[\%]{99} accuracy, (ii)~brute-force SSH login attempts with \numprint[\%]{98} accuracy, and (iii)~hardware tampering with \numprint[\%]{100} accuracy.
    The machine-learning models require a small number of power traces for training and still achieve a high accuracy for attack detection.
    The concepts and techniques discussed in the paper can also extend to offer intrusion detection for embedded systems in general.

\keywords{side-channel\and power analysis\and intrusion detection.}
\end{abstract}
%
%
\glsresetall % reset all acronyms to be expanded on first use.

\section{Introduction}
Data centers are experiencing unprecedented growth~\cite{osti_1372902} because of the increased reliance on cloud services with cloud-based attacks representing \numprint[\%]{45} of data breaches in 2022~\cite{datacenterbreach}.
The downtime of data centers costs companies hundreds of thousands of dollars per hour~\cite{6848725}.

All data centers use network equipment such as network switches and routers.
A successful attack on a network switch can have devastating effects on the integrity of the data center.
To deter cases of cyberattacks, data centers often use \gls{ids}.
Current \glspl{ids} use different approaches to detect intrusions. 
\glspl{hids} are implemented directly on the monitored device and leverage information provided by the system to detect intrusions.
\glspl{nids} leverage network information to detect intrusions at the network level.
Although \glspl{hids} and \glspl{nids} offer intrusion detection capabilities, they are still ineffective against attacks such as firmware modification~\cite{cisco_trust,thomson_2019}, bypassing secure boot-up~\cite{Cui2013WhenFM, hau_2015}, log tampering~\cite{koch2010security}, or hardware tampering\cite{rohatgi2009electromagnetic}.

The literature shows promising work in improving the state-of-the-art in security by analyzing side-channel emissions from embedded systems.
Systems generate side-channel emissions, which usually reflect their activity in the form of power consumption \cite{kocher1999differential, brier2004correlation, Moreno2018}, electromagnetic waves \cite{khan2019malware, sehatbakhsh2019remote}, acoustic emissions \cite{genkin2014rsa, liuacoustic}, etc. 
Side-channel-based \glspl{ids} analyze side-channel emissions and can complement state-of-art \glspl{ids}, as shown in this paper.
The \gls{ids} uses \gls{dsp} and \gls{ml} to detect anomalies or recognize patterns of previously detected intrusions. 
Thus, using this \gls{ids} would improve the security of the embedded system by detecting attacks that regular \glspl{ids} fail to identify.

\subsection{Contributions}

This paper proposes a side-channel-based \gls{ids} --- also called physics-based \gls{ids} --- that can complement existing \glspl{ids} and improve security for embedded systems. 
The side-channel-based \gls{ids} can potentially protect any embedded system treated as a black box and detect a range of attacks against it.
Our \gls{ids} is deployed on an HP Procurve 5406zl network switch as a black box.
The experiments in the paper illustrate the \gls{ids} capabilities of detecting firmware manipulation and hardware tampering attacks against the switch and defending against log entry forging through log verification.

The side-channel-based \gls{ids} achieves near-perfect accuracy scores despite using simple \gls{dsp} methods and \gls{ml} algorithms. The algorithms analyze \gls{ac} and \gls{dc} power consumption of the network switch to detect these attacks. 
%The experiments use a relatively small dataset that contains roughly \numprint{1000} power traces.

\subsection{Paper Organization}

The paper is organized as follows:
Section~\ref{sec:Overview} provides an overview of the motivation for the experiments and threat model. 
Section~\ref{Related Work} describes other side-channel-based approaches for runtime monitoring and integrity assessment.
Section~\ref{Firmware} describes experiments related to firmware manipulation,
Section~\ref{RunTime} describes log verification and auditing,
and Section~\ref{Hardware} describes hardware tampering.
The paper concludes in Sections~\ref{Discussion} and~\ref{Conclusion}. 

\section{Overview}
\label{sec:Overview}

All embedded systems leak information about their operation through side channel emissions.
Side-channel-based \glspl{ids} use \gls{dsp} methods and \gls{ml} algorithms to model the side-channel data and learn patterns that correlate to the system activity.
An important part of designing a reliable side-channel \gls{ids} is identifying appropriate side-channel emissions among temperature, vibration, ultrasound, \gls{em}, power consumption, etc.
Our experiments focus on the power consumption.
Power consumption is reasonably easy to non-intrusively and reliably measure.

Side-channel-based \gls{ids} can complement \gls{hids} and \gls{nids} in offering runtime monitoring and integrity assessment for embedded systems, as shown in Table~\ref{tab:example}.
Side-channel-based \glspl{ids} run independently from the system they monitor, which makes them more difficult to circumvent compared to \gls{ids} hosted by the system.
This independence is also beneficial in case of a malfunction of the \gls{ids}, which cannot disrupt the regular operation of the system.


\begin{table}[htb]
    \centering
    \begin{tabularx}{\columnwidth}{X>{\hsize=.4\hsize}cccc}
        \toprule
        \textbf{Attack Scenarios} & \textbf{Reference} & \textbf{\gls{hids}} & \textbf{\gls{nids}} & \textbf{SCIDS}\tabularnewline
        \midrule
%       The attacker can:  & & & \tabularnewline
%        \addlinespace[1em]
        Run unapproved executable through backdoor &  \small{\cite{cve-2018-0150,cve-2018-0151,cve-2018-0222}}
        & \cmark & \xmark & \cmark\tabularnewline
        \addlinespace[1em]
        Exploit existing executable & \small{\cite{kovacs_2019,CVE-2019-12649}}& \cmark & \xmark & \cmark\tabularnewline
        \addlinespace[1em]
        Spy on the network & \small{\cite{Hernandez2014SmartNT,router_hacking_slingshot}}&  \xmark & \cmark & \cmark \tabularnewline
        \addlinespace[1em]
        Pivot/proxy for network attack & \small{\cite{router_hacking_slingshot,symantec_security_response}} & \xmark & \cmark & \cmark\tabularnewline
        \addlinespace[1em]
        Bypass secure boot & \small{\cite{cisco_trust,thomson_2019}} & \xmark & \xmark & \cmark\tabularnewline
        \addlinespace[1em]
        Change firmware & \small{\cite{Cui2013WhenFM,hau_2015}}& \xmark & \xmark & \cmark\tabularnewline
        
        \bottomrule
       

    \end{tabularx}
    \caption{Attack scenarios that side-channel-based \gls{ids} can detect.}
    \label{tab:example}
\end{table}

\subsection{Threat Model}
\label{subsec:threat-model}

In the context of this paper, we consider active attackers that can tamper with the execution of network devices.
These attackers can accomplish their goal by assuming different roles and exploiting several mechanisms, as summarized below:

\textbf{Remote Code Execution:}
A remote attacker may take advantage of known or zero-day remote exploits in categories such as remote code injection, privilege escalation, etc.
The outcome could be to temporarily tamper with the device's execution, possibly persistently.

\textbf{Brute-Force or Dictionary-Based Password Guessing:}
A remote attacker could attempt to log in through password guessing, with the objective of tampering with the device's execution once logged in.

\textbf{Unauthorized Firmware Reprogramming (or Failure to Apply a Scheduled Firmware Upgrade):}
Either through physical access to the device or upon successful administrative login, the attacker can reprogram the firmware of the device.
The applied firmware can be an older version to reactivate a specific vulnerability, or it could be a custom firmware that contains backdoors.

\textbf{Unauthorized Hardware Configuration Changes:}
An attacker with physical access to the device could apply undocumented changes to the configuration of the device to its advantage.

\textbf{Tampering with Administrative/Maintenance Logs:}
The attacker's goal may be to mislead the operators through actions such as failing to apply a firmware upgrade while reporting that the firmware has been upgraded.
This could be done with the purpose of keeping a particular vulnerability in the device while the administrators assume that such vulnerability has been addressed.  


\subsection{Analysis of Side-channels}
Electronic systems, including embedded devices, involuntarily leak information through different side channels.
Due to each side channel's specific nature, some are better for different applications.
In the context of \gls{ids} for network equipment, we considered power consumption, ultrasound and \gls{em} emissions.
After initial tests, power consumption proved to provide the most information about the system state relative to the practicality of measurement.

In our setup, the power consumption of the device is measured in two different ways: measurement at the \gls{ac} line (between the device's \gls{psu} and the power outlet); and measurement at the \gls{dc} power (from the \gls{psu} to the motherboard of the device).
For both \gls{ac} and \gls{dc}, a power measurement device is placed in series with the main power cable.
The box measures the voltage drop generated by the current flowing through a shunt resistor.
This box samples the voltage at one mega sample per seconds (1MSPS).
During every operation of the device, the different instructions influence the overall power consumption \cite{727070} and will be detectable in either \gls{ac} and \gls{dc} power consumption.
\gls{ac} powertraces are less intrusive to capture than \gls{dc} power consumption and offer the most transparent way to retrofit the proposed system for different devices.
However, its \gls{snr} is lower compared to the \gls{dc} measurement because the \gls{ac}/\gls{dc} switching converter introduces a buffering of electrical energy, thus hiding some of the fine-grained details.
%Work by Moreno~et~al.~\cite{Moreno2018} uses the power consumption of embedded systems for non-intrusive online runtime monitoring through reconstruction of the program's execution trace.

\section{Related Work}
\label{Related Work}
The idea of side-channel-based \gls{ids} traces back to the seminal work in side-channel analysis by Paul C. Kocher.
He introduced Differential Power Analysis to find secret keys used by cryptographic protocols in tamper-resistant devices~\cite{kocher1999differential}.
This led to a field of research focusing on side-channel analysis that has been growing since. Power analysis is the most common and widely studied side-channel analysis technique~\cite{brier2004correlation,mangard2008power}. %new citations% 
Cagalj et al.~\cite{vcagalj2014timing} show a successful passive side-channel timing attack on U.S. patent Mod 10 method and Hopper-Blum (HB) protocol.
%Quisquater et al.~\cite{quisquater2002automatic} present an approach to identify executed instructions with the use of self-organizing maps, power analysis and analysis of electromagnetic traces. %new citations%
Zhai et al.~\cite{zhai2015method} propose a self-organizing maps approach that uses features extracted from an embedded processor to detect abnormal behaviour in embedded devices.
%Eisenbarth et al.~\cite{eisenbarth2010building} propose a methodology for recovering the instruction flow of microcontrollers using its power consumption.
Goldack et al.~\cite{goldack2008side} propose a solution to identify individual instructions on a PIC microcontroller by mapping each instruction type to a power consumption template.
However, the attack focused side-channel analysis can offer non-intrusive runtime monitoring, as well. \\
\indent
Literature shows promising work in assessing integrity through power monitoring.%~\cite{10.1145/2976749.2978299}.
Works by Moreno et al. offer two building blocks for this work.
In~\cite{moreno2013non}, the team proposes a solution for non-intrusive debugging and program tracing using side-channel analysis.
In this work, they use the power consumption of a given embedded system to identify the code block the embedded system was executing at the time.
The team builds on their previous technique and presents a new one~\cite{Moreno2018} using the power consumption of embedded systems for non-intrusive online runtime monitoring through anomaly detection.
%They use a signals and systems analysis approach to identify anomalies using the power consumption of a system and showcase this by identifying buffer overflow attacks on their system. 
Msgna et al.~\cite{msgna2014verifying} propose a technique for using the instruction-level power consumption of a system to verify the integrity of the software components of a system with no prior knowledge of the software code.
Grisel-Davy et al.~\cite{grisel2022work} propose the verification of the boot process of various embedded systems using their power consumption signature.

In more recent literature, there is a trend towards the use of \gls{ml} for side-channel analysis to enhance the security of systems.
Michele Giovanni Calvi~\cite{calvi2019runtime} offers a solution for runtime monitoring of an entire cyberphysical system treated as a black box.
They collect data from a self-driving car during operations such as steering and acceleration.
Using this data, they train a Long Short Term Memory~\cite{hochreiter1997long} deep learning model and use it to verify the safety of the vehicle. %new citations% 
Zhengbing et al. \cite{4488501} suggest the use of forensic techniques for profiling user behaviour to detect intrusions and propose an intelligent lightweight \gls{ids}. Hanilçi et al.~\cite{hanilci2011recognition} use recorded speech from a cell phone to ascertain the cell phone brand and model through using vector quantization and \gls{svm} models on the \gls{mfcc} of the audio.
In~\cite{khan2019malware}, Khan et al. propose a technique to identify malware in critical embedded and cyberphysical systems using \gls{em} side channel signals.
Their technique uses deep learning on EM emanation to model the behaviour of an uncompromised system.
The system flags an activity as anomalous when the emanations differ from the normal ones used to train the neural network.
Sehatbakhsh et al.~\cite{sehatbakhsh2019remote} also use EM emanations and detect malware code injection into a known application without any prior knowledge of the malware signature.
They use HDBSCAN clustering method to identify anomalous behaviour exhibited by the malicious code.
Yilmaz et al.~\cite{yilmaz2019detecting} implement K-Nearest Neighbors clustering methods along with PCA dimensionality reduction method to model EM emanations from a phone with the different operational status of front/rear camera.
Using the ML methods, they can determine the state of cellphone cameras.

Some work also investigated the possibility of forging power consumption for defense purposes.
Raghavendra et al.~\cite{Pradyumna_Pothukuchi_2021} proposed a simple control method to mask the power consumption pattern of any application.
However, this kind of method does not enable masking into an arbitrary pattern as it is meant for obfuscation, not impersonation.
Thus an attacker could not leverage this method to make an activity (malware) impersonate another one (legit activity) from a power consumption point of view.

%The work that this paper proposes builds on top of the aforementioned works.
%An HP network switch, treated as a black box, generates side-channel leaks in the form of its power consumption.
%The experiments treat this power consumption as an output of the system when the inputs are certain attacks/stimuli that triggers the switch.
%The data train the \gls{ml} models, which, in turn, successfully identify the attacks/stimuli on the switch.

\section{Experiment Family I: Firmware Manipulation}
\label{Firmware}
Embedded systems require firmware updates for a range of reasons including the addition of features or security patches.
Attacks on these systems commonly target the firmware update process~\cite{hau_2015}.
The ability to modify the firmware enables attackers to perform a range of other attacks, such as Communication Channel Manipulation [CAPEC 216], Protocol Manipulation [CAPEC 272], Functionality Bypass [CAPEC 554], and Software Integrity Attack [CAPEC 184].

The following two experiments were conducted with ten official firmware versions using the same device configuration.
Starting from the pre-installed version K.15.06.008, we performed upgrades to the next ten higher release versions (K.15.07 to K.15.17) and picked the final build for each release.

\subsubsection{Feature Engineering}\label{FE-Firmware}

With the HP Procurve Switch 5406zl taking around 120 seconds to complete its boot-up sequence, this experiment family produces the largest dataset of this case study.
Therefore, several preprocessing steps were applied to reduce the size of the dataset and remove noise.
A combination of downsampling and a sliding median filter yields the best results at a minimal size per training set.
Given a power trace with a length of \numprint{120e6} datapoints, downsampling with a factor of \numprint{1e6} results in a sample size of 120 and provides an overall accuracy of \numprint[\%]{99} for this experiment.
This process enables training accurate machine-learning models (see Table~\ref{tab:fw-results}) with less than \numprint{1000} training samples, each consisting of 120 datapoints (See Figure~\ref{fig:firmwares-samples}).

Both temporal and frequency domains are investigated.
For the temporal domain, the preprocessed \gls{ac} or \gls{dc} time series are considered.
For the frequency domain, the \gls{psd} of the \gls{dc} data are considered.
800 bootup traces were collected using an automated script to controle the state of the switch.
Out of these 800 traces, 14 were discarded du to errors in the capture procedure (cropped measurements).
For all evaluations, the data are divided into training and test data with a ration of 70:30.
Figure~\ref{fig:firmwares} illustrates the captured data for two different firmware versions in temporal and frequency domain.

\begin{figure}
    \begin{subfigure}{0.49\textwidth}
        \centering
        \includegraphics[width=\linewidth]{images/Firmware_Comparison_TD_direct.pdf}
        \caption{Median-filtered power traces of boot-up sequences for two different firmware versions (ten captures each).}
        \label{fig:firmwares-samples}
    \end{subfigure}
    \begin{subfigure}{0.45\textwidth}
        \centering
        \includegraphics[width=\linewidth]{images/psd.pdf}
        \caption{PSD of power traces of boot-up sequences for two different firmware versions (two traces for each version).}
        \label{fig:firmwares-psd}
    \end{subfigure}
\caption{Influence of different firmware versions on the power consumption at boot time.}
\label{fig:firmwares}
\end{figure}

\subsection{Experiment 1: Classifying Firmware Versions}
\label{Classifying-Firmware-Versions}                                                                                                                                                                                    
Given a power trace during boot-up and boot-up traces for ten different firmware versions, the goal of Experiment I.1 is to predict which firmware is currently installed on the device.
The result can be used to confirm successful firmware updates and to check whether the device reports the correct version.

Two classification models were evaluated to solve this task: \gls{rfc} and \gls{svm}.

\subsubsection{Results}
The \gls{rfc} delivers the best results of all tested \gls{ml} algorithms over both time and requency domains.
%A \gls{rfc} model trained on 786 samples achieved an accuracy of over \numprint[\%]{99} on an independently collected set of \gls{dc} data.

%\paragraph{\textbf{Frequency Domain}}
%Among the various \gls{ml} models trained on frequency-domain data, \gls{rfc} model has the best results with \numprint[\%]{99} accuracy. The \gls{rfc} model when tested on an independently collected validation set presents the same results, verifying the integrity of the model. The details of trained models and their performances can be found on the Table~\ref{tab:fw-change-fd-precision-comparison}.

\begin{table}[ht]
    \centering
    \begin{tabular}{lccc}
        \toprule
        \textbf{Data} & \textbf{Model} & \textbf{Macro F1 Score} & \textbf{Accuracy} \tabularnewline
        \midrule
        \multirow{2}*{DC Time Domain} & \gls{rfc} & \numprint[\%]{100}  & \numprint[\%]{100} \tabularnewline
        & \gls{svm} & \numprint[\%]{96.8} & \numprint[\%]{99.3}\tabularnewline
        \midrule
        \multirow{2}*{AC Time Domain}& \gls{rfc} & \numprint[\%]{87.4}  & \numprint[\%]{98.9} \tabularnewline
        & \gls{svm} & \numprint[\%]{75.8} & \numprint[\%]{95.5} \tabularnewline
        \midrule
        \multirow{2}*{DC Frequency Domain} & \gls{rfc} & \numprint[\%]{97.6} & \numprint[\%]{99.8} \tabularnewline
        & \gls{svm} & \numprint[\%]{95.3} & \numprint[\%]{96.0} \tabularnewline
        \bottomrule
    \end{tabular}
    \caption{Comparison between the different algorithms for firmware classification.}
    \label{tab:fw-results}
\end{table}


\subsection{Experiment 2: Detecting Firmware Change}
Given the most recently collected power trace during boot-up and the power trace collected before it, the goal of Experiment I.2 is to predict whether the firmware has been altered between these two traces.
The model uses \gls{dtw} and a training procedure on the collected traces, which implements a distance value as a parameter to the model to provide a decision, on whether there is a change in the firmware version.

The model uses a windowed \gls{dtw} to compute the distance between the current and the previous power traces collected during the boot-up.
The distance that results from \gls{dtw} is then subjected to a comparison with the model parameters.
The model has the parameter $D_{\max}$ (maximum distance).
Optimization of the parameter involves training on the data collected for the firmware classification experiment.

Given a pseudo-random sample trace of class $j$ from the training set, the selected sample acts as the baseline for the class $j$ model.
For each class $j$ from the training set, the training computes the \gls{dtw} distance between any sample of class $j$ and all the samples in the training set.
The results determine the parameters of the model.
The maximum distance is for class $j$ is defined as $d_{j_{\max}}$ and the variance of the distances as $\sigma_j$. 

The decision for each class can be made using the class's distance and variance. 
A global decision, combining them all, is achieved by introducing the parameter, $D_{\max}$.
This parameter is the mean of the parameter $d_{n_{\max}}$ across all the models belonging to a class.
Getting the average instead of the maximum of $d_{n_{\max}}$ is valid because the distance results obtained from \gls{dtw} are roughly similar on all classes, and any bias that may occur towards a single class is removed from the model.
Following the same idea, $\sigma_{\mathrm{all}}$ is defined as the average of each $\sigma_j$.

The general model uses $D_{\max}$ and $\sigma_{\mathrm{all}}$, as follows:

\begin{equation}
    \text{decision} = \left\{\begin{matrix}
        0 & \text{ if } D_{\mathrm{max}} \sigma_{\mathrm{all}} \geqslant DTW(a,b) (1+\sigma_{\mathrm{all}})\\ 
        1 & \text{ if } D_{\mathrm{max}} \sigma_{\mathrm{all}} < DTW(a,b) (1+\sigma_{\mathrm{all}})\\ 
\end{matrix}\right.
\end{equation}
where $a$ and $b$ denote two boot-up samples.

The equivalence case denotes that there is no change in the firmware.
Because $D_{\max}$ is the average of all $d_{j_{\max}}$ values, thus it falls into the range of observed values.

\subsubsection{Results}
Training and test results indicate that the model achieves \numprint[\%]{99} accuracy when the $D_{\max}$ is \numprint{27.16}.
The test data is data collected under the same conditions as the training data and includes firmware versions that are present during the training process and firmware versions that are not present. The test data has never been subjected to the training process, and the training procedure applies the above notations with the described parameters set during the training process.
Based on the model accuracy, a generalization of the model is possible with the introduced $D_{\max}$ without requiring any input from the firmware classification model.    


\section{Experiment Family II: Run-Time Monitoring}
\label{RunTime}
\gls{ssh} allows users to securely access a remote device even if the network is insecure.
Systems that enable SSH access maintain logs of SSH login attempts.
However, maintaining a log of the login attempt history proves futile since an attacker with system control can forge these log entries.
Since side-channel \glspl{ids} only focus on external properties and are independent of the system it monitors, they can defend against an attacker forging log entries.


\subsection{Experiment 1: Detecting SSH Login Attempts}
\label{detect_ssh}

%This experiment aims to identify instances of SSH login attempts in the power trace collected from a network switch during its regular operation.
Given an arbitrary power trace sample from the monitored device, the goal of Experiment II.1 is to predict whether the sample includes an SSH login attempt.

\subsubsection{Feature Engineering}
The signal collected from the network switch is a time series $T_1 \triangleq \{x \in \mathbb{R}\}$ sampled \numprint[MHz]{1} then downsamples by a factor of \numprint{1000} which results in 1 sample per millisecond.
Each sample has a corresponding label that is either 1 (\gls{ssh} login attempt) or 0 (no \gls{ssh} attempt) represented as $ T_2 \triangleq \{y \in \mathbb\{0,1\}\}$.


Figure~\ref{fig:ssh_time_window} shows power consumption increases during each login attempt.
The data acquisition process saves these timestamps of the connections while capturing the power traces.
To create training samples for the \gls{ml} algorithms, a sliding window of \numprint{500} datapoints and step size of \numprint{250} datapoints divides the powertrace into multiple samples with $S \triangleq \{ x \in \mathbb{R}\}$ with $|S| = 500$ and $S \subseteq T_1$.

Every data point in the sample is a feature of the model. If ${S \in [1]^{500}}$, then the sample is indicative of an SSH attempt otherwise, the feature indicates no SSH attempt. A matrix representation of $Z = \{ S_{1}, S_{2}, ... , S_{L}\}$ with rows of $S$ and $\forall i,j: |S_i|=|S_j|$, and the accompanying set of labels $Y_{Z} = \{ y_Z \in \{0,1\}^{L}\}$ where $L$ is the total number of samples.

\begin{figure}[htp]
    \centering
    \includegraphics[width=\linewidth]{images/time_domain_ssh.pdf}
    \caption{Downsampled and scaled DC power traces during SSH login attempts and the corresponding labels.}
    \label{fig:ssh_time_window}
\end{figure}

The samples created while applying sliding windows to the power trace exist in the time domain. 
Application of \gls{fft} can convert the data from temporal domain to frequency domain. The \gls{fft} calculates the frequency spectrum for windows of 500 features. The spectrum is labeled 0 or 1, corresponding to their original labels from the temporal domain. 

\subsubsection{Results}

A test set with \numprint{4095} samples consisting of \numprint{500} features each led to the results in Table~\ref{tab:ssh-precision-comparison}.
The feature engineering step extracts the samples from 20 power traces (each 50 seconds long).                                                                                                                                                                                     
In total, there were 120 power traces and the model trained over 85 of them and validated over 15.
\gls{ssh} attempts comprised \numprint[\%]{30} of the data, and the rest represented the idle behaviour of the system.
The skew in the dataset makes the model more certain while predicting a positive class and helps lower the number of false positives.

The \gls{svm} model trained on data in temporal domain using the Gaussian Kernel configured with $C = 1$ and $\gamma = 0.1$ achieved an accuracy of \numprint[\%]{98}.
\gls{rfc}, configured with 500 trees and a maximum depth of 50, performed equally well and achieved an accuracy of \numprint[\%]{97}, also on temporal domain.

Lastly, a \gls{1dcnn} trained on a mix of data from both time and frequency domain achieves an accuracy rate of \numprint[\%]{95} and minimizes \gls{fpr} to \numprint[\%]{1}.
However, it has the highest \gls{fnr}.
    
Thus, \gls{svm} had the best accuracy rates along with the lowest \gls{fnr} and the second lowest \gls{fpr}. \gls{rfc} trained on time-domain data, on the other hand, has the lowest \gls{fpr} but has a much higher \gls{fnr}.


\begin{table}[ht]
    \begin{center}
    
    \begin{tabular}{ccccccc}
        \toprule
        \textbf{Model} & \textbf{Precision} & \textbf{Recall} & \textbf{F1 Score} & \textbf{Accuracy} & \textbf{FPR} & \textbf{FNR} \tabularnewline
        \midrule
        %& \multicolumn{5}{>{\hsize=\dimexpr5\hsize+5\tabcolsep+\arrayrulewidth\relax}Y}{\textbf{Time Domain}} & \tabularnewline
        \midrule
        \gls{rfc} & \numprint[\%]{95} & \numprint[\%]{97} & \numprint[\%]{95} & \numprint[\%]{97} & \numprint[\%]{0.6} & \numprint[\%]{14} \tabularnewline
        SVM & \numprint[\%]{95} & \numprint[\%]{97} & \numprint[\%]{96} & \numprint[\%]{98} & \numprint[\%]{0.8} & \numprint[\%]{8} \tabularnewline
        1D~CNN & \numprint[\%]{94} & \numprint[\%]{93} & \numprint[\%]{93} & \numprint[\%]{96} & \numprint[\%]{2} & \numprint[\%]{9} \tabularnewline
        \midrule
        %& \multicolumn{5}{>{\hsize=\dimexpr5\hsize+5\tabcolsep+\arrayrulewidth\relax}Y}{\textbf{Frequency Domain}} & \tabularnewline
        \midrule
        \gls{rfc} & \numprint[\%]{89} & \numprint[\%]{67} & \numprint[\%]{72} & 
        \numprint[\%]{88} & 
        \numprint[\%]{12} & 
        \numprint[\%]{8} \tabularnewline
        SVM & -- & -- & -- & -- & -- & --  \tabularnewline
        1D~CNN & 
        \numprint[\%]{90} & \numprint[\%]{90} & \numprint[\%]{90} & \numprint[\%]{94} & 
        \numprint[\%]{3} & 
        \numprint[\%]{17} \tabularnewline
        \midrule
        %& \multicolumn{5}{>{\hsize=\dimexpr5\hsize+5\tabcolsep+\arrayrulewidth\relax}Y}{\textbf{Time + Frequency Domain}} & \tabularnewline
        \midrule
        1D~CNN & \numprint[\%]{89} & 
        \numprint[\%]{95} & 
        \numprint[\%]{92} & 
        \numprint[\%]{95} & 
        \numprint[\%]{1} & 
        \numprint[\%]{20} \tabularnewline
        \bottomrule
    \end{tabular}

    \end{center}
    \caption{Comparison between the different algorithms for detecting SSH login attempts.}
    \label{tab:ssh-precision-comparison}
\end{table}

\subsection{Experiment 2: Classifying SSH Login Attempts}
Given a window of power trace with an SSH login attempt, the goal of Experiment~II.2 to classify the login attempt as successful or unsuccessful.

\subsubsection{Feature Engineering}
This experiment builds on top of experiment II.1 and classifies the \gls{ssh} login attempts detected as successful or failed.
The experiment considers the data only in the temporal domain.
The matrix representation for this experiment is a slight modification of the previous one: $Z = \{ S_{1}, S_{2}, ... , S_{L}\}$ with rows of $S$ and $\forall i,j: |S_i|=|S_j|$, and the accompanying set of labels $Y_{Z} = \{ y_Z \in \{-1,1\}^{L}\}$ where $L$ is the total number of windows, $S$ is a window of \numprint{500} samples in time-domain, and all the windows correspond to either a successful or a failed SSH login attempt.

\subsubsection{Results}

Models trained using \glspl{svm} and \gls{1dcnn} gave the best results for the classification along with the lowest \gls{fpr} and \gls{fnr}.
Optimizing the parameters of the \gls{rfc} with 250 trees, \glspl{svm} with $C = 100$, $\gamma = 10$, and Gaussian Kernel, and \gls{1dcnn}, the accuracy score reached \numprint[\%]{96.7}, \numprint[\%]{98.5} and \numprint[\%]{98.6} respectively. Table~\ref{tab:ssh-classification-precision-comparison} details all the results.

The experiment uses the 4095 samples extracted from Experiment~II.1 that includes only successful and unsuccessful SSH attempts.
65\% of all the samples form the training set, 15\% contribute to the validation set, and the test set includes 20\% of all the samples.
Testing is done over roughly 1000 samples of 500 features.
The \gls{svm} model performed the best and had the lowest \gls{fpr} and \gls{fnr}.


\begin{table}[ht]
    \begin{center}
    \begin{tabular}{ccccccc}
        \toprule
        \textbf{Model} & \textbf{Precision} & \textbf{Recall} & \textbf{F1 Score} & \textbf{Accuracy} & \textbf{FPR} & \textbf{FNR} \tabularnewline
        \midrule
        & \multicolumn{5}{>{\hsize=\dimexpr5\hsize+5\tabcolsep+\arrayrulewidth\relax}c}{\textbf{Time Domain}} & \tabularnewline
        \midrule
        \gls{rfc} & \numprint[\%]{97} & \numprint[\%]{97} & \numprint[\%]{97} & \numprint[\%]{96.7} & \numprint[\%]{12} & \numprint[\%]{8} \tabularnewline
        SVM & \numprint[\%]{99} & \numprint[\%]{99} & \numprint[\%]{99} & \numprint[\%]{98.5} & 
        \numprint[\%]{1} & 
        \numprint[\%]{1.5}  \tabularnewline
        1D~CNN & \numprint[\%]{98.5} & 
        \numprint[\%]{98} & \numprint[\%]{98} & \numprint[\%]{98} & \numprint[\%]{1} & \numprint[\%]{2}  \tabularnewline
        \bottomrule
    \end{tabular}
    \end{center}
    \caption{Comparison between the different algorithms for classifying SSH login attempts.}
    \label{tab:ssh-classification-precision-comparison}
\end{table}


\section{Experiment Family III: Hardware Tampering}
\label{Hardware}

The HP Procurve Switch 5406zl supports the on-the-fly installation of networking modules to modify the number of ports available.
This capability exposes the switch to a Hardware Integrity Attack [CAPEC 440].
An attacker with physical access to the front panel of the network equipment can tamper with the modules and potentially install unauthorized ones.
Installing new modules could offer an attacker a way to gain access to the machine by leveraging a poor default configuration of the ports.
For example, on network equipment where the default configuration does not include a limit for the number of MAC addresses per port, installing an extension module could allow an attacker to perform a MAC Flood attack [CAPEC 125].
Existing \glspl{ids} and security software do not yet offer functionality to detect the installation of unauthorized modules.
Hence, currently, the only way to identify unauthorized hardware modification is through the use of the network equipment's involuntary emissions.

\subsection{Experiment 1: Identifying Number of Expansion Modules}
\label{expe:hardware-1}

Given an arbitrary power trace sample from the monitored device, the goal of Experiment III.1 is to predict the number of expansion modules installed on the device.
In this experiment, there was no on-the-fly installation or removal of the module during the capture, only in between captures.

\subsubsection{Feature Engineering}
The installation or removal of an expansion module increases or decreases the average \gls{dc} and \gls{ac} power consumption of the device.
By analyzing the power consumption, it is possible to identify the number of expansion modules installed at any time.

\textbf{\gls{dc} data:} To create the training dataset, the prepossessing program extracted snippets of data randomly picked from \numprint{138} 20 seconds long \gls{dc} power consumption traces.
Each trace is 20 seconds long to avoid any outlier condition that, for a few seconds, could affect the average power consumption and cause biased training.
Within each trace, the program picked ten snippets of five values.
Those values of the number and length of snippets correspond to the minimum training time needed to achieve a \numprint[\%]{100} accuracy with a stratified 10-fold cross-validation setup with the data used in this experiment. The average value of each snippet is then computed. The final training dataset is a 1D array of shape $(\numprint{1380},1)$.

\textbf{\gls{ac} data:} Each number of expansion modules will cause a different pattern in the fundamental \numprint[Hz]{60} wave of the \gls{ac} power consumption.
To create the training dataset, the prepossessing program extracted $N$ periods of the fundamental wave by detecting consecutive local minima in the trace. 
The extracted periods of \numprint{3333} data points (one period of the \numprint[Hz]{60} captured at 1MSPS and decimated by a factor of 5) constitute the training set of shape $(\numprint{4320},\numprint{3333})$.

\subsubsection{Results}

The average \gls{dc} values does not overlap for different number of expansion modules installed.
This property enables both \gls{svm} and \gls{knn} to perfectly classify the number of modules installed.
The \gls{svm} model trained with a linear kernel performed the same as the \gls{knn} model with $K=1$.
Both methods classify the traces with a \numprint[\%]{100} accuracy on \gls{dc} data.

The \gls{ac} periods do present different patterns depending on the number of modules but remain similar at some points and do not present a separation as clear as the \gls{dc} averages.
The \gls{svm} model was able to identify the number of modules installed with an accuracy of \numprint[\%]{99}.

Results from Table~\ref{tab:hardware-results} show that \gls{dc} data yields the best results.
These high accuracy and recall performances are the result of the non-overlapping grouping of the averages \gls{dc} consumptions.
The results presented are produced with a stratified 10-fold cross-validation setup.

\begin{table}[ht]
    \begin{center}
    \begin{tabular}{ccccc}
        \toprule
        \textbf{Input Data} & \textbf{Model} & \textbf{Accuracy} & \textbf{Recall}\tabularnewline
        \midrule
         \gls{dc} & SVM & \numprint[\%]{100} & \numprint[\%]{100}\tabularnewline
         \gls{dc} & KNN & \numprint[\%]{100} & \numprint[\%]{100}\tabularnewline
         \gls{ac} & SVM & \numprint[\%]{99.5} & \numprint[\%]{99.45}\tabularnewline
        \bottomrule
    \end{tabular}
    \end{center}
    \caption{Comparison between the different models for hardware detection with a stratified 10-fold cross validation setup.}
    \label{tab:hardware-results}
\end{table}


\section{Discussion}
\label{Discussion}
This section highlights important aspects of this study.

\subsection{Influence of Traffic on the Results}
The data used for training the models did not include traffic and were collected in a laboratory environment.
Because the production equipment is used by actual users, it is impossible to perform attack that would disrupt to connection quality or lower the security of the device. 
However, complementary experiments were conducted to verify whether traffic would have a significant influence on the results of the experiment.
For Experiment Family I (Section~\ref{Firmware}), the traffic cannot influence the results as there is no traffic possible during the boot-up sequence and the experiment uses only the boot-up sequences to perform the classification.
For Experiment Family II (Section~\ref{RunTime}) and III (Section~\ref{Hardware}), we captured data containing real traffic (captures on the identical production switch) and simulated traffic (connections between multiples pairs of machines at around 1Gbps in the laboratory environment).
Traffic data does not show any significant influence on \gls{dc} or \gls{ac} in both time and frequency domain. 
From these results, it is possible to conclude that traffic should not affect the results from the presented experiments.

\subsection{Obtainable Datasets}
As presented in this paper, the trained models can successfully detect attacks executed on the network equipment. 
Those results are especially interesting as the model training step relies on an obtainable number of training samples to achieve near perfect accuracy scores.
This is a success, because (1)~our models achieve similar accuracy as some of the most successful experiments involving \gls{ml}~\cite{szegedy2017inception,xie2017aggregated} but (2)~use only a small sample size compared to image libraries with millions of image samples as training data \cite{sun2017revisiting}.
Our experiments use a maximum of \numprint{4100} power trace samples.
The obtainable number of training samples makes this approach adaptable to a range of different systems and domains because it solves the issue of collecting large amounts of data usually required to enable \gls{ml} approaches.
The models trained are relatively lightweight owing to the small number of samples along with the heavy downsampling performed on data for the experiments. 
The lightweight nature of the models allows for fast online run-time monitoring and integrity assessment of embedded systems.

\section{Conclusion}
\label{Conclusion}

This paper introduces a physics-based \gls{ids} that offers a novel and complementary type of runtime monitoring and integrity assessment for network equipment.
The proposed \gls{ids} leverages side-channel information generated by the system at the physical level and infers the system's state and activities to detect attacks.
This paper presents en evaluation of the performances against hardware tampering, firmware manipulation, and log tampering.
The results show that the used methods achieve near perfect accuracy on all experiments with only a small training set.
Overall, the introduced techniques provide a glimpse on a general concept that is extensible to other real-time and embedded systems.
Future work can investigate additional side channels and how the interaction can even further reduce the required sample size and improve the accuracy.


\bibliographystyle{splncs04}
\bibliography{bibliography}


\end{document}