\chapter{Exploratory Work on Physics-Based Security}\label{chap:pastwork} The \gls{esg} has a history of power side-channel analysis. In 2017, the \gls{eet} project started with the aim to explore the intrusion detection capabilities of side-channel analysis on a wide range of embedded systems. A series of exploratory work on the topic of physics-based defense followed, each ilustrating a different capability. \section{Electromechanical Emission Tripwire} The \gls{eet} project marked the start of the physics-based security at the ESG lab. The project aims at evaluating the capabilities of physics-based security and provide a proof-of-concept. The initial target was a network switch. Network switches are a core component of any data center. As powerful as computers can be, if they are not inter-connected, their computing power remains useless. In a data center with hundreds of machines, communication is as essential as computing power. The failure of a network switch can have devastating consequences for the data center operations. Every minute of downtime costs the data center and its clients a fortune and must be prevented. \gls{hids} are often not a perfect solution for network switches. Their \gls{os} typically dont support additional software installation and may not propose built-in \gls{ids} capabilities. When they do, the security solutions may be weak or rapidly out of date and fail to protect against attacks such as firmware modification~\cite{cisco_trust,thomson_2019} and bypassing secure boot-up~\cite{Cui2013WhenFM, hau_2015}. They also fail to offer effective run-time monitoring through auditing and verifying log entries~\cite{koch2010security}. For these reasons, network switches are prime candidates for side-channel security. The installation of a side-channel monitoring system is often minimally invasive and can even be performed without downtime if the machine supports redundant power supplies. The aim of the project was to leverage side-channel analysis to detect anomalous activities that can be related to attacks on a network switch. The goal is not to create a complete \gls{ids} suite from physics-based security but to offer a complementary detection mechanism for the cases where traditional \gls{ids} are failing. \subsection{Attack Scenario} Network switches have large the attack surface. Every manageable switch has a management system that enable changing the parameters of the machine. This management is typically accessible remotly via \gls{ssh}, telnet, HTTP or locally with a serial connection. At least one of these interface shoud be available and they are typically protected with a username/password pair -- although certificate or key authentication may be available for modern interfaces like \gls{ssh}. On top of these intended interface, a network switch is also at risk of attacks from the connected clients. A malicious client connected to the switch can run \gls{mac} flooding attack or VLAN hopping attack. An attacker that gets physical access to the machine can also tamper with the firmware (upgrading/downgrading the firmware, uploading malicious firmware) or the hardware configuration of the machine. We considered the following intrusions: a remote connection via \gls{ssh}, a firmware change, and a hardware change. The remote connection via \gls{ssh} does not always implies an intrusion. However, this operation can be the first step of a more complex attack. The network switch logs the connections for later forensics, but the \gls{os} does not typically offer mechanism to trigger actions for a remote connection. The capability of detecting a remote connection independently of the \gls{os} is valuable in a security pipeline. Moreover, an attacker that gains access to the machine could wipe logs to cover their tracks. With the detection mechanism isolated from the target machine, the attacker cannot bypass the detection nor erase their tracks. A firmware change can also be a legitimate operation. Updating the firmware is now a common capability on many embedded systems. However, if the firmware change was not allowed by the system administrator, then it represents a threat. Downgrading the firmware can re-open older security flaws that have since been documented. Upgrading the firmware without approval can cause disruptions in the machine's operation. Loading a modified version of the firmware can also enable an attacker to forge the firmware version and remain undetected from remote security or monitoring solutions. Finally, a hardware change is also a security threat. The machine that we considered for the experiment --- HP Procurve Network Switch 5406zl --- allows for the installation of additional port modules. Each module expands the port capacity of the machine. Modules can be \textit{hot-plugged} and will apply the default configuration of the machine. Installing a new blade on a machine with a poor default configuration allows an attacker to set up various attacks. For example, if the default configuration does not limit the number of \gls{mac} addresses, a attacker can perform a \gls{mac} flooding attack to access restricted traffic. This last scenario requires physical access to the machine. \subsection{Host Independence} One important aspect of the \gls{eet} technology reside in the independence between the host and the detection machine. In a similar way to a \gls{nids}, the detection system is remote and never require cooperation from the host to collect data. This independence provide compeling features in both directions. First, an attacker with access to the host does not have access to the detection system, which is important for the reliability of the results. In the case of a \gls{hids}, the data are collected by a software on the host. Whether these data are analyzed locally of sent to a remote machine makes no difference, as a compromised machine cannot be trusted to send genuine measurements. An attacker with access to the machine can tamper with the measurement process to report nominal values and stay under the radar. This problem is addressed by the \gls{eet} system as the power consumption of a software running on a machine cannot be faked and is difficult to hide. Second, a failure of the detection system does not induce any perturbation of the host --- as long as the measurement system can safely fail into a passive component ---, which is crutial for critical systems. The \gls{eet} system attempts to close the gap between host-\gls{ids} independence and access to relevant information about the machine's activities. \subsection{Side-Channels} Two side-channel were initially considered: power consumption and ultrasound emissions. The ultrasound emissions were quickly discarded for multiple reasons. When working with sound, the placement of the microphone is important and should be consistent. This is a problem for the deployment of this technology to a variety of machines, as finding the best position of the microphone is difficult. Moreover, the ultrasound measurements did not show the same level of details as the power consumption. Power consumption is a popular side-channel for many reasons: it is easy to capture reliably with low-cost equipment, the placement of the capture device has little impact on the results, adding a capture device is often as easy as plugging it in series with the main power cable of the machine, and provides a good level of detail about the activity. We measured the power consumption in the form of power traces (time series of power measurements). The capture device was a shunt resistor placed in series with the main power cable that generated a voltage drop proportional to the current (see figure \ref{fig:overview-eet1}). We measured this voltage drop at a high frequency from 10-kilo samples per second (10KSPS) to one million samples per second (1MSPS). \begin{figure} \centering \includegraphics[width=0.9\textwidth]{images/overview_eet1.pdf} \caption{Overview of the EET setup.} \label{fig:overview-eet1} \end{figure} \subsection{Results} The proposed method successfully detected remote connections, firmware changes, and hardware changes. More specifically, firmware change detection showed the most promising results. The power consumption during the boot process is more stable and less noisy than during runtime. Thanks to this consistency, changes between two firmware versions (see Figure \ref{fig:eet1_firmware}) are easy to detect with simple distance-based methods like \gls{knn}, \gls{svm} and \gls{rfc}. All these methods yield good results for the detection of abnormal firmware. \begin{figure} \centering \includegraphics[width=0.8\textwidth]{images/eet1_firmware.pdf} \caption{Boot-up sequences for two different firmware versions} \label{fig:eet1_firmware} \end{figure} This first exploration of the capabilities of physics-based \gls{ids} lead to the publication of an article \cite{eet1_mlcs} at the workshop on Machine Learning for Cyber Security at the ECML-PKDD conference. \newpage \section{xPSU}\label{sec:xpsu} The xPSU project continued the exploratory work started with the \gls{eet} project. One important observation of the \gls{eet} project was that the global power consumption could be too noisy to extract all the relevant information is some cases. One solution to this issue is to measure the power consumption at a lower level on specific components of interest (see Figure~\ref{fig:xpsu}). The xPSU project aims at placing a power consumption probe and pre-processing system inside a regular \gls{pc}'s \gls{psu}. The \gls{psu} is a prime location for monitoring power as it generates the different power sources for the components of the \gls{pc}. Integrating the measurement device in a \gls{psu} enables a \textit{drop-in} installation of the monitoring system in most \gls{pc}. The capture mechanism consisted of a shunt resistor for generating the voltage drop, an \gls{adc} for measuring the value, and an \gls{sbc} for processing the measurements. The xPSU system measure and analyse the power consumption without any communication with the host device to ensure independence. The xPSU was an early proof of concept, and all the components could not fit in the \gls{psu}. The fan of the \gls{psu} was moved outside of the enclosure, modifying the form factor of the \gls{psu}. For this reason, the xPSU was not a perfect \textit{drop-in} replacement of a regular power supply, but the final form factor was encouraging. A more compact form factor is possible with a better design of the capture system and a more appropriate choice of components. \begin{figure} \centering \includegraphics[width=0.8\textwidth]{images/xpsu_illustration} \caption{The xPSU focuses on a granular measure of each components.} \label{fig:xpsu} \end{figure} \subsection{Results} We evaluated the performances of the xPSU on the task of detecting changes in hard drive firmware. Although it is not an ordinary operation, it is possible to update the firmware of a hard drive. Updates enable attackers to modify the firmware an potentially implent malwares that are impossible to detect or remove without changing the drive \cite{hdd_malware}. We selected drives with a pending firmware update for the experiment and measured their boot power trace before and after the update. We also measured the trace of multiple drives of the same model and storage capacity to evaluate the detection of a drive replacement. The results were satisfactory and illustrated the possibility of detecting a firmware change or a drive replacement from the boot power consumption of the drive captured from within the \gls{psu}. The standard Molex cable supplying power the a SATA hard drive is composed of 3 voltage levels: 3V, 5V and 12V. After some tests, it appears that the 5V cables --- grouped on the same shunt resistor --- carried the most information about the drive activity. The shunt resistor generated the voltage drop on the 5V cables of the hard drive. \agd{find back results and add them here} \newpage \section{Boot Process Verifier}\label{sec:bpv} The good results of the \gls{eet} and xPSU projects paved the way for the development of a robust and versatile solution for verifying the boot process of a machine. From the \gls{eet} project, we learned that modelling the expected trace --- based on a number of known good traces --- enabled the detection of anomalous firmware. From the xPSU project, we learned that most embedded systems requiring a firmware would exhibit a firmware signature in the power consumption during boot up. The core idea of the \gls{bpv} is to leverage a small number of known good firmware traces to build a model of normal boot power consumption. The model automatically computes a threshold to describe the acceptable range within which a new boot trace should fall to qualify as normal. If a new boot trace falls outside of the range, the system labels it abnormal, and an alert is raised. The \gls{bpv} is not a tool for finding the root cause of an anomaly, but only for detecting an anomaly. The anomaly can result from malicious firmware, firmware upgrade/downgrade, or a change in firmware settings. The \gls{eet} project also illustrated the potential of simple distance-based models. A distance-based model was prefered for the \gls{bpv} to maintain the explainability of the model decision. The \gls{bpv} is an approach to the following problem statement. \begin{problem-statement}[Boot Process Verification] Given a set of known-valid time series samples $S=\{s_1,\dots, s_n\}$ and a new unlabeled time series $t$, assign to $t$ the label \textit{valid} or \textit{anomalous} with the condition that the \textit{valid} label should only be assigned to new traces originating from the same distribution as the training samples from $S$. \end{problem-statement} The sample in $S$ and the unlabeled input $t$ are all discretized, real-valued time series of the same length. The training samples $S$ all belong to the \textit{valid} class. No example of the \textit{anomalous} class is accessible to the algorithm for building the model or choosing the threshold. All samples in $S$ originate from the same distribution as they are different occurrences of boot sequences from the same machine with the same firmware and configuration. The proposed solution was a distance-based detector with a threshold based on the \gls{iqr}. The distance between two time series of same length is defined as the Euclidean distance and computed as $d(a,b)^2 = \sum_{i=0}^N(a[i]-b[i])^2$. The \gls{iqr} is a measure of the dispersion of samples more robust to outliers than the variance. It is based on the quartiles and defined as $IQR = Q_3 - Q_1$ with $Q_3$ the third quartile and $Q_1$ the first quartile. This value is commonly used~\cite{han2011data} to detect outliers as a more robust alternative to the $3\sigma$ interval of a Gaussian distribution. The training phase begins by computing the pointwise average trace. Then, the \gls{iqr} of the distances from each trace to the average trace is computed. Finally, the distance threshold takes the value $Q3 + 1.5\times IQR$. The distance of each new trace to the reference average is computed and compared to the threshold in the detection phase. If the distance is above the pre-computed threshold, the new trace is considered anomalous. \subsection{Results} We evaluated the \gls{bpv} on two occasions. First, we assembled a panel of relevant devices, including switches, \gls{wap} and \gls{pc}. The evaluations revealed that the \gls{bpv} performed better on simpler devices like switches and \gls{wap} compared to general-purpose computers. This is mainly due to the reduced variability and noise in the traces captured from simpler devices that produce a more robust model. This first study leads to the publication of a work-in-progress paper in the EMSOFT 2022 conference \cite{grisel2022work} that describes the design and capabilities of the \gls{bpv} in its first version. %Then, we performed a case study with an industry partner on \gls{rtu}. %The \gls{rtu} was composed of one low-complexity embedded system and one main general-purpose computer. %The computer's activity overtook most of the other information in the trace and made it more difficult to detect subtle variations. %However, the \gls{bpv} could still detect intrusions in the computer from the global trace. %For example, a user modifying some settings through the \gls{bios} or booting into a different \gls{os} was detected. %This case study revealed that some systems could have multiple valid modes of the boot sequence. %This discovery enabled us to rethink the model of the \gls{bpv} to allow such variations. We performed the second evaluation on a drone. A drone is a prime machine for the \gls{bpv} as its low complexity allows for consistent boot traces. We successfully detected different firmware versions by leveraging the information from the two previous experiments. Along the evaluations, the \gls{bpv} capabilities have been modified to adapt to specific cases and enable anomalous training samples, multi-model evaluations, and autonomous learning. This expansion of the work on \gls{bpv} lead to the plublication of a paper \cite{bpv_qrs} at the QRS Conference. \begin{table}[ht] \centering \begin{tabular}{lccc} \toprule \textbf{Test Case} & \textbf{Experiment} & \textbf{F1 Score} \tabularnewline \toprule \multirow{4}*{Network Devices} & TP-Link switch & 0.87\tabularnewline & HP switch & 0.98 \tabularnewline & Asus Router & 1.00\tabularnewline & Linksys Router & 0.92\tabularnewline \midrule \multirow{4}*{Drone} & Original & 1.00\tabularnewline & Compiled & 1.00\tabularnewline & Low Battery & 1.00\tabularnewline & Bootloader Bug & 1.00\tabularnewline \bottomrule \end{tabular} \label{tab:fw-results} \end{table} \newpage \section{State Detection and Segmentation} Section~\ref{sec:bpv} mentioned the use of distance models on boot power traces to evaluate their validity. However, we never mentioned how these traces were detected, extracted, and synchronized. This problem of pattern detection in a time series is complexe as the boot sequence may not be known in advance, can take multiple form, and the solution must detect an anomalous boot that radically changes the pattern. The \gls{sds} algorithm was a first attempt at detecting and extracting boot sequences for the \gls{bpv} to analyse. The algorithm leverages two feature common to all boot sequences: a sharp spike of power consumption, and an average increase in the power consumption. Two thresholds are manually set for the detection. First, the \textit{off\_threshold} which is the power consumption under which the machine is considered off. Second, the \textit{boot\_time} which represent the time span of the boot procedure. The algorithm considers each sample iteratively to decide on the state among \textit{OFF}, \textit{BOOT} and \textit{ON}. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{images/sds_illustration} \caption{SDS detection mechanism using the $y$-offset (\textit{off\_threshold}) and the $x$-offset (\textit{bios\_time})} \label{fig:sds_illustration} \end{figure} \begin{algorithm}[H] \caption{SDS} \label{alg:sds} \begin{algorithmic}[1] \Require $trace$ the time serie of $N$ samples., $off\_threshold$, $bios\_time$. \State $pstates \gets array(N-1)$ \State $states \gets array(N)$ \State $boot\_time \gets None$ \For{$i \in [0,\dots, N]$} \State $s \gets trace[i].value$ \State $t \gets trace[i].time$ \If{$s < off\_threshold$} \State $states[i] \gets OFF$ \Else \If{$i=0$} \State $states[i] \gets ON$ \ElsIf{$pstate[i-1] = OFF$} \State $states[i] \gets BOOT$ \State $boot_time \gets t$ \ElsIf{$pstates[i-1] = ON$} \State $states[i] \gets ON$ \Else \If{$t - boot\_time < bios\_time$} \State $states[i] \gets BOOT$ \Else \State $states[i] \gets ON$ \EndIf \EndIf \EndIf \State $pstate[i] \gets states[i]$ \EndFor \end{algorithmic} \end{algorithm} This simple algorithm makes the \gls{sds} robust and reliable, but also limited. The \gls{sds} is an appropriate solution for detecting states that exhibit a change in average consumption and with stable pre-defined duration. The detection of consistent and synchronized boot-up sequences fits perfectly in this use case. This consistency and synchrony of the instances are essential for distance-based detectors that compare these instances. However, for the states that cannot be described by a change in average consumption and duration, the \gls{sds} is incompetent. For example, if a machine can perform two runtime operations that generate the same consumption pattern but with different frequencies, then the \gls{sds} cannot distinguish these two states reliably. These limitations make the \gls{sds} a preliminary work, not a final solution. It outlines that state detection is a complex problem and that the properties of the output need to be taken into account during the design. If the desired output is only an information of the state occurrence, then the perfect consistency and synchronization of the detected segments are superfluous. If a follow-up algorithm processes the output --- and especially if it is distance-based --- then the output needs to be consistent and synchronized. These considerations reveal a tradeoff between training data and capabilities. The \gls{sds} required no training data except for the two threshold values. This is interesting from a deployment perspective, where labeled data can be scarce. However, it also impacts the detection capability as the \gls{sds} does not look for actual patterns but for independent thresholds crossings. \newpage \section{Device State Detector} The \gls{dsd} is the continuation of the \gls{sds}. The algorithm's goal remains the same; detect the machine's state. However, the detection process and the outputs are fundamentally different. The \gls{sds} was built with robustness, ease of training and consistency in mind. The synchronization and consistency of the output will not be the main focus of the \gls{dsd}, and they will be replaced by a greater versatility of the state detection at the cost of more training data. The \gls{dsd} is an approach to a family of problems that are similar to \gls{sds}, but differ by the nature of the leverage data. Until now, we only took into accound the case of the power consumption of a single machine --- or single source --- captured at a single point --- single measure. Other variations of the same problem --- combinations of single/multi and single/multi measures --- will follow in the next chapter. The \gls{dsd} algorithm is an approach to the following problem statement. \begin{problem-statement}[Single-Source Single-Measure] Given a discretized time series $t$ and a set of patterns $P=\{\chi, P_1,\dots, P_n\}$, identify an injective mapping $m_{SSSM}:\mathbb{N}\longrightarrow P$ such that every sample $t[i]$ maps to a pattern in $P$ with the condition that the sample matches an occurence of the pattern in $t$. \end{problem-statement} The time serie $t: \mathbb{N} \longrightarrow \mathbb{R}$ is a discretized, mono-variate, real-valued time series. The patterns $P_j \in P\setminus \chi$ are of the same type as $t$. A sample $t[i]$ \textit{matches} a pattern $P_j \in P\setminus \chi$ if there exists a subsequence of $t$, the length of $P_j$, that include the sample, such that a similarity measure between this subsequence and $P_j$ is below a pre-defined threshold. The pattern $\chi$ is the unknown pattern assigned to the samples in $t$ that dont match any of the $P_j$ patterns. \begin{figure} \centering \includegraphics[width=0.9\textwidth]{images/dsd_illustration} \caption{Illustration of the DSD input and output.} \label{fig:dsd_illustration} \end{figure} The core of the algorithm is a \gls{knn} classification. This algorithm is a proven and robust way of labelling new samples based on their relative similarity to the training samples (\cn ?). Although this is a good algorithm for many classification problems, its application to time series for state detection is not trivial for multiple reasons. First, a single time serie can contain multiple different states, making it a multi-label classification problem. Second, extracting windows to perform classification introduces parameters --- window size, window placement around sample to classify, number of sample to classify per window, stride --- potentially difficult to tune or justify. Finally, there is no single commonly accepted distance metric between time series of different lengths. The \gls{dsd} addresses these design choices to perform state detection with minimal training data. For the rest of the explanation of the \gls{dsd}, we will suppose that the training data consists of one time series per state. These time series represent one occurrence of a state to detect. One important detail is that each training sample can have a different length as the states are likely not all of the same duration. The common way of applying a \gls{knn} classifier for detecting multiple patterns in a time series is to iteratively consider windows of the trace corresponding to each length of the training sample. Then the classifier would evaluate the distance of each slices to the corresponding training sample and normalize this distance by the length to generate comparable values. The state of the closest training sample is assigned to every sample of the slice, and the next slice is taken without overlap. The results for this method are sub-optimal for two reasons. First, if the stride between each window is too large, crucial patterns can be overlooked in the trace. If it is too small, the detection accuracy can suffer as the state of each sample is evaluated multiple time --- due to windows overlap. Second, the whole window will be assigned one label, which causes the edges of the states to be inaccurate --- especially when states patterns share similarities. \agd{find a theoretical setup where the middle-sample knn is worse than dsd. Consider cases of a bootup with a description that include some OFF portion.} The \gls{dsd} uses a better metric for evaluating the distance between a sample and each state. For each sample and for each state, every window of the length of the state containing the sample is considered. The first window contains the sample at the last position, and the last window contains the sample at the first position (see Figure~\ref{windows_dsd}). \begin{figure} \centering \includegraphics[width=0.9\textwidth]{images/windows_dsd.pdf} \caption{Windows considered around the sample $i$ for comparison with the pattern of size $k$.} \label{fig:windows_dsd} \end{figure} The algorithm computes the distance between each window and the state and normalizes it by the length of the state. After all the distances are computed, we can assign to the sample the state that is the closest. This method naturally segments the state space into areas where the borders represent a mid-point between two states. We refined the method by introducing a coefficient to shrink the capture areas of each state. The emerging empty area corresponding to no state allows for the detection of unseen states. This method retains the low complexity of a distance-based \gls{knn} algorithm while yielding better accuracy, especially around state transitions. The \gls{dsd} was designed for one-shot classification, but the multi-shot version is naturally accessible by adding more training examples and going from a 1-NN to a K-NN. Two metrics represent the performance of the \gls{dsd} and any other algorithm for the same problem. The accuracy is computed as the number of correct labels over the total number of labels to predict. This metric is common and gives an overview of the performances comparable with a random baseline. The reason for labeling of each sample is to makes the time series actionable. Other algorithms --- further down in the processing pipeline --- can evaluate the sequence of states detected for a machine in order to decide on the integrity of the machine. In this regard, a labeling error can have different impact depending on the location. A single error at the transition between two states would result in a slight timing error for the state transition detection. However, a single error in the middle of a series of identical labels would result in the detection of a new incorrect state, potentially triggering actions down the line. These two errors have the same impact on the accuracy but a radically different impact on . This illustrates that the accuracy is not the complete picture. To evaluate the state detection at a higher level, the levenshtein distance of the reduced labels is defined. The reduced labels is the vector of labels with every sequence of identical labels represented as only one symbole. The normalized state edit distance is defined as \begin{equation} nsed(truth,preds) = \dfrac{Lev(reduced(truth),reduced(preds))}{max(reduced(truth),reduced(preds)} \end{equation} with $Lev$ the Levenshtein distance. This metric is complementary to the accuracy and will be computed for every evaluation of the the state detection algorithms. This work on the detection of machine activity from power consumption information lead to the publication of an article \cite{dsd_qrs} at the QRS conference. \newpage \section{Conclusion on Past Work} The project of physics-based security at a global level with complete independence from the protected machine is not trivial. The main hurdle is the collection of information with a dual constraint. First, the collected data is always unlabeled and often partial --- i.e.s may not contain all possible activities. Second, the independence from the machine implies not having any control over the machine's activity to collect specific data. However, these constraints are also the strengths of this approach. The power consumption is a limited but reliable source of information which is difficult to forge. The challenge is for the solution to extract as much information from it. The independence is also important as it guarantees that an attacker cannot bypass the detection mechanism. With these constraints in mind, the results of the preliminary works are encouraging. The \gls{bpv} and \gls{dsd} algorithm propose approaches to the problems of boot process integrity and runtime activity monitoring. These two complementary aspects represent a large area of the attack surface of a typical embedded system. The unique properties of host independence and unforgeability of the input data make the physics-based \gls{ids} a promising complement for any security suite. More work is obviously required. The main point of interest is to evaluate the performance of the \gls{dsd} to make it as versatile and reliable as possible. From the xPSU project, we understood that a more granular measurement of the power consumption could be beneficial in detecting specific attacks and enabling root cause analysis instead of basic anomaly detection. The continuation of the research work will focus on runtime monitoring and investigate diferent data measurement scales and their respective benefits.