This commit is contained in:
Arthur Grisel-Davy 2023-07-31 12:43:56 -04:00
parent 7f45c2e7f2
commit 6c413308a1

View file

@ -75,22 +75,28 @@ Results of state detection with MAD enable the definition and verification of hi
\section{Introduction} \section{Introduction}
\gls{ids}s leverage different types of data to detect intrusions. \gls{ids}s leverage different types of data to detect intrusions.
On one side, most solutions use labeled and actionable data, often provided by the system to protect. On one side, most solutions use labelled and actionable data, often provided by the system to protect.
In the software world, this data can be the resource usage \cite{1702202}, program source code \cite{9491765} or network traffic \cite{10.1145/2940343.2940348} leveraged by an \gls{hids} or \gls{nids}. This data can be the resource usage \cite{1702202}, program source code \cite{9491765} or network traffic \cite{10.1145/2940343.2940348} leveraged by an \gls{hids} or \gls{nids}.
In the machine monitoring world the input data can be the shape of a gear \cite{wang2015measurement} or the throughput of a pump \cite{gupta2021novel}.
On the other side, some methods consider only information that the system did not intentionally provide. On the other side, some methods consider only information that the system did not intentionally provide.
The system emits these activities by-product through physical mediums called side channels. The system emits these activity by-products through physical mediums called side channels.
Common side-channel information for an embedded system include power consumption \cite{yang2016power} or electromagnetic fields \cite{chawla2021machine}. Common side-channel information for an embedded system includes power consumption \cite{yang2016power} or electromagnetic fields \cite{chawla2021machine}.
For a production machine, common side-channel information include vibrations \cite{zhang2019numerical} or chemical composition of fluids \cite{4393062}.
Side-channel information offer compelling advantages over agent-collected information. Side-channel information offers compelling advantages over agent-collected information.
First, the information is difficult to forge. First, the information is difficult to forge.
Because the monitored system is not involved in the data retrieval process, there is no risk that an attacker that compromised the system could easily send forged information. Because the monitored system is not involved in the data retrieval process, there is no risk that an attacker that compromised the system could easily send forged information.
For example, if an attacker performs any computation on the system --- which is the case of most attacks --- it will unavoidably affect a variety of different side channels. For example, if an attacker performs any computation on the system --- which is the case of most attacks --- it will unavoidably affect a variety of different side channels.
There are studies focusing on altering the power consumption profile of software, but their goal is to mask the consumption pattern to avoid leaking side-channel information.
These solutions \cite{1253591,6918465} do not offer to change the pattern to an arbitrary target but to make all activities indistinguishable.
These methods still induce changes in the consumption pattern that makes them identifiable by the detection system.
Second, the side-channel information retrieval process is often non-intrusive and non-disruptive for the monitored system. Second, the side-channel information retrieval process is often non-intrusive and non-disruptive for the monitored system.
Measuring the power consumption of a computer or the vibrations of a machine does not involve the cooperation or modification of the system \cite{10.1145/2976749.2978353}. Measuring the power consumption of a computer does not involve the cooperation or modification of the system \cite{10.1145/2976749.2978353}.
This host-independence property is crucial for safety-critical or high-availability applications as the failure of one of the two --- monitored or monitoring --- systems does not affect the other. This host independence property is crucial for safety-critical or high-availability applications as the failure of one of the two --- monitored or monitoring --- systems does not affect the other.
These two properties --- reliable data and host-independence --- set physics-based monitoring solution apart with distinct advantages and use-cases. These two properties --- reliable data and host independence --- set physics-based monitoring solutions apart with distinct advantages and use cases.
It is interesting to notice that leveraging side-channel analysis to detect malfunction is not limited to software.
For production machines with high availability requirements, many side-channels provide useful information about the state of the machine.
Common sources of information are vibrations \cite{zhang2019numerical}, the chemical composition of various fluids \cite{4393062}, the shape of a gear \cite{wang2015measurement} or performance metrics like the throughput of a pump \cite{gupta2021novel}.
This is important to keep in mind that other domains outside of software can also benefit from side-channel analysis tools tailored for security enforcement.
However, using side-channel data introduces new challenges. However, using side-channel data introduces new challenges.
One obstacle to overcome when designing a physics-based solution is the interpretation of the data. One obstacle to overcome when designing a physics-based solution is the interpretation of the data.
@ -103,27 +109,29 @@ The state of a machine is often represented by a specific pattern.
This pattern could be, for example, a succession of specific amplitudes or a frequency/average pair for periodic processes. This pattern could be, for example, a succession of specific amplitudes or a frequency/average pair for periodic processes.
These patterns are impossible to reliably detect with a simple threshold method. These patterns are impossible to reliably detect with a simple threshold method.
Identifying the occurrence and position of these patterns makes the data actionable and enables higher-level --- i.e., that work at a higher level of abstraction \cite{tongaonkar2007inferring} --- security and monitoring policies. Identifying the occurrence and position of these patterns makes the data actionable and enables higher-level --- i.e., that work at a higher level of abstraction \cite{tongaonkar2007inferring} --- security and monitoring policies.
For example, a computer starting mid-night or rebooting multiple times in a row should raise an alert for a possible intrusion or malfunction. For example, a computer starting at night or rebooting multiple times in a row should raise an alert for a possible intrusion or malfunction.
Rule-based \gls{ids}s using side channel information require an accurate and practical pattern detection solution. Rule-based \gls{ids}s using side-channel information require an accurate and practical pattern detection solution.
Many data-mining algorithms assume that training data is cheap, meaning that acquiring large --- labeled --- datasets is achievable without major expense. Many data-mining algorithms assume that training data is cheap, meaning that acquiring large --- labelled --- datasets is achievable without significant expense.
Unfortunately, collecting labeled data requires following a procedure and induce downtime for the machine which can be expensive. Unfortunately, collecting labelled data requires following a procedure and induces downtime for the machine, which can be expensive.
Collecting many training samples during normal operations of the machine is more time-consuming as the machine's activity cannot be controlled. Collecting many training samples during normal operations of the machine is more time-consuming as the machine's activity cannot be controlled.
A single sample of each pattern to be detected in the time series is a more convenient data requirement. A more convenient data requirement would be a single sample of each pattern to detect.
Collecting a sample is immediately possible after the installation of the measurement equipment during normal operations of the machine. Collecting a sample is immediately possible after the installation of the measurement equipment during normal operations of the machine.
In this paper, we present \gls{mad}, a distance-based, one-shot pattern detection method for time series. This paper presents \gls{mad}, a distance-based, one-shot pattern detection method for time series.
\gls{mad} focuses on providing pre-defined state detection from only one training sample per class. \gls{mad} focuses on providing pre-defined state detection from only one training sample per class.
This approach enables the analysis of side-channel information in contexts where the collection of large datasets is impractical. This approach enables the analysis of side-channel information in contexts where the collection of large datasets is impractical.
A context selection algorithm lies at the core of \gls{mad} and yield stable classification of individual sample, important for the robustness of high-level security rules. A window selection algorithm lies at the core of \gls{mad} and yields a stable classification of individual samples, essential for the robustness of high-level security rules.
In experiments, \gls{mad} outperforms other approaches in accuracy and the Levenshtein distance on various simulated, lab-captured, and public times-series datasets. In experiments, \gls{mad} outperforms other approaches in accuracy and the reduced Levenshtein distance on various simulated, lab-captured, and public times-series datasets.
We will present the current related work on physics-based security and time series pattern detection in Section~\ref{sec:related}. We will present the current related work on physics-based security and time series pattern detection in Section~\ref{sec:related}.
Then we will introduce the formal and practical definitions of our solution in Section~\ref{sec:statement} and~\ref{sec:solution}. Then we will introduce the formal and practical definitions of the solution in Section~\ref{sec:statement} and~\ref{sec:solution}.
Finally, we will present the datasets considered in Section~\ref{sec:dataset} and the results in Section~\ref{sec:results} to finish with a discussion of the solution in Section~\ref{sec:discussion}. The two case studies presented in section~\ref{sec:cs1} and~\ref{sec:cs2} illustrate the performances of the solution in various situations.
Finally, we will discuss some important aspects of the proposed solution in Section~\ref{sec:discussion}.
\section{Related Work}\label{sec:related} \section{Related Work}\label{sec:related}
Side-channel analysis focuses on extracting information from involuntary emissions of a system. \agd{add something about STL}
Side-channel analysis focuses on extracting information from the involuntary emissions of a system.
This topic traces back to the seminal work of Paul C. Kocher. This topic traces back to the seminal work of Paul C. Kocher.
He introduced power side-channel analysis to extract secrets from several cryptographic protocols \cite{kocher1996timing}. He introduced power side-channel analysis to extract secrets from several cryptographic protocols \cite{kocher1996timing}.
This led to the new field of side-channel analysis \cite{randolph2020power}. This led to the new field of side-channel analysis \cite{randolph2020power}.
@ -144,9 +152,9 @@ To apply security policies to side-channel information, it is necessary to first
The problem of identifying pre-defined patterns in unlabeled time series is referenced under various names in the literature. The problem of identifying pre-defined patterns in unlabeled time series is referenced under various names in the literature.
The terms \textit{activity segmentation} or \textit{activity detection} are the most relevant for the problem we are interested in. The terms \textit{activity segmentation} or \textit{activity detection} are the most relevant for the problem we are interested in.
The state of the art methods in this domain focus on human activities and leverage various sensors such as smartphones \cite{wannenburg2016physical}, cameras \cite{bodor2003vision} or wearable sensors \cite{uddin2018activity}. The state-of-the-art methods in this domain focus on human activities and leverage various sensors such as smartphones \cite{wannenburg2016physical}, cameras \cite{bodor2003vision} or wearable sensors \cite{uddin2018activity}.
These methods rely on large labeled datasets to train classification models and detect activities \cite{micucci2017unimib}. These methods rely on large labelled datasets to train classification models and detect activities \cite{micucci2017unimib}.
For real-life applications, access to large labeled datasets may not be possible. For real-life applications, access to large labelled datasets may not be possible.
Another approach, more general than activity detection, uses \gls{cpd}. Another approach, more general than activity detection, uses \gls{cpd}.
\gls{cpd} is a sub-topic of time series analysis that focuses on detecting abrupt changes in a time series \cite{truong2020selective}. \gls{cpd} is a sub-topic of time series analysis that focuses on detecting abrupt changes in a time series \cite{truong2020selective}.
It is assumed in many cases that these change points are representative of state transitions from the observed system. It is assumed in many cases that these change points are representative of state transitions from the observed system.
@ -154,7 +162,7 @@ However, \gls{cpd} is only the first step in state detection as classification o
Moreover, not all state transitions trigger abrupt changes in time series statistics, and some states include abrupt changes. Moreover, not all state transitions trigger abrupt changes in time series statistics, and some states include abrupt changes.
Overall, \gls{cpd} only fits a specific type of problem with stable states and abrupt transitions. Overall, \gls{cpd} only fits a specific type of problem with stable states and abrupt transitions.
Neural networks raised in popularity for time series analysis with \gls{rnn}. Neural networks raised in popularity for time series analysis with \gls{rnn}.
Large \gls{cnn} can perform pattern extraction in long time series, for example in the context of \gls{nilm} \cite{8598355}. Large \gls{cnn} can perform pattern extraction in long time series, for example, in the context of \gls{nilm} \cite{8598355}.
\gls{nilm} focuses on the problem of signal disaggregation. \gls{nilm} focuses on the problem of signal disaggregation.
In this problem, the signal comprises an aggregate of multiple signals, each with their own patterns \cite{angelis2022nilm}. In this problem, the signal comprises an aggregate of multiple signals, each with their own patterns \cite{angelis2022nilm}.
This problem shares many terms and core techniques as this paper but the nature of the input data makes \gls{nilm} a distinct area of research. This problem shares many terms and core techniques as this paper but the nature of the input data makes \gls{nilm} a distinct area of research.
@ -173,10 +181,10 @@ However, they evaluate their work on the recognition of handwritten numerals, wh
\section{Problem Statement}\label{sec:statement} \section{Problem Statement}\label{sec:statement}
%\gls{mad} focuses on detecting the state of a time series at any point in time. %\gls{mad} focuses on detecting the state of a time series at any point in time.
We consider the problem from the point of view of multi-class, mono-label classification problem \cite{aly2005survey} for every sample in a time series. We consider the problem from the point of view of a multi-class, mono-label classification problem \cite{aly2005survey} for every sample in a time series.
The problem is multi-class because multiple states can occur in one-time series, and therefore any sample is assigned one of multiple states. The problem is multi-class because multiple states can occur in one-time series, and therefore any sample is assigned one of multiple states.
The problem is mono-label because only one state is assigned to each sample. The problem is mono-label because only one state is assigned to each sample.
The classification is a mapping from the samples space to the states space. The classification is a mapping from the sample space to the state space.
\begin{problem-statement}[\gls{mad}] \begin{problem-statement}[\gls{mad}]
Given a discretized time series $t$ and a set of patterns $P=\{P_1,\dots, P_n\}$, identify a mapping $m:\mathbb{N}\longrightarrow P\cup \lambda$ such that every sample $t[i]$ Given a discretized time series $t$ and a set of patterns $P=\{P_1,\dots, P_n\}$, identify a mapping $m:\mathbb{N}\longrightarrow P\cup \lambda$ such that every sample $t[i]$
@ -197,10 +205,10 @@ The pattern $\lambda$ is the \textit{unknown} pattern assigned to the samples in
\end{figure} \end{figure}
\section{Proposed Solution: MAD}\label{sec:solution} \section{Proposed Solution: MAD}\label{sec:solution}
\gls{mad}'s core idea separates it from other traditional sliding window algorithm. \gls{mad}'s core idea separates it from other traditional sliding window algorithms.
In \gls{mad}, the sample window around the sample to classify dynamically adapts for optimal context selection. In \gls{mad}, the sample window around the sample to classify dynamically adapts for optimal context selection.
This principle influences the design of the detector and requires the definition of new distance metrics. This principle influences the design of the detector and requires the definition of new distance metrics.
Because the patterns lengths may differ, our approach requires distance metrics that are robust to length variations. Because the lengths of the patterns may differ, our approach requires distance metrics robust to length variations.
%For the following explanation, the pattern set $P$ refers to the provided patterns only $\{P\setminus \lambda\}$ --- unless specified otherwise. %For the following explanation, the pattern set $P$ refers to the provided patterns only $\{P\setminus \lambda\}$ --- unless specified otherwise.
We first define the fundamental distance metric as the normalized Euclidean distance between two-time series $a$ and $b$ of the same length $N_a=N_b$ We first define the fundamental distance metric as the normalized Euclidean distance between two-time series $a$ and $b$ of the same length $N_a=N_b$
\begin{equation} \begin{equation}
@ -225,7 +233,7 @@ The sample receives the label $j$ associated with the pattern $P_j$ that results
The minimum distance from the pattern $P_j$ to all other patterns $P_l$ with $l\neq i$ --- denoted $ID_j$ --- forms the basis of the threshold $T_j$. The minimum distance from the pattern $P_j$ to all other patterns $P_l$ with $l\neq i$ --- denoted $ID_j$ --- forms the basis of the threshold $T_j$.
Intuitively, the patterns in $P$ represent most of the patterns expected in the trace. Intuitively, the patterns in $P$ represent most of the patterns expected in the trace.
Thus, to decide that a substring matches a pattern $P_j$, it must match $P_j$ better than any other pattern $P_l$ with $l\neq i$ does. Thus, to decide that a substring matches a pattern $P_j$, it must match $P_j$ better than any other pattern $P_l$ with $l\neq i$ does.
Otherwise, the distance metric justifies to assign the label of $P_j$ to a pattern of another label instead of the substring, which is counter-intuitive. Otherwise, the distance metric justifies assigning the label of $P_j$ to a pattern of another label instead of the substring, which is counter-intuitive.\agd{make explanation better}
The inter-distance between $P_j$ to $P_l$, defined as The inter-distance between $P_j$ to $P_l$, defined as
\begin{equation} \begin{equation}
ID(P_j,P_l) = \min_{i\in[0,N_l-N_j]} nd(P_j,P_l[i:i+N_j]) ID(P_j,P_l) = \min_{i\in[0,N_l-N_j]} nd(P_j,P_l[i:i+N_j])
@ -243,7 +251,7 @@ The shrinkage coefficient $\alpha$ provides some control over the confidence of
A small value shrinks the range of capture of each label more and will leave more samples classified as \textit{unknown}. A small value shrinks the range of capture of each label more and will leave more samples classified as \textit{unknown}.
A large value leaves less area for the \textit{unknown} state and forces the detector to choose a label, even for samples unlike any pattern. A large value leaves less area for the \textit{unknown} state and forces the detector to choose a label, even for samples unlike any pattern.
The \textit{unknown} label enables the detector to carry over the information of novelty to the output. The \textit{unknown} label enables the detector to carry over the information of novelty to the output.
In cases where a substring does not resemble any pattern --- for example, in cases of anomalies, or unforeseen activities ---, the ability to inform of novel patterns enables a more granular definition of security policies. In cases where a substring does not resemble any pattern --- for example, in cases of anomalies or unforeseen activities ---, the ability to inform of novel patterns enables a more granular definition of security policies.
Finally, we assign to each sample the label of the closest pattern with a distance lower than its threshold. Finally, we assign to each sample the label of the closest pattern with a distance lower than its threshold.
\begin{equation} \begin{equation}
@ -272,7 +280,7 @@ The efficient implementation follows the operations:
\item For every sample in the substring, store the minimum of the previously stored and newly computed normalized distance as the sample distance. \item For every sample in the substring, store the minimum of the previously stored and newly computed normalized distance as the sample distance.
\item Select the label by comparing the sample distances to the thresholds. \item Select the label by comparing the sample distances to the thresholds.
\end{enumerate} \end{enumerate}
This results in the same final value for the sample distance $sd(i,P_j)$ with less computations of the normalized distance --- at the expense of cheaper comparison operations. This results in the same final value for the sample distance $sd(i,P_j)$ with fewer computations of the normalized distance --- at the expense of cheaper comparison operations.
Algorithm~\ref{alg:code} presents the implementation's pseudo-code. Algorithm~\ref{alg:code} presents the implementation's pseudo-code.
\begin{algorithm} \begin{algorithm}
@ -337,9 +345,9 @@ Algorithm~\ref{alg:code} presents the implementation's pseudo-code.
\textbf{Time-Efficiency:} \textbf{Time-Efficiency:}
\agd{Better time efficiency analysis and comparison with the efficiency of \gls{1nn}} \agd{Better time efficiency analysis and comparison with the efficiency of \gls{1nn}}
The time efficiency of the algorithm is expressed as a function of the number of normalized distance computations and the number of comparison operations. The time efficiency of the algorithm is expressed as a function of the number of normalized distance computations and the number of comparison operations.
Each part of the algorithm has its own time-efficiency expression with Algorithm~\ref{alg:code} showing each of the three parts. Each part of the algorithm has its own time-efficiency expression, with Algorithm~\ref{alg:code} showing each of the three parts.
The first part, dedicated to the threshold computation, is polynomial in the number of patterns and linear in the length of each pattern. The first part, dedicated to the threshold computation, is polynomial in the number of patterns and linear in the length of each pattern.
The second part, in charge of the distances computation, is linear in the number of patterns, the length of the time series, and the length of each pattern. The second part, in charge of computing the distances, is linear in the number of patterns, the length of the time series, and the length of each pattern.
Finally, the third part, focusing on the final label selection, is linear in both the length of the time series and the number of patterns. Finally, the third part, focusing on the final label selection, is linear in both the length of the time series and the number of patterns.
Overall, the actual detection computation --- second and third parts --- is linear in all input sizes. Overall, the actual detection computation --- second and third parts --- is linear in all input sizes.
Adding an additional value to the time series triggers the computation of one more distance value per pattern, hence the linear relationship. Adding an additional value to the time series triggers the computation of one more distance value per pattern, hence the linear relationship.
@ -354,7 +362,7 @@ Thus the loops always terminate.
The second part iterates over the patterns and the time series with two nested loops. The second part iterates over the patterns and the time series with two nested loops.
Similarly to the first part, the time series is finite and never altered. Similarly to the first part, the time series is finite and never altered.
Thus the second part also terminates. Thus the second part also terminates.
Finally, the third part uses the sames loops as the second and also terminates. Finally, the third part uses the same loops as the second and also terminates.
Overall, \gls{mad} always terminates for any finite time series and finite set of finite patterns. Overall, \gls{mad} always terminates for any finite time series and finite set of finite patterns.
\textbf{Influence of $\alpha$} \textbf{Influence of $\alpha$}
@ -366,15 +374,15 @@ The default value for $\alpha$ is one.
This value follows the intuitive reasoning presented in Section~\ref{sec:solution}. This value follows the intuitive reasoning presented in Section~\ref{sec:solution}.
To better understand the influence of the shrink coefficient, the algorithm can be perceived as a 2D area segmentation problem. To better understand the influence of the shrink coefficient, the algorithm can be perceived as a 2D area segmentation problem.
Let us consider the 2D plane where each pattern has a position based on its shape. Let us consider the 2D plane where each pattern has a position based on its shape (see Figure~\ref{fig:overview}).
A substring to classify also has a position in the plane and a distance to each pattern (see bottom part of Figure~\ref{fig:overview}). A substring to classify also has a position in the plane and a distance to each pattern.
During classification, the substring takes the label of the closest pattern. During classification, the substring takes the label of the closest pattern.
For any pattern $P_j$, the set of positions in the plane that are assigned to $P_j$ --- i.e., the set of positions for which $P_j$ is the closest pattern --- is called the area of attraction of $P_j$. For any pattern $P_j$, the set of positions in the plane that is assigned to $P_j$ --- i.e., the set of positions for which $P_j$ is the closest pattern --- is called the area of attraction of $P_j$.
In a classic \gls{1nn} context, every point in the plane is in the area of attraction of one pattern. In a classic \gls{1nn} context, every point in the plane is in the area of attraction of one pattern.
This infinite area of attraction is not a desirable feature in this context. This infinite area of attraction is not a desirable feature in this context.
Let us consider now a time series exhibiting anomalous or unforeseen behavior. Let us now consider a time series exhibiting anomalous or unforeseen behavior.
Some substrings in this time series do not resemble any of the provided pattern. Some substrings in this time series do not resemble any of the provided patterns.
In an infinite area of attraction context, the anomalous points are assigned to a pattern, even if they poorly match it. In an infinite area of attraction context, the anomalous points are assigned to a pattern, even if they poorly match it.
As a result, the behavior of the security rule can become unpredictable as anomalous points can receive a seemingly random label. As a result, the behavior of the security rule can become unpredictable as anomalous points can receive a seemingly random label.
@ -384,13 +392,13 @@ The shrink coefficient $\alpha$ --- through the modification of the threshold $T
The lower the value of $\alpha$, the smaller the areas of attraction around each sample. The lower the value of $\alpha$, the smaller the areas of attraction around each sample.
Applying a coefficient to the thresholds produces a reduction of the radius of the area of attraction, not an homothety of the initial areas. Applying a coefficient to the thresholds produces a reduction of the radius of the area of attraction, not an homothety of the initial areas.
In other words, the shrink does not preserve the shape of the area. In other words, the shrink does not preserve the shape of the area.
For a value $\alpha < 0.5$, all areas become disks --- in the 2D representation --- and all shape information are lost. For a value $\alpha < 0.5$, all areas become disks --- in the 2D representation --- and all shape information is lost.
Figure~\ref{fig:areas} illustrate the areas of capture around the patterns for different values of $\alpha$. Figure~\ref{fig:areas} illustrate the areas of capture around the patterns for different values of $\alpha$.
\begin{figure} \begin{figure}
\centering \centering
\includegraphics[width=0.49\textwidth]{images/areas.pdf} \includegraphics[width=0.49\textwidth]{images/areas.pdf}
\caption{2D visualization of the areas of capture around each pattern as $\alpha$ changes. When $\alpha \ggg 2$, the areas of capture tends to equal these of a classic \gls{1nn}.} \caption{2D visualization of the areas of capture around each pattern as $\alpha$ changes. When $\alpha \ggg 2$, the areas of capture tend to equal these of a classic \gls{1nn}.}
\label{fig:areas} \label{fig:areas}
\end{figure} \end{figure}
\agd{Increase font size} \agd{Increase font size}
@ -400,7 +408,7 @@ The influence of the $\alpha$ coefficient on the classification is monotonic and
Because $\alpha$ influences the thresholds, changing $\alpha$ results in moving the transitions in the detected labels. Because $\alpha$ influences the thresholds, changing $\alpha$ results in moving the transitions in the detected labels.
In other words, a lower value of $\alpha$ expands the unknown segments while a higher value shrinks them until they disappear. In other words, a lower value of $\alpha$ expands the unknown segments while a higher value shrinks them until they disappear.
Figure~\ref{fig:alpha_impact} illustrates the impact $\alpha$ on the width of unknown segments. Figure~\ref{fig:alpha_impact} illustrates the impact $\alpha$ on the width of unknown segments.
The impact of $\alpha$ on the number of unknown sample is also monotonic. The impact of $\alpha$ on the number of unknown samples is also monotonic.
\begin{proof} \begin{proof}
We prove the monotony of the number of unknown samples as a function of $\alpha$ by induction. We prove the monotony of the number of unknown samples as a function of $\alpha$ by induction.
@ -419,7 +427,7 @@ When a threshold increases from $T_0$ to $T_1$, all the samples in $S_{T_0}$ als
It is also possible for samples to belong to $S_{T_1}$ but not to $S_{T_0}$ if their distance falls between $T_0$ and $T_1$. It is also possible for samples to belong to $S_{T_1}$ but not to $S_{T_0}$ if their distance falls between $T_0$ and $T_1$.
Hence, $S_{T_0}$ is a subset of $S_{T_1}$ and the cardinality of $S_T$ as a function of $T$ is monotonically non-decreasing. Hence, $S_{T_0}$ is a subset of $S_{T_1}$ and the cardinality of $S_T$ as a function of $T$ is monotonically non-decreasing.
We conclude that the number of unknown samples --- i.e.,samples above every thresholds --- as a function of $\alpha$ is monotonically non-increasing. We conclude that the number of unknown samples as a function of $\alpha$ is monotonically non-increasing.
\end{proof} \end{proof}
@ -441,9 +449,9 @@ Figure~\ref{fig:alpha} presents the number of unknown samples in the classificat
\end{figure} \end{figure}
\section{Case Study 1: Comparison with Other Methods} \section{Case Study 1: Comparison with Other Methods}\label{sec:cs1}
The first evaluation of \gls{mad} consists in the detection of the states for time series from various machines. The first evaluation of \gls{mad} consists in the detection of the states for time series from various machines.
We evaluate the performance of the proposed solution against other traditional methods to illustrate the capabilities and advantages of \gls{mad}. We evaluate the performances of the proposed solution against other traditional methods to illustrate the capabilities and advantages of \gls{mad}.
\subsection{Performance Metrics} \subsection{Performance Metrics}
We considered two metrics to illustrate the performance of \gls{mad}. We considered two metrics to illustrate the performance of \gls{mad}.
@ -460,13 +468,13 @@ However, the raw label list embeds state detection time information, which the L
We first reduce the ground truth and the detected labels by removing immediate duplicate of labels. We first reduce the ground truth and the detected labels by removing immediate duplicate of labels.
This reduction removes timing information yet conserves the global order of state occurrences. This reduction removes timing information yet conserves the global order of state occurrences.
The Levenshtein distance between the ground truth and the detected labels is low if every state occurrence is correctly detected. The Levenshtein distance between the ground truth and the detected labels is low if every state occurrence is correctly detected.
Similarly the metric is high if states occurrences are missed, added, or miss-detected. Similarly, the metric is high if occurrences are missed, added, or miss-detected.
To remove length bias and make the metric comparable across datasets, we normalize the raw Levenshtein distance and define it as To remove length bias and make the metric comparable across datasets, we normalize the raw Levenshtein distance and define it as
\begin{equation} \begin{equation}
levacc = \dfrac{Levenshtein(rgtruth,rlabels)}{max(rN_t,rN_l)} levacc = \dfrac{Levenshtein(rgtruth,rlabels)}{max(rN_t,rN_l)}
\end{equation} \end{equation}
with $rgtruth$ and $rlabels$ respectively the reduced ground truth and reduced labels and $rN_t$ and $rN_l$ their length. with $rgtruth$ and $rlabels$ respectively the reduced ground truth and reduced labels and $rN_t$ and $rN_l$ their length.
The Levenshtein distance provides complementary insights in the performance of the detection in this specific use case. The Levenshtein distance provides complementary insights on the quality of the detection in this specific use case.
Figure~\ref{fig:metrics} illustrates the impact of an error on both metrics. Figure~\ref{fig:metrics} illustrates the impact of an error on both metrics.
It is important to notice that zero represents the best Levenshtein distance and one the worst --- contrary to the accuracy. It is important to notice that zero represents the best Levenshtein distance and one the worst --- contrary to the accuracy.
@ -480,7 +488,7 @@ It is important to notice that zero represents the best Levenshtein distance and
\subsection{Dataset}\label{sec:dataset} \subsection{Dataset}\label{sec:dataset}
\agd{include more datasets from REFIT. One per house would be perfect but simply more is already good. Add in annexe why other are rejected.} \agd{include more datasets from REFIT. One per house would be perfect but simply more is already good. Add in annexe why other are rejected.}
We evaluate the performance of \gls{mad} against eight time series. We evaluate the performances of \gls{mad} against eight time series.
One is a simulated signal composed of sine waves of varying frequency and average. One is a simulated signal composed of sine waves of varying frequency and average.
Four were captured in a lab environment on consumer-available machines (two NUC PCs and two wireless routers). Four were captured in a lab environment on consumer-available machines (two NUC PCs and two wireless routers).
Finally, two were extracted from the REFIT dataset \cite{278e1df91d22494f9be2adfca2559f92} and correspond to home appliances during real-life use. Finally, two were extracted from the REFIT dataset \cite{278e1df91d22494f9be2adfca2559f92} and correspond to home appliances during real-life use.
@ -504,7 +512,7 @@ Table~\ref{tab:dataset} presents the times series and their characteristics.
\label{tab:dataset} \label{tab:dataset}
\end{table} \end{table}
The dataset aims to provide a diverse machine and state patterns to evaluate the performances. The dataset aims to provide diverse machine and state patterns to evaluate the performances.
For each time series, we generated the ground truth by manually labeling all sections of the time series using a custom-made range selection tool based on a Matplotlib \cite{Hunter:2007} application. For each time series, we generated the ground truth by manually labeling all sections of the time series using a custom-made range selection tool based on a Matplotlib \cite{Hunter:2007} application.
The dataset is publicly available \cite{zenodo}. The dataset is publicly available \cite{zenodo}.
@ -518,29 +526,29 @@ The states to detect on these computing machines are \textit{powered off}, \text
With these states, it is possible to set up many security rules such as: \textit{"machine on after office hours"}, \textit{"X reboots in a row"} or \textit{"Coincident shutdown of Y machines within Z minutes"}. With these states, it is possible to set up many security rules such as: \textit{"machine on after office hours"}, \textit{"X reboots in a row"} or \textit{"Coincident shutdown of Y machines within Z minutes"}.
\textbf{GENERATED:} \textbf{GENERATED:}
An algorithm generated the GENERATED time series following 3 steps. An algorithm generated the GENERATED time series following three steps.
First, the algorithm randomly selects multiple frequency/average pairs. First, the algorithm randomly selects multiple frequency/average pairs.
Second, the algorithm generates 18 segments, by selecting a pair and a random length. Second, the algorithm generates 18 segments by selecting a pair and a random length.
Finally, the algorithm concatenates the segments to form the complete time series. Finally, the algorithm concatenates the segments to form the complete time series.
The patterns correspond to a minimal length example of each pair. The patterns correspond to a minimal length example of each pair.
This time series illustrates the capabilities of the proposed solution in a case where a simple threshold would fail. This time series illustrates the capabilities of the proposed solution in a case where a simple threshold would fail.
\textbf{REFIT:} \textbf{REFIT:}
In 215, D. Murray et. al \cite{278e1df91d22494f9be2adfca2559f92} created the REFIT dataset for \gls{nilm} research. In 215, D. Murray et al. \cite{278e1df91d22494f9be2adfca2559f92} created the REFIT dataset for \gls{nilm} research.
This dataset is now widely used in this research area. This dataset is now widely used in this research area.
REFIT is composed of the global consumption of 20 houses along with the specific consumption of nine appliances per house. REFIT is composed of the global consumption of 20 houses, along with the specific consumption of nine appliances per house.
The global house consumption does not fit the problem statement of this paper as multiple patterns overlap. The global house consumption does not fit the problem statement of this paper as multiple patterns overlap.
However the individual consumption of some appliances fit the problem statement and two were selected. However, the individual consumption of some appliances fit the problem statement, and two were selected.
The REFIT-H4A1 is the first appliance of the fourth house and corresponds to a fridge. The REFIT-H4A1 is the first appliance of the fourth house and corresponds to a fridge.
The REFIT-H4A4 is the fourth appliance of the fourth house and corresponds to a washing machine. The REFIT-H4A4 is the fourth appliance of the fourth house and corresponds to a washing machine.
The activity in this second time series was very sparse with long periods without consumption. The activity in this second time series was sparse with long periods without consumption.
The no consumption sections: are not a challenging --- i.e., all detectors perform well on this type of pattern ---, make the manual labeling more difficult, and level all results up. The no-consumption sections are not challenging --- i.e., all detectors perform well on this type of pattern ---, make the manual labeling more difficult, and level all results up.
For this reason we removed large sections of inactivity between active segments to make the time series more challenging without tempering with the order of detector performances. For this reason, we removed large sections of inactivity between active segments to make the time series more challenging without tempering with the order of detector performances.
\subsection{Alternative Methods} \subsection{Alternative Methods}
\agd{explain how the svm and mlp are trained.} \agd{explain how the svm and mlp are trained.}
We implemented three alternative methods to compare with the performance of \gls{mad}. We implemented three alternative methods to compare with \gls{mad}.
The alternative methodes are chosen to be well-established and of comparable complexity. The alternative methods are chosen to be well-established and of comparable complexity.
The methods are: a \gls{1nn} detector, an \gls{svm} classifier, and an \gls{mlp} classifier. The methods are: a \gls{1nn} detector, an \gls{svm} classifier, and an \gls{mlp} classifier.
More complex solutions like \gls{rnn} or \gls{cnn} show good performances on time series analysis but require too much data to be applicable to one-shot classification. More complex solutions like \gls{rnn} or \gls{cnn} show good performances on time series analysis but require too much data to be applicable to one-shot classification.
All alternative methods rely on a sliding window to extract substring to classify. All alternative methods rely on a sliding window to extract substring to classify.
@ -553,19 +561,19 @@ For the \gls{svm} and \gls{mlp} detectors, the window size is shorter than the s
The training sample extraction algorithm slides the window along all patterns to extract all possible substrings. The training sample extraction algorithm slides the window along all patterns to extract all possible substrings.
These substrings constitute the training dataset with multiple samples per pattern. These substrings constitute the training dataset with multiple samples per pattern.
The \gls{mlp} is implemented using \cite{keras} and composed of a single layer with 100 neurones. The \gls{mlp} is implemented using \cite{keras} and composed of a single layer with 100 neurones.
The number of neurone was chosen after evaluating the accuracy of the \gls{mlp} on one of the dataset (NUCPC\_1) with varying number of neurones. The number of neurones was chosen after evaluating the accuracy of the \gls{mlp} on one of the dataset (NUCPC\_1) with varying numbers of neurones.
Similarily, the \gls{svm} detector is implemeted using \cite{sklearn} with the default parameters. Similarly, the \gls{svm} detector is implemented using \cite{sklearn} with the default parameters.
The \gls{1nn} considers one window per pattern length around each sample. The \gls{1nn} considers one window per pattern length around each sample.
Every window is compared to its pattern, and the normalized Euclidean distance is considered for the decision. Every window is compared to its pattern, and the normalized Euclidean distance is considered for the decision.
Overall, it is possible to adapt the methods to work with variable length patterns, but \gls{mad} is the only pattern-length-agnostic method by design. Overall, it is possible to adapt the methods to work with variable length patterns, but \gls{mad} is the only pattern-length-agnostic method by design.
\subsection{Results}\label{sec:results} \subsection{Results}\label{sec:results}
The benchmark consists in detecting the label of every sample for each time series with each method and compute the performance metrics. The benchmark consists in detecting the label of every sample for each time series with each method and computing the performance metrics.
The detectors that require training (\gls{svm} and \gls{mlp}) were re-trained for every evaluation. The detectors that require training (\gls{svm} and \gls{mlp}) were re-trained for every evaluation.
Figure~\ref{fig:res} presents the results. Figure~\ref{fig:res} presents the results.
\gls{mad} is consistently as or more accurate than the alternative method. \gls{mad} is consistently as or more accurate than the alternative method.
The Levenshtein distance illustrates how \gls{mad} provides a smoother and less noisy labeling. The Levenshtein distance illustrates how \gls{mad} provides smoother and less noisy labeling.
This stability introduces less state detection errors that could falsly trigger security rules. This stability introduces fewer state detection errors that could falsely trigger security rules.
With both performances metrics combined, \gls{mad} outperforms the other methods. With both performances metrics combined, \gls{mad} outperforms the other methods.
\begin{figure*} \begin{figure*}
@ -576,22 +584,22 @@ With both performances metrics combined, \gls{mad} outperforms the other methods
\end{figure*} \end{figure*}
\section{Case Study 2: Attack Scenarios} \section{Case Study 2: Attack Scenarios}\label{sec:cs2}
The second case study focuses on a realistic production scenario. The second case study focuses on a realistic production scenario.
The goal of this study is to illustrate hoh \gls{mad} enbales hight abstraction level rules applications by converting the low-level power consumption signal into labeled and actionable states sequence. This case study aims to illustrate how \gls{mad} enables high abstraction level rules applications by converting the low-level power consumption signal into labelled and actionable states sequence.
\subsection{Overview} \subsection{Overview}
This second case study aims at illustrating the performances of the \gls{mad} detector on more realisitc data. This second case study aims at illustrating the performances of the \gls{mad} detector on more realistic data.
To this extend, a machine was setup to perform tasks on a typical office work schedule including work hours, sleep hours, and maintenance hours. To this extent, a machine was set up to perform tasks on a typical office work schedule composed of work hours, sleep hours, and maintenance hours.
The scenario comprises 4 phases: The scenario comprises four phases:
\begin{itemize} \begin{itemize}
\item 1 Night Sleep: During the night and until the worker begin the day, the machine is asleep in S3 sleep state\cite{sleep_state}. Any other state than sleep is considered anomalous during this time. \item 1 Night Sleep: During the night and until the worker begins the day, the machine is asleep in S3 sleep state\cite{sleep_state}. Any other state than sleep is considered anomalous during this time.
\item 2 Work Hours: During work hours, little restriction is applied on the activity. Only a long period with the machine asleep is considered anomalous. \item 2 Work Hours: During work hours, little restriction is applied to the activity. Only a long period with the machine asleep is considered anomalous.
\item 3 Maintenance: During the night, the machine wakes up as part of an automated maintenance schedule. During maintenance updates are fetched and a reboot is performed. \item 3 Maintenance: During the night, the machine wakes up as part of an automated maintenance schedule. During maintenance, updates are fetched, and a reboot is performed.
\item 4 No Long High Load: At no point there should be a sustained high load on the machine. Given the scenario of classic office work, having all cores of a machine maxed out is suspicious. Violations to this rule are generated by running the programm xmrig for more than 30 seconds. Xmrig is a legitimate crypto-mining software but it is commonly abused by criminals to build crypto-mining malwares. \item 4 No Long High Load: At no point should there be a sustained high load on the machine. Given the scenario of classic office work, having all cores of a machine maxed out is suspicious. Violations of this rule are generated by running the program xmrig for more than 30 seconds. Xmrig is a legitimate crypto-mining software, but it is commonly abused by criminals to build crypto-mining malware.
\end{itemize} \end{itemize}
\begin{figure} \begin{figure}
@ -601,32 +609,32 @@ The scenario comprises 4 phases:
\label{fig:2w_experiment} \label{fig:2w_experiment}
\end{figure} \end{figure}
In order to reduce the experimentation and processing time, the daily scenario is compressed into 4 hours, allowing 6 runs per day and a processing time of only $\approx 4min$ per run. In order to reduce the experimentation and processing time, the daily scenario is compressed into 4 hours, allowing six runs per day and a processing time of only $\approx 4min$ per run.
Note that this compression of experiment time does not influence the results (the patterns are kept uncompressed) and is only for convenience and better confidence in the results. Note that this compression of experiment time does not influence the results (the patterns are kept uncompressed) and is only for convenience and better confidence in the results.
Figure~\ref{fig:2w_experiment} illustrate the experiment scenario with both the real and compressed time. Figure~\ref{fig:2w_experiment} illustrates the experiment scenario with both the real and compressed time.
The data capture follows the same setup as presented in the first case study. The data capture follows the same setup as presented in the first case study.
A power measurement device is placed in series with the main power cable of the machine (a NUC micro-pc). A power measurement device is placed in series with the main power cable of the machine (a NUC micro-pc).
The measurement devices captures the power consumption at 10 kilo-sampls per seconds. The measurement device captures the power consumption at 10 kilo-samples per second.
The pre-processing step downsamples the trace to 20 samples per seconds using a median filter. The pre-processing step downsamples the trace to 20 samples per second using a median filter.
This step greatly reduces the measurement noise and the processing time, and increases the consistency of the results. This step greatly reduces the measurement noise and the processing time and increases the consistency of the results.
The final sampling rate of 20 samples per seconds was selected empirically to be around one order of magnitude highter than the typical length of the patterns to detect (around 5 seconds). The final sampling rate of 20 samples per second was selected empirically to be around one order of magnitude higher than the typical length of the patterns to detect (around five seconds).
For each comrpessed day of experiment (4 hours segment, thereafter refered as days), the \gls{mad} performs state detection and returns a label vector. For each compressed day of the experiment (four hours segment, thereafter referred to as days), the \gls{mad} performs state detection and returns a label vector.
This label vector associate a label to each sample of the power trace following the mapping: -~1 is UNKNOWN, 0 is SLEEP, 1 is IDLE, 2 is HIGH and 3 is REBOOT. This label vector associates a label to each sample of the power trace following the mapping: -~1 is UNKNOWN, 0 is SLEEP, 1 is IDLE, 2 is HIGH and 3 is REBOOT.
The training dataset comprise one sample per state, captured during a the run of a benchmark script that interatively place the machine in each states to detect. The training dataset comprises one sample per state, captured during the run of a benchmark script that interactively places the machine in each state to detect.
\agd{make dataset available} \agd{make dataset available}
The script on the machine generates logs that serves as ground truth to verify the results of rule checking. The script on the machine generates logs that serve as ground truth to verify the results of rule checking.
Figure~\ref{fig:preds} present an illustration of the results. Figure~\ref{fig:preds} present an illustration of the results.
The main graph line in the midle is the power consumption over time. The main graph line in the middle is the power consumption over time.
The color of the line represent the predicted state of the machine, based on the power consumption pattern. The line's colour represents the machine's predicted state based on the power consumption pattern.
Below the graph, two lines illustrates the labels vectors. Below the graph, two lines illustrate the labels vectors.
The top line is the predicted labels and can be seen as a projection of the power consumption line on the x-axis. The top line is the predicted labels and can be interpreted as a projection of the power consumption line on the x-axis.
The bottom line is the labels ground truth, generated from the scenario logs. The bottom line is the label's ground truth, generated from the scenario logs.
We can already notice with this Figure that the prediction corretc most of the time except for some noise around states transitions and uncertainty between idle and generic activities (represented as UNKNOWN). We can already notice with this figure that the prediction is correct most of the time except for some noise around state transitions and uncertainty between idle and generic activities (represented as UNKNOWN).
The errors at transitions are explained by the training samples that focuses on stable states and do not provide labels for transitions pattern. The errors at transitions are explained by the training samples that focus on stable states and do not provide labels for transitions pattern.
A simple solution to avoid this issue if required would be to provide training patterns for states transitions. A simple solution to avoid this issue would be to provide training patterns for state transitions.
The type of error foreshadows the good capabilities of this method for rules verification presented in more details in Section~\ref{2wexp-results}. The type of error foreshadows the good capabilities of this method for rules verification presented in more detail in Section~\ref{2wexp-results}.
\begin{figure} \begin{figure}
\centering \centering
@ -637,10 +645,10 @@ The type of error foreshadows the good capabilities of this method for rules ver
\subsection{Security Rules} \subsection{Security Rules}
Many rules can be imagined to describe the expected and unwanted behavior of a machine. Many rules can be imagined to describe the expected and unwanted behavior of a machine.
System administrators can define highly specific rules to detect specific attacks or to match the typicall acticities of their infrastructure. System administrators can define sophisticated rules to detect specific attacks or to match the typical activities of their infrastructure.
We selected 4 rules (see Table~\ref{tab:rules}) that are representative of common threats on companies or administrations's \gls{it} infrastructures. We selected 4 rules (see Table~\ref{tab:rules}) that are representative of common threats on \gls{it} infrastructures.
These rules are not exhaustive and are merely an example of the potential of converting power cosumption traces to actionable data. These rules are not exhaustive and are merely an example of the potential of converting power consumption traces to actionable data.
The rules are formaly defined using the \gls{stl} syntax which is bespoke for describing variable patterns with temporal components. The rules are formally defined using the \gls{stl} syntax, which is bespoke for describing variable patterns with temporal components.
\begin{table*} \begin{table*}
\centering \centering
@ -650,7 +658,7 @@ The rules are formaly defined using the \gls{stl} syntax which is bespoke for de
\toprule \toprule
1 & "SLEEP" state only & $R_1 := \square_{[0,1h]}(s[t]=0)$ & Machine takeover, Botnet\cite{mitre_botnet}, Rogue Employee\\ 1 & "SLEEP" state only & $R_1 := \square_{[0,1h]}(s[t]=0)$ & Machine takeover, Botnet\cite{mitre_botnet}, Rogue Employee\\
2 & No "SLEEP" for more than 8m. & $R_4 := \square_{[1h,2h40]} (s[t_0]=0 \rightarrow \lozenge_{[t_0,t_0+1h]}(s[t_0]=0))$ & System Malfunction\\ 2 & No "SLEEP" for more than 8m. & $R_4 := \square_{[1h,2h40]} (s[t_0]=0 \rightarrow \lozenge_{[t_0,t_0+1h]}(s[t_0]=0))$ & System Malfunction\\
3 & Exactly one occurence of "REBOOT" & $R_2 := \lozenge(s[t_0]=3) \cup (\neg \square_{[t_0,t_0+2h40]}(s[t]=3)$ & \gls{apt}\cite{mitre_prevent}, Backdoors\\ 3 & Exactly one occurrence of "REBOOT" & $R_2 := \lozenge(s[t_0]=3) \cup (\neg \square_{[t_0,t_0+2h40]}(s[t]=3)$ & \gls{apt}\cite{mitre_prevent}, Backdoors\\
4 & No "HIGH" state for more than 30s. & $R_3 := \square (s[t_0]=2 \rightarrow \lozenge_{[t_0,t_0+30s]}(s[t]=2))$ & CryptoMining Malware \cite{mitre_crypto}, Ransomware\cite{mitre_ransomware}, BotNet\cite{mitre_botnet}\\ 4 & No "HIGH" state for more than 30s. & $R_3 := \square (s[t_0]=2 \rightarrow \lozenge_{[t_0,t_0+30s]}(s[t]=2))$ & CryptoMining Malware \cite{mitre_crypto}, Ransomware\cite{mitre_ransomware}, BotNet\cite{mitre_botnet}\\
\bottomrule \bottomrule
\end{tabular} \end{tabular}
@ -658,19 +666,19 @@ The rules are formaly defined using the \gls{stl} syntax which is bespoke for de
\end{table*} \end{table*}
\subsection{Results}\label{2wexp-results} \subsection{Results}\label{2wexp-results}
The performance measure represent the ability of the whole pipeline (\gls{mad} and rule checking) to detect anomalous behavior. The performance measure represents the ability of the complete pipeline (\gls{mad} and rule checking) to detect anomalous behavior.
The main metrics are the micro and macro $F_1$ score of the rule violation detection. The main metrics are the micro and macro $F_1$ score of the rule violation detection.
The macro-$F_1$ score is defined as the arithmetic mean over individual $F_1$ scores for a more robust evaluation of the global performance as described in \cite{opitz2021macro}. The macro-$F_1$ score is defined as the arithmetic mean over individual $F_1$ scores for a more robust evaluation of the global performance as described in \cite{opitz2021macro}.
Table~\ref{tab:rules-results} presents the performance for the detection of each rule. Table~\ref{tab:rules-results} presents the performance for the detection of each rule.
The performances or perfect on this scenario without any false positive or false negative over XX\agd{updates} runs. The performances are perfect for this scenario without any false positive or false negative over XX\agd{updates} runs.
The perfect detection of more complexe patterns like REBOOT illustrate the need for a system capable of matching arbitrary states. The perfect detection of more complex patterns like REBOOT illustrates the need for a system capable of matching arbitrary states.
Many common states from an embedded systems are represented by flat lines at varying average levels. Flat lines at varying average levels represent many common states from embedded systems.
If the only states to detect were OFF, ON and HIGH, then a simple threshold method would work wonders. If the only states to detect were OFF, ON and HIGH, then a simple threshold method would work wonders.
However, the REBOOT pattern is not so simple. However, the REBOOT pattern is more complex.
The REBOOT resambles generic activities and crosses most of the same thresholds. The REBOOT resembles generic activities and crosses most of the same thresholds.
In order to consistently recognize it, the classifier must have, at its core, a pattern matching mechanism. In order to consistently recognize it, the classifier must have, at its core, a pattern-matching mechanism.
This leads to believe that \gls{mad} balances the tradeoff between simple, explainable and efficient on one side and capable, complete and versatile on the other. This illustrates that \gls{mad} balances the tradeoff between simple, explainable and efficient on one side and capable, complete and versatile on the other.
\begin{table} \begin{table}
\centering \centering
@ -689,57 +697,59 @@ This leads to believe that \gls{mad} balances the tradeoff between simple, expla
\section{Discussion}\label{sec:discussion} \section{Discussion}\label{sec:discussion}
In this section we highlight specific aspects of the proposed solution. In this section, we highlight specific aspects of the proposed solution.
\textbf{Dynamic Window vs Fixed Windows: } \textbf{Dynamic Window vs Fixed Windows: }
One of the core mechanism of \gls{mad} is the ability to choose the best fitting window to calssify each sample. One of the core mechanisms of \gls{mad} is the ability to choose the best-fitting window to classify each sample.
This mechanism is crucial to overcome some of the shortcomings of a traditional \gls{1nn}. This mechanism is crucial to overcome some of the shortcomings of a traditional \gls{1nn}.
It is important to understand the advantages of this dynamic window placement to fully appreciate the performances of \gls{mad} It is essential to understand the advantages of this dynamic window placement to fully appreciate the performances of \gls{mad}
Figure~\ref{fig:proof} illustrates test case that focuses on the comarison between the two methodes. Figure~\ref{fig:proof} illustrates a test case that focuses on the comparison between the two methods.
In this figure, the top graph represent the near-perfect classification of the trace into different classes by \gls{mad}. In this figure, the top graph represents the near-perfect classification of the trace into different classes by \gls{mad}.
To make the results more comparables, the $\alpha$ parameter of \gls{mad} was set to $\infty$ to avoid the distance threshold mechanism and focuse on the dynamic window placement. To make the results more comparable, the $\alpha$ parameter of \gls{mad} was set to $\infty$ to avoid the distance threshold mechanism and focuses on the dynamic window placement.
The middle graph represent the classification by a \gls{1nn} and it illustrates the three types of errors that \gls{mad} aims to overcome. The middle graph represents the classification by a \gls{1nn}, and it illustrates the three types of errors that \gls{mad} aims to overcome.
The bottom graph represent the predicted state fro each sample by each method with -~1 the UNKNOWN state and $[0-4]$ the possible states of the trace. The bottom graph represents the predicted state for each sample by each method with -~1 the UNKNOWN state and $[0-4]$ the possible states of the trace.
\begin{itemize} \begin{itemize}
\item Transition Bleeding Error: Around transitions, tends to miss the exact transition timing and miss-classify samples. \item Transition Bleeding Error: Around transitions, tends to miss the exact transition timing and miss-classify samples.
This si explained by the rigidity of the window around the sample. This is explained by the rigidity of the window around the sample.
At the transition time, the two halfs of the window are competing to match different states. At the transition time, the two halves of the window are competing to match different states.
Depending on the shape of the states involved, it may require much more than half of the window to prefer the new state, leading to miss-classified samples around the transition. Depending on the involved states' shape, it may require more than half of the window to prefer the new state, leading to miss-classified samples around the transition.
In contrast, \gls{mad} will always choose a window that fully matches either of the states and that is not across the transition, avoiding the transition error. In contrast, \gls{mad} will always choose a window that fully matches either of the states, and that is not across the transition, avoiding the transition error.
\item Out-of-phase Error: When a states is describe by multiple iterations of a periodic pattern, the match between a window and the trace varies dramatically every half-period. When the window is in phase with the pattern, the match is maximal and \gls{1nn} perfectly fills its role. However, when the window and the pattern are out of phase, the match is minimal and the nearest neighbor may be a flat pattern at the average level of the pattern. This error manifests itself through predictions switching between two values at half the period of the pattern. \gls{mad} avoids this error by moving the window by, at most, half a period to ensure a perfect match with the periodic pattern. \item Out-of-phase Error: When a state is described by multiple iterations of a periodic pattern, the match between a window and the trace varies dramatically every half-period. When the window is in phase with the pattern, the match is maximal and \gls{1nn} perfectly fills its role. However, when the window and the pattern are out of phase, the match is minimal, and the nearest neighbour may be a flat pattern at the average level of the pattern. This error manifests itself through predictions switching between two values at half the period of the pattern. \gls{mad} avoids this error by moving the window by, at most, half a period to ensure a perfect match with the periodic pattern.
\item Unknown-Edges Error: Because of the fixed nature of the window of a \gls{1nn}, every sample that is less than a half window away from either end cannot be classified. This error is not so important in most cases where edge sample are less important and many solutions are available to solve this issue. However, \gls{mad} naturally solve this issue by shifting the window only in the valid range until the edge. \item Unknown-Edges Error: Because of the fixed nature of the window of a \gls{1nn}, every sample that is less than a half window away from either end cannot be classified. This error is not so important in most cases where edge samples are less important, and many solutions are available to solve this issue. However, \gls{mad} naturally solve this issue by shifting the window only in the valid range until the edge.
\end{itemize} \end{itemize}
There surely are other methods than \gls{mad} to sovle these issues. There are other methods than \gls{mad} to solve these issues, like \gls{dtw} distance metric, padding, or labels post-processing.
However, this illustrates how \gls{mad} leverages at its core the dynamic window placement to dramatically improve the accuracy of the classification. However, this illustrates how \gls{mad} leverages at its core the dynamic window placement to dramatically improve the accuracy of the classification.
The dynamic window placement is a simple mechanism that does not involve complex and computationally expensive distance metrics like \gls{dtw} to improve matchs. Dynamic window placement is a simple mechanism that does not involve complex and computationally expensive distance metrics like \gls{dtw} to improve matches.
This leave the choice of the distance metric open for specific applciations. This leaves the choice of the distance metric open for specific applications.
The dynamic window placement also avoid increased complexity by requiring the same number of distance computation as \gls{1nn}. The dynamic window placement also avoids increased complexity by requiring the same number of distance computation as \gls{1nn}.
\begin{figure*} \begin{figure*}
\centering \centering
\includegraphics[width=0.9\textwidth]{images/proof.pdf} \includegraphics[width=0.9\textwidth]{images/proof.pdf}
\caption{Classification comparison between MAD and 1-NN with examples of prediction error from 1-NN highlighted. Top graph is \gls{mad}, middle graph is 1-NN, bottom graph is the prediction vector of both methods.} \caption{Classification comparison between MAD and 1-NN with examples of prediction error from 1-NN highlighted. The top graph is \gls{mad}, the middle graph is 1-NN, and the bottom graph is the prediction vector of both methods.}
\label{fig:proof} \label{fig:proof}
\end{figure*} \end{figure*}
\textbf{Limitations: } \textbf{Limitations: }
The proposed method have some limitations that are important to acknowledge. The proposed method has some limitations that are important to acknowledge.
The current version of \gls{mad} is tailored for a specific use case. The current version of \gls{mad} is tailored for a specific use case.
The goal is to enable high-level security policies with a secure and reliable state detection of a machine from a time series. The goal is to enable high-level security policies with a secure and reliable state detection of a machine from a time series.
The purpose of the state detection is not anomaly or novelty detection at the time series level. The purpose of the state detection is not an anomaly or novelty detection at the time series level.
For this reason, the patterns to be detected by \gls{mad} bear some limitations. For this reason, the patterns to be detected by \gls{mad} bear some limitations.
First, the patterns must be distinct. First, the patterns must be distinct.
If two patterns share a significant portion of time series, \gls{mad} will have an issue leading to unstable results. If two patterns share a significant portion of time series, \gls{mad} will have an issue leading to unstable results.
Second, the states must be hand selected. Second, the states must be hand selected.
The data requirement is extremely low --- only one sample per pattern --- so the selected samples must be reliable. The data requirement is extremely low --- only one sample per pattern --- so the selected samples must be reliable.
For now, a human expert decided on the best patterns to select. For now, a human expert decided on the best patterns to select.
While there is nothing particularly difficult in the selection, it is still a highly manual process that we hope to automatize with future iterations. While nothing is complicated in this selection, it is still a highly manual process that we hope to automate with future iterations.
Finally, the states must be consistent. Finally, the states must be consistent.
If a state has an unpredictable signature --- i.e., each occurence display a significantly different pattern ---, \gls{mad} will not be able to detect the occurences reliably. If a state has an unpredictable signature --- i.e., each occurrence displays a significantly different pattern ---, \gls{mad} will not be able to detect the occurrences reliably.
If a state has different patterns, it is possible to capture each variation as a distinct training sample to enable better detection.
The proposed solution is trivial to adapt for multi-shot detection, but the design decisions and implementation details are outside the scope of this paper.
\textbf{Extension to Multi-shot Classification: } \textbf{Extension to Multi-shot Classification: }
\gls{mad} is not limited to one-shot cases and can leverage more labeled data. \gls{mad} is not limited to one-shot cases and can leverage more labeled data.
@ -759,10 +769,10 @@ Finally, because \gls{mad} is distance-based and window-based, parallelization i
\section{Conclusion} \section{Conclusion}
We present \gls{mad} and its associated rule-verification pipeline, a novel solution to enable high-level security policy enforcement from side channel information. We present \gls{mad} and its associated rule-verification pipeline, a novel solution to enable high-level security policy enforcement from side channel information.
Leveraging side channel information requires labeling samples to discover the state of the monitored system. Leveraging side channel information requires labeling samples to discover the state of the monitored system.
Additionally, in the use cases where side-channels are leveraged, collecting large labeled datasets can be challenging. Additionally, in the use cases where side channels are leveraged, collecting large labeled datasets can be challenging.
\gls{mad} is designed around three core features: low data requirement, flexibility of the detection capabilities, and stability of the results. \gls{mad} is designed around three core features: low data requirement, flexibility of the detection capabilities, and stability of the results.
Built as a variation of a traditional \gls{1nn}, \gls{mad} uses a dynamic window placement that always provides the most relevant context for sample classification. Built as a variation of a traditional \gls{1nn}, \gls{mad} uses a dynamic window placement that always provides the most relevant context for sample classification.
One hyper-parameter, $\alpha$, controls the confidence of the detector and the trade-off between un-classified and miss-classified samples. One hyper-parameter, $\alpha$, controls the confidence of the detector and the tradeoff between un-classified and miss-classified samples.
The comparison to traditional state detection methods highlights the potential of \gls{mad} for the pre-processing of raw data for security applications. The comparison to traditional state detection methods highlights the potential of \gls{mad} for the pre-processing of raw data for security applications.
\bibliographystyle{plain} \bibliographystyle{plain}