\documentclass[conference]{IEEEconf} %\input epsf \usepackage{graphicx} \usepackage{multirow} \usepackage{xcolor} \usepackage{booktabs} \usepackage{tabularx} \usepackage{algpseudocodex} \usepackage{algorithm} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsthm} \usepackage[toc,acronym,abbreviations,nonumberlist,nogroupskip]{glossaries-extra} \renewcommand\thesection{\arabic{section}} % arabic numerals for the sections \renewcommand\thesubsectiondis{\thesection.\arabic{subsection}.}% arabic numerals for the subsections \renewcommand\thesubsubsectiondis{\thesubsectiondis.\arabic{subsubsection}.}% arabic numerals for the subsubsections \newtheorem{problem-statement}{Problem Statement} \newcommand\agd[1]{{\color{red}$\bigstar$}\footnote{agd: #1}} \newcommand\SF[1]{{\color{blue}$\bigstar$}\footnote{sf: #1}} \newcommand{\cn}{{\color{purple}[citation needed]}} \newcommand{\pv}{{\color{orange}[passive voice]}} \newcommand{\wv}{{\color{orange}[weak verb]}} % correct bad hyphenation here \hyphenation{op-tical net-works semi-conduc-tor IEEEconf} \begin{document} \input{acronyms} \title{\textbf{\Large MAD: One-Shot Machine Activity Detector for Physics-Based Cyber Security\\}} \author{Arthur Grisel-Davy$^{1,*}$, Sebastian Fischmeister$^{2}$\\ \normalsize $^{1}$University of Waterloo, Ontario, Canada\\ \normalsize agriseld@uwaterloo.ca, sfishme@uwaterloo.ca\\ \normalsize *corresponding author } %+++++++++++++++++++++++++++++++++++++++++++ % use only for invited papers %\specialpapernotice{(Invited Paper)} % make the title area \maketitle \begin{abstract} Side channel analysis offers several advantages over traditional machine monitoring methods. The low intrusiveness, independence with the host, data reliability and difficulty to bypass are compelling arguments for using involuntary emissions as input for security policies. However, side-channel information often comes in the form of unlabeled time series representing a proxy variable of the activity. Enabling the definition and enforcement of high-level security policies requires extracting the state or activity of the system. We present in this paper a novel time series, one-shot classifier called \gls{mad} specifically designed and evaluated for side-channel analysis. \gls{mad} outperforms other traditional state detection solutions in terms of accuracy and, as importantly, Levenshtein distance of the state sequence. \end{abstract} \IEEEoverridecommandlockouts \vspace{1.5ex} \begin{keywords} \itshape component; formatting; style; styling; insert (key words) \end{keywords} % no keywords % For peer review papers, you can put extra information on the cover % page as needed: % \begin{center} \bfseries EDICS Category: 3-BBND \end{center} % % for peerreview papers, inserts a page break and creates the second title. % Will be ignored for other modes. \IEEEpeerreviewmaketitle \agd{reset acronyms} \section{Introduction} \gls{ids}s leverage different types of data to detect intrusions. On one side, most solutions use labeled and actionable data, often provided by the system to protect. In the software world, this data can be the resource usage \cite{1702202}, program source code \cite{9491765} or network traffic \cite{10.1145/2940343.2940348} leveraged by an \gls{hids} or \gls{nids}. In the machine monitoring world the input data can be the shape of a gear \cite{wang2015measurement} or the throughput of a pump \cite{gupta2021novel}. On the other side, some methods consider only information that the system did not intentionally provide. The system emits these activities by-product through physical mediums called side channels. Common side-channel information for an embedded system include power consumption \cite{yang2016power} or electromagnetic fields \cite{chawla2021machine}. For a production machine, common side-channel information include vibrations \cite{zhang2019numerical} or chemical composition of fluids \cite{4393062}. Side-channel information offer compelling advantages over agent-collected information. First, the information is difficult to forge. Because the monitored system is not involved in the data retrieval process, there is no risk that an attacker that compromised the system could easily send forged information. For example, if an attacker performs any computation on the system --- which is the case of most attacks --- it will unavoidably affect a variety of different side channels. Second, the side-channel information retrieval process is often non-intrusive and non-disruptive for the monitored system. Measuring the power consumption of a computer or the vibrations of a machine does not involve the cooperation or modification of the system \cite{10.1145/2976749.2978353}. This host-independence property is crucial for safety-critical or high-availability applications as the failure of one of the two --- monitored or monitoring --- systems does not affect the other. These two properties --- reliable data and host-independence --- set physics-based monitoring solution apart with distinct advantages and use-cases. However, using side-channel data introduces new challenges. One obstacle to overcome when designing a physics-based solution is the interpretation of the data. Because the data collection consists of measuring a physical phenomenon, the input data is often a discrete time series. The values in these time series are not directly actionable. In some cases, a threshold value is enough to assess the integrity of the system. In such a case, comparing each value of the time series to the threshold is possible \cite{jelali2013statistical}. However, whenever a simple threshold is not a reliable factor for the decision, a more advanced analysis of the time series is required to make it actionable. The state of a machine is often represented by a specific pattern. This pattern could be, for example, a succession of specific amplitudes or a frequency/average pair for periodic processes. These patterns are impossible to reliably detect with a simple threshold method. Identifying the occurrence and position of these patterns makes the data actionable and enables higher-level --- i.e., that work at a higher level of abstraction \cite{tongaonkar2007inferring} --- security and monitoring policies. For example, a computer starting mid-night or rebooting multiple times in a row should raise an alert for a possible intrusion or malfunction. Rule-based \gls{ids}s using side channel information require an accurate and practical pattern detection solution. Many data-mining algorithms assume that training data is cheap, meaning that acquiring large --- labeled --- datasets is achievable without major expense. Unfortunately, collecting labeled data requires following a procedure and induce downtime for the machine which can be expensive. Collecting many training samples during normal operations of the machine is more time-consuming as the machine's activity cannot be controlled. A single sample of each pattern to be detected in the time series is a more convenient data requirement. Collecting a sample is immediately possible after the installation of the measurement equipment during normal operations of the machine. In this paper, we present \gls{mad}, a distance-based, one-shot pattern detection method for time series. \gls{mad} focuses on providing pre-defined state detection from only one training sample per class. This approach enables the analysis of side-channel information in contexts where the collection of large datasets is impractical. A context selection algorithm lies at the core of \gls{mad} and yield stable classification of individual sample, important for the robustness of high-level security rules. In experiments, \gls{mad} outperforms other approaches in accuracy and the Levenshtein distance on various simulated, lab-captured, and public times-series datasets. We will present the current related work on physics-based security and time series pattern detection in Section~\ref{sec:related}. Then we will introduce the formal and practical definitions of our solution in Section~\ref{sec:statement} and~\ref{sec:solution}. Finally, we will present the datasets considered in Section~\ref{sec:dataset} and the results in Section~\ref{sec:results} to finish with a discussion of the solution in Section~\ref{sec:discussion}. \section{Related Work}\label{sec:related} Side-channel analysis focuses on extracting information from involuntary emissions of a system. This topic traces back to the seminal work of Paul C. Kocher. He introduced power side-channel analysis to extract secrets from several cryptographic protocols \cite{kocher1996timing}. This led to the new field of side-channel analysis \cite{randolph2020power}. However, the potential of leveraging side-channel information for defense and security purposes remains mostly untapped. The information leakage through involuntary emissions through different channels provides insights into the activities of a machine. Acoustic emissions \cite{belikovetsky2018digital}, heat pattern signature \cite{al2016forensics} or power consumption \cite{10.1145/3571288, gatlin2019detecting, CHOU2014400}, can --- among other side-channels --- reveal information about a machine's activity. Side-channel information collection generally results in time series objects to analyze. There exists a variety of methods for analyzing time series. For signature-based solutions, a specific extract of the data is compared to known-good references to assess the integrity of the host \cite{9934955, 9061783}. This signature comparison enables the verification of expected and specific sections and requires that the sections of interest can be extracted and synchronized. Another solution for detecting intrusions is the definition of security policies. Security policies are sets of rules that describe wanted or unwanted behavior. These rules are built on input data accessible to the \gls{ids} such as user activity \cite{ilgun1995state} or network traffic \cite{5563714, kumar2020integrated}. However, the input data requirements must have to apply a rule. This illustrates the gap between the side-channel analysis methods and the rule-based intrusion detection methods. To apply security policies to side-channel information, it is necessary to first label the data. The problem of identifying pre-defined patterns in unlabeled time series is referenced under various names in the literature. The terms \textit{activity segmentation} or \textit{activity detection} are the most relevant for the problem we are interested in. The state of the art methods in this domain focus on human activities and leverage various sensors such as smartphones \cite{wannenburg2016physical}, cameras \cite{bodor2003vision} or wearable sensors \cite{uddin2018activity}. These methods rely on large labeled datasets to train classification models and detect activities \cite{micucci2017unimib}. For real-life applications, access to large labeled datasets may not be possible. Another approach, more general than activity detection, uses \gls{cpd}. \gls{cpd} is a sub-topic of time series analysis that focuses on detecting abrupt changes in a time series \cite{truong2020selective}. It is assumed in many cases that these change points are representative of state transitions from the observed system. However, \gls{cpd} is only the first step in state detection as classification of the detected segments remains necessary \cite{aminikhanghahi2017survey}. Moreover, not all state transitions trigger abrupt changes in time series statistics, and some states include abrupt changes. Overall, \gls{cpd} only fits a specific type of problem with stable states and abrupt transitions. Neural networks raised in popularity for time series analysis with \gls{rnn}. Large \gls{cnn} can perform pattern extraction in long time series, for example in the context of \gls{nilm} \cite{8598355}. \gls{nilm} focuses on the problem of signal disaggregation. In this problem, the signal comprises an aggregate of multiple signals, each with their own patterns \cite{angelis2022nilm}. This problem shares many terms and core techniques as this paper but the nature of the input data makes \gls{nilm} a distinct area of research. The specific problem of classification with only one example of each class is called one-shot --- or few-shot --- classification. This topic focuses on pre-extracted time series classification with few training samples, often using multi-level neural networks \cite{10.1145/3371158.3371162, 9647357}. However, in the context of side-channel analysis, a time series contains many patterns that are not extracted. Moreover, neural-based approaches lack interpretability, which can cause issues in the case of unforeseen time series patterns. Simpler approaches with novelty detection capabilities are required when the output serves as input for rule-based processing. Finally, Duin et. al. investigate the problem of distance-based few-shot classification \cite{duin1997experiments}. They present an approach based on the similarity between new objects and a dissimilarity matrix between items of the training set. The similarities are evaluated with Nearest-Neighbor rules or \gls{svm}. Their approach bears some interesting similarities with the one presented in this paper. However, they evaluate their work on the recognition of handwritten numerals, which is far from the use case we are interested in. \section{Problem Statement}\label{sec:statement} %\gls{mad} focuses on detecting the state of a time series at any point in time. We consider the problem from the point of view of multi-class, mono-label classification problem \cite{aly2005survey} for every sample in a time series. The problem is multi-class because multiple states can occur in one-time series, and therefore any sample is assigned one of multiple states. The problem is mono-label because only one state is assigned to each sample. The classification is a mapping from the samples space to the states space. \begin{problem-statement}[\gls{mad}] Given a discretized time series $t$ and a set of patterns $P=\{P_1,\dots, P_n\}$, identify a mapping $m:\mathbb{N}\longrightarrow P\cup \lambda$ such that every sample $t[i]$ maps to a pattern in $P\cup \lambda$ with the condition that the sample matches an occurrence of the pattern in $t$. \end{problem-statement} The time series $t: \mathbb{N} \longrightarrow \mathbb{R}$ is a finite, discretized, mono-variate, real-valued time series. The patterns (also called training samples) $P_j \in P$ are of the same type as $t$. Each pattern $P_j$ can take any length denoted $N_j$. A sample $t[i]$ \textit{matches} a pattern $P_j \in P$ if there exists a substring of $t$, the length of $P_j$, that includes the sample, such that a similarity measure between this substring and $P_j$ is below a pre-defined threshold. The pattern $\lambda$ is the \textit{unknown} pattern assigned to the samples in $t$ that do not match any of the patterns in $P$. \begin{figure} \centering \includegraphics[width=0.45\textwidth]{images/overview.pdf} \caption{Illustration of the sample distance from one sample to each training example in a 2D space.} \label{fig:overview} \end{figure} \section{Proposed Solution: MAD}\label{sec:solution} \gls{mad}'s core idea separates it from other traditional sliding window algorithm. In \gls{mad}, the sample window around the sample to classify dynamically adapts for optimal context selection. This principle influences the design of the detector and requires the definition of new distance metrics. Because the patterns lengths may differ, our approach requires distance metrics that are robust to length variations. %For the following explanation, the pattern set $P$ refers to the provided patterns only $\{P\setminus \lambda\}$ --- unless specified otherwise. We first define the fundamental distance metric as the normalized Euclidean distance between two-time series $a$ and $b$ of the same length $N_a=N_b$ \begin{equation} nd(a,b) = \dfrac{EuclideanDist(a,b)}{N_a} \end{equation} Using this normalized distance $nd$, we define the distance from a sample $t[i]$ to a pattern $P_j \in P$. This is the sample distance $sd$ defined as \begin{equation}\label{eq:sd} sd(i,P_j) = \min_{k\in [i-N_j,i+N_j])}(nd(t[i-k:i+k],P_j)) \end{equation} %with $P_j$ the training sample corresponding to the state $j$, and $t$ the complete time series. Computing the distance $sd(i,P_j)$ requires to: (1) select every substring of $t$ of length $N_j$ that contains the sample $t[i]$, (2) evaluate their normalized distance to the pattern $P_j$, and (3) consider $sd(i,P_j)$ as the smallest of these distances. For simplicity, Equation~\ref{eq:sd} omits the border conditions for the range of $k$. When the sample position $i$ is less than $N_j$ or greater than $N_t-N_j$, the range adapts to only consider valid substrings. Our approach uses a threshold-based method to decide what label to assign to a sample. For each sample in $t$, the algorithm compares the distance $sd(i,P_j)$ to the threshold $T_j$. The sample receives the label $j$ associated with the pattern $P_j$ that results in the smallest distance $sd(i,P_j)$ with $sd(i,P_j)N_j$. If $N_l