#import "utils.typ": *
#import "tablex.typ": tablex, hlinex, vlinex, colspanx, rowspanx
#import "@preview/acrostiche:0.2.0": *

#import "template.typ": *
#show: ieee.with(
  title: "Independent Boot Process Verification using Side-Channel Power Analysis",
  abstract: [
  Firmware attacks on embedded systems can have disastrous security implications.
  Through the firmware update mechanism, an attacker can tamper with the firmware to open known vulnerabilities, change security settings, or deploy custom backdoors, to pave the way for subsequent attacks or gain complete machine control.
  Firmware protection solutions often share the flaw of requiring the cooperation of the machine they aim to protect.
  If the machine gets compromised, the results from the protection mechanism become untrustworthy.
  
  One solution to this problem is to leverage an independent source of information to assess the integrity of the firmware and the boot-up sequence.
  In this paper, we propose a physics-based Intrusion Detection System called the Boot Process Verifier that only relies on side-channel power consumption measurement to verify the integrity of the boot-up sequence.
  The BPV works in complete independence from the machine to protect and requires only a few nominal training samples to establish a baseline of nominal behaviour.
  The range of application of this approach potentially extends to any embedded systems.
  We present three test cases that illustrate the performances of the BPV on micro-PC, network equipment (switches and wireless access points), and a drone.
  ],
  authors: (
    (
      name: "Arthur Grisel-Davy*",
      department: "Electrical and Computer Engineering",
      organization: "University of Waterloo",
      location: "Waterloo, Canada",
      email: "agriseld@uwaterloo.ca",
    ),
    (
      name: "Sebastian Fischmeister",
      department: "",
      organization: "",
      location: "",
      email: "sfischme@uwaterloo.ca",
    ),
  ),
  anon: true,
  index-terms: (),
  bibliography-file: "bibli.bib",
)

#init-acronyms((
  "IoT": ("Internet of Things",),
  "BPV": ("Boot Process Verifier",),
  "IDS": ("Intrusion Detection System",),
  "SVM": ("Support Vector Machine",),
  "PLC": ("Programable Logic Controler",),
  "DC": ("Direct Current",),
  "AC": ("Alternating Current",),
  "APT": ("Advanced Persistent Threat",),
  "PDU": ("Power Distribution Unit",),
  "VLAN": ("Virtual Local Area Network",),
  "VPN": ("Virtual Private Network",),
  "IQR": ("Inter-Quartile Range",),
  "IT":  ("Information Technology",),
  "OEM": ("Original Equipment Manufacturer",),
  "SCA": ("Side-Channel Analysis",),
  "ROM": ("Read Only Memory",),
  "AIM": ("Anomaly-Infused Model",),
  "RFC": ("Random Forest Classifier",),
  "BIOS": ("Basic Input/Output System",),
  "OS": ("Operating System",),
))


// add spaces around lists and tables
#show enum: l =>{v(5pt)
  l
}

#show list: l =>{v(5pt)
  l
}

#show table: t=>{v(10pt)
  t
  v(5pt)
}

#reset-all-acronyms()

= Introduction
The firmware of any embedded system is susceptible to attacks. Since firmware provides many security features, it is always of major interest to attackers.
Every year, new firmware vulnerabilities are discovered. Any device that requires firmware, such as computers @185175, #acr("PLC") @BASNIGHT201376, or #acr("IoT") devices @rieck2016attacks, is vulnerable to these attacks.
There are multiple ways to leverage a firmware attack. Reverting firmware to an older version allows an attacker to reopen discovered and documented flaws.
Canceling an update can ensure that previously deployed attacks remain available. Finally, implementing custom firmware enables full access to the machine.

The issue of malicious firmware is not recent.
The oldest firmware vulnerability recorded on #link("cve.mitre.org") related to firmware dates back to 1999.
Over the years, many solutions have been proposed to mitigate this issue.
The first and most common countermeasure is verifying the integrity of the firmware before applying an update.
The methods to verify a firmware typically include but are not limited to cryptography @firmware_crypto, blockchain technology @firmware_blockchain @firmware_blockchain_2 or direct data comparison @firmware_data. Depending on the complexity, the manufacturer can provide a tag @firmware_sign of the firmware or encrypt it to provide trust that it is genuine.
The integrity verification can also be performed at run-time as part of the firmware itself or with dedicated hardware @trustanchor.

The above solutions to firmware attacks share the common flaw of being applied to the same machine they are installed on.
This allows an attacker to bypass these countermeasures after infecting the machine.
An attacker that could avoid triggering a verification, tamper with the verification mechanism, feed forged data to the verification mechanism, or falsify the verification report could render any defense useless.
In order to avoid this flaw, the #acr("IDS") must leverage data that can be trusted even from a compromised machine.
#acr("IDS") are then subject to a trade-off between having access to relevant and meaningful information and keeping the detection mechanism separated from the target machine.
Our solution addresses this trade-off by leveraging side-channel information.

== Contributions
This paper presents a novel solution for firmware verification using side-channel analysis.
Building on the assumption that every security mechanism requiring the host cooperation is vulnerable to being bypassed, we propose using the device's power consumption signature during the firmware execution to assess its integrity.
Because of the intrinsic properties of side-channel information, the integrity evaluation does not involve any communication with the host and is based on trustworthy data.
A distance-based outlier detector that uses power traces of a nominal boot-up sequence can learn the expected pattern and detect any variation in a new boot-up sequence.
This novel solution can detect various attacks centered around manipulating firmware.
In addition to its versatility of detection, it is also easily retrofittable to almost any embedded system.
It requires minimal training examples and minor hardware modifications in most cases, especially for #acr("DC")-powered devices.

== Paper Organization
We elaborate on the type of attacks that our method aims to mitigate in the threat model @threat.
@bpv describes the proposed solution.
@exp-network,~@exp-drone, and~@aim present test cases that illustrate applications and variations of the #acr("BPV").
Finally, the paper ends with @discussion that provides more insights on specific aspects of the proposed solution and @conclusion for the conclusion.


= Related Work
Historically, the firmware was written on a #acr("ROM") and was impossible to change, thus preventing most attacks.
With the growing complexity of embedded systems, manufacturers developed procedures to enable remote firmware upgrades.
Firmware upgrades can address performances or security flaws or add features.
Unfortunately, attackers can leverage these firmware upgrade mechanisms to implement unauthorized or malicious pieces of software in the machine.
Almost all embedded systems are vulnerable to firmware attacks.
In industrial applications, studies proposed firmware attacks on control systems such as power systems field devices @power-devices, #acr("PLC") @plc_firmware, or any other industrial embedded system @santamarta2012here.
Safety-critical environments are also prime targets, including medical devices @health_review @pacemaker @medical_case_study, railway systems @railway or automotive systems @cars.

// Manufacturers try to protect firmware updates with cryptography but each solution interract with the host and cannot be trusted.
Manufacturers have implemented different security mechanisms to prevent firmware attacks.
The most common protection is code signing @8726545 @4531926.
The firmware code is cryptographically signed, or a checksum is computed.
This method suffers many possible bypasses.
First, an attack can modify the firmware at the manufacturer level @BASNIGHT201376, generating a trusted signature of the modified firmware.
Second, malware can bypass the verification @9065145.
Finally, an attacker can forge the result of the test to report valid firmware, even with dedicated hardware @thrangrycats.
Blockchain technology is also considered for guaranteeing firmware integrity @blockchain1.
A blockchain is a cryptographic chain of trust where each link is integrated into the next to guarantee that the information in the chain has not been modified.
This technology could provide software integrity verification at each point where a supply chain attack is possible.
However, the blockchain still needs to be verified at some point, and this verification can still be bypassed or forged.
Overall, no security mechanism that requires interacting with the host machine can guarantee firmware integrity as a compromised machine can produce forged results.

// SCA provides a way to verify the integrity without interacting with the host.
Historically, attackers leveraged #acr("SCA") in general and power analysis in particular @sca_attack.
Power consumption generally leaks execution information about the software activity that enables attacks.
Clark et al. proposed a method to identify web page visits from #acr("AC") power @clark_current_2013.
Conti et al. developed a method for identifying laptop-user pairs from power consumption @10.1145-2899007.2899009.
Seemingly harmless power consumption data from a mobile phone can even leak position data @michalevsky2015powerspy.
All these methods illustrate the potential of power side channels for attacks, but a well-intention program could also leverage them for defense.
After all, the lack of interaction required with the machine benefits the defense mechanism by increasing bypasses difficulty.
Following this idea, Clark et al. @wud proposed in 2013 a power consumption-based malware detector for medical devices.
Hernandez et al. included power consumption with network data for malware detection @8855288.
Electrical power consumption is especially appropriate for inferring the machine's activity for different reasons.
First, it is easy to measure in a reproducible manner.
Second, it can be easy to get access to relevant power cables with little tampering from the machine when the power conversion from #acr("AC") to #acr("DC") power is performed outside the machine.
It is also a common side channel to all embedded systems as they all consume electricity.
Finally, from the developer's point of view, forging power consumption to impoersonate other programms is difficult, especially at the firmware level. Because of the multiple abstraction layers between the code of a program and its implementation at the hardware level, changes in the code will result in a different power consumption pattern.
This is especially true when considering firmware or machines with low computation capabilities or highly specialized devices that have deterministic and stable execution patterns at boot-up.

However, to the best of our knowledge, no work leveraged the same data or method for firmware integrity verification.
Boot-ups are a natural target for defensive purposes are they are notoriously hard to protect, and host-based #acr("IDS") are not yet active to defend the machine.
Moreover, boot-ups produces significantly more consistent power consumption than normal operation on general-purpose machines as it follows a pre-defined process.

In light of the potential of side-channel attacks, some work proposed manipulating power consumption patterns.
Defense mechanisms like Maya @pothukuchi2021maya propose to obfuscate specific activity patterns by applying a control method to target a pre-defined mask.
If changing the power consumption pattern of software to impersonate another is possible, that could decrease the potential of side-channel-based #acr("IDS").
However, the current work is designed for defense and aims at obfuscating the patterns by applying masks with the goal of making all power signatures similar, not impersonating a specific one.
Thus, power consumption remains a trustworthy source of information as a different set of instructions necessarily generates a different power consumption.

= Threat Model<threat>
Many attacks are enabled by tampering with the firmware.
Because the firmware is responsible for the initialization of the components, the low-level communications, and some security features, executing adversary code in place of the expected firmware is a powerful capability @mitre @capec.
If given enough time, information or access, an attacker could take complete control of the machine and pave the way to future #acr("APT").

A firmware modification is defined as implementing any change in the firmware code.
Modifications include implementing custom functions, removing security features, or changing the firmware for a different version (downgrade or upgrade).
As long as the firmware is different from the one expected by the system administrator, we consider that it has been modified.
Downgrading the firmware to an older version (also called firmware rollback) is an efficient way to render a machine vulnerable to attacks.
Opposite to writing custom firmware, it requires little information about the machine.
All the documentation and resources are easily accessible online from the manufacturer.
Even the exploits are likely to be documented as they are the reason for the firmware upgrade.
An attacker would only need to wait for vulnerabilities to be discovered and then revert the firmware to an older version.
These properties make the firmware downgrade a powerful first step to performing more elaborate attacks.
Manufacturers sometimes implement firmware anti-rollback mechanisms to prevent this type of attack, but they are also vulnerable to bypass. 
Custom firmware may be required for more subtle or advanced attacks.
This requires more work and information as firmware codes are usually not open-source and are challenging to reverse engineer.
Moreover, the firmware is tailored for a specific machine, and it can be difficult for an attacker to perform a custom firmware attack.
Although, the successful implementation of custom firmware can lead to performing almost any attack.
Finally, a firmware upgrade could also be used to open a newly discovered vulnerability.

A complete firmware change is another form of firmware manipulation.
The manufacturer's firmware is replaced by another available firmware that supports the same machine.
Such alternatives can be found for computers @coreboot, routers @owrt @ddwrt @freshtomato, but also video game consoles or various embedded machines.
These alternative firmware are often open-source and provide more features, capabilities and performances as they are updated and optimized by their community.
Although these firmware are typically not malicious, implementing alternative firmware on a machine could allow an attacker to gain control of it without necessarily alerting the end user.

// = Side Channel Analysis<sca>
// @SCA leverages the emissions of a system to gain information about its operations.
// Side channels are defined as any involuntary emission from a system.
// Historically, the main side channels are sound, power consumption or electromagnetic.
// The common use of side-channel is in the context of attacks.
// The machine state is leveraged to extract critical information, allowing powerful and difficult-to-mitigate attacks.
// @SCA attacks are commonly applied to cryptography @cryptoreview @curveattack, keyboard typing @keyboard, printers @printer and many more.
// Side-channel attacks can be easy to implement depending on the chosen channel.
// Power consumption is a reliable source of information but requires physical access to the machine.
// Sound and electromagnetic fields can be measured for a distance but are also typically more susceptible to measurement location @iot_anoamly_sca.

= Boot Process Verifier<bpv>
Verifying the firmware of the machine using its power consumption represents a time series classification problem described in the problem statement:

#ps(title: "Boot Process Verification")[
Given a set of N time series $T=(t_1,...,t_N)$ corresponding to the power consumption during nominal boot sequences and a new unlabeled time series $u$, predict whether $u$ was generated by a nominal boot sequence. 
]

The training time series in $T$ are discretized, mono-variate, and real-valued time series of length $L$.
The length of the captured time series is a parameter of the detector, tuned for each machine.
The number of training time series $N$ is considered small relative to the usual size of training datasets in time series classification problems @sun2017revisiting.
All time series considered in this problem ($T union u$) are all of the same length and synchronized at capture time; see @sds for more details about the synchronization process.

== Detection Models<detector>
The #acr("BPV") performs classification of the boot traces using a distance-based detector and a threshold.
The core of the detection is the computation of the distance between the new trace $u$, and the training traces $T$.
If this distance is greater than the pre-computed threshold, then the #acr("BPV") classifies the new trace as anomalous.
Otherwise, the new trace is nominal.

The training phase consists in computing the threshold based on the known good training traces.
Two main specificities of this problem make the computation of the threshold difficult.
First, the training dataset only contains nominal traces.
This assumption is important as there are a nearly infinite number of ways how a boot sequence can be altered to create a malicious or malfunctioning device.
The #acr("BPV") aims at fingerprinting the nominal sequence, not recognizing the possible abnormal sequences.
Thus, the model can only describe the nominal traces statistically, based on the available examples, and assume that outliers to this statistical model correspond to abnormal boot sequences.

Second, the number of training samples is small.
In this case, small is relative to the usual number of training samples leveraged for time series classification (see @discussion for more details).
We assume that the training dataset contains between 10 and 100 samples.
This assumption is important for realism.
To keep the detector non-disruptive, the nominal boot sequences are captured during the normal operation of the device.
However, the boot-up of a machine is a rare event, and thus the training dataset must remain small.

The training sequence of the #acr("BPV") computes the distance threshold based on a statistical description of the distribution of the distance between each pair of normal traces.
The training sequence follows two steps.
+ The sequence computes the Euclidean distance between all pairs of training traces $D = {d(t_i,t_j) forall i,j in [1,...,N]^2; i eq.not j }$.
+ The sequence computes the threshold as $"thresh" = 1.5 dot "IQR"(D)$.

The #acr("IQR") is a measure of the dispersion of samples.
It is based on the first and third quartiles and defined as $ "IQR" = Q_3 - Q_1$ with $Q_3$ being the third quartile and $Q_1$ being the first quartile.
This value is common @han2011data for detecting outliers as a similar but more robust alternative to the $3 times sigma$ interval of a Gaussian distribution.
To apply the #acr("IQR") to the times series, we first compute the average of the nominal traces.
This average serves as a reference for computing the distance of each trace.
The Euclidean distance is computed between each trace and the reference, and the #acr("IQR") of these distances is computed.
The distance threshold takes the value $1.5 times "IQR"$. For the detection, the distance of each new trace to the reference is computed and compared to the threshold.
The new trace is considered anomalous if the distance is above the threshold.

=== Support For Multi-modal Boot-up Sequences<multi-modal>
Some machines can boot following multiple different boot-up sequences that are considered normal.
There exist various reasons for such behaviour.
For example, a machine can perform recovery operations if the power is interrupted while the machine is off or perform health checks on components that may pass or fail and trigger deeper inspections procedure.
Because the machines are treated as black boxes, it is important for the #acr("BPV") to deal with these multiple modes during training.
See @online for more details about how the online training procedure deals with multi-modal models.
Our approach is to develop one model per mode following the same procedure as for a single mode, presented in~@detector.
With multiple models available, the detection logic evolves to consider the new trace nominal if it matches any of the models.
If the new trace does not match any model, then it does not follow any of the nominal modes and is considered abnormal.
@fig-modes illustrates the trained #acr("BPV") models when two modes are present in the boot-up sequence.
The top part of the figure represents the average power trace for each mode. The x-axis is the time in milliseconds, and the y-axis is the amplitude in a unit proportional to the ampere (the absolute value of the consumption is unimportant for this study, only the global pattern matters).
The bottom part of the figure represents the distances and the threshold.
Each colour represents one mode.
Each point represents the distance from one training sample to the average trace of its mode.
The vertical dashed lines represent the distance threshold.

#figure(
  image("images/training.svg", width:100%),
  caption: [BPV model trained with two modes.]
)<fig-modes>

= Test Case 0: General Purpose Computer

This test case illustrates the first application of the #acr("BPV") and follows a slightly different setup and assumptions.
First, the power consumption measurement does not only contain the consumption of the machine to protect.
In some cases, capturing only the power consumption of the machine to protect is impossible.
This is the case if the power connections follow proprietary designs or if the machine to protect is inaccessible (for practical or security reasons).
In this case, the available data is an aggregate of the machine to protect and a second machine.
The second machine does not perform any task, and its contribution to the aggregated power consumption is constant.
Second, anomalous examples of boot-up sequences are available.
This test case was designed with an industry partner for the detection of two specific attacks: boot-up on an external USB drive and access to the machine's #acr("BIOS").

Because the machine and the expected attacks are known in advance, it is possible to tailor the #acr("BPV")'s parameters to maximize the performance at detecting the attacks.
Because of these two specificities, this test case should be regarded as a first iteration to demonstrate the potential of the #acr("BPV") in a more restrictive environment.
The following test cases in @exp-network and @exp-drone present other applications in more challenging environments.

== Experimental Setup
This test case was conducted on a micro PC running Windows 8.
The available power consumption was an aggregate of two micro-pc, one being the machine to protect.
The second machine remained idle for the duration of the experiment.
// @l3-setup illustrates the setup for the data capture.

// #figure(
//   image("images/l3-setup.svg", width:100%),
//   caption: [Overview of the experiment setup for test case 0.]
// )<l3-setup>

From these samples representing nominal boot-ups, it appears that the machine presents multiple boot-up modes.
Hence, the model is multi-modal with three modes.
See @multi-modal for more details about how multi-modal models are defined.
@l3-training illustrates the power traces associated with each mode.

#figure(
  image("images/l3-training.svg", width:100%),
  caption: [Multi-Modal BPV model after training.]
)<l3-training>

After collecting training traces, the distribution of samples in each model was $(0.31,0.06,0.62)$.
This distribution remains purely circumstantial from the point of view of the detector that considers the machine to protect as a black box.
The root causes for the appearance of one boot-up mode, or another is outside the scope of this work.
The final training dataset comprises 93 training samples divided into three models following the above distribution.

Abnormal boot-up traces are also collected.
The abnormal boot sequences are composed of sequences where an operator went into the #acr("BIOS") and then continued booting into the #acr("OS").

== Results
The models are manually tuned to obtain 100% accuracy in the classification of nominal and abnormal boot sequences.
Obtaining 100% accuracy illustrates that there is a clear separation between nominal and abnormal boot sequences for this type of attack.

Although this test case represents an unrealistic situation (mainly because the anomalous samples are accessible during training), it is still a valuable first evaluation of the #acr("BPV").
This test case serves as a proof-of-concept and indicates that there is a potential for the detection of firmware-level attacks with power consumption.
The #acr("BPV") detected the pre-defined attack with complete independence from the machine and with a perfect success rate.
Having access to anomalous samples enabled us to optimize the threshold placement to minimize false-positive (nominal boot-ups detected as anomalous) by relaxing the threshold value. 


= Test Case 1: Network Devices<exp-network>
To verify the performance of the proposed detector, we design an experiment that aims at detecting firmware modifications on different devices.
Networking devices are a vital component of any organization, from individual houses to complete data centers.
A network failure can result in significant downtime that is extremely expensive for data centers @downtime.
Compromised network devices can also result in data breaches and #acr("APT").
These devices are generally highly specialized in processing and transmitting information as fast as possible.
We consider four machines that represent consumer-available products for different prices and performance ranges.

- Asus Router RT-N12 D1. This router is a low-end product that provides switch, router and wireless access point capabilities for home usage.
- Linksys Router MR8300 v1.1. This router is a mid-range product that offers the same capabilities as the Asus router with better performance at a higher price.
- TP-Link Switch T1500G-10PS. This 8-port switch offers some security features for low-load usage.
- HP Switch Procurve 2650 J4899B. This product is enterprise-oriented and provides more performance than the TP-Link switch. This is the only product of the selection that required hardware modification, as the power supply is internal to the machine. The modification consists in cutting the 5V cables to install the capture system.

None of the selected devices support the installation of host-based #acr("IDS") or firmware integrity verification. 
The firmware is verified only during updates with a proprietary mechanism.
This experiment illustrates the firmware verification capability of a side-channel #acr("IDS") for these machines where common #acr("IDS") may not be applicable.

== Experimental Setup<setup>
Although this experiment is conducted in a controlled environment, the setup is representative of a real deployment (see @capture for more details).
We gather data from the four networking equipment, which are connected to a managed #acr("PDU") (see @capture for more details).
This #acr("PDU")'s output can be controlled by sending instructions on a telnet interface and enables turning each machine on or off automatically.
Each machine will undergo a firmware change or version change to represent a firmware attack.
The changes are listed in @tab-machines.

#v(10pt)
#figure(
  tablex(
    columns: (25%,25%,25%,25%),
    align: (left+horizon,right+horizon,right+horizon,right+horizon),
    auto-vlines: false,
    repeat-header: false,
    [*Device*], [*Original*], [*Change 1*], [*Change 2*],
    [TP-Link\ Switch], [20200805], [20200109], align(center, [X]),
    [HP Procurve\ Switch], [H.10.119], [H.10.117], align(center, [X]),
    [Asus Router], [Latest EOM], [OpenWrt\ v21.02.2], [OpenWrt\ v21.02.0],
    [Linksys\ Router], [Latest EOM], [OpenWrt\ v21.02.2], [OpenWrt\ v21.02.0],
  ),
  supplement: [Table],
  kind: "table",
  caption: [Machines used for the experiment and the changes applied.],
)<tab-machines>


This experiment aims at simulating an attack situation by performing firmware modifications on the target devices and recording the boot-up power trace data for each version.
For the switches, we flash different firmware versions provided by the #acr("OEM").
For wireless routers, their firmware is changed from the #acr("OEM") to different versions of #link("https://openwrt.org/")[OpenWrt].
In this study, we consider the latest #acr("OEM") firmware version to be the nominal version, expected to be installed on the machine by default.
Any other version or firmware represents an attack and is considered anomalous.

== Experiment procedure
To account for randomness and gather representative boot-up sequences of the device, we performed 500 boot iterations for each machine.
This cannot reasonably be performed manually with consistency.
Therefore, an automation script controls the #acr("PDU") with precise timings to perform the boots without human intervention.

The exact experimental procedure for each target has minor variations depending on the target's boot-up requirements and timings.
Overall, they all follow the same template:

+ Turn ON the power to the machine.
+ Wait for a predetermined time for the target to boot up completely.
+ Turn OFF the power to the machine and wait for a few seconds to ensure proper shutdown of the machine.

== Results<results>
We obtain the result per machine and per model.
The evaluation is performed on actual anomalous traces collected in a controlled environment.
For each evaluation, a random set of $10$ consecutive traces are selected from the nominal label to serve as the seed for the anomaly generation.
The anomaly generator returns a training dataset composed of normal traces on one side and anomalous artificial traces on the other.
The models train using this dataset and are evaluated against a balanced dataset combining $M in [20,50]$ consecutive anomalous traces selected at random across all abnormal classes and as many nominal traces.
The testing set is balanced between nominal and abnormal traces.
//The training requires only a few nominal traces.
This evaluation is repeated $50$ times, and the $F_1$ score is computed for each iteration.
The final score is the average of these $F_1$ scores.
The results are presented in @tab-results.


#figure(
  tablex(
    columns: (40%,40%),
    auto-vlines: false,
    align: (left, right),
    [*Machine*], [*BPV*],
    [TP-Link switch], [0.87],
    [HP switch], [0.98],
    [Asus router], [1.00],
    [Linksys router], [0.92]
  ),
  supplement: [Table],
  kind: "table",
  caption: [Results of detection.]
)<tab-results>

There are two hyper-parameters to tune to obtain the best performances.
First, the length of the trace considered is important.
The trace needs to cover the whole boot-up sequence to be sure to detect any possible change.
It is better to avoid extending the trace too much after the firmware sequence is done, as the typical operation of the machine can produce noisy power consumption that interferes with the optimal placement of the threshold by diluting important features.
Second, the number of training traces can be optimized.
A minimum of four traces is required for the #acr("IQR") method (for the computation of quartiles).
We confirmed empirically that a minimum of ten traces produces better results than four as it enables the #acr("IQR") to work on quartiles that are actually robust to outliers.
Collecting additional traces after these lower boundaries offers marginal performance improvements as the number of traces has little impact on the threshold placement of both models.
Moreover, collecting many boot-up sequences can be difficult to achieve in practice.
Finally, tuning the sampling rate is important to ensure the best performances.
A machine boot-up in two seconds will require a higher sampling rate than a machine booting in thirty seconds.
All these parameters are machine-specific and need manual tuning before deployment of the #acr("BPV").

= Test Case 2: Drone<exp-drone>
In this case study, we demonstrate the potential of physics-based #acr("IDS") for drones.
Drones are not new, but their usage both in the consumer and professional sectors increased significantly in recent years @droneincrease.
The core component of consumer-available drones is usually a microcontroller, also called a flight controller.
As with any other microcontrollers, the flight controller of a drone and its main program (we call the main program firmware in this paper) are subject to updates and attacks @8326960 @8433205.
Some of these attacks leverage firmware manipulations @8556480.
With custom firmware uploaded to a drone, many attack possibilities become accessible to the attacker, such as geofencing an area, recovering video feed, or damaging the drone.
Moreover, flight controllers are specialized devices that usually do not support the installation of third-party security software nor provide advanced security features such as cryptographic verification of the firmware.
With drone usage soaring and the lack of security solutions, the problem of verifying their firmware against anomalies becomes important.

== Experimental Setup
The experimental setup for this case study is similar to the one presented in @exp-network.
The experiment focuses on the Spiri Mu drone #footnote[#link("https://spirirobotics.com/products/spiri-mu/")] flashed with the PX4 Drone Autopilot firmware #footnote[#link("https://px4.io/")].
The firmware for the flight controller consists of a microprocessor-specific bootloader, a second-stage bootloader common to all supported flight controllers, and the operating system composed of different modules.

The battery of the drone is replaced with a laboratory power supply to ensure reproducible results.
The power consumption measurement device (see @capture for more details) is installed in series with the main power cable that provides an 11V #acr("DC") current to the drone.
A controllable relay is placed in series with the main cable to enable scripted boot-up and shutdown scenarios.
The experiment scenarios are:

- *Nominal:* The first two versions consisted of unmodified firmware provided by the PX4 project, the first one was a pre-compiled version, and the second one was locally compiled. Although both versions should be identical, some differences appeared in their consumption patterns and required the training of a dual-mode model.
- *Low Battery:* When the drone starts with a low battery level, its behaviour changes to signal the user of the issue. Any battery level below 11V is considered low. In this scenario, a nominal firmware is loaded, and the drone starts with 10V, triggering the low battery behaviour.
- *Malfunctioning Firmware:* Two malfunctioning firmware versions were compiled. The first introduces a _division-by-zero_ bug in the second stage bootloader. The second introduces the same bug but in the battery management module (in the OS part of the firmware). The second scenario should not introduce measurable anomalous patterns in the boot-up sequence as it only affects the OS stage.

#figure(
  image("images/drone-overlaps.svg", width: 100%),
  caption: [Overlap of boot-up traces for different scenarios and their average. Green = Low Battery (8 traces + average), Purple = Battery Module Bug (8 traces + average), Orange = Bootloader Bug (6 traces + average).]
)

The experiment procedure consists in starting the drone flight controller multiple times while capturing the power consumption.
The experiment consists in repeating each scenario between 40 and 100 times.
The experiment procedure automatically captures boot-up traces for better reproducibility (see @sds for more details).

#block(breakable:false)[
== Results

#figure(
  tablex(
    auto-vlines: false,
    align: (left, right, right),
    columns: (40%,auto,auto),
    [*Scenario*],[*Accuracy*], [*Nbr. of Samples*],
    [Original],[1],[98],
    [Compiled],[1],[49],
    [Low Battery],[1],[44],
    [Bootloader Bug],[1],[50],
    [Battery Module Bug], [0.082],[39],
  ),
  supplement: [Table],
  kind: "table",
  caption: [Results of the intrusion detection on the drone.]
)<drone-results>
]

@drone-results presents the results of the detection.
Both Original and Compiled represent nominal firmware versions.

Each scenario introduces disturbances in the boot-up sequence power consumption.
The model correctly identifies the anomalous firmware.
One interesting scenario is the Battery Module Bug that is mostly detected as nominal.
This result is expected as the bug affects the operations of the firmware after the boot-up sequence.
Hence, the power consumption in the first second of activity remains nominal.

It is interesting to notice that the differences in power consumption patterns among the different firmware are visible immediately after the initial power spike.
This suggests that future work could achieve an even lower time-to-decision, likely as low as 200ms depending on the anomaly.

 
// #figure(
//   image("images/l3-setup.svg", width: 100%),
//   caption: [Setup for BPV with an aggregated power measurement.]
// )<l3-setup>

// == Results


= Specific Case Study: Anomaly Infused Model<aim>
#reset-acronym("AIM")
When training a model to detect outliers, it is often expected to have examples of possible anomalies.
In some cases, gathering anomalies can be difficult, costly, or impossible.
In the context of this study, it would be impractical to measure power consumption patterns for a wide range of firmware anomalies.
Such data collection would require modifying firmware parameters, suspending equipment usage, or infecting production machines with malicious firmware.
These modifications are impossible for production equipment and would still lead to an incomplete training dataset.
To circumvent this limitation, we propose a variation of the training process called #acr("AIM").
#acr("AIM") leverages the specificity of distance-based detectors.
Distance-based detectors produce results based solely on the distance between two traces and a learned threshold.
The threshold is chosen to separate normal and anomalous traces as well as possible.
The actual pattern of the traces is not important for this type of detector as only the aggregated distance of each sample matters.
This implies that a distance-based detector that relies on a distance threshold can be trained the same way with either real anomalous traces or with artificial traces that present the same distance to the reference.
The idea behind an #acr("AIM") is to leverage this property and generate artificial anomalous traces to form the training set.
The additional anomalous traces are generated using only normal traces, which circumvents the need for extensive data collection.

== Anomaly Generation
The generation of anomalies from normal traces is based on the modification of the boot-up pattern.
Data augmentation can leverage different time series modification methods to help a model generalize.
The kind of modification applied to a trace is highly dependent on the application and the model @zimmering2021generating and requires domain knowledge about the system.
In this case, we want to generate anomalous traces with patterns similar to actual anomalous traces from a machine.
The first step of this process is to extract domain knowledge from all the traces collected.
The type of modification an anomalous trace presents compared to a normal trace help us design anomaly generation functions that apply the same type of transformation to normal traces with varying parameters.
The goal is not to reproduce exact anomalous traces but to generate a wide variety of possible anomalous traces given a small set of normal traces.

#figure(
  image("images/Bootup_traces_TPLINK.svg", width: 100%),
  caption: [
    Example of TP-Link switch boot-up traces for different firmware versions. The anomalous firmware (FIRMWARE V2) presents both a $y$ and $x$ shift.
  ],
)<fig-boot-up_traces_TPLINK>

@fig-boot-up_traces_TPLINK illustrates the domain knowledge extracted from the traces.
The anomalies that the power traces exhibit are a combination of types of transformations.

- The trace is shifted along the $y$ axis. In this case, the anomalous firmware consumes significantly more or less power than the normal one. This shift can affect the whole trace or only a part of it. This can be the result of different usage of the machine's components or a significant change in the firmware instructions.
- The trace is delayed or in advance along the $x$ axis. The anomalous trace presents the same patterns and amplitude as the normal trace but at different points in time. This shift can occur when parts of the firmware are added or removed by updates.

The anomaly generation function combines the domain knowledge observations and applies transformations to generate examples of anomalous traces from normal traces.
The possible transformations are:

- Shifting the time domain. The shift direction can be forward (introducing a delay) or backward (removing a delay). The parameters are the amplitude and the start time. Both parameters are random for each new trace. The boundaries of these values do not include very large shifts, as these would not contribute to the threshold placement. The missing parts of the trace after shifting are recreated based on the average and standard deviation value of the previous 0.5s, assuming a Gaussian noise.
    
- Shifting the $y$ axis. The direction of the shift can be upward (more energy consumed) or downward (less energy consumed). The amplitude is chosen between $4$ and $5$ times the standard deviation for each sample. These values ensure not creating an anomalous trace that conflicts with the normal traces and removing any shift too large that would not contribute to the threshold placement. The start time is random.
    
- Shifting both the $x$ and $y$ axis. Anomalous traces always presents a combination of $x$ shift, $y$ shift, or both.

#figure(
  image("images/schematic.svg", width: 100%),
  caption: [Overview of the #acr("BPV") model training and evaluation.],  
)<fig-overview>

@fig-overview presents an overview of the model's data flow.
The resulting dataset does not exactly resemble the anomalous traces that are collected but presents traces with the same range of distances to normal traces (see  @fig-Synthetic_vs_Normal_TPLINK).
To avoid introducing training biases, the dataset is balanced by generating new normal traces using the average and standard deviation if required.


#figure(
  image("images/Synthetic_vs_Normal_TPLINK.svg", width: 100%),
  caption: [Example of generated anomalous traces compared with captured normal traces for TP-Link switch.],
)<fig-Synthetic_vs_Normal_TPLINK>

== Results
A benchmarking algorithm evaluates the performances of #acr("AIM") against the performances of the original #acr("BPV") trained with only normal traces.
#acr("AIM") places the threshold to maximize the margins to the closest normal distance and abnormal distance in the same way a 1D-#acr("SVM") would.
This is a natural extension of the #acr("BPV") when abnormal samples are available.

Two main parameters are important to tune for the #acr("AIM").
First, the range for the length of the $x$ shift, and especially its lower bound, has an important influence on the generated anomalies.
A small lower bound allows for the generation of anomalous traces that closely resemble the nominal traces, which can result in a sub-optimal threshold placement.
Second, the range parameter for the y-shift affects the results in the same way.
The values for these parameters are chosen as part of the domain knowledge extraction, and they affect the transferability of the model (see @aim-conclusion).

The performances are evaluated on the same dataset as for Test Case 1 (see~@exp-network).
//The performance metric is the F1 score.
The final performance measure is the average F1 score (and its standard deviation) over 30 independent runs. 
Each run selects five random normal traces as seeds for the dataset generation.
The training dataset is composed of 100 training traces and 100 evaluation traces.
The results are presented in @tab-aim

#figure(
  tablex(
    auto-vlines: false,
    align: (left, right, right),
    columns:(33%,33%,33%),
    [*Machine*], [*BPV*], [*AIM+BPV*],
    [HP-SWITCH],[$0.895 plus.minus 0.094$],[$0.657 plus.minus 0.394$],
    [TPLINK-SWITCH], [$0.9 plus.minus 0.084$],[$0.985 plus.minus 0.035$],
    [WAP-ASUS], [$1.0 plus.minus 0.0$],[$0.987 plus.minus 0.041$],
    [WAP-LINKSYS],[$0.882 plus.minus 0.099$],[$0.867 plus.minus 0.098$],
  ),
  supplement: [Table],
  kind: "table",
  caption: [Performances of the #acr("AIM")+#acr("BPV") model compared with the original #acr("BPV") model (average F1 score #sym.plus.minus std.).]
)<tab-aim>

== Conclusion on the #acr("AIM") Model<aim-conclusion>
The #acr("AIM") model produces mixed results.
The model was tuned for the TPLINK-SWITCH machine and produced significantly better results for this machine.
However, the results did not transfer well to the other machines.
Experiments reveal that the values of parameters that produce the best results can differ significantly from one machine to the other, even for the same type of machine.
The idea of introducing artificial anomalous examples in the training dataset is valid and can enable the creation of a better model.
This artificial augmentation of the training set is especially interesting in the context of rare events where creating an extensive dataset is expensive.
However, the lack of transferability of the proposed methods indicates that further work is required to evolve #acr("AIM") into an undeniably better solution compared to #acr("BPV").


= Discussion<discussion>
This section elaborates on some important aspects of this study.

== Capture Process<capture>
We use a hardware device referred to as the capture box @hidden placed in series with the primary power cable of the target device.
The technology for measuring the current differs depending on the capture box's version.
For test cases 0 and 3, the box's shunt resistor generates a voltage drop representative of the global power consumption of the machine.
For test cases 1 and 2, a Hall effect sensor returns a voltage proportional to the current.
For both versions, the voltage value is sampled at 10 KSPS.
These samples are packaged in small fixed-size chunks and sent to a data aggregation server on a private #acr("VLAN").
The data aggregation server is responsible for gathering data from all of our capture boxes and sending it via a #acr("VPN") tunnel to a storage server.
Each file on the server contains 10s of power consumption data.

== Extraction of Synchronized Bootup Traces<sds>
A threshold-based algorithm extracts the boot-up sequences from the complete trace.
The extraction is not performed manually because of the large number of samples and to ensure a consistent detection of the boot-up pattern and a precise alignment of the different sequences extracted.
Because the boot-up sequence usually begins with a sharp increase in power consumption from the device, the algorithm leverages this rising edge to detect the start time accurately.
Two parameters control the extraction.
$T$ is the consumption threshold, and $L$ is the length of the boot-up sequence.
To extract all the boot-up sequences in a power trace, the algorithm evaluates consecutive samples against $T$.
If sample $s_(i-1)<T$ and $s_i>T$, then $s_i$ is the first sample of a boot-up sequence, and the next $L$ samples are extracted.
The power trace is resampled at 50ms using a median aggregating function to avoid any incorrect detections.
This pre-processing removes most of the impulse noise that could falsely trigger the detection method.
The final step of the detection is to store all the boot sequences under the same label for evaluation.
// The complete dataset corresponding to this experiment is available online @dataset.

== Support for Online Training<online>
In order to integrate the #acr("BPV") in a realistic environment, the training procedure takes the rareness of the boot-up event into account.
Once the measurement device is set up on the machine to protect, the streaming time series representing the power consumption serves as input for the boot-up detection algorithm (see @sds).
Each boot-up event is extracted and added to a dataset of boot-up traces.
Once the dataset reaches the expected number of samples, the #acr("BPV") computes the threshold and is ready for validation of the next boot-up.
The complete training and validation procedures require no human interactions.

In the case of a multi-modal model, the training procedure requires one human interaction.
Presented with the boot-up samples, an operator can transform the model into a multi-modal model by separating the training samples into multiple modes.
Once the separation is performed, the training procedure resumes without interaction, and the next boot-up samples are assigned to the closest mode.

Thanks to its low complexity and support for multi-modes, the #acr("BPV") can adapt during training to changes in the training data and support switching between single and multi-modes.

= Conclusion<conclusion>
This study illustrates the applicability of side-channel analysis to detect firmware attacks.
The proposed side-channel-based #acr("IDS") can detect firmware tampering from the power consumption trace.
Moreover, distance-based models leveraged in this study allow minimal training data requirements.
On a per-machine basis, anomaly generation can enhance the training set without additional anomalous data capture.
Finally, deploying this technology to production networking equipment requires minimal downtime and hardware intrusion, and it is applicable to clientless equipment.
This study illustrates the potential of independent, side-channel-based #acr("IDS") for the detection of low-level attacks that can compromise machines even before the operating system gets loaded.