deneir/BPV/qrs/main.typ

#import "utils.typ": *

#import "template.typ": *
#show: ieee.with(
  title: "Independent Few-shot Firmware Integrity Verification with Side-Channel Power Analysis",
  abstract: [

  ],
  authors: (
    (
      name: "Arthur Grisel-Davy",
      department: "Electrical and Computer Engineering",
      organization: "University of Waterloo",
      location: [Waterloo, Canada],
      email: "agriseld@uwaterloo.ca"
    ),
    (
      name: "Sebastian Fischmeister",
      department: "Electrical and Computer Engineering",
      organization: "University of Waterloo",
      location: [Waterloo, Canada],
      email: "sfischme@uwaterloo.ca"
    ),
  ),
  index-terms: (),
  bibliography-file: "bibli.bib",
)

#let acronyms = (
  "BPV": "Boot Process Verifier",
  "IDS": "Intrusion Detection System",
  "SVM": "Support Vector Machine",
  "PLC": "Programable Logic Controlers",
  "DC": "Direct Current",
  "AC": "Alternating Current",
  "APT": "Advanced Persistent Threats",
  "PDU": "Power Distribution Unit",
  "VLAN": "Virtual Local Area Network",
  "VPN": "Virtual Private Network",
  "IQR": "Inter-Quartile Range",
  "IT":  "Information Technology",
  "OEM": "Original equipment manufacturer",
  "SCA": "Side-Channel Analysis",
  "ROM": "Read Only Memory",
  "AIM": "Anomaly-Infused Model",
  "RFC": "Random Forest Classifier"
)

#show ref: r =>{// Overload the reference definition

  // Grab the term, target of the reference
  let term = if type(r.target) == "label"{
    str(r.target)
  }
  else{
    // I don't know why the target could not be a type label but it is handled
    none
  }

  if term in acronyms{
    // Grab definition of the term
    let definition = acronyms.at(term)
    // Generate the key associated with this term
    let state-key = "acronym-state-" + term
    // Create a state to keep track of the expansion of this acronym
    state(state-key,false).display(seen => {if seen{term}else{[#definition (#term)]}})
    // Update state to true as it has just been defined
    state(state-key, false).update(true)
  }
  else{
    r
  }
}

// add spaces around lists and tables
#show enum: l =>{v(5pt)
  l
  //v(5pt)
}

#show list: l =>{v(5pt)
  l
  //v(5pt)
}

#show table: t=>{v(10pt)
  t
  v(5pt)}


= Introduction
The firmware of any embedded system is susceptible to attacks. Since firmware provides many security features, it is always of major interest to attackers.
Every year, a steady number of new vulnerabilities are discovered. Any device that requires firmware, such as computers @185175, @PLC @BASNIGHT201376, or IoT devices @rieck2016attacks, is vulnerable to these attacks.
There are multiple ways to leverage a firmware attack. Reverting firmware to an older version allows an attacker to reopen discovered and documented flaws.
Cancelling an update can ensure that previously deployed attacks remain available. Finally, implementing custom firmware enables full access to the machine.

The issue of malicious firmware is not recent.
The oldest firmware vulnerability recorded on #link("cve.mitre.org") related to firmware dates back to 1999.
Over the years, many solutions have been proposed to mitigate this issue.
The first and most common countermeasure is verifying the integrity of the firmware before applying an update.
The methods to verify a firmware typically include but are not limited to cryptography @firmware_crypto, blockchain technology @firmware_blockchain @firmware_blockchain_2 or direct data comparison @firmware_data. Depending on the complexity, the manufacturer can provide a tag @firmware_sign of the firmware or encrypt it to provide trust that it is genuine.
The integrity verification can also be performed at run-time as part of the firmware itself or with dedicated hardware @trustanchor.

The above solutions to firmware attacks share the common flaw of being applied to the same machine they are installed on.
This allows an attacker to bypass these countermeasures after infecting the machine.
An attacker that could avoid triggering a verification, tamper with the verification mechanism, feed forged data to the verification mechanism, or falsify the verification report could render any defense useless.
@IDS are subjected to a trade-off between having access to relevant and meaningful information and keeping the detection mechanism separated from the target machine.
Our solution addresses this trade-off by leveraging side-channel information.

== Contributions
This paper presents a novel solution for firmware verification using side-channel analysis.
Building on the assumption that every security mechanism operating on a host is vulnerable to being bypassed, we propose using the device's power consumption signature during the firmware execution to assess its integrity.
Because of the intrinsic properties of side-channel information, the integrity evaluation is based on does not involve any communication with the host and is based on data difficult to forge.
A distance-based outlier detector that uses power traces of a nominal boot-up sequence can learn the expected pattern and detect any variation in a new boot-up sequence.
This novel solution can detect various attacks centred around manipulating firmware.
In addition to its versatility of detection, it is also easily retrofittable to almost any embedded system with @DC input and a consistent boot sequence.
It requires minimal training examples and minor hardware modification in most cases, especially for @DC powered devices.

== Paper Organization
We elaborate on the type of attacks that our method aims to mitigate in the threat model @threat and the technology we leverage to capture relevant information in Section @SCA.
Secion~@bpv describe the proposed solution.
Sections~@exp-network,~@exp-drone and~@aim present test cases that illustrates applications and variations of the @BPV.
Finally, the paper finishes with Section @discussion that provides more insight on specific aspects of the proposed solution and Section~@conclusion for the conclusion.


= Related Work

// All devices uses firmware and firmwares ar at risk of attacks
Historically, the firmware was written on a @ROM, and it was impossible to change.
With the growing complexity of embedded systems, manufacturers developed procedures to allow remote firmware upgrades.
Firmware upgrades can address performances or security flaws or, less frequently, add features.
Unfortunately, attackers can leverage these firmware upgrade mechanisms to implement unauthorized or malicious pieces of software in the machine.
Almost all embedded systems are vulnerable to firmware attacks.
In industrial applications, studies proposed firmware attacks on control systems such as power systems field devices @power-devices, @PLC @plc_firmware, or any other industrial embedded system @santamarta2012here.
Safety-critical environments are also prime targets, including medical devices @health_review @pacemaker @medical_case_study, railway systems @railway or automotive systems @cars.

// Manufacturers try to protect firmware updates with cryptography but each solution interract with the host and cannot be trusted.
Manufacturers have implemented different security mechanisms to prevent firmware attacks.
The most common protection is code signing @8726545 @4531926.
The firmware code is cryptographically signed, or a checksum is computed.
This method suffers many possible bypasses.
First, the firmware can be modified at the manufacturer level @BASNIGHT201376, generating a trusted signature of the modified firmware.
Second, the verification can be bypassed @9065145.
Finally, the result of the test can be forged to report valid firmware, even with dedicated hardware @thrangrycats.
Blockchain technology is also considered for guaranteeing firmware integrity @blockchain1.
Blockchain is a cryptographic chain of trust where each link is integrated into the next to guarantee that the information in the chain has not been modified.
This technology could provide software integrity verification at each point where a supply chain attack is possible.
However, the blockchain still needs to be verified at some point, and this verification can still be bypassed or forged.
Overall, no security mechanism that requires interacting with the host machine can guarantee firmware integrity as the host machine can already be compromised and thus produce forged results.

// SCA provides a way to verify the integrity without interacting with the host.
Historically, attackers leveraged @SCA in general and power analysis in particular @sca_attack.
Power consumption generally leaks execution information about the software activity that enables attacks.
Clark et al. proposed a method to identify web page visits from @AC power @clark_current_2013.
Conti et al. developed a method for identifying laptop-user pairs from power consumption @10.1145-2899007.2899009.
Seemingly harmless power consumption data from a mobile phone can even leak position data @michalevsky2015powerspy.
All these methods illustrate the potential of power side channels for attacks, but a well-intention program could also leverage them for defense purposes.
After all, the lack of interaction required with the machine benefits the defense mechanism by increasing bypasses difficulty.
Following this idea, Clark et al. @wud proposed in 2013 a power consumption-based malware detector for medical devices.
Hernandez et al. included power consumption with network data for malware detection @8855288.
Electrical power consumption is especially appropriate for infering the machine activity for different reasons.
First, it is easy to measure in a reproducible manner.
Then, it can be easy to get access to relevant power cables with little tampering from the machine when the power conversion from @AC to @DC power is performed outside the machine.
It is also a common side channel to all embedded systems as they all consume electricity.
Second, it is hard to fake from the developer's point of view. Because of the multiple abstraction layers between the code of a program and its implementation at the hardware level, changes in the code will result in a different power consumption pattern.
This is especially true when considering firmware or machines with low computation capabilities or highly specialized devices that have deterministic and stable execution patterns at boot-up.

However, to the best of our knowledge, no work leveraged the same data or method for firmware integrity verification.
Bootups are a natural target for defensive purposes are they are notoriously hard to protect, and host-based @IDS are not yet active in defending the machine.
Moreover, bootup produces significantly more consistent power consumption than normal operation on general-purpose machines as it follows a pre-defined process.

In light of the potential of side-channel attacks, some work proposed manipulating power consumption patterns.
Defense mechanism like Maya @pothukuchi2021maya proposes to obfuscate specific activity pattern by applying a control method to target a pre-defined mask.
If changing the power consumption pattern of software to impersonate another is possible, that could decrease the potential of side-channel-based @IDS.
However, the current work is designed for defense and aims at obfuscating the patterns by applying masks with the goal of making all power signatures similar, not impersonating a specific one.
Thus, power consumption remains a trustworthy source of information as a different set of instructions necessarily generates a different power consumption.

= Threat Model<threat>

Many attacks are enabled by tampering with the firmware.
Because the firmware is responsible for the initialization of the components, the low-level communications, and some security features, executing adversary code in place of the expected firmware is a powerful capability @mitre @capec.
If given enough time, information or access, an attacker could take complete control of the machine and pave the way to future @APT.

A firmware modification is defined as implementing any change in the firmware code.
Modifications include implementing custom functions, removing security features, or changing the firmware for a different version (downgrade or upgrade).
As long as the firmware is different from the one expected by the system administrator, we consider that it has been modified.
Downgrading the firmware to an older version is an efficient way to render a machine vulnerable to attacks.
Opposite to writing custom firmware, it requires little information about the machine.
All the documentation and resources are easily accessible online from the manufacturer.
Even the exploits are likely to be documented as they are the reason for the firmware upgrade.
An attacker would only need to wait for vulnerabilities to be discovered and then revert the firmware to an older version.
These properties make the firmware downgrade a powerful first step to performing more elaborate attacks.
Custom firmware may be required for more subtle or advanced attacks.
This requires more work and information as firmware codes are usually not open source and are challenging to reverse engineer.
Moreover, the firmware is tailored for a specific machine, and it can be difficult for an attacker to perform a custom firmware attack.
Although, if a custom firmware can be successfully implemented, almost any attack can be performed.
Finally, a firmware upgrade could also be used to open a newly discovered vulnerability.

A complete firmware change is another form of firmware manipulation.
The manufacturer's firmware is replaced by another available firmware that supports the same machine.
Such alternatives can be found for computers @coreboot, routers @owrt @ddwrt @freshtomato, but also video game consoles or various embedded machines.
These alternative firmware are often open-source and provide more features, capabilities and performances as they are updated and optimized by their community.
Implementing alternative firmware on a machine could allow an attacker to gain control of it without necessarily alerting the end user.

#agd[add a section about the capture process. Either here or in the discussion and reference it.]
// = Side Channel Analysis<sca>
// @SCA leverages the emissions of a system to gain information about its operations.
// Side channels are defined as any involuntary emission from a system.
// Historically, the main side channels are sound, power consumption or electromagnetic.
// The common use of side-channel is in the context of attacks.
// The machine state is leveraged to extract critical information, allowing powerful and difficult-to-mitigate attacks.
// @SCA attacks are commonly applied to cryptography @cryptoreview @curveattack, keyboard typing @keyboard, printers @printer and many more.
// Side-channel attacks can be easy to implement depending on the chosen channel.
// Power consumption is a reliable source of information but requires physical access to the machine.
// Sound and electromagnetic fields can be measured for a distance but are also typically more susceptible to measurement location @iot_anoamly_sca.

= Boot Process Verification<bpv>
Verifying the firmware of the machine using its power consumption represents a time series classification problem described in the problem statement:

#ps(title: "Boot Process Verification")[
Given a set of N time series $T=(t_1,...,t_N)$ corresponding to the power consumption during nominal boot sequences and a new unlabeled time series $u$, predict whether $u$ was generated by a nominal boot sequence.
]

The training time series in $T$ are discretized, mono-variate, real-valued time series of length $L$.
The length of the captured time series is a parameter of the detector, tuned for each machine.
The number of training time series $N$ is considered small relative to the usual size of training datasets in time series classification problems @sun2017revisiting.
All time series considered in this problem ($T union u$) are all of the same length and synchronized at capture time; see Section~@sds for more details about the synchronization process.

== Detection Models<detector>

The @BPV performs classification of the boot traces using a distance-based detector and a threshold.
The core of the detection is the computation of the distance between the new trace $u$, and the training traces $T$.
If this distance is greater than the pre-computed threshold, then the @BPV classifies the new trace as anomalous.
Otherwise, the new trace is nominal.

The training phase consists in computing the threshold based on the known good training traces.
Two main specificities of this problem make the computation of the threshold difficult.
First, the training dataset only contains nominal traces.
This assumption is important as there are a nearly infinite number of ways how a boot sequence can be altered to create a malicious or malfunctioning device.
The @BPV aims at fingerprinting the nominal sequence, not recognizing the possible abnormal sequences.
Thus, the model can only describe the nominal traces statistically, based on the available examples, and assume that outliers to this statistical model correspond to abnormal boot sequences.

Second, the number of training samples is small.
In this case, small is relative to the usual number of training samples leveraged for time series classification (see @discussion for more details).
We assume that the training dataset contains between ten and 100 samples.
This assumption is important for realism.
To keep the detector non-disruptive, the nominal boot sequences are captured during the normal operation of the device.
However, the bootup of a machine is a rare event, and thus the training dataset must remain small.

The training sequence of the @BPV computes the distance threshold based on a statistical description of the distribution of the distance between each pair of normal traces.
The training sequence follows two steps.
+ The sequence computes the distance between all pairs of training traces $D = {d(t_i,t_j) forall i,j in [1,...,N]^2; i eq.not j }$.
+ The sequence computes the threshold as $"thresh" = 1.5 dot "IQR"(D)$ with IQR the Inter-Quartile Range of the distances set $D$.

The @IQR is a measure of the dispersion of samples.
It is based on the first and third quartiles and defined as $ "IQR" = Q_3 - Q_1$ with $Q_3$ the third quartile and $Q_1$ the first quartile.
This value is commonly used @han2011data to detect outliers as a similar but more robust alternative to the $3"sigma"$ interval of a Gaussian distribution.
To apply the @IQR to the times series, we compute first compute the average of the NORMAL traces.
This average serves as a reference for computing the distance of each trace.
The Euclidean distance is computed between each trace and the reference, and the @IQR of these distances is computed.
The distance threshold takes the value $1.5 * "IQR"$. For the detection, the distance of each new trace to the reference is computed and compared to the threshold.
If the distance is above the threshold, the new trace is considered anomalous.

=== Support For Multi-modal Boot-up Sequences

Some machines can boot following multiple different bootup sequences that are considered normal.
There can exist various reasons for such behaviour.
For example, a machine can perform recovery operations if the power is interrupted while the machine is off or perform health checks on components that may pass or fail and trigger deeper inspections procedure.
Because the machines are treated as black boxes, it is important for the @BPV to deal with these multiple modes during training.
Our approach is to develop one model per mode following the same procedure as for a single mode, presented in Section~@detector.
Then, the detection logic evolves to consider the new trace nominal if it matches any of the models.
If the new trace does not match any model, then it does not follow any of the nominal modes and is considered abnormal.
@fig-modes illustrate the trained @BPV models when two modes are present in the bootup sequence.

#figure(
  image("images/training.svg", width:100%),
  caption: [BPV model trained with two modes.]
)<fig-modes>

= Test Case 1: Network Devices<exp-network>

To verify the performance of the proposed detector, we design an experiment that aims at detecting firmware modifications on different devices .
Networking devices are a vital component of any organization, from individual houses to complete data centers @downtime.
A network failure can result in significant downtime that is extremely expensive for data centers.
Compromised network devices can also result in data breaches and @APT.
These devices are generally highly specialized in processing and transmitting information as fast as possible.
We consider four machines that represent consumer-available products for different prices and performance ranges.

- Asus Router RT-N12 D1. This router is a low-end product that provides switch, router and wireless access point capabilities for home usage.
- Linksys Router MR8300 v1.1. This router is a mid-range product that offers the same capabilities as the Asus router with better performance at a higher price.
- TP-Link Switch T1500G-10PS. This 8-port switch offers some security features for low-load usage.
- HP Switch Procurve 2650 J4899B. This product is enterprise-oriented and provides more performance than the TP-Link switch. This is the only product of the selection that required hardware modification, as the power supply is internal to the machine. The modification consists in cutting the 5V cables to implement the capture system.

None of the selected devices supports the installation of host-based @IDS or firmware integrity verification.
The firmware is verified only during updates with a proprietary mechanism.
This experiment illustrates the firmware verification capability of a side-channel @IDS for these machines where common @IDS may not be applicable.

== Experimental Setup<setup>
Although this experiment is conducted in a controlled environment, the setup to a real deployment (see @capture for more details).
We gather data from the four networking equipment which are connected to a managed @PDU (see @capture for more details).
This @PDU's output can be controlled by sending instructions on a telnet interface and enables turning each machine on or off automatically.
Each machine will undergo firmware change or version change to represent a firmware attack.
The changes are listed in @tab-machines.

#figure(
  table(
    columns: (auto,auto,auto,auto),
    align: horizon,
    [*Equipment*], [*Original \ Firmware*], [*Modification 1*], [*Modification 2*],
    [TP-Link\ Switch], [20200805], [20200109], align(center, [X]),
    [HP Procurve\ Switch], [H.10.119], [H.10.117], align(center, [X]),
    [Asus Router], [Latest EOM], [OpenWrt\ v21.02.2], [OpenWrt\ v21.02.0],
    [Linksys\ Router], [Latest EOM], [OpenWrt\ v21.02.2], [OpenWrt\ v21.02.0]
  ),
  caption: [Machines used for the experiments and their modifications.],
)<tab-machines>

This experiment aims at simulating an attack situation by performing firmware modifications on the target devices and recording the boot-up power trace data for each version.
For the switches, we flash different firmware versions provided by the \gle{oem}.
For wireless routers, their firmware is changed from the @OEM to different versions of #link("https://openwrt.org/")[OpenWrt].
In this study, we consider the latest @OEM firmware version to be the nominal version, expected to be installed on the machine by default.
Any other version or firmware represent an attack and is considered anomalous.

== Experiment procedure
To account for randomness and gather representative boot-up sequences of the device, we performed 500 boot iterations for each machine.
This cannot reasonably be performed manually with consistency.
Therefore, an automation script controls the @PDU with precise timings to perform the boots without human intervention.

The exact experimental procedure followed for each target has minor variations depending on the target's boot-up requirements and timings.
Overall, they all follow the same template:

+ Turn ON the power to the machine.
+ Wait for a predetermined time for the target to boot up completely.
+ Turn OFF the power to the machine and wait for a few seconds to ensure proper shutdown of the machine.

== Results<results>
We obtain the result per machine and per model.
The training dataset is generated by injecting artificial anomalies, but the evaluation is performed on actual anomalous traces collected in a controlled environment.
For each evaluation, a random set of $10$ consecutive traces is selected from the NORMAL label to serve as the seed for the anomaly generation.
The anomaly generator returns a training dataset composed of normal traces on one side and anomalous artificial traces on the other.
The models train using this dataset and are evaluated against a balanced dataset combining $M in [20,50]$ consecutive anomalous traces selected at random across all abnormal classes and as many nonimal traces.
The testing set is balanced between nominal and abnormal traces.
The training requires only a few nominal traces.
This evaluation is repeated $50$ times, and the $F_1$ score is computed for each iteration.
The final score is the average of these $F_1$ scores.
The results are presented in @tab-results.

#figure(
  table(
    columns: 2,
    [*Machine*], [*BPV*],
    [TP-Link switch], [0.866],
    [HP switch], [0.983],
    [Asus router], [1],
    [Linksys router], [0.921]
  ),
  caption: [Results of detection.]
)<tab-results>

There are two hyper-parameters to tune to obtain the best performances.
First, the length of the trace considered is essential.
The trace needs to cover the whole boot-up sequence to be sure to detect any possible change.
It is better to avoid extending the trace too much after the firmware sequence is done, as the typical operation of the machine can produce noisy power consumption that interferes with the optimal placement of the threshold.
Second, the number of training traces can be optimized.
A minimum of four traces is required for the @IQR method based on quartiles.
A minimum of two traces are necessary for the @SVM Threshold method as anomalous traces need to be generated based on the average and standard deviation of the normal dataset.
Collecting additional traces after these lower boundaries offers marginal performance improvements as the number of traces has little impact on the threshold placement of both models.
Moreover, collecting many boot-up sequences can be difficult to achieve in practice.
Finally, tuning the sampling rate is important to ensure the best performances.
A machine boot-up in two seconds will require a higher sampling rate than a machine booting in thirty seconds.
All these parameters are machine-specific and need manual tuning before deployment of the side-channel @IDS.

= Test Case 2: Drone<exp-drone>

In this case study, we demonstrate the potential of physics-based @IDS for drones.
Drones are not new, but their usage both in the consumer and professional sectors increased significantly in recent years @droneincrease.
The core component of consumer-available drones is usually a microcontroller, also called a flight controller.
As with any other microcontrollers, the flight controller of a drone and its main program (we call the main program firmware in this paper) are subject to updates and attacks @8326960 @8433205.
Some of these attacks leverage firmware manipulations @8556480.
With custom firmware uploaded to a drone, many attack possibilities become accessible to the attacker, such as geofencing an area, recovering video feed, or damaging the drone.
Moreover, flight controllers as specialized devices that usually do not support the installation of third-party security software nor provide advanced security features such as cryptographic verification of the firmware.
With drone usage soaring and the lack of security solutions, the problem of verifying their firmware against anomalies becomes important.

== Experimental Setup

The experimental setup for this case study is similar to the one presented in @exp-network.
The experiment focuses on the Spiri Mu drone #footnote[#link("https://spirirobotics.com/products/spiri-mu/")] flashed with the PX4 Drone Autopilot firmware #footnote[#link("https://px4.io/")].
The firmware for the flight controller consists of a microprocessor-specific bootloader, a second-stage bootloader common to all supported flight controllers.

The battery of the drone is disconnected to ensure reproducible results and replaced with a laboratory power supply.
The power consumption measurement device (see @capture for more details) is attached in series with the main power cable that provides an 11V @DC current to the drone.
A controllable relay is placed in series with the main cable to enable scripted bootup and shutdown scenarios.
The experiment scenarios are:

- *Nominal*: The first two versions consisted of unmodified firmware provided by the PX4 project, the first one was a pre-compiled version, and the second one was locally compiled. Although both versions should be identical, some differences appeared in their consumption pattern and required the training of a dual-mode model.
- *Low Battery*: When the drone starts with a low battery level, its behaviour changes to signal the user of the issue. Any battery level below 11V is considered low. In this scenario, a nominal firmware is loaded, and the drone starts with 10V, triggering the low battery behaviour.
- *Malfunctioning Firmware*: Two malfunctioning firmware versions were compiled. The first introduces a _division-by-zero_ bug in the second stage bootloader. The second introduces the same bug but in the battery management module (in the OS part of the firmware). The second scenario should not introduce measurable anomalous patterns in the boot-up sequence as it only affects the OS stage.

#figure(
  image("images/drone-overlaps.svg", width: 100%),
  caption: [Overlap of bootup traces for different scenarios and their average. Green = Low Battery (8 traces + average), Purple = Battery Module Bug (8 traces + average), Orange = Bootloader Bug (6 traces + average).]
)

== Results
The experiment procedure consists in starting the drone flight controller multiple times while capturing the power consumption.
The experiment consists in repeating each scenario between 40 and 100 times.
The experiment procedure automatically captures boot-up traces for better reproducibility (see @sds for more details).

@drone-results presents the results of the detection.
Both Original and Compiled represent nominal firmware versions.

#figure(
  table(
    columns: (40%,20%,40%),
    [*Scenario*],[*Accuracy*], [*Nbr. of Samples*],
    [Original],[1],[98],
    [Compiled],[1],[49],
    [Low Battery],[1],[44],
    [Bootloader Bug],[1],[50],
    [Battery Module Bug], [0.082],[39],
  ),
  caption: [Results of the intrusion detection on the drone.]
)<drone-results>

Each scenario introduces disturbances in the boot-up sequence power consumption.
The model correctly identifies the anomalous firmware.
One interesting scenario is the Battery Module Bug that is mostly detected as nominal.
This result is expected as the bug affects the operations of the firmware after the bootup sequence.
Hence, the power consumption in the first second of activity remains nominal.
#agd[Should the result of the battery module bug remain, or is it confusing to present scenarios where the BPV expectedly fails?]

It is interesting to notice that the differences in power consumption patterns among the different firmware are visible immediately after the initial power spike.
This suggests that future work could achieve an even lower time-to-decision, likely as low as 200ms depending on the anomaly.


// = Test Case 3: Aggregated Power Measurements
// In some cases, capturing only the power consumption of the machine to protect is impossible.
// For example, if the power connections follow proprietary designs, or if the machine to protect is innaccessible (for practical or security reasons).
// In this case, the data available may be an aggregate of the consumption of multiple machines or components.
// This power global power consumption measurement is still suitable for boot process verification.

// This test case was conducted with an industry partner to protect a micro-pc running Windows 10.
// The available power consumption was an aggregate of two micro-pc, one being the machine to protect.
// The second machine remained idle for the duration of the experiment.
// @l3-setup illustrate the setup for the data capture.

// Although this setup can seem simplistic or ideal, it is a first approache to evaluating the applicability of @BPV in a more complexe environement.
// The data presented in this test case come from a real installation, not from a controlled laboratory environment.

// #figure(
//   image("images/l3-setup.svg", width: 100%),
//   caption: [Setup for BPV with an aggregated power measurement.]
// )<l3-setup>

// == Results


= Specific Case Study: @AIM <aim>
When training a model to detect outliers, it is often expected to have examples of possible anomalies.
In some cases, gathering anomalies can be difficult, costly, or impossible.
In the context of this study, it would be impractical to measure power consumption patterns for a wide range of firmware anomalies.
Such data collection would require modifying firmware parameters, suspending equipment usage, or infecting production machines with malicious firmware.
These modifications are impossible for production equipment and would still lead to an incomplete training dataset.
To circumvent this limitation, we propose a variation of the training process called @AIM.
@AIM leverages the specificity of distance-based detectors.
Distance-based detectors produce results based solely on the distance between two traces and a learned threshold.
The threshold is chosen to separate normal and anomalous traces as well as possible.
The actual pattern of the traces is not important for this type of detector as only the aggregated distance of each sample matters.
This implies that a distance-based detector that relies on a distance threshold can be trained the same way with either real anomalous traces or with artificial traces that present the same distance to the reference.
The idea behind an @AIM is to leverage this property and generate artificial anomalous traces to form the training set.
The additional anomalous traces are generated using only normal traces, which circumvents the need for extensive data collection.

== Anomaly Generation
The generation of anomalies from normal traces is based on the modification of the boot-up pattern.
Data augmentation can leverage different time series modification methods to help a model generalize.
The kind of modification applied to a trace is highly dependent on the application and the model @zimmering2021generating and requires domain knowledge about the system.
In this case, we want to generate anomalous traces with patterns similar to actual anomalous traces from a machine.
The first step of this process is to extract domain knowledge from all the traces collected.
The type of modification an anomalous trace present compared to a normal trace help us design anomaly generation functions that apply the same type of transformation to normal traces with varying parameters.
The goal is not the reproduce exact anomalous traces but to generate a wide variety of possible anomalous traces given a small set of normal traces.

#figure(
  image("images/Bootup_traces_TPLINK.svg", width: 100%),
  caption: [
    Example of TP-Link switch boot-up traces for different firmware versions. The anomalous firmware (FIRMWARE V2) present both a $y$ and $x$ shift.
  ],
)<fig-boot-up_traces_TPLINK>

@fig-boot-up_traces_TPLINK illustrate the domain knowledge extracted from this machine.
The anomalies that the power trace exibit are a combination of types of transformations.

- The trace is shifted along the $y$ axis. In this case, the anomalous firmware consumes significantly more or less power than the normal one. This shift can affect the whole trace or only a part of it. This can be the result of different usage of the machine's components or a significant change in the firmware instructions.
- The trace is delayed or in advance along the $x$ axis. The anomalous trace presents the same patterns and amplitude as the normal trace but at different points in time. This shift can occur when parts of the firmware are added or removed by updates.

The anomaly generation function combines the domain knowledge observations and applies transformations to generate examples of anomalous traces from normal traces.
The possible transformations are:

- Shifting the time domain. The direction of the shift can be forward (introducing a delay) or backward (removing a delay). The parameters of the shift are the amplitude and the start time. Both parameters are randomly selected for each new trace. The boundaries of these values do not include very large shifts as these would not contribute to the threshold placement for the models selected. The missing parts of the trace after shifting are recreated based on the average and standard deviation value of the previous 0.5s, assuming a Gaussian noise.

- Shifting the $y$ axis. The direction of the shift can be upward (more energy consumed) or downward (less energy consumed). The amplitude is chosen between $4$ and $5$ times the standard deviation for each sample. These values ensure not creating an anomalous trace that conflicts with the normal traces and removing any shift too large that would not contribute to the threshold placement. The start time is chosen randomly in the trace.

- Shifting both the $x$ and $y$ axis. Anomalous traces always presents a combination of $x$ shift, $y$ shift, or both.

@fig-overview presents an overview of the model's data flow.

#figure(
  image("images/schematic.svg", width: 100%),
  caption: [Overview of the @BPV model training and evaluation.],
)<fig-overview>

The resulting dataset does not exactly resemble the anomalous traces that are collected but presents traces with the same range of distance to normal traces (see  @fig-Synthetic_vs_Normal_TPLINK).
To avoid introducing training biases, the dataset is balanced by generating new normal traces using the average and standard deviation if required.


#figure(
  image("images/Synthetic_vs_Normal_TPLINK.svg", width: 100%),
  caption: [Example of generated synthetic anomalous traces vs normal traces for TP-Link switch.],
)<fig-Synthetic_vs_Normal_TPLINK>

== Results
A benchmarking algorithm evaluates the performances of @AIM against the performances of the original @BPV trained with only normal traces.
@AIM places the threshold to maximize the margins to the closest normal distance and abnormal distance in the same way a 1D-@SVM would.
This is a natural extension of the @BPV when abnormal samples are available.

Two main parameters are important to tune for the @AIM.
First, the range for the length of the x shift, and especially its lower bound, has an important influence on the generated anomalies.
A small lower bound allows for the generation of anomalous traces that closely resemble the nominal traces, which can result in a sub-optimal threshold placement.
Second, the range parameter for the y-shift affects the results in the same way.
The values for these parameters are chosen as part of the domain knowledge extraction, and they affect the transferability of the model (see @aim-conclusion).

The performances are evaluated on the same dataset as for the initial @BPV evaluation (see~@exp-network).
The performance metric is the F1 score.
The final performance measure is the average F1 score (and its standard deviation) over 30 independent runs.
Each run selects five random normal traces as seed for the dataset generation.
The training dataset is composed of 100 training traces and 100 evaluation races.


The results are presented in @tab-aim

#figure(
  table(
    columns:(33%,33%,33%),
    [*Machine*], [*BPV*], [*AIM*],
    [HP-SWITCH],[$0.895 plus.minus 0.094$],[$0.657 plus.minus 0.394$],
    [TPLINK-SWITCH], [$0.9 plus.minus 0.084$],[$0.985 plus.minus 0.035$],
    [WAP-ASUS], [$1.0 plus.minus 0.0$],[$0.987 plus.minus 0.041$],
    [WAP-LINKSYS],[$0.882 plus.minus 0.099$],[$0.867 plus.minus 0.098$],
  ),
  caption: [Performances of the @AIM model compared with the original @BPV model (average F1 score #sym.plus.minus std).]
)<tab-aim>

== Conclusion on the @AIM Model<aim-conclusion>

The @AIM model produces mixed results.
The model was tuned for the TPLINK-SWITCH machine and produced significantly better results for this machine.
However, the results did not transfer well to the other machines.
Experiments reveal that the values of parameters that produce the best results can differ significantly from one machine to the other, even for the same type of machine.
The idea of introducing artificial anomalous examples in the training dataset is valid and can indeed enable the creation of a better model.
This artificial augmentation of the training set is especially interesting in the context of rare events where creating an extensive dataset is expensive.
However, the lack of transferability of the proposed methods indicates that further work is required to evolve @AIM into an undeniably better solution compared to @BPV.


= Discussion<discussion>
This section elaborate on some important aspects of this study.

== Capture Process<capture>
We use a hardware device referred to as the capture box @hidden placed in series with the primary power cable of the target device.
The technology for measuring the current differ depending on the capture box's version.
For test case 1 and 3, the box's shunt resistor generates a voltage drop representative of the global power consumption of the machine.
For test case 2, a Hall effect sensor return a voltage proportional to the current.
For both versions, the voltage value is sampled at 10 KSPS.
These samples are packaged in small fixed-size chunks and sent to a data aggregation server on a private @VLAN.
The data aggregation server is responsible for gathering data from all of our capture boxes and sending it via a @VPN tunnel to a storage server.
Each file on the server contains 10s of power consumption data.

== Extraction of Synchronized Bootup Traces<sds>
A threshold-based algorithm extracts the boot-up sequences from the complete trace.
The extraction is not performed manually because of the large number of samples and to ensure a consistent detection of the boot-up pattern and a precise alignment of the different sequences extracted.
Because the boot-up sequence usually begins with a sharp increase in power consumption from the device, the algorithm leverages this rising edge to detect the start time accurately.
Two parameters control the extraction.
$T$ is the consumption threshold, and $L$ is the length of the boot-up sequence.
To extract all the boot-up sequences in a power trace, the algorithm evaluates consecutive samples against $T$.
If sample $s_{i-1}<T$ and $s_i>T$, then $s_i$ is the first sample of a boot-up sequence, and the next $L$ samples are extracted.
The power trace is resampled at 50ms using a median aggregating function to avoid any incorrect detections.
This pre-processing removes most of the impulse noise that could falsely trigger the detection method.
The final step of the detection is to store all the boot sequences under the same label for evaluation.
The complete dataset corresponding to this experiment is available online @dataset.

== Support for Online Training
In order for the @BPV to integrate in a realistic environment, the training procedure takes the rareness of the boot-up event into account.
Once the measurement device is setup on the machine to protect, the streaming time series representing the power consumption serves as input for the bootup detection algorithm (see @sds).
Each bootup event is extracted and added to a dataset of bootup traces.
Once the dataset reaches the expected number of samples, the @BPV computes the threshold and is ready for validation of the next bootup.
The complete training and validation procedures require no human interractions.

In the case of a multi-modal models, the training procedure require one human interraction.
Presented with the bootup samples, an operator can transform the model into a multi-modal model by separating the training samples into multiple modes.
Once the separation is performed, the training procedure resumes without interraction and the next bootup samples are assigned to the closest mode.

Thanks to its low-complexity and support for multi-modes, the @BPV can adapt during training to changes in the training data and supports switching between single and multi-modes.

= Conclusion<conclusion>
This study illustrates the applicability of side-channel analysis to detect firmware attacks.
The proposed side-channel-based @IDS can detect firmware tampering from the power consumption trace.
Moreover, distance-based models leveraged in this study allow minimal training data and training time requirements.
On a per-machine basis, anomaly generation can enhance the training set without additional anomalous data capture.
Finally, deploying this technology to production networking equipment requires minimal downtime and hardware intrusion, and it is applicable to clientless equipment.
This study illustrates the potential of independent, side-channel-based @IDS for the detection of low-level attacks that can compromise machines event before the operating system gets loaded.