spring cleanup

This commit is contained in:
grizzly 2025-06-08 18:47:32 -04:00
parent f5712a3a73
commit 848f8cb57d
95 changed files with 46734 additions and 2311 deletions

View file

@ -1,66 +0,0 @@
/*#set text(font: "Essays1743")*/
= Redundancy Check
To cut through the noise in the logs, I am bringing back from retirement the log verification algorithm.
This algorithm applies hierarchical clustering to the logs using a custom distance metric.
The main thing to remember is that the only parameter is the ~distance threshold~ that is expressed as a number of second.
This threshold represents the maximum average time two logs can be apart to be considered redundant.
A value around 1s should be a pretty good value for this type of data.
Quick stats.: as of April 16, the machine b-06 generated 9002 log entries across 170 unique logs.
This number is missleading because the uniqueness of the logs is determined based on the path of the application that starts or stops.
However, this path changes for two may reason:
- When the user changes, the same application start from a different user directory. It is not clear if the same application runing from two different user directory should be considered different applications or not.
- Sometimes, an application will have a change in capitalization in its name. For example, `C:\\[...]\BackgroundTaskHost.exe` exists also as `C:\\[...]\backgroundTaskHost.exe` or even simply `backgroundTaskHost.exe` without absolute path. This makes it difficult to combine back together the different version of the same app.
#figure(caption: "Initial redundancy graph with threshold 2s",
image("graph_init_2.svg", width:100%)
)<graph_init>
The initial exploration does not try to clean the logs.
The log selection algo runs with a distance threshold of 2 seconds and produces the result in #label("graph_init").
There are some good results in this graph.
Mostly, most clusters seems to make sense. For example cluster Z groups together maple-related apps, cluster L groups chrome-related apps, and cluster Q groups things I don't know about but have similar names.
However, we can identify a number of issues and mainly:
- Some apps name appear twice or more. This is likely due to inconsistencies in the path or user sessions.
- The clusters do not take into accound the type of log --- starting or stoping the app.
The initial idea would be to clean to logs to get a better or more clustering.
However I have made this mistake too many time so I first need to assess if the clustering is good or not.
No need to leak info into the data if the clustering is good.
If it is not, I need a quantitative way of evaluating if I make it better or not.
How do you evaluate if a clustering is good? No idea.
Lets start by visualizing what these clusters look like in a time series.
= Lag
If only this was lab data, we would make sure the timings a synced up before the experiment.
However, this is the real world and in the real world the university does not consult me when setting up their log collection system.
As a result, the logs seems to lag behind the power traces (or the other way around, but let's say its them).
Given that we are considering patterns of logs in the order of < 1s, any lag of more than one second is game breaking.
Visually, we can estimate that the lag is around 1h (so game is so broken there is no game anymore).
We can also hypothesized that, for each machine, the lag is constant in time (but may not be constant accross machines).
#figure(caption: "Example of log vs power lag",
image("example_logs_lag.svg", width:100%)
)<graph_init>
To estimate a constant lag, the cross-correlation is an obvious choice.
The cross correlation presents the corelation of two signals for all possible values of lag between them.
With the strong assumption that the density of logs should be exactly correlated with the power consumption, we can compute the correlation between the log series and the time series.
#align(center)[~Hold up a minute, the log series is not a time series, so it cannot be compared with the power consumption data trivially.~]
That is true. We first need to convert the log series in a time series to compute the cross-correlation.
There are many ways to do that so I started with the simplest:
1. Create an array of all zeros with the same shape as the power trace.
2. Place a one at each time stamps where a log happens. Add thelm if multiple.
3. Apply a moving average filter multiple times to smooth the data.
4. Compute the cross correlation between the power trace and the logs.
#figure(caption: "Lag computation using cross correlation.",
image("lag.svg", width:100%)
)<graph_init>