spring cleanup

2025-06-08 18:47:32 -04:00 · 2025-06-08 18:47:32 -04:00 · 848f8cb57d
commit 848f8cb57d
parent f5712a3a73
95 changed files with 46734 additions and 2311 deletions
--- a/lv/ongoing_notes/notes.typ
+++ b/lv/ongoing_notes/notes.typ
@ -1,66 +0,0 @@
-/*#set text(font: "Essays1743")*/
-
-= Redundancy Check
-
-To cut through the noise in the logs, I am bringing back from retirement the log verification algorithm.
-This algorithm applies hierarchical clustering to the logs using a custom distance metric.
-The main thing to remember is that the only parameter is the ~distance threshold~ that is expressed as a number of second.
-This threshold represents the maximum average time two logs can be apart to be considered redundant.
-A value around 1s should be a pretty good value for this type of data.
-
-Quick stats.:  as of April 16, the machine b-06 generated 9002 log entries across 170 unique logs.
-This number is missleading because the uniqueness of the logs is determined based on the path of the application that starts or stops.
-However, this path changes for two may reason:
-
- When the user changes, the same application start from a different user directory. It is not clear if the same application runing from two different user directory should be considered different applications or not.
- Sometimes, an application will have a change in capitalization in its name. For example, `C:\\[...]\BackgroundTaskHost.exe` exists also as `C:\\[...]\backgroundTaskHost.exe` or even simply `backgroundTaskHost.exe` without absolute path. This makes it difficult to combine back together the different version of the same app.
-
-#figure(caption: "Initial redundancy graph with threshold 2s",
-image("graph_init_2.svg", width:100%)
-)<graph_init>
-
-The initial exploration does not try to clean the logs.
-The log selection algo runs with a distance threshold of 2 seconds and produces the result in #label("graph_init").
-There are some good results in this graph. 
-Mostly, most clusters seems to make sense. For example cluster Z groups together maple-related apps, cluster L groups chrome-related apps, and cluster Q groups things I don't know about but have similar names.
-However, we can identify a number of issues and mainly:
-
- Some apps name appear twice or more. This is likely due to inconsistencies in the path or user sessions.
- The clusters do not take into accound the type of log --- starting or stoping the app.
-
-The initial idea would be to clean to logs to get a better or more clustering.
-However I have made this mistake too many time so I first need to assess if the clustering is good or not.
-No need to leak info into the data if the clustering is good.
-If it is not, I need a quantitative way of evaluating if I make it better or not.
-How do you evaluate if a clustering is good? No idea.
-Lets start by visualizing what these clusters look like in a time series.
-
-
-= Lag
-If only this was lab data, we would make sure the timings a synced up before the experiment.
-However, this is the real world and in the real world the university does not consult me when setting up their log collection system.
-As a result, the logs seems to lag behind the power traces (or the other way around, but let's say its them).
-Given that we are considering patterns of logs in the order of < 1s, any lag of more than one second is game breaking.
-Visually, we can estimate that the lag is around 1h (so game is so broken there is no game anymore).
-We can also hypothesized that, for each machine, the lag is constant in time (but may not be constant accross machines).
-
-#figure(caption: "Example of log vs power lag",
-image("example_logs_lag.svg", width:100%)
-)<graph_init>
-
-To estimate a constant lag, the cross-correlation is an obvious choice.
-The cross correlation presents the corelation of two signals for all possible values of lag between them.
-With the strong assumption that the density of logs should be exactly correlated with the power consumption, we can compute the correlation between the log series and the time series.
-
-#align(center)[~Hold up a minute, the log series is not a time series, so it cannot be compared with the power consumption data trivially.~]
-That is true. We first need to convert the log series in a time series to compute the cross-correlation.
-There are many ways to do that so I started with the simplest:
-1. Create an array of all zeros with the same shape as the power trace.
-2. Place a one at each time stamps where a log happens. Add thelm if multiple.
-3. Apply a moving average filter multiple times to smooth the data.
-4. Compute the cross correlation between the power trace and the logs.
-
-
-#figure(caption: "Lag computation using cross correlation.",
-image("lag.svg", width:100%)
-)<graph_init>