spring cleanup

2025-06-08 18:47:32 -04:00 · 2025-06-08 18:47:32 -04:00 · 848f8cb57d
commit 848f8cb57d
parent f5712a3a73
95 changed files with 46734 additions and 2311 deletions
--- a/procver/lv/notes_logs/example_logs_lag.svg
+++ b/procver/lv/notes_logs/example_logs_lag.svg
--- a/procver/lv/notes_logs/graph_init_2.svg
+++ b/procver/lv/notes_logs/graph_init_2.svg
--- a/procver/lv/notes_logs/lag.svg
+++ b/procver/lv/notes_logs/lag.svg
--- a/procver/lv/notes_logs/notes.typ
+++ b/procver/lv/notes_logs/notes.typ
@ -0,0 +1,66 @@
+/*#set text(font: "Essays1743")*/
+
+= Redundancy Check
+
+To cut through the noise in the logs, I am bringing back from retirement the log verification algorithm.
+This algorithm applies hierarchical clustering to the logs using a custom distance metric.
+The main thing to remember is that the only parameter is the ~distance threshold~ that is expressed as a number of second.
+This threshold represents the maximum average time two logs can be apart to be considered redundant.
+A value around 1s should be a pretty good value for this type of data.
+
+Quick stats.:  as of April 16, the machine b-06 generated 9002 log entries across 170 unique logs.
+This number is missleading because the uniqueness of the logs is determined based on the path of the application that starts or stops.
+However, this path changes for two may reason:
+
+- When the user changes, the same application start from a different user directory. It is not clear if the same application runing from two different user directory should be considered different applications or not.
+- Sometimes, an application will have a change in capitalization in its name. For example, `C:\\[...]\BackgroundTaskHost.exe` exists also as `C:\\[...]\backgroundTaskHost.exe` or even simply `backgroundTaskHost.exe` without absolute path. This makes it difficult to combine back together the different version of the same app.
+
+#figure(caption: "Initial redundancy graph with threshold 2s",
+image("graph_init_2.svg", width:100%)
+)<graph_init>
+
+The initial exploration does not try to clean the logs.
+The log selection algo runs with a distance threshold of 2 seconds and produces the result in #label("graph_init").
+There are some good results in this graph. 
+Mostly, most clusters seems to make sense. For example cluster Z groups together maple-related apps, cluster L groups chrome-related apps, and cluster Q groups things I don't know about but have similar names.
+However, we can identify a number of issues and mainly:
+
+- Some apps name appear twice or more. This is likely due to inconsistencies in the path or user sessions.
+- The clusters do not take into accound the type of log --- starting or stoping the app.
+
+The initial idea would be to clean to logs to get a better or more clustering.
+However I have made this mistake too many time so I first need to assess if the clustering is good or not.
+No need to leak info into the data if the clustering is good.
+If it is not, I need a quantitative way of evaluating if I make it better or not.
+How do you evaluate if a clustering is good? No idea.
+Lets start by visualizing what these clusters look like in a time series.
+
+
+= Lag
+If only this was lab data, we would make sure the timings a synced up before the experiment.
+However, this is the real world and in the real world the university does not consult me when setting up their log collection system.
+As a result, the logs seems to lag behind the power traces (or the other way around, but let's say its them).
+Given that we are considering patterns of logs in the order of < 1s, any lag of more than one second is game breaking.
+Visually, we can estimate that the lag is around 1h (so game is so broken there is no game anymore).
+We can also hypothesized that, for each machine, the lag is constant in time (but may not be constant accross machines).
+
+#figure(caption: "Example of log vs power lag",
+image("example_logs_lag.svg", width:100%)
+)<graph_init>
+
+To estimate a constant lag, the cross-correlation is an obvious choice.
+The cross correlation presents the corelation of two signals for all possible values of lag between them.
+With the strong assumption that the density of logs should be exactly correlated with the power consumption, we can compute the correlation between the log series and the time series.
+
+#align(center)[~Hold up a minute, the log series is not a time series, so it cannot be compared with the power consumption data trivially.~]
+That is true. We first need to convert the log series in a time series to compute the cross-correlation.
+There are many ways to do that so I started with the simplest:
+1. Create an array of all zeros with the same shape as the power trace.
+2. Place a one at each time stamps where a log happens. Add thelm if multiple.
+3. Apply a moving average filter multiple times to smooth the data.
+4. Compute the cross correlation between the power trace and the logs.
+
+
+#figure(caption: "Lag computation using cross correlation.",
+image("lag.svg", width:100%)
+)<graph_init>