spring cleanup
This commit is contained in:
parent
f5712a3a73
commit
848f8cb57d
95 changed files with 46734 additions and 2311 deletions
35079
procver/lv/notes_logs/example_logs_lag.svg
Normal file
35079
procver/lv/notes_logs/example_logs_lag.svg
Normal file
File diff suppressed because it is too large
Load diff
|
After Width: | Height: | Size: 987 KiB |
1519
procver/lv/notes_logs/graph_init_2.svg
Normal file
1519
procver/lv/notes_logs/graph_init_2.svg
Normal file
File diff suppressed because it is too large
Load diff
|
After Width: | Height: | Size: 77 KiB |
10070
procver/lv/notes_logs/lag.svg
Normal file
10070
procver/lv/notes_logs/lag.svg
Normal file
File diff suppressed because it is too large
Load diff
|
After Width: | Height: | Size: 249 KiB |
66
procver/lv/notes_logs/notes.typ
Normal file
66
procver/lv/notes_logs/notes.typ
Normal file
|
|
@ -0,0 +1,66 @@
|
|||
/*#set text(font: "Essays1743")*/
|
||||
|
||||
= Redundancy Check
|
||||
|
||||
To cut through the noise in the logs, I am bringing back from retirement the log verification algorithm.
|
||||
This algorithm applies hierarchical clustering to the logs using a custom distance metric.
|
||||
The main thing to remember is that the only parameter is the ~distance threshold~ that is expressed as a number of second.
|
||||
This threshold represents the maximum average time two logs can be apart to be considered redundant.
|
||||
A value around 1s should be a pretty good value for this type of data.
|
||||
|
||||
Quick stats.: as of April 16, the machine b-06 generated 9002 log entries across 170 unique logs.
|
||||
This number is missleading because the uniqueness of the logs is determined based on the path of the application that starts or stops.
|
||||
However, this path changes for two may reason:
|
||||
|
||||
- When the user changes, the same application start from a different user directory. It is not clear if the same application runing from two different user directory should be considered different applications or not.
|
||||
- Sometimes, an application will have a change in capitalization in its name. For example, `C:\\[...]\BackgroundTaskHost.exe` exists also as `C:\\[...]\backgroundTaskHost.exe` or even simply `backgroundTaskHost.exe` without absolute path. This makes it difficult to combine back together the different version of the same app.
|
||||
|
||||
#figure(caption: "Initial redundancy graph with threshold 2s",
|
||||
image("graph_init_2.svg", width:100%)
|
||||
)<graph_init>
|
||||
|
||||
The initial exploration does not try to clean the logs.
|
||||
The log selection algo runs with a distance threshold of 2 seconds and produces the result in #label("graph_init").
|
||||
There are some good results in this graph.
|
||||
Mostly, most clusters seems to make sense. For example cluster Z groups together maple-related apps, cluster L groups chrome-related apps, and cluster Q groups things I don't know about but have similar names.
|
||||
However, we can identify a number of issues and mainly:
|
||||
|
||||
- Some apps name appear twice or more. This is likely due to inconsistencies in the path or user sessions.
|
||||
- The clusters do not take into accound the type of log --- starting or stoping the app.
|
||||
|
||||
The initial idea would be to clean to logs to get a better or more clustering.
|
||||
However I have made this mistake too many time so I first need to assess if the clustering is good or not.
|
||||
No need to leak info into the data if the clustering is good.
|
||||
If it is not, I need a quantitative way of evaluating if I make it better or not.
|
||||
How do you evaluate if a clustering is good? No idea.
|
||||
Lets start by visualizing what these clusters look like in a time series.
|
||||
|
||||
|
||||
= Lag
|
||||
If only this was lab data, we would make sure the timings a synced up before the experiment.
|
||||
However, this is the real world and in the real world the university does not consult me when setting up their log collection system.
|
||||
As a result, the logs seems to lag behind the power traces (or the other way around, but let's say its them).
|
||||
Given that we are considering patterns of logs in the order of < 1s, any lag of more than one second is game breaking.
|
||||
Visually, we can estimate that the lag is around 1h (so game is so broken there is no game anymore).
|
||||
We can also hypothesized that, for each machine, the lag is constant in time (but may not be constant accross machines).
|
||||
|
||||
#figure(caption: "Example of log vs power lag",
|
||||
image("example_logs_lag.svg", width:100%)
|
||||
)<graph_init>
|
||||
|
||||
To estimate a constant lag, the cross-correlation is an obvious choice.
|
||||
The cross correlation presents the corelation of two signals for all possible values of lag between them.
|
||||
With the strong assumption that the density of logs should be exactly correlated with the power consumption, we can compute the correlation between the log series and the time series.
|
||||
|
||||
#align(center)[~Hold up a minute, the log series is not a time series, so it cannot be compared with the power consumption data trivially.~]
|
||||
That is true. We first need to convert the log series in a time series to compute the cross-correlation.
|
||||
There are many ways to do that so I started with the simplest:
|
||||
1. Create an array of all zeros with the same shape as the power trace.
|
||||
2. Place a one at each time stamps where a log happens. Add thelm if multiple.
|
||||
3. Apply a moving average filter multiple times to smooth the data.
|
||||
4. Compute the cross correlation between the power trace and the logs.
|
||||
|
||||
|
||||
#figure(caption: "Lag computation using cross correlation.",
|
||||
image("lag.svg", width:100%)
|
||||
)<graph_init>
|
||||
Loading…
Add table
Add a link
Reference in a new issue