As you rightly caution, the correlation between any one metric and time-to-interactive does not tell us anything. This is where "lies, damn lies, and statistics" comes from.
The fact that you've assembled the dataset and made it open source is the real news here.
The next step is to do a multiple regression, where more than one measurement is used to develop a predictive model. And of course the chosen metrics should have some underlying bearing on performance.
So to demonstrate whether or not the takeaways truly have good empirical backing a multiple regression would tell us. Throw all the variables mentioned into the model:
1) number of requests,
2) number of kilobytes transferred,
3) resource HTTP protocol version,
4) async fetch versus render-blocking download.
As it stands now, all I can surmise from the simple linear regressions shown is that old websites load fewer resources.