Hi Bruno, thanks for reading!
To answer your question, it is important to think about the difference between variables that are correlated (otherwise known as multicollinearity) and data points that are correlated.
When variables are correlated, two or more variables in your model measure similar things. For example, if I have one variable for participant weight and one variable for participant BMI, those variables are probably going to be highly correlated and I shouldn’t have both in my model. However, this doesn’t present a concern regarding the independence of your observations — measures of BMI and weight for one participant are independent of measures taken from another participant.
In your modeling process, you would check for multicollinearity by creating a matrix which shows the level of correlation between each of the variables you’re thinking of including.
When data points are correlated, however, it means that you do not have independent observations and the assumptions of logistic and linear regression have been violated. In my examples, data points were correlated because they were taken from the same family over time or because they were taken from students who could be grouped into countries. When you do not have independent data points, simple logistic and linear regression aren’t appropriate regardless of the variables included in the model.
As I described above, you would assess correlation among your data points through knowledge of your data collection process, and/or by calculating ICC and using the Durbin-Watson test.
In the instance that two variables are highly correlated, you are correct that you probably want to use just one of the correlated variables. When data points are correlated though, as I discuss in this article, GEE and MLM are useful techniques.