Lecture 9: Assessing correlations
-be able to explain why identifying correlations is useful for data wrangling/analysis
- Discover relation
- One step towards discovering causality
-understand what is correlation between a pair of features
- Correlation is used to detect pairs of variables that might have some relationship
- Correlation does not necessarily imply causality
- Purpose:
- discover relationships
- feature ranking : select the best features for building better predictive models
-understand how correlation can be identified using visualisation
- Scatterplot
-understand the concept of a linear relation, versus a non linear relation for a pair of features
-understand why the concept of correlation is important, where it is used and understand why correlation is not the same as causation
- Discover relationships
- One step towards discovering causality
A causes B
- Examples:
- Gene A causes lung cancer
- Feature ranking: select the best features for building better predictive models
- A good feature to use, is a feature that has high correlation with the outcome one is trying to predict
- Can hint at potential causal relationships
- Does not imply causality!
-understand the use of Euclidean distance for computing correlation between two features and its advantages/ disadvantages
- Problem of Euclidean Distance
- Objects can be represented with different measure scales
- Does not give a clear intuition about how well variables are correlated
- Cannot discover variables with similar behaviour / dynamics but at different scale
- Cannot discover variables with similar behaviour / dynamics but in the opposite direction (negative correlation)
- Advantages: easy to implement
- Disadvantages: objects can be represented with different measuer scales, and Euclidean distance does not give a clear intuition about how well variable are correlated.
-understand the use of Pearson correlation coefficient for computing correlation between two features and its advantages/ disadvantages
- Advantage: can easily find how data are related from the result
- Disadvantages: can only find linear relation
-understand the meaning of the variables in the Pearson correlation coefficient formula and how they can be calculated. Be able to compute this coefficient on a simple pair of features. The formula for this coefficient will be provided on the exam.
- rxy in range [-1, 1]
- 1 = perfect positive linear correlation
- -1 = perfect negative linear correlation
- 0 = no correlation
- l r l = strength of linear correlation
Example
- Degree
- 0.5 = large
- 0.3 ~ 0.5 = moderate
- 0.1 ~ 0.3 = small
- < 0.1 = trivial
- Scale invariant: r(x,y) = r(x, Ky)
- Location invariant: r(x,y) = r(x, K y)
- Can only detect linear relationship: y = a × x b noise
-be able to interpret the meaning of a computed Pearson correlation coefficient
We will define a correlation measure rxy, assessing samples
from two features x and y
– Assess how close their scatter plot is to a straight line (a
linear relationship)
-understand the advantages and disadvantages of using the Pearson correlation coefficient for assessing the degree of relationship between two features
SAME AS PEARSON COORELATION