Assessing correlations

Lecture 9: Assessing correlations

-be able to explain why identifying correlations is useful for data wrangling/analysis

Discover relation
One step towards discovering causality

-understand what is correlation between a pair of features

Correlation is used to detect pairs of variables that might have some relationship
Correlation does not necessarily imply causality
Purpose:
- discover relationships
- feature ranking : select the best features for building better predictive models

-understand how correlation can be identified using visualisation

Scatterplot

-understand the concept of a linear relation, versus a non linear relation for a pair of features

-understand why the concept of correlation is important, where it is used and understand why correlation is not the same as causation

Discover relationships
One step towards discovering causality

A causes B

Examples:
- Gene A causes lung cancer
Feature ranking: select the best features for building better predictive models
- A good feature to use, is a feature that has high correlation with the outcome one is trying to predict
Can hint at potential causal relationships
Does not imply causality!

-understand the use of Euclidean distance for computing correlation between two features and its advantages/ disadvantages

Problem of Euclidean Distance
- Objects can be represented with different measure scales
- Does not give a clear intuition about how well variables are correlated
- Cannot discover variables with similar behaviour / dynamics but at different scale
- Cannot discover variables with similar behaviour / dynamics but in the opposite direction (negative correlation)
Advantages: easy to implement
Disadvantages: objects can be represented with different measuer scales, and Euclidean distance does not give a clear intuition about how well variable are correlated.

-understand the use of Pearson correlation coefficient for computing correlation between two features and its advantages/ disadvantages

Advantage: can easily find how data are related from the result
Disadvantages: can only find linear relation

-understand the meaning of the variables in the Pearson correlation coefficient formula and how they can be calculated. Be able to compute this coefficient on a simple pair of features. The formula for this coefficient will be provided on the exam.

rxy in range [-1, 1]
- 1 = perfect positive linear correlation
- -1 = perfect negative linear correlation
- 0 = no correlation
- l r l = strength of linear correlation

Example

Degree
- 0.5 = large
- 0.3 ~ 0.5 = moderate
- 0.1 ~ 0.3 = small
- < 0.1 = trivial
Scale invariant: r(x,y) = r(x, Ky)
Location invariant: r(x,y) = r(x, K y)
Can only detect linear relationship: y = a × x b noise

-be able to interpret the meaning of a computed Pearson correlation coefficient

We will define a correlation measure rxy, assessing samples

from two features x and y

– Assess how close their scatter plot is to a straight line (a

linear relationship)

-understand the advantages and disadvantages of using the Pearson correlation coefficient for assessing the degree of relationship between two features

SAME AS PEARSON COORELATION

correlation data relation relationship variables

0 人点赞