Assessing correlations

2021-05-19 13:35:44 浏览数 (1)

Lecture 9: Assessing correlations

-be able to explain why identifying correlations is useful for data wrangling/analysis

  • Discover relation
  • One step towards discovering causality

-understand what is correlation between a pair of features

  • Correlation is used to detect pairs of variables that might have some relationship
  • Correlation does not necessarily imply causality
  • Purpose:
    • discover relationships
    • feature ranking : select the best features for building better predictive models

-understand how correlation can be identified using visualisation

  • Scatterplot

-understand the concept of a linear relation, versus a non linear relation for a pair of features

-understand why the concept of correlation is important, where it is used and understand why correlation is not the same as causation

  • Discover relationships
  • One step towards discovering causality

A causes B

  • Examples:
    • Gene A causes lung cancer
  • Feature ranking: select the best features for building better predictive models
    • A good feature to use, is a feature that has high correlation with the outcome one is trying to predict
  • Can hint at potential causal relationships
  • Does not imply causality!

-understand the use of Euclidean distance for computing correlation between two features and its advantages/ disadvantages

  • Problem of Euclidean Distance
    • Objects can be represented with different measure scales
    • Does not give a clear intuition about how well variables are correlated
    • Cannot discover variables with similar behaviour / dynamics but at different scale
    • Cannot discover variables with similar behaviour / dynamics but in the opposite direction (negative correlation)
  • Advantages: easy to implement
  • Disadvantages: objects can be represented with different measuer scales, and Euclidean distance does not give a clear intuition about how well variable are correlated.

-understand the use of Pearson correlation coefficient for computing correlation between two features and its advantages/ disadvantages

  • Advantage: can easily find how data are related from the result
  • Disadvantages: can only find linear relation

-understand the meaning of the variables in the Pearson correlation coefficient formula and how they can be calculated. Be able to compute this coefficient on a simple pair of features. The formula for this coefficient will be provided on the exam.

  • rxy in range [-1, 1]
    • 1 = perfect positive linear correlation
    • -1 = perfect negative linear correlation
    • 0 = no correlation
    • l r l = strength of linear correlation

Example

  • Degree
    • 0.5 = large
    • 0.3 ~ 0.5 = moderate
    • 0.1 ~ 0.3 = small
    • < 0.1 = trivial
  • Scale invariant: r(x,y) = r(x, Ky)
  • Location invariant: r(x,y) = r(x, K y)
  • Can only detect linear relationship: y = a × x b noise

-be able to interpret the meaning of a computed Pearson correlation coefficient

We will define a correlation measure rxy, assessing samples

from two features x and y

– Assess how close their scatter plot is to a straight line (a

linear relationship)

-understand the advantages and disadvantages of using the Pearson correlation coefficient for assessing the degree of relationship between two features

SAME AS PEARSON COORELATION

0 人点赞