Correlation and Data Transformations -Majestic Blog
Measuring Linear Association: Correlation . variable does not increase at a constant As the correlation coefficient increases, the observations group closer. +1 indicates a perfect positive linear relationship: as one variable increases in its pattern appears to be nonlinear, then the correlation coefficient is not useful. Rank correlation coefficients, such as Spearman's rank requiring that increase to be represented by a linear relationship. or to make the coefficient less sensitive to non-normality in distributions.
By Neep Hazarika September 26, In this article, we will show how data transformations can be an important tool for the proper statistical analysis of data.
The association, or correlation, between two variables can be visualised by creating a scatterplot of the data. In certain instances, it may appear that the relationship between the two variables is not linear; in such a case, a linear correlation analysis may still be appropriate. Transformations can often significantly improve a fit between X and Y.
Figure 1 a Raw Data and b Transformed Data The main focus of this article will be on the study of special types of transformations to achieve linearity. We will apply the methodology to a real-world problem, where we compare correlations between the online popularity rankings of the most visited websites provided by Ranking. Data transformation essentially entails the application of a mathematical function to change the measurement scale of a variable that optimizes the linear correlation between the data.
In general, two kinds of transformations can be found in the literature: A linear transformation is one that preserves linear relationships between variables.
A nonlinear transformation alters either increases or decreases the linear relationships between variables and, thus, modifies the correlation between the variables. There are an infinite number of transformations that one could use to achieve linearity for correlation analysis, but it is important to resolve which transformation to apply before proceeding with the statistical calculations.
Correlation and Data Transformations
If a cause-and-effect relationship is being tested, the variable that causes the association is called the independent variable and is plotted on the X axis, while the effect is called the dependent variable and is plotted on the Y axis. Some common methods of transforming variables to achieve linearity are described below: If the raw data shows a trend as shown in Figure 2 athen it would be appropriate to apply a logarithmic transformation to the dependent variable y as shown in Figure 2 b.
Note that the transformed points lie on a straight line. A trend in the raw data as shown in Figure 4 a would suggest a reciprocal transformation, i. In all of the cases described above, the transformation was applied only to the dependent variable y.
- Correlation and dependence
- Statistics review 7: Correlation and regression
- Linear, nonlinear, and monotonic relationships
In some cases, however, it may be necessary to transform the independent variable x, as described below: If the raw data follows a trend as shown in Figure 5 aa logarithmic transformation can be applied to the independent variable x: Sometimes, transformations may have to be applied to both the dependent and independent variables, as shown in Figure 6 below.
In this case, a logarithmic transformation has been applied to both the x and y variables. This method is often referred to as the Power model of data transformation. Figure 6 a Raw Data b Transformed Data The Box-Cox transformation [ 1 ] constitutes another particularly useful family of transformations, which is applied to the independent variable in most cases.
The transformation is defined as: It is hoped that transforming X can provide a sizeable improvement to the fit.
Non-linear relationships are much more complicated than linear ones
The Box-Cox transformation can also be applied to the Y variable, but this aspect will not be discussed here. One of the advantages of the he Box-Cox linearity plot is that it provides a convenient methodology to determine an appropriate transformation without involving a lot of alternative testing and elimination of failures.
These are commonly occurring relationships between variables. For example, the pressure and volume of nitrogen during an isentropic expansion are related as PV1. Next, a number of non-linear relationships are monotonic in nature.
This means they do not oscillate and steadily increase or decrease. This is good to study because they behave qualitatively like linear relationships for a number of cases. Approximations A linear relationship is the simplest to understand and therefore can serve as the first approximation of a non-linear relationship. The limits of validity need to be well noted. In fact, a number of phenomena were thought to be linear but later scientists realized that this was only true as an approximation.
Consider special theory of relativity that redefined our perceptions of space and time. It gives the full non-linear relationship between variables. They can very well be approximated to be linear in Newtonian mechanics as a first approximation at lower speeds.
If you consider momentum, in Newtonian mechanics it is linearly dependent on velocity. If you double the velocity, the momentum will double. However, at speeds approaching those of light, this becomes a highly non-linear relationship.