How to Compute the Sample Correlation Coefficient

The sample correlation coefficient, denoted as $r$, is a descriptive statistic used to quantify the relationship between two sets of observed data, such as height and weight or hours studied and test scores. This single numerical value measures the linear association between two variables, indicating both the strength and the direction of this link.

Defining Linear Relationships

A linear relationship exists when a change in one variable is consistently associated with a proportional change in the other variable. This association can manifest as either a positive or a negative relationship.

In a positive correlation, the two variables change in the same direction; as the first variable increases, the second variable also tends to increase. Conversely, a negative correlation describes an inverse relationship where an increase in one variable is consistently accompanied by a decrease in the other. For instance, hours spent exercising and body fat might display a negative correlation.

The strength of the relationship refers to how closely the data points cluster around an imaginary straight line drawn through the center of the data. A strong relationship implies that the data points are tightly grouped, exhibiting a clear pattern. A weak relationship, on the other hand, shows data points scattered more loosely, suggesting a less predictable association between the two variables. The sample correlation coefficient is specifically designed to measure linear relationships. Data sets that follow a curved or non-linear pattern can yield misleading correlation coefficients, meaning a low $r$ value does not necessarily mean no relationship exists, but rather that no straight-line relationship is present.

Essential Data Preparation Steps

The calculation of the sample correlation coefficient requires that the data for the two variables be collected in paired observations. Each data point must consist of a value for the first variable, $X$, and a corresponding value for the second variable, $Y$, creating paired coordinates $(x_i, y_i)$.

The initial step involves determining the arithmetic mean for each variable. The mean of $X$, denoted as $bar{x}$, is found by summing all $x$ values and dividing by the total number of observations, $n$. The same procedure is applied to the $Y$ values to find $bar{y}$.

Once the means are established, the next necessary component is the calculation of the deviation of every data point from its respective mean. For each $x_i$ and $y_i$, one must calculate the difference $(x_i – bar{x})$ and $(y_i – bar{y})$. These deviation scores indicate how far each observation lies from the average value of its variable.

These deviation scores are then used to calculate the sum of the squared deviations for both $X$ and $Y$. Squaring each deviation, $(x_i – bar{x})^2$, and summing these values across all observations yields the sum of squares for $X$, $sum (x_i – bar{x})^2$. The same procedure is completed for the $Y$ variable, resulting in $sum (y_i – bar{y})^2$.

Calculating the Correlation Coefficient

The formula for the sample correlation coefficient, $r$, provides a standardized measure of the linear relationship between $X$ and $Y$. It measures how the variables vary together and divides this by how they vary individually, ensuring the resulting coefficient is unitless.

The mathematical expression for the sample correlation coefficient is presented as:

$$r = frac{sum (x_i – bar{x})(y_i – bar{y})}{sqrt{sum (x_i – bar{x})^2 sum (y_i – bar{y})^2}}$$

The numerator, $sum (x_i – bar{x})(y_i – bar{y})$, is the sum of the products of the deviations, measuring the degree to which the variables vary together (covariance). A positive sum indicates that when an $X$ value is above its mean, the corresponding $Y$ value also tends to be above its mean, contributing to a positive relationship. Conversely, if an $X$ value above the mean is paired with a $Y$ value below the mean, the product of the deviations will be negative, contributing to a negative relationship.

The denominator, $sqrt{sum (x_i – bar{x})^2 sum (y_i – bar{y})^2}$, standardizes the covariance measurement. This term is the geometric mean of the total variation within $X$ and $Y$. Dividing the numerator by this factor constrains the resulting $r$ value to a range between $-1$ and $+1$.

To illustrate the assembly of these components, consider a minimal data set of three paired observations: $(x_1=1, y_1=3)$, $(x_2=2, y_2=5)$, and $(x_3=3, y_3=7)$. First, the means must be calculated: $bar{x} = 2$ and $bar{y} = 5$. The deviations are then calculated for each point. For the first point, the deviations are $(1-2) = -1$ and $(3-5) = -2$, yielding a product of $(-1)(-2) = 2$. The second point has deviations of $(2-2) = 0$ and $(5-5) = 0$, with a product of $0$. The third point has deviations of $(3-2) = 1$ and $(7-5) = 2$, with a product of $2$. Summing the products of the deviations gives the numerator: $2 + 0 + 2 = 4$. For the denominator, the sum of the squared $X$ deviations is $(-1)^2 + (0)^2 + (1)^2 = 2$, and the sum of the squared $Y$ deviations is $(-2)^2 + (0)^2 + (2)^2 = 8$. Plugging these values into the formula yields $r = frac{4}{sqrt{2 times 8}} = frac{4}{sqrt{16}} = frac{4}{4} = 1$.

Understanding the Resulting Value

The final sample correlation coefficient, $r$, always falls between $-1$ and $+1$. This range allows for a direct interpretation of the strength and direction of the linear association, with the sign communicating the direction of the relationship. A coefficient of $r = +1$ represents a perfect positive linear correlation, while $r = -1$ signifies a perfect negative linear correlation. A result of $r = 0$ means that there is no linear relationship between the two variables.

Values between these extremes describe the strength of the association. The closer the absolute value of $r$ is to $1$, the stronger the linear relationship. For example, a coefficient of $r = 0.75$ indicates a strong positive relationship, suggesting a clear trend where both variables increase together. A coefficient of $r = -0.30$, however, would suggest a weak negative relationship, implying that the association is not particularly strong or predictable. Generally, values closer to $pm 1$ demonstrate a more reliable and tightly clustered linear pattern in the data.