Varclus Detailed Explanation

Venkatsai
6 min readJul 8, 2024

--

Hi All,

In this article, I’ll break down Varclus(Variable Clustering) step-by-step in a simple and clear way. You’ll not only understand what Varclus is, but also how to interpret it. Plus, we’ll dive into a hands-on example using Excel to find Varclus for sample data.

Variable clustering (Varclus) is a technique used in multivariate statistics to group variables into clusters based on their similarities. This technique is often used for variable/feature reduction

Below are the Basic steps for calculating Varclus. I will provide each step with an example in a detailed way below
1. Calculate Correlation Matrix
2. Initial Clustering
3. Decide the Number of Clusters
4. Calculate Rsquare

Here’s a breakdown of each step to help you understand Varclus better. To follow along, I’ve created a sample dataset with 6 columns and 6 rows. You can access this data using the below link. Preferably, try opening the spreadsheet below using MS Excel.

Step 1: Calculate Correlation Matrix
To assess the similarity between variables in our dataset, we can calculate a correlation matrix. This matrix shows the correlation coefficient between each pair of variables. A correlation coefficient closer to 1 indicates a strong positive relationship, while a value closer to -1 indicates a strong negative relationship. We can use the CORREL function in Excel to calculate these coefficients. For instance, the correlation between Hours_of_Study (column C) and Student(column B) can be found using the formula: =CORREL($B$2:$B$7,C2:C7)

Step 2: Initial Clustering
Clustering is a powerful machine-learning technique that groups similar data points together based on their characteristics. Its goal is to organize data into clusters where points within a cluster share more similarities with each other compared to points in different clusters. There are many clustering algorithms available, but for now, let’s focus on Hierarchical Clustering.

While Microsoft Excel doesn’t offer built-in functionality for Hierarchical Clustering, I’ve used the XLSTAT add-on to perform this analysis in the attached Excel sheet. The resulting dendrogram, which is a tree-like structure, visually represents the hierarchical relationships between the data points.

A dendrogram plot is a visual representation of hierarchical clustering that shows the arrangement of clusters formed at each step of the algorithm. Here’s a short interpretation of a dendrogram plot and what the height of the tree represents:

Dendrogram Interpretation:

  • Height of the Tree: The height at which two branches merge represents the distance or dissimilarity between the clusters being joined. A lower height means the clusters are more similar, while a higher height indicates greater dissimilarity.
  • Cutting the Dendrogram: By drawing a horizontal line across the dendrogram at a specific height, you can determine the number of clusters. The number of vertical lines intersected by the horizontal line indicates the number of clusters at that dissimilarity level.

Step 3: Decide the number of clusters
XLSTAT offers an initial suggestion for the optimal number of clusters based on the dendrogram’s structure and the number of clusters themselves. However, we can also leverage the dendrogram to define our own cluster boundaries visually. In this case, we’ll accept XLSTAT’s suggestion of four clusters:
Here we have four clusters
Cluster 1: GPA and Hours_of_Study
Cluster 2: SAT_score and ACT_score
Cluster 3: Extracurricular Activities
Cluster 4: Student

One way to summarize the values within each cluster is by calculating the average of the variable values. This would provide a central tendency for each cluster. Below are the cluster values

Step 4: Calculate Rsquare
Because variables within the same cluster are correlated, we might want to select a single representative variable from each cluster. This representative variable should ideally be highly correlated with other variables in its own cluster (centroid) and have a weak correlation with the variables in the nearest neighboring cluster.

One way to identify such a representative variable is by calculating the R-squared value. R-squared represents the proportion of variance in one variable explained by another variable. We can calculate the R-squared value between a variable and its own cluster (centroid) and the R-squared value between the same variable and the nearest neighboring cluster.

The variable with a high R-squared value for its own cluster and a low R-squared value for the nearest neighboring cluster is a good candidate for the representative variable.

Here’s the formula for calculating R-squared

Step 4.1 Calculating the nearest cluster
Before calculating the R-squared values, we need to determine the nearest cluster for each cluster in our analysis. Euclidean distance is a common method for measuring the similarity between data points.

In the context of Varclus, we can calculate the Euclidean distance between the centroids (representative points) of each cluster. The cluster with the smallest Euclidean distance to a given cluster is considered its nearest neighbor.

Step 4.2 Calculating R Square
Thankfully, Excel offers a convenient function named “RSQ” to calculate R-squared values. We can leverage this function to compute the R-squared value between each variable and its own cluster centroid (center) as well as the R-squared value between the same variable and the nearest neighboring cluster

Step 4.3 Calculating 1- R Squared Ratio
Recall that we’re aiming to select a representative variable for each cluster. This variable should be highly correlated with the other variables in its own cluster (centroid) and weakly correlated with the variables in the nearest neighboring cluster.

To identify such a variable, we can utilize the 1-R-squared ratio. A variable with the lowest 1-R-squared ratio is likely to be a good representative for the cluster. It means maximum correlation with own cluster and minimum correlation with next cluster

By analyzing the 1-R-squared ratios, we can see that within cluster 1, GPA has a stronger correlation with the other variables in the cluster compared to Hours_of_Study. This suggests that GPA is more representative of the commonalities within cluster 1 and exhibits a weaker relationship with the variables in the nearest neighboring cluster. Therefore, based on this analysis, we can choose GPA as the representative variable for cluster 1, effectively reducing the number of variables from two to one.

Notes
A common question that arises is why we haven’t used Principal Component Analysis (PCA) in this explanation of Varclus. While PCA can be a valuable tool for dimensionality reduction, it’s important to note that it’s an optional step in the Varclus process. Additionally, understanding how Varclus works in conjunction with PCA can be more complex.

Comment in case of any doubts, and I will try to respond ASAP.
Thanks for Reading.
Venkat Sai

--

--

Venkatsai
Venkatsai

Written by Venkatsai

AIR 42 Graduate from IIT Bombay 5+ years of Experience in Risk Analytics