3.5 Variable Selection
There is no requirement that a modeler use each of a dataset’s variables when building a clustering model. As for the decision regarding which variables to include, there is no statistical test or function that can make that determination – it simply up to the modeler.
To the best degree possible, the variables selected for inclusion should align with the model’s business purpose. In most cases, the best way to make this determination is to iterate – simply try different combinations, and assess the results.
When thinking about the total number of variables to keep, bear in mind that as Euclidean distance space takes on more and more dimensions, it can become harder to identify records that are similar to one another. This is sometimes referred to as the “Curse of Dimensionality.” It can be mitigated by reducing the number of features included, by combining features together, or using linear combinations of features, rather than their original values.