Implementing Hierarchical and KMeans Clustering on Principal Components

September 18, 2022

Let’s say, you want to identify groups of customers to treat each group with different strategies. If you have a dataset containing features describing customers’ various aspects such as age, occupation, living area, spending behavior and so on, clustering methods can help you figure it out. Hierarchical clustering and KMeans clustering combined with PCA (Principal Component Analysis) are often used in various clustering methods. In this post, I will present how to implement hierarchical clustering, KMeans clustering, and both on PCA features, using SciPy and Scikit-learn libraries in Python.

Feature Scaling, Min-Max and Z-Score Normalization

September 9, 2022

Feature scaling is making features take similar ranges of values. In this post, I will talk about when it is necessary and then look into two most frequently mentioned methods, min-max normalization and z-score nomalization. Finally, several other tehniques will be introduced briefly including mean normalization, robust scalar, and scaling to unit length. Examples accompany to help understand effects of each method, which are made with the Python and scikit-learn APIs.

Sample size formulas

July 25, 2022

Although online tools are available to obtain a sample size, I make a note of formulas for sample size calculation. Being aware of them may help you use online calculators more confidently. It seems that formulas come in different flavors. However, there are more commonly used ones. I will list them up, after check the fundamentals of sample size determination.

Hacker Statistics, simulation of data acquisition

July 15, 2022

I will share the concept and applications of “hacker statistics”, an approach for statistical inference. I first try to identify its definition, and then present several examples that apply it. It is considered a quite versitile tool, although there remain a number of questions to study further.

What are associated with profitable movies?

June 1, 2022

Using a TMDb movie data set containing properties of movies released between 1960 and 2015, I explored what are related to financial success of movies, with visualizations.