Distribution difference on real-world imbalanced data

Siya
3 min readApr 29, 2021

Day to day basis we often need to determine if two distributions are same or not. This becomes even more significant when the data is imbalanced.

Let’s figure out ways of determining the difference of distributions. This article mostly tells what not to do.

Now, I should first check mean, standard deviations of both the distribution.

Here, I am trying to determine if my variable’s distribution is different for my target value.

# df= my dataframe#var_1 = my variableimport pandas as pddf.loc[df["target_column"] == 1, "var_1"].describe()count       78.000000 mean      1381.820513 std       2361.359551 min        360.000000 25%        582.500000 50%        765.000000 75%       1142.500000 max      20000.000000 Name: var_1, dtype: float64df.loc[df["target_column"] == 0, "var_1"].describe()count    100608.000000 mean        747.289480 std         770.431889 min           0.000000 25%         500.000000 50%         650.000000 75%         880.000000 max       80790.000000 Name: var_1, dtype: float64

So, mean and standard deviation are quite different.

Let’s try and visualize it using seaborn library,

import seaborn as snssns.displot(df, x="var_1", col="target_column", multiple="dodge", stat="density", common_norm=False)

Does this help at all? — Nope. All it says that for target_value = 0 there is slightly higher values at zero and little bit spread out.

Will reducing bin size help?

Reduced bin size to 100

Again, a very little difference.

Let’s try another way to visualize it so that it can be visibly different-.

Let’s try to plot both values in single histogram?

bins = np.linspace(-5, 5, 100)pyplot.hist(x, bins, alpha=0.5, label='x') #with target_col= 1pyplot.hist(y, bins, alpha=0.5, label='y') 3with taget_col = 0pyplot.legend(loc='upper right')pyplot.show()

Ugh! Ugly! X variable is completely non visible.

Let’s resort to density plot-

sns.distplot(x, hist = False, kde = True, label='1', )sns.distplot(y, hist = False, kde = True, label='0')# Plot formattingplt.legend(prop={'size': 12})plt.title('Var_1 vs target')plt.xlabel('Var_1')plt.ylabel('Density')

This is better.

Let’s see what significance test says about it-

When comparing two continuous valued distributions first test comes to mind is —

Kolmogorov-Smirnov’s test

Let’s perform this-

#importing the packagefrom scipy.stats import ks_2samp#checking if the feature is significantly different for my target, since I want to see if this feature is contributing to my target or notx =df[df['target_column']==1]['var_1'].values #for target value= 1y = df[df['target_column']==0]['var_1'].values #for target value 0ks_2samp(x, y)Results-Ks_2sampResult(statistic=0.20255232799960848, pvalue=0.0033386921251179658)

Since the p-value is less than .05, look like there is indeed a difference.

Let’s try another test, Mann-Whitney U test-

from scipy.stats import mannwhitneyumannwhitneyu(x, y)MannwhitneyuResult(statistic=2899555.5, pvalue=3.2691203453568055e-05)

p-Value is very small, hence can safely reject the NULL hypothesis!

Looks like this variable is useful in predicting my target

Let’s plot how narrow is the difference, plot ECDF-

sns.histplot(data=df, x="var_1", hue="target_column",log_scale=False, element="step", fill=False,cumulative=True, stat="density", common_norm=False,)

Well, will have to make do with little difference.

Time to wrap all this in a loop and call for all real values variable to analyze further!!

--

--