The Data Mining Triology: III. Analysis
Overview
Finally we have come to the last part in fundamental data mining. This is where people's analytical power shine through. However, we also highlight some cautions engineers should practice in exploratory analysis.
While data analysis is fascinating, I feel that building models based on the analysis to facilitate business decisions is even more exciting. This heavily relies on machine learning models and artificial intelligence toolkits. I've also written (and will write more in the future) blogs on these topics. Word of reminder: these models require some level of statistical and mathematical foundations, so it really depends on one's interests in developing these models.
A general view of the dataset
One can always use an easy trick: YourDataFrameName.describe()
to show the details about your data entries. This gives very good view of properties of your data. A sample output looks like
1 | 'all') df.describe(include= |
Next let's look into numerical data's pattern first.
Numerical data distributions
1. Generate comprehensive view for the numericals in Data Set
1 | list(set(df.dtypes.tolist())) |
Key steps:
(i) From the graphs, find which features have similar distributions;
(ii) Document the discovery for further investigation;
2. Correlation (correlation is affected by outliers)
Find the strongly correlated values with the output. Call this list of values golden_features_list
.
1 | df_num_corr = df_num.corr()['SalePrice'][:-1] # -1 because the latest row is SalePrice |
3. Correlation (outliers removal)
- Plot the numerical features and see which ones have very few or explainable outliers
- Remove the outliers from these features and see which one can have a good correlation without their outliers
1 | for i in range(0, len(df_num.columns), 5): |
Key steps:
(i) Spot any clear outliers, document. Think of outlier's plausibility. Think of whether to remove it & document;
(ii) Spot any clearly linear/non-linear relationships, document;
(iii) Spot any distribution with a lot of 0's: do Correlation (0 Removal);
4. Correlation (0 Removal)
Removing all 0's in some columns and generate golden_features_list
again, see if any new features are added.
1 | import operator |
Finally, we can give conclusion with respect to the Numerical data distribution analysis.
Feature to feature (Categorical to Categorical) Correlation Analysis:
- Heat map for features:
Steps of Analysis for the Heatmap:
- First of all, remove all simple correlations (easy to explain & not that relevant)
- Next, identify the relationships that are pertinent to the questoin/task
- Lastly, conclude the features that are similar to be easily combined/ need further investigation/ clearly helpful to the task
- Document the analysis;
1 | corr = df_num.drop('SalePrice', axis=1).corr() # We already examined SalePrice correlations |
Q –> Q
Q to Q stands for "Quantitative to Quantitative relationship", which is often found in a pure numeric dataset. For qualitative relationships, tricks like counting and sorting can also be used to transform the qualitative data into numeric ones for Q to Q analysis.
- Extract strongly correlated quantitative features
1
2
3features_to_analyse = [x for x in quantitative_features_list if x in golden_features_list]
features_to_analyse.append('SalePrice')
display(features_to_analyse) - plot the distribution:
1
2
3
4fig, ax = plt.subplots(round(len(features_to_analyse) / 3), 3, figsize = (18, 12))
for i, ax in enumerate(fig.axes):
if i < len(features_to_analyse) - 1:
sns.regplot(x=features_to_analyse[i],y='SalePrice', data=df[features_to_analyse], ax=ax) - Analysis of the distribution:
- Since Linear Regression is give, we focus on analyzing the spread of the data in each graph
C –> Q (Categorical to Quantitative relationship)
C to Q stands for "Categorical to Quantitative relationship". This is different from qualitative or quantitive relationships, as we cannot compare the degree of desired attributes based on category number themselves. Thus we should see how these attributes are manipulated to make the relationship interpretable.
- Extract Categorical features
1
2
3categorical_features = [a for a in quantitative_features_list[:-1] + df.columns.tolist() if (a not in quantitative_features_list[:-1]) or (a not in df.columns.tolist())]
df_categ = df[categorical_features]
df_not_num = df_categ.select_dtypes(include = ['O']) # include the non-numerical features - Apply Boxplot
1
2
3plt.figure(figsize = (12, 6))
ax = sns.boxplot(x='SaleCondition', y='SalePrice', data=df_categ)
# can replace "SaleCondition" with other features - Apply Distribution plotThrough these plots, we can see that some categories are predominant for some features such as
1
2
3
4
5
6fig, axes = plt.subplots(round(len(df_not_num.columns) / 3), 3, figsize=(12, 30))
for i, ax in enumerate(fig.axes):
if i < len(df_not_num.columns):
ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=45)
sns.countplot(x=df_not_num.columns[i], alpha=0.7, data=df_not_num, ax=ax)
fig.tight_layout()Utilities
,Heating
,GarageCond
,Functional
… These features may not be relevant for our predictive model.
Conclusion
The methods above cover a wide range of tools being applied in data analytics. There are definitely many more directions in EDA, and I'll update my discovery every time I find some interesting things.
The Data Mining Triology: III. Analysis
https://criss-wang.github.io/post/blogs/data/exploratory-data-analysis/