Exploratory data analysis with Python

In this post, I cover simple exploratory python commands which allow proper control over your data.

First, we load the Iris dataset, then find attributes of the data, then plot a scatterplot matrix for to observe correlations.

from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
iris = datasets.load_iris()
type(iris)
sklearn.datasets.base.Bunch

Bunch denotes iris is similar to a dictionary, in that it contains key-value pairs. The keys are as follows:

print(iris.keys())
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

What are the dimensions of the dataset?

iris.data.shape
(150, 4)

The iris targets are:

iris.target_names
array(['setosa', 'versicolor', 'virginica'], 
      dtype='<U10')
iris.feature_names
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']
iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Assign feature and target data:

x = iris.data
y = iris.target
df = pd.DataFrame(x, columns=iris.feature_names)

View head of dataframe:

df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
Now we observe  the scatter plot matrix of the chosen variables, with the class of the flower as the fill color.
_ = pd.plotting.scatter_matrix(df, c=y, figsize=[8,8], s=150, marker = 'D') # c = color
plt.show()
iris

Note Petal length and petal width are highly correlated. The classes are also well clustered across the various plots.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s