Analyzing Decision Tree and K-means Clustering using Iris dataset

Decision trees and K-means clustering are fundamental machine learning algorithms for pattern discovery and classification. This article demonstrates how to apply both techniques to the famous Iris dataset, comparing their performance and visualizing the results.

Iris Dataset Overview

The Iris dataset, introduced by Ronald Fisher in 1936, contains 150 samples of iris flowers from three species: Iris setosa, Iris versicolor, and Iris virginica. Each sample has four features measured in centimeters ?

  • Sepal Length
  • Sepal Width
  • Petal Length
  • Petal Width

This dataset is ideal for testing classification and clustering algorithms due to its clear species separation and well-balanced classes.

Loading and Exploring the Dataset

import pandas as pd
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.cluster import KMeans
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = datasets.load_iris()
print("Species:", iris.target_names)
print("First 10 targets:", iris.target[:10])

# Create DataFrames for easier manipulation
X = pd.DataFrame(iris.data, columns=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'])
y = pd.DataFrame(iris.target, columns=['Target'])

print("\nFirst 5 rows of features:")
print(X.head())
print("\nFirst 5 rows of targets:")
print(y.head())
Species: ['setosa' 'versicolor' 'virginica']
First 10 targets: [0 0 0 0 0 0 0 0 0 0]

First 5 rows of features:
   Sepal Length  Sepal Width  Petal Length  Petal Width
0           5.1          3.5           1.4          0.2
1           4.9          3.0           1.4          0.2
2           4.7          3.2           1.3          0.2
3           4.6          3.1           1.5          0.2
4           5.0          3.6           1.4          0.2

First 5 rows of targets:
   Target
0       0
1       0
2       0
3       0
4       0

Data Visualization

Let's visualize the dataset to understand the natural clustering of different iris species ?

plt.figure(figsize=(12,5))

# Define colors and legend patches
colors = np.array(['red', 'green', 'blue'])
red_patch = mpatches.Patch(color='red', label='Setosa')
green_patch = mpatches.Patch(color='green', label='Versicolor')
blue_patch = mpatches.Patch(color='blue', label='Virginica')

# Plot sepal measurements
plt.subplot(1, 2, 1)
plt.scatter(X['Sepal Length'], X['Sepal Width'], c=colors[y['Target']])
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Sepal Length vs Sepal Width')
plt.legend(handles=[red_patch, green_patch, blue_patch])

# Plot petal measurements
plt.subplot(1, 2, 2)
plt.scatter(X['Petal Length'], X['Petal Width'], c=colors[y['Target']])
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title('Petal Length vs Petal Width')
plt.legend(handles=[red_patch, green_patch, blue_patch])

plt.tight_layout()
plt.show()
[Plot showing two scatter plots side by side - sepal and petal measurements colored by species]

K-means Clustering Analysis

K-means clustering partitions data into k clusters by minimizing within-cluster variance. Let's apply it to the Iris dataset ?

# Apply K-means clustering
kmeans_model = KMeans(n_clusters=3, random_state=42)
kmeans_model.fit(X)

print("Cluster centers:")
print(kmeans_model.cluster_centers_)
print(f"\nCluster labels (first 10): {kmeans_model.labels_[:10]}")

# Map cluster labels to match original species order
predicted_labels = np.choose(kmeans_model.labels_, [1, 0, 2]).astype(np.int64)

# Calculate accuracy
kmeans_accuracy = accuracy_score(y['Target'], predicted_labels)
print(f"K-means clustering accuracy: {kmeans_accuracy:.3f}")
Cluster centers:
[[5.9016129  2.7483871  4.39354839 1.43387097]
 [5.006      3.428      1.462      0.246     ]
 [6.85       3.07368421 5.74210526 2.07105263]]

Cluster labels (first 10): [1 1 1 1 1 1 1 1 1 1]
K-means clustering accuracy: 0.893

Decision Tree Classification

Decision trees create a hierarchical model that makes decisions based on feature values. Let's train and evaluate a decision tree classifier ?

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.25, random_state=42
)

# Create and train decision tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

# Make predictions
y_train_pred = dt_classifier.predict(X_train)
y_test_pred = dt_classifier.predict(X_test)

# Calculate accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Decision Tree Train Accuracy: {train_accuracy:.3f}")
print(f"Decision Tree Test Accuracy: {test_accuracy:.3f}")
Decision Tree Train Accuracy: 1.000
Decision Tree Test Accuracy: 1.000

Comparison of Results

Let's visualize the clustering results compared to the true species labels ?

plt.figure(figsize=(12,5))

# Plot original classification
plt.subplot(1, 2, 1)
plt.scatter(X['Petal Length'], X['Petal Width'], c=colors[y['Target']])
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title('True Species Classification')
plt.legend(handles=[red_patch, green_patch, blue_patch])

# Plot K-means clustering results
plt.subplot(1, 2, 2)
plt.scatter(X['Petal Length'], X['Petal Width'], c=colors[predicted_labels])
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title('K-means Clustering Results')
plt.legend(handles=[red_patch, green_patch, blue_patch])

plt.tight_layout()
plt.show()
[Plot showing comparison between true species labels and K-means clustering results]

Algorithm Comparison

Algorithm Type Accuracy Best For
K-means Clustering Unsupervised 89.3% Pattern discovery without labels
Decision Tree Supervised 100% Classification with labeled data

Conclusion

Both algorithms successfully analyzed the Iris dataset, with the Decision Tree achieving perfect accuracy using supervised learning. K-means clustering achieved 89.3% accuracy without using target labels, demonstrating the natural separability of iris species. These results highlight the effectiveness of both approaches for different machine learning scenarios.

Updated on: 2026-03-27T07:05:14+05:30

862 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements