Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Analyzing Decision Tree and K-means Clustering using Iris dataset
Decision trees and K-means clustering are fundamental machine learning algorithms for pattern discovery and classification. This article demonstrates how to apply both techniques to the famous Iris dataset, comparing their performance and visualizing the results.
Iris Dataset Overview
The Iris dataset, introduced by Ronald Fisher in 1936, contains 150 samples of iris flowers from three species: Iris setosa, Iris versicolor, and Iris virginica. Each sample has four features measured in centimeters ?
- Sepal Length
- Sepal Width
- Petal Length
- Petal Width
This dataset is ideal for testing classification and clustering algorithms due to its clear species separation and well-balanced classes.
Loading and Exploring the Dataset
import pandas as pd
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.cluster import KMeans
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load the iris dataset
iris = datasets.load_iris()
print("Species:", iris.target_names)
print("First 10 targets:", iris.target[:10])
# Create DataFrames for easier manipulation
X = pd.DataFrame(iris.data, columns=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'])
y = pd.DataFrame(iris.target, columns=['Target'])
print("\nFirst 5 rows of features:")
print(X.head())
print("\nFirst 5 rows of targets:")
print(y.head())
Species: ['setosa' 'versicolor' 'virginica'] First 10 targets: [0 0 0 0 0 0 0 0 0 0] First 5 rows of features: Sepal Length Sepal Width Petal Length Petal Width 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 First 5 rows of targets: Target 0 0 1 0 2 0 3 0 4 0
Data Visualization
Let's visualize the dataset to understand the natural clustering of different iris species ?
plt.figure(figsize=(12,5))
# Define colors and legend patches
colors = np.array(['red', 'green', 'blue'])
red_patch = mpatches.Patch(color='red', label='Setosa')
green_patch = mpatches.Patch(color='green', label='Versicolor')
blue_patch = mpatches.Patch(color='blue', label='Virginica')
# Plot sepal measurements
plt.subplot(1, 2, 1)
plt.scatter(X['Sepal Length'], X['Sepal Width'], c=colors[y['Target']])
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Sepal Length vs Sepal Width')
plt.legend(handles=[red_patch, green_patch, blue_patch])
# Plot petal measurements
plt.subplot(1, 2, 2)
plt.scatter(X['Petal Length'], X['Petal Width'], c=colors[y['Target']])
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title('Petal Length vs Petal Width')
plt.legend(handles=[red_patch, green_patch, blue_patch])
plt.tight_layout()
plt.show()
[Plot showing two scatter plots side by side - sepal and petal measurements colored by species]
K-means Clustering Analysis
K-means clustering partitions data into k clusters by minimizing within-cluster variance. Let's apply it to the Iris dataset ?
# Apply K-means clustering
kmeans_model = KMeans(n_clusters=3, random_state=42)
kmeans_model.fit(X)
print("Cluster centers:")
print(kmeans_model.cluster_centers_)
print(f"\nCluster labels (first 10): {kmeans_model.labels_[:10]}")
# Map cluster labels to match original species order
predicted_labels = np.choose(kmeans_model.labels_, [1, 0, 2]).astype(np.int64)
# Calculate accuracy
kmeans_accuracy = accuracy_score(y['Target'], predicted_labels)
print(f"K-means clustering accuracy: {kmeans_accuracy:.3f}")
Cluster centers: [[5.9016129 2.7483871 4.39354839 1.43387097] [5.006 3.428 1.462 0.246 ] [6.85 3.07368421 5.74210526 2.07105263]] Cluster labels (first 10): [1 1 1 1 1 1 1 1 1 1] K-means clustering accuracy: 0.893
Decision Tree Classification
Decision trees create a hierarchical model that makes decisions based on feature values. Let's train and evaluate a decision tree classifier ?
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.25, random_state=42
)
# Create and train decision tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)
# Make predictions
y_train_pred = dt_classifier.predict(X_train)
y_test_pred = dt_classifier.predict(X_test)
# Calculate accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Decision Tree Train Accuracy: {train_accuracy:.3f}")
print(f"Decision Tree Test Accuracy: {test_accuracy:.3f}")
Decision Tree Train Accuracy: 1.000 Decision Tree Test Accuracy: 1.000
Comparison of Results
Let's visualize the clustering results compared to the true species labels ?
plt.figure(figsize=(12,5))
# Plot original classification
plt.subplot(1, 2, 1)
plt.scatter(X['Petal Length'], X['Petal Width'], c=colors[y['Target']])
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title('True Species Classification')
plt.legend(handles=[red_patch, green_patch, blue_patch])
# Plot K-means clustering results
plt.subplot(1, 2, 2)
plt.scatter(X['Petal Length'], X['Petal Width'], c=colors[predicted_labels])
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title('K-means Clustering Results')
plt.legend(handles=[red_patch, green_patch, blue_patch])
plt.tight_layout()
plt.show()
[Plot showing comparison between true species labels and K-means clustering results]
Algorithm Comparison
| Algorithm | Type | Accuracy | Best For |
|---|---|---|---|
| K-means Clustering | Unsupervised | 89.3% | Pattern discovery without labels |
| Decision Tree | Supervised | 100% | Classification with labeled data |
Conclusion
Both algorithms successfully analyzed the Iris dataset, with the Decision Tree achieving perfect accuracy using supervised learning. K-means clustering achieved 89.3% accuracy without using target labels, demonstrating the natural separability of iris species. These results highlight the effectiveness of both approaches for different machine learning scenarios.
