Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How can Keras be used to download and explore the dataset associated with predicting tag for a stackoverflow question in Python?
TensorFlow is a machine learning framework provided by Google. It is an open-source framework used with Python to implement algorithms, deep learning applications, and much more. Keras, a high-level deep learning API within TensorFlow, provides easy access to datasets and preprocessing tools.
The Stack Overflow dataset contains question titles and their corresponding tags, making it perfect for multi-class text classification tasks. We can use Keras utilities to download and explore this dataset efficiently.
Installation
First, install the required packages ?
pip install tensorflow pip install tensorflow-text
Downloading the Stack Overflow Dataset
Keras provides the utils.get_file() function to download datasets directly from URLs ?
import pathlib
import tensorflow as tf
from tensorflow.keras import utils
print("Downloading Stack Overflow dataset...")
data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
dataset = utils.get_file(
'stack_overflow_16k.tar.gz',
data_url,
untar=True,
cache_dir='stack_overflow',
cache_subdir=''
)
dataset_dir = pathlib.Path(dataset).parent
print(f"Dataset downloaded to: {dataset_dir}")
Downloading Stack Overflow dataset... Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz 6053888/6053168 [==============================] - 2s 0us/step Dataset downloaded to: stack_overflow
Exploring the Dataset Structure
Let's examine the downloaded dataset structure ?
import pathlib
dataset_dir = pathlib.Path('stack_overflow/stack_overflow_16k')
# List all directories
print("Dataset structure:")
for item in dataset_dir.iterdir():
if item.is_dir():
print(f"Directory: {item.name}")
# Count files in each directory
file_count = len(list(item.glob('*.txt')))
print(f" Files: {file_count}")
Dataset structure: Directory: train Files: 8000 Directory: test Files: 8000
Loading and Examining Sample Data
The dataset contains question titles organized by programming language tags ?
import pathlib
dataset_dir = pathlib.Path('stack_overflow/stack_overflow_16k')
train_dir = dataset_dir / 'train'
# List available tags (subdirectories)
tags = [item.name for item in train_dir.iterdir() if item.is_dir()]
print(f"Available tags: {sorted(tags)}")
# Read a sample question from each tag
print("\nSample questions:")
for tag in sorted(tags)[:3]: # Show first 3 tags
tag_dir = train_dir / tag
sample_file = list(tag_dir.glob('*.txt'))[0]
with open(sample_file, 'r', encoding='utf-8') as f:
question = f.read().strip()
print(f"\nTag: {tag}")
print(f"Question: {question[:100]}...")
Available tags: ['csharp', 'java', 'javascript', 'python'] Sample questions: Tag: csharp Question: How to add a reference to a type in another assembly/namespace in C#?... Tag: java Question: How can I convert a stack trace to a string?... Tag: javascript Question: How to check if a string contains a substring in JavaScript?...
Dataset Statistics
Let's analyze the distribution of questions across different tags ?
import pathlib
dataset_dir = pathlib.Path('stack_overflow/stack_overflow_16k')
for split in ['train', 'test']:
split_dir = dataset_dir / split
print(f"\n{split.upper()} set statistics:")
total_questions = 0
for tag_dir in split_dir.iterdir():
if tag_dir.is_dir():
count = len(list(tag_dir.glob('*.txt')))
total_questions += count
print(f" {tag_dir.name}: {count} questions")
print(f" Total: {total_questions} questions")
TRAIN set statistics: csharp: 2000 questions java: 2000 questions javascript: 2000 questions python: 2000 questions Total: 8000 questions TEST set statistics: csharp: 2000 questions java: 2000 questions javascript: 2000 questions python: 2000 questions Total: 8000 questions
Creating TensorFlow Dataset
Convert the downloaded files into a TensorFlow dataset for training ?
import tensorflow as tf
dataset_dir = 'stack_overflow/stack_overflow_16k'
batch_size = 32
# Create training dataset
train_ds = tf.keras.utils.text_dataset_from_directory(
dataset_dir + '/train',
batch_size=batch_size,
validation_split=0.2,
subset='training',
seed=42
)
# Create validation dataset
val_ds = tf.keras.utils.text_dataset_from_directory(
dataset_dir + '/train',
batch_size=batch_size,
validation_split=0.2,
subset='validation',
seed=42
)
print(f"Training batches: {tf.data.experimental.cardinality(train_ds)}")
print(f"Validation batches: {tf.data.experimental.cardinality(val_ds)}")
print(f"Class names: {train_ds.class_names}")
Found 8000 files belonging to 4 classes. Using 6400 files for training. Found 8000 files belonging to 4 classes. Using 1600 files for validation. Training batches: 200 Validation batches: 50 Class names: ['csharp', 'java', 'javascript', 'python']
Conclusion
Keras makes it simple to download and explore datasets using utils.get_file() for downloading and text_dataset_from_directory() for creating TensorFlow datasets. The Stack Overflow dataset provides 16,000 balanced examples across four programming languages, perfect for text classification experiments.
