Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Utilize Time Series in Pandas?
Time series data represents observations recorded over time intervals and is crucial for analyzing trends, patterns, and temporal relationships. Pandas provides comprehensive functionality for working with time series data, from basic manipulation to advanced analysis and visualization.
Creating Sample Time Series Data
Let's start by creating sample time series data to demonstrate the concepts ?
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Create sample time series data
dates = pd.date_range('2023-01-01', periods=100, freq='D')
values = np.random.randn(100).cumsum() + 100
data = pd.DataFrame({
'value': values
}, index=dates)
print(data.head())
print(f"\nData shape: {data.shape}")
print(f"Index type: {type(data.index)}")
value
2023-01-01 99.496714
2023-01-02 98.358148
2023-01-03 99.706946
2023-01-04 98.302979
2023-01-05 99.146402
Data shape: (100, 1)
Index type: <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
Converting Strings to DateTime Index
When working with time series data from files, you often need to convert date columns to proper datetime format ?
# Create DataFrame with string dates
data_str = pd.DataFrame({
'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
'sales': [100, 120, 95, 110]
})
print("Before conversion:")
print(data_str.dtypes)
# Convert date column to datetime and set as index
data_str['date'] = pd.to_datetime(data_str['date'])
data_str.set_index('date', inplace=True)
print("\nAfter conversion:")
print(data_str.dtypes)
print(data_str.head())
Before conversion:
date object
sales int64
dtype: object
After conversion:
sales int64
dtype: object
sales
date
2023-01-01 100
2023-01-02 120
2023-01-03 95
2023-01-04 110
Indexing and Slicing Time Series
Pandas provides intuitive ways to select data based on time periods ?
# Create larger sample dataset
dates = pd.date_range('2022-01-01', '2023-12-31', freq='D')
ts_data = pd.DataFrame({
'value': np.random.randn(len(dates)).cumsum() + 100
}, index=dates)
# Select data for specific year
year_2023 = ts_data['2023']
print(f"Year 2023 data points: {len(year_2023)}")
# Select specific date range
jan_2023 = ts_data['2023-01-01':'2023-01-31']
print(f"January 2023 data points: {len(jan_2023)}")
# Select by month across all years
march_data = ts_data[ts_data.index.month == 3]
print(f"All March data points: {len(march_data)}")
print("\nFirst few rows of January 2023:")
print(jan_2023.head())
Year 2023 data points: 365
January 2023 data points: 31
All March data points: 62
First few rows of January 2023:
value
2023-01-01 98.875932
2023-01-02 99.543839
2023-01-03 99.178784
2023-01-04 99.177369
2023-01-05 98.871758
Handling Missing Values in Time Series
Time series often have missing values that need proper handling ?
# Create time series with missing values
dates = pd.date_range('2023-01-01', periods=10, freq='D')
values_with_nan = [10, np.nan, 12, 13, np.nan, np.nan, 16, 17, np.nan, 19]
ts_missing = pd.DataFrame({
'value': values_with_nan
}, index=dates)
print("Original data with missing values:")
print(ts_missing)
# Forward fill
ts_ffill = ts_missing.copy()
ts_ffill['value'] = ts_ffill['value'].ffill()
print("\nAfter forward fill:")
print(ts_ffill)
# Interpolation
ts_interp = ts_missing.copy()
ts_interp['value'] = ts_interp['value'].interpolate()
print("\nAfter interpolation:")
print(ts_interp)
Original data with missing values:
value
2023-01-01 10.0
2023-01-02 NaN
2023-01-03 12.0
2023-01-04 13.0
2023-01-05 NaN
2023-01-06 NaN
2023-01-07 16.0
2023-01-08 17.0
2023-01-09 NaN
2023-01-10 19.0
After forward fill:
value
2023-01-01 10.0
2023-01-02 10.0
2023-01-03 12.0
2023-01-04 13.0
2023-01-05 13.0
2023-01-06 13.0
2023-01-07 16.0
2023-01-08 17.0
2023-01-09 17.0
2023-01-10 19.0
After interpolation:
value
2023-01-01 10.0
2023-01-02 11.0
2023-01-03 12.0
2023-01-04 13.0
2023-01-05 14.0
2023-01-06 15.0
2023-01-07 16.0
2023-01-08 17.0
2023-01-09 18.0
2023-01-10 19.0
Resampling Time Series Data
Resampling allows you to change the frequency of your time series data ?
# Create daily data
daily_dates = pd.date_range('2023-01-01', periods=30, freq='D')
daily_data = pd.DataFrame({
'sales': np.random.randint(50, 150, 30)
}, index=daily_dates)
print("Daily data (first 7 days):")
print(daily_data.head(7))
# Resample to weekly frequency
weekly_sum = daily_data.resample('W').sum()
weekly_mean = daily_data.resample('W').mean()
print("\nWeekly sum:")
print(weekly_sum)
print("\nWeekly mean:")
print(weekly_mean.round(2))
# Resample to monthly frequency
monthly_stats = daily_data.resample('M').agg({
'sales': ['sum', 'mean', 'max', 'min']
})
print("\nMonthly statistics:")
print(monthly_stats.round(2))
Daily data (first 7 days):
sales
2023-01-01 106
2023-01-02 71
2023-01-03 89
2023-01-04 73
2023-01-05 146
2023-01-06 50
2023-01-07 88
Weekly sum:
sales
2023-01-08 623
2023-01-15 722
2023-01-22 685
2023-01-29 659
2023-02-05 88
Weekly mean:
sales
2023-01-08 89.00
2023-01-15 103.14
2023-01-22 97.86
2023-01-29 94.14
2023-02-05 88.00
Monthly statistics:
sales
sum mean max min
2023-01-31 2777 89.6 149 50
Basic Time Series Analysis
Perform basic statistical analysis on time series data ?
# Create sample time series
dates = pd.date_range('2023-01-01', periods=365, freq='D')
trend = np.linspace(100, 120, 365)
seasonal = 10 * np.sin(2 * np.pi * np.arange(365) / 365.25 * 4)
noise = np.random.normal(0, 2, 365)
ts_values = trend + seasonal + noise
ts_analysis = pd.DataFrame({
'value': ts_values
}, index=dates)
# Calculate rolling statistics
ts_analysis['rolling_mean_7'] = ts_analysis['value'].rolling(window=7).mean()
ts_analysis['rolling_std_7'] = ts_analysis['value'].rolling(window=7).std()
print("Time series with rolling statistics (first 10 rows):")
print(ts_analysis.head(10).round(2))
# Calculate monthly aggregates
monthly_agg = ts_analysis.resample('M').agg({
'value': ['mean', 'std', 'min', 'max']
}).round(2)
print("\nMonthly aggregates:")
print(monthly_agg.head())
Time series with rolling statistics (first 10 rows):
value rolling_mean_7 rolling_std_7
2023-01-01 102.48 NaN NaN
2023-01-02 95.61 NaN NaN
2023-01-03 96.70 NaN NaN
2023-01-04 102.58 NaN NaN
2023-01-05 96.45 NaN NaN
2023-01-06 101.59 NaN NaN
2023-01-07 99.14 99.22 3.11
2023-01-08 98.53 98.67 2.49
2023-01-09 95.42 98.54 2.83
2023-01-10 99.33 99.04 2.38
Monthly aggregates:
value
mean std min max
2023-01-31 99.71 4.04 88.92 108.59
2023-02-28 99.30 3.86 91.14 108.12
2023-03-31 101.42 4.21 91.05 111.76
2023-04-30 107.49 3.83 97.85 115.48
2023-05-31 110.15 3.67 100.43 118.73
Comparison of Time Series Methods
| Operation | Method | Use Case | Result |
|---|---|---|---|
| Missing Values | ffill() |
Forward fill missing values | Uses last valid observation |
| Missing Values | interpolate() |
Estimate missing values | Linear interpolation between points |
| Frequency Change | resample('W') |
Convert to weekly data | Aggregated weekly values |
| Frequency Change | resample('M') |
Convert to monthly data | Aggregated monthly values |
Conclusion
Pandas provides powerful tools for time series analysis, from basic indexing and slicing to advanced resampling and statistical operations. These capabilities make it easy to analyze temporal data, handle missing values, and extract meaningful insights from time-based datasets.
