Working with Timezones¶
Proper timezone handling is critical when working with ICU data from multiple sources. This guide explains how CLIFpy manages timezones and best practices for your data.
Overview¶
CLIFpy ensures all datetime columns are timezone-aware to: - Prevent ambiguity in timestamp interpretation - Enable accurate time-based calculations - Support data from multiple time zones - Maintain consistency across tables
Timezone Specification¶
When Loading Data¶
Always specify the timezone when loading data:
# Specify source data timezone
table = TableClass.from_file(
data_directory='/data',
filetype='parquet',
timezone='US/Central' # Source data timezone
)
# Common US timezones
# 'US/Eastern', 'US/Central', 'US/Mountain', 'US/Pacific'
# 'America/New_York', 'America/Chicago', 'America/Denver', 'America/Los_Angeles'
Using Orchestrator¶
The orchestrator ensures consistent timezone across all tables:
orchestrator = ClifOrchestrator(
data_directory='/data',
filetype='parquet',
timezone='US/Central' # Applied to all tables
)
Timezone Conversion¶
During Loading¶
CLIFpy automatically converts datetime columns to the specified timezone:
# Original data in UTC
table = TableClass.from_file('/data', 'parquet', timezone='UTC')
# Convert to Central time during loading
table = TableClass.from_file('/data', 'parquet', timezone='US/Central')
After Loading¶
Convert between timezones using pandas:
# Convert to different timezone
df = table.df.copy()
df['lab_datetime'] = df['lab_datetime'].dt.tz_convert('US/Eastern')
# Localize timezone-naive data (not recommended)
# df['datetime'] = df['datetime'].dt.tz_localize('US/Central')
Common Timezone Issues¶
Issue 1: Timezone-Naive Data¶
Problem: Source data lacks timezone information
# This will fail validation
table.validate()
# Error: "Datetime column 'admission_date' is not timezone-aware"
Solution: Specify timezone during loading
# CLIFpy will localize to specified timezone
table = TableClass.from_file(
'/data',
'parquet',
timezone='US/Central' # Assumes data is in Central time
)
Issue 2: Mixed Timezones¶
Problem: Different tables from different timezones
# Hospital A in Eastern time
labs_a = Labs.from_file('/hospital_a/data', 'parquet', timezone='US/Eastern')
# Hospital B in Pacific time
labs_b = Labs.from_file('/hospital_b/data', 'parquet', timezone='US/Pacific')
Solution: Convert to common timezone
# Convert both to UTC for analysis
labs_a.df['lab_datetime'] = labs_a.df['lab_datetime'].dt.tz_convert('UTC')
labs_b.df['lab_datetime'] = labs_b.df['lab_datetime'].dt.tz_convert('UTC')
# Combine datasets
combined_labs = pd.concat([labs_a.df, labs_b.df])
Issue 3: Daylight Saving Time¶
Problem: Ambiguous times during DST transitions
Solution: Use pytz-aware timezone names
# Good - handles DST automatically
table = TableClass.from_file('/data', 'parquet', timezone='US/Central')
# Avoid - doesn't handle DST
# table = TableClass.from_file('/data', 'parquet', timezone='CST6CDT')
Best Practices¶
1. Know Your Source Timezone¶
# Document source timezone
"""
Data extracted from Hospital EHR
Timezone: US/Central (America/Chicago)
Includes DST adjustments
"""
table = TableClass.from_file('/data', 'parquet', timezone='US/Central')
2. Use Consistent Timezones¶
# Use orchestrator for consistency
orchestrator = ClifOrchestrator('/data', 'parquet', timezone='US/Central')
orchestrator.initialize(tables=['labs', 'vitals', 'medications'])
# All tables now use same timezone
3. Validate Timezone Handling¶
# Check timezone after loading
print(f"Lab datetime timezone: {table.df['lab_datetime'].dt.tz}")
# Verify reasonable time ranges
print(f"Earliest: {table.df['lab_datetime'].min()}")
print(f"Latest: {table.df['lab_datetime'].max()}")
4. Document Timezone Conversions¶
# Keep audit trail of conversions
metadata = {
'original_timezone': 'US/Eastern',
'converted_timezone': 'UTC',
'conversion_date': datetime.now(),
'conversion_method': 'pandas dt.tz_convert'
}
Time-based Calculations¶
Duration Calculations¶
Timezone-aware datetimes ensure accurate duration calculations:
# Calculate length of stay
los = adt.df['out_dttm'] - adt.df['in_dttm']
los_hours = los.dt.total_seconds() / 3600
# Time since admission
current_time = pd.Timestamp.now(tz='US/Central')
time_since = current_time - hosp.df['admission_dttm']
Filtering by Time¶
# Get data from last 24 hours
cutoff = pd.Timestamp.now(tz='US/Central') - pd.Timedelta(hours=24)
recent = table.df[table.df['datetime_column'] >= cutoff]
# Filter to specific date (timezone-aware)
date = pd.Timestamp('2023-01-01', tz='US/Central')
day_data = table.df[table.df['datetime_column'].dt.date == date.date()]
Aggregating by Time¶
# Hourly aggregation
hourly = table.df.set_index('datetime_column').resample('H').mean()
# Daily aggregation (timezone affects day boundaries!)
daily = table.df.set_index('datetime_column').resample('D').count()
Multi-site Considerations¶
When combining data from multiple sites:
# Strategy 1: Convert all to UTC
sites = ['site_a', 'site_b', 'site_c']
site_timezones = {
'site_a': 'US/Eastern',
'site_b': 'US/Central',
'site_c': 'US/Pacific'
}
all_data = []
for site in sites:
table = Labs.from_file(f'/data/{site}', 'parquet',
timezone=site_timezones[site])
# Convert to UTC
table.df['lab_datetime'] = table.df['lab_datetime'].dt.tz_convert('UTC')
table.df['site'] = site
all_data.append(table.df)
combined = pd.concat(all_data)
# Strategy 2: Use site's local time with site column
# Keep original timezone but track source
for site in sites:
table = Labs.from_file(f'/data/{site}', 'parquet',
timezone=site_timezones[site])
table.df['site'] = site
table.df['source_timezone'] = site_timezones[site]
Timezone Reference¶
Common medical facility timezones:
US_TIMEZONES = {
'Eastern': 'US/Eastern', # NYC, Boston, Atlanta
'Central': 'US/Central', # Chicago, Houston, Dallas
'Mountain': 'US/Mountain', # Denver, Phoenix
'Pacific': 'US/Pacific', # LA, Seattle, San Francisco
'Alaska': 'US/Alaska', # Anchorage
'Hawaii': 'US/Hawaii' # Honolulu
}
# International
INTL_TIMEZONES = {
'London': 'Europe/London',
'Paris': 'Europe/Paris',
'Tokyo': 'Asia/Tokyo',
'Sydney': 'Australia/Sydney',
'Toronto': 'America/Toronto'
}
Troubleshooting¶
Check Current Timezone¶
# For a datetime column
print(table.df['datetime_column'].dt.tz)
# For a single timestamp
print(table.df['datetime_column'].iloc[0].tzinfo)
Fix Timezone Issues¶
# If validation fails due to timezone
if not table.isvalid():
tz_errors = [e for e in table.errors if 'timezone' in str(e)]
if tz_errors:
# Reload with proper timezone
table = TableClass.from_file('/data', 'parquet',
timezone='US/Central')
Next Steps¶
- Review validation guide for timezone validation
- See examples of timezone handling
- Learn about multi-site analysis