Skip to content

Working with Timezones

Proper timezone handling is critical when working with ICU data from multiple sources. This guide explains how CLIFpy manages timezones and best practices for your data.

Overview

CLIFpy ensures all datetime columns are timezone-aware to: - Prevent ambiguity in timestamp interpretation - Enable accurate time-based calculations - Support data from multiple time zones - Maintain consistency across tables

Timezone Specification

When Loading Data

Always specify the timezone when loading data:

# Specify source data timezone
table = TableClass.from_file(
    data_directory='/data',
    filetype='parquet',
    timezone='US/Central'  # Source data timezone
)

# Common US timezones
# 'US/Eastern', 'US/Central', 'US/Mountain', 'US/Pacific'
# 'America/New_York', 'America/Chicago', 'America/Denver', 'America/Los_Angeles'

Using Orchestrator

The orchestrator ensures consistent timezone across all tables:

orchestrator = ClifOrchestrator(
    data_directory='/data',
    filetype='parquet',
    timezone='US/Central'  # Applied to all tables
)

Timezone Conversion

During Loading

CLIFpy automatically converts datetime columns to the specified timezone:

# Original data in UTC
table = TableClass.from_file('/data', 'parquet', timezone='UTC')

# Convert to Central time during loading
table = TableClass.from_file('/data', 'parquet', timezone='US/Central')

After Loading

Convert between timezones using pandas:

# Convert to different timezone
df = table.df.copy()
df['lab_datetime'] = df['lab_datetime'].dt.tz_convert('US/Eastern')

# Localize timezone-naive data (not recommended)
# df['datetime'] = df['datetime'].dt.tz_localize('US/Central')

Common Timezone Issues

Issue 1: Timezone-Naive Data

Problem: Source data lacks timezone information

# This will fail validation
table.validate()
# Error: "Datetime column 'admission_date' is not timezone-aware"

Solution: Specify timezone during loading

# CLIFpy will localize to specified timezone
table = TableClass.from_file(
    '/data', 
    'parquet', 
    timezone='US/Central'  # Assumes data is in Central time
)

Issue 2: Mixed Timezones

Problem: Different tables from different timezones

# Hospital A in Eastern time
labs_a = Labs.from_file('/hospital_a/data', 'parquet', timezone='US/Eastern')

# Hospital B in Pacific time  
labs_b = Labs.from_file('/hospital_b/data', 'parquet', timezone='US/Pacific')

Solution: Convert to common timezone

# Convert both to UTC for analysis
labs_a.df['lab_datetime'] = labs_a.df['lab_datetime'].dt.tz_convert('UTC')
labs_b.df['lab_datetime'] = labs_b.df['lab_datetime'].dt.tz_convert('UTC')

# Combine datasets
combined_labs = pd.concat([labs_a.df, labs_b.df])

Issue 3: Daylight Saving Time

Problem: Ambiguous times during DST transitions

# Fall back: 2:00 AM occurs twice
# Spring forward: 2:00 AM doesn't exist

Solution: Use pytz-aware timezone names

# Good - handles DST automatically
table = TableClass.from_file('/data', 'parquet', timezone='US/Central')

# Avoid - doesn't handle DST
# table = TableClass.from_file('/data', 'parquet', timezone='CST6CDT')

Best Practices

1. Know Your Source Timezone

# Document source timezone
"""
Data extracted from Hospital EHR
Timezone: US/Central (America/Chicago)
Includes DST adjustments
"""
table = TableClass.from_file('/data', 'parquet', timezone='US/Central')

2. Use Consistent Timezones

# Use orchestrator for consistency
orchestrator = ClifOrchestrator('/data', 'parquet', timezone='US/Central')
orchestrator.initialize(tables=['labs', 'vitals', 'medications'])

# All tables now use same timezone

3. Validate Timezone Handling

# Check timezone after loading
print(f"Lab datetime timezone: {table.df['lab_datetime'].dt.tz}")

# Verify reasonable time ranges
print(f"Earliest: {table.df['lab_datetime'].min()}")
print(f"Latest: {table.df['lab_datetime'].max()}")

4. Document Timezone Conversions

# Keep audit trail of conversions
metadata = {
    'original_timezone': 'US/Eastern',
    'converted_timezone': 'UTC',
    'conversion_date': datetime.now(),
    'conversion_method': 'pandas dt.tz_convert'
}

Time-based Calculations

Duration Calculations

Timezone-aware datetimes ensure accurate duration calculations:

# Calculate length of stay
los = adt.df['out_dttm'] - adt.df['in_dttm']
los_hours = los.dt.total_seconds() / 3600

# Time since admission
current_time = pd.Timestamp.now(tz='US/Central')
time_since = current_time - hosp.df['admission_dttm']

Filtering by Time

# Get data from last 24 hours
cutoff = pd.Timestamp.now(tz='US/Central') - pd.Timedelta(hours=24)
recent = table.df[table.df['datetime_column'] >= cutoff]

# Filter to specific date (timezone-aware)
date = pd.Timestamp('2023-01-01', tz='US/Central')
day_data = table.df[table.df['datetime_column'].dt.date == date.date()]

Aggregating by Time

# Hourly aggregation
hourly = table.df.set_index('datetime_column').resample('H').mean()

# Daily aggregation (timezone affects day boundaries!)
daily = table.df.set_index('datetime_column').resample('D').count()

Multi-site Considerations

When combining data from multiple sites:

# Strategy 1: Convert all to UTC
sites = ['site_a', 'site_b', 'site_c']
site_timezones = {
    'site_a': 'US/Eastern',
    'site_b': 'US/Central', 
    'site_c': 'US/Pacific'
}

all_data = []
for site in sites:
    table = Labs.from_file(f'/data/{site}', 'parquet', 
                          timezone=site_timezones[site])
    # Convert to UTC
    table.df['lab_datetime'] = table.df['lab_datetime'].dt.tz_convert('UTC')
    table.df['site'] = site
    all_data.append(table.df)

combined = pd.concat(all_data)
# Strategy 2: Use site's local time with site column
# Keep original timezone but track source
for site in sites:
    table = Labs.from_file(f'/data/{site}', 'parquet',
                          timezone=site_timezones[site])
    table.df['site'] = site
    table.df['source_timezone'] = site_timezones[site]

Timezone Reference

Common medical facility timezones:

US_TIMEZONES = {
    'Eastern': 'US/Eastern',     # NYC, Boston, Atlanta
    'Central': 'US/Central',     # Chicago, Houston, Dallas
    'Mountain': 'US/Mountain',   # Denver, Phoenix
    'Pacific': 'US/Pacific',     # LA, Seattle, San Francisco
    'Alaska': 'US/Alaska',       # Anchorage
    'Hawaii': 'US/Hawaii'        # Honolulu
}

# International
INTL_TIMEZONES = {
    'London': 'Europe/London',
    'Paris': 'Europe/Paris',
    'Tokyo': 'Asia/Tokyo',
    'Sydney': 'Australia/Sydney',
    'Toronto': 'America/Toronto'
}

Troubleshooting

Check Current Timezone

# For a datetime column
print(table.df['datetime_column'].dt.tz)

# For a single timestamp
print(table.df['datetime_column'].iloc[0].tzinfo)

Fix Timezone Issues

# If validation fails due to timezone
if not table.isvalid():
    tz_errors = [e for e in table.errors if 'timezone' in str(e)]
    if tz_errors:
        # Reload with proper timezone
        table = TableClass.from_file('/data', 'parquet', 
                                   timezone='US/Central')

Next Steps