Basic Usage¶
This guide covers the fundamental patterns for working with CLIFpy.
Core Concepts¶
Table Classes¶
Each CLIF table is represented by a Python class that inherits from BaseTable
:
Patient
- Demographics and patient identificationAdt
- Admission, discharge, and transfer eventsHospitalization
- Hospital stay informationLabs
- Laboratory test resultsVitals
- Vital signs measurementsRespiratorySupport
- Ventilation and oxygen therapyMedicationAdminContinuous
- Continuous infusionsPatientAssessments
- Clinical assessment scoresPosition
- Patient positioning
Data Loading¶
All tables support two loading methods:
# Method 1: From files
table = TableClass.from_file(
data_directory='/path/to/data',
filetype='parquet', # or 'csv'
timezone='US/Central'
)
# Method 2: From existing DataFrame
table = TableClass(
data=existing_dataframe,
timezone='US/Central'
)
Validation¶
Every table includes built-in validation:
# Run validation
table.validate()
# Check if valid
if table.isvalid():
print("Validation passed!")
else:
# Review errors
for error in table.errors[:5]:
print(f"{error['type']}: {error['message']}")
Working with DataFrames¶
All table data is accessible via the df
attribute:
# Access the underlying DataFrame
df = table.df
# Use standard pandas operations
print(df.shape)
print(df.columns.tolist())
print(df.dtypes)
# Filter data
filtered = df[df['some_column'] > threshold]
Common Operations¶
Date Range Filtering¶
Most tables with datetime columns support date range filtering:
from datetime import datetime
# Filter by date range
start = datetime(2023, 1, 1)
end = datetime(2023, 12, 31)
# For tables with custom methods
filtered = table.filter_by_date_range(start, end)
# Or using pandas
mask = (df['datetime_column'] >= start) & (df['datetime_column'] <= end)
filtered = df[mask]
Category Filtering¶
Tables with standardized categories provide filtering methods:
# Labs by category
chemistry = labs.filter_by_category('chemistry')
hematology = labs.filter_by_category('hematology')
# ADT by location
icu_stays = adt.filter_by_location_category('icu')
ed_visits = adt.filter_by_location_category('ed')
# Medications by group
vasopressors = meds.filter_by_med_group('vasopressor')
sedatives = meds.filter_by_med_group('sedative')
Patient-specific Data¶
# Single patient
patient_id = 'P12345'
patient_labs = labs.df[labs.df['patient_id'] == patient_id]
# Multiple patients
patient_ids = ['P001', 'P002', 'P003']
cohort_data = vitals.df[vitals.df['patient_id'].isin(patient_ids)]
Output and Reporting¶
Summary Statistics¶
# Get table summary
summary = table.get_summary()
print(f"Rows: {summary['num_rows']}")
print(f"Columns: {summary['num_columns']}")
print(f"Memory usage: {summary['memory_usage_mb']:.2f} MB")
# Save summary to file
table.save_summary()
Validation Reports¶
Validation results are automatically saved to the output directory:
# Set custom output directory
table = TableClass.from_file(
data_directory='/path/to/data',
filetype='parquet',
output_directory='/path/to/reports'
)
# After validation, check output files:
# - validation_log_[table_name].log
# - validation_errors_[table_name].csv
# - missing_data_stats_[table_name].csv
Timezone Handling¶
CLIFpy ensures consistent timezone handling:
# Specify timezone when loading
table = TableClass.from_file(
data_directory='/path/to/data',
filetype='parquet',
timezone='US/Central' # All datetime columns converted to this timezone
)
# Datetime columns are timezone-aware
print(table.df['datetime_column'].dt.tz)
Memory Management¶
For large datasets:
# Load only specific columns
table = TableClass.from_file(
data_directory='/path/to/data',
filetype='parquet',
columns=['patient_id', 'datetime', 'value']
)
# Load a sample
table = TableClass.from_file(
data_directory='/path/to/data',
filetype='parquet',
sample_size=10000
)
# Apply filters during loading
table = TableClass.from_file(
data_directory='/path/to/data',
filetype='parquet',
filters={'patient_id': patient_list}
)
Error Handling¶
try:
table = TableClass.from_file('/path/to/data', 'parquet')
table.validate()
if not table.isvalid():
# Handle validation errors
error_df = pd.DataFrame(table.errors)
error_df.to_csv('validation_errors.csv', index=False)
except FileNotFoundError:
print("Data files not found")
except Exception as e:
print(f"Error loading data: {e}")