Loading CLIF Data

Learn how to efficiently load and explore CLIF tables using clifpy.

The ClifOrchestrator
1. Basic Setup
CLIF Table Overview
Loading Individual Tables
Filtering Data
Handling Large Datasets
1. Lazy Loading
2. Memory-Efficient Aggregation
Next Steps

The ClifOrchestrator

The ClifOrchestrator is your central hub for working with CLIF data. It handles:

Loading all 18 CLIF tables with consistent configuration
Schema validation against mCIDE vocabularies
Encounter stitching (linking related ICU stays)
Wide dataset creation
Clinical score computation (SOFA, comorbidities)

Basic Setup

from clifpy import ClifOrchestrator

# Point to your CLIF data directory
clif = ClifOrchestrator(
    data_dir="path/to/clif/data",
    file_format="parquet"  # or "csv"
)

# Load specific tables
clif.load_tables(["patient", "hospitalization", "vitals", "labs", "adt"])

# Access loaded data
patients = clif.patient
vitals = clif.vitals

clifpy uses DuckDB and Polars under the hood, so it can handle datasets much larger than your RAM.

CLIF Table Overview

CLIF organizes ICU data into logical categories:

Demographics

Location & Movement

| Table | Description | |:——|:————| | adt | Admission-Discharge-Transfer events with location details |

Clinical Data

Medications

Specialized

Loading Individual Tables

For more control, load tables individually:

from clifpy import VitalsTable, LabsTable

# Load vitals with validation
vitals = VitalsTable(
    data_dir="path/to/clif/data",
    validate=True  # Check against mCIDE vocabularies
)
vitals.load()

# Explore the data
print(f"Records: {len(vitals.df):,}")
print(f"Unique encounters: {vitals.df['hospitalization_id'].n_unique():,}")
print(f"Date range: {vitals.df['recorded_dttm'].min()} to {vitals.df['recorded_dttm'].max()}")

Filtering Data

By Time Window

import polars as pl
from datetime import datetime

# Filter to 2024 data
vitals_2024 = vitals.df.filter(
    pl.col("recorded_dttm").dt.year() == 2024
)

By Encounter

# Get vitals for specific encounters
encounter_ids = ["ENC001", "ENC002", "ENC003"]
subset = vitals.df.filter(
    pl.col("hospitalization_id").is_in(encounter_ids)
)

By Vital Category

# Get only heart rate and blood pressure
hr_bp = vitals.df.filter(
    pl.col("vital_category").is_in(["heart_rate", "sbp", "dbp"])
)

Handling Large Datasets

clifpy is designed for large datasets. Here are some tips:

Lazy Loading

# Use lazy evaluation for large files
vitals_lazy = pl.scan_parquet("path/to/vitals.parquet")

# Apply filters before collecting
filtered = (
    vitals_lazy
    .filter(pl.col("vital_category") == "heart_rate")
    .filter(pl.col("vital_value").is_between(40, 200))
    .collect()  # Only now does it execute
)

Memory-Efficient Aggregation

# Aggregate without loading full dataset
hourly_means = (
    vitals_lazy
    .group_by([
        "hospitalization_id",
        pl.col("recorded_dttm").dt.truncate("1h").alias("hour"),
        "vital_category"
    ])
    .agg(pl.col("vital_value").mean().alias("mean_value"))
    .collect()
)

Next Steps

Now that you can load data, learn how to:

Validate your data against CLIF schemas
Perform common analyses with CLIF
Use the project template for reproducible research