Best Practices
A practical guide to building CLIF projects — from cohort identification to optimized analysis.
Author: Kaveri Chhikara
Table of contents
Code Workflow
Step 1: Load Core Tables to Identify the Cohort
Output: A list of hospitalization_ids
- Start with
hospitalization— filter by age, dates, or other static criteria - Join with
adt— sanity check that hospitalizations have location data - Join with
patient— get demographics and other static info - Extract the list of
hospitalization_ids — this is your starting cohort - Optional: Stitch encounters — use
stitching_encounterslogic or thehospitalization_joined_idcolumn to identify linked hospitalizations (available in clifpy)
Step 2: Refine the Cohort
Output: A filtered list of hospitalization_ids meeting your inclusion/exclusion criteria
Example cohort definitions:
| Criteria | Approach |
|---|---|
| ICU stay ≥24 hours | Load adt for Step 1 cohort → calculate ICU duration → filter |
| On IMV at least once | Load respiratory_support → filter device_category == "IMV" → get unique IDs |
| Exclude trach patients | Load respiratory_support → identify tracheostomy == True → exclude from cohort |
Step 3: Load Required Tables for Final Cohort
Once the cohort is finalized:
- Use clifpy’s
load_data()function to filter CLIF tables to your cohort - Specify only the required fields and mCIDE categories
- This keeps memory usage manageable
Step 4: Build Patient Trajectories
For time-series data like respiratory support or CRRT:
- Use
waterfall()for respiratory support or CRRT tables - Apply appropriate filling logic to create complete event-based patient trajectories
Step 5: Optimization Tips
| Tip | Why |
|---|---|
| Use vectorized operations | Loops are slow; Polars/pandas vectorized ops are fast |
| Use efficient dtypes | Int8 for flags/dummies instead of Int64 saves memory |
Load *_category as lowercase | Consistent casing prevents matching bugs |
Never use *_name fields | Always use *_category — these are harmonized and standardized in the CLIF schema |
| Use try/except blocks | Handle errors gracefully across sites with different data quirks |
Data Security
Never Commit Patient Data
# Your .gitignore should ALWAYS include:
data/
*.csv
*.parquet
Only Share Aggregates
# ❌ Bad - sharing patient-level data
results = df.select("patient_id", "outcome")
# ✅ Good - sharing only aggregates
results = df.group_by("site").agg([
pl.count().alias("n"),
pl.col("outcome").mean().alias("mortality_rate")
])
Minimum Cell Sizes
For any summary statistics, ensure minimum cell sizes (typically n ≥ 10) to prevent re-identification.
Common Errors
| Error | Cause | Fix |
|---|---|---|
| “Column not found” | Wrong column name | Check the CLIF data dictionary |
| DateTime parsing errors | Column not parsed as datetime | Use pl.col("dttm").str.to_datetime() |
| Memory errors | Loading too much data | Use lazy evaluation: pl.scan_parquet() then .collect() |
| Validation failures | Categories don’t match mCIDE | Check df["category"].unique() against mCIDE |
Getting Help
- #clif-code-ecosystem — Slack channel for coding questions
- #clifpy — Slack channel for clifpy-specific issues
- GitHub Issues — For bugs in clifpy, project repos
- Weekly CLIF Calls — Thursdays 2-3 PM CT
Have improvements? PR to this repo or let us know in #clif-code-ecosystem!