SOFA Score Computation¶

Compute Sequential Organ Failure Assessment (SOFA) scores from CLIF data.

Quick Start¶

from clifpy.clif_orchestrator import ClifOrchestrator

co = ClifOrchestrator(config_path='config/config.yaml')
sofa_scores = co.compute_sofa_scores()

Parameters¶

wide_df: Optional pre-computed wide dataset
cohort_df: Optional time windows for filtering
id_name: Grouping column (default: 'encounter_block')
extremal_type: 'worst' (default) or 'latest' (future)
fill_na_scores_with_zero: Handle missing data (default: True)

Encounter Block vs Hospitalization ID¶

By default, SOFA scores are computed per encounter_block, which groups related hospitalizations:

# Initialize with encounter stitching
co = ClifOrchestrator(
    config_path='config/config.yaml',
    stitch_encounter=True,
    stitch_time_interval=6  # hours between admissions
)

# Default: scores per encounter block (may span multiple hospitalizations)
sofa_by_encounter = co.compute_sofa_scores()  # uses encounter_block

# Alternative: scores per individual hospitalization
sofa_by_hosp = co.compute_sofa_scores(id_name='hospitalization_id')

What happens when using encounter_block:

If encounter mapping doesn't exist, it's created automatically via run_stitch_encounters()
Multiple hospitalizations within the time interval are grouped as one encounter
SOFA score represents the worst values across the entire encounter
Result has one row per encounter_block instead of per hospitalization

Example encounter mapping:

hospitalization_id | encounter_block
-------------------|----------------
12345             | E001
12346             | E001  # Same encounter (readmit < 6 hours)
12347             | E002  # Different encounter

Required Data¶

SOFA requires these variables:

Labs: creatinine, platelet_count, po2_arterial, bilirubin_total
Vitals: map, spo2
Assessments: gcs_total
Medications: norepinephrine, epinephrine, dopamine, dobutamine (pre-converted to mcg/kg/min)
Respiratory: device_category, fio2_set

Missing Data¶

Missing values default to score of 0
P/F ratio uses PaO2 or imputed from SpO2
Medications must be pre-converted to standard units

Example with Time Filtering¶

import pandas as pd

# Define cohort with time windows
cohort_df = pd.DataFrame({
    'encounter_block': ['E001', 'E002'],  # or 'hospitalization_id'
    'start_time': pd.to_datetime(['2024-01-01', '2024-01-02']),
    'end_time': pd.to_datetime(['2024-01-03', '2024-01-04'])
})

sofa_scores = co.compute_sofa_scores(
    cohort_df=cohort_df,
    id_name='encounter_block'  # must match cohort_df column
)

Output¶

Returns DataFrame with:

One row per id_name (encounter_block or hospitalization_id)
Individual component scores (sofa_cv_97, sofa_coag, sofa_liver, sofa_resp, sofa_cns, sofa_renal)
Total SOFA score (sofa_total)
Intermediate calculations (p_f, p_f_imputed)

SOFA Components¶

Component	Based on	Score Range
Cardiovascular	Vasopressor doses, MAP	0-4
Coagulation	Platelet count	0-4
Liver	Bilirubin levels	0-4
Respiratory	P/F ratio, respiratory support	0-4
CNS	GCS score	0-4
Renal	Creatinine levels	0-4

Higher scores indicate worse organ dysfunction. Total score ranges from 0-24.

Notes¶

Medication units: Ensure medications are pre-converted to mcg/kg/min using the unit converter
PaO2 imputation: When PaO2 is missing but SpO2 < 97%, PaO2 is estimated using the Severinghaus equation
Missing data philosophy: Absence of monitoring data suggests the organ wasn't failing enough to warrant close observation (score = 0)

High-Performance SOFA with Polars¶

For large datasets or performance-critical applications, CLIFpy provides compute_sofa_polars(), an optimized implementation using Polars that loads data directly from files.

Quick Start (Polars)¶

import polars as pl
from datetime import datetime
from clifpy import compute_sofa_polars

# Define cohort with time windows
cohort_df = pl.DataFrame({
    'hospitalization_id': ['H001', 'H002', 'H003'],
    'start_dttm': [datetime(2024, 1, 1), datetime(2024, 1, 2), datetime(2024, 1, 3)],
    'end_dttm': [datetime(2024, 1, 2), datetime(2024, 1, 3), datetime(2024, 1, 4)]
})

# Compute SOFA scores
sofa_scores = compute_sofa_polars(
    data_directory='/path/to/clif/data',
    cohort_df=cohort_df,
    filetype='parquet',
    timezone='US/Central'
)

Parameters (Polars)¶

Parameter	Type	Default	Description
`data_directory`	str	required	Path to directory containing CLIF data files
`cohort_df`	pl.DataFrame	required	Cohort with hospitalization_id, start_dttm, end_dttm
`filetype`	str	'parquet'	File format ('parquet' or 'csv')
`id_name`	str	'hospitalization_id'	Column name for grouping scores
`extremal_type`	str	'worst'	Aggregation type ('worst' for min/max)
`fill_na_scores_with_zero`	bool	True	Fill missing component scores with 0
`remove_outliers`	bool	True	Remove physiologically implausible values
`timezone`	str	None	Target timezone (e.g., 'US/Central')

With Encounter Blocks¶

import polars as pl
from datetime import datetime
from clifpy import compute_sofa_polars

# Cohort with encounter blocks
cohort_df = pl.DataFrame({
    'hospitalization_id': ['H001', 'H002', 'H003'],
    'encounter_block': [1, 1, 2],  # H001 and H002 are same encounter
    'start_dttm': [datetime(2024, 1, 1), datetime(2024, 1, 2), datetime(2024, 1, 5)],
    'end_dttm': [datetime(2024, 1, 2), datetime(2024, 1, 3), datetime(2024, 1, 6)]
})

# Group by encounter_block instead of hospitalization_id
sofa_scores = compute_sofa_polars(
    data_directory='/path/to/clif/data',
    cohort_df=cohort_df,
    filetype='parquet',
    id_name='encounter_block',
    timezone='US/Central'
)

Integration with Pandas Workflow¶

If you have a pandas cohort DataFrame, convert it to Polars:

import pandas as pd
import polars as pl
from clifpy import compute_sofa_polars

# Pandas cohort
cohort_pd = pd.DataFrame({
    'hospitalization_id': ['H001', 'H002'],
    'start_dttm': pd.to_datetime(['2024-01-01', '2024-01-02']),
    'end_dttm': pd.to_datetime(['2024-01-02', '2024-01-03'])
})

# Convert to Polars
cohort_pl = pl.from_pandas(cohort_pd)

# Compute SOFA
sofa_scores_pl = compute_sofa_polars(
    data_directory='/path/to/clif/data',
    cohort_df=cohort_pl,
    timezone='US/Central'
)

# Convert result back to pandas if needed
sofa_scores_pd = sofa_scores_pl.to_pandas()

Performance Benefits¶

The Polars implementation offers significant performance improvements:

Lazy evaluation: Uses scan_parquet() for memory-efficient loading
Predicate pushdown: Filters are applied at the file level
Parallel execution: Polars automatically parallelizes operations
Memory efficiency: Processes data in chunks, avoiding memory exhaustion

Recommended for: - Large cohorts (>10,000 hospitalizations) - Memory-constrained environments - Production pipelines requiring fast execution

Polars vs Orchestrator Comparison¶

Feature	`ClifOrchestrator.compute_sofa_scores()`	`compute_sofa_polars()`
Backend	Pandas + DuckDB	Polars
Data loading	Requires pre-loaded tables	Loads directly from files
Memory usage	Higher (full tables in memory)	Lower (lazy evaluation)
Speed	Good	Faster for large datasets
Integration	Works with orchestrator workflow	Standalone function
Output	pandas DataFrame	polars DataFrame

Additional Polars Utilities¶

CLIFpy also exports Polars-based utilities for loading and datetime handling:

from clifpy import (
    load_data_polars,
    load_clif_table_polars,
    standardize_datetime_columns_polars,
)

# Load any CLIF table as Polars LazyFrame
labs = load_data_polars(
    table_name='labs',
    table_path='/path/to/clif/data',
    table_format_type='parquet',
    site_tz='US/Central',
    lazy=True  # Returns LazyFrame for deferred execution
)

# Convenience function with filtering
vitals = load_clif_table_polars(
    data_directory='/path/to/clif/data',
    table_name='vitals',
    hospitalization_ids=['H001', 'H002'],
    site_tz='US/Central'
)