Utilities API Reference¶
CLIFpy provides several utility modules to support data processing and analysis tasks.
Med Unit Converter¶
The unit converter module provides comprehensive medication dose unit conversion functionality.
clifpy.utils.unit_converter.convert_dose_units_by_med_category
¶
convert_dose_units_by_med_category(med_df, vitals_df=None, preferred_units=None, show_intermediate=False, override=False)
Convert medication dose units to user-defined preferred units for each med_category.
This function performs a two-step conversion process:
- Standardizes all dose units to a base set of standard units (mcg/min, ml/min, u/min for rates)
- Converts from base units to medication-specific preferred units if provided
The conversion maintains unit class consistency (rates stay rates, amounts stay amounts) and handles weight-based dosing appropriately using patient weights.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
med_df
|
DataFrame
|
Medication DataFrame with required columns:
|
required |
vitals_df
|
DataFrame
|
Vitals DataFrame for extracting patient weights if not in med_df. Required columns if weight_kg missing from med_df:
|
None
|
preferred_units
|
dict
|
Dictionary mapping medication categories to their preferred units. Keys are medication category names, values are target unit strings. Example: {'propofol': 'mcg/kg/min', 'fentanyl': 'mcg/hr', 'insulin': 'u/hr'} If None, uses base units (mcg/min, ml/min, u/min) as defaults. |
None
|
show_intermediate
|
bool
|
If False, excludes intermediate calculation columns (multipliers) from output. If True, retains all columns including conversion multipliers for debugging. |
False
|
override
|
bool
|
If True, prints warning messages for unacceptable preferred units but continues processing. If False, raises ValueError when encountering unacceptable preferred units. |
False
|
Returns:
| Type | Description |
|---|---|
Tuple[DataFrame, DataFrame]
|
A tuple containing:
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If required columns (med_dose_unit, med_dose) are missing from med_df, if standardization to base units fails, or if conversion to preferred units fails. |
Examples:
>>> import pandas as pd
>>> med_df = pd.DataFrame({
... 'med_category': ['propofol', 'fentanyl', 'insulin'],
... 'med_dose': [200, 2, 5],
... 'med_dose_unit': ['MCG/KG/MIN', 'mcg/kg/hr', 'units/hr'],
... 'weight_kg': [70, 80, 75]
... })
>>> preferred = {
... 'propofol': 'mcg/kg/min',
... 'fentanyl': 'mcg/hr',
... 'insulin': 'u/hr'
... }
>>> result_df, counts_df = convert_dose_units_by_med_category(med_df, preferred_units=preferred)
Notes
The function handles various unit formats including:
- Weight-based dosing: /kg, /lb (uses patient weight for conversion)
- Time conversions: /hr to /min
- Volume conversions: L to mL
- Mass conversions: mg, ng, g to mcg
- Unit conversions: milli-units (mu) to units (u)
Unrecognized units are preserved but flagged in the _unit_class column.
Todo
Implement config file parsing for default preferred_units.
Source code in clifpy/utils/unit_converter.py
876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 | |
clifpy.utils.unit_converter.standardize_dose_to_base_units
¶
Standardize medication dose units to a base set of standard units.
Main public API function that performs complete dose unit standardization pipeline: format cleaning, name cleaning, and unit conversion. Returns both base data and a summary table of conversions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
med_df
|
DataFrame
|
Medication DataFrame with required columns:
Additional columns are preserved in output. |
required |
vitals_df
|
DataFrame
|
Vitals DataFrame for extracting patient weights if not in med_df. Required columns if weight_kg missing from med_df:
|
None
|
Returns:
| Type | Description |
|---|---|
Tuple[DataFrame, DataFrame]
|
A tuple containing:
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If required columns are missing from med_df. |
Examples:
>>> import pandas as pd
>>> med_df = pd.DataFrame({
... 'med_dose': [6, 100, 500],
... 'med_dose_unit': ['MCG/KG/HR', 'mL / hr', 'mg'],
... 'weight_kg': [70, 80, 75]
... })
>>> base_df, counts_df = standardize_dose_to_base_units(med_df)
>>> '_base_unit' in base_df.columns
True
>>> 'count' in counts_df.columns
True
Notes
Standard units for conversion:
- Rate units: mcg/min, ml/min, u/min (all per minute)
- Amount units: mcg, ml, u (base units)
The function automatically handles:
- Weight-based dosing (/kg, /lb) using patient weights
- Time conversions (per hour to per minute)
- Volume conversions (L to mL)
- Mass conversions (mg, ng, g to mcg)
- Unit conversions (milli-units to units)
Unrecognized units are flagged but preserved in the output.
Source code in clifpy/utils/unit_converter.py
589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 | |
Constants and Data Structures¶
Acceptable Units¶
clifpy.utils.unit_converter.ACCEPTABLE_AMOUNT_UNITS
module-attribute
¶
clifpy.utils.unit_converter.ACCEPTABLE_RATE_UNITS
module-attribute
¶
clifpy.utils.unit_converter.ALL_ACCEPTABLE_UNITS
module-attribute
¶
Unit Patterns¶
The following constants define regex patterns for unit classification:
clifpy.utils.unit_converter.MASS_REGEX
module-attribute
¶
Conversion Mappings¶
clifpy.utils.unit_converter.UNIT_NAMING_VARIANTS
module-attribute
¶
UNIT_NAMING_VARIANTS = {'/hr': '/h(r|our)?$', '/min': '/m(in|inute)?$', 'u': 'u(nits|nit)?', 'm': 'milli-?', 'l': 'l(iters|itres|itre|iter)?', 'mcg': '^(u|µ|μ)g', 'g': '^g(rams|ram)?'}
clifpy.utils.unit_converter.REGEX_TO_FACTOR_MAPPER
module-attribute
¶
REGEX_TO_FACTOR_MAPPER = {HR_REGEX: '1/60', L_REGEX: '1000', MU_REGEX: '1/1000', MG_REGEX: '1000', NG_REGEX: '1/1000', G_REGEX: '1000000', KG_REGEX: 'weight_kg', LB_REGEX: 'weight_kg * 2.20462'}
Internal Functions¶
The following functions are used internally by the main conversion functions. They are documented here for completeness and advanced usage.
clifpy.utils.unit_converter._clean_dose_unit_formats
¶
Clean dose unit formatting by removing spaces and converting to lowercase.
This is the first step in the cleaning pipeline. It standardizes the basic formatting of dose units before applying name cleaning.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
s
|
Series
|
Series containing dose unit strings to clean. |
required |
Returns:
| Type | Description |
|---|---|
Series
|
Series with cleaned formatting (no spaces, lowercase). |
Examples:
>>> import pandas as pd
>>> s = pd.Series(['mL / hr', 'MCG/KG/MIN', ' Mg/Hr '])
>>> result = _clean_dose_unit_formats(s)
>>> list(result)
['ml/hr', 'mcg/kg/min', 'mg/hr']
Notes
This function is typically used as the first step in the cleaning pipeline, followed by _clean_dose_unit_names().
Source code in clifpy/utils/unit_converter.py
clifpy.utils.unit_converter._clean_dose_unit_names
¶
Clean dose unit name variants to standard abbreviations.
Applies regex patterns to convert various unit name variants to their standard abbreviated forms (e.g., 'milliliter' -> 'ml', 'hour' -> 'hr').
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
s
|
Series
|
Series containing dose unit strings with name variants. Should already be format-cleaned (lowercase, no spaces). |
required |
Returns:
| Type | Description |
|---|---|
Series
|
Series with clean unit names. |
Examples:
>>> import pandas as pd
>>> s = pd.Series(['milliliter/hour', 'units/minute', 'µg/kg/h'])
>>> result = _clean_dose_unit_names(s)
>>> list(result)
['ml/hr', 'u/min', 'mcg/kg/hr']
Notes
Handles conversions including:
- Time: hour/h -> hr, minute/m -> min
- Volume: liter/liters/litre/litres -> l
- Units: units/unit -> u, milli-units -> mu
- Mass: µg/ug -> mcg, gram -> g
This function should be applied after _clean_dose_unit_formats().
Source code in clifpy/utils/unit_converter.py
clifpy.utils.unit_converter._convert_clean_units_to_base_units
¶
Convert clean dose units to base units.
Core conversion function that transforms various dose units into a base set of standard units (mcg/min, ml/min, u/min for rates; mcg, ml, u for amounts). Uses DuckDB for efficient SQL-based transformations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
med_df
|
DataFrame
|
DataFrame containing medication data with required columns:
|
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Original DataFrame with additional columns:
|
Examples:
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'med_dose': [6, 100],
... '_clean_unit': ['mcg/kg/hr', 'ml/hr'],
... 'weight_kg': [70, 80]
... })
>>> result = _convert_clean_dose_units_to_base_units(df)
>>> 'mcg/min' in result['_base_unit'].values
True
>>> 'ml/min' in result['_base_unit'].values
True
Notes
Conversion targets:
- Rate units: mcg/min, ml/min, u/min
- Amount units: mcg, ml, u
- Unrecognized units: original dose and (cleaned) unit will be preserved
Weight-based conversions use patient weight from weight_kg column. Time conversions: /hr -> /min (divide by 60).
Source code in clifpy/utils/unit_converter.py
403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 | |
clifpy.utils.unit_converter._convert_base_units_to_preferred_units
¶
Convert base standardized units to user-preferred units.
Performs the second stage of unit conversion, transforming from standardized base units (mcg/min, ml/min, u/min) to medication-specific preferred units while maintaining unit class consistency.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
med_df
|
DataFrame
|
DataFrame with required columns from first-stage conversion:
|
required |
override
|
bool
|
If True, prints warnings but continues when encountering:
If False, raises ValueError for these conditions. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Original DataFrame with additional columns:
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If required columns are missing from med_df or if preferred units are not in ALL_ACCEPTABLE_UNITS (when override=False). |
Notes
Conversion rules enforced:
- Conversions only allowed within same unit class (rate→rate, amount→amount)
- Cannot convert between incompatible subclasses (e.g., mass→volume)
- When conversion fails, falls back to base units and dose values
- Missing units (NULL) are handled with 'original unit is missing' status
The function uses DuckDB SQL for efficient processing and applies regex pattern matching to classify units and calculate conversion factors.
See Also
_convert_clean_dose_units_to_base_units : First-stage conversion convert_dose_units_by_med_category : Public API for complete conversion pipeline
Source code in clifpy/utils/unit_converter.py
709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 | |
clifpy.utils.unit_converter._create_unit_conversion_counts_table
¶
Create summary table of unit conversion counts.
Generates a grouped summary showing the frequency of each unit conversion pattern, useful for data quality assessment and identifying common or problematic unit patterns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
med_df
|
DataFrame
|
DataFrame with required columns from conversion process:
|
required |
group_by
|
List[str]
|
List of columns to group by. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Summary DataFrame with columns:
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If required columns are missing from input DataFrame. |
Examples:
>>> import pandas as pd
>>> # df_base = standardize_dose_to_base_units(med_df)[0]
>>> # counts = _create_unit_conversion_counts_table(df_base, ['med_dose_unit'])
>>> # 'count' in counts.columns
True
Notes
This table is particularly useful for:
- Identifying unrecognized units that need handling
- Understanding the distribution of unit types in your data
- Quality control and validation of conversions
Source code in clifpy/utils/unit_converter.py
clifpy.utils.unit_converter._convert_set_to_str_for_sql
¶
Convert a set of strings to SQL IN clause format.
Transforms a Python set into a comma-separated string suitable for use in SQL IN clauses within DuckDB queries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
s
|
Set[str]
|
Set of strings to be formatted for SQL. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Comma-separated string with items separated by "','". Does not include outer quotes - those are added in SQL query. |
Examples:
>>> units = {'ml/hr', 'mcg/min', 'u/hr'}
>>> _convert_set_to_str_for_sql(units)
"ml/hr','mcg/min','u/hr"
Usage in SQL queries:
Notes
This is a helper function for building DuckDB SQL queries that need to check if values are in a set of acceptable units.
Source code in clifpy/utils/unit_converter.py
clifpy.utils.unit_converter._concat_builders_by_patterns
¶
Concatenate multiple SQL CASE WHEN statements from patterns.
Helper function that combines multiple regex pattern builders into a single SQL CASE statement for DuckDB queries. Used internally to build conversion factor calculations for different unit components (amount, time, weight).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
builder
|
callable
|
Function that generates CASE WHEN clauses from regex patterns. Should accept a pattern string and return a WHEN...THEN clause. |
required |
patterns
|
list
|
List of regex patterns to process with the builder function. |
required |
else_case
|
str
|
Value to use in the ELSE clause when no patterns match. Default is '1' (no conversion factor). |
'1'
|
Returns:
| Type | Description |
|---|---|
str
|
Complete SQL CASE statement with all pattern conditions. |
Examples:
>>> patterns = ['/hr$', '/min$']
>>> builder = lambda p: f"WHEN regexp_matches(col, '{p}') THEN factor"
>>> result = _concat_builders_by_patterns(builder, patterns)
>>> 'CASE WHEN' in result and 'ELSE 1 END' in result
True
Notes
This function is used internally by conversion functions to build SQL queries that apply different conversion factors based on unit patterns.
Source code in clifpy/utils/unit_converter.py
clifpy.utils.unit_converter._pattern_to_factor_builder_for_base
¶
Build SQL CASE WHEN statement for regex pattern matching.
Helper function that generates SQL CASE WHEN clauses for DuckDB queries based on regex patterns and their corresponding conversion factors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pattern
|
str
|
Regex pattern to match (must exist in REGEX_TO_FACTOR_MAPPER). |
required |
Returns:
| Type | Description |
|---|---|
str
|
SQL CASE WHEN clause string. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the pattern is not found in REGEX_TO_FACTOR_MAPPER. |
Examples:
>>> clause = _pattern_to_factor_builder_for_base(HR_REGEX)
>>> 'WHEN regexp_matches' in clause and 'THEN' in clause
True
Notes
This function is used internally by _convert_clean_dose_units_to_base_units to build the SQL query for unit conversion.
Source code in clifpy/utils/unit_converter.py
clifpy.utils.unit_converter._pattern_to_factor_builder_for_preferred
¶
Build SQL CASE WHEN statement for preferred unit conversion.
Generates SQL clauses for converting from base units back to preferred units by applying the inverse of the original conversion factor. Used when converting from standardized base units to medication-specific preferred units.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pattern
|
str
|
Regex pattern to match in _preferred_unit column. Must exist in REGEX_TO_FACTOR_MAPPER dictionary. |
required |
Returns:
| Type | Description |
|---|---|
str
|
SQL CASE WHEN clause with inverse conversion factor. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the pattern is not found in REGEX_TO_FACTOR_MAPPER. |
Examples:
>>> clause = _pattern_to_factor_builder_for_preferred('/hr$')
>>> 'WHEN regexp_matches(_preferred_unit' in clause and 'THEN 1/' in clause
True
Notes
This function applies the inverse of the factor used in _pattern_to_factor_builder_for_base, allowing bidirectional conversion between unit systems. The inverse is calculated as 1/(original_factor).
See Also
_pattern_to_factor_builder_for_base : Builds patterns for base unit conversion
Source code in clifpy/utils/unit_converter.py
This section documents the utility functions available in CLIFpy for data processing, validation, and specialized operations.
Core Data Processing¶
Encounter Stitching¶
Stitch together hospital encounters that occur within a specified time window, useful for treating rapid readmissions as a single continuous encounter.
clifpy.utils.stitching_encounters.stitch_encounters
¶
Stitches together related hospital encounters that occur within a specified time interval.
This function identifies and groups hospitalizations that occur within a specified time window of each other (default 6 hours), treating them as a single continuous encounter. This is useful for handling cases where patients are discharged and readmitted quickly (e.g., ED to inpatient transfers).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hospitalization
|
DataFrame
|
Hospitalization table with required columns: - patient_id - hospitalization_id - admission_dttm - discharge_dttm - age_at_admission - admission_type_category - discharge_category |
required |
adt
|
DataFrame
|
ADT (Admission/Discharge/Transfer) table with required columns: - hospitalization_id - in_dttm - out_dttm - location_category - hospital_id |
required |
time_interval
|
int
|
Number of hours between discharge and next admission to consider encounters linked. If a patient is readmitted within this window, the encounters are stitched together. |
6
|
Returns:
| Type | Description |
|---|---|
Tuple[DataFrame, DataFrame, DataFrame]
|
hospitalization_stitched : pd.DataFrame Enhanced hospitalization data with encounter_block column adt_stitched : pd.DataFrame Enhanced ADT data with encounter_block column encounter_mapping : pd.DataFrame Mapping of hospitalization_id to encounter_block |
Raises:
| Type | Description |
|---|---|
ValueError
|
If required columns are missing from input DataFrames |
Examples:
>>> hosp_stitched, adt_stitched, mapping = stitch_encounters(
... hospitalization_df,
... adt_df,
... time_interval=12 # 12-hour window
... )
Source code in clifpy/utils/stitching_encounters.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | |
Wide Dataset Creation¶
Transform CLIF tables into wide format for analysis, with automatic pivoting and high-performance processing.
clifpy.utils.wide_dataset.create_wide_dataset
¶
create_wide_dataset(clif_instance, optional_tables=None, category_filters=None, sample=False, hospitalization_ids=None, cohort_df=None, output_format='dataframe', save_to_data_location=False, output_filename=None, return_dataframe=True, base_table_columns=None, batch_size=1000, memory_limit=None, threads=None, show_progress=True)
Create a wide dataset by joining multiple CLIF tables with pivoting support.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
clif_instance
|
CLIF object with loaded data |
required | |
optional_tables
|
List[str]
|
DEPRECATED - use category_filters to specify tables |
None
|
category_filters
|
Dict[str, List[str]]
|
Dict specifying filtering/selection for each table. Behavior differs by table type: PIVOT TABLES (narrow to wide conversion): - Values are category values to filter and pivot into columns - Example: {'vitals': ['heart_rate', 'sbp', 'spo2'], 'labs': ['hemoglobin', 'sodium', 'creatinine']} - Acceptable values come from the category column's permissible values defined in each table's schema file (clifpy/schemas/*_schema.yaml) WIDE TABLES (already in wide format): - Values are column names to keep from the table - Example: {'respiratory_support': ['device_category', 'fio2_set', 'peep_set']} - Acceptable values are any column names from the table schema Supported tables and their types are defined in: clifpy/schemas/wide_tables_config.yaml Table presence in this dict determines if it will be loaded. For complete lists of acceptable category values, see: - Table schemas: clifpy/schemas/*_schema.yaml - Wide dataset config: clifpy/schemas/wide_tables_config.yaml |
None
|
sample
|
bool
|
if True, randomly select 20 hospitalizations |
False
|
hospitalization_ids
|
List[str]
|
List of specific hospitalization IDs to filter |
None
|
cohort_df
|
DataFrame
|
DataFrame with columns ['hospitalization_id', 'start_time', 'end_time'] If provided, data will be filtered to only include events within the specified time windows for each hospitalization |
None
|
output_format
|
str
|
'dataframe', 'csv', or 'parquet' |
'dataframe'
|
save_to_data_location
|
bool
|
save output to data directory |
False
|
output_filename
|
str
|
Custom filename (default: 'wide_dataset_YYYYMMDD_HHMMSS') |
None
|
return_dataframe
|
bool
|
return DataFrame even when saving to file |
True
|
base_table_columns
|
Dict[str, List[str]]
|
DEPRECATED - columns are selected automatically |
None
|
batch_size
|
int
|
Number of hospitalizations to process in each batch |
1000
|
memory_limit
|
str
|
DuckDB memory limit (e.g., '8GB') |
None
|
threads
|
int
|
Number of threads for DuckDB to use |
None
|
show_progress
|
bool
|
Show progress bars for long operations |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame or None
|
DataFrame if return_dataframe=True, None otherwise |
Source code in clifpy/utils/wide_dataset.py
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 | |
clifpy.utils.wide_dataset.convert_wide_to_hourly
¶
convert_wide_to_hourly(wide_df, aggregation_config, id_name='hospitalization_id', hourly_window=1, fill_gaps=False, memory_limit='4GB', temp_directory=None, batch_size=None, timezone='UTC')
Convert a wide dataset to temporal aggregation with user-defined aggregation methods.
This function uses DuckDB for high-performance aggregation with event-based windowing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
wide_df
|
DataFrame
|
Wide dataset DataFrame from create_wide_dataset() |
required |
aggregation_config
|
Dict[str, List[str]]
|
Dict mapping aggregation methods to list of columns Example: { 'max': ['map', 'temp_c', 'sbp'], 'mean': ['heart_rate', 'respiratory_rate'], 'min': ['spo2'], 'median': ['glucose'], 'first': ['gcs_total', 'rass'], 'last': ['assessment_value'], 'boolean': ['norepinephrine', 'propofol'], 'one_hot_encode': ['medication_name', 'assessment_category'] } |
required |
id_name
|
str
|
Column name to use for grouping aggregation - 'hospitalization_id': Group by individual hospitalizations (default) - 'encounter_block': Group by encounter blocks (after encounter stitching) - Any other ID column present in the wide dataset |
'hospitalization_id'
|
hourly_window
|
int
|
Time window for aggregation in hours (1-72). Windows are event-based (relative to each group's first event): - Window 0: [first_event, first_event + hourly_window hours) - Window 1: [first_event + hourly_window, first_event + 2hourly_window) - Window N: [first_event + Nhourly_window, ...) Common values: 1 (hourly), 2 (bi-hourly), 6 (quarter-day), 12 (half-day), 24 (daily), 72 (3-day - maximum) |
1
|
fill_gaps
|
bool
|
Whether to create rows for time windows with no data.
Example with events at window 0, 1, 5: - fill_gaps=False: Output has 3 rows (windows 0, 1, 5) - fill_gaps=True: Output has 6 rows (windows 0, 1, 2, 3, 4, 5) Windows 2, 3, 4 have NaN for all aggregated columns |
False
|
memory_limit
|
str
|
DuckDB memory limit (e.g., '4GB', '8GB') |
'4GB'
|
temp_directory
|
str
|
Directory for temporary files (default: system temp) |
None
|
batch_size
|
int
|
Process in batches if dataset is large (auto-determined if None) |
None
|
timezone
|
str
|
Timezone for datetime operations in DuckDB (e.g., 'UTC', 'America/New_York') |
'UTC'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Aggregated dataset with columns: Group & Window Identifiers: - {id_name}: Group identifier (hospitalization_id or encounter_block) - window_number: Sequential window index (0-indexed, starts at 0 for each group) - window_start_dttm: Window start timestamp (inclusive) - window_end_dttm: Window end timestamp (exclusive) Context Columns: - patient_id: Patient identifier - day_number: Day number within hospitalization Aggregated Columns: - All columns specified in aggregation_config with appropriate suffixes (_max, _min, _mean, _median, _first, _last, _boolean, one-hot encoded) Notes: - Windows are relative to each group's first event, not calendar boundaries - window_end_dttm - window_start_dttm = hourly_window hours (always) - When fill_gaps=True, gap windows contain NaN (not forward-filled) - When fill_gaps=False, only windows with data appear (sparse output) |
Source code in clifpy/utils/wide_dataset.py
325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 | |
Respiratory Support Processing¶
Waterfall Processing¶
Apply sophisticated data cleaning and imputation to respiratory support data for complete ventilator timelines.
clifpy.utils.waterfall.process_resp_support_waterfall
¶
process_resp_support_waterfall(resp_support, *, id_col='hospitalization_id', bfill=False, verbose=True)
Clean + waterfall-fill the CLIF resp_support table
(Python port of Nick's reference R pipeline).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
resp_support
|
DataFrame
|
Raw CLIF respiratory-support table already in UTC. |
required |
id_col
|
str
|
Encounter-level identifier column. |
``"hospitalization_id"``
|
bfill
|
bool
|
If True, numeric setters are back-filled after forward-fill. If False (default) only forward-fill is used. |
``False``
|
verbose
|
bool
|
Prints progress banners when True. |
``True``
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Fully processed table with
|
Notes
The function does not change time-zones; convert before calling if needed.
Source code in clifpy/utils/waterfall.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 | |
Clinical Calculations¶
Comorbidity Indices¶
Calculate Charlson and Elixhauser comorbidity indices from diagnosis data.
clifpy.utils.comorbidity.calculate_cci
¶
Calculate Charlson Comorbidity Index (CCI) for hospitalizations.
This function processes hospital diagnosis data to calculate CCI scores using the Quan (2011) adaptation with ICD-10-CM codes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hospital_diagnosis
|
HospitalDiagnosis object, pandas DataFrame, or polars DataFrame
|
containing diagnosis data with columns: - hospitalization_id - diagnosis_code - diagnosis_code_format |
required |
hierarchy
|
bool
|
Apply assign0 logic to prevent double counting of conditions when both mild and severe forms are present |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with columns: - hospitalization_id (index) - 17 binary condition columns (0/1) - cci_score (weighted sum) |
Source code in clifpy/utils/comorbidity.py
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 | |
Data Quality Management¶
Outlier Handling¶
Detect and handle physiologically implausible values using configurable ranges.
clifpy.utils.outlier_handler.apply_outlier_handling
¶
Apply outlier handling to a table object's dataframe.
This function identifies numeric values that fall outside acceptable ranges and converts them to NaN. For category-dependent columns (vitals, labs, medications, assessments), ranges are applied based on the category value.
Uses ultra-fast Polars implementation with progress tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_obj
|
A pyCLIF table object with .df (DataFrame) and .table_name attributes |
required | |
outlier_config_path
|
str
|
Path to custom outlier configuration YAML. If None, uses internal CLIF standard config. |
None
|
Returns:
| Type | Description |
|---|---|
None
|
modifies table_obj.df in-place |
Source code in clifpy/utils/outlier_handler.py
clifpy.utils.outlier_handler.get_outlier_summary
¶
Get a summary of potential outliers without modifying the data.
This is a convenience wrapper around validate_numeric_ranges_from_config() for interactive use with table objects. It provides actual outlier counts and percentages without modifying the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_obj
|
A pyCLIF table object with .df, .table_name, and .schema attributes |
required | |
outlier_config_path
|
str
|
Path to custom outlier configuration. If None, uses CLIF standard config. |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
Summary of outliers with keys: - table_name: Name of the table - total_rows: Total number of rows - config_source: "CLIF standard" or "Custom" - outliers: List of outlier validation results with counts and percentages |
See Also
clifpy.utils.validator.validate_numeric_ranges_from_config : Core validation function
Examples:
>>> from clifpy.tables.vitals import Vitals
>>> from clifpy.utils.outlier_handler import get_outlier_summary
>>>
>>> vitals = Vitals.from_file()
>>> summary = get_outlier_summary(vitals)
>>> print(f"Found {len(summary['outliers'])} outlier patterns")
Source code in clifpy/utils/outlier_handler.py
Data Validation¶
Comprehensive validation functions for ensuring data quality and CLIF compliance.
clifpy.utils.validator.validate_dataframe
¶
Validate df against spec.
Returns a list of error dictionaries. An empty list means success.
For datatype validation:
- If a column doesn't match the expected type exactly, the validator checks if the data can be cast to the correct type
- Castable type mismatches return warnings with type "datatype_castable"
- Non-castable type mismatches return errors with type "datatype_mismatch"
- Both include descriptive messages about the casting capability
Source code in clifpy/utils/validator.py
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 | |
clifpy.utils.validator.validate_table
¶
Validate df using the JSON spec for table_name.
Convenience wrapper combining :pyfunc:_load_spec and
:pyfunc:validate_dataframe.
Source code in clifpy/utils/validator.py
clifpy.utils.validator.check_required_columns
¶
Validate that required columns are present in the dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The dataframe to validate |
required |
column_names
|
List[str]
|
List of required column names |
required |
table_name
|
str
|
Name of the table being validated |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with validation results including missing columns |
Source code in clifpy/utils/validator.py
clifpy.utils.validator.verify_column_dtypes
¶
Ensure columns have correct data types per schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The dataframe to validate |
required |
schema
|
dict
|
Schema containing column definitions |
required |
Returns:
| Type | Description |
|---|---|
List[dict]
|
List of datatype mismatch errors |
Source code in clifpy/utils/validator.py
clifpy.utils.validator.validate_categorical_values
¶
Check values against permitted categories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The dataframe to validate |
required |
schema
|
dict
|
Schema containing category definitions |
required |
Returns:
| Type | Description |
|---|---|
List[dict]
|
List of invalid category value errors |
Source code in clifpy/utils/validator.py
Configuration and I/O¶
Configuration Management¶
Load and manage CLIF configuration files for consistent settings.
clifpy.utils.config.load_config
¶
Load CLIF configuration from JSON or YAML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_path
|
str
|
Path to the configuration file. If None, looks for 'config.json' or 'config.yaml' in current directory. |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
Configuration dictionary with required fields validated |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If config file doesn't exist |
ValueError
|
If required fields are missing or invalid |
JSONDecodeError
|
If JSON config file is not valid |
YAMLError
|
If YAML config file is not valid |
Source code in clifpy/utils/config.py
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | |
clifpy.utils.config.get_config_or_params
¶
get_config_or_params(config_path=None, data_directory=None, filetype=None, timezone=None, output_directory=None)
Get configuration from either config file or direct parameters.
Loading priority:
- If all required params provided directly → use them
- If config_path provided → load from that path, allow param overrides
- If no params and no config_path → auto-detect config.json/yaml/yml
- Parameters override config file values when both are provided
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_path
|
str
|
Path to configuration file |
None
|
data_directory
|
str
|
Direct parameter |
None
|
filetype
|
str
|
Direct parameter |
None
|
timezone
|
str
|
Direct parameter |
None
|
output_directory
|
str
|
Direct parameter |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
Final configuration dictionary |
Raises:
| Type | Description |
|---|---|
ValueError
|
If neither config nor required params are provided |
Source code in clifpy/utils/config.py
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 | |
Data Loading¶
Core data loading functionality with timezone and filtering support.
clifpy.utils.io.load_data
¶
load_data(table_name, table_path, table_format_type, sample_size=None, columns=None, filters=None, site_tz=None, verbose=False)
Load data from a file in the specified directory with the option to select specific columns and apply filters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_name
|
str
|
The name of the table to load. |
required |
table_path
|
str
|
Path to the directory containing the data file. |
required |
table_format_type
|
str
|
Format of the data file (e.g., 'csv', 'parquet'). |
required |
sample_size
|
int
|
Number of rows to load. |
None
|
columns
|
list of str
|
List of column names to load. |
None
|
filters
|
dict
|
Dictionary of filters to apply. |
None
|
site_tz
|
str
|
Timezone string for datetime conversion, e.g., "America/New_York". |
None
|
verbose
|
bool
|
If True, show detailed loading messages. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the requested data. |
Source code in clifpy/utils/io.py
Simplified Import Paths¶
As of version 0.0.1, commonly used utilities are available directly from the clifpy package:
# Direct imports from clifpy
import clifpy
# Encounter stitching
hospitalization_stitched, adt_stitched, mapping = clifpy.stitch_encounters(
hospitalization_df,
adt_df,
time_interval=6
)
# Wide dataset creation
wide_df = clifpy.create_wide_dataset(
clif_instance=orchestrator,
optional_tables=['vitals', 'labs'],
category_filters={'vitals': ['heart_rate', 'sbp']}
)
# Calculate comorbidity index
cci_scores = clifpy.calculate_cci(
hospital_diagnosis_df,
hospitalization_df
)
# Apply outlier handling
clifpy.apply_outlier_handling(table_object)
For backward compatibility, the original import paths (clifpy.utils.module.function) remain available.