DQA (Validation) API Reference¶
CLIFpy's Data Quality Assessment (DQA) module provides comprehensive validation organized around three pillars: Conformance, Completeness, and Plausibility. All checks support dual backends (Polars and DuckDB) and return structured result objects.
For a user-guide introduction, see Data Quality Assessment (DQA).
Result Classes¶
clifpy.utils.validator.DQAConformanceResult
¶
Container for DQA conformance check results.
Source code in clifpy/utils/validator.py
clifpy.utils.validator.DQACompletenessResult
¶
Container for DQA completeness check results.
Source code in clifpy/utils/validator.py
clifpy.utils.validator.DQAPlausibilityResult
¶
Container for DQA plausibility check results.
Source code in clifpy/utils/validator.py
Conformance Checks¶
Conformance checks verify that data matches expected structure, schema, types, and allowed values.
A.1 — Table Presence¶
clifpy.utils.validator.check_table_exists
¶
Check if a table file exists at the specified path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_path
|
str or Path
|
Directory containing the table files |
required |
table_name
|
str
|
Name of the table to check |
required |
filetype
|
str
|
File extension (parquet, csv, etc.) |
'parquet'
|
Returns:
| Type | Description |
|---|---|
DQAConformanceResult
|
Result object with check status |
Source code in clifpy/utils/validator.py
clifpy.utils.validator.check_table_presence
¶
Check that a loaded DataFrame has rows and columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pd.DataFrame, pl.DataFrame, or pl.LazyFrame
|
Data to validate (already loaded) |
required |
table_name
|
str
|
Name of the table |
required |
Source code in clifpy/utils/validator.py
A.2 — Required Columns¶
clifpy.utils.validator.check_required_columns
¶
Check if all required columns are present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pd.DataFrame, pl.DataFrame, or pl.LazyFrame
|
Data to validate (already loaded) |
required |
schema
|
dict
|
Table schema containing required_columns |
required |
table_name
|
str
|
Name of the table |
required |
Source code in clifpy/utils/validator.py
B.1 — Data Types¶
clifpy.utils.validator.check_column_dtypes
¶
Check if columns have correct data types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pd.DataFrame, pl.DataFrame, or pl.LazyFrame
|
Data to validate (already loaded) |
required |
schema
|
dict
|
Table schema |
required |
table_name
|
str
|
Name of the table |
required |
Source code in clifpy/utils/validator.py
B.2 — Datetime Format¶
clifpy.utils.validator.check_datetime_format
¶
Validate datetime columns are in correct format.
Source code in clifpy/utils/validator.py
B.3 — Lab Reference Units¶
clifpy.utils.validator.check_lab_reference_units
¶
Check if lab reference units match schema definitions.
Source code in clifpy/utils/validator.py
B.4 — Categorical Values¶
clifpy.utils.validator.check_categorical_values
¶
Check if categorical values match mCIDE permissible values.
Source code in clifpy/utils/validator.py
B.5 — Category-to-Group Mapping¶
clifpy.utils.validator.check_category_group_mapping
¶
Check if category-to-group mappings match schema definitions.
Source code in clifpy/utils/validator.py
Completeness Checks¶
Completeness checks evaluate missing data, conditional requirements, and referential coverage.
A.1 — Missingness¶
clifpy.utils.validator.check_missingness
¶
Check missingness in required columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pd.DataFrame, pl.DataFrame, or pl.LazyFrame
|
Data to validate (already loaded) |
required |
schema
|
dict
|
Table schema containing required_columns |
required |
table_name
|
str
|
Name of the table |
required |
error_threshold
|
float
|
Percent missing above which an error is raised |
50.0
|
warning_threshold
|
float
|
Percent missing above which a warning is raised |
10.0
|
Returns:
| Type | Description |
|---|---|
DQACompletenessResult
|
Result containing missingness statistics |
Source code in clifpy/utils/validator.py
A.2 — Conditional Requirements¶
clifpy.utils.validator.check_conditional_requirements
¶
Check conditional required fields.
Source code in clifpy/utils/validator.py
B — mCIDE Value Coverage¶
clifpy.utils.validator.check_mcide_value_coverage
¶
Check if all mCIDE standardized values are present in the data.
Source code in clifpy/utils/validator.py
C.1 — Relational Integrity¶
clifpy.utils.validator.check_relational_integrity
¶
Check bidirectional relational integrity between tables.
Runs the backend-specific check in both directions: - Forward (reference → target): What percentage of reference IDs appear in the target table? (e.g., "what % of hospitalizations have labs?") - Reverse (target → reference): What percentage of target IDs exist in the reference table? (e.g., "what % of lab hosp_ids are valid?")
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_df
|
DataFrame
|
The target table (e.g., labs). |
required |
reference_df
|
DataFrame
|
The reference table (e.g., hospitalization). |
required |
target_table
|
str
|
Name of the target table. |
required |
reference_table
|
str
|
Name of the reference table. |
required |
key_column
|
str
|
The shared key column (e.g., |
required |
Returns:
| Type | Description |
|---|---|
DQACompletenessResult
|
Consolidated result with forward/reverse coverage metrics. |
Source code in clifpy/utils/validator.py
3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 | |
Plausibility Checks¶
Plausibility checks validate logical consistency, chronological order, and clinical reasonableness.
A.1 — Chronological Order¶
clifpy.utils.validator.check_chronological_order
¶
check_chronological_order(df, table_name, chronological_rules=None, warning_threshold=0.0, error_threshold=10.0)
Check that datetime pairs follow expected chronological order.
Source code in clifpy/utils/validator.py
A.2 — Numeric Range Plausibility¶
clifpy.utils.validator.check_numeric_range_plausibility
¶
check_numeric_range_plausibility(df, table_name, outlier_config=None, warning_threshold=0.0, error_threshold=10.0)
Check numeric values are within plausible ranges.
Source code in clifpy/utils/validator.py
A.3 — Field-Level Plausibility¶
clifpy.utils.validator.check_field_plausibility
¶
Check field-level plausibility constraints.
Source code in clifpy/utils/validator.py
A.4 — Medication Dose Unit Consistency¶
clifpy.utils.validator.check_medication_dose_unit_consistency
¶
Check medication dose unit consistency.
Source code in clifpy/utils/validator.py
B.1 — Cross-Table Temporal Plausibility¶
clifpy.utils.validator.check_cross_table_temporal_plausibility
¶
check_cross_table_temporal_plausibility(target_df, hospitalization_df, target_table, time_columns, warning_threshold=0.0, error_threshold=10.0)
Check that datetime values fall within hospitalization bounds.
Source code in clifpy/utils/validator.py
C.1 — Overlapping Periods¶
clifpy.utils.validator.check_overlapping_periods
¶
check_overlapping_periods(df, table_name, entity_col='hospitalization_id', start_col='in_dttm', end_col='out_dttm')
Check for overlapping time periods within entities.
Source code in clifpy/utils/validator.py
C.2 — Category Temporal Consistency¶
clifpy.utils.validator.check_category_temporal_consistency
¶
Check category distribution consistency over time.
Source code in clifpy/utils/validator.py
D.1 — Duplicate Composite Keys¶
clifpy.utils.validator.check_duplicate_composite_keys
¶
check_duplicate_composite_keys(df, table_name, composite_keys=None, schema=None, warning_threshold=0.0, error_threshold=10.0)
Check for duplicate composite keys.
Source code in clifpy/utils/validator.py
Cross-Table Checks¶
These checks operate across multiple loaded tables to validate relational and temporal consistency.
clifpy.utils.validator.run_relational_integrity_checks
¶
Auto-detect and run relational integrity checks for loaded tables.
Reads FK rules from validation_rules.yaml and runs
:func:check_relational_integrity for every applicable
(table, fk_column) pair.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tables
|
list
|
Objects with |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, DQACompletenessResult]]
|
|
Source code in clifpy/utils/validator.py
5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 5929 5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 5940 5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 | |
clifpy.utils.validator.run_cross_table_completeness_checks
¶
Run cross-table conditional completeness checks (K.5) on full DataFrames.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tables
|
list
|
Objects with |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, DQACompletenessResult]]
|
Results keyed by target table name. |
Source code in clifpy/utils/validator.py
6449 6450 6451 6452 6453 6454 6455 6456 6457 6458 6459 6460 6461 6462 6463 6464 6465 6466 6467 6468 6469 6470 6471 6472 6473 6474 6475 6476 6477 6478 6479 6480 6481 6482 6483 6484 6485 6486 6487 6488 6489 6490 6491 6492 6493 6494 6495 6496 6497 6498 6499 6500 6501 6502 6503 6504 6505 6506 6507 6508 6509 6510 6511 6512 6513 6514 6515 6516 6517 6518 6519 6520 6521 6522 6523 6524 6525 6526 6527 6528 6529 6530 6531 6532 6533 6534 6535 6536 6537 6538 6539 6540 6541 6542 6543 6544 6545 6546 6547 6548 6549 6550 6551 6552 6553 6554 6555 6556 6557 6558 6559 6560 6561 6562 6563 6564 6565 6566 6567 6568 6569 6570 6571 6572 6573 6574 6575 6576 6577 6578 6579 6580 6581 6582 6583 6584 6585 6586 6587 6588 6589 6590 6591 6592 6593 6594 6595 6596 6597 6598 6599 6600 6601 6602 6603 6604 6605 | |
clifpy.utils.validator.run_cross_table_plausibility_checks
¶
Run cross-table plausibility checks (B.1).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tables
|
list
|
Objects with |
required |
plausibility_thresholds
|
dict
|
Override default plausibility thresholds per check. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, DQAPlausibilityResult]]
|
|
Source code in clifpy/utils/validator.py
6792 6793 6794 6795 6796 6797 6798 6799 6800 6801 6802 6803 6804 6805 6806 6807 6808 6809 6810 6811 6812 6813 6814 6815 6816 6817 6818 6819 6820 6821 6822 6823 6824 6825 6826 6827 6828 6829 6830 6831 6832 6833 6834 6835 6836 6837 6838 6839 6840 6841 6842 6843 6844 6845 6846 6847 6848 6849 6850 6851 6852 6853 6854 6855 6856 6857 6858 6859 6860 6861 6862 6863 6864 6865 | |
Orchestration¶
High-level functions that run groups of checks or the full DQA suite.
Single-Table Orchestration¶
clifpy.utils.validator.run_conformance_checks
¶
Run all conformance checks on a table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The data to validate |
required |
schema
|
dict
|
Schema for the table |
required |
table_name
|
str
|
Name of the table |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, DQAConformanceResult]
|
Dictionary of check results keyed by check type |
Source code in clifpy/utils/validator.py
5768 5769 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 5787 5788 5789 5790 5791 5792 5793 5794 5795 5796 5797 5798 5799 5800 5801 5802 5803 5804 5805 5806 5807 5808 5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 5819 5820 5821 5822 5823 5824 5825 5826 5827 5828 5829 | |
clifpy.utils.validator.run_completeness_checks
¶
Run all completeness checks on a table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The data to validate |
required |
schema
|
dict
|
Schema for the table |
required |
table_name
|
str
|
Name of the table |
required |
error_threshold
|
float
|
Percent missing above which an error is raised |
50.0
|
warning_threshold
|
float
|
Percent missing above which a warning is raised |
10.0
|
Returns:
| Type | Description |
|---|---|
Dict[str, DQACompletenessResult]
|
Dictionary of check results keyed by check type |
Source code in clifpy/utils/validator.py
clifpy.utils.validator.run_plausibility_checks
¶
Run all single-table plausibility checks on a table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The data to validate |
required |
schema
|
dict
|
Schema for the table |
required |
table_name
|
str
|
Name of the table |
required |
plausibility_thresholds
|
dict
|
Override default plausibility thresholds per check. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, DQAPlausibilityResult]
|
Dictionary of check results keyed by check type |
Source code in clifpy/utils/validator.py
6700 6701 6702 6703 6704 6705 6706 6707 6708 6709 6710 6711 6712 6713 6714 6715 6716 6717 6718 6719 6720 6721 6722 6723 6724 6725 6726 6727 6728 6729 6730 6731 6732 6733 6734 6735 6736 6737 6738 6739 6740 6741 6742 6743 6744 6745 6746 6747 6748 6749 6750 6751 6752 6753 6754 6755 6756 6757 6758 6759 6760 6761 6762 6763 6764 6765 6766 6767 6768 6769 6770 6771 6772 6773 6774 6775 6776 6777 6778 6779 6780 6781 6782 6783 6784 6785 6786 6787 6788 6789 | |
clifpy.utils.validator.run_full_dqa
¶
run_full_dqa(df, schema=None, table_name='', tables=None, error_threshold=50.0, warning_threshold=10.0, hosp_years=None, plausibility_thresholds=None, clif_version=DEFAULT_CLIF_VERSION)
Run the complete DQA suite on a single table.
Orchestrates conformance checks, completeness checks, plausibility checks, and — when tables is provided — auto-detected relational integrity and cross-table plausibility checks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The data to validate. |
required |
schema
|
dict
|
Schema for the table. When None (the default), the schema is loaded automatically from the built-in schemas using table_name. |
None
|
table_name
|
str
|
Name of the table. |
''
|
tables
|
list
|
Objects with |
None
|
error_threshold
|
float
|
Percent missing above which an error is raised (default 50). |
50.0
|
warning_threshold
|
float
|
Percent missing above which a warning is raised (default 10). |
10.0
|
hosp_years
|
set
|
Pre-extracted hospitalization years for P.6 temporal consistency. When provided, skips scanning the hospitalization table to extract years. |
None
|
plausibility_thresholds
|
dict
|
Override default plausibility thresholds per check. |
None
|
clif_version
|
str
|
CLIF schema version to auto-load when schema is None (e.g. "2.1", "3.0"). Ignored when schema is passed explicitly. Defaults to the package default (2.1). |
DEFAULT_CLIF_VERSION
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Keys: |
Source code in clifpy/utils/validator.py
6868 6869 6870 6871 6872 6873 6874 6875 6876 6877 6878 6879 6880 6881 6882 6883 6884 6885 6886 6887 6888 6889 6890 6891 6892 6893 6894 6895 6896 6897 6898 6899 6900 6901 6902 6903 6904 6905 6906 6907 6908 6909 6910 6911 6912 6913 6914 6915 6916 6917 6918 6919 6920 6921 6922 6923 6924 6925 6926 6927 6928 6929 6930 6931 6932 6933 6934 6935 6936 6937 6938 6939 6940 6941 6942 6943 6944 6945 6946 6947 6948 6949 6950 6951 6952 6953 6954 6955 6956 6957 6958 6959 6960 6961 6962 6963 6964 6965 6966 6967 6968 6969 6970 6971 6972 6973 6974 6975 6976 6977 6978 6979 6980 6981 6982 6983 6984 6985 6986 6987 6988 6989 6990 6991 6992 6993 6994 6995 6996 6997 6998 6999 7000 7001 7002 7003 | |
Cache-Based Cross-Table Pipeline¶
For memory-optimized cross-table validation, extract lightweight caches and run checks without keeping full DataFrames in memory.
clifpy.utils.validator.extract_cross_table_cache
¶
Extract a lightweight cache from a single table object.
Used by the optimised pipeline in CLIF-TableOne's runner to avoid keeping full DataFrames in memory for cross-table checks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_obj
|
BaseTable
|
Object with |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Keys: |
Source code in clifpy/utils/validator.py
5994 5995 5996 5997 5998 5999 6000 6001 6002 6003 6004 6005 6006 6007 6008 6009 6010 6011 6012 6013 6014 6015 6016 6017 6018 6019 6020 6021 6022 6023 6024 6025 6026 6027 6028 6029 6030 6031 6032 6033 6034 6035 6036 6037 6038 6039 6040 6041 6042 6043 6044 6045 6046 6047 6048 6049 6050 6051 6052 6053 6054 6055 6056 6057 6058 6059 6060 6061 6062 6063 6064 6065 6066 6067 6068 6069 6070 6071 6072 6073 6074 6075 6076 6077 6078 6079 6080 6081 6082 6083 6084 6085 6086 6087 6088 6089 6090 6091 6092 6093 6094 6095 6096 6097 6098 6099 6100 6101 6102 6103 6104 6105 6106 6107 6108 6109 6110 6111 6112 6113 6114 6115 6116 6117 6118 6119 6120 6121 6122 6123 6124 6125 6126 6127 6128 6129 6130 6131 6132 6133 6134 6135 6136 6137 6138 6139 6140 6141 6142 6143 6144 6145 6146 6147 6148 6149 6150 6151 6152 6153 6154 6155 6156 6157 6158 6159 6160 6161 6162 6163 6164 6165 6166 6167 6168 | |
clifpy.utils.validator.run_relational_integrity_checks_from_cache
¶
Run relational integrity checks using pre-extracted caches.
Equivalent to :func:run_relational_integrity_checks but operates on
Python set objects (FK ID sets) instead of scanning full DataFrames.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
caches
|
dict
|
|
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, DQACompletenessResult]]
|
Same structure as :func: |
Source code in clifpy/utils/validator.py
6171 6172 6173 6174 6175 6176 6177 6178 6179 6180 6181 6182 6183 6184 6185 6186 6187 6188 6189 6190 6191 6192 6193 6194 6195 6196 6197 6198 6199 6200 6201 6202 6203 6204 6205 6206 6207 6208 6209 6210 6211 6212 6213 6214 6215 6216 6217 6218 6219 6220 6221 6222 6223 6224 6225 6226 6227 6228 6229 6230 6231 6232 6233 6234 6235 6236 6237 6238 6239 6240 6241 6242 6243 6244 6245 6246 6247 6248 6249 6250 6251 6252 6253 6254 6255 6256 6257 6258 6259 6260 6261 6262 6263 6264 6265 6266 6267 6268 6269 6270 6271 6272 6273 6274 6275 6276 6277 6278 6279 6280 6281 6282 6283 6284 6285 6286 6287 6288 6289 6290 6291 6292 6293 6294 6295 6296 6297 6298 6299 6300 6301 6302 6303 6304 6305 6306 6307 6308 6309 6310 6311 6312 6313 6314 6315 6316 6317 6318 6319 6320 6321 6322 6323 6324 6325 | |
clifpy.utils.validator.run_cross_table_completeness_checks_from_cache
¶
Run cross-table conditional completeness checks (K.5) from caches.
For each YAML rule in cross_table_conditional_requirements, computes
the set of join-column IDs that satisfy the source condition but are
missing the required target column value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
caches
|
dict
|
|
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, DQACompletenessResult]]
|
Results keyed by target table name, then by a descriptive check key. |
Source code in clifpy/utils/validator.py
6328 6329 6330 6331 6332 6333 6334 6335 6336 6337 6338 6339 6340 6341 6342 6343 6344 6345 6346 6347 6348 6349 6350 6351 6352 6353 6354 6355 6356 6357 6358 6359 6360 6361 6362 6363 6364 6365 6366 6367 6368 6369 6370 6371 6372 6373 6374 6375 6376 6377 6378 6379 6380 6381 6382 6383 6384 6385 6386 6387 6388 6389 6390 6391 6392 6393 6394 6395 6396 6397 6398 6399 6400 6401 6402 6403 6404 6405 6406 6407 6408 6409 6410 6411 6412 6413 6414 6415 6416 6417 6418 6419 6420 6421 6422 6423 6424 6425 6426 6427 6428 6429 6430 6431 6432 6433 6434 6435 6436 6437 6438 6439 6440 6441 6442 6443 6444 6445 6446 | |
clifpy.utils.validator.run_cross_table_plausibility_checks_from_cache
¶
Run cross-table plausibility checks using pre-extracted caches.
Equivalent to :func:run_cross_table_plausibility_checks but uses
cached temporal subset DataFrames and hospitalization bounds instead
of full DataFrames.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
caches
|
dict
|
|
required |
plausibility_thresholds
|
dict
|
Override default plausibility thresholds per check. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, DQAPlausibilityResult]]
|
Same structure as :func: |
Source code in clifpy/utils/validator.py
6608 6609 6610 6611 6612 6613 6614 6615 6616 6617 6618 6619 6620 6621 6622 6623 6624 6625 6626 6627 6628 6629 6630 6631 6632 6633 6634 6635 6636 6637 6638 6639 6640 6641 6642 6643 6644 6645 6646 6647 6648 6649 6650 6651 6652 6653 6654 6655 6656 6657 6658 6659 6660 6661 6662 6663 6664 6665 6666 6667 6668 6669 6670 6671 6672 6673 6674 6675 6676 6677 6678 6679 6680 6681 6682 6683 6684 6685 6686 6687 6688 6689 6690 6691 6692 6693 6694 6695 6696 6697 | |
Report Generation¶
clifpy.utils.report_generator.collect_dqa_issues
¶
Collect errors, warnings, and info messages from run_full_dqa output.
Returns (category_scores, all_issues) where each issue is a dict with category, check_type, severity ('error'/'warning'/'info'), message, details, plus enriched fields: rule_code, rule_description, column_field.
Scoring reads atomic_total/atomic_passed on each check's result.
Both fields are required on every DQAResult producer — the check's
"natural atomic unit" decides the granularity (per-column, per-rule,
per-permissible-value, or 1 for binary checks). If a check result is
missing atomic counts, this function raises ValueError rather than
silently approximating the score from message counts; populate the
fields in the check itself (see clifpy/utils/validator.py).
Source code in clifpy/utils/report_generator.py
255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 | |
clifpy.utils.report_generator.generate_validation_pdf
¶
Generate a PDF report from DQA validation results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
validation_data
|
dict
|
Output from run_full_dqa (keys: conformance, completeness, relational, plausibility). |
required |
table_name
|
str
|
Name of the table. |
required |
output_path
|
str
|
Path where PDF should be saved. |
required |
site_name
|
str
|
Name of the site/hospital. |
None
|
feedback
|
dict
|
User feedback with 'user_decisions' keyed by error_id. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Path to generated PDF file. |
Source code in clifpy/utils/report_generator.py
418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 | |
clifpy.utils.report_generator.generate_text_report
¶
Generate a plain-text DQA report.
Parameters match generate_validation_pdf.
Source code in clifpy/utils/report_generator.py
808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 | |
Backward Compatibility¶
clifpy.utils.validator.validate_dataframe
¶
Validate a dataframe against schema and return list of errors.
This function provides compatibility with CLIF-TableOne's expected validation interface.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pd.DataFrame, pl.DataFrame, or pl.LazyFrame
|
Data to validate |
required |
schema
|
dict
|
Table schema containing columns, required_columns, etc. |
required |
table_name
|
str
|
Name of the table (inferred from schema if not provided) |
None
|
plausibility_thresholds
|
dict
|
Override default plausibility thresholds per check. |
None
|
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
List of error dictionaries with keys: - type: str - Error type/check name - description: str - Human-readable error description - details: dict - Additional error details - category: str - 'schema' or 'data_quality' |
Source code in clifpy/utils/validator.py
7012 7013 7014 7015 7016 7017 7018 7019 7020 7021 7022 7023 7024 7025 7026 7027 7028 7029 7030 7031 7032 7033 7034 7035 7036 7037 7038 7039 7040 7041 7042 7043 7044 7045 7046 7047 7048 7049 7050 7051 7052 7053 7054 7055 7056 7057 7058 7059 7060 7061 7062 7063 7064 7065 7066 7067 7068 7069 7070 7071 7072 7073 7074 7075 7076 7077 7078 7079 7080 7081 7082 7083 7084 7085 7086 7087 7088 7089 7090 7091 7092 7093 7094 7095 7096 7097 7098 7099 7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 7116 7117 7118 7119 7120 7121 7122 | |
clifpy.utils.validator.format_clifpy_error
¶
Format a validation error for display.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
error
|
dict
|
Error dictionary from validate_dataframe() |
required |
row_count
|
int
|
Total row count of the table |
required |
table_name
|
str
|
Name of the table |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Formatted error with type, description, category, and details |
Source code in clifpy/utils/validator.py
clifpy.utils.validator.determine_validation_status
¶
Determine validation status based on errors.
Status Logic: - INCOMPLETE (red): Missing required columns OR non-castable datatype errors OR 100% null in required columns - PARTIAL (yellow): Required columns present but has data quality issues (missing categorical values, high missingness, etc.) - COMPLETE (green): All required columns present, no critical issues
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
errors
|
list
|
List of formatted error dictionaries |
required |
required_columns
|
list
|
List of required column names |
None
|
table_name
|
str
|
Name of the table (for table-specific logic) |
None
|
Returns:
| Type | Description |
|---|---|
str
|
'complete', 'partial', or 'incomplete' |
Source code in clifpy/utils/validator.py
7187 7188 7189 7190 7191 7192 7193 7194 7195 7196 7197 7198 7199 7200 7201 7202 7203 7204 7205 7206 7207 7208 7209 7210 7211 7212 7213 7214 7215 7216 7217 7218 7219 7220 7221 7222 7223 7224 7225 7226 7227 7228 7229 7230 7231 7232 7233 7234 7235 7236 7237 7238 7239 7240 7241 7242 7243 7244 7245 7246 7247 7248 7249 7250 7251 7252 7253 7254 7255 7256 7257 7258 | |
clifpy.utils.validator.classify_errors_by_status_impact
¶
Classify errors into status-affecting and informational categories.
Used by PDF/report generators to separate critical errors from informational messages.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
errors
|
dict
|
Dictionary with keys 'schema_errors', 'data_quality_issues', 'other_errors' |
required |
required_columns
|
list
|
List of required column names |
required |
table_name
|
str
|
Name of the table |
required |
config_timezone
|
str
|
Configured timezone (to filter timezone-related errors) |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with 'status_affecting' and 'informational', each containing 'schema_errors', 'data_quality_issues', and 'other_errors' lists |
Source code in clifpy/utils/validator.py
7261 7262 7263 7264 7265 7266 7267 7268 7269 7270 7271 7272 7273 7274 7275 7276 7277 7278 7279 7280 7281 7282 7283 7284 7285 7286 7287 7288 7289 7290 7291 7292 7293 7294 7295 7296 7297 7298 7299 7300 7301 7302 7303 7304 7305 7306 7307 7308 7309 7310 7311 7312 7313 7314 7315 7316 7317 7318 7319 7320 7321 7322 7323 7324 7325 7326 7327 7328 7329 7330 7331 7332 7333 7334 7335 7336 7337 7338 7339 7340 7341 7342 7343 7344 7345 7346 7347 7348 7349 7350 7351 7352 | |
clifpy.utils.validator.get_validation_summary
¶
Generate a text summary of validation results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
validation_results
|
dict
|
Validation results from validate() method |
required |
Returns:
| Type | Description |
|---|---|
str
|
Human-readable summary string |