DQA (Validation) API Reference¶
CLIFpy's Data Quality Assessment (DQA) module provides comprehensive validation organized around three pillars: Conformance, Completeness, and Plausibility. All checks support dual backends (Polars and DuckDB) and return structured result objects.
For a user-guide introduction, see Data Quality Assessment (DQA).
Result Classes¶
clifpy.utils.validator.DQAConformanceResult
¶
Container for DQA conformance check results.
Source code in clifpy/utils/validator.py
clifpy.utils.validator.DQACompletenessResult
¶
Container for DQA completeness check results.
Source code in clifpy/utils/validator.py
clifpy.utils.validator.DQAPlausibilityResult
¶
Container for DQA plausibility check results.
Source code in clifpy/utils/validator.py
Conformance Checks¶
Conformance checks verify that data matches expected structure, schema, types, and allowed values.
A.1 — Table Presence¶
clifpy.utils.validator.check_table_exists
¶
Check if a table file exists at the specified path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_path
|
str or Path
|
Directory containing the table files |
required |
table_name
|
str
|
Name of the table to check |
required |
filetype
|
str
|
File extension (parquet, csv, etc.) |
'parquet'
|
Returns:
| Type | Description |
|---|---|
DQAConformanceResult
|
Result object with check status |
Source code in clifpy/utils/validator.py
clifpy.utils.validator.check_table_presence
¶
Check that a loaded DataFrame has rows and columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pd.DataFrame, pl.DataFrame, or pl.LazyFrame
|
Data to validate (already loaded) |
required |
table_name
|
str
|
Name of the table |
required |
Source code in clifpy/utils/validator.py
A.2 — Required Columns¶
clifpy.utils.validator.check_required_columns
¶
Check if all required columns are present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pd.DataFrame, pl.DataFrame, or pl.LazyFrame
|
Data to validate (already loaded) |
required |
schema
|
dict
|
Table schema containing required_columns |
required |
table_name
|
str
|
Name of the table |
required |
Source code in clifpy/utils/validator.py
B.1 — Data Types¶
clifpy.utils.validator.check_column_dtypes
¶
Check if columns have correct data types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pd.DataFrame, pl.DataFrame, or pl.LazyFrame
|
Data to validate (already loaded) |
required |
schema
|
dict
|
Table schema |
required |
table_name
|
str
|
Name of the table |
required |
Source code in clifpy/utils/validator.py
B.2 — Datetime Format¶
clifpy.utils.validator.check_datetime_format
¶
Validate datetime columns are in correct format.
Source code in clifpy/utils/validator.py
B.3 — Lab Reference Units¶
clifpy.utils.validator.check_lab_reference_units
¶
Check if lab reference units match schema definitions.
Source code in clifpy/utils/validator.py
B.4 — Categorical Values¶
clifpy.utils.validator.check_categorical_values
¶
Check if categorical values match mCIDE permissible values.
Source code in clifpy/utils/validator.py
B.5 — Category-to-Group Mapping¶
clifpy.utils.validator.check_category_group_mapping
¶
Check if category-to-group mappings match schema definitions.
Source code in clifpy/utils/validator.py
Completeness Checks¶
Completeness checks evaluate missing data, conditional requirements, and referential coverage.
A.1 — Missingness¶
clifpy.utils.validator.check_missingness
¶
Check missingness in required columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pd.DataFrame, pl.DataFrame, or pl.LazyFrame
|
Data to validate (already loaded) |
required |
schema
|
dict
|
Table schema containing required_columns |
required |
table_name
|
str
|
Name of the table |
required |
error_threshold
|
float
|
Percent missing above which an error is raised |
50.0
|
warning_threshold
|
float
|
Percent missing above which a warning is raised |
10.0
|
Returns:
| Type | Description |
|---|---|
DQACompletenessResult
|
Result containing missingness statistics |
Source code in clifpy/utils/validator.py
A.2 — Conditional Requirements¶
clifpy.utils.validator.check_conditional_requirements
¶
Check conditional required fields.
Source code in clifpy/utils/validator.py
B — mCIDE Value Coverage¶
clifpy.utils.validator.check_mcide_value_coverage
¶
Check if all mCIDE standardized values are present in the data.
Source code in clifpy/utils/validator.py
C.1 — Relational Integrity¶
clifpy.utils.validator.check_relational_integrity
¶
Check bidirectional relational integrity between tables.
Runs the backend-specific check in both directions: - Forward (reference → target): What percentage of reference IDs appear in the target table? (e.g., "what % of hospitalizations have labs?") - Reverse (target → reference): What percentage of target IDs exist in the reference table? (e.g., "what % of lab hosp_ids are valid?")
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_df
|
DataFrame
|
The target table (e.g., labs). |
required |
reference_df
|
DataFrame
|
The reference table (e.g., hospitalization). |
required |
target_table
|
str
|
Name of the target table. |
required |
reference_table
|
str
|
Name of the reference table. |
required |
key_column
|
str
|
The shared key column (e.g., |
required |
Returns:
| Type | Description |
|---|---|
DQACompletenessResult
|
Consolidated result with forward/reverse coverage metrics. |
Source code in clifpy/utils/validator.py
2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 | |
Plausibility Checks¶
Plausibility checks validate logical consistency, temporal ordering, and clinical reasonableness.
A.1 — Temporal Ordering¶
clifpy.utils.validator.check_temporal_ordering
¶
check_temporal_ordering(df, table_name, temporal_rules=None, warning_threshold=0.0, error_threshold=10.0)
Check that datetime pairs follow expected temporal ordering.
Source code in clifpy/utils/validator.py
A.2 — Numeric Range Plausibility¶
clifpy.utils.validator.check_numeric_range_plausibility
¶
check_numeric_range_plausibility(df, table_name, outlier_config=None, warning_threshold=0.0, error_threshold=10.0)
Check numeric values are within plausible ranges.
Source code in clifpy/utils/validator.py
A.3 — Field-Level Plausibility¶
clifpy.utils.validator.check_field_plausibility
¶
Check field-level plausibility constraints.
Source code in clifpy/utils/validator.py
A.4 — Medication Dose Unit Consistency¶
clifpy.utils.validator.check_medication_dose_unit_consistency
¶
Check medication dose unit consistency.
Source code in clifpy/utils/validator.py
B.1 — Cross-Table Temporal Plausibility¶
clifpy.utils.validator.check_cross_table_temporal_plausibility
¶
check_cross_table_temporal_plausibility(target_df, hospitalization_df, target_table, time_columns, warning_threshold=0.0, error_threshold=10.0)
Check that datetime values fall within hospitalization bounds.
Source code in clifpy/utils/validator.py
C.1 — Overlapping Periods¶
clifpy.utils.validator.check_overlapping_periods
¶
check_overlapping_periods(df, table_name, entity_col='hospitalization_id', start_col='in_dttm', end_col='out_dttm')
Check for overlapping time periods within entities.
Source code in clifpy/utils/validator.py
C.2 — Category Temporal Consistency¶
clifpy.utils.validator.check_category_temporal_consistency
¶
Check category distribution consistency over time.
Source code in clifpy/utils/validator.py
D.1 — Duplicate Composite Keys¶
clifpy.utils.validator.check_duplicate_composite_keys
¶
check_duplicate_composite_keys(df, table_name, composite_keys=None, schema=None, warning_threshold=0.0, error_threshold=10.0)
Check for duplicate composite keys.
Source code in clifpy/utils/validator.py
Cross-Table Checks¶
These checks operate across multiple loaded tables to validate relational and temporal consistency.
clifpy.utils.validator.run_relational_integrity_checks
¶
Auto-detect and run relational integrity checks for loaded tables.
Reads FK rules from validation_rules.yaml and runs
:func:check_relational_integrity for every applicable
(table, fk_column) pair.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tables
|
list
|
Objects with |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, DQACompletenessResult]]
|
|
Source code in clifpy/utils/validator.py
4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 | |
clifpy.utils.validator.run_cross_table_completeness_checks
¶
Run cross-table conditional completeness checks (K.5) on full DataFrames.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tables
|
list
|
Objects with |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, DQACompletenessResult]]
|
Results keyed by target table name. |
Source code in clifpy/utils/validator.py
5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5239 5240 5241 5242 5243 5244 5245 5246 5247 5248 5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 5266 5267 5268 5269 5270 5271 5272 5273 5274 5275 5276 5277 5278 5279 | |
clifpy.utils.validator.run_cross_table_plausibility_checks
¶
Run cross-table plausibility checks (B.1).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tables
|
list
|
Objects with |
required |
plausibility_thresholds
|
dict
|
Override default plausibility thresholds per check. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, DQAPlausibilityResult]]
|
|
Source code in clifpy/utils/validator.py
5463 5464 5465 5466 5467 5468 5469 5470 5471 5472 5473 5474 5475 5476 5477 5478 5479 5480 5481 5482 5483 5484 5485 5486 5487 5488 5489 5490 5491 5492 5493 5494 5495 5496 5497 5498 5499 5500 5501 5502 5503 5504 5505 5506 5507 5508 5509 5510 5511 5512 5513 5514 5515 5516 5517 5518 5519 5520 5521 5522 5523 5524 5525 5526 5527 5528 5529 5530 5531 5532 5533 5534 | |
Orchestration¶
High-level functions that run groups of checks or the full DQA suite.
Single-Table Orchestration¶
clifpy.utils.validator.run_conformance_checks
¶
Run all conformance checks on a table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The data to validate |
required |
schema
|
dict
|
Schema for the table |
required |
table_name
|
str
|
Name of the table |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, DQAConformanceResult]
|
Dictionary of check results keyed by check type |
Source code in clifpy/utils/validator.py
clifpy.utils.validator.run_completeness_checks
¶
Run all completeness checks on a table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The data to validate |
required |
schema
|
dict
|
Schema for the table |
required |
table_name
|
str
|
Name of the table |
required |
error_threshold
|
float
|
Percent missing above which an error is raised |
50.0
|
warning_threshold
|
float
|
Percent missing above which a warning is raised |
10.0
|
Returns:
| Type | Description |
|---|---|
Dict[str, DQACompletenessResult]
|
Dictionary of check results keyed by check type |
Source code in clifpy/utils/validator.py
clifpy.utils.validator.run_plausibility_checks
¶
Run all single-table plausibility checks on a table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The data to validate |
required |
schema
|
dict
|
Schema for the table |
required |
table_name
|
str
|
Name of the table |
required |
plausibility_thresholds
|
dict
|
Override default plausibility thresholds per check. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, DQAPlausibilityResult]
|
Dictionary of check results keyed by check type |
Source code in clifpy/utils/validator.py
5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455 5456 5457 5458 5459 5460 | |
clifpy.utils.validator.run_full_dqa
¶
run_full_dqa(df, schema=None, table_name='', tables=None, error_threshold=50.0, warning_threshold=10.0, hosp_years=None, plausibility_thresholds=None)
Run the complete DQA suite on a single table.
Orchestrates conformance checks, completeness checks, plausibility checks, and — when tables is provided — auto-detected relational integrity and cross-table plausibility checks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The data to validate. |
required |
schema
|
dict
|
Schema for the table. When None (the default), the schema is loaded automatically from the built-in schemas using table_name. |
None
|
table_name
|
str
|
Name of the table. |
''
|
tables
|
list
|
Objects with |
None
|
error_threshold
|
float
|
Percent missing above which an error is raised (default 50). |
50.0
|
warning_threshold
|
float
|
Percent missing above which a warning is raised (default 10). |
10.0
|
hosp_years
|
set
|
Pre-extracted hospitalization years for P.6 temporal consistency. When provided, skips scanning the hospitalization table to extract years. |
None
|
plausibility_thresholds
|
dict
|
Override default plausibility thresholds per check. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Keys: |
Source code in clifpy/utils/validator.py
5537 5538 5539 5540 5541 5542 5543 5544 5545 5546 5547 5548 5549 5550 5551 5552 5553 5554 5555 5556 5557 5558 5559 5560 5561 5562 5563 5564 5565 5566 5567 5568 5569 5570 5571 5572 5573 5574 5575 5576 5577 5578 5579 5580 5581 5582 5583 5584 5585 5586 5587 5588 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 5631 5632 5633 5634 5635 5636 5637 5638 5639 5640 5641 5642 5643 5644 5645 5646 5647 5648 5649 5650 5651 5652 5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 5663 5664 5665 5666 5667 | |
Cache-Based Cross-Table Pipeline¶
For memory-optimized cross-table validation, extract lightweight caches and run checks without keeping full DataFrames in memory.
clifpy.utils.validator.extract_cross_table_cache
¶
Extract a lightweight cache from a single table object.
Used by the optimised pipeline in CLIF-TableOne's runner to avoid keeping full DataFrames in memory for cross-table checks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_obj
|
BaseTable
|
Object with |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Keys: |
Source code in clifpy/utils/validator.py
4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 | |
clifpy.utils.validator.run_relational_integrity_checks_from_cache
¶
Run relational integrity checks using pre-extracted caches.
Equivalent to :func:run_relational_integrity_checks but operates on
Python set objects (FK ID sets) instead of scanning full DataFrames.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
caches
|
dict
|
|
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, DQACompletenessResult]]
|
Same structure as :func: |
Source code in clifpy/utils/validator.py
4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 | |
clifpy.utils.validator.run_cross_table_completeness_checks_from_cache
¶
Run cross-table conditional completeness checks (K.5) from caches.
For each YAML rule in cross_table_conditional_requirements, computes
the set of join-column IDs that satisfy the source condition but are
missing the required target column value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
caches
|
dict
|
|
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, DQACompletenessResult]]
|
Results keyed by target table name, then by a descriptive check key. |
Source code in clifpy/utils/validator.py
5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 | |
clifpy.utils.validator.run_cross_table_plausibility_checks_from_cache
¶
Run cross-table plausibility checks using pre-extracted caches.
Equivalent to :func:run_cross_table_plausibility_checks but uses
cached temporal subset DataFrames and hospitalization bounds instead
of full DataFrames.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
caches
|
dict
|
|
required |
plausibility_thresholds
|
dict
|
Override default plausibility thresholds per check. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, DQAPlausibilityResult]]
|
Same structure as :func: |
Source code in clifpy/utils/validator.py
5282 5283 5284 5285 5286 5287 5288 5289 5290 5291 5292 5293 5294 5295 5296 5297 5298 5299 5300 5301 5302 5303 5304 5305 5306 5307 5308 5309 5310 5311 5312 5313 5314 5315 5316 5317 5318 5319 5320 5321 5322 5323 5324 5325 5326 5327 5328 5329 5330 5331 5332 5333 5334 5335 5336 5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 | |
Report Generation¶
clifpy.utils.report_generator.collect_dqa_issues
¶
Collect errors, warnings, and info messages from run_full_dqa output.
Returns (category_scores, all_issues) where each issue is a dict with category, check_type, severity ('error'/'warning'/'info'), message, details, plus enriched fields: rule_code, rule_description, column_field.
Source code in clifpy/utils/report_generator.py
clifpy.utils.report_generator.generate_validation_pdf
¶
Generate a PDF report from DQA validation results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
validation_data
|
dict
|
Output from run_full_dqa (keys: conformance, completeness, relational, plausibility). |
required |
table_name
|
str
|
Name of the table. |
required |
output_path
|
str
|
Path where PDF should be saved. |
required |
site_name
|
str
|
Name of the site/hospital. |
None
|
feedback
|
dict
|
User feedback with 'user_decisions' keyed by error_id. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Path to generated PDF file. |
Source code in clifpy/utils/report_generator.py
214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 | |
clifpy.utils.report_generator.generate_text_report
¶
Generate a plain-text DQA report.
Parameters match generate_validation_pdf.
Source code in clifpy/utils/report_generator.py
594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 | |
Backward Compatibility¶
clifpy.utils.validator.validate_dataframe
¶
Validate a dataframe against schema and return list of errors.
This function provides compatibility with CLIF-TableOne's expected validation interface.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pd.DataFrame, pl.DataFrame, or pl.LazyFrame
|
Data to validate |
required |
schema
|
dict
|
Table schema containing columns, required_columns, etc. |
required |
table_name
|
str
|
Name of the table (inferred from schema if not provided) |
None
|
plausibility_thresholds
|
dict
|
Override default plausibility thresholds per check. |
None
|
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
List of error dictionaries with keys: - type: str - Error type/check name - description: str - Human-readable error description - details: dict - Additional error details - category: str - 'schema' or 'data_quality' |
Source code in clifpy/utils/validator.py
5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 5703 5704 5705 5706 5707 5708 5709 5710 5711 5712 5713 5714 5715 5716 5717 5718 5719 5720 5721 5722 5723 5724 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 5736 5737 5738 5739 5740 5741 5742 5743 5744 5745 5746 5747 5748 5749 5750 5751 5752 5753 5754 5755 5756 5757 5758 5759 5760 5761 5762 5763 5764 5765 5766 5767 5768 5769 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 | |
clifpy.utils.validator.format_clifpy_error
¶
Format a validation error for display.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
error
|
dict
|
Error dictionary from validate_dataframe() |
required |
row_count
|
int
|
Total row count of the table |
required |
table_name
|
str
|
Name of the table |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Formatted error with type, description, category, and details |
Source code in clifpy/utils/validator.py
clifpy.utils.validator.determine_validation_status
¶
Determine validation status based on errors.
Status Logic: - INCOMPLETE (red): Missing required columns OR non-castable datatype errors OR 100% null in required columns - PARTIAL (yellow): Required columns present but has data quality issues (missing categorical values, high missingness, etc.) - COMPLETE (green): All required columns present, no critical issues
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
errors
|
list
|
List of formatted error dictionaries |
required |
required_columns
|
list
|
List of required column names |
None
|
table_name
|
str
|
Name of the table (for table-specific logic) |
None
|
Returns:
| Type | Description |
|---|---|
str
|
'complete', 'partial', or 'incomplete' |
Source code in clifpy/utils/validator.py
5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 5886 5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 | |
clifpy.utils.validator.classify_errors_by_status_impact
¶
Classify errors into status-affecting and informational categories.
Used by PDF/report generators to separate critical errors from informational messages.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
errors
|
dict
|
Dictionary with keys 'schema_errors', 'data_quality_issues', 'other_errors' |
required |
required_columns
|
list
|
List of required column names |
required |
table_name
|
str
|
Name of the table |
required |
config_timezone
|
str
|
Configured timezone (to filter timezone-related errors) |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with 'status_affecting' and 'informational', each containing 'schema_errors', 'data_quality_issues', and 'other_errors' lists |
Source code in clifpy/utils/validator.py
5925 5926 5927 5928 5929 5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 5940 5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 5999 6000 6001 6002 6003 6004 6005 6006 6007 6008 6009 6010 6011 6012 6013 6014 6015 6016 | |
clifpy.utils.validator.get_validation_summary
¶
Generate a text summary of validation results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
validation_results
|
dict
|
Validation results from validate() method |
required |
Returns:
| Type | Description |
|---|---|
str
|
Human-readable summary string |