Data Validation¶
CLIFpy provides comprehensive validation to ensure your data conforms to CLIF standards. This guide explains the validation process and how to interpret results.
Overview¶
Validation in CLIFpy operates at multiple levels:
- Schema Validation - Ensures required columns exist with correct data types
- Category Validation - Verifies values match standardized categories
- Range Validation - Checks values fall within clinically reasonable ranges
- Timezone Validation - Ensures datetime columns are timezone-aware
- Duplicate Detection - Identifies duplicate records based on composite keys
- Completeness Checks - Analyzes missing data patterns
Running Validation¶
Basic Validation¶
# Load and validate a table
table = TableClass.from_file('/data', 'parquet')
table.validate()
# Check if valid
if table.isvalid():
print("Validation passed!")
else:
print(f"Found {len(table.errors)} validation errors")
Bulk Validation with Orchestrator¶
from clifpy.clif_orchestrator import ClifOrchestrator
orchestrator = ClifOrchestrator('/data', 'parquet')
orchestrator.initialize(tables=['patient', 'labs', 'vitals'])
# Validate all tables
orchestrator.validate_all()
Understanding Validation Results¶
Error Types¶
Validation errors are stored in the errors
attribute:
# Review errors
for error in table.errors[:10]: # First 10 errors
print(f"Type: {error['type']}")
print(f"Message: {error['message']}")
print(f"Details: {error.get('details', 'N/A')}")
print("-" * 50)
Common error types:
- missing_column
- Required column not found
- invalid_category
- Value not in permissible list
- out_of_range
- Value outside acceptable range
- invalid_timezone
- Datetime column not timezone-aware
- duplicate_rows
- Duplicate records found
Validation Reports¶
Validation results are automatically saved to the output directory:
# Set custom output directory
table = TableClass.from_file(
data_directory='/data',
filetype='parquet',
output_directory='/path/to/reports'
)
# After validation, these files are created:
# - validation_log_[table_name].log
# - validation_errors_[table_name].csv
# - missing_data_stats_[table_name].csv
Schema Validation¶
Each table has a YAML schema defining its structure:
# Example from patient_schema.yaml
columns:
- name: patient_id
data_type: VARCHAR
required: true
is_category_column: false
- name: sex_category
data_type: VARCHAR
required: true
is_category_column: true
permissible_values:
- Male
- Female
- Unknown
Required Columns¶
# Check which required columns are missing
if not table.isvalid():
missing_cols = [e for e in table.errors if e['type'] == 'missing_column']
for error in missing_cols:
print(f"Missing required column: {error['column']}")
Data Types¶
CLIFpy validates that columns have appropriate data types:
- VARCHAR
- String/text data
- DATETIME
- Timezone-aware datetime
- NUMERIC
- Numeric values (int or float)
Category Validation¶
Standardized categories ensure consistency across institutions:
# Example: Validating location categories in ADT
valid_locations = ['ed', 'ward', 'stepdown', 'icu', 'procedural',
'l&d', 'hospice', 'psych', 'rehab', 'radiology',
'dialysis', 'other']
# Check for invalid categories
category_errors = [e for e in table.errors
if e['type'] == 'invalid_category']
Range Validation¶
Clinical values are checked against reasonable ranges:
# Example: Vital signs ranges
ranges = {
'heart_rate': (0, 300),
'sbp': (0, 300),
'dbp': (0, 200),
'temp_c': (25, 44),
'spo2': (50, 100)
}
# Identify out-of-range values
range_errors = [e for e in table.errors
if e['type'] == 'out_of_range']
Timezone Validation¶
All datetime columns must be timezone-aware:
# Check timezone issues
tz_errors = [e for e in table.errors
if 'timezone' in e.get('message', '').lower()]
if tz_errors:
print("Datetime columns must be timezone-aware")
print("Consider reloading with explicit timezone:")
print("table = TableClass.from_file('/data', 'parquet', timezone='US/Central')")
Duplicate Detection¶
Duplicates are identified based on composite keys:
# Check for duplicates
duplicate_errors = [e for e in table.errors
if e['type'] == 'duplicate_rows']
if duplicate_errors:
for error in duplicate_errors:
print(f"Found {error['count']} duplicate rows")
print(f"Composite keys: {error['keys']}")
Missing Data Analysis¶
CLIFpy analyzes missing data patterns:
# Get missing data statistics
summary = table.get_summary()
if 'missing_data' in summary:
print("Columns with missing data:")
for col, count in summary['missing_data'].items():
pct = (count / summary['num_rows']) * 100
print(f" {col}: {count} ({pct:.1f}%)")
Custom Validation¶
Tables may include specific validation logic:
# Example: Labs table validates reference ranges
# Example: Medications validates dose units match drug
# Example: Respiratory support validates device/mode combinations
Best Practices¶
- Always validate after loading - Catch issues early
- Review all error types - Don't just check if valid
- Save validation reports - Keep audit trail
- Fix data at source - Update extraction/ETL process
- Document exceptions - Some errors may be acceptable
Handling Validation Errors¶
Option 1: Fix and Reload¶
# Identify issues
table.validate()
errors_df = pd.DataFrame(table.errors)
errors_df.to_csv('validation_errors.csv', index=False)
# Fix source data based on errors
# Then reload
table = TableClass.from_file('/fixed_data', 'parquet')
table.validate()
Option 2: Filter Invalid Records¶
# Remove records with invalid categories
valid_categories = ['Male', 'Female', 'Unknown']
cleaned_df = table.df[table.df['sex_category'].isin(valid_categories)]
# Create new table instance with cleaned data
table = TableClass(data=cleaned_df, timezone='US/Central')
Option 3: Document and Proceed¶
# For acceptable validation errors
if not table.isvalid():
# Document why proceeding despite errors
with open('validation_notes.txt', 'w') as f:
f.write(f"Proceeding with {len(table.errors)} known issues:\n")
f.write("- Missing optional columns\n")
f.write("- Historical data outside current ranges\n")
Next Steps¶
- Learn about timezone handling
- Explore table-specific guides
- See practical examples