Find, explain, and fix common data problems before reports or models use them. — Claude Skill
Un Skill de Claude para Claude Code por masterkram — ejecutar /data-quality en Claude·Actualizado el 12 jun 2026·vmain@8b32590
Profiles datasets for missing values, duplicates, outliers, invalid values, and rule violations, then verifies whether the repair actually improved the data.
- Finds missing values, duplicate records, outliers, invalid domains, and broken constraints.
- Explains whether a problem is harmless, needs cleanup, or should block reporting.
- Suggests repair strategies instead of blindly deleting suspicious rows.
- Re-runs checks after cleanup so teams can see what improved.
An analyst eyeballs a dataset, deletes obvious bad rows, and hopes downstream analysis is still valid.
Run /data-quality to define quality rules, quantify defects, repair systematically, and verify improvements.
Para quién es
Qué hace
Decide whether data is reliable enough for a report, migration, or model.
Find duplicate customers, accounts, products, or transactions and choose how to merge them.
Check required fields, allowed values, unique keys, and relationships between tables.
Cómo funciona
Define the dataset, business rules, and why the data matters.
Profile completeness, uniqueness, validity, consistency, and outliers.
Choose a repair strategy for missing values, duplicates, strings, or constraints.
Verify the cleaned data against the original rules and document remaining risk.
Opciones de entrada
CSV, spreadsheet, SQL extract, dataframe, schema, or sample rows.
Ejemplo
Dataset: customer_export.csv, 52,000 rows. Use case: migrate accounts into a new CRM next week. Rules: - email is required for contacts, - account_id must be unique, - country must be a valid ISO country, - annual_revenue should not be negative. Known concern: duplicate company names and missing emails. Need: decide whether migration is safe and what to fix first.
The skill treats this as migration readiness, not generic cleanup. It focuses on problems that would block or corrupt the CRM migration.
1,148 rows are missing email, 392 duplicate account_id values exist, 86 country values are invalid, and 14 rows have negative revenue.
Duplicate account_id is blocking because it can merge the wrong accounts. Missing emails may be acceptable for account records but not contact records.
Fix duplicate account IDs first, map invalid countries second, flag missing emails for enrichment, and investigate negative revenue instead of deleting those rows.
Rerun the same checks and show before/after counts so the migration owner can sign off.
Métricas que mejora
Funciona con
¿Quieres usar Data Quality?
Elige cómo empezar.
Instala y ejecuta este skill localmente en tu computadora.
Abre una terminal en tu computadora y pega este comando:
Esto descarga el skill con todos sus archivos en tu computadora:
Añade -g al final para tenerlo disponible en todos tus proyectos.
Inicia Claude Code, luego escribe el comando:
Data Quality Skill
Systematic approach to diagnosing and fixing data quality problems.
Data Quality Process
Define & Identify → Detect & Quantify → Clean & Rectify → Measure & Verify
- Define: Understand data context, business rules, quality requirements
- Detect: Profile data, find glitches (missing, duplicates, outliers, violations)
- Clean: Apply appropriate repair strategies
- Measure: Validate repairs, quantify improvement
Quick Reference
| Problem | Script | Key Function |
|---|---|---|
| Data overview | data_profiling.py | profile_dataframe(df) |
| Find quality issues | data_profiling.py | detect_glitches(df) |
| Missing values | missing_data.py | analyze_missing(df) |
| Imputation | missing_data.py | impute_mean/median/regression() |
| Duplicates | duplicate_detection.py | find_duplicates(df, cols) |
| Deduplication | duplicate_detection.py | deduplicate(df, cols) |
| Outliers | anomaly_detection.py | detect_anomalies(df) |
| Constraint check | constraint_checking.py | validate_constraints(df, rules) |
| String matching | similarity_metrics.py | jaro_winkler_similarity() |
Workflow
Step 1: Profile the Data
from scripts.data_profiling import profile_dataframe, detect_glitches, generate_quality_report
# Quick overview
print(generate_quality_report(df))
# Detailed profile
profile = profile_dataframe(df)
# Find issues
glitches = detect_glitches(df)
Step 2: Analyze Specific Issues
Missing Data:
from scripts.missing_data import analyze_missing, test_mcar
analysis = analyze_missing(df)
# Check if safe to delete rows
mcar_test = test_mcar(df, 'column_with_missing', ['other_cols'])
Duplicates:
from scripts.duplicate_detection import find_duplicates, cluster_duplicates
matches = find_duplicates(df, ['name', 'email'], threshold=0.85)
clusters = cluster_duplicates(matches)
Outliers:
from scripts.anomaly_detection import detect_anomalies, iqr_outliers
# Multi-column summary
anomalies = detect_anomalies(df, method='iqr')
# Single column detail
result = iqr_outliers(df, 'price', multiplier=1.5)
Constraints:
from scripts.constraint_checking import validate_constraints
constraints = [
{'type': 'unique', 'columns': ['id']},
{'type': 'not_null', 'columns': ['name', 'email']},
{'type': 'fd', 'determinant': ['id'], 'dependent': ['name']},
{'type': 'domain', 'column': 'age', 'min_value': 0, 'max_value': 150},
]
results = validate_constraints(df, constraints)
Step 3: Clean the Data
Handle Missing:
from scripts.missing_data import impute_median, impute_regression, listwise_deletion
# Simple: median for numeric
df_clean = impute_median(df, 'age')
# Better: regression-based
df_clean = impute_regression(df, 'income', ['age', 'education'])
# If MCAR confirmed
df_clean = listwise_deletion(df)
Remove Duplicates:
from scripts.duplicate_detection import deduplicate
df_clean, summary = deduplicate(
df,
columns=['name', 'email', 'address'],
threshold=0.8,
merge_strategy='most_complete'
)
print(f"Reduced from {summary['original_rows']} to {summary['final_rows']} rows")
Handle Outliers:
# Cap extreme values
q01, q99 = df['col'].quantile([0.01, 0.99])
df['col'] = df['col'].clip(q01, q99)
# Or remove
df_clean = df[~detect_anomalies(df)['col']['outlier_indices']]
Step 4: Validate
Re-run profiling and constraint checks on cleaned data to verify improvements.
References
For deeper understanding:
- references/dimensions.md: Data quality dimensions (accuracy, completeness, etc.)
- references/glitch_taxonomy.md: Types of data glitches and detection approaches
- references/repair_strategies.md: Detailed repair and cleaning strategies
Key Concepts
Data Quality = Fit for Use
- Free of defects
- Has features needed for the task
- Right information, right place, right time
Missing Data Mechanisms:
- MCAR: Missing Completely At Random (safe to delete)
- MAR: Missing At Random (imputation may work)
- MNAR: Missing Not At Random (most problematic)
Constraints:
- Functional Dependency:
X → Ymeans X uniquely determines Y - Referential Integrity: foreign keys reference valid primary keys
- Domain Constraints: values within allowed set/range
Entity Resolution:
- Blocking reduces O(n²) to O(n·window)
- Similarity metrics: Jaro-Winkler (names), Levenshtein (typos), Jaccard (sets)
- Cluster by transitive closure, merge by strategy
Similarity Metrics Comparison
| Metric | Best For | Example |
|---|---|---|
| Jaro-Winkler | Names, short strings | "Robert" vs "Rupert" |
| Levenshtein | Typos, edit distance | "recieve" vs "receive" |
| Jaccard | Token/word comparison | "John Doe" vs "Doe, John" |
| Q-gram | Fuzzy substring matching | Partial matches |
Documentos de referencia
name: data-quality description: Diagnose and fix data quality problems in datasets. Use when working with dirty data, finding duplicates, handling missing values, detecting outliers/anomalies, validating constraints (functional dependencies, referential integrity), profiling datasets, or cleaning data for analysis or ML. Covers the full data quality lifecycle - define, detect, clean, measure.
Data Quality Skill
Systematic approach to diagnosing and fixing data quality problems.
Data Quality Process
Define & Identify → Detect & Quantify → Clean & Rectify → Measure & Verify
- Define: Understand data context, business rules, quality requirements
- Detect: Profile data, find glitches (missing, duplicates, outliers, violations)
- Clean: Apply appropriate repair strategies
- Measure: Validate repairs, quantify improvement
Quick Reference
| Problem | Script | Key Function |
|---|---|---|
| Data overview | data_profiling.py | profile_dataframe(df) |
| Find quality issues | data_profiling.py | detect_glitches(df) |
| Missing values | missing_data.py | analyze_missing(df) |
| Imputation | missing_data.py | impute_mean/median/regression() |
| Duplicates | duplicate_detection.py | find_duplicates(df, cols) |
| Deduplication | duplicate_detection.py | deduplicate(df, cols) |
| Outliers | anomaly_detection.py | detect_anomalies(df) |
| Constraint check | constraint_checking.py | validate_constraints(df, rules) |
| String matching | similarity_metrics.py | jaro_winkler_similarity() |
Workflow
Step 1: Profile the Data
from scripts.data_profiling import profile_dataframe, detect_glitches, generate_quality_report
# Quick overview
print(generate_quality_report(df))
# Detailed profile
profile = profile_dataframe(df)
# Find issues
glitches = detect_glitches(df)
Step 2: Analyze Specific Issues
Missing Data:
from scripts.missing_data import analyze_missing, test_mcar
analysis = analyze_missing(df)
# Check if safe to delete rows
mcar_test = test_mcar(df, 'column_with_missing', ['other_cols'])
Duplicates:
from scripts.duplicate_detection import find_duplicates, cluster_duplicates
matches = find_duplicates(df, ['name', 'email'], threshold=0.85)
clusters = cluster_duplicates(matches)
Outliers:
from scripts.anomaly_detection import detect_anomalies, iqr_outliers
# Multi-column summary
anomalies = detect_anomalies(df, method='iqr')
# Single column detail
result = iqr_outliers(df, 'price', multiplier=1.5)
Constraints:
from scripts.constraint_checking import validate_constraints
constraints = [
{'type': 'unique', 'columns': ['id']},
{'type': 'not_null', 'columns': ['name', 'email']},
{'type': 'fd', 'determinant': ['id'], 'dependent': ['name']},
{'type': 'domain', 'column': 'age', 'min_value': 0, 'max_value': 150},
]
results = validate_constraints(df, constraints)
Step 3: Clean the Data
Handle Missing:
from scripts.missing_data import impute_median, impute_regression, listwise_deletion
# Simple: median for numeric
df_clean = impute_median(df, 'age')
# Better: regression-based
df_clean = impute_regression(df, 'income', ['age', 'education'])
# If MCAR confirmed
df_clean = listwise_deletion(df)
Remove Duplicates:
from scripts.duplicate_detection import deduplicate
df_clean, summary = deduplicate(
df,
columns=['name', 'email', 'address'],
threshold=0.8,
merge_strategy='most_complete'
)
print(f"Reduced from {summary['original_rows']} to {summary['final_rows']} rows")
Handle Outliers:
# Cap extreme values
q01, q99 = df['col'].quantile([0.01, 0.99])
df['col'] = df['col'].clip(q01, q99)
# Or remove
df_clean = df[~detect_anomalies(df)['col']['outlier_indices']]
Step 4: Validate
Re-run profiling and constraint checks on cleaned data to verify improvements.
References
For deeper understanding:
- references/dimensions.md: Data quality dimensions (accuracy, completeness, etc.)
- references/glitch_taxonomy.md: Types of data glitches and detection approaches
- references/repair_strategies.md: Detailed repair and cleaning strategies
Key Concepts
Data Quality = Fit for Use
- Free of defects
- Has features needed for the task
- Right information, right place, right time
Missing Data Mechanisms:
- MCAR: Missing Completely At Random (safe to delete)
- MAR: Missing At Random (imputation may work)
- MNAR: Missing Not At Random (most problematic)
Constraints:
- Functional Dependency:
X → Ymeans X uniquely determines Y - Referential Integrity: foreign keys reference valid primary keys
- Domain Constraints: values within allowed set/range
Entity Resolution:
- Blocking reduces O(n²) to O(n·window)
- Similarity metrics: Jaro-Winkler (names), Levenshtein (typos), Jaccard (sets)
- Cluster by transitive closure, merge by strategy
Similarity Metrics Comparison
| Metric | Best For | Example |
|---|---|---|
| Jaro-Winkler | Names, short strings | "Robert" vs "Rupert" |
| Levenshtein | Typos, edit distance | "recieve" vs "receive" |
| Jaccard | Token/word comparison | "John Doe" vs "Doe, John" |
| Q-gram | Fuzzy substring matching | Partial matches |
Data Quality Dimensions
Data quality is determined by "fitness for use" - the capability of data to meet requirements for a given context.
Core Dimensions
Accuracy
Data correctly represents the real-world entity or event.
Measurement:
- Compare against authoritative source (gold standard)
- Expert review sampling
- Cross-reference validation
Common Issues:
- Typos and transcription errors
- Outdated information
- Measurement errors
Completeness
All required data values are present.
Levels:
- Schema completeness: all expected attributes exist
- Column completeness: % non-null values per column
- Population completeness: all expected records present
Measurement:
completeness = (non_null_count / total_count) * 100
Consistency
Data values don't contradict each other across the dataset or systems.
Types:
- Intra-record: values within same record consistent (age matches birth date)
- Inter-record: values across records consistent (no duplicate IDs)
- Cross-system: same entity has consistent values in different databases
Measurement:
- Constraint violation counts
- Cross-reference mismatches
Timeliness
Data is current enough for the intended use.
Aspects:
- Currency: when data was last updated
- Volatility: how often data changes
- Latency: delay between real-world change and data update
Measurement:
timeliness_score = 1 - (current_time - last_update) / max_acceptable_age
Validity
Data conforms to defined formats, types, and ranges.
Checks:
- Data type validation
- Format validation (dates, emails, phones)
- Range/domain validation
- Pattern matching
Uniqueness
No unintended duplicates exist.
Levels:
- Primary key uniqueness
- Natural key uniqueness
- Entity-level deduplication
Extended Dimensions
Relevance
Data is applicable and useful for the task.
Interpretability
Data meaning is clear and unambiguous (well-documented).
Accessibility
Authorized users can easily obtain the data.
Credibility
Data comes from trustworthy sources.
Business Context
Different uses require different quality priorities:
| Use Case | Priority Dimensions |
|---|---|
| Financial reporting | Accuracy, Completeness, Consistency |
| Real-time analytics | Timeliness, Availability |
| Customer communications | Accuracy, Completeness |
| ML model training | Completeness, Consistency, Validity |
| Regulatory compliance | Accuracy, Completeness, Auditability |
Measuring Overall Quality
Weighted composite score:
quality_score = sum(weight[dim] * score[dim] for dim in dimensions) / sum(weights)
Quality thresholds:
- Critical: < 70% - Immediate attention required
- Warning: 70-90% - Improvement needed
- Acceptable: > 90% - Monitor for degradation
Data Glitch Taxonomy
Data glitches are defects that compromise data quality. Understanding glitch types guides detection and repair strategies.
Glitch Categories
1. Missing Data
Value-level: Individual cells are NULL/empty
- Easy to detect:
df.isnull().sum()
Record-level: Entire rows missing from expected population
- Hard to detect: requires external reference
Attribute-level: Expected columns missing from schema
- Moderate: compare against schema documentation
Missing Data Mechanisms:
| Mechanism | Description | Implication |
|---|---|---|
| MCAR | Missing Completely At Random - no correlation | Safe to delete rows |
| MAR | Missing At Random - correlated with observed data | Imputation may work |
| MNAR | Missing Not At Random - correlated with missing value itself | Most problematic |
2. Inconsistent/Erroneous Data
Syntactic errors:
- Typos: "Jhon" instead of "John"
- Formatting: inconsistent date formats, phone formats
- Encoding: character encoding issues
Semantic errors:
- Wrong values: age = 250
- Contradictions: birth_date > current_date
- Constraint violations: duplicate primary keys
Detection:
- Constraint checking (FDs, referential integrity)
- Domain validation
- Pattern analysis
3. Anomalies and Outliers
Point anomalies: Single data points far from normal
- Age = -5 or Age = 200
Contextual anomalies: Abnormal in specific context
- Temperature = 90°F normal in summer, anomaly in winter
Collective anomalies: Groups of related points anomalous together
- Sudden spike in all sensor readings
Detection methods:
- Statistical: z-score, IQR, modified z-score
- ML-based: isolation forest, autoencoders, k-NN
4. Semantic Duplicates
Records referring to the same real-world entity with different representations:
Row 1: "John Smith", "123 Main St", "NYC"
Row 2: "J. Smith", "123 Main Street", "New York City"
Detection:
- Blocking to reduce comparison space
- Fuzzy matching with similarity metrics
- Clustering via transitive closure
5. Undocumented Data
Data without adequate metadata:
- Unknown column meanings
- Missing data dictionaries
- Unclear units of measurement
- No lineage/provenance information
Symptoms:
- Column names like "col1", "field_a", "x"
- No documentation about allowed values
- Ambiguous semantics
Glitch Complexes
Real data often has compound glitch patterns:
Multi-type Glitch
Single value has multiple glitch types:
- Value is both an outlier AND inconsistent with constraint
Concomitant Glitches
Same record has glitches in multiple columns:
- Missing name AND invalid email AND outlier age
Multi-occurrent Glitches
Same glitch type appears across many records:
- 1000 records all missing the same field
Detection Complexity Factors
Relevance
Severity varies by domain:
- Missing email: critical for marketing, irrelevant for shipping
Ambiguity
Boundary between valid/invalid unclear:
- Is age 120 an error or just rare?
Complex Dependencies
One glitch can mask another:
- Missing value hides what would be a constraint violation
Dynamic Nature
Glitch types evolve:
- New data sources introduce new error patterns
Glitch Quantification
Per-value scoring
glitch_signature = [has_missing, has_outlier, has_format_error, ...]
glitch_score = sum(weight[i] * signature[i] for i in range(len(signature)))
Global scoring
total_glitch_score = sum(all_value_scores) / total_values
Detection Priority
- Start broad: Profile entire dataset (profiling script)
- Identify patterns: Find common glitch types
- Prioritize: Focus on high-impact, high-frequency issues
- Deep dive: Investigate root causes of priority glitches
Data Repair Strategies
Repairing data quality issues involves trade-offs between data loss, accuracy, and computational cost.
Fundamental Concept: Minimal Repair
Minimal repair = smallest change that removes constraint violations.
Key insight: When a constraint is violated, it's ambiguous which value is wrong. Minimal repair preserves as much original data as possible.
Missing Data Strategies
Deletion Methods
Listwise deletion: Remove entire row if any value missing
df_clean = df.dropna()
- Pros: Simple, preserves relationships in remaining data
- Cons: Can lose significant data, only valid under MCAR
- Use when: < 5% missing, confirmed MCAR
Pairwise deletion: Use available data for each calculation
df.dropna(subset=['col1', 'col2']) # Only for this analysis
- Pros: Preserves more data
- Cons: Different N for different analyses
- Use when: Analysis-specific completeness needed
Attribute deletion: Remove columns with too much missing
df.drop(columns=[col for col in df if df[col].isnull().mean() > 0.5])
- Use when: Column > 50% missing, not critical
Imputation Methods
Simple imputation:
| Method | Formula | Best for |
|---|---|---|
| Mean | df[col].fillna(df[col].mean()) | Numeric, normal distribution |
| Median | df[col].fillna(df[col].median()) | Numeric, skewed/outliers |
| Mode | df[col].fillna(df[col].mode()[0]) | Categorical |
| Constant | df[col].fillna(value) | Domain-specific defaults |
| Forward fill | df[col].ffill() | Time series |
Warnings:
- Mean/median underestimates variance
- Weakens correlations between variables
- Can introduce bias if not MCAR
Model-based imputation:
# Regression imputation
from scripts.missing_data import impute_regression
df_imputed = impute_regression(df, target='income', predictors=['age', 'education'])
- Preserves relationships better
- Still underestimates variance unless noise added
Multiple imputation:
- Impute multiple times with random variation
- Analyze each imputed dataset
- Pool results accounting for imputation uncertainty
Indicator Method
Add binary flag for missingness (useful for ML):
df['col_missing'] = df['col'].isnull().astype(int)
df['col'] = df['col'].fillna(df['col'].median())
Duplicate Repair Strategies
Record Selection
Survivor strategy: Keep most complete record
from scripts.duplicate_detection import merge_records
survivor = merge_records(df, cluster, strategy='most_complete')
First/Last: Keep chronologically first or latest
- Use when: temporal precedence matters
Record Fusion
Attribute-based: Create composite record
merged = merge_records(df, cluster, strategy='combine')
# Takes first non-null value for each attribute
Aggregation rules by data type:
- Numeric: max, min, avg, sum (depends on semantics)
- String: longest, most recent, most frequent
- Date: earliest, latest
Post-merge Actions
Update foreign keys in related tables:
# After merging records 2,3,4 into record 1
other_table['fk_col'] = other_table['fk_col'].replace({2: 1, 3: 1, 4: 1})
Constraint Violation Repair
Deletion vs. Modification
Tuple deletion: Remove violating rows entirely
- Simple but loses information
- May cascade to dependent tables
Value modification: Update values to satisfy constraints
- Preserves more data
- Risk of introducing other errors
Consistent Query Answering (CQA)
Instead of repairing data, return query results consistent across ALL possible minimal repairs.
-- Original: SELECT * FROM Students
-- CQA version: Only return students with no FD conflicts
SELECT * FROM Students s1
WHERE NOT EXISTS (
SELECT * FROM Students s2
WHERE s1.id = s2.id AND s1.name != s2.name
)
Active Integrity Constraints
Specify repair actions with constraints:
IF violation(student_id -> name) THEN keep_most_recent
IF violation(age BETWEEN 0 AND 150) THEN set_null
Anomaly/Outlier Repair
Correction strategies
Capping/Winsorization: Replace extremes with boundary values
lower, upper = df['col'].quantile([0.01, 0.99])
df['col'] = df['col'].clip(lower, upper)
Deletion: Remove outlier rows
- Risk of bias if outliers are systematic
Imputation: Treat as missing and impute
df.loc[outlier_mask, 'col'] = np.nan
df = impute_median(df, 'col')
Investigation: Verify if genuine extreme or error
- Sometimes outliers are real and important!
Repair Decision Framework
1. Assess impact
- How critical is this data?
- What's the cost of wrong repair vs. no repair?
2. Understand cause
- Systematic error → fix at source
- Random error → statistical repair
3. Choose strategy
- Low risk: aggressive cleaning
- High risk: conservative (CQA, flagging)
4. Validate
- Check repair didn't introduce new issues
- Compare distributions before/after
- Verify business logic still holds
5. Document
- Record what was changed and why
- Enable audit trail
Best Practices
- Never destroy original data - keep backups
- Document all transformations - reproducibility
- Prefer prevention - fix data entry, not downstream
- Validate repairs - check for unintended consequences
- Consider context - repair strategy depends on use case