ElasticFlow
HubAll SkillsBy DepartmentBy RoleBy ToolBy MetricMCPsPublishers
메인 사이트로그인회원가입
ElasticFlow

AI 기반 워크플로 자동화로 비즈니스를 혁신하세요. 모든 엔터프라이즈 요구를 위한 통합 플랫폼.

팔로우

플랫폼

  • 기능
  • 장점
  • 사용 사례
  • 워크플로 라이브러리

사용 사례

  • 영업
  • 마케팅
  • 재무·법무
  • 인사

카탈로그

  • 부서
  • 역할
  • 도구
  • 메트릭
  • 플랫폼

성장

  • 추천 프로그램
  • 파트너

법무

  • 개인정보 처리방침
  • 서비스 약관
  • 쿠키 정책
  • 허용 사용
  • 보안
  • SLA

© 2026 ElasticFlow. 모든 권리 보유.

ElasticFlow
HubAll SkillsBy DepartmentBy RoleBy ToolBy MetricMCPsPublishers
메인 사이트로그인회원가입
ElasticFlow

AI 기반 워크플로 자동화로 비즈니스를 혁신하세요. 모든 엔터프라이즈 요구를 위한 통합 플랫폼.

팔로우

플랫폼

  • 기능
  • 장점
  • 사용 사례
  • 워크플로 라이브러리

사용 사례

  • 영업
  • 마케팅
  • 재무·법무
  • 인사

카탈로그

  • 부서
  • 역할
  • 도구
  • 메트릭
  • 플랫폼

성장

  • 추천 프로그램
  • 파트너

법무

  • 개인정보 처리방침
  • 서비스 약관
  • 쿠키 정책
  • 허용 사용
  • 보안
  • SLA

© 2026 ElasticFlow. 모든 권리 보유.

ElasticFlow
HubAll SkillsBy DepartmentBy RoleBy ToolBy MetricMCPsPublishers
메인 사이트로그인회원가입
  1. 홈
  2. 스킬
  3. Data Quality
AI 스킬Audit data quality제품 및 엔지니어링

Find, explain, and fix common data problems before reports or models use them. — Claude Skill

Claude Code용 Claude 스킬 · 제공: masterkram · 실행: /data-quality (Claude 내)·업데이트: 2026년 6월 12일·vmain@8b32590

호환Claude

Profiles datasets for missing values, duplicates, outliers, invalid values, and rule violations, then verifies whether the repair actually improved the data.

  • Finds missing values, duplicate records, outliers, invalid domains, and broken constraints.
  • Explains whether a problem is harmless, needs cleanup, or should block reporting.
  • Suggests repair strategies instead of blindly deleting suspicious rows.
  • Re-runs checks after cleanup so teams can see what improved.
사용자오늘

An analyst eyeballs a dataset, deletes obvious bad rows, and hopes downstream analysis is still valid.

/data-quality 사용 시

Run /data-quality to define quality rules, quantify defects, repair systematically, and verify improvements.

1 Profile2 Detect3 Repair4 Verify

대상

Data Engineer

Profile, clean, and verify datasets before analysis, migration, or ML.

이 역할의 스킬 보기
Analytics Engineer

Make analytical datasets fit for reporting with measured quality checks.

이 역할의 스킬 보기

기능

Dataset readiness check

Decide whether data is reliable enough for a report, migration, or model.

Duplicate cleanup

Find duplicate customers, accounts, products, or transactions and choose how to merge them.

Rule validation

Check required fields, allowed values, unique keys, and relationships between tables.

작동 방식

1

Define the dataset, business rules, and why the data matters.

2

Profile completeness, uniqueness, validity, consistency, and outliers.

3

Choose a repair strategy for missing values, duplicates, strings, or constraints.

4

Verify the cleaned data against the original rules and document remaining risk.

입력 옵션

Dataset

CSV, spreadsheet, SQL extract, dataframe, schema, or sample rows.

예시

What the user pastes
Dataset: customer_export.csv, 52,000 rows
Use case: migrate accounts into new CRM next week
Rules:
- email is required
- account_id must be unique
- country must be a valid ISO country
- annual_revenue should not be negative
Known concern: duplicate company names and missing emails
Useful result
Findings
1,148 rows missing email; 392 duplicate account_id values; 86 invalid country values; 14 negative revenue values.
Migration risk
Duplicate account_id is blocking. Missing emails may be acceptable only if CRM allows account records without contacts.
Repair plan
Deduplicate account_id first, map country values, flag missing emails for enrichment, and investigate negative revenue instead of deleting it.
Proof after cleanup
Rerun the same checks and show before/after counts so the migration owner can sign off.

개선되는 지표

데이터 품질
+20-40%
제품 및 엔지니어링
Assertion Pass Rate
+10-25%
제품 및 엔지니어링
Data Quality Incident Rate
-10-25%
제품 및 엔지니어링

지원 도구

Google Sheets
수동

Use spreadsheet datasets as inputs for lightweight profiling and cleanup.

Snowflake
수동

Run warehouse-backed data quality checks and repairs against Snowflake extracts.

SQL
수동

Use SQL extracts and constraints as inputs for data quality profiling and verification.

유사 스킬

속성 중복에 따라 자동 추천됩니다. 나란히 비교하면 차이가 드러납니다.

전체 4개 비교 →

A/B 테스트 설계

제공: Corey Haines
↳text, file-uploadvstext, tool-access(제공해야 하는 것)·markdown, csvvsmarkdown(출력 형식)·review-requiredvsapproval-required(사람 검토)

캠페인 브리프 생성기

제공: Gooseworks
↳text, file-uploadvstext(제공해야 하는 것)·markdown, csvvsmarkdown(출력 형식)·confidentialvsinternal(데이터 민감도)

경쟁사 인텔리전스

제공: Gooseworks
↳text, file-uploadvstext, api-credentials(제공해야 하는 것)·markdown, csvvsmarkdown, email(출력 형식)·review-requiredvsnone(사람 검토)
속성 중복 × 차별화로 정렬. Data Quality은(는) 각 항목과 12개 이상의 속성을 공유합니다.

Data Quality을(를) 사용해 보시겠어요?

시작 방법을 선택하세요.

Claude Code에서 실행
무료. 오픈 소스.

이 스킬을 컴퓨터에 로컬로 설치하고 실행합니다.

1
Claude Code 설치

컴퓨터에서 터미널을 열고 이 명령을 붙여넣으세요:

2
스킬 설치

이 명령은 스킬과 모든 파일을 컴퓨터에 다운로드합니다:

모든 프로젝트에서 사용하려면 끝에 -g를 추가하세요.

3
실행하기

Claude Code를 시작한 다음 명령을 입력하세요:

그다음
GitHub에서 소스 보기
ElasticFlow에서 사용
팀 및 협업 기능

브라우저에서 스킬을 실행. 결과 공유, 액세스 관리, 팀과 협업. 터미널 불필요.

14일 무료 평가판. 언제든 취소 가능.

GitHub에서 보기

Data Quality Skill

Systematic approach to diagnosing and fixing data quality problems.

Data Quality Process

Define & Identify → Detect & Quantify → Clean & Rectify → Measure & Verify
  1. Define: Understand data context, business rules, quality requirements
  2. Detect: Profile data, find glitches (missing, duplicates, outliers, violations)
  3. Clean: Apply appropriate repair strategies
  4. Measure: Validate repairs, quantify improvement

Quick Reference

ProblemScriptKey Function
Data overviewdata_profiling.pyprofile_dataframe(df)
Find quality issuesdata_profiling.pydetect_glitches(df)
Missing valuesmissing_data.pyanalyze_missing(df)
Imputationmissing_data.pyimpute_mean/median/regression()
Duplicatesduplicate_detection.pyfind_duplicates(df, cols)
Deduplicationduplicate_detection.pydeduplicate(df, cols)
Outliersanomaly_detection.pydetect_anomalies(df)
Constraint checkconstraint_checking.pyvalidate_constraints(df, rules)
String matchingsimilarity_metrics.pyjaro_winkler_similarity()

Workflow

Step 1: Profile the Data

from scripts.data_profiling import profile_dataframe, detect_glitches, generate_quality_report

# Quick overview
print(generate_quality_report(df))

# Detailed profile
profile = profile_dataframe(df)

# Find issues
glitches = detect_glitches(df)

Step 2: Analyze Specific Issues

Missing Data:

from scripts.missing_data import analyze_missing, test_mcar

analysis = analyze_missing(df)
# Check if safe to delete rows
mcar_test = test_mcar(df, 'column_with_missing', ['other_cols'])

Duplicates:

from scripts.duplicate_detection import find_duplicates, cluster_duplicates

matches = find_duplicates(df, ['name', 'email'], threshold=0.85)
clusters = cluster_duplicates(matches)

Outliers:

from scripts.anomaly_detection import detect_anomalies, iqr_outliers

# Multi-column summary
anomalies = detect_anomalies(df, method='iqr')

# Single column detail
result = iqr_outliers(df, 'price', multiplier=1.5)

Constraints:

from scripts.constraint_checking import validate_constraints

constraints = [
    {'type': 'unique', 'columns': ['id']},
    {'type': 'not_null', 'columns': ['name', 'email']},
    {'type': 'fd', 'determinant': ['id'], 'dependent': ['name']},
    {'type': 'domain', 'column': 'age', 'min_value': 0, 'max_value': 150},
]
results = validate_constraints(df, constraints)

Step 3: Clean the Data

Handle Missing:

from scripts.missing_data import impute_median, impute_regression, listwise_deletion

# Simple: median for numeric
df_clean = impute_median(df, 'age')

# Better: regression-based
df_clean = impute_regression(df, 'income', ['age', 'education'])

# If MCAR confirmed
df_clean = listwise_deletion(df)

Remove Duplicates:

from scripts.duplicate_detection import deduplicate

df_clean, summary = deduplicate(
    df, 
    columns=['name', 'email', 'address'],
    threshold=0.8,
    merge_strategy='most_complete'
)
print(f"Reduced from {summary['original_rows']} to {summary['final_rows']} rows")

Handle Outliers:

# Cap extreme values
q01, q99 = df['col'].quantile([0.01, 0.99])
df['col'] = df['col'].clip(q01, q99)

# Or remove
df_clean = df[~detect_anomalies(df)['col']['outlier_indices']]

Step 4: Validate

Re-run profiling and constraint checks on cleaned data to verify improvements.

References

For deeper understanding:

  • references/dimensions.md: Data quality dimensions (accuracy, completeness, etc.)
  • references/glitch_taxonomy.md: Types of data glitches and detection approaches
  • references/repair_strategies.md: Detailed repair and cleaning strategies

Key Concepts

Data Quality = Fit for Use

  • Free of defects
  • Has features needed for the task
  • Right information, right place, right time

Missing Data Mechanisms:

  • MCAR: Missing Completely At Random (safe to delete)
  • MAR: Missing At Random (imputation may work)
  • MNAR: Missing Not At Random (most problematic)

Constraints:

  • Functional Dependency: X → Y means X uniquely determines Y
  • Referential Integrity: foreign keys reference valid primary keys
  • Domain Constraints: values within allowed set/range

Entity Resolution:

  • Blocking reduces O(n²) to O(n·window)
  • Similarity metrics: Jaro-Winkler (names), Levenshtein (typos), Jaccard (sets)
  • Cluster by transitive closure, merge by strategy

Similarity Metrics Comparison

MetricBest ForExample
Jaro-WinklerNames, short strings"Robert" vs "Rupert"
LevenshteinTypos, edit distance"recieve" vs "receive"
JaccardToken/word comparison"John Doe" vs "Doe, John"
Q-gramFuzzy substring matchingPartial matches

참조 문서


name: data-quality description: Diagnose and fix data quality problems in datasets. Use when working with dirty data, finding duplicates, handling missing values, detecting outliers/anomalies, validating constraints (functional dependencies, referential integrity), profiling datasets, or cleaning data for analysis or ML. Covers the full data quality lifecycle - define, detect, clean, measure.

Data Quality Skill

Systematic approach to diagnosing and fixing data quality problems.

Data Quality Process

Define & Identify → Detect & Quantify → Clean & Rectify → Measure & Verify
  1. Define: Understand data context, business rules, quality requirements
  2. Detect: Profile data, find glitches (missing, duplicates, outliers, violations)
  3. Clean: Apply appropriate repair strategies
  4. Measure: Validate repairs, quantify improvement

Quick Reference

ProblemScriptKey Function
Data overviewdata_profiling.pyprofile_dataframe(df)
Find quality issuesdata_profiling.pydetect_glitches(df)
Missing valuesmissing_data.pyanalyze_missing(df)
Imputationmissing_data.pyimpute_mean/median/regression()
Duplicatesduplicate_detection.pyfind_duplicates(df, cols)
Deduplicationduplicate_detection.pydeduplicate(df, cols)
Outliersanomaly_detection.pydetect_anomalies(df)
Constraint checkconstraint_checking.pyvalidate_constraints(df, rules)
String matchingsimilarity_metrics.pyjaro_winkler_similarity()

Workflow

Step 1: Profile the Data

from scripts.data_profiling import profile_dataframe, detect_glitches, generate_quality_report

# Quick overview
print(generate_quality_report(df))

# Detailed profile
profile = profile_dataframe(df)

# Find issues
glitches = detect_glitches(df)

Step 2: Analyze Specific Issues

Missing Data:

from scripts.missing_data import analyze_missing, test_mcar

analysis = analyze_missing(df)
# Check if safe to delete rows
mcar_test = test_mcar(df, 'column_with_missing', ['other_cols'])

Duplicates:

from scripts.duplicate_detection import find_duplicates, cluster_duplicates

matches = find_duplicates(df, ['name', 'email'], threshold=0.85)
clusters = cluster_duplicates(matches)

Outliers:

from scripts.anomaly_detection import detect_anomalies, iqr_outliers

# Multi-column summary
anomalies = detect_anomalies(df, method='iqr')

# Single column detail
result = iqr_outliers(df, 'price', multiplier=1.5)

Constraints:

from scripts.constraint_checking import validate_constraints

constraints = [
    {'type': 'unique', 'columns': ['id']},
    {'type': 'not_null', 'columns': ['name', 'email']},
    {'type': 'fd', 'determinant': ['id'], 'dependent': ['name']},
    {'type': 'domain', 'column': 'age', 'min_value': 0, 'max_value': 150},
]
results = validate_constraints(df, constraints)

Step 3: Clean the Data

Handle Missing:

from scripts.missing_data import impute_median, impute_regression, listwise_deletion

# Simple: median for numeric
df_clean = impute_median(df, 'age')

# Better: regression-based
df_clean = impute_regression(df, 'income', ['age', 'education'])

# If MCAR confirmed
df_clean = listwise_deletion(df)

Remove Duplicates:

from scripts.duplicate_detection import deduplicate

df_clean, summary = deduplicate(
    df, 
    columns=['name', 'email', 'address'],
    threshold=0.8,
    merge_strategy='most_complete'
)
print(f"Reduced from {summary['original_rows']} to {summary['final_rows']} rows")

Handle Outliers:

# Cap extreme values
q01, q99 = df['col'].quantile([0.01, 0.99])
df['col'] = df['col'].clip(q01, q99)

# Or remove
df_clean = df[~detect_anomalies(df)['col']['outlier_indices']]

Step 4: Validate

Re-run profiling and constraint checks on cleaned data to verify improvements.

References

For deeper understanding:

  • references/dimensions.md: Data quality dimensions (accuracy, completeness, etc.)
  • references/glitch_taxonomy.md: Types of data glitches and detection approaches
  • references/repair_strategies.md: Detailed repair and cleaning strategies

Key Concepts

Data Quality = Fit for Use

  • Free of defects
  • Has features needed for the task
  • Right information, right place, right time

Missing Data Mechanisms:

  • MCAR: Missing Completely At Random (safe to delete)
  • MAR: Missing At Random (imputation may work)
  • MNAR: Missing Not At Random (most problematic)

Constraints:

  • Functional Dependency: X → Y means X uniquely determines Y
  • Referential Integrity: foreign keys reference valid primary keys
  • Domain Constraints: values within allowed set/range

Entity Resolution:

  • Blocking reduces O(n²) to O(n·window)
  • Similarity metrics: Jaro-Winkler (names), Levenshtein (typos), Jaccard (sets)
  • Cluster by transitive closure, merge by strategy

Similarity Metrics Comparison

MetricBest ForExample
Jaro-WinklerNames, short strings"Robert" vs "Rupert"
LevenshteinTypos, edit distance"recieve" vs "receive"
JaccardToken/word comparison"John Doe" vs "Doe, John"
Q-gramFuzzy substring matchingPartial matches

Data Quality Dimensions

Data quality is determined by "fitness for use" - the capability of data to meet requirements for a given context.

Core Dimensions

Accuracy

Data correctly represents the real-world entity or event.

Measurement:

  • Compare against authoritative source (gold standard)
  • Expert review sampling
  • Cross-reference validation

Common Issues:

  • Typos and transcription errors
  • Outdated information
  • Measurement errors

Completeness

All required data values are present.

Levels:

  • Schema completeness: all expected attributes exist
  • Column completeness: % non-null values per column
  • Population completeness: all expected records present

Measurement:

completeness = (non_null_count / total_count) * 100

Consistency

Data values don't contradict each other across the dataset or systems.

Types:

  • Intra-record: values within same record consistent (age matches birth date)
  • Inter-record: values across records consistent (no duplicate IDs)
  • Cross-system: same entity has consistent values in different databases

Measurement:

  • Constraint violation counts
  • Cross-reference mismatches

Timeliness

Data is current enough for the intended use.

Aspects:

  • Currency: when data was last updated
  • Volatility: how often data changes
  • Latency: delay between real-world change and data update

Measurement:

timeliness_score = 1 - (current_time - last_update) / max_acceptable_age

Validity

Data conforms to defined formats, types, and ranges.

Checks:

  • Data type validation
  • Format validation (dates, emails, phones)
  • Range/domain validation
  • Pattern matching

Uniqueness

No unintended duplicates exist.

Levels:

  • Primary key uniqueness
  • Natural key uniqueness
  • Entity-level deduplication

Extended Dimensions

Relevance

Data is applicable and useful for the task.

Interpretability

Data meaning is clear and unambiguous (well-documented).

Accessibility

Authorized users can easily obtain the data.

Credibility

Data comes from trustworthy sources.

Business Context

Different uses require different quality priorities:

Use CasePriority Dimensions
Financial reportingAccuracy, Completeness, Consistency
Real-time analyticsTimeliness, Availability
Customer communicationsAccuracy, Completeness
ML model trainingCompleteness, Consistency, Validity
Regulatory complianceAccuracy, Completeness, Auditability

Measuring Overall Quality

Weighted composite score:

quality_score = sum(weight[dim] * score[dim] for dim in dimensions) / sum(weights)

Quality thresholds:

  • Critical: < 70% - Immediate attention required
  • Warning: 70-90% - Improvement needed
  • Acceptable: > 90% - Monitor for degradation

Data Glitch Taxonomy

Data glitches are defects that compromise data quality. Understanding glitch types guides detection and repair strategies.

Glitch Categories

1. Missing Data

Value-level: Individual cells are NULL/empty

  • Easy to detect: df.isnull().sum()

Record-level: Entire rows missing from expected population

  • Hard to detect: requires external reference

Attribute-level: Expected columns missing from schema

  • Moderate: compare against schema documentation

Missing Data Mechanisms:

MechanismDescriptionImplication
MCARMissing Completely At Random - no correlationSafe to delete rows
MARMissing At Random - correlated with observed dataImputation may work
MNARMissing Not At Random - correlated with missing value itselfMost problematic

2. Inconsistent/Erroneous Data

Syntactic errors:

  • Typos: "Jhon" instead of "John"
  • Formatting: inconsistent date formats, phone formats
  • Encoding: character encoding issues

Semantic errors:

  • Wrong values: age = 250
  • Contradictions: birth_date > current_date
  • Constraint violations: duplicate primary keys

Detection:

  • Constraint checking (FDs, referential integrity)
  • Domain validation
  • Pattern analysis

3. Anomalies and Outliers

Point anomalies: Single data points far from normal

  • Age = -5 or Age = 200

Contextual anomalies: Abnormal in specific context

  • Temperature = 90°F normal in summer, anomaly in winter

Collective anomalies: Groups of related points anomalous together

  • Sudden spike in all sensor readings

Detection methods:

  • Statistical: z-score, IQR, modified z-score
  • ML-based: isolation forest, autoencoders, k-NN

4. Semantic Duplicates

Records referring to the same real-world entity with different representations:

Row 1: "John Smith", "123 Main St", "NYC"
Row 2: "J. Smith", "123 Main Street", "New York City"

Detection:

  • Blocking to reduce comparison space
  • Fuzzy matching with similarity metrics
  • Clustering via transitive closure

5. Undocumented Data

Data without adequate metadata:

  • Unknown column meanings
  • Missing data dictionaries
  • Unclear units of measurement
  • No lineage/provenance information

Symptoms:

  • Column names like "col1", "field_a", "x"
  • No documentation about allowed values
  • Ambiguous semantics

Glitch Complexes

Real data often has compound glitch patterns:

Multi-type Glitch

Single value has multiple glitch types:

  • Value is both an outlier AND inconsistent with constraint

Concomitant Glitches

Same record has glitches in multiple columns:

  • Missing name AND invalid email AND outlier age

Multi-occurrent Glitches

Same glitch type appears across many records:

  • 1000 records all missing the same field

Detection Complexity Factors

Relevance

Severity varies by domain:

  • Missing email: critical for marketing, irrelevant for shipping

Ambiguity

Boundary between valid/invalid unclear:

  • Is age 120 an error or just rare?

Complex Dependencies

One glitch can mask another:

  • Missing value hides what would be a constraint violation

Dynamic Nature

Glitch types evolve:

  • New data sources introduce new error patterns

Glitch Quantification

Per-value scoring

glitch_signature = [has_missing, has_outlier, has_format_error, ...]
glitch_score = sum(weight[i] * signature[i] for i in range(len(signature)))

Global scoring

total_glitch_score = sum(all_value_scores) / total_values

Detection Priority

  1. Start broad: Profile entire dataset (profiling script)
  2. Identify patterns: Find common glitch types
  3. Prioritize: Focus on high-impact, high-frequency issues
  4. Deep dive: Investigate root causes of priority glitches

Data Repair Strategies

Repairing data quality issues involves trade-offs between data loss, accuracy, and computational cost.

Fundamental Concept: Minimal Repair

Minimal repair = smallest change that removes constraint violations.

Key insight: When a constraint is violated, it's ambiguous which value is wrong. Minimal repair preserves as much original data as possible.

Missing Data Strategies

Deletion Methods

Listwise deletion: Remove entire row if any value missing

df_clean = df.dropna()
  • Pros: Simple, preserves relationships in remaining data
  • Cons: Can lose significant data, only valid under MCAR
  • Use when: < 5% missing, confirmed MCAR

Pairwise deletion: Use available data for each calculation

df.dropna(subset=['col1', 'col2'])  # Only for this analysis
  • Pros: Preserves more data
  • Cons: Different N for different analyses
  • Use when: Analysis-specific completeness needed

Attribute deletion: Remove columns with too much missing

df.drop(columns=[col for col in df if df[col].isnull().mean() > 0.5])
  • Use when: Column > 50% missing, not critical

Imputation Methods

Simple imputation:

MethodFormulaBest for
Meandf[col].fillna(df[col].mean())Numeric, normal distribution
Mediandf[col].fillna(df[col].median())Numeric, skewed/outliers
Modedf[col].fillna(df[col].mode()[0])Categorical
Constantdf[col].fillna(value)Domain-specific defaults
Forward filldf[col].ffill()Time series

Warnings:

  • Mean/median underestimates variance
  • Weakens correlations between variables
  • Can introduce bias if not MCAR

Model-based imputation:

# Regression imputation
from scripts.missing_data import impute_regression
df_imputed = impute_regression(df, target='income', predictors=['age', 'education'])
  • Preserves relationships better
  • Still underestimates variance unless noise added

Multiple imputation:

  1. Impute multiple times with random variation
  2. Analyze each imputed dataset
  3. Pool results accounting for imputation uncertainty

Indicator Method

Add binary flag for missingness (useful for ML):

df['col_missing'] = df['col'].isnull().astype(int)
df['col'] = df['col'].fillna(df['col'].median())

Duplicate Repair Strategies

Record Selection

Survivor strategy: Keep most complete record

from scripts.duplicate_detection import merge_records
survivor = merge_records(df, cluster, strategy='most_complete')

First/Last: Keep chronologically first or latest

  • Use when: temporal precedence matters

Record Fusion

Attribute-based: Create composite record

merged = merge_records(df, cluster, strategy='combine')
# Takes first non-null value for each attribute

Aggregation rules by data type:

  • Numeric: max, min, avg, sum (depends on semantics)
  • String: longest, most recent, most frequent
  • Date: earliest, latest

Post-merge Actions

Update foreign keys in related tables:

# After merging records 2,3,4 into record 1
other_table['fk_col'] = other_table['fk_col'].replace({2: 1, 3: 1, 4: 1})

Constraint Violation Repair

Deletion vs. Modification

Tuple deletion: Remove violating rows entirely

  • Simple but loses information
  • May cascade to dependent tables

Value modification: Update values to satisfy constraints

  • Preserves more data
  • Risk of introducing other errors

Consistent Query Answering (CQA)

Instead of repairing data, return query results consistent across ALL possible minimal repairs.

-- Original: SELECT * FROM Students
-- CQA version: Only return students with no FD conflicts
SELECT * FROM Students s1
WHERE NOT EXISTS (
    SELECT * FROM Students s2
    WHERE s1.id = s2.id AND s1.name != s2.name
)

Active Integrity Constraints

Specify repair actions with constraints:

IF violation(student_id -> name) THEN keep_most_recent
IF violation(age BETWEEN 0 AND 150) THEN set_null

Anomaly/Outlier Repair

Correction strategies

Capping/Winsorization: Replace extremes with boundary values

lower, upper = df['col'].quantile([0.01, 0.99])
df['col'] = df['col'].clip(lower, upper)

Deletion: Remove outlier rows

  • Risk of bias if outliers are systematic

Imputation: Treat as missing and impute

df.loc[outlier_mask, 'col'] = np.nan
df = impute_median(df, 'col')

Investigation: Verify if genuine extreme or error

  • Sometimes outliers are real and important!

Repair Decision Framework

1. Assess impact
   - How critical is this data?
   - What's the cost of wrong repair vs. no repair?

2. Understand cause
   - Systematic error → fix at source
   - Random error → statistical repair

3. Choose strategy
   - Low risk: aggressive cleaning
   - High risk: conservative (CQA, flagging)

4. Validate
   - Check repair didn't introduce new issues
   - Compare distributions before/after
   - Verify business logic still holds

5. Document
   - Record what was changed and why
   - Enable audit trail

Best Practices

  1. Never destroy original data - keep backups
  2. Document all transformations - reproducibility
  3. Prefer prevention - fix data entry, not downstream
  4. Validate repairs - check for unintended consequences
  5. Consider context - repair strategy depends on use case
ElasticFlow

AI 기반 워크플로 자동화로 비즈니스를 혁신하세요. 모든 엔터프라이즈 요구를 위한 통합 플랫폼.

팔로우

플랫폼

  • 기능
  • 장점
  • 사용 사례
  • 워크플로 라이브러리

사용 사례

  • 영업
  • 마케팅
  • 재무·법무
  • 인사

카탈로그

  • 부서
  • 역할
  • 도구
  • 메트릭
  • 플랫폼

성장

  • 추천 프로그램
  • 파트너

법무

  • 개인정보 처리방침
  • 서비스 약관
  • 쿠키 정책
  • 허용 사용
  • 보안
  • SLA

© 2026 ElasticFlow. 모든 권리 보유.