ElasticFlow
HubAll SkillsBy DepartmentBy RoleBy ToolBy MetricMCPsPublishers
メインサイトログイン登録
ElasticFlow

AI搭載のワークフロー自動化でビジネスを変革。エンタープライズのあらゆるニーズを満たす統合プラットフォーム。

フォローする

プラットフォーム

  • 機能
  • メリット
  • ユースケース
  • ワークフローライブラリ

ユースケース

  • 営業
  • マーケティング
  • 財務・法務
  • 人事

カタログ

  • 部門
  • ロール
  • ツール
  • メトリクス
  • プラットフォーム

成長

  • 紹介プログラム
  • パートナー

法務

  • プライバシーポリシー
  • 利用規約
  • Cookieポリシー
  • 許容される利用
  • セキュリティ
  • SLA

© 2026 ElasticFlow. All rights reserved.

ElasticFlow
HubAll SkillsBy DepartmentBy RoleBy ToolBy MetricMCPsPublishers
メインサイトログイン登録
ElasticFlow

AI搭載のワークフロー自動化でビジネスを変革。エンタープライズのあらゆるニーズを満たす統合プラットフォーム。

フォローする

プラットフォーム

  • 機能
  • メリット
  • ユースケース
  • ワークフローライブラリ

ユースケース

  • 営業
  • マーケティング
  • 財務・法務
  • 人事

カタログ

  • 部門
  • ロール
  • ツール
  • メトリクス
  • プラットフォーム

成長

  • 紹介プログラム
  • パートナー

法務

  • プライバシーポリシー
  • 利用規約
  • Cookieポリシー
  • 許容される利用
  • セキュリティ
  • SLA

© 2026 ElasticFlow. All rights reserved.

ElasticFlow
HubAll SkillsBy DepartmentBy RoleBy ToolBy MetricMCPsPublishers
メインサイトログイン登録
  1. ホーム
  2. スキル
  3. Data Quality
AIスキルAudit data qualityProduct & Engineering

Find, explain, and fix common data problems before reports or models use them. — Claude Skill

Claude Code向けClaudeスキル · 提供:masterkram · 実行:/data-quality(Claude内)·更新日:2026年6月12日·vmain@8b32590

対応ChatGPTChatGPTClaudeClaudeCCClaude CodeCDClaude DesktopXCodex / Codex CLICursorCursorGeminiGeminiHHermes (via Continue / Cline)OpenClawOpenClawWindsurfWindsurf

Profiles datasets for missing values, duplicates, outliers, invalid values, and rule violations, then verifies whether the repair actually improved the data.

  • Finds missing values, duplicate records, outliers, invalid domains, and broken constraints.
  • Explains whether a problem is harmless, needs cleanup, or should block reporting.
  • Suggests repair strategies instead of blindly deleting suspicious rows.
  • Re-runs checks after cleanup so teams can see what improved.
あなた今日

An analyst eyeballs a dataset, deletes obvious bad rows, and hopes downstream analysis is still valid.

/data-quality使用時

Run /data-quality to define quality rules, quantify defects, repair systematically, and verify improvements.

1 Profile2 Detect3 Repair4 Verify

対象ユーザー

Data Engineer

Profile, clean, and verify datasets before analysis, migration, or ML.

この役職のスキルを見る
Analytics Engineer

Make analytical datasets fit for reporting with measured quality checks.

この役職のスキルを見る

機能

Dataset readiness check

Decide whether data is reliable enough for a report, migration, or model.

Duplicate cleanup

Find duplicate customers, accounts, products, or transactions and choose how to merge them.

Rule validation

Check required fields, allowed values, unique keys, and relationships between tables.

仕組み

1

Define the dataset, business rules, and why the data matters.

2

Profile completeness, uniqueness, validity, consistency, and outliers.

3

Choose a repair strategy for missing values, duplicates, strings, or constraints.

4

Verify the cleaned data against the original rules and document remaining risk.

入力オプション

Dataset

CSV, spreadsheet, SQL extract, dataframe, schema, or sample rows.

例

Example input
Dataset: customer_export.csv, 52,000 rows.
Use case: migrate accounts into a new CRM next week.

Rules:
- email is required for contacts,
- account_id must be unique,
- country must be a valid ISO country,
- annual_revenue should not be negative.

Known concern: duplicate company names and missing emails.
Need: decide whether migration is safe and what to fix first.
What the skill returns
How it reads the request
The skill treats this as migration readiness, not generic cleanup. It focuses on problems that would block or corrupt the CRM migration.
Findings
1,148 rows are missing email, 392 duplicate account_id values exist, 86 country values are invalid, and 14 rows have negative revenue.
Migration risk
Duplicate account_id is blocking because it can merge the wrong accounts. Missing emails may be acceptable for account records but not contact records.
Fix order
Fix duplicate account IDs first, map invalid countries second, flag missing emails for enrichment, and investigate negative revenue instead of deleting those rows.
How to prove it is ready
Rerun the same checks and show before/after counts so the migration owner can sign off.

改善される指標

Data Quality
+20-40%
Product & Engineering
Assertion Pass Rate
+10-25%
Product & Engineering
Data Quality Incident Rate
-10-25%
Product & Engineering

対応ツール

Google Sheets
手動

Use spreadsheet datasets as inputs for lightweight profiling and cleanup.

Snowflake
手動

Run warehouse-backed data quality checks and repairs against Snowflake extracts.

SQL
手動

Use SQL extracts and constraints as inputs for data quality profiling and verification.

Data Qualityを使ってみますか?

始め方を選択してください。

Claude Codeで実行
無料・オープンソース

このスキルをコンピュータにローカルでインストールして実行します。

1
Claude Codeをインストール

コンピュータでターミナルを開き、このコマンドを貼り付けます:

2
スキルをインストール

このコマンドでスキルとすべてのファイルをコンピュータにダウンロードします:

末尾に-gを追加すると、すべてのプロジェクトで利用可能になります。

3
実行する

Claude Codeを起動し、コマンドを入力します:

次に
GitHubでソースを見る
ElasticFlowで利用
チームとコラボレーション機能

ブラウザからスキルを実行。結果を共有し、アクセス管理、チームで協力。ターミナル不要。

14日間無料トライアル。いつでもキャンセル可能。

GitHubで見る

Data Quality Skill

Systematic approach to diagnosing and fixing data quality problems.

Data Quality Process

Define & Identify → Detect & Quantify → Clean & Rectify → Measure & Verify
  1. Define: Understand data context, business rules, quality requirements
  2. Detect: Profile data, find glitches (missing, duplicates, outliers, violations)
  3. Clean: Apply appropriate repair strategies
  4. Measure: Validate repairs, quantify improvement

Quick Reference

ProblemScriptKey Function
Data overviewdata_profiling.pyprofile_dataframe(df)
Find quality issuesdata_profiling.pydetect_glitches(df)
Missing valuesmissing_data.pyanalyze_missing(df)
Imputationmissing_data.pyimpute_mean/median/regression()
Duplicatesduplicate_detection.pyfind_duplicates(df, cols)
Deduplicationduplicate_detection.pydeduplicate(df, cols)
Outliersanomaly_detection.pydetect_anomalies(df)
Constraint checkconstraint_checking.pyvalidate_constraints(df, rules)
String matchingsimilarity_metrics.pyjaro_winkler_similarity()

Workflow

Step 1: Profile the Data

from scripts.data_profiling import profile_dataframe, detect_glitches, generate_quality_report

# Quick overview
print(generate_quality_report(df))

# Detailed profile
profile = profile_dataframe(df)

# Find issues
glitches = detect_glitches(df)

Step 2: Analyze Specific Issues

Missing Data:

from scripts.missing_data import analyze_missing, test_mcar

analysis = analyze_missing(df)
# Check if safe to delete rows
mcar_test = test_mcar(df, 'column_with_missing', ['other_cols'])

Duplicates:

from scripts.duplicate_detection import find_duplicates, cluster_duplicates

matches = find_duplicates(df, ['name', 'email'], threshold=0.85)
clusters = cluster_duplicates(matches)

Outliers:

from scripts.anomaly_detection import detect_anomalies, iqr_outliers

# Multi-column summary
anomalies = detect_anomalies(df, method='iqr')

# Single column detail
result = iqr_outliers(df, 'price', multiplier=1.5)

Constraints:

from scripts.constraint_checking import validate_constraints

constraints = [
    {'type': 'unique', 'columns': ['id']},
    {'type': 'not_null', 'columns': ['name', 'email']},
    {'type': 'fd', 'determinant': ['id'], 'dependent': ['name']},
    {'type': 'domain', 'column': 'age', 'min_value': 0, 'max_value': 150},
]
results = validate_constraints(df, constraints)

Step 3: Clean the Data

Handle Missing:

from scripts.missing_data import impute_median, impute_regression, listwise_deletion

# Simple: median for numeric
df_clean = impute_median(df, 'age')

# Better: regression-based
df_clean = impute_regression(df, 'income', ['age', 'education'])

# If MCAR confirmed
df_clean = listwise_deletion(df)

Remove Duplicates:

from scripts.duplicate_detection import deduplicate

df_clean, summary = deduplicate(
    df, 
    columns=['name', 'email', 'address'],
    threshold=0.8,
    merge_strategy='most_complete'
)
print(f"Reduced from {summary['original_rows']} to {summary['final_rows']} rows")

Handle Outliers:

# Cap extreme values
q01, q99 = df['col'].quantile([0.01, 0.99])
df['col'] = df['col'].clip(q01, q99)

# Or remove
df_clean = df[~detect_anomalies(df)['col']['outlier_indices']]

Step 4: Validate

Re-run profiling and constraint checks on cleaned data to verify improvements.

References

For deeper understanding:

  • references/dimensions.md: Data quality dimensions (accuracy, completeness, etc.)
  • references/glitch_taxonomy.md: Types of data glitches and detection approaches
  • references/repair_strategies.md: Detailed repair and cleaning strategies

Key Concepts

Data Quality = Fit for Use

  • Free of defects
  • Has features needed for the task
  • Right information, right place, right time

Missing Data Mechanisms:

  • MCAR: Missing Completely At Random (safe to delete)
  • MAR: Missing At Random (imputation may work)
  • MNAR: Missing Not At Random (most problematic)

Constraints:

  • Functional Dependency: X → Y means X uniquely determines Y
  • Referential Integrity: foreign keys reference valid primary keys
  • Domain Constraints: values within allowed set/range

Entity Resolution:

  • Blocking reduces O(n²) to O(n·window)
  • Similarity metrics: Jaro-Winkler (names), Levenshtein (typos), Jaccard (sets)
  • Cluster by transitive closure, merge by strategy

Similarity Metrics Comparison

MetricBest ForExample
Jaro-WinklerNames, short strings"Robert" vs "Rupert"
LevenshteinTypos, edit distance"recieve" vs "receive"
JaccardToken/word comparison"John Doe" vs "Doe, John"
Q-gramFuzzy substring matchingPartial matches

参照ドキュメント


name: data-quality description: Diagnose and fix data quality problems in datasets. Use when working with dirty data, finding duplicates, handling missing values, detecting outliers/anomalies, validating constraints (functional dependencies, referential integrity), profiling datasets, or cleaning data for analysis or ML. Covers the full data quality lifecycle - define, detect, clean, measure.

Data Quality Skill

Systematic approach to diagnosing and fixing data quality problems.

Data Quality Process

Define & Identify → Detect & Quantify → Clean & Rectify → Measure & Verify
  1. Define: Understand data context, business rules, quality requirements
  2. Detect: Profile data, find glitches (missing, duplicates, outliers, violations)
  3. Clean: Apply appropriate repair strategies
  4. Measure: Validate repairs, quantify improvement

Quick Reference

ProblemScriptKey Function
Data overviewdata_profiling.pyprofile_dataframe(df)
Find quality issuesdata_profiling.pydetect_glitches(df)
Missing valuesmissing_data.pyanalyze_missing(df)
Imputationmissing_data.pyimpute_mean/median/regression()
Duplicatesduplicate_detection.pyfind_duplicates(df, cols)
Deduplicationduplicate_detection.pydeduplicate(df, cols)
Outliersanomaly_detection.pydetect_anomalies(df)
Constraint checkconstraint_checking.pyvalidate_constraints(df, rules)
String matchingsimilarity_metrics.pyjaro_winkler_similarity()

Workflow

Step 1: Profile the Data

from scripts.data_profiling import profile_dataframe, detect_glitches, generate_quality_report

# Quick overview
print(generate_quality_report(df))

# Detailed profile
profile = profile_dataframe(df)

# Find issues
glitches = detect_glitches(df)

Step 2: Analyze Specific Issues

Missing Data:

from scripts.missing_data import analyze_missing, test_mcar

analysis = analyze_missing(df)
# Check if safe to delete rows
mcar_test = test_mcar(df, 'column_with_missing', ['other_cols'])

Duplicates:

from scripts.duplicate_detection import find_duplicates, cluster_duplicates

matches = find_duplicates(df, ['name', 'email'], threshold=0.85)
clusters = cluster_duplicates(matches)

Outliers:

from scripts.anomaly_detection import detect_anomalies, iqr_outliers

# Multi-column summary
anomalies = detect_anomalies(df, method='iqr')

# Single column detail
result = iqr_outliers(df, 'price', multiplier=1.5)

Constraints:

from scripts.constraint_checking import validate_constraints

constraints = [
    {'type': 'unique', 'columns': ['id']},
    {'type': 'not_null', 'columns': ['name', 'email']},
    {'type': 'fd', 'determinant': ['id'], 'dependent': ['name']},
    {'type': 'domain', 'column': 'age', 'min_value': 0, 'max_value': 150},
]
results = validate_constraints(df, constraints)

Step 3: Clean the Data

Handle Missing:

from scripts.missing_data import impute_median, impute_regression, listwise_deletion

# Simple: median for numeric
df_clean = impute_median(df, 'age')

# Better: regression-based
df_clean = impute_regression(df, 'income', ['age', 'education'])

# If MCAR confirmed
df_clean = listwise_deletion(df)

Remove Duplicates:

from scripts.duplicate_detection import deduplicate

df_clean, summary = deduplicate(
    df, 
    columns=['name', 'email', 'address'],
    threshold=0.8,
    merge_strategy='most_complete'
)
print(f"Reduced from {summary['original_rows']} to {summary['final_rows']} rows")

Handle Outliers:

# Cap extreme values
q01, q99 = df['col'].quantile([0.01, 0.99])
df['col'] = df['col'].clip(q01, q99)

# Or remove
df_clean = df[~detect_anomalies(df)['col']['outlier_indices']]

Step 4: Validate

Re-run profiling and constraint checks on cleaned data to verify improvements.

References

For deeper understanding:

  • references/dimensions.md: Data quality dimensions (accuracy, completeness, etc.)
  • references/glitch_taxonomy.md: Types of data glitches and detection approaches
  • references/repair_strategies.md: Detailed repair and cleaning strategies

Key Concepts

Data Quality = Fit for Use

  • Free of defects
  • Has features needed for the task
  • Right information, right place, right time

Missing Data Mechanisms:

  • MCAR: Missing Completely At Random (safe to delete)
  • MAR: Missing At Random (imputation may work)
  • MNAR: Missing Not At Random (most problematic)

Constraints:

  • Functional Dependency: X → Y means X uniquely determines Y
  • Referential Integrity: foreign keys reference valid primary keys
  • Domain Constraints: values within allowed set/range

Entity Resolution:

  • Blocking reduces O(n²) to O(n·window)
  • Similarity metrics: Jaro-Winkler (names), Levenshtein (typos), Jaccard (sets)
  • Cluster by transitive closure, merge by strategy

Similarity Metrics Comparison

MetricBest ForExample
Jaro-WinklerNames, short strings"Robert" vs "Rupert"
LevenshteinTypos, edit distance"recieve" vs "receive"
JaccardToken/word comparison"John Doe" vs "Doe, John"
Q-gramFuzzy substring matchingPartial matches

Data Quality Dimensions

Data quality is determined by "fitness for use" - the capability of data to meet requirements for a given context.

Core Dimensions

Accuracy

Data correctly represents the real-world entity or event.

Measurement:

  • Compare against authoritative source (gold standard)
  • Expert review sampling
  • Cross-reference validation

Common Issues:

  • Typos and transcription errors
  • Outdated information
  • Measurement errors

Completeness

All required data values are present.

Levels:

  • Schema completeness: all expected attributes exist
  • Column completeness: % non-null values per column
  • Population completeness: all expected records present

Measurement:

completeness = (non_null_count / total_count) * 100

Consistency

Data values don't contradict each other across the dataset or systems.

Types:

  • Intra-record: values within same record consistent (age matches birth date)
  • Inter-record: values across records consistent (no duplicate IDs)
  • Cross-system: same entity has consistent values in different databases

Measurement:

  • Constraint violation counts
  • Cross-reference mismatches

Timeliness

Data is current enough for the intended use.

Aspects:

  • Currency: when data was last updated
  • Volatility: how often data changes
  • Latency: delay between real-world change and data update

Measurement:

timeliness_score = 1 - (current_time - last_update) / max_acceptable_age

Validity

Data conforms to defined formats, types, and ranges.

Checks:

  • Data type validation
  • Format validation (dates, emails, phones)
  • Range/domain validation
  • Pattern matching

Uniqueness

No unintended duplicates exist.

Levels:

  • Primary key uniqueness
  • Natural key uniqueness
  • Entity-level deduplication

Extended Dimensions

Relevance

Data is applicable and useful for the task.

Interpretability

Data meaning is clear and unambiguous (well-documented).

Accessibility

Authorized users can easily obtain the data.

Credibility

Data comes from trustworthy sources.

Business Context

Different uses require different quality priorities:

Use CasePriority Dimensions
Financial reportingAccuracy, Completeness, Consistency
Real-time analyticsTimeliness, Availability
Customer communicationsAccuracy, Completeness
ML model trainingCompleteness, Consistency, Validity
Regulatory complianceAccuracy, Completeness, Auditability

Measuring Overall Quality

Weighted composite score:

quality_score = sum(weight[dim] * score[dim] for dim in dimensions) / sum(weights)

Quality thresholds:

  • Critical: < 70% - Immediate attention required
  • Warning: 70-90% - Improvement needed
  • Acceptable: > 90% - Monitor for degradation

Data Glitch Taxonomy

Data glitches are defects that compromise data quality. Understanding glitch types guides detection and repair strategies.

Glitch Categories

1. Missing Data

Value-level: Individual cells are NULL/empty

  • Easy to detect: df.isnull().sum()

Record-level: Entire rows missing from expected population

  • Hard to detect: requires external reference

Attribute-level: Expected columns missing from schema

  • Moderate: compare against schema documentation

Missing Data Mechanisms:

MechanismDescriptionImplication
MCARMissing Completely At Random - no correlationSafe to delete rows
MARMissing At Random - correlated with observed dataImputation may work
MNARMissing Not At Random - correlated with missing value itselfMost problematic

2. Inconsistent/Erroneous Data

Syntactic errors:

  • Typos: "Jhon" instead of "John"
  • Formatting: inconsistent date formats, phone formats
  • Encoding: character encoding issues

Semantic errors:

  • Wrong values: age = 250
  • Contradictions: birth_date > current_date
  • Constraint violations: duplicate primary keys

Detection:

  • Constraint checking (FDs, referential integrity)
  • Domain validation
  • Pattern analysis

3. Anomalies and Outliers

Point anomalies: Single data points far from normal

  • Age = -5 or Age = 200

Contextual anomalies: Abnormal in specific context

  • Temperature = 90°F normal in summer, anomaly in winter

Collective anomalies: Groups of related points anomalous together

  • Sudden spike in all sensor readings

Detection methods:

  • Statistical: z-score, IQR, modified z-score
  • ML-based: isolation forest, autoencoders, k-NN

4. Semantic Duplicates

Records referring to the same real-world entity with different representations:

Row 1: "John Smith", "123 Main St", "NYC"
Row 2: "J. Smith", "123 Main Street", "New York City"

Detection:

  • Blocking to reduce comparison space
  • Fuzzy matching with similarity metrics
  • Clustering via transitive closure

5. Undocumented Data

Data without adequate metadata:

  • Unknown column meanings
  • Missing data dictionaries
  • Unclear units of measurement
  • No lineage/provenance information

Symptoms:

  • Column names like "col1", "field_a", "x"
  • No documentation about allowed values
  • Ambiguous semantics

Glitch Complexes

Real data often has compound glitch patterns:

Multi-type Glitch

Single value has multiple glitch types:

  • Value is both an outlier AND inconsistent with constraint

Concomitant Glitches

Same record has glitches in multiple columns:

  • Missing name AND invalid email AND outlier age

Multi-occurrent Glitches

Same glitch type appears across many records:

  • 1000 records all missing the same field

Detection Complexity Factors

Relevance

Severity varies by domain:

  • Missing email: critical for marketing, irrelevant for shipping

Ambiguity

Boundary between valid/invalid unclear:

  • Is age 120 an error or just rare?

Complex Dependencies

One glitch can mask another:

  • Missing value hides what would be a constraint violation

Dynamic Nature

Glitch types evolve:

  • New data sources introduce new error patterns

Glitch Quantification

Per-value scoring

glitch_signature = [has_missing, has_outlier, has_format_error, ...]
glitch_score = sum(weight[i] * signature[i] for i in range(len(signature)))

Global scoring

total_glitch_score = sum(all_value_scores) / total_values

Detection Priority

  1. Start broad: Profile entire dataset (profiling script)
  2. Identify patterns: Find common glitch types
  3. Prioritize: Focus on high-impact, high-frequency issues
  4. Deep dive: Investigate root causes of priority glitches

Data Repair Strategies

Repairing data quality issues involves trade-offs between data loss, accuracy, and computational cost.

Fundamental Concept: Minimal Repair

Minimal repair = smallest change that removes constraint violations.

Key insight: When a constraint is violated, it's ambiguous which value is wrong. Minimal repair preserves as much original data as possible.

Missing Data Strategies

Deletion Methods

Listwise deletion: Remove entire row if any value missing

df_clean = df.dropna()
  • Pros: Simple, preserves relationships in remaining data
  • Cons: Can lose significant data, only valid under MCAR
  • Use when: < 5% missing, confirmed MCAR

Pairwise deletion: Use available data for each calculation

df.dropna(subset=['col1', 'col2'])  # Only for this analysis
  • Pros: Preserves more data
  • Cons: Different N for different analyses
  • Use when: Analysis-specific completeness needed

Attribute deletion: Remove columns with too much missing

df.drop(columns=[col for col in df if df[col].isnull().mean() > 0.5])
  • Use when: Column > 50% missing, not critical

Imputation Methods

Simple imputation:

MethodFormulaBest for
Meandf[col].fillna(df[col].mean())Numeric, normal distribution
Mediandf[col].fillna(df[col].median())Numeric, skewed/outliers
Modedf[col].fillna(df[col].mode()[0])Categorical
Constantdf[col].fillna(value)Domain-specific defaults
Forward filldf[col].ffill()Time series

Warnings:

  • Mean/median underestimates variance
  • Weakens correlations between variables
  • Can introduce bias if not MCAR

Model-based imputation:

# Regression imputation
from scripts.missing_data import impute_regression
df_imputed = impute_regression(df, target='income', predictors=['age', 'education'])
  • Preserves relationships better
  • Still underestimates variance unless noise added

Multiple imputation:

  1. Impute multiple times with random variation
  2. Analyze each imputed dataset
  3. Pool results accounting for imputation uncertainty

Indicator Method

Add binary flag for missingness (useful for ML):

df['col_missing'] = df['col'].isnull().astype(int)
df['col'] = df['col'].fillna(df['col'].median())

Duplicate Repair Strategies

Record Selection

Survivor strategy: Keep most complete record

from scripts.duplicate_detection import merge_records
survivor = merge_records(df, cluster, strategy='most_complete')

First/Last: Keep chronologically first or latest

  • Use when: temporal precedence matters

Record Fusion

Attribute-based: Create composite record

merged = merge_records(df, cluster, strategy='combine')
# Takes first non-null value for each attribute

Aggregation rules by data type:

  • Numeric: max, min, avg, sum (depends on semantics)
  • String: longest, most recent, most frequent
  • Date: earliest, latest

Post-merge Actions

Update foreign keys in related tables:

# After merging records 2,3,4 into record 1
other_table['fk_col'] = other_table['fk_col'].replace({2: 1, 3: 1, 4: 1})

Constraint Violation Repair

Deletion vs. Modification

Tuple deletion: Remove violating rows entirely

  • Simple but loses information
  • May cascade to dependent tables

Value modification: Update values to satisfy constraints

  • Preserves more data
  • Risk of introducing other errors

Consistent Query Answering (CQA)

Instead of repairing data, return query results consistent across ALL possible minimal repairs.

-- Original: SELECT * FROM Students
-- CQA version: Only return students with no FD conflicts
SELECT * FROM Students s1
WHERE NOT EXISTS (
    SELECT * FROM Students s2
    WHERE s1.id = s2.id AND s1.name != s2.name
)

Active Integrity Constraints

Specify repair actions with constraints:

IF violation(student_id -> name) THEN keep_most_recent
IF violation(age BETWEEN 0 AND 150) THEN set_null

Anomaly/Outlier Repair

Correction strategies

Capping/Winsorization: Replace extremes with boundary values

lower, upper = df['col'].quantile([0.01, 0.99])
df['col'] = df['col'].clip(lower, upper)

Deletion: Remove outlier rows

  • Risk of bias if outliers are systematic

Imputation: Treat as missing and impute

df.loc[outlier_mask, 'col'] = np.nan
df = impute_median(df, 'col')

Investigation: Verify if genuine extreme or error

  • Sometimes outliers are real and important!

Repair Decision Framework

1. Assess impact
   - How critical is this data?
   - What's the cost of wrong repair vs. no repair?

2. Understand cause
   - Systematic error → fix at source
   - Random error → statistical repair

3. Choose strategy
   - Low risk: aggressive cleaning
   - High risk: conservative (CQA, flagging)

4. Validate
   - Check repair didn't introduce new issues
   - Compare distributions before/after
   - Verify business logic still holds

5. Document
   - Record what was changed and why
   - Enable audit trail

Best Practices

  1. Never destroy original data - keep backups
  2. Document all transformations - reproducibility
  3. Prefer prevention - fix data entry, not downstream
  4. Validate repairs - check for unintended consequences
  5. Consider context - repair strategy depends on use case
ElasticFlow

AI搭載のワークフロー自動化でビジネスを変革。エンタープライズのあらゆるニーズを満たす統合プラットフォーム。

フォローする

プラットフォーム

  • 機能
  • メリット
  • ユースケース
  • ワークフローライブラリ

ユースケース

  • 営業
  • マーケティング
  • 財務・法務
  • 人事

カタログ

  • 部門
  • ロール
  • ツール
  • メトリクス
  • プラットフォーム

成長

  • 紹介プログラム
  • パートナー

法務

  • プライバシーポリシー
  • 利用規約
  • Cookieポリシー
  • 許容される利用
  • セキュリティ
  • SLA

© 2026 ElasticFlow. All rights reserved.