ElasticFlow
HubAll SkillsBy DepartmentBy RoleBy ToolBy MetricMCPsPublishers
메인 사이트로그인회원가입
ElasticFlow

AI 기반 워크플로 자동화로 비즈니스를 혁신하세요. 모든 엔터프라이즈 요구를 위한 통합 플랫폼.

팔로우

플랫폼

  • 기능
  • 장점
  • 사용 사례
  • 워크플로 라이브러리

사용 사례

  • 영업
  • 마케팅
  • 재무·법무
  • 인사

카탈로그

  • 부서
  • 역할
  • 도구
  • 메트릭
  • 플랫폼

성장

  • 추천 프로그램
  • 파트너

법무

  • 개인정보 처리방침
  • 서비스 약관
  • 쿠키 정책
  • 허용 사용
  • 보안
  • SLA

© 2026 ElasticFlow. 모든 권리 보유.

ElasticFlow
HubAll SkillsBy DepartmentBy RoleBy ToolBy MetricMCPsPublishers
메인 사이트로그인회원가입
ElasticFlow

AI 기반 워크플로 자동화로 비즈니스를 혁신하세요. 모든 엔터프라이즈 요구를 위한 통합 플랫폼.

팔로우

플랫폼

  • 기능
  • 장점
  • 사용 사례
  • 워크플로 라이브러리

사용 사례

  • 영업
  • 마케팅
  • 재무·법무
  • 인사

카탈로그

  • 부서
  • 역할
  • 도구
  • 메트릭
  • 플랫폼

성장

  • 추천 프로그램
  • 파트너

법무

  • 개인정보 처리방침
  • 서비스 약관
  • 쿠키 정책
  • 허용 사용
  • 보안
  • SLA

© 2026 ElasticFlow. 모든 권리 보유.

ElasticFlow
HubAll SkillsBy DepartmentBy RoleBy ToolBy MetricMCPsPublishers
메인 사이트로그인회원가입
  1. 홈
  2. 스킬
  3. DataHub Lineage
AI 스킬Trace lineage제품 및 엔지니어링

See what data depends on a table before you change it. — Claude Skill

Claude Code용 Claude 스킬 · 제공: DataHub Project✓ · 실행: /datahub-lineage (Claude 내)·업데이트: 2026년 6월 12일·vmain@68585b1

호환Claude

Finds upstream sources, downstream dashboards, owners, and risk in DataHub so teams can avoid breaking reports, pipelines, or customer-facing data.

  • Shows what feeds a dataset and what depends on it downstream.
  • Finds dashboards, tables, pipelines, owners, and platforms affected by a change.
  • Supports impact analysis, root-cause tracing, cross-platform maps, and specific source-to-target paths.
  • Turns raw DataHub lineage into a readable impact report for data and business teams.
사용자오늘

An analyst manually clicks through lineage views and exports partial lists of dependencies.

/datahub-lineage 사용 시

Run /datahub-lineage to resolve the entity, traverse the graph, enrich results, and produce a reusable impact report.

1 Find entity2 Choose traversal mode3 Run lineage query4 Enrich and summarize impact

대상

Data Engineer

Trace upstream/downstream dependencies and change impact in DataHub.

이 역할의 스킬 보기
Analytics Engineer

Map how analytical datasets feed dashboards, models, and downstream reports.

이 역할의 스킬 보기

기능

Change impact

Find who and what is affected before a table, column, or pipeline changes.

Root cause

Trace upstream sources when a dashboard or dataset looks wrong.

Ownership map

Show which teams own the data assets involved in a flow.

작동 방식

1

Start with a DataHub entity name or URN.

2

Choose the question: what breaks downstream, where did bad data come from, or how does data flow across platforms.

3

Traverse the lineage graph and enrich results with owner, platform, type, and metadata context.

4

Summarize affected assets, risk level, owners, and recommended next actions.

입력 옵션

DataHub entity

A dataset, chart, dashboard, pipeline, or URN.

예시

What the user pastes
Planned change: rename column analytics.orders.discount_code to promo_code
DataHub entity: Snowflake analytics.orders
Deadline: deploy Friday
Concern: revenue dashboards and finance exports may depend on this column
Need: downstream owners, affected assets, and who to notify before deploy
Useful result
Affected assets
Revenue dashboard, weekly bookings export, finance close model, and customer cohort notebook depend on analytics.orders.
Highest risk
Finance close model uses discount_code directly and has no alternate path. Changing the column before month close is risky.
Owners to notify
Finance Analytics owns the close model; RevOps owns bookings export; Growth owns cohort notebook.
Recommended action
Add promo_code as a new column first, keep discount_code for one release, and remove it after downstream owners confirm migration.

개선되는 지표

Lineage Coverage
+25-50%
제품 및 엔지니어링
Data Freshness
Faster impact review
제품 및 엔지니어링
Metric Trust
+10-20%
제품 및 엔지니어링

지원 도구

DataHub
수동

Primary system for lineage graph traversal, entity lookup, ownership, and metadata enrichment.

Snowflake
수동

Common source or target platform in DataHub lineage graphs.

SQL
수동

Use SQL context to explain transformations and lineage paths.

유사 스킬

속성 중복에 따라 자동 추천됩니다. 나란히 비교하면 차이가 드러납니다.

전체 4개 비교 →

광고 크리에이티브

제공: Corey Haines
↳text, tool-accessvstext, url(제공해야 하는 것)·markdown, csvvsmarkdown(출력 형식)·confidentialvspublic(데이터 민감도)

통화 메모 → 요약

제공: Anthropic✓
↳text, tool-accessvstext, file-upload(제공해야 하는 것)·markdown, csvvsmarkdown, email(출력 형식)·confidentialvsinternal(데이터 민감도)

콜드 이메일 작성

제공: Corey Haines
↳text, tool-accessvstext, url(제공해야 하는 것)·markdown, csvvsmarkdown, email(출력 형식)·confidentialvsinternal(데이터 민감도)
속성 중복 × 차별화로 정렬. DataHub Lineage은(는) 각 항목과 12개 이상의 속성을 공유합니다.

DataHub Lineage을(를) 사용해 보시겠어요?

시작 방법을 선택하세요.

Claude Code에서 실행
무료. 오픈 소스.

이 스킬을 컴퓨터에 로컬로 설치하고 실행합니다.

1
Claude Code 설치

컴퓨터에서 터미널을 열고 이 명령을 붙여넣으세요:

2
스킬 설치

이 명령은 스킬과 모든 파일을 컴퓨터에 다운로드합니다:

모든 프로젝트에서 사용하려면 끝에 -g를 추가하세요.

3
실행하기

Claude Code를 시작한 다음 명령을 입력하세요:

그다음
GitHub에서 소스 보기
ElasticFlow에서 사용
팀 및 협업 기능

브라우저에서 스킬을 실행. 결과 공유, 액세스 관리, 팀과 협업. 터미널 불필요.

14일 무료 평가판. 언제든 취소 가능.

GitHub에서 보기

DataHub Lineage

You are an expert DataHub lineage analyst. Your role is to help the user understand how data flows through their systems — tracing upstream sources, downstream consumers, cross-platform dependencies, and assessing the impact of changes.


Multi-Agent Compatibility

This skill is designed to work across multiple coding agents (Claude Code, Cursor, Codex, Copilot, Gemini CLI, Windsurf, and others).

What works everywhere:

  • The full lineage exploration workflow
  • All traversal modes (impact analysis, root cause, dependency mapping)
  • Lineage visualization via MCP tools or DataHub CLI

Claude Code-specific features (other agents can safely ignore these):

  • allowed-tools in the YAML frontmatter above
  • Task(subagent_type="datahub-skills:metadata-searcher") for delegated entity lookup — only when multiple complex searches are needed to resolve and enrich a large lineage graph. For simple entity lookups, execute inline. Fallback instructions are provided inline for agents without sub-agent dispatch.

Reference file paths: Shared references are in ../shared-references/ relative to this skill's directory. Skill-specific references are in references/ and templates in templates/.


Not This Skill

If the user wants to...Use this instead
Search for entities by keyword or metadata/datahub-search
Answer "who owns X?" or "what is X?"/datahub-search (metadata lookup, not lineage)
Add or update metadata (descriptions, tags, owners)/datahub-enrich
Create assertions, run quality checks, manage incidents/datahub-quality

Key boundary: Lineage handles lineage and dependency questions ("what feeds into X?", "what breaks if I change X?"). Search handles metadata questions ("who owns X?"). Enrich handles metadata updates ("set owner", "tag this").


Step 1: Identify Target Entity

Find the entity the user wants to trace.

  1. If the user provides a URN, use it directly
  2. If they provide a name, search for it: datahub search "<name>" --where "entity_type = dataset" --limit 5
  3. If multiple matches, present options and ask the user to choose
  4. Confirm: show entity name, URN, platform, type

Input validation: Reject shell metacharacters in search queries and URNs before passing to CLI.


Step 2: Determine Traversal Mode

Traversal modes

ModeDirectionUse CaseUser Says
Impact analysisDownstream"What breaks if I change this?""impact of X", "what depends on X", "downstream"
Root causeUpstream"Where does this data come from?""root cause", "what feeds X", "upstream", "source of"
Full pipelineBoth"Show the complete data flow""full lineage", "end to end", "trace the pipeline"
Cross-platformBoth"How does data flow between systems?""from Snowflake to Looker", "cross-platform"
Specific pathDirected"How does X reach Y?""path from X to Y", "how does X connect to Y"

Depth configuration

DepthWhen to Use
1 hopDefault — immediate upstream/downstream
2-3 hopsUser asks for "full" lineage or cross-platform tracing
3+ hopsOnly with user confirmation — results grow exponentially

Ask about depth if the user doesn't specify: "How many hops should I trace? (default: 1, or specify 'full')"


Step 3: Execute Lineage Queries

Choosing your tool: MCP vs. CLI

MCP toolsDataHub CLI
When availablePreferred for simple traversalsUse for path, column-level lineage, --format json metadata
Lineageget_lineage(urn=..., direction=..., depth=...)datahub lineage --urn "..." --direction upstream
Enrich resultsget_entities(urns=[...])datahub search "*" --where 'urn IN (...)' with --projection

MCP provides structured lineage graphs without shell overhead — MCP tools are self-documenting, so check their schemas for parameter details. Fall back to CLI for features MCP may not support — path tracing between two entities, column-level lineage, and output format control.

Using the datahub lineage CLI command

# Upstream sources (full graph by default)
datahub lineage --urn "<URN>" --direction upstream

# Downstream dependents
datahub lineage --urn "<URN>" --direction downstream

# Limit depth
datahub lineage --urn "<URN>" --direction downstream --hops 1

# Column-level lineage (datasets only)
datahub lineage --urn "<URN>" --column customer_id --direction upstream

# JSON output (includes metadata with hints about capped/truncated results)
datahub lineage --urn "<URN>" --direction downstream --format json

# Find path between two entities
datahub lineage path --from "<URN_A>" --to "<URN_B>"

The command returns a summary line indicating how many entities were found, the maximum hop depth, and whether results were capped. Use --format json for structured output with a metadata object the agent can inspect.

Defaults: --hops 3 (full transitive lineage), --count 100. Increase --count if the summary indicates results were capped.

Output formats: Use --format json for structured processing (includes a metadata object with capped/truncated hints). Default table output is best for quick display to the user.

What lineage returns vs. what needs follow-up

datahub lineage returns basic fields for each entity: URN, name, type, platform, and hop distance. It does not support --projection and does not return ownership, descriptions, tags, or other rich metadata.

To enrich lineage results with richer metadata, use search with a urn filter to batch multiple URNs in a single call with --projection:

# Batch-enrich lineage results — quote URNs (they contain parentheses and commas)
datahub search "*" \
  --where 'urn IN ("urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)", "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)")' \
  --projection "urn type
    ... on Dataset { properties { name description } platform { name }
      ownership { owners { owner type } }
      siblings { isPrimary siblings { urn ... on Dataset { properties { name description } platform { name } } } }
    }"

This avoids N+1 calls — collect the URNs from lineage output and resolve them all in one search. The urn field is not a named filter but works via custom passthrough to Elasticsearch.

MCP alternative: If MCP is available, get_entities(urns=["<URN_1>", "<URN_2>"]) also supports batch lookup.

Siblings in lineage results

Lineage may return a dbt model URN when the user is thinking of the warehouse table (or vice versa). These are linked via the siblings aspect. When presenting lineage results, note when an entity has a sibling on a different platform — e.g., "dbt model stg_orders (sibling: Snowflake analytics.stg_orders)". See the entity model reference for sibling resolution details.

Specific path tracing

Use the CLI command first:

datahub lineage path --from "<URN_A>" --to "<URN_B>"

If path is unavailable, fall back to manual BFS: get downstream from A incrementing depth, check for B at each hop, and stop after 5 hops.


Step 4: Visualize Lineage

ASCII flow diagram

For simple lineage (up to ~10 entities):

[source_table_1] ──→ [staging_table] ──→ [analytics_table] ──→ [Revenue Dashboard]
[source_table_2] ──┘                                        └──→ [daily_export]

Structured list

For larger or more complex lineage:

### Upstream (sources for analytics_table)

| Hop | Entity         | Type    | Platform   | Relationship |
| --- | -------------- | ------- | ---------- | ------------ |
| 1   | staging_table  | dataset | Snowflake  | TRANSFORMED  |
| 2   | source_table_1 | dataset | PostgreSQL | TRANSFORMED  |
| 2   | source_table_2 | dataset | PostgreSQL | TRANSFORMED  |

### Downstream (consumers of analytics_table)

| Hop | Entity            | Type      | Platform | Relationship |
| --- | ----------------- | --------- | -------- | ------------ |
| 1   | Revenue Dashboard | dashboard | Looker   | —            |
| 1   | daily_export      | dataset   | S3       | TRANSFORMED  |

Impact analysis format

For impact analysis, group by entity type, identify critical paths (single-dependency chains), and list affected owners. See templates/impact-analysis.template.md for the full template.

Cross-platform view

Group by platform when lineage crosses systems:

PostgreSQL           Snowflake              Looker
─────────           ─────────              ──────
[raw_orders] ──→ [stg_orders] ──→ [fct_orders] ──→ [Orders Dashboard]
[raw_customers] ──→ [stg_customers] ──┘

Suggesting Next Steps

After presenting lineage:

  • "Want to see metadata details for any of these?" → fetch with datahub search using --projection with ownership, descriptions, siblings
  • "Want to update metadata along this pipeline? Use /datahub-enrich"
  • "Want to run an impact audit? Use /datahub-audit"

Reference Documents

DocumentPathPurpose
Lineage patterns referencereferences/lineage-patterns-reference.mdTraversal strategies and patterns
Impact analysis templatetemplates/impact-analysis.template.mdImpact analysis report template
Lineage map templatetemplates/lineage-map.template.mdLineage visualization template
CLI reference (shared)../shared-references/datahub-cli-reference.mdCLI commands

Common Mistakes

  • Using datahub get --aspect upstreamLineage instead of datahub lineage. The datahub lineage command supports both upstream and downstream in one call with proper pagination. Use it instead of the raw aspect fetch.
  • Showing only URNs. The datahub lineage command returns names and platforms — present those to the user, not raw URNs.
  • Answering metadata questions instead of tracing. "Who owns X?" is a Search question, not a Lineage question. Lineage is for relationships between entities, not entity properties.

Red Flags

  • User input contains shell metacharacters → reject, do not pass to CLI.
  • Traversal depth > 3 hops → confirm with user before proceeding.
  • Lineage returns 0 edges → entity may not have lineage ingested. Note this rather than saying "no dependencies."
  • User asks about metadata, not lineage ("who owns X?", "add a tag") → redirect to /datahub-search or /datahub-enrich.

URN Parsing

Dataset URNs follow this format: urn:li:dataset:(urn:li:dataPlatform:<platform>,<qualified_name>,<env>). Extract the readable parts directly from the URN string rather than writing Python to parse each one:

  • Platform: text after dataPlatform: before the comma
  • Table name: text between the first and last comma (the qualified name)
  • Environment: text after the last comma before the closing paren

For dashboard/chart URNs: urn:li:<type>:(<platform>,<id>).

Present lineage results using names extracted from URNs directly. Only fetch additional properties (descriptions, owners) if the user asks.

Remember

  • Show the flow visually. ASCII diagrams are more intuitive than tables for small graphs.
  • Check siblings. Lineage may show dbt entities when the user thinks in warehouse table names, or vice versa.
  • Enrich when asked. datahub lineage returns names and platforms but not ownership, descriptions, or tags — use follow-up search with --projection when the user wants richer context.
  • Check for capped results. If the summary indicates truncation, increase --count.

참조 문서


name: datahub-lineage description: | Use this skill when the user wants to explore lineage, trace data dependencies, perform impact analysis, find root causes, map data pipelines, or understand how data flows between systems. Triggers on: "what feeds into X", "what depends on X", "show lineage for X", "impact analysis", "trace the pipeline", "root cause", "upstream of X", "downstream of X", or any request involving data lineage and dependency tracking. user-invocable: true min-cli-version: 1.5.0.1rc1 allowed-tools: Bash(datahub *)

DataHub Lineage

You are an expert DataHub lineage analyst. Your role is to help the user understand how data flows through their systems — tracing upstream sources, downstream consumers, cross-platform dependencies, and assessing the impact of changes.


Multi-Agent Compatibility

This skill is designed to work across multiple coding agents (Claude Code, Cursor, Codex, Copilot, Gemini CLI, Windsurf, and others).

What works everywhere:

  • The full lineage exploration workflow
  • All traversal modes (impact analysis, root cause, dependency mapping)
  • Lineage visualization via MCP tools or DataHub CLI

Claude Code-specific features (other agents can safely ignore these):

  • allowed-tools in the YAML frontmatter above
  • Task(subagent_type="datahub-skills:metadata-searcher") for delegated entity lookup — only when multiple complex searches are needed to resolve and enrich a large lineage graph. For simple entity lookups, execute inline. Fallback instructions are provided inline for agents without sub-agent dispatch.

Reference file paths: Shared references are in ../shared-references/ relative to this skill's directory. Skill-specific references are in references/ and templates in templates/.


Not This Skill

If the user wants to...Use this instead
Search for entities by keyword or metadata/datahub-search
Answer "who owns X?" or "what is X?"/datahub-search (metadata lookup, not lineage)
Add or update metadata (descriptions, tags, owners)/datahub-enrich
Create assertions, run quality checks, manage incidents/datahub-quality

Key boundary: Lineage handles lineage and dependency questions ("what feeds into X?", "what breaks if I change X?"). Search handles metadata questions ("who owns X?"). Enrich handles metadata updates ("set owner", "tag this").


Step 1: Identify Target Entity

Find the entity the user wants to trace.

  1. If the user provides a URN, use it directly
  2. If they provide a name, search for it: datahub search "<name>" --where "entity_type = dataset" --limit 5
  3. If multiple matches, present options and ask the user to choose
  4. Confirm: show entity name, URN, platform, type

Input validation: Reject shell metacharacters in search queries and URNs before passing to CLI.


Step 2: Determine Traversal Mode

Traversal modes

ModeDirectionUse CaseUser Says
Impact analysisDownstream"What breaks if I change this?""impact of X", "what depends on X", "downstream"
Root causeUpstream"Where does this data come from?""root cause", "what feeds X", "upstream", "source of"
Full pipelineBoth"Show the complete data flow""full lineage", "end to end", "trace the pipeline"
Cross-platformBoth"How does data flow between systems?""from Snowflake to Looker", "cross-platform"
Specific pathDirected"How does X reach Y?""path from X to Y", "how does X connect to Y"

Depth configuration

DepthWhen to Use
1 hopDefault — immediate upstream/downstream
2-3 hopsUser asks for "full" lineage or cross-platform tracing
3+ hopsOnly with user confirmation — results grow exponentially

Ask about depth if the user doesn't specify: "How many hops should I trace? (default: 1, or specify 'full')"


Step 3: Execute Lineage Queries

Choosing your tool: MCP vs. CLI

MCP toolsDataHub CLI
When availablePreferred for simple traversalsUse for path, column-level lineage, --format json metadata
Lineageget_lineage(urn=..., direction=..., depth=...)datahub lineage --urn "..." --direction upstream
Enrich resultsget_entities(urns=[...])datahub search "*" --where 'urn IN (...)' with --projection

MCP provides structured lineage graphs without shell overhead — MCP tools are self-documenting, so check their schemas for parameter details. Fall back to CLI for features MCP may not support — path tracing between two entities, column-level lineage, and output format control.

Using the datahub lineage CLI command

# Upstream sources (full graph by default)
datahub lineage --urn "<URN>" --direction upstream

# Downstream dependents
datahub lineage --urn "<URN>" --direction downstream

# Limit depth
datahub lineage --urn "<URN>" --direction downstream --hops 1

# Column-level lineage (datasets only)
datahub lineage --urn "<URN>" --column customer_id --direction upstream

# JSON output (includes metadata with hints about capped/truncated results)
datahub lineage --urn "<URN>" --direction downstream --format json

# Find path between two entities
datahub lineage path --from "<URN_A>" --to "<URN_B>"

The command returns a summary line indicating how many entities were found, the maximum hop depth, and whether results were capped. Use --format json for structured output with a metadata object the agent can inspect.

Defaults: --hops 3 (full transitive lineage), --count 100. Increase --count if the summary indicates results were capped.

Output formats: Use --format json for structured processing (includes a metadata object with capped/truncated hints). Default table output is best for quick display to the user.

What lineage returns vs. what needs follow-up

datahub lineage returns basic fields for each entity: URN, name, type, platform, and hop distance. It does not support --projection and does not return ownership, descriptions, tags, or other rich metadata.

To enrich lineage results with richer metadata, use search with a urn filter to batch multiple URNs in a single call with --projection:

# Batch-enrich lineage results — quote URNs (they contain parentheses and commas)
datahub search "*" \
  --where 'urn IN ("urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)", "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)")' \
  --projection "urn type
    ... on Dataset { properties { name description } platform { name }
      ownership { owners { owner type } }
      siblings { isPrimary siblings { urn ... on Dataset { properties { name description } platform { name } } } }
    }"

This avoids N+1 calls — collect the URNs from lineage output and resolve them all in one search. The urn field is not a named filter but works via custom passthrough to Elasticsearch.

MCP alternative: If MCP is available, get_entities(urns=["<URN_1>", "<URN_2>"]) also supports batch lookup.

Siblings in lineage results

Lineage may return a dbt model URN when the user is thinking of the warehouse table (or vice versa). These are linked via the siblings aspect. When presenting lineage results, note when an entity has a sibling on a different platform — e.g., "dbt model stg_orders (sibling: Snowflake analytics.stg_orders)". See the entity model reference for sibling resolution details.

Specific path tracing

Use the CLI command first:

datahub lineage path --from "<URN_A>" --to "<URN_B>"

If path is unavailable, fall back to manual BFS: get downstream from A incrementing depth, check for B at each hop, and stop after 5 hops.


Step 4: Visualize Lineage

ASCII flow diagram

For simple lineage (up to ~10 entities):

[source_table_1] ──→ [staging_table] ──→ [analytics_table] ──→ [Revenue Dashboard]
[source_table_2] ──┘                                        └──→ [daily_export]

Structured list

For larger or more complex lineage:

### Upstream (sources for analytics_table)

| Hop | Entity         | Type    | Platform   | Relationship |
| --- | -------------- | ------- | ---------- | ------------ |
| 1   | staging_table  | dataset | Snowflake  | TRANSFORMED  |
| 2   | source_table_1 | dataset | PostgreSQL | TRANSFORMED  |
| 2   | source_table_2 | dataset | PostgreSQL | TRANSFORMED  |

### Downstream (consumers of analytics_table)

| Hop | Entity            | Type      | Platform | Relationship |
| --- | ----------------- | --------- | -------- | ------------ |
| 1   | Revenue Dashboard | dashboard | Looker   | —            |
| 1   | daily_export      | dataset   | S3       | TRANSFORMED  |

Impact analysis format

For impact analysis, group by entity type, identify critical paths (single-dependency chains), and list affected owners. See templates/impact-analysis.template.md for the full template.

Cross-platform view

Group by platform when lineage crosses systems:

PostgreSQL           Snowflake              Looker
─────────           ─────────              ──────
[raw_orders] ──→ [stg_orders] ──→ [fct_orders] ──→ [Orders Dashboard]
[raw_customers] ──→ [stg_customers] ──┘

Suggesting Next Steps

After presenting lineage:

  • "Want to see metadata details for any of these?" → fetch with datahub search using --projection with ownership, descriptions, siblings
  • "Want to update metadata along this pipeline? Use /datahub-enrich"
  • "Want to run an impact audit? Use /datahub-audit"

Reference Documents

DocumentPathPurpose
Lineage patterns referencereferences/lineage-patterns-reference.mdTraversal strategies and patterns
Impact analysis templatetemplates/impact-analysis.template.mdImpact analysis report template
Lineage map templatetemplates/lineage-map.template.mdLineage visualization template
CLI reference (shared)../shared-references/datahub-cli-reference.mdCLI commands

Common Mistakes

  • Using datahub get --aspect upstreamLineage instead of datahub lineage. The datahub lineage command supports both upstream and downstream in one call with proper pagination. Use it instead of the raw aspect fetch.
  • Showing only URNs. The datahub lineage command returns names and platforms — present those to the user, not raw URNs.
  • Answering metadata questions instead of tracing. "Who owns X?" is a Search question, not a Lineage question. Lineage is for relationships between entities, not entity properties.

Red Flags

  • User input contains shell metacharacters → reject, do not pass to CLI.
  • Traversal depth > 3 hops → confirm with user before proceeding.
  • Lineage returns 0 edges → entity may not have lineage ingested. Note this rather than saying "no dependencies."
  • User asks about metadata, not lineage ("who owns X?", "add a tag") → redirect to /datahub-search or /datahub-enrich.

URN Parsing

Dataset URNs follow this format: urn:li:dataset:(urn:li:dataPlatform:<platform>,<qualified_name>,<env>). Extract the readable parts directly from the URN string rather than writing Python to parse each one:

  • Platform: text after dataPlatform: before the comma
  • Table name: text between the first and last comma (the qualified name)
  • Environment: text after the last comma before the closing paren

For dashboard/chart URNs: urn:li:<type>:(<platform>,<id>).

Present lineage results using names extracted from URNs directly. Only fetch additional properties (descriptions, owners) if the user asks.

Remember

  • Show the flow visually. ASCII diagrams are more intuitive than tables for small graphs.
  • Check siblings. Lineage may show dbt entities when the user thinks in warehouse table names, or vice versa.
  • Enrich when asked. datahub lineage returns names and platforms but not ownership, descriptions, or tags — use follow-up search with --projection when the user wants richer context.
  • Check for capped results. If the summary indicates truncation, increase --count.

DataHub Lineage

Explore lineage, trace data dependencies, and perform impact analysis using DataHub's lineage graph.

What it does

  1. Identifies the target entity
  2. Determines traversal direction and depth
  3. Executes lineage queries via MCP tools or CLI
  4. Visualizes the lineage graph with ASCII flow diagrams

Capabilities

  • Impact analysis — What breaks if I change this table?
  • Root cause — Where does this data come from?
  • Full pipeline — End-to-end data flow mapping
  • Cross-platform — Trace data across Snowflake, dbt, Looker, etc.
  • Path finding — How does entity A connect to entity B?

Usage

/datahub-lineage impact analysis for customer_orders
/datahub-lineage what feeds into the Revenue Dashboard?
/datahub-lineage full pipeline for daily_revenue
/datahub-lineage path from raw_events to analytics_dashboard

Lineage Patterns Reference

Common lineage traversal strategies and patterns.

Traversal Strategies

Impact Analysis (Downstream)

Goal: Determine what breaks if an entity changes.

Strategy:

  1. Get all downstream entities (start with depth 1, expand as needed)
  2. Classify by type (datasets, dashboards, jobs)
  3. Identify critical paths (entities with single upstream dependency)
  4. List affected owners for notification

Key question: "Which downstream entities have no alternative data source?"

Root Cause (Upstream)

Goal: Trace where data originates and how it's transformed.

Strategy:

  1. Get all upstream entities (depth 1-3)
  2. Follow until reaching source-of-record systems (databases, APIs, files)
  3. Note transformation types at each hop (TRANSFORMED, VIEW, COPY)
  4. Identify the original data source

Key question: "Where does this data ultimately come from?"

Full Pipeline (Both Directions)

Goal: Map the complete data flow from source to consumption.

Strategy:

  1. Get upstream to source (root cause)
  2. Get downstream to consumers (impact)
  3. Merge into a single directed graph
  4. Present as end-to-end flow

Cross-Platform Tracing

Goal: Understand how data moves between systems.

Strategy:

  1. Trace lineage in both directions
  2. Group entities by platform
  3. Identify cross-platform edges (e.g., PostgreSQL → Snowflake via dbt)
  4. Highlight the integration points

Path Finding

Goal: Determine if and how entity A connects to entity B.

Strategy:

  1. Start BFS from entity A downstream
  2. At each hop, check if entity B appears
  3. If found, return the path
  4. Max depth: 5 hops (ask user before going deeper)

Lineage Edge Types

TypeMeaning
TRANSFORMEDData was transformed (e.g., SQL query, dbt model)
VIEWEntity is a view over the source
COPYData was copied without transformation

Platform-Specific Lineage Notes

PlatformLineage SourceNotes
dbtdbt manifestModel-level lineage, often the richest
AirflowTask dependenciesJob-level lineage
SnowflakeQuery logsColumn-level lineage possible
BigQueryAudit logsTable-level lineage
LookerLookML exploresDashboard → dataset lineage
TableauWorkbook metadataDashboard → dataset lineage

Choosing the Right Command

NeedCommandWhy
Unfiltered upstream/downstreamdatahub lineageSimple, returns names and platforms
Column-level lineagedatahub lineage --column <field>Only command that supports column tracing
Filter by type, platform, tagssearchAcrossLineage via datahub graphqlServer-side filtering avoids fetching full graph
Time-windowed lineagesearchAcrossLineage with lineageFlagsOnly way to scope by edge update time
Large result sets (300+)scrollAcrossLineage via datahub graphqlCursor-based pagination for large graphs

Lineage Limitations

  • Use datahub lineage for both upstream and downstream traversal. Supports --hops, --column, and --format json with metadata hints.
  • Use searchAcrossLineage when filtering is needed. datahub lineage has no filter support — use the GraphQL query via datahub graphql to filter by entity type, platform, tags, domain, or time window.
  • Depth: Deep lineage graphs (5+ hops) can be very large. Always cap and ask.
  • Staleness: Lineage reflects the last ingestion. It may not reflect recent pipeline changes.
  • Column-level: Not all sources provide column-level lineage. Note when unavailable.

Impact Analysis

Target Entity

Name: <!-- entity name --> URN: <!-- urn --> Platform: <!-- platform --> Type: <!-- dataset / dashboard / etc. -->

Impact Summary

Direct dependents (1 hop): <!-- count --> Transitive dependents (all hops): <!-- count --> Depth traced: <!-- hops -->

Affected Entities

By Type

TypeCountEntities
Datasets<!-- n --><!-- list -->
Dashboards<!-- n --><!-- list -->
Data Jobs<!-- n --><!-- list -->
Charts<!-- n --><!-- list -->

By Platform

PlatformCount
<!-- platform --><!-- n -->

Critical Paths

<!-- Entities with single upstream dependency on target -->
EntityTypeRisk
<!-- name --><!-- type -->Single dependency — no alternative source

Lineage Graph

<!-- ASCII flow diagram -->

Affected Owners

OwnerEntities Affected
<!-- owner --><!-- count and list -->

Recommendations

  1. <!-- Notification actions -->
  2. <!-- Migration/update suggestions -->

Lineage Map

Target Entity

Name: <!-- entity name --> URN: <!-- urn -->

Flow Diagram

<!-- ASCII lineage diagram -->

Upstream (Sources)

HopEntityTypePlatformRelationship
1<!-- name --><!-- type --><!-- platform --><!-- TRANSFORMED/VIEW/COPY -->

Downstream (Consumers)

HopEntityTypePlatformRelationship
1<!-- name --><!-- type --><!-- platform --><!-- type -->

Cross-Platform Boundaries

FromToEdge
<!-- platform A --><!-- platform B --><!-- entity A → entity B -->

DataHub CLI Reference

Commands verified against DataHub CLI v1.4.0. Install via pip install acryl-datahub.


Tool Detection

Before running any DataHub commands, determine which tools are available:

  1. MCP tools available — If tools like datahub_search, datahub_get_entity, datahub_get_lineage are in your tool list, use them directly. They are the preferred path — no CLI installation needed.
  2. CLI available — If you have a Bash tool, check: which datahub. If found, use the CLI commands documented below.
  3. Neither — Suggest the user set up a DataHub connection using /datahub-setup.

MCP takes priority over CLI when both are available — MCP tools are purpose-built for agent use with structured inputs/outputs and no shell overhead.

CLI ↔ MCP Equivalents

OperationCLI CommandMCP Tool
Searchdatahub search "query" --where "..."search(query="...", filter="...")
Get entitydatahub get --urn "..." --aspect ownershipget_entities(urns=["..."])
Upstream lineagedatahub lineage --urn "..." --direction upstreamget_lineage(urn="...", upstream=true)
Downstream lineagedatahub lineage --urn "..." --direction downstreamget_lineage(urn="...", upstream=false)
GraphQLdatahub graphql --query '...'execute_graphql(query="...")
Server configdatahub check server-configNot needed (MCP server handles config)

MCP tool names may be prefixed (e.g. mcp__datahub-cloud__search). Match by the function name suffix, not the full prefixed name. MCP tools are self-documenting — check their schemas for parameter details rather than relying on static documentation.

The rest of this document covers the CLI path.


Authentication

The CLI reads connection settings from ~/.datahubenv:

gms:
  server: "http://localhost:8080"
  token: "<personal-access-token>"

Or via environment variables:

export DATAHUB_GMS_URL="http://localhost:8080"
export DATAHUB_GMS_TOKEN="<token>"

Version Check

Before running commands, check the installed CLI version:

datahub version

If a skill requires a minimum version and the installed version is older, upgrade:

pip install --upgrade acryl-datahub --pre

The --pre flag ensures pre-release versions (e.g. 1.5.0rc1) are included, which may be required for newer features.

Server Detection

Detect whether you're connected to DataHub Cloud or OSS:

datahub check server-config
  • serverEnv: 'cloud' → DataHub Cloud (supports popularity sorting, dataset features)
  • serverEnv: 'core' or other → OSS / self-hosted (feature fields not available)

Cache this result for the session — don't re-check on every command. Some features marked (Cloud only) below require serverEnv: cloud.

Context

Pass context on CLI commands using -C key=value so commands can be correlated:

datahub -C skill=datahub-audit search "revenue"
datahub -C skill=datahub-audit -C caller=claude-code get --urn "..."

The -C flag goes on the root datahub command (before the subcommand). Use the skill's own name from its YAML frontmatter as the skill value. If the flag is not recognized, omit it — the command works the same without it.


Search & Discovery

The search CLI uses a positional query argument — not --query.

# Basic keyword search
datahub search "revenue"

# Search with limit
datahub search "customers" --limit 20

# Filter by platform (simple filter)
datahub search "*" --filter platform=snowflake

# Filter by entity type
datahub search "*" --where "entity_type = dataset"

# SQL-like WHERE expressions (recommended for agents)
datahub search "*" --where "platform = snowflake AND env = PROD"
datahub search "*" --where "platform IN (snowflake, bigquery)"
datahub search "*" --where "entity_type = dataset AND platform = snowflake"

# Multiple simple filters (AND between fields, comma = OR within field)
datahub search "*" --filter platform=snowflake --filter env=PROD
datahub search "*" --filter platform=snowflake,bigquery

# Output formats
datahub search "revenue" --table          # Human-readable table
datahub search "revenue" --urns-only      # URNs only, one per line
datahub search "revenue" --format json    # JSON (default)

# Pagination (max 50 per page)
datahub search "customers" --limit 50 --offset 0     # page 1
datahub search "customers" --limit 50 --offset 50    # page 2

# Facets only (counts by type/platform/etc.)
datahub search "*" --facets-only --format json

# Dry run (preview query without executing)
datahub search "revenue" --where "platform = snowflake" --dry-run

# Projection (limit returned fields — reduces token cost)
datahub search "customers" --projection "urn type"

# Column-level search (find datasets containing a specific field)
datahub search "*" --where "entity_type = dataset AND fieldPaths = customer_id"

# Sorting
datahub search "*" --sort-by lastModifiedAt --sort-order desc --limit 10
datahub search "*" --sort-by _entityName --sort-order asc --limit 10

# Popularity / usage sorting (Cloud only — check serverEnv first)
# Most queried datasets
datahub search "*" --where "entity_type = dataset" \
  --sort-by queryCountLast30DaysFeature --sort-order desc --limit 10 \
  --projection "urn type ... on Dataset { properties { name } platform { name } statsSummary { queryCountLast30Days uniqueUserCountLast30Days } }"

# Most updated datasets
datahub search "*" --where "entity_type = dataset" --sort-by writeCountLast30DaysFeature --sort-order desc --limit 10

# Largest tables (by row count or bytes)
datahub search "*" --where "entity_type = dataset" --sort-by rowCountFeature --sort-order desc --limit 10
datahub search "*" --where "entity_type = dataset" --sort-by sizeInBytesFeature --sort-order desc --limit 10

# Existence filters (IS NULL / IS NOT NULL)
datahub search "*" --where "entity_type = dataset AND description IS NULL AND editableDescription IS NULL"
datahub search "*" --where "entity_type = dataset AND glossary_term IS NOT NULL"

# Sibling-aware description audit (single query, no N+1 fetches)
# Step 1: Find datasets missing both ingestion and user-edited descriptions
# Step 2: Project siblings with their descriptions to compute effective coverage
datahub search "*" \
  --where "entity_type = dataset AND platform = snowflake AND description IS NULL AND editableDescription IS NULL" \
  --projection "urn type ... on Dataset { siblings { isPrimary siblings { urn ... on Dataset { properties { name description } editableProperties { description } } } } }" \
  --format json --limit 50

# URN resolution for filters
# Tag, domain, and glossary_term filters require full URNs — not display names.
# Always resolve the name to a URN first, then use the URN in the filter.

# Step 1: Find tag URN by name
datahub search "large table" --where "entity_type = tag" --urns-only --limit 1
# → urn:li:tag:sample_data___default_large_table

# Step 2: Use the URN in a filter
datahub search "*" --where "entity_type = dataset AND tags = 'urn:li:tag:sample_data___default_large_table'"

# Same pattern for domains:
datahub search "ecommerce" --where "entity_type = domain" --urns-only --limit 1
# → urn:li:domain:91994180-...
datahub search "*" --where "entity_type = dataset AND domain = 'urn:li:domain:91994180-...'"

# And glossary terms:
datahub search "PII" --where "entity_type = glossaryTerm" --urns-only --limit 1
datahub search "*" --where "entity_type = dataset AND glossary_term = 'urn:li:glossaryTerm:...'"

# Discover available filters
datahub search list-filters
datahub search describe-filter platform

# Agent best practices
datahub search --agent-context

Entity Retrieval

# Get full entity metadata
datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,table_name,PROD)"

# Get specific aspect
datahub get --urn "<URN>" --aspect schemaMetadata
datahub get --urn "<URN>" --aspect ownership
datahub get --urn "<URN>" --aspect globalTags

Lineage

# Upstream sources (full graph by default)
datahub lineage --urn "<URN>" --direction upstream

# Downstream dependents
datahub lineage --urn "<URN>" --direction downstream

# Limit to immediate neighbors
datahub lineage --urn "<URN>" --direction upstream --hops 1

# Column-level lineage (datasets only)
datahub lineage --urn "<URN>" --column customer_id --direction upstream

# JSON output (includes metadata with capped/hint info)
datahub lineage --urn "<URN>" --direction downstream --format json

# Find path between two entities
datahub lineage path --from "<URN_A>" --to "<URN_B>"

# Agent best practices
datahub lineage --agent-context

Timeline (Change History)

# Schema changes
datahub timeline --urn "<URN>" --category technical_schema

# Ownership changes
datahub timeline --urn "<URN>" --category owner

# Tag changes
datahub timeline --urn "<URN>" --category tag

# With time range
datahub timeline --urn "<URN>" --category technical_schema --start 7daysago

Categories: tag, glossary_term, technical_schema, documentation, owner


Write Operations (via GraphQL Mutations)

Write operations use datahub graphql --query 'mutation { ... }'. The CLI does not have dedicated tag, glossary, or inline put commands for these operations.

Important rules for GraphQL mutations:

  • Return field subselections required. Mutations returning objects (not scalars like Boolean) need { urn } or similar after the mutation. Without it: SubselectionRequired error.
  • Long queries must use temp files. Long inline --query strings get misinterpreted as file paths on macOS (File name too long). Write to a .graphql file and pass the path: datahub graphql --query /tmp/my-mutation.graphql --format json.
  • Short mutations can be inline. Simple mutations like addTag, removeTag, addOwner are short enough to pass inline.

Tags

# Create a tag
# With id: name-based URN (human-readable, but ID is immutable — can't rename later)
# Without id: GUID-based URN (opaque, but display name can change freely)
# When unsure, ask the user which they prefer.
datahub graphql --query 'mutation {
  createTag(input: { id: "pii", name: "PII", description: "Contains PII data" })
}' --format json
# → returns urn:li:tag:pii

# Add tag to entity (tag must exist first)
datahub graphql --query 'mutation {
  addTag(input: { tagUrn: "urn:li:tag:<TAG_URN>", resourceUrn: "<ENTITY_URN>" })
}' --format json

# Add tag to a specific field
datahub graphql --query 'mutation {
  addTag(input: {
    tagUrn: "urn:li:tag:<TAG_URN>",
    resourceUrn: "<ENTITY_URN>",
    subResourceType: DATASET_FIELD,
    subResource: "<FIELD_PATH>"
  })
}' --format json

# Remove tag
datahub graphql --query 'mutation {
  removeTag(input: { tagUrn: "urn:li:tag:<TAG_URN>", resourceUrn: "<ENTITY_URN>" })
}' --format json

# Batch add tags
datahub graphql --query 'mutation {
  batchAddTags(input: {
    tagUrns: ["urn:li:tag:<TAG1>", "urn:li:tag:<TAG2>"],
    resources: [{ resourceUrn: "<URN1>" }, { resourceUrn: "<URN2>" }]
  })
}' --format json

Glossary Terms

# Add term to entity
datahub graphql --query 'mutation {
  addTerm(input: { termUrn: "urn:li:glossaryTerm:<TERM>", resourceUrn: "<ENTITY_URN>" })
}' --format json

# Remove term
datahub graphql --query 'mutation {
  removeTerm(input: { termUrn: "urn:li:glossaryTerm:<TERM>", resourceUrn: "<ENTITY_URN>" })
}' --format json

Ownership

# Add owner (appends — does not replace existing owners)
datahub graphql --query 'mutation {
  addOwner(input: {
    ownerUrn: "urn:li:corpuser:<USER>",
    resourceUrn: "<ENTITY_URN>",
    ownerEntityType: CORP_USER,
    type: TECHNICAL_OWNER
  })
}' --format json

# Remove owner
datahub graphql --query 'mutation {
  removeOwner(input: { ownerUrn: "urn:li:corpuser:<USER>", resourceUrn: "<ENTITY_URN>" })
}' --format json

# Batch add owners
datahub graphql --query 'mutation {
  batchAddOwners(input: {
    owners: [{ ownerUrn: "urn:li:corpuser:<USER>", ownerEntityType: CORP_USER }],
    resources: [{ resourceUrn: "<URN1>" }, { resourceUrn: "<URN2>" }]
  })
}' --format json

Owner types: TECHNICAL_OWNER, BUSINESS_OWNER, DATA_STEWARD, NONE

Deprecation

# Deprecate
datahub graphql --query 'mutation {
  updateDeprecation(input: { urn: "<URN>", deprecated: true, note: "Replaced by new_table" })
}' --format json

# Un-deprecate
datahub graphql --query 'mutation {
  updateDeprecation(input: { urn: "<URN>", deprecated: false })
}' --format json

Domains

# Create domain
datahub graphql --query 'mutation {
  createDomain(input: { name: "Marketing", description: "Marketing data" })
}' --format json

# Assign entity to domain (domain must exist)
datahub graphql --query 'mutation {
  setDomain(entityUrn: "<ENTITY_URN>", domainUrn: "urn:li:domain:<DOMAIN_ID>")
}' --format json

# Remove from domain
datahub graphql --query 'mutation {
  unsetDomain(entityUrn: "<ENTITY_URN>")
}' --format json

# Batch assign
datahub graphql --query 'mutation {
  batchSetDomain(input: {
    domainUrn: "urn:li:domain:<ID>",
    resources: [{ resourceUrn: "<URN1>" }, { resourceUrn: "<URN2>" }]
  })
}' --format json

Description

datahub graphql --query 'mutation {
  updateDescription(input: {
    description: "New description text",
    resourceUrn: "<ENTITY_URN>"
  })
}' --format json

Data Products

Note: domainUrn is required — every data product must belong to a domain. Use datahub graphql --describe createDataProduct --recurse to verify the schema.

# Create (domainUrn is REQUIRED)
datahub graphql --query 'mutation {
  createDataProduct(input: {
    domainUrn: "urn:li:domain:<DOMAIN_ID>",
    properties: { name: "Revenue Analytics", description: "Revenue pipeline" }
  }) { urn }
}' --format json

# Add assets to data product
datahub graphql --query 'mutation {
  batchSetDataProduct(input: {
    dataProductUrn: "urn:li:dataProduct:<ID>",
    resourceUrns: ["<URN1>", "<URN2>"]
  })
}' --format json

Verification & Health

# Check CLI version
datahub version

# Verify connectivity (this entity always exists)
datahub get --urn "urn:li:corpuser:datahub"

# Test search (confirms search index works)
datahub search "*" --limit 1

# Server configuration
datahub check server-config

Note: datahub check server-health does not exist. Use datahub get --urn "urn:li:corpuser:datahub" to verify connectivity.


GraphQL Discovery

# List all available operations
datahub graphql --list-operations --format json

# List mutations only
datahub graphql --list-mutations --format json

# Describe a specific operation
datahub graphql --describe addTag --format json

# Describe with full type expansion
datahub graphql --describe addTag --recurse --format json

# Dry run (preview without executing)
datahub graphql --query '{ me { corpUser { urn } } }' --dry-run

# Agent best practices
datahub graphql --agent-context

Batch Mutation Pattern (Python)

Shell loops with dataset URNs are fragile due to quoting issues with parentheses. For multi-entity mutations, use a Python script with temp files:

import subprocess, json, tempfile, os

def run_graphql_mutation(query, variables):
    """Run a GraphQL mutation with variables via temp file. Returns parsed JSON or None."""
    with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
        json.dump(variables, f)
        vf = f.name
    try:
        result = subprocess.run(
            ["datahub", "graphql", "-q", query, "-v", vf, "--format", "json", "--no-pretty"],
            capture_output=True, text=True
        )
        if result.returncode == 0:
            return json.loads(result.stdout)
        else:
            print(f"ERROR: {result.stderr.strip()[:120]}")
            return None
    finally:
        os.unlink(vf)

# Example: batch update descriptions
query = "mutation updateDataset($urn: String!, $input: DatasetUpdateInput!) { updateDataset(urn: $urn, input: $input) { urn } }"

datasets = {
    "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)": "Description for table1",
    "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)": "Description for table2",
}

for urn, desc in datasets.items():
    variables = {"urn": urn, "input": {"editableProperties": {"description": desc}}}
    result = run_graphql_mutation(query, variables)
    status = "OK" if result else "FAIL"
    print(f"  {urn.split(',')[1]}: {status}")

Output Processing

# Pipe search URNs to get for batch retrieval
datahub search "customers" --urns-only | xargs -I{} datahub get --urn {}

# Extract field names from schema
datahub get --urn "<URN>" --aspect schemaMetadata | python3 -c "
import sys, json
data = json.load(sys.stdin)
for f in data.get('schemaMetadata', {}).get('fields', []):
    print(f['fieldPath'])
"
ElasticFlow

AI 기반 워크플로 자동화로 비즈니스를 혁신하세요. 모든 엔터프라이즈 요구를 위한 통합 플랫폼.

팔로우

플랫폼

  • 기능
  • 장점
  • 사용 사례
  • 워크플로 라이브러리

사용 사례

  • 영업
  • 마케팅
  • 재무·법무
  • 인사

카탈로그

  • 부서
  • 역할
  • 도구
  • 메트릭
  • 플랫폼

성장

  • 추천 프로그램
  • 파트너

법무

  • 개인정보 처리방침
  • 서비스 약관
  • 쿠키 정책
  • 허용 사용
  • 보안
  • SLA

© 2026 ElasticFlow. 모든 권리 보유.