利用可能な言語: English 한국어

AIスキルTrace lineageProduct & Engineering

See what data depends on a table before you change it. — Claude Skill

Name: DataHub Lineage
Author: DataHub Project

Claude Code向けClaudeスキル · 提供：DataHub Project✓ · 実行：/datahub-lineage（Claude内）·更新日：2026年6月12日·vmain@68585b1

対応ChatGPT

ClaudeClaude CodeClaude DesktopCodex / Codex CLI

Cursor

GeminiHermes (via Continue / Cline)

OpenClaw

Windsurf

Finds upstream sources, downstream dashboards, owners, and risk in DataHub so teams can avoid breaking reports, pipelines, or customer-facing data.

Shows what feeds a dataset and what depends on it downstream.
Finds dashboards, tables, pipelines, owners, and platforms affected by a change.
Supports impact analysis, root-cause tracing, cross-platform maps, and specific source-to-target paths.
Turns raw DataHub lineage into a readable impact report for data and business teams.

あなた今日

An analyst manually clicks through lineage views and exports partial lists of dependencies.

/datahub-lineage使用時

Run /datahub-lineage to resolve the entity, traverse the graph, enrich results, and produce a reusable impact report.

1 Find entity2 Choose traversal mode3 Run lineage query4 Enrich and summarize impact

対象ユーザー

Data Engineer

Trace upstream/downstream dependencies and change impact in DataHub.

この役職のスキルを見る

Analytics Engineer

Map how analytical datasets feed dashboards, models, and downstream reports.

この役職のスキルを見る

機能

Change impact

Find who and what is affected before a table, column, or pipeline changes.

Root cause

Trace upstream sources when a dashboard or dataset looks wrong.

Ownership map

Show which teams own the data assets involved in a flow.

仕組み

Start with a DataHub entity name or URN.

Choose the question: what breaks downstream, where did bad data come from, or how does data flow across platforms.

Traverse the lineage graph and enrich results with owner, platform, type, and metadata context.

Summarize affected assets, risk level, owners, and recommended next actions.

入力オプション

DataHub entity

A dataset, chart, dashboard, pipeline, or URN.

例

Example input

Planned change: rename analytics.orders.discount_code to promo_code.

DataHub asset: Snowflake table analytics.orders.
Deploy target: Friday.

Concern:
- Finance close reports may use this table.
- Revenue dashboard may use this column.
- Customer cohort notebook may read from the downstream model.

Need: affected assets, owners, risk level, and who to notify before deploy.

What the skill returns

How it reads the request

The skill treats the table like a dependency map: it checks what feeds analytics.orders and what depends on it downstream.

Affected assets

Revenue dashboard, weekly bookings export, finance close model, and customer cohort notebook depend on this table or downstream models.

Highest risk

The finance close model uses discount_code directly and has no fallback. Changing the column before month close could break close reporting.

Owners to notify

Finance Analytics owns the close model, RevOps Analytics owns the bookings export, and Growth owns the cohort notebook.

Recommended rollout

Add promo_code first, keep discount_code for one release, notify owners, and remove the old column only after downstream teams confirm migration.

改善される指標

Lineage Coverage

+25-50%

Product & Engineering

Data Freshness

Faster impact review

Product & Engineering

Metric Trust

+10-20%

Product & Engineering

対応ツール

DataHub

手動

Primary system for lineage graph traversal, entity lookup, ownership, and metadata enrichment.

Snowflake

手動

Common source or target platform in DataHub lineage graphs.

SQL

手動

Use SQL context to explain transformations and lineage paths.

DataHub Lineageを使ってみますか？

始め方を選択してください。

Claude Codeで実行

無料・オープンソース

このスキルをコンピュータにローカルでインストールして実行します。

Claude Codeをインストール

コンピュータでターミナルを開き、このコマンドを貼り付けます：

スキルをインストール

このコマンドでスキルとすべてのファイルをコンピュータにダウンロードします：

末尾に-gを追加すると、すべてのプロジェクトで利用可能になります。

実行する

Claude Codeを起動し、コマンドを入力します：

次に

GitHubでソースを見る

ElasticFlowで利用

チームとコラボレーション機能

ブラウザからスキルを実行。結果を共有し、アクセス管理、チームで協力。ターミナル不要。

14日間無料トライアル。いつでもキャンセル可能。

GitHubで見る

DataHub Lineage

You are an expert DataHub lineage analyst. Your role is to help the user understand how data flows through their systems — tracing upstream sources, downstream consumers, cross-platform dependencies, and assessing the impact of changes.

Multi-Agent Compatibility

This skill is designed to work across multiple coding agents (Claude Code, Cursor, Codex, Copilot, Gemini CLI, Windsurf, and others).

What works everywhere:

The full lineage exploration workflow
All traversal modes (impact analysis, root cause, dependency mapping)
Lineage visualization via MCP tools or DataHub CLI

Claude Code-specific features (other agents can safely ignore these):

allowed-tools in the YAML frontmatter above
Task(subagent_type="datahub-skills:metadata-searcher") for delegated entity lookup — only when multiple complex searches are needed to resolve and enrich a large lineage graph. For simple entity lookups, execute inline. Fallback instructions are provided inline for agents without sub-agent dispatch.

Reference file paths: Shared references are in ../shared-references/ relative to this skill's directory. Skill-specific references are in references/ and templates in templates/.

Not This Skill

If the user wants to...	Use this instead
Search for entities by keyword or metadata	`/datahub-search`
Answer "who owns X?" or "what is X?"	`/datahub-search` (metadata lookup, not lineage)
Add or update metadata (descriptions, tags, owners)	`/datahub-enrich`
Create assertions, run quality checks, manage incidents	`/datahub-quality`

Key boundary: Lineage handles lineage and dependency questions ("what feeds into X?", "what breaks if I change X?"). Search handles metadata questions ("who owns X?"). Enrich handles metadata updates ("set owner", "tag this").

Step 1: Identify Target Entity

Find the entity the user wants to trace.

If the user provides a URN, use it directly
If they provide a name, search for it: datahub search "<name>" --where "entity_type = dataset" --limit 5
If multiple matches, present options and ask the user to choose
Confirm: show entity name, URN, platform, type

Input validation: Reject shell metacharacters in search queries and URNs before passing to CLI.

Step 2: Determine Traversal Mode

Traversal modes

Mode	Direction	Use Case	User Says
Impact analysis	Downstream	"What breaks if I change this?"	"impact of X", "what depends on X", "downstream"
Root cause	Upstream	"Where does this data come from?"	"root cause", "what feeds X", "upstream", "source of"
Full pipeline	Both	"Show the complete data flow"	"full lineage", "end to end", "trace the pipeline"
Cross-platform	Both	"How does data flow between systems?"	"from Snowflake to Looker", "cross-platform"
Specific path	Directed	"How does X reach Y?"	"path from X to Y", "how does X connect to Y"

Depth configuration

Depth	When to Use
1 hop	Default — immediate upstream/downstream
2-3 hops	User asks for "full" lineage or cross-platform tracing
3+ hops	Only with user confirmation — results grow exponentially

Ask about depth if the user doesn't specify: "How many hops should I trace? (default: 1, or specify 'full')"

Step 3: Execute Lineage Queries

Choosing your tool: MCP vs. CLI

	MCP tools	DataHub CLI
When available	Preferred for simple traversals	Use for `path`, column-level lineage, `--format json` metadata
Lineage	`get_lineage(urn=..., direction=..., depth=...)`	`datahub lineage --urn "..." --direction upstream`
Enrich results	`get_entities(urns=[...])`	`datahub search "*" --where 'urn IN (...)'` with `--projection`

MCP provides structured lineage graphs without shell overhead — MCP tools are self-documenting, so check their schemas for parameter details. Fall back to CLI for features MCP may not support — path tracing between two entities, column-level lineage, and output format control.

Using the `datahub lineage` CLI command

# Upstream sources (full graph by default)
datahub lineage --urn "<URN>" --direction upstream

# Downstream dependents
datahub lineage --urn "<URN>" --direction downstream

# Limit depth
datahub lineage --urn "<URN>" --direction downstream --hops 1

# Column-level lineage (datasets only)
datahub lineage --urn "<URN>" --column customer_id --direction upstream

# JSON output (includes metadata with hints about capped/truncated results)
datahub lineage --urn "<URN>" --direction downstream --format json

# Find path between two entities
datahub lineage path --from "<URN_A>" --to "<URN_B>"

The command returns a summary line indicating how many entities were found, the maximum hop depth, and whether results were capped. Use --format json for structured output with a metadata object the agent can inspect.

Defaults: --hops 3 (full transitive lineage), --count 100. Increase --count if the summary indicates results were capped.

Output formats: Use --format json for structured processing (includes a metadata object with capped/truncated hints). Default table output is best for quick display to the user.

What lineage returns vs. what needs follow-up

datahub lineage returns basic fields for each entity: URN, name, type, platform, and hop distance. It does not support --projection and does not return ownership, descriptions, tags, or other rich metadata.

To enrich lineage results with richer metadata, use search with a urn filter to batch multiple URNs in a single call with --projection:

# Batch-enrich lineage results — quote URNs (they contain parentheses and commas)
datahub search "*" \
  --where 'urn IN ("urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)", "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)")' \
  --projection "urn type
    ... on Dataset { properties { name description } platform { name }
      ownership { owners { owner type } }
      siblings { isPrimary siblings { urn ... on Dataset { properties { name description } platform { name } } } }
    }"

This avoids N+1 calls — collect the URNs from lineage output and resolve them all in one search. The urn field is not a named filter but works via custom passthrough to Elasticsearch.

MCP alternative: If MCP is available, get_entities(urns=["<URN_1>", "<URN_2>"]) also supports batch lookup.

Siblings in lineage results

Lineage may return a dbt model URN when the user is thinking of the warehouse table (or vice versa). These are linked via the siblings aspect. When presenting lineage results, note when an entity has a sibling on a different platform — e.g., "dbt model stg_orders (sibling: Snowflake analytics.stg_orders)". See the entity model reference for sibling resolution details.

Specific path tracing

Use the CLI command first:

datahub lineage path --from "<URN_A>" --to "<URN_B>"

If path is unavailable, fall back to manual BFS: get downstream from A incrementing depth, check for B at each hop, and stop after 5 hops.

Step 4: Visualize Lineage

ASCII flow diagram

For simple lineage (up to ~10 entities):

[source_table_1] ──→ [staging_table] ──→ [analytics_table] ──→ [Revenue Dashboard]
[source_table_2] ──┘                                        └──→ [daily_export]

Structured list

For larger or more complex lineage:

### Upstream (sources for analytics_table)

| Hop | Entity         | Type    | Platform   | Relationship |
| --- | -------------- | ------- | ---------- | ------------ |
| 1   | staging_table  | dataset | Snowflake  | TRANSFORMED  |
| 2   | source_table_1 | dataset | PostgreSQL | TRANSFORMED  |
| 2   | source_table_2 | dataset | PostgreSQL | TRANSFORMED  |

### Downstream (consumers of analytics_table)

| Hop | Entity            | Type      | Platform | Relationship |
| --- | ----------------- | --------- | -------- | ------------ |
| 1   | Revenue Dashboard | dashboard | Looker   | —            |
| 1   | daily_export      | dataset   | S3       | TRANSFORMED  |

Impact analysis format

For impact analysis, group by entity type, identify critical paths (single-dependency chains), and list affected owners. See templates/impact-analysis.template.md for the full template.

Cross-platform view

Group by platform when lineage crosses systems:

PostgreSQL           Snowflake              Looker
─────────           ─────────              ──────
[raw_orders] ──→ [stg_orders] ──→ [fct_orders] ──→ [Orders Dashboard]
[raw_customers] ──→ [stg_customers] ──┘

Suggesting Next Steps

After presenting lineage:

"Want to see metadata details for any of these?" → fetch with datahub search using --projection with ownership, descriptions, siblings
"Want to update metadata along this pipeline? Use /datahub-enrich"
"Want to run an impact audit? Use /datahub-audit"

Reference Documents

Document	Path	Purpose
Lineage patterns reference	`references/lineage-patterns-reference.md`	Traversal strategies and patterns
Impact analysis template	`templates/impact-analysis.template.md`	Impact analysis report template
Lineage map template	`templates/lineage-map.template.md`	Lineage visualization template
CLI reference (shared)	`../shared-references/datahub-cli-reference.md`	CLI commands

Common Mistakes

Using datahub get --aspect upstreamLineage instead of datahub lineage. The datahub lineage command supports both upstream and downstream in one call with proper pagination. Use it instead of the raw aspect fetch.
Showing only URNs. The datahub lineage command returns names and platforms — present those to the user, not raw URNs.
Answering metadata questions instead of tracing. "Who owns X?" is a Search question, not a Lineage question. Lineage is for relationships between entities, not entity properties.

Red Flags

User input contains shell metacharacters → reject, do not pass to CLI.
Traversal depth > 3 hops → confirm with user before proceeding.
Lineage returns 0 edges → entity may not have lineage ingested. Note this rather than saying "no dependencies."
User asks about metadata, not lineage ("who owns X?", "add a tag") → redirect to /datahub-search or /datahub-enrich.

URN Parsing

Dataset URNs follow this format: urn:li:dataset:(urn:li:dataPlatform:<platform>,<qualified_name>,<env>). Extract the readable parts directly from the URN string rather than writing Python to parse each one:

Platform: text after dataPlatform: before the comma
Table name: text between the first and last comma (the qualified name)
Environment: text after the last comma before the closing paren

For dashboard/chart URNs: urn:li:<type>:(<platform>,<id>).

Present lineage results using names extracted from URNs directly. Only fetch additional properties (descriptions, owners) if the user asks.

Remember

Show the flow visually. ASCII diagrams are more intuitive than tables for small graphs.
Check siblings. Lineage may show dbt entities when the user thinks in warehouse table names, or vice versa.
Enrich when asked. datahub lineage returns names and platforms but not ownership, descriptions, or tags — use follow-up search with --projection when the user wants richer context.
Check for capped results. If the summary indicates truncation, increase --count.

参照ドキュメント

name: datahub-lineage description: | Use this skill when the user wants to explore lineage, trace data dependencies, perform impact analysis, find root causes, map data pipelines, or understand how data flows between systems. Triggers on: "what feeds into X", "what depends on X", "show lineage for X", "impact analysis", "trace the pipeline", "root cause", "upstream of X", "downstream of X", or any request involving data lineage and dependency tracking. user-invocable: true min-cli-version: 1.5.0.1rc1 allowed-tools: Bash(datahub *)

DataHub Lineage

Multi-Agent Compatibility

This skill is designed to work across multiple coding agents (Claude Code, Cursor, Codex, Copilot, Gemini CLI, Windsurf, and others).

What works everywhere:

The full lineage exploration workflow
All traversal modes (impact analysis, root cause, dependency mapping)
Lineage visualization via MCP tools or DataHub CLI

Claude Code-specific features (other agents can safely ignore these):

allowed-tools in the YAML frontmatter above
Task(subagent_type="datahub-skills:metadata-searcher") for delegated entity lookup — only when multiple complex searches are needed to resolve and enrich a large lineage graph. For simple entity lookups, execute inline. Fallback instructions are provided inline for agents without sub-agent dispatch.

Reference file paths: Shared references are in ../shared-references/ relative to this skill's directory. Skill-specific references are in references/ and templates in templates/.

Not This Skill

If the user wants to...	Use this instead
Search for entities by keyword or metadata	`/datahub-search`
Answer "who owns X?" or "what is X?"	`/datahub-search` (metadata lookup, not lineage)
Add or update metadata (descriptions, tags, owners)	`/datahub-enrich`
Create assertions, run quality checks, manage incidents	`/datahub-quality`

Step 1: Identify Target Entity

Find the entity the user wants to trace.

If the user provides a URN, use it directly
If they provide a name, search for it: datahub search "<name>" --where "entity_type = dataset" --limit 5
If multiple matches, present options and ask the user to choose
Confirm: show entity name, URN, platform, type

Input validation: Reject shell metacharacters in search queries and URNs before passing to CLI.

Step 2: Determine Traversal Mode

Traversal modes

Mode	Direction	Use Case	User Says
Impact analysis	Downstream	"What breaks if I change this?"	"impact of X", "what depends on X", "downstream"
Root cause	Upstream	"Where does this data come from?"	"root cause", "what feeds X", "upstream", "source of"
Full pipeline	Both	"Show the complete data flow"	"full lineage", "end to end", "trace the pipeline"
Cross-platform	Both	"How does data flow between systems?"	"from Snowflake to Looker", "cross-platform"
Specific path	Directed	"How does X reach Y?"	"path from X to Y", "how does X connect to Y"

Depth configuration

Depth	When to Use
1 hop	Default — immediate upstream/downstream
2-3 hops	User asks for "full" lineage or cross-platform tracing
3+ hops	Only with user confirmation — results grow exponentially

Ask about depth if the user doesn't specify: "How many hops should I trace? (default: 1, or specify 'full')"

Step 3: Execute Lineage Queries

Choosing your tool: MCP vs. CLI

	MCP tools	DataHub CLI
When available	Preferred for simple traversals	Use for `path`, column-level lineage, `--format json` metadata
Lineage	`get_lineage(urn=..., direction=..., depth=...)`	`datahub lineage --urn "..." --direction upstream`
Enrich results	`get_entities(urns=[...])`	`datahub search "*" --where 'urn IN (...)'` with `--projection`

Using the `datahub lineage` CLI command

# Upstream sources (full graph by default)
datahub lineage --urn "<URN>" --direction upstream

# Downstream dependents
datahub lineage --urn "<URN>" --direction downstream

# Limit depth
datahub lineage --urn "<URN>" --direction downstream --hops 1

# Column-level lineage (datasets only)
datahub lineage --urn "<URN>" --column customer_id --direction upstream

# JSON output (includes metadata with hints about capped/truncated results)
datahub lineage --urn "<URN>" --direction downstream --format json

# Find path between two entities
datahub lineage path --from "<URN_A>" --to "<URN_B>"

Defaults: --hops 3 (full transitive lineage), --count 100. Increase --count if the summary indicates results were capped.

Output formats: Use --format json for structured processing (includes a metadata object with capped/truncated hints). Default table output is best for quick display to the user.

What lineage returns vs. what needs follow-up

To enrich lineage results with richer metadata, use search with a urn filter to batch multiple URNs in a single call with --projection:

# Batch-enrich lineage results — quote URNs (they contain parentheses and commas)
datahub search "*" \
  --where 'urn IN ("urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)", "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)")' \
  --projection "urn type
    ... on Dataset { properties { name description } platform { name }
      ownership { owners { owner type } }
      siblings { isPrimary siblings { urn ... on Dataset { properties { name description } platform { name } } } }
    }"

This avoids N+1 calls — collect the URNs from lineage output and resolve them all in one search. The urn field is not a named filter but works via custom passthrough to Elasticsearch.

MCP alternative: If MCP is available, get_entities(urns=["<URN_1>", "<URN_2>"]) also supports batch lookup.

Siblings in lineage results

Specific path tracing

Use the CLI command first:

datahub lineage path --from "<URN_A>" --to "<URN_B>"

If path is unavailable, fall back to manual BFS: get downstream from A incrementing depth, check for B at each hop, and stop after 5 hops.

Step 4: Visualize Lineage

ASCII flow diagram

For simple lineage (up to ~10 entities):

[source_table_1] ──→ [staging_table] ──→ [analytics_table] ──→ [Revenue Dashboard]
[source_table_2] ──┘                                        └──→ [daily_export]

Structured list

For larger or more complex lineage:

### Upstream (sources for analytics_table)

| Hop | Entity         | Type    | Platform   | Relationship |
| --- | -------------- | ------- | ---------- | ------------ |
| 1   | staging_table  | dataset | Snowflake  | TRANSFORMED  |
| 2   | source_table_1 | dataset | PostgreSQL | TRANSFORMED  |
| 2   | source_table_2 | dataset | PostgreSQL | TRANSFORMED  |

### Downstream (consumers of analytics_table)

| Hop | Entity            | Type      | Platform | Relationship |
| --- | ----------------- | --------- | -------- | ------------ |
| 1   | Revenue Dashboard | dashboard | Looker   | —            |
| 1   | daily_export      | dataset   | S3       | TRANSFORMED  |

Impact analysis format

For impact analysis, group by entity type, identify critical paths (single-dependency chains), and list affected owners. See templates/impact-analysis.template.md for the full template.

Cross-platform view

Group by platform when lineage crosses systems:

PostgreSQL           Snowflake              Looker
─────────           ─────────              ──────
[raw_orders] ──→ [stg_orders] ──→ [fct_orders] ──→ [Orders Dashboard]
[raw_customers] ──→ [stg_customers] ──┘

Suggesting Next Steps

After presenting lineage:

"Want to see metadata details for any of these?" → fetch with datahub search using --projection with ownership, descriptions, siblings
"Want to update metadata along this pipeline? Use /datahub-enrich"
"Want to run an impact audit? Use /datahub-audit"

Reference Documents

Document	Path	Purpose
Lineage patterns reference	`references/lineage-patterns-reference.md`	Traversal strategies and patterns
Impact analysis template	`templates/impact-analysis.template.md`	Impact analysis report template
Lineage map template	`templates/lineage-map.template.md`	Lineage visualization template
CLI reference (shared)	`../shared-references/datahub-cli-reference.md`	CLI commands

Common Mistakes

Using datahub get --aspect upstreamLineage instead of datahub lineage. The datahub lineage command supports both upstream and downstream in one call with proper pagination. Use it instead of the raw aspect fetch.
Showing only URNs. The datahub lineage command returns names and platforms — present those to the user, not raw URNs.
Answering metadata questions instead of tracing. "Who owns X?" is a Search question, not a Lineage question. Lineage is for relationships between entities, not entity properties.

Red Flags

User input contains shell metacharacters → reject, do not pass to CLI.
Traversal depth > 3 hops → confirm with user before proceeding.
Lineage returns 0 edges → entity may not have lineage ingested. Note this rather than saying "no dependencies."
User asks about metadata, not lineage ("who owns X?", "add a tag") → redirect to /datahub-search or /datahub-enrich.

URN Parsing

Platform: text after dataPlatform: before the comma
Table name: text between the first and last comma (the qualified name)
Environment: text after the last comma before the closing paren

For dashboard/chart URNs: urn:li:<type>:(<platform>,<id>).

Present lineage results using names extracted from URNs directly. Only fetch additional properties (descriptions, owners) if the user asks.

Remember

Show the flow visually. ASCII diagrams are more intuitive than tables for small graphs.
Check siblings. Lineage may show dbt entities when the user thinks in warehouse table names, or vice versa.
Enrich when asked. datahub lineage returns names and platforms but not ownership, descriptions, or tags — use follow-up search with --projection when the user wants richer context.
Check for capped results. If the summary indicates truncation, increase --count.

DataHub Lineage

Explore lineage, trace data dependencies, and perform impact analysis using DataHub's lineage graph.

What it does

Identifies the target entity
Determines traversal direction and depth
Executes lineage queries via MCP tools or CLI
Visualizes the lineage graph with ASCII flow diagrams

Capabilities

Impact analysis — What breaks if I change this table?
Root cause — Where does this data come from?
Full pipeline — End-to-end data flow mapping
Cross-platform — Trace data across Snowflake, dbt, Looker, etc.
Path finding — How does entity A connect to entity B?

Usage

/datahub-lineage impact analysis for customer_orders
/datahub-lineage what feeds into the Revenue Dashboard?
/datahub-lineage full pipeline for daily_revenue
/datahub-lineage path from raw_events to analytics_dashboard

Lineage Patterns Reference

Common lineage traversal strategies and patterns.

Traversal Strategies

Impact Analysis (Downstream)

Goal: Determine what breaks if an entity changes.

Strategy:

Get all downstream entities (start with depth 1, expand as needed)
Classify by type (datasets, dashboards, jobs)
Identify critical paths (entities with single upstream dependency)
List affected owners for notification

Key question: "Which downstream entities have no alternative data source?"

Root Cause (Upstream)

Goal: Trace where data originates and how it's transformed.

Strategy:

Get all upstream entities (depth 1-3)
Follow until reaching source-of-record systems (databases, APIs, files)
Note transformation types at each hop (TRANSFORMED, VIEW, COPY)
Identify the original data source

Key question: "Where does this data ultimately come from?"

Full Pipeline (Both Directions)

Goal: Map the complete data flow from source to consumption.

Strategy:

Get upstream to source (root cause)
Get downstream to consumers (impact)
Merge into a single directed graph
Present as end-to-end flow

Cross-Platform Tracing

Goal: Understand how data moves between systems.

Strategy:

Trace lineage in both directions
Group entities by platform
Identify cross-platform edges (e.g., PostgreSQL → Snowflake via dbt)
Highlight the integration points

Path Finding

Goal: Determine if and how entity A connects to entity B.

Strategy:

Start BFS from entity A downstream
At each hop, check if entity B appears
If found, return the path
Max depth: 5 hops (ask user before going deeper)

Lineage Edge Types

Type	Meaning
`TRANSFORMED`	Data was transformed (e.g., SQL query, dbt model)
`VIEW`	Entity is a view over the source
`COPY`	Data was copied without transformation

Platform-Specific Lineage Notes

Platform	Lineage Source	Notes
dbt	dbt manifest	Model-level lineage, often the richest
Airflow	Task dependencies	Job-level lineage
Snowflake	Query logs	Column-level lineage possible
BigQuery	Audit logs	Table-level lineage
Looker	LookML explores	Dashboard → dataset lineage
Tableau	Workbook metadata	Dashboard → dataset lineage

Choosing the Right Command

Need	Command	Why
Unfiltered upstream/downstream	`datahub lineage`	Simple, returns names and platforms
Column-level lineage	`datahub lineage --column <field>`	Only command that supports column tracing
Filter by type, platform, tags	`searchAcrossLineage` via `datahub graphql`	Server-side filtering avoids fetching full graph
Time-windowed lineage	`searchAcrossLineage` with `lineageFlags`	Only way to scope by edge update time
Large result sets (300+)	`scrollAcrossLineage` via `datahub graphql`	Cursor-based pagination for large graphs

Lineage Limitations

Use datahub lineage for both upstream and downstream traversal. Supports --hops, --column, and --format json with metadata hints.
Use searchAcrossLineage when filtering is needed. datahub lineage has no filter support — use the GraphQL query via datahub graphql to filter by entity type, platform, tags, domain, or time window.
Depth: Deep lineage graphs (5+ hops) can be very large. Always cap and ask.
Staleness: Lineage reflects the last ingestion. It may not reflect recent pipeline changes.
Column-level: Not all sources provide column-level lineage. Note when unavailable.

Impact Analysis

Target Entity

Name:  URN:  Platform:  Type:

Impact Summary

Direct dependents (1 hop):  Transitive dependents (all hops):  Depth traced:

Affected Entities

By Type

Type	Count	Entities
Datasets	<!-- n -->	<!-- list -->
Dashboards	<!-- n -->	<!-- list -->
Data Jobs	<!-- n -->	<!-- list -->
Charts	<!-- n -->	<!-- list -->

By Platform

Platform	Count
<!-- platform -->	<!-- n -->

Critical Paths

Entity	Type	Risk
<!-- name -->	<!-- type -->	Single dependency — no alternative source

Lineage Graph

<!-- ASCII flow diagram -->

Affected Owners

Owner	Entities Affected
<!-- owner -->	<!-- count and list -->

Recommendations

Lineage Map

Target Entity

Name:  URN:

Flow Diagram

<!-- ASCII lineage diagram -->

Upstream (Sources)

Hop	Entity	Type	Platform	Relationship
1	<!-- name -->	<!-- type -->	<!-- platform -->	<!-- TRANSFORMED/VIEW/COPY -->

Downstream (Consumers)

Hop	Entity	Type	Platform	Relationship
1	<!-- name -->	<!-- type -->	<!-- platform -->	<!-- type -->

Cross-Platform Boundaries

From	To	Edge
<!-- platform A -->	<!-- platform B -->	<!-- entity A → entity B -->

DataHub CLI Reference

Commands verified against DataHub CLI v1.4.0. Install via pip install acryl-datahub.

Tool Detection

Before running any DataHub commands, determine which tools are available:

MCP tools available — If tools like datahub_search, datahub_get_entity, datahub_get_lineage are in your tool list, use them directly. They are the preferred path — no CLI installation needed.
CLI available — If you have a Bash tool, check: which datahub. If found, use the CLI commands documented below.
Neither — Suggest the user set up a DataHub connection using /datahub-setup.

MCP takes priority over CLI when both are available — MCP tools are purpose-built for agent use with structured inputs/outputs and no shell overhead.

CLI ↔ MCP Equivalents

Operation	CLI Command	MCP Tool
Search	`datahub search "query" --where "..."`	`search(query="...", filter="...")`
Get entity	`datahub get --urn "..." --aspect ownership`	`get_entities(urns=["..."])`
Upstream lineage	`datahub lineage --urn "..." --direction upstream`	`get_lineage(urn="...", upstream=true)`
Downstream lineage	`datahub lineage --urn "..." --direction downstream`	`get_lineage(urn="...", upstream=false)`
GraphQL	`datahub graphql --query '...'`	`execute_graphql(query="...")`
Server config	`datahub check server-config`	Not needed (MCP server handles config)

MCP tool names may be prefixed (e.g. mcp__datahub-cloud__search). Match by the function name suffix, not the full prefixed name. MCP tools are self-documenting — check their schemas for parameter details rather than relying on static documentation.

The rest of this document covers the CLI path.

Authentication

The CLI reads connection settings from ~/.datahubenv:

gms:
  server: "http://localhost:8080"
  token: "<personal-access-token>"

Or via environment variables:

export DATAHUB_GMS_URL="http://localhost:8080"
export DATAHUB_GMS_TOKEN="<token>"

Version Check

Before running commands, check the installed CLI version:

datahub version

If a skill requires a minimum version and the installed version is older, upgrade:

pip install --upgrade acryl-datahub --pre

The --pre flag ensures pre-release versions (e.g. 1.5.0rc1) are included, which may be required for newer features.

Server Detection

Detect whether you're connected to DataHub Cloud or OSS:

datahub check server-config

serverEnv: 'cloud' → DataHub Cloud (supports popularity sorting, dataset features)
serverEnv: 'core' or other → OSS / self-hosted (feature fields not available)

Cache this result for the session — don't re-check on every command. Some features marked (Cloud only) below require serverEnv: cloud.

Context

Pass context on CLI commands using -C key=value so commands can be correlated:

datahub -C skill=datahub-audit search "revenue"
datahub -C skill=datahub-audit -C caller=claude-code get --urn "..."

The -C flag goes on the root datahub command (before the subcommand). Use the skill's own name from its YAML frontmatter as the skill value. If the flag is not recognized, omit it — the command works the same without it.

Search & Discovery

The search CLI uses a positional query argument — not --query.

# Basic keyword search
datahub search "revenue"

# Search with limit
datahub search "customers" --limit 20

# Filter by platform (simple filter)
datahub search "*" --filter platform=snowflake

# Filter by entity type
datahub search "*" --where "entity_type = dataset"

# SQL-like WHERE expressions (recommended for agents)
datahub search "*" --where "platform = snowflake AND env = PROD"
datahub search "*" --where "platform IN (snowflake, bigquery)"
datahub search "*" --where "entity_type = dataset AND platform = snowflake"

# Multiple simple filters (AND between fields, comma = OR within field)
datahub search "*" --filter platform=snowflake --filter env=PROD
datahub search "*" --filter platform=snowflake,bigquery

# Output formats
datahub search "revenue" --table          # Human-readable table
datahub search "revenue" --urns-only      # URNs only, one per line
datahub search "revenue" --format json    # JSON (default)

# Pagination (max 50 per page)
datahub search "customers" --limit 50 --offset 0     # page 1
datahub search "customers" --limit 50 --offset 50    # page 2

# Facets only (counts by type/platform/etc.)
datahub search "*" --facets-only --format json

# Dry run (preview query without executing)
datahub search "revenue" --where "platform = snowflake" --dry-run

# Projection (limit returned fields — reduces token cost)
datahub search "customers" --projection "urn type"

# Column-level search (find datasets containing a specific field)
datahub search "*" --where "entity_type = dataset AND fieldPaths = customer_id"

# Sorting
datahub search "*" --sort-by lastModifiedAt --sort-order desc --limit 10
datahub search "*" --sort-by _entityName --sort-order asc --limit 10

# Popularity / usage sorting (Cloud only — check serverEnv first)
# Most queried datasets
datahub search "*" --where "entity_type = dataset" \
  --sort-by queryCountLast30DaysFeature --sort-order desc --limit 10 \
  --projection "urn type ... on Dataset { properties { name } platform { name } statsSummary { queryCountLast30Days uniqueUserCountLast30Days } }"

# Most updated datasets
datahub search "*" --where "entity_type = dataset" --sort-by writeCountLast30DaysFeature --sort-order desc --limit 10

# Largest tables (by row count or bytes)
datahub search "*" --where "entity_type = dataset" --sort-by rowCountFeature --sort-order desc --limit 10
datahub search "*" --where "entity_type = dataset" --sort-by sizeInBytesFeature --sort-order desc --limit 10

# Existence filters (IS NULL / IS NOT NULL)
datahub search "*" --where "entity_type = dataset AND description IS NULL AND editableDescription IS NULL"
datahub search "*" --where "entity_type = dataset AND glossary_term IS NOT NULL"

# Sibling-aware description audit (single query, no N+1 fetches)
# Step 1: Find datasets missing both ingestion and user-edited descriptions
# Step 2: Project siblings with their descriptions to compute effective coverage
datahub search "*" \
  --where "entity_type = dataset AND platform = snowflake AND description IS NULL AND editableDescription IS NULL" \
  --projection "urn type ... on Dataset { siblings { isPrimary siblings { urn ... on Dataset { properties { name description } editableProperties { description } } } } }" \
  --format json --limit 50

# URN resolution for filters
# Tag, domain, and glossary_term filters require full URNs — not display names.
# Always resolve the name to a URN first, then use the URN in the filter.

# Step 1: Find tag URN by name
datahub search "large table" --where "entity_type = tag" --urns-only --limit 1
# → urn:li:tag:sample_data___default_large_table

# Step 2: Use the URN in a filter
datahub search "*" --where "entity_type = dataset AND tags = 'urn:li:tag:sample_data___default_large_table'"

# Same pattern for domains:
datahub search "ecommerce" --where "entity_type = domain" --urns-only --limit 1
# → urn:li:domain:91994180-...
datahub search "*" --where "entity_type = dataset AND domain = 'urn:li:domain:91994180-...'"

# And glossary terms:
datahub search "PII" --where "entity_type = glossaryTerm" --urns-only --limit 1
datahub search "*" --where "entity_type = dataset AND glossary_term = 'urn:li:glossaryTerm:...'"

# Discover available filters
datahub search list-filters
datahub search describe-filter platform

# Agent best practices
datahub search --agent-context

Entity Retrieval

# Get full entity metadata
datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,table_name,PROD)"

# Get specific aspect
datahub get --urn "<URN>" --aspect schemaMetadata
datahub get --urn "<URN>" --aspect ownership
datahub get --urn "<URN>" --aspect globalTags

Lineage

# Upstream sources (full graph by default)
datahub lineage --urn "<URN>" --direction upstream

# Downstream dependents
datahub lineage --urn "<URN>" --direction downstream

# Limit to immediate neighbors
datahub lineage --urn "<URN>" --direction upstream --hops 1

# Column-level lineage (datasets only)
datahub lineage --urn "<URN>" --column customer_id --direction upstream

# JSON output (includes metadata with capped/hint info)
datahub lineage --urn "<URN>" --direction downstream --format json

# Find path between two entities
datahub lineage path --from "<URN_A>" --to "<URN_B>"

# Agent best practices
datahub lineage --agent-context

Timeline (Change History)

# Schema changes
datahub timeline --urn "<URN>" --category technical_schema

# Ownership changes
datahub timeline --urn "<URN>" --category owner

# Tag changes
datahub timeline --urn "<URN>" --category tag

# With time range
datahub timeline --urn "<URN>" --category technical_schema --start 7daysago

Categories: tag, glossary_term, technical_schema, documentation, owner

Write Operations (via GraphQL Mutations)

Write operations use datahub graphql --query 'mutation { ... }'. The CLI does not have dedicated tag, glossary, or inline put commands for these operations.

Important rules for GraphQL mutations:

Return field subselections required. Mutations returning objects (not scalars like Boolean) need { urn } or similar after the mutation. Without it: SubselectionRequired error.
Long queries must use temp files. Long inline --query strings get misinterpreted as file paths on macOS (File name too long). Write to a .graphql file and pass the path: datahub graphql --query /tmp/my-mutation.graphql --format json.
Short mutations can be inline. Simple mutations like addTag, removeTag, addOwner are short enough to pass inline.

Glossary Terms

# Add term to entity
datahub graphql --query 'mutation {
  addTerm(input: { termUrn: "urn:li:glossaryTerm:<TERM>", resourceUrn: "<ENTITY_URN>" })
}' --format json

# Remove term
datahub graphql --query 'mutation {
  removeTerm(input: { termUrn: "urn:li:glossaryTerm:<TERM>", resourceUrn: "<ENTITY_URN>" })
}' --format json

Ownership

# Add owner (appends — does not replace existing owners)
datahub graphql --query 'mutation {
  addOwner(input: {
    ownerUrn: "urn:li:corpuser:<USER>",
    resourceUrn: "<ENTITY_URN>",
    ownerEntityType: CORP_USER,
    type: TECHNICAL_OWNER
  })
}' --format json

# Remove owner
datahub graphql --query 'mutation {
  removeOwner(input: { ownerUrn: "urn:li:corpuser:<USER>", resourceUrn: "<ENTITY_URN>" })
}' --format json

# Batch add owners
datahub graphql --query 'mutation {
  batchAddOwners(input: {
    owners: [{ ownerUrn: "urn:li:corpuser:<USER>", ownerEntityType: CORP_USER }],
    resources: [{ resourceUrn: "<URN1>" }, { resourceUrn: "<URN2>" }]
  })
}' --format json

Owner types: TECHNICAL_OWNER, BUSINESS_OWNER, DATA_STEWARD, NONE

Deprecation

# Deprecate
datahub graphql --query 'mutation {
  updateDeprecation(input: { urn: "<URN>", deprecated: true, note: "Replaced by new_table" })
}' --format json

# Un-deprecate
datahub graphql --query 'mutation {
  updateDeprecation(input: { urn: "<URN>", deprecated: false })
}' --format json

Domains

# Create domain
datahub graphql --query 'mutation {
  createDomain(input: { name: "Marketing", description: "Marketing data" })
}' --format json

# Assign entity to domain (domain must exist)
datahub graphql --query 'mutation {
  setDomain(entityUrn: "<ENTITY_URN>", domainUrn: "urn:li:domain:<DOMAIN_ID>")
}' --format json

# Remove from domain
datahub graphql --query 'mutation {
  unsetDomain(entityUrn: "<ENTITY_URN>")
}' --format json

# Batch assign
datahub graphql --query 'mutation {
  batchSetDomain(input: {
    domainUrn: "urn:li:domain:<ID>",
    resources: [{ resourceUrn: "<URN1>" }, { resourceUrn: "<URN2>" }]
  })
}' --format json

Description

datahub graphql --query 'mutation {
  updateDescription(input: {
    description: "New description text",
    resourceUrn: "<ENTITY_URN>"
  })
}' --format json

Data Products

Note: domainUrn is required — every data product must belong to a domain. Use datahub graphql --describe createDataProduct --recurse to verify the schema.

# Create (domainUrn is REQUIRED)
datahub graphql --query 'mutation {
  createDataProduct(input: {
    domainUrn: "urn:li:domain:<DOMAIN_ID>",
    properties: { name: "Revenue Analytics", description: "Revenue pipeline" }
  }) { urn }
}' --format json

# Add assets to data product
datahub graphql --query 'mutation {
  batchSetDataProduct(input: {
    dataProductUrn: "urn:li:dataProduct:<ID>",
    resourceUrns: ["<URN1>", "<URN2>"]
  })
}' --format json

Verification & Health

# Check CLI version
datahub version

# Verify connectivity (this entity always exists)
datahub get --urn "urn:li:corpuser:datahub"

# Test search (confirms search index works)
datahub search "*" --limit 1

# Server configuration
datahub check server-config

Note: datahub check server-health does not exist. Use datahub get --urn "urn:li:corpuser:datahub" to verify connectivity.

GraphQL Discovery

# List all available operations
datahub graphql --list-operations --format json

# List mutations only
datahub graphql --list-mutations --format json

# Describe a specific operation
datahub graphql --describe addTag --format json

# Describe with full type expansion
datahub graphql --describe addTag --recurse --format json

# Dry run (preview without executing)
datahub graphql --query '{ me { corpUser { urn } } }' --dry-run

# Agent best practices
datahub graphql --agent-context

Batch Mutation Pattern (Python)

Shell loops with dataset URNs are fragile due to quoting issues with parentheses. For multi-entity mutations, use a Python script with temp files:

import subprocess, json, tempfile, os

def run_graphql_mutation(query, variables):
    """Run a GraphQL mutation with variables via temp file. Returns parsed JSON or None."""
    with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
        json.dump(variables, f)
        vf = f.name
    try:
        result = subprocess.run(
            ["datahub", "graphql", "-q", query, "-v", vf, "--format", "json", "--no-pretty"],
            capture_output=True, text=True
        )
        if result.returncode == 0:
            return json.loads(result.stdout)
        else:
            print(f"ERROR: {result.stderr.strip()[:120]}")
            return None
    finally:
        os.unlink(vf)

# Example: batch update descriptions
query = "mutation updateDataset($urn: String!, $input: DatasetUpdateInput!) { updateDataset(urn: $urn, input: $input) { urn } }"

datasets = {
    "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)": "Description for table1",
    "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)": "Description for table2",
}

for urn, desc in datasets.items():
    variables = {"urn": urn, "input": {"editableProperties": {"description": desc}}}
    result = run_graphql_mutation(query, variables)
    status = "OK" if result else "FAIL"
    print(f"  {urn.split(',')[1]}: {status}")

Output Processing

# Pipe search URNs to get for batch retrieval
datahub search "customers" --urns-only | xargs -I{} datahub get --urn {}

# Extract field names from schema
datahub get --urn "<URN>" --aspect schemaMetadata | python3 -c "
import sys, json
data = json.load(sys.stdin)
for f in data.get('schemaMetadata', {}).get('fields', []):
    print(f['fieldPath'])
"

利用可能な言語: English 한국어

AIスキルTrace lineageProduct & Engineering

See what data depends on a table before you change it. — Claude Skill

Claude Code向けClaudeスキル · 提供：DataHub Project✓ · 実行：/datahub-lineage（Claude内）·更新日：2026年6月12日·vmain@68585b1

対応ChatGPT

ClaudeClaude CodeClaude DesktopCodex / Codex CLI

Cursor

GeminiHermes (via Continue / Cline)

OpenClaw

Windsurf

Finds upstream sources, downstream dashboards, owners, and risk in DataHub so teams can avoid breaking reports, pipelines, or customer-facing data.

Shows what feeds a dataset and what depends on it downstream.
Finds dashboards, tables, pipelines, owners, and platforms affected by a change.
Supports impact analysis, root-cause tracing, cross-platform maps, and specific source-to-target paths.
Turns raw DataHub lineage into a readable impact report for data and business teams.

あなた今日

An analyst manually clicks through lineage views and exports partial lists of dependencies.

/datahub-lineage使用時

Run /datahub-lineage to resolve the entity, traverse the graph, enrich results, and produce a reusable impact report.

1 Find entity2 Choose traversal mode3 Run lineage query4 Enrich and summarize impact

対象ユーザー

Data Engineer

Trace upstream/downstream dependencies and change impact in DataHub.

この役職のスキルを見る

Analytics Engineer

Map how analytical datasets feed dashboards, models, and downstream reports.

この役職のスキルを見る

機能

Change impact

Find who and what is affected before a table, column, or pipeline changes.

Root cause

Trace upstream sources when a dashboard or dataset looks wrong.

Ownership map

Show which teams own the data assets involved in a flow.

仕組み

Start with a DataHub entity name or URN.

Choose the question: what breaks downstream, where did bad data come from, or how does data flow across platforms.

Traverse the lineage graph and enrich results with owner, platform, type, and metadata context.

Summarize affected assets, risk level, owners, and recommended next actions.

入力オプション

DataHub entity

A dataset, chart, dashboard, pipeline, or URN.

例

Example input

Planned change: rename analytics.orders.discount_code to promo_code.

DataHub asset: Snowflake table analytics.orders.
Deploy target: Friday.

Concern:
- Finance close reports may use this table.
- Revenue dashboard may use this column.
- Customer cohort notebook may read from the downstream model.

Need: affected assets, owners, risk level, and who to notify before deploy.

What the skill returns

How it reads the request

The skill treats the table like a dependency map: it checks what feeds analytics.orders and what depends on it downstream.

Affected assets

Revenue dashboard, weekly bookings export, finance close model, and customer cohort notebook depend on this table or downstream models.

Highest risk

The finance close model uses discount_code directly and has no fallback. Changing the column before month close could break close reporting.

Owners to notify

Finance Analytics owns the close model, RevOps Analytics owns the bookings export, and Growth owns the cohort notebook.

Recommended rollout

Add promo_code first, keep discount_code for one release, notify owners, and remove the old column only after downstream teams confirm migration.

改善される指標

Lineage Coverage

+25-50%

Product & Engineering

Data Freshness

Faster impact review

Product & Engineering

Metric Trust

+10-20%

Product & Engineering

対応ツール

DataHub

手動

Primary system for lineage graph traversal, entity lookup, ownership, and metadata enrichment.

Snowflake

手動

Common source or target platform in DataHub lineage graphs.

SQL

手動

Use SQL context to explain transformations and lineage paths.

DataHub Lineageを使ってみますか？

始め方を選択してください。

Claude Codeで実行

無料・オープンソース

このスキルをコンピュータにローカルでインストールして実行します。

Claude Codeをインストール

コンピュータでターミナルを開き、このコマンドを貼り付けます：

スキルをインストール

このコマンドでスキルとすべてのファイルをコンピュータにダウンロードします：

末尾に-gを追加すると、すべてのプロジェクトで利用可能になります。

実行する

Claude Codeを起動し、コマンドを入力します：

次に

GitHubでソースを見る

ElasticFlowで利用

チームとコラボレーション機能

ブラウザからスキルを実行。結果を共有し、アクセス管理、チームで協力。ターミナル不要。

14日間無料トライアル。いつでもキャンセル可能。

GitHubで見る

DataHub Lineage

Multi-Agent Compatibility

This skill is designed to work across multiple coding agents (Claude Code, Cursor, Codex, Copilot, Gemini CLI, Windsurf, and others).

What works everywhere:

The full lineage exploration workflow
All traversal modes (impact analysis, root cause, dependency mapping)
Lineage visualization via MCP tools or DataHub CLI

Claude Code-specific features (other agents can safely ignore these):

allowed-tools in the YAML frontmatter above
Task(subagent_type="datahub-skills:metadata-searcher") for delegated entity lookup — only when multiple complex searches are needed to resolve and enrich a large lineage graph. For simple entity lookups, execute inline. Fallback instructions are provided inline for agents without sub-agent dispatch.

Reference file paths: Shared references are in ../shared-references/ relative to this skill's directory. Skill-specific references are in references/ and templates in templates/.

Not This Skill

If the user wants to...	Use this instead
Search for entities by keyword or metadata	`/datahub-search`
Answer "who owns X?" or "what is X?"	`/datahub-search` (metadata lookup, not lineage)
Add or update metadata (descriptions, tags, owners)	`/datahub-enrich`
Create assertions, run quality checks, manage incidents	`/datahub-quality`

Step 1: Identify Target Entity

Find the entity the user wants to trace.

If the user provides a URN, use it directly
If they provide a name, search for it: datahub search "<name>" --where "entity_type = dataset" --limit 5
If multiple matches, present options and ask the user to choose
Confirm: show entity name, URN, platform, type

Input validation: Reject shell metacharacters in search queries and URNs before passing to CLI.

Step 2: Determine Traversal Mode

Traversal modes

Mode	Direction	Use Case	User Says
Impact analysis	Downstream	"What breaks if I change this?"	"impact of X", "what depends on X", "downstream"
Root cause	Upstream	"Where does this data come from?"	"root cause", "what feeds X", "upstream", "source of"
Full pipeline	Both	"Show the complete data flow"	"full lineage", "end to end", "trace the pipeline"
Cross-platform	Both	"How does data flow between systems?"	"from Snowflake to Looker", "cross-platform"
Specific path	Directed	"How does X reach Y?"	"path from X to Y", "how does X connect to Y"

Depth configuration

Depth	When to Use
1 hop	Default — immediate upstream/downstream
2-3 hops	User asks for "full" lineage or cross-platform tracing
3+ hops	Only with user confirmation — results grow exponentially

Ask about depth if the user doesn't specify: "How many hops should I trace? (default: 1, or specify 'full')"

Step 3: Execute Lineage Queries

Choosing your tool: MCP vs. CLI

	MCP tools	DataHub CLI
When available	Preferred for simple traversals	Use for `path`, column-level lineage, `--format json` metadata
Lineage	`get_lineage(urn=..., direction=..., depth=...)`	`datahub lineage --urn "..." --direction upstream`
Enrich results	`get_entities(urns=[...])`	`datahub search "*" --where 'urn IN (...)'` with `--projection`

Using the `datahub lineage` CLI command

# Upstream sources (full graph by default)
datahub lineage --urn "<URN>" --direction upstream

# Downstream dependents
datahub lineage --urn "<URN>" --direction downstream

# Limit depth
datahub lineage --urn "<URN>" --direction downstream --hops 1

# Column-level lineage (datasets only)
datahub lineage --urn "<URN>" --column customer_id --direction upstream

# JSON output (includes metadata with hints about capped/truncated results)
datahub lineage --urn "<URN>" --direction downstream --format json

# Find path between two entities
datahub lineage path --from "<URN_A>" --to "<URN_B>"

Defaults: --hops 3 (full transitive lineage), --count 100. Increase --count if the summary indicates results were capped.

Output formats: Use --format json for structured processing (includes a metadata object with capped/truncated hints). Default table output is best for quick display to the user.

What lineage returns vs. what needs follow-up

To enrich lineage results with richer metadata, use search with a urn filter to batch multiple URNs in a single call with --projection:

# Batch-enrich lineage results — quote URNs (they contain parentheses and commas)
datahub search "*" \
  --where 'urn IN ("urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)", "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)")' \
  --projection "urn type
    ... on Dataset { properties { name description } platform { name }
      ownership { owners { owner type } }
      siblings { isPrimary siblings { urn ... on Dataset { properties { name description } platform { name } } } }
    }"

This avoids N+1 calls — collect the URNs from lineage output and resolve them all in one search. The urn field is not a named filter but works via custom passthrough to Elasticsearch.

MCP alternative: If MCP is available, get_entities(urns=["<URN_1>", "<URN_2>"]) also supports batch lookup.

Siblings in lineage results

Specific path tracing

Use the CLI command first:

datahub lineage path --from "<URN_A>" --to "<URN_B>"

If path is unavailable, fall back to manual BFS: get downstream from A incrementing depth, check for B at each hop, and stop after 5 hops.

Step 4: Visualize Lineage

ASCII flow diagram

For simple lineage (up to ~10 entities):

[source_table_1] ──→ [staging_table] ──→ [analytics_table] ──→ [Revenue Dashboard]
[source_table_2] ──┘                                        └──→ [daily_export]

Structured list

For larger or more complex lineage:

### Upstream (sources for analytics_table)

| Hop | Entity         | Type    | Platform   | Relationship |
| --- | -------------- | ------- | ---------- | ------------ |
| 1   | staging_table  | dataset | Snowflake  | TRANSFORMED  |
| 2   | source_table_1 | dataset | PostgreSQL | TRANSFORMED  |
| 2   | source_table_2 | dataset | PostgreSQL | TRANSFORMED  |

### Downstream (consumers of analytics_table)

| Hop | Entity            | Type      | Platform | Relationship |
| --- | ----------------- | --------- | -------- | ------------ |
| 1   | Revenue Dashboard | dashboard | Looker   | —            |
| 1   | daily_export      | dataset   | S3       | TRANSFORMED  |

Impact analysis format

For impact analysis, group by entity type, identify critical paths (single-dependency chains), and list affected owners. See templates/impact-analysis.template.md for the full template.

Cross-platform view

Group by platform when lineage crosses systems:

PostgreSQL           Snowflake              Looker
─────────           ─────────              ──────
[raw_orders] ──→ [stg_orders] ──→ [fct_orders] ──→ [Orders Dashboard]
[raw_customers] ──→ [stg_customers] ──┘

Suggesting Next Steps

After presenting lineage:

"Want to see metadata details for any of these?" → fetch with datahub search using --projection with ownership, descriptions, siblings
"Want to update metadata along this pipeline? Use /datahub-enrich"
"Want to run an impact audit? Use /datahub-audit"

Reference Documents

Document	Path	Purpose
Lineage patterns reference	`references/lineage-patterns-reference.md`	Traversal strategies and patterns
Impact analysis template	`templates/impact-analysis.template.md`	Impact analysis report template
Lineage map template	`templates/lineage-map.template.md`	Lineage visualization template
CLI reference (shared)	`../shared-references/datahub-cli-reference.md`	CLI commands

Common Mistakes

Using datahub get --aspect upstreamLineage instead of datahub lineage. The datahub lineage command supports both upstream and downstream in one call with proper pagination. Use it instead of the raw aspect fetch.
Showing only URNs. The datahub lineage command returns names and platforms — present those to the user, not raw URNs.
Answering metadata questions instead of tracing. "Who owns X?" is a Search question, not a Lineage question. Lineage is for relationships between entities, not entity properties.

Red Flags

User input contains shell metacharacters → reject, do not pass to CLI.
Traversal depth > 3 hops → confirm with user before proceeding.
Lineage returns 0 edges → entity may not have lineage ingested. Note this rather than saying "no dependencies."
User asks about metadata, not lineage ("who owns X?", "add a tag") → redirect to /datahub-search or /datahub-enrich.

URN Parsing

Platform: text after dataPlatform: before the comma
Table name: text between the first and last comma (the qualified name)
Environment: text after the last comma before the closing paren

For dashboard/chart URNs: urn:li:<type>:(<platform>,<id>).

Present lineage results using names extracted from URNs directly. Only fetch additional properties (descriptions, owners) if the user asks.

Remember

Show the flow visually. ASCII diagrams are more intuitive than tables for small graphs.
Check siblings. Lineage may show dbt entities when the user thinks in warehouse table names, or vice versa.
Enrich when asked. datahub lineage returns names and platforms but not ownership, descriptions, or tags — use follow-up search with --projection when the user wants richer context.
Check for capped results. If the summary indicates truncation, increase --count.

参照ドキュメント

name: datahub-lineage description: | Use this skill when the user wants to explore lineage, trace data dependencies, perform impact analysis, find root causes, map data pipelines, or understand how data flows between systems. Triggers on: "what feeds into X", "what depends on X", "show lineage for X", "impact analysis", "trace the pipeline", "root cause", "upstream of X", "downstream of X", or any request involving data lineage and dependency tracking. user-invocable: true min-cli-version: 1.5.0.1rc1 allowed-tools: Bash(datahub *)

DataHub Lineage

Multi-Agent Compatibility

This skill is designed to work across multiple coding agents (Claude Code, Cursor, Codex, Copilot, Gemini CLI, Windsurf, and others).

What works everywhere:

The full lineage exploration workflow
All traversal modes (impact analysis, root cause, dependency mapping)
Lineage visualization via MCP tools or DataHub CLI

Claude Code-specific features (other agents can safely ignore these):

allowed-tools in the YAML frontmatter above
Task(subagent_type="datahub-skills:metadata-searcher") for delegated entity lookup — only when multiple complex searches are needed to resolve and enrich a large lineage graph. For simple entity lookups, execute inline. Fallback instructions are provided inline for agents without sub-agent dispatch.

Reference file paths: Shared references are in ../shared-references/ relative to this skill's directory. Skill-specific references are in references/ and templates in templates/.

Not This Skill

If the user wants to...	Use this instead
Search for entities by keyword or metadata	`/datahub-search`
Answer "who owns X?" or "what is X?"	`/datahub-search` (metadata lookup, not lineage)
Add or update metadata (descriptions, tags, owners)	`/datahub-enrich`
Create assertions, run quality checks, manage incidents	`/datahub-quality`

Step 1: Identify Target Entity

Find the entity the user wants to trace.

If the user provides a URN, use it directly
If they provide a name, search for it: datahub search "<name>" --where "entity_type = dataset" --limit 5
If multiple matches, present options and ask the user to choose
Confirm: show entity name, URN, platform, type

Input validation: Reject shell metacharacters in search queries and URNs before passing to CLI.

Step 2: Determine Traversal Mode

Traversal modes

Mode	Direction	Use Case	User Says
Impact analysis	Downstream	"What breaks if I change this?"	"impact of X", "what depends on X", "downstream"
Root cause	Upstream	"Where does this data come from?"	"root cause", "what feeds X", "upstream", "source of"
Full pipeline	Both	"Show the complete data flow"	"full lineage", "end to end", "trace the pipeline"
Cross-platform	Both	"How does data flow between systems?"	"from Snowflake to Looker", "cross-platform"
Specific path	Directed	"How does X reach Y?"	"path from X to Y", "how does X connect to Y"

Depth configuration

Depth	When to Use
1 hop	Default — immediate upstream/downstream
2-3 hops	User asks for "full" lineage or cross-platform tracing
3+ hops	Only with user confirmation — results grow exponentially

Ask about depth if the user doesn't specify: "How many hops should I trace? (default: 1, or specify 'full')"

Step 3: Execute Lineage Queries

Choosing your tool: MCP vs. CLI

	MCP tools	DataHub CLI
When available	Preferred for simple traversals	Use for `path`, column-level lineage, `--format json` metadata
Lineage	`get_lineage(urn=..., direction=..., depth=...)`	`datahub lineage --urn "..." --direction upstream`
Enrich results	`get_entities(urns=[...])`	`datahub search "*" --where 'urn IN (...)'` with `--projection`

Using the `datahub lineage` CLI command

# Upstream sources (full graph by default)
datahub lineage --urn "<URN>" --direction upstream

# Downstream dependents
datahub lineage --urn "<URN>" --direction downstream

# Limit depth
datahub lineage --urn "<URN>" --direction downstream --hops 1

# Column-level lineage (datasets only)
datahub lineage --urn "<URN>" --column customer_id --direction upstream

# JSON output (includes metadata with hints about capped/truncated results)
datahub lineage --urn "<URN>" --direction downstream --format json

# Find path between two entities
datahub lineage path --from "<URN_A>" --to "<URN_B>"

Defaults: --hops 3 (full transitive lineage), --count 100. Increase --count if the summary indicates results were capped.

Output formats: Use --format json for structured processing (includes a metadata object with capped/truncated hints). Default table output is best for quick display to the user.

What lineage returns vs. what needs follow-up

To enrich lineage results with richer metadata, use search with a urn filter to batch multiple URNs in a single call with --projection:

# Batch-enrich lineage results — quote URNs (they contain parentheses and commas)
datahub search "*" \
  --where 'urn IN ("urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)", "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)")' \
  --projection "urn type
    ... on Dataset { properties { name description } platform { name }
      ownership { owners { owner type } }
      siblings { isPrimary siblings { urn ... on Dataset { properties { name description } platform { name } } } }
    }"

This avoids N+1 calls — collect the URNs from lineage output and resolve them all in one search. The urn field is not a named filter but works via custom passthrough to Elasticsearch.

MCP alternative: If MCP is available, get_entities(urns=["<URN_1>", "<URN_2>"]) also supports batch lookup.

Siblings in lineage results

Specific path tracing

Use the CLI command first:

datahub lineage path --from "<URN_A>" --to "<URN_B>"

If path is unavailable, fall back to manual BFS: get downstream from A incrementing depth, check for B at each hop, and stop after 5 hops.

Step 4: Visualize Lineage

ASCII flow diagram

For simple lineage (up to ~10 entities):

[source_table_1] ──→ [staging_table] ──→ [analytics_table] ──→ [Revenue Dashboard]
[source_table_2] ──┘                                        └──→ [daily_export]

Structured list

For larger or more complex lineage:

### Upstream (sources for analytics_table)

| Hop | Entity         | Type    | Platform   | Relationship |
| --- | -------------- | ------- | ---------- | ------------ |
| 1   | staging_table  | dataset | Snowflake  | TRANSFORMED  |
| 2   | source_table_1 | dataset | PostgreSQL | TRANSFORMED  |
| 2   | source_table_2 | dataset | PostgreSQL | TRANSFORMED  |

### Downstream (consumers of analytics_table)

| Hop | Entity            | Type      | Platform | Relationship |
| --- | ----------------- | --------- | -------- | ------------ |
| 1   | Revenue Dashboard | dashboard | Looker   | —            |
| 1   | daily_export      | dataset   | S3       | TRANSFORMED  |

Impact analysis format

For impact analysis, group by entity type, identify critical paths (single-dependency chains), and list affected owners. See templates/impact-analysis.template.md for the full template.

Cross-platform view

Group by platform when lineage crosses systems:

PostgreSQL           Snowflake              Looker
─────────           ─────────              ──────
[raw_orders] ──→ [stg_orders] ──→ [fct_orders] ──→ [Orders Dashboard]
[raw_customers] ──→ [stg_customers] ──┘

Suggesting Next Steps

After presenting lineage:

"Want to see metadata details for any of these?" → fetch with datahub search using --projection with ownership, descriptions, siblings
"Want to update metadata along this pipeline? Use /datahub-enrich"
"Want to run an impact audit? Use /datahub-audit"

Reference Documents

Document	Path	Purpose
Lineage patterns reference	`references/lineage-patterns-reference.md`	Traversal strategies and patterns
Impact analysis template	`templates/impact-analysis.template.md`	Impact analysis report template
Lineage map template	`templates/lineage-map.template.md`	Lineage visualization template
CLI reference (shared)	`../shared-references/datahub-cli-reference.md`	CLI commands

Common Mistakes

Using datahub get --aspect upstreamLineage instead of datahub lineage. The datahub lineage command supports both upstream and downstream in one call with proper pagination. Use it instead of the raw aspect fetch.
Showing only URNs. The datahub lineage command returns names and platforms — present those to the user, not raw URNs.
Answering metadata questions instead of tracing. "Who owns X?" is a Search question, not a Lineage question. Lineage is for relationships between entities, not entity properties.

Red Flags

User input contains shell metacharacters → reject, do not pass to CLI.
Traversal depth > 3 hops → confirm with user before proceeding.
Lineage returns 0 edges → entity may not have lineage ingested. Note this rather than saying "no dependencies."
User asks about metadata, not lineage ("who owns X?", "add a tag") → redirect to /datahub-search or /datahub-enrich.

URN Parsing

Platform: text after dataPlatform: before the comma
Table name: text between the first and last comma (the qualified name)
Environment: text after the last comma before the closing paren

For dashboard/chart URNs: urn:li:<type>:(<platform>,<id>).

Present lineage results using names extracted from URNs directly. Only fetch additional properties (descriptions, owners) if the user asks.

Remember

Show the flow visually. ASCII diagrams are more intuitive than tables for small graphs.
Check siblings. Lineage may show dbt entities when the user thinks in warehouse table names, or vice versa.
Enrich when asked. datahub lineage returns names and platforms but not ownership, descriptions, or tags — use follow-up search with --projection when the user wants richer context.
Check for capped results. If the summary indicates truncation, increase --count.

DataHub Lineage

Explore lineage, trace data dependencies, and perform impact analysis using DataHub's lineage graph.

What it does

Identifies the target entity
Determines traversal direction and depth
Executes lineage queries via MCP tools or CLI
Visualizes the lineage graph with ASCII flow diagrams

Capabilities

Impact analysis — What breaks if I change this table?
Root cause — Where does this data come from?
Full pipeline — End-to-end data flow mapping
Cross-platform — Trace data across Snowflake, dbt, Looker, etc.
Path finding — How does entity A connect to entity B?

Usage

/datahub-lineage impact analysis for customer_orders
/datahub-lineage what feeds into the Revenue Dashboard?
/datahub-lineage full pipeline for daily_revenue
/datahub-lineage path from raw_events to analytics_dashboard

Lineage Patterns Reference

Common lineage traversal strategies and patterns.

Traversal Strategies

Impact Analysis (Downstream)

Goal: Determine what breaks if an entity changes.

Strategy:

Get all downstream entities (start with depth 1, expand as needed)
Classify by type (datasets, dashboards, jobs)
Identify critical paths (entities with single upstream dependency)
List affected owners for notification

Key question: "Which downstream entities have no alternative data source?"

Root Cause (Upstream)

Goal: Trace where data originates and how it's transformed.

Strategy:

Get all upstream entities (depth 1-3)
Follow until reaching source-of-record systems (databases, APIs, files)
Note transformation types at each hop (TRANSFORMED, VIEW, COPY)
Identify the original data source

Key question: "Where does this data ultimately come from?"

Full Pipeline (Both Directions)

Goal: Map the complete data flow from source to consumption.

Strategy:

Get upstream to source (root cause)
Get downstream to consumers (impact)
Merge into a single directed graph
Present as end-to-end flow

Cross-Platform Tracing

Goal: Understand how data moves between systems.

Strategy:

Trace lineage in both directions
Group entities by platform
Identify cross-platform edges (e.g., PostgreSQL → Snowflake via dbt)
Highlight the integration points

Path Finding

Goal: Determine if and how entity A connects to entity B.

Strategy:

Start BFS from entity A downstream
At each hop, check if entity B appears
If found, return the path
Max depth: 5 hops (ask user before going deeper)

Lineage Edge Types

Type	Meaning
`TRANSFORMED`	Data was transformed (e.g., SQL query, dbt model)
`VIEW`	Entity is a view over the source
`COPY`	Data was copied without transformation

Platform-Specific Lineage Notes

Platform	Lineage Source	Notes
dbt	dbt manifest	Model-level lineage, often the richest
Airflow	Task dependencies	Job-level lineage
Snowflake	Query logs	Column-level lineage possible
BigQuery	Audit logs	Table-level lineage
Looker	LookML explores	Dashboard → dataset lineage
Tableau	Workbook metadata	Dashboard → dataset lineage

Choosing the Right Command

Need	Command	Why
Unfiltered upstream/downstream	`datahub lineage`	Simple, returns names and platforms
Column-level lineage	`datahub lineage --column <field>`	Only command that supports column tracing
Filter by type, platform, tags	`searchAcrossLineage` via `datahub graphql`	Server-side filtering avoids fetching full graph
Time-windowed lineage	`searchAcrossLineage` with `lineageFlags`	Only way to scope by edge update time
Large result sets (300+)	`scrollAcrossLineage` via `datahub graphql`	Cursor-based pagination for large graphs

Lineage Limitations

Use datahub lineage for both upstream and downstream traversal. Supports --hops, --column, and --format json with metadata hints.
Use searchAcrossLineage when filtering is needed. datahub lineage has no filter support — use the GraphQL query via datahub graphql to filter by entity type, platform, tags, domain, or time window.
Depth: Deep lineage graphs (5+ hops) can be very large. Always cap and ask.
Staleness: Lineage reflects the last ingestion. It may not reflect recent pipeline changes.
Column-level: Not all sources provide column-level lineage. Note when unavailable.

Impact Analysis

Target Entity

Name:  URN:  Platform:  Type:

Impact Summary

Direct dependents (1 hop):  Transitive dependents (all hops):  Depth traced:

Affected Entities

By Type

Type	Count	Entities
Datasets	<!-- n -->	<!-- list -->
Dashboards	<!-- n -->	<!-- list -->
Data Jobs	<!-- n -->	<!-- list -->
Charts	<!-- n -->	<!-- list -->

By Platform

Platform	Count
<!-- platform -->	<!-- n -->

Critical Paths

Entity	Type	Risk
<!-- name -->	<!-- type -->	Single dependency — no alternative source

Lineage Graph

<!-- ASCII flow diagram -->

Affected Owners

Owner	Entities Affected
<!-- owner -->	<!-- count and list -->

Recommendations

Lineage Map

Target Entity

Name:  URN:

Flow Diagram

<!-- ASCII lineage diagram -->

Upstream (Sources)

Hop	Entity	Type	Platform	Relationship
1	<!-- name -->	<!-- type -->	<!-- platform -->	<!-- TRANSFORMED/VIEW/COPY -->

Downstream (Consumers)

Hop	Entity	Type	Platform	Relationship
1	<!-- name -->	<!-- type -->	<!-- platform -->	<!-- type -->

Cross-Platform Boundaries

From	To	Edge
<!-- platform A -->	<!-- platform B -->	<!-- entity A → entity B -->

DataHub CLI Reference

Commands verified against DataHub CLI v1.4.0. Install via pip install acryl-datahub.

Tool Detection

Before running any DataHub commands, determine which tools are available:

MCP tools available — If tools like datahub_search, datahub_get_entity, datahub_get_lineage are in your tool list, use them directly. They are the preferred path — no CLI installation needed.
CLI available — If you have a Bash tool, check: which datahub. If found, use the CLI commands documented below.
Neither — Suggest the user set up a DataHub connection using /datahub-setup.

MCP takes priority over CLI when both are available — MCP tools are purpose-built for agent use with structured inputs/outputs and no shell overhead.

CLI ↔ MCP Equivalents

Operation	CLI Command	MCP Tool
Search	`datahub search "query" --where "..."`	`search(query="...", filter="...")`
Get entity	`datahub get --urn "..." --aspect ownership`	`get_entities(urns=["..."])`
Upstream lineage	`datahub lineage --urn "..." --direction upstream`	`get_lineage(urn="...", upstream=true)`
Downstream lineage	`datahub lineage --urn "..." --direction downstream`	`get_lineage(urn="...", upstream=false)`
GraphQL	`datahub graphql --query '...'`	`execute_graphql(query="...")`
Server config	`datahub check server-config`	Not needed (MCP server handles config)

The rest of this document covers the CLI path.

Authentication

The CLI reads connection settings from ~/.datahubenv:

gms:
  server: "http://localhost:8080"
  token: "<personal-access-token>"

Or via environment variables:

export DATAHUB_GMS_URL="http://localhost:8080"
export DATAHUB_GMS_TOKEN="<token>"

Version Check

Before running commands, check the installed CLI version:

datahub version

If a skill requires a minimum version and the installed version is older, upgrade:

pip install --upgrade acryl-datahub --pre

The --pre flag ensures pre-release versions (e.g. 1.5.0rc1) are included, which may be required for newer features.

Server Detection

Detect whether you're connected to DataHub Cloud or OSS:

datahub check server-config

serverEnv: 'cloud' → DataHub Cloud (supports popularity sorting, dataset features)
serverEnv: 'core' or other → OSS / self-hosted (feature fields not available)

Cache this result for the session — don't re-check on every command. Some features marked (Cloud only) below require serverEnv: cloud.

Context

Pass context on CLI commands using -C key=value so commands can be correlated:

datahub -C skill=datahub-audit search "revenue"
datahub -C skill=datahub-audit -C caller=claude-code get --urn "..."

Search & Discovery

The search CLI uses a positional query argument — not --query.

# Basic keyword search
datahub search "revenue"

# Search with limit
datahub search "customers" --limit 20

# Filter by platform (simple filter)
datahub search "*" --filter platform=snowflake

# Filter by entity type
datahub search "*" --where "entity_type = dataset"

# SQL-like WHERE expressions (recommended for agents)
datahub search "*" --where "platform = snowflake AND env = PROD"
datahub search "*" --where "platform IN (snowflake, bigquery)"
datahub search "*" --where "entity_type = dataset AND platform = snowflake"

# Multiple simple filters (AND between fields, comma = OR within field)
datahub search "*" --filter platform=snowflake --filter env=PROD
datahub search "*" --filter platform=snowflake,bigquery

# Output formats
datahub search "revenue" --table          # Human-readable table
datahub search "revenue" --urns-only      # URNs only, one per line
datahub search "revenue" --format json    # JSON (default)

# Pagination (max 50 per page)
datahub search "customers" --limit 50 --offset 0     # page 1
datahub search "customers" --limit 50 --offset 50    # page 2

# Facets only (counts by type/platform/etc.)
datahub search "*" --facets-only --format json

# Dry run (preview query without executing)
datahub search "revenue" --where "platform = snowflake" --dry-run

# Projection (limit returned fields — reduces token cost)
datahub search "customers" --projection "urn type"

# Column-level search (find datasets containing a specific field)
datahub search "*" --where "entity_type = dataset AND fieldPaths = customer_id"

# Sorting
datahub search "*" --sort-by lastModifiedAt --sort-order desc --limit 10
datahub search "*" --sort-by _entityName --sort-order asc --limit 10

# Popularity / usage sorting (Cloud only — check serverEnv first)
# Most queried datasets
datahub search "*" --where "entity_type = dataset" \
  --sort-by queryCountLast30DaysFeature --sort-order desc --limit 10 \
  --projection "urn type ... on Dataset { properties { name } platform { name } statsSummary { queryCountLast30Days uniqueUserCountLast30Days } }"

# Most updated datasets
datahub search "*" --where "entity_type = dataset" --sort-by writeCountLast30DaysFeature --sort-order desc --limit 10

# Largest tables (by row count or bytes)
datahub search "*" --where "entity_type = dataset" --sort-by rowCountFeature --sort-order desc --limit 10
datahub search "*" --where "entity_type = dataset" --sort-by sizeInBytesFeature --sort-order desc --limit 10

# Existence filters (IS NULL / IS NOT NULL)
datahub search "*" --where "entity_type = dataset AND description IS NULL AND editableDescription IS NULL"
datahub search "*" --where "entity_type = dataset AND glossary_term IS NOT NULL"

# Sibling-aware description audit (single query, no N+1 fetches)
# Step 1: Find datasets missing both ingestion and user-edited descriptions
# Step 2: Project siblings with their descriptions to compute effective coverage
datahub search "*" \
  --where "entity_type = dataset AND platform = snowflake AND description IS NULL AND editableDescription IS NULL" \
  --projection "urn type ... on Dataset { siblings { isPrimary siblings { urn ... on Dataset { properties { name description } editableProperties { description } } } } }" \
  --format json --limit 50

# URN resolution for filters
# Tag, domain, and glossary_term filters require full URNs — not display names.
# Always resolve the name to a URN first, then use the URN in the filter.

# Step 1: Find tag URN by name
datahub search "large table" --where "entity_type = tag" --urns-only --limit 1
# → urn:li:tag:sample_data___default_large_table

# Step 2: Use the URN in a filter
datahub search "*" --where "entity_type = dataset AND tags = 'urn:li:tag:sample_data___default_large_table'"

# Same pattern for domains:
datahub search "ecommerce" --where "entity_type = domain" --urns-only --limit 1
# → urn:li:domain:91994180-...
datahub search "*" --where "entity_type = dataset AND domain = 'urn:li:domain:91994180-...'"

# And glossary terms:
datahub search "PII" --where "entity_type = glossaryTerm" --urns-only --limit 1
datahub search "*" --where "entity_type = dataset AND glossary_term = 'urn:li:glossaryTerm:...'"

# Discover available filters
datahub search list-filters
datahub search describe-filter platform

# Agent best practices
datahub search --agent-context

Entity Retrieval

# Get full entity metadata
datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,table_name,PROD)"

# Get specific aspect
datahub get --urn "<URN>" --aspect schemaMetadata
datahub get --urn "<URN>" --aspect ownership
datahub get --urn "<URN>" --aspect globalTags

Lineage

# Upstream sources (full graph by default)
datahub lineage --urn "<URN>" --direction upstream

# Downstream dependents
datahub lineage --urn "<URN>" --direction downstream

# Limit to immediate neighbors
datahub lineage --urn "<URN>" --direction upstream --hops 1

# Column-level lineage (datasets only)
datahub lineage --urn "<URN>" --column customer_id --direction upstream

# JSON output (includes metadata with capped/hint info)
datahub lineage --urn "<URN>" --direction downstream --format json

# Find path between two entities
datahub lineage path --from "<URN_A>" --to "<URN_B>"

# Agent best practices
datahub lineage --agent-context

Timeline (Change History)

# Schema changes
datahub timeline --urn "<URN>" --category technical_schema

# Ownership changes
datahub timeline --urn "<URN>" --category owner

# Tag changes
datahub timeline --urn "<URN>" --category tag

# With time range
datahub timeline --urn "<URN>" --category technical_schema --start 7daysago

Categories: tag, glossary_term, technical_schema, documentation, owner

Write Operations (via GraphQL Mutations)

Write operations use datahub graphql --query 'mutation { ... }'. The CLI does not have dedicated tag, glossary, or inline put commands for these operations.

Important rules for GraphQL mutations:

Return field subselections required. Mutations returning objects (not scalars like Boolean) need { urn } or similar after the mutation. Without it: SubselectionRequired error.
Long queries must use temp files. Long inline --query strings get misinterpreted as file paths on macOS (File name too long). Write to a .graphql file and pass the path: datahub graphql --query /tmp/my-mutation.graphql --format json.
Short mutations can be inline. Simple mutations like addTag, removeTag, addOwner are short enough to pass inline.

Glossary Terms

# Add term to entity
datahub graphql --query 'mutation {
  addTerm(input: { termUrn: "urn:li:glossaryTerm:<TERM>", resourceUrn: "<ENTITY_URN>" })
}' --format json

# Remove term
datahub graphql --query 'mutation {
  removeTerm(input: { termUrn: "urn:li:glossaryTerm:<TERM>", resourceUrn: "<ENTITY_URN>" })
}' --format json

Ownership

# Add owner (appends — does not replace existing owners)
datahub graphql --query 'mutation {
  addOwner(input: {
    ownerUrn: "urn:li:corpuser:<USER>",
    resourceUrn: "<ENTITY_URN>",
    ownerEntityType: CORP_USER,
    type: TECHNICAL_OWNER
  })
}' --format json

# Remove owner
datahub graphql --query 'mutation {
  removeOwner(input: { ownerUrn: "urn:li:corpuser:<USER>", resourceUrn: "<ENTITY_URN>" })
}' --format json

# Batch add owners
datahub graphql --query 'mutation {
  batchAddOwners(input: {
    owners: [{ ownerUrn: "urn:li:corpuser:<USER>", ownerEntityType: CORP_USER }],
    resources: [{ resourceUrn: "<URN1>" }, { resourceUrn: "<URN2>" }]
  })
}' --format json

Owner types: TECHNICAL_OWNER, BUSINESS_OWNER, DATA_STEWARD, NONE

Deprecation

# Deprecate
datahub graphql --query 'mutation {
  updateDeprecation(input: { urn: "<URN>", deprecated: true, note: "Replaced by new_table" })
}' --format json

# Un-deprecate
datahub graphql --query 'mutation {
  updateDeprecation(input: { urn: "<URN>", deprecated: false })
}' --format json

Domains

# Create domain
datahub graphql --query 'mutation {
  createDomain(input: { name: "Marketing", description: "Marketing data" })
}' --format json

# Assign entity to domain (domain must exist)
datahub graphql --query 'mutation {
  setDomain(entityUrn: "<ENTITY_URN>", domainUrn: "urn:li:domain:<DOMAIN_ID>")
}' --format json

# Remove from domain
datahub graphql --query 'mutation {
  unsetDomain(entityUrn: "<ENTITY_URN>")
}' --format json

# Batch assign
datahub graphql --query 'mutation {
  batchSetDomain(input: {
    domainUrn: "urn:li:domain:<ID>",
    resources: [{ resourceUrn: "<URN1>" }, { resourceUrn: "<URN2>" }]
  })
}' --format json

Description

datahub graphql --query 'mutation {
  updateDescription(input: {
    description: "New description text",
    resourceUrn: "<ENTITY_URN>"
  })
}' --format json

Data Products

Note: domainUrn is required — every data product must belong to a domain. Use datahub graphql --describe createDataProduct --recurse to verify the schema.

# Create (domainUrn is REQUIRED)
datahub graphql --query 'mutation {
  createDataProduct(input: {
    domainUrn: "urn:li:domain:<DOMAIN_ID>",
    properties: { name: "Revenue Analytics", description: "Revenue pipeline" }
  }) { urn }
}' --format json

# Add assets to data product
datahub graphql --query 'mutation {
  batchSetDataProduct(input: {
    dataProductUrn: "urn:li:dataProduct:<ID>",
    resourceUrns: ["<URN1>", "<URN2>"]
  })
}' --format json

Verification & Health

# Check CLI version
datahub version

# Verify connectivity (this entity always exists)
datahub get --urn "urn:li:corpuser:datahub"

# Test search (confirms search index works)
datahub search "*" --limit 1

# Server configuration
datahub check server-config

Note: datahub check server-health does not exist. Use datahub get --urn "urn:li:corpuser:datahub" to verify connectivity.

GraphQL Discovery

# List all available operations
datahub graphql --list-operations --format json

# List mutations only
datahub graphql --list-mutations --format json

# Describe a specific operation
datahub graphql --describe addTag --format json

# Describe with full type expansion
datahub graphql --describe addTag --recurse --format json

# Dry run (preview without executing)
datahub graphql --query '{ me { corpUser { urn } } }' --dry-run

# Agent best practices
datahub graphql --agent-context

Batch Mutation Pattern (Python)

Shell loops with dataset URNs are fragile due to quoting issues with parentheses. For multi-entity mutations, use a Python script with temp files:

import subprocess, json, tempfile, os

def run_graphql_mutation(query, variables):
    """Run a GraphQL mutation with variables via temp file. Returns parsed JSON or None."""
    with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
        json.dump(variables, f)
        vf = f.name
    try:
        result = subprocess.run(
            ["datahub", "graphql", "-q", query, "-v", vf, "--format", "json", "--no-pretty"],
            capture_output=True, text=True
        )
        if result.returncode == 0:
            return json.loads(result.stdout)
        else:
            print(f"ERROR: {result.stderr.strip()[:120]}")
            return None
    finally:
        os.unlink(vf)

# Example: batch update descriptions
query = "mutation updateDataset($urn: String!, $input: DatasetUpdateInput!) { updateDataset(urn: $urn, input: $input) { urn } }"

datasets = {
    "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)": "Description for table1",
    "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)": "Description for table2",
}

for urn, desc in datasets.items():
    variables = {"urn": urn, "input": {"editableProperties": {"description": desc}}}
    result = run_graphql_mutation(query, variables)
    status = "OK" if result else "FAIL"
    print(f"  {urn.split(',')[1]}: {status}")

Output Processing

# Pipe search URNs to get for batch retrieval
datahub search "customers" --urns-only | xargs -I{} datahub get --urn {}

# Extract field names from schema
datahub get --urn "<URN>" --aspect schemaMetadata | python3 -c "
import sys, json
data = json.load(sys.stdin)
for f in data.get('schemaMetadata', {}).get('fields', []):
    print(f['fieldPath'])
"

See what data depends on a table before you change it. — Claude Skill

対象ユーザー

機能

仕組み

入力オプション

例

改善される指標

対応ツール

DataHub Lineageを使ってみますか？

スキルの手順

DataHub Lineage

Multi-Agent Compatibility

Not This Skill

Step 1: Identify Target Entity

Step 2: Determine Traversal Mode

Traversal modes

Depth configuration

Step 3: Execute Lineage Queries

Choosing your tool: MCP vs. CLI

Using the datahub lineage CLI command

What lineage returns vs. what needs follow-up

Siblings in lineage results

Specific path tracing

Step 4: Visualize Lineage

ASCII flow diagram

Structured list

Impact analysis format

Cross-platform view

Suggesting Next Steps

Reference Documents

Common Mistakes

Red Flags

URN Parsing

Remember

参照ドキュメント

DataHub Lineage

Multi-Agent Compatibility

Not This Skill

Step 1: Identify Target Entity

Step 2: Determine Traversal Mode

Traversal modes

Depth configuration

Step 3: Execute Lineage Queries

Choosing your tool: MCP vs. CLI

Using the datahub lineage CLI command

What lineage returns vs. what needs follow-up

Siblings in lineage results

Specific path tracing

Step 4: Visualize Lineage

ASCII flow diagram

Structured list

Impact analysis format

Cross-platform view

Suggesting Next Steps

Reference Documents

Common Mistakes

Red Flags

URN Parsing

Remember

DataHub Lineage

What it does

Capabilities

Usage

Lineage Patterns Reference

Traversal Strategies

Impact Analysis (Downstream)

Root Cause (Upstream)

Full Pipeline (Both Directions)

Cross-Platform Tracing

Path Finding

Lineage Edge Types

Platform-Specific Lineage Notes

Choosing the Right Command

Lineage Limitations

Impact Analysis

Target Entity

Impact Summary

Affected Entities

By Type

By Platform

Using the `datahub lineage` CLI command

Using the `datahub lineage` CLI command

Using the `datahub lineage` CLI command