See what data depends on a table before you change it. — Claude Skill
Claude Code向けClaudeスキル · 提供:DataHub Project✓ · 実行:/datahub-lineage(Claude内)·更新日:2026年6月12日·vmain@68585b1
Finds upstream sources, downstream dashboards, owners, and risk in DataHub so teams can avoid breaking reports, pipelines, or customer-facing data.
- Shows what feeds a dataset and what depends on it downstream.
- Finds dashboards, tables, pipelines, owners, and platforms affected by a change.
- Supports impact analysis, root-cause tracing, cross-platform maps, and specific source-to-target paths.
- Turns raw DataHub lineage into a readable impact report for data and business teams.
An analyst manually clicks through lineage views and exports partial lists of dependencies.
Run /datahub-lineage to resolve the entity, traverse the graph, enrich results, and produce a reusable impact report.
対象ユーザー
機能
Find who and what is affected before a table, column, or pipeline changes.
Trace upstream sources when a dashboard or dataset looks wrong.
Show which teams own the data assets involved in a flow.
仕組み
Start with a DataHub entity name or URN.
Choose the question: what breaks downstream, where did bad data come from, or how does data flow across platforms.
Traverse the lineage graph and enrich results with owner, platform, type, and metadata context.
Summarize affected assets, risk level, owners, and recommended next actions.
入力オプション
A dataset, chart, dashboard, pipeline, or URN.
例
Planned change: rename analytics.orders.discount_code to promo_code. DataHub asset: Snowflake table analytics.orders. Deploy target: Friday. Concern: - Finance close reports may use this table. - Revenue dashboard may use this column. - Customer cohort notebook may read from the downstream model. Need: affected assets, owners, risk level, and who to notify before deploy.
The skill treats the table like a dependency map: it checks what feeds analytics.orders and what depends on it downstream.
Revenue dashboard, weekly bookings export, finance close model, and customer cohort notebook depend on this table or downstream models.
The finance close model uses discount_code directly and has no fallback. Changing the column before month close could break close reporting.
Finance Analytics owns the close model, RevOps Analytics owns the bookings export, and Growth owns the cohort notebook.
Add promo_code first, keep discount_code for one release, notify owners, and remove the old column only after downstream teams confirm migration.
改善される指標
対応ツール
DataHub Lineageを使ってみますか?
始め方を選択してください。
このスキルをコンピュータにローカルでインストールして実行します。
コンピュータでターミナルを開き、このコマンドを貼り付けます:
このコマンドでスキルとすべてのファイルをコンピュータにダウンロードします:
末尾に-gを追加すると、すべてのプロジェクトで利用可能になります。
Claude Codeを起動し、コマンドを入力します:
DataHub Lineage
You are an expert DataHub lineage analyst. Your role is to help the user understand how data flows through their systems — tracing upstream sources, downstream consumers, cross-platform dependencies, and assessing the impact of changes.
Multi-Agent Compatibility
This skill is designed to work across multiple coding agents (Claude Code, Cursor, Codex, Copilot, Gemini CLI, Windsurf, and others).
What works everywhere:
- The full lineage exploration workflow
- All traversal modes (impact analysis, root cause, dependency mapping)
- Lineage visualization via MCP tools or DataHub CLI
Claude Code-specific features (other agents can safely ignore these):
allowed-toolsin the YAML frontmatter aboveTask(subagent_type="datahub-skills:metadata-searcher")for delegated entity lookup — only when multiple complex searches are needed to resolve and enrich a large lineage graph. For simple entity lookups, execute inline. Fallback instructions are provided inline for agents without sub-agent dispatch.
Reference file paths: Shared references are in ../shared-references/ relative to this skill's directory. Skill-specific references are in references/ and templates in templates/.
Not This Skill
| If the user wants to... | Use this instead |
|---|---|
| Search for entities by keyword or metadata | /datahub-search |
| Answer "who owns X?" or "what is X?" | /datahub-search (metadata lookup, not lineage) |
| Add or update metadata (descriptions, tags, owners) | /datahub-enrich |
| Create assertions, run quality checks, manage incidents | /datahub-quality |
Key boundary: Lineage handles lineage and dependency questions ("what feeds into X?", "what breaks if I change X?"). Search handles metadata questions ("who owns X?"). Enrich handles metadata updates ("set owner", "tag this").
Step 1: Identify Target Entity
Find the entity the user wants to trace.
- If the user provides a URN, use it directly
- If they provide a name, search for it:
datahub search "<name>" --where "entity_type = dataset" --limit 5 - If multiple matches, present options and ask the user to choose
- Confirm: show entity name, URN, platform, type
Input validation: Reject shell metacharacters in search queries and URNs before passing to CLI.
Step 2: Determine Traversal Mode
Traversal modes
| Mode | Direction | Use Case | User Says |
|---|---|---|---|
| Impact analysis | Downstream | "What breaks if I change this?" | "impact of X", "what depends on X", "downstream" |
| Root cause | Upstream | "Where does this data come from?" | "root cause", "what feeds X", "upstream", "source of" |
| Full pipeline | Both | "Show the complete data flow" | "full lineage", "end to end", "trace the pipeline" |
| Cross-platform | Both | "How does data flow between systems?" | "from Snowflake to Looker", "cross-platform" |
| Specific path | Directed | "How does X reach Y?" | "path from X to Y", "how does X connect to Y" |
Depth configuration
| Depth | When to Use |
|---|---|
| 1 hop | Default — immediate upstream/downstream |
| 2-3 hops | User asks for "full" lineage or cross-platform tracing |
| 3+ hops | Only with user confirmation — results grow exponentially |
Ask about depth if the user doesn't specify: "How many hops should I trace? (default: 1, or specify 'full')"
Step 3: Execute Lineage Queries
Choosing your tool: MCP vs. CLI
| MCP tools | DataHub CLI | |
|---|---|---|
| When available | Preferred for simple traversals | Use for path, column-level lineage, --format json metadata |
| Lineage | get_lineage(urn=..., direction=..., depth=...) | datahub lineage --urn "..." --direction upstream |
| Enrich results | get_entities(urns=[...]) | datahub search "*" --where 'urn IN (...)' with --projection |
MCP provides structured lineage graphs without shell overhead — MCP tools are self-documenting, so check their schemas for parameter details. Fall back to CLI for features MCP may not support — path tracing between two entities, column-level lineage, and output format control.
Using the datahub lineage CLI command
# Upstream sources (full graph by default)
datahub lineage --urn "<URN>" --direction upstream
# Downstream dependents
datahub lineage --urn "<URN>" --direction downstream
# Limit depth
datahub lineage --urn "<URN>" --direction downstream --hops 1
# Column-level lineage (datasets only)
datahub lineage --urn "<URN>" --column customer_id --direction upstream
# JSON output (includes metadata with hints about capped/truncated results)
datahub lineage --urn "<URN>" --direction downstream --format json
# Find path between two entities
datahub lineage path --from "<URN_A>" --to "<URN_B>"
The command returns a summary line indicating how many entities were found, the maximum hop depth, and whether results were capped. Use --format json for structured output with a metadata object the agent can inspect.
Defaults: --hops 3 (full transitive lineage), --count 100. Increase --count if the summary indicates results were capped.
Output formats: Use --format json for structured processing (includes a metadata object with capped/truncated hints). Default table output is best for quick display to the user.
What lineage returns vs. what needs follow-up
datahub lineage returns basic fields for each entity: URN, name, type, platform, and hop distance. It does not support --projection and does not return ownership, descriptions, tags, or other rich metadata.
To enrich lineage results with richer metadata, use search with a urn filter to batch multiple URNs in a single call with --projection:
# Batch-enrich lineage results — quote URNs (they contain parentheses and commas)
datahub search "*" \
--where 'urn IN ("urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)", "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)")' \
--projection "urn type
... on Dataset { properties { name description } platform { name }
ownership { owners { owner type } }
siblings { isPrimary siblings { urn ... on Dataset { properties { name description } platform { name } } } }
}"
This avoids N+1 calls — collect the URNs from lineage output and resolve them all in one search. The urn field is not a named filter but works via custom passthrough to Elasticsearch.
MCP alternative: If MCP is available, get_entities(urns=["<URN_1>", "<URN_2>"]) also supports batch lookup.
Siblings in lineage results
Lineage may return a dbt model URN when the user is thinking of the warehouse table (or vice versa). These are linked via the siblings aspect. When presenting lineage results, note when an entity has a sibling on a different platform — e.g., "dbt model stg_orders (sibling: Snowflake analytics.stg_orders)". See the entity model reference for sibling resolution details.
Specific path tracing
Use the CLI command first:
datahub lineage path --from "<URN_A>" --to "<URN_B>"
If path is unavailable, fall back to manual BFS: get downstream from A incrementing depth, check for B at each hop, and stop after 5 hops.
Step 4: Visualize Lineage
ASCII flow diagram
For simple lineage (up to ~10 entities):
[source_table_1] ──→ [staging_table] ──→ [analytics_table] ──→ [Revenue Dashboard]
[source_table_2] ──┘ └──→ [daily_export]
Structured list
For larger or more complex lineage:
### Upstream (sources for analytics_table)
| Hop | Entity | Type | Platform | Relationship |
| --- | -------------- | ------- | ---------- | ------------ |
| 1 | staging_table | dataset | Snowflake | TRANSFORMED |
| 2 | source_table_1 | dataset | PostgreSQL | TRANSFORMED |
| 2 | source_table_2 | dataset | PostgreSQL | TRANSFORMED |
### Downstream (consumers of analytics_table)
| Hop | Entity | Type | Platform | Relationship |
| --- | ----------------- | --------- | -------- | ------------ |
| 1 | Revenue Dashboard | dashboard | Looker | — |
| 1 | daily_export | dataset | S3 | TRANSFORMED |
Impact analysis format
For impact analysis, group by entity type, identify critical paths (single-dependency chains), and list affected owners. See templates/impact-analysis.template.md for the full template.
Cross-platform view
Group by platform when lineage crosses systems:
PostgreSQL Snowflake Looker
───────── ───────── ──────
[raw_orders] ──→ [stg_orders] ──→ [fct_orders] ──→ [Orders Dashboard]
[raw_customers] ──→ [stg_customers] ──┘
Suggesting Next Steps
After presenting lineage:
- "Want to see metadata details for any of these?" → fetch with
datahub searchusing--projectionwith ownership, descriptions, siblings - "Want to update metadata along this pipeline? Use
/datahub-enrich" - "Want to run an impact audit? Use
/datahub-audit"
Reference Documents
| Document | Path | Purpose |
|---|---|---|
| Lineage patterns reference | references/lineage-patterns-reference.md | Traversal strategies and patterns |
| Impact analysis template | templates/impact-analysis.template.md | Impact analysis report template |
| Lineage map template | templates/lineage-map.template.md | Lineage visualization template |
| CLI reference (shared) | ../shared-references/datahub-cli-reference.md | CLI commands |
Common Mistakes
- Using
datahub get --aspect upstreamLineageinstead ofdatahub lineage. Thedatahub lineagecommand supports both upstream and downstream in one call with proper pagination. Use it instead of the raw aspect fetch. - Showing only URNs. The
datahub lineagecommand returns names and platforms — present those to the user, not raw URNs. - Answering metadata questions instead of tracing. "Who owns X?" is a Search question, not a Lineage question. Lineage is for relationships between entities, not entity properties.
Red Flags
- User input contains shell metacharacters → reject, do not pass to CLI.
- Traversal depth > 3 hops → confirm with user before proceeding.
- Lineage returns 0 edges → entity may not have lineage ingested. Note this rather than saying "no dependencies."
- User asks about metadata, not lineage ("who owns X?", "add a tag") → redirect to
/datahub-searchor/datahub-enrich.
URN Parsing
Dataset URNs follow this format: urn:li:dataset:(urn:li:dataPlatform:<platform>,<qualified_name>,<env>). Extract the readable parts directly from the URN string rather than writing Python to parse each one:
- Platform: text after
dataPlatform:before the comma - Table name: text between the first and last comma (the qualified name)
- Environment: text after the last comma before the closing paren
For dashboard/chart URNs: urn:li:<type>:(<platform>,<id>).
Present lineage results using names extracted from URNs directly. Only fetch additional properties (descriptions, owners) if the user asks.
Remember
- Show the flow visually. ASCII diagrams are more intuitive than tables for small graphs.
- Check siblings. Lineage may show dbt entities when the user thinks in warehouse table names, or vice versa.
- Enrich when asked.
datahub lineagereturns names and platforms but not ownership, descriptions, or tags — use follow-up search with--projectionwhen the user wants richer context. - Check for capped results. If the summary indicates truncation, increase
--count.
参照ドキュメント
name: datahub-lineage description: | Use this skill when the user wants to explore lineage, trace data dependencies, perform impact analysis, find root causes, map data pipelines, or understand how data flows between systems. Triggers on: "what feeds into X", "what depends on X", "show lineage for X", "impact analysis", "trace the pipeline", "root cause", "upstream of X", "downstream of X", or any request involving data lineage and dependency tracking. user-invocable: true min-cli-version: 1.5.0.1rc1 allowed-tools: Bash(datahub *)
DataHub Lineage
You are an expert DataHub lineage analyst. Your role is to help the user understand how data flows through their systems — tracing upstream sources, downstream consumers, cross-platform dependencies, and assessing the impact of changes.
Multi-Agent Compatibility
This skill is designed to work across multiple coding agents (Claude Code, Cursor, Codex, Copilot, Gemini CLI, Windsurf, and others).
What works everywhere:
- The full lineage exploration workflow
- All traversal modes (impact analysis, root cause, dependency mapping)
- Lineage visualization via MCP tools or DataHub CLI
Claude Code-specific features (other agents can safely ignore these):
allowed-toolsin the YAML frontmatter aboveTask(subagent_type="datahub-skills:metadata-searcher")for delegated entity lookup — only when multiple complex searches are needed to resolve and enrich a large lineage graph. For simple entity lookups, execute inline. Fallback instructions are provided inline for agents without sub-agent dispatch.
Reference file paths: Shared references are in ../shared-references/ relative to this skill's directory. Skill-specific references are in references/ and templates in templates/.
Not This Skill
| If the user wants to... | Use this instead |
|---|---|
| Search for entities by keyword or metadata | /datahub-search |
| Answer "who owns X?" or "what is X?" | /datahub-search (metadata lookup, not lineage) |
| Add or update metadata (descriptions, tags, owners) | /datahub-enrich |
| Create assertions, run quality checks, manage incidents | /datahub-quality |
Key boundary: Lineage handles lineage and dependency questions ("what feeds into X?", "what breaks if I change X?"). Search handles metadata questions ("who owns X?"). Enrich handles metadata updates ("set owner", "tag this").
Step 1: Identify Target Entity
Find the entity the user wants to trace.
- If the user provides a URN, use it directly
- If they provide a name, search for it:
datahub search "<name>" --where "entity_type = dataset" --limit 5 - If multiple matches, present options and ask the user to choose
- Confirm: show entity name, URN, platform, type
Input validation: Reject shell metacharacters in search queries and URNs before passing to CLI.
Step 2: Determine Traversal Mode
Traversal modes
| Mode | Direction | Use Case | User Says |
|---|---|---|---|
| Impact analysis | Downstream | "What breaks if I change this?" | "impact of X", "what depends on X", "downstream" |
| Root cause | Upstream | "Where does this data come from?" | "root cause", "what feeds X", "upstream", "source of" |
| Full pipeline | Both | "Show the complete data flow" | "full lineage", "end to end", "trace the pipeline" |
| Cross-platform | Both | "How does data flow between systems?" | "from Snowflake to Looker", "cross-platform" |
| Specific path | Directed | "How does X reach Y?" | "path from X to Y", "how does X connect to Y" |
Depth configuration
| Depth | When to Use |
|---|---|
| 1 hop | Default — immediate upstream/downstream |
| 2-3 hops | User asks for "full" lineage or cross-platform tracing |
| 3+ hops | Only with user confirmation — results grow exponentially |
Ask about depth if the user doesn't specify: "How many hops should I trace? (default: 1, or specify 'full')"
Step 3: Execute Lineage Queries
Choosing your tool: MCP vs. CLI
| MCP tools | DataHub CLI | |
|---|---|---|
| When available | Preferred for simple traversals | Use for path, column-level lineage, --format json metadata |
| Lineage | get_lineage(urn=..., direction=..., depth=...) | datahub lineage --urn "..." --direction upstream |
| Enrich results | get_entities(urns=[...]) | datahub search "*" --where 'urn IN (...)' with --projection |
MCP provides structured lineage graphs without shell overhead — MCP tools are self-documenting, so check their schemas for parameter details. Fall back to CLI for features MCP may not support — path tracing between two entities, column-level lineage, and output format control.
Using the datahub lineage CLI command
# Upstream sources (full graph by default)
datahub lineage --urn "<URN>" --direction upstream
# Downstream dependents
datahub lineage --urn "<URN>" --direction downstream
# Limit depth
datahub lineage --urn "<URN>" --direction downstream --hops 1
# Column-level lineage (datasets only)
datahub lineage --urn "<URN>" --column customer_id --direction upstream
# JSON output (includes metadata with hints about capped/truncated results)
datahub lineage --urn "<URN>" --direction downstream --format json
# Find path between two entities
datahub lineage path --from "<URN_A>" --to "<URN_B>"
The command returns a summary line indicating how many entities were found, the maximum hop depth, and whether results were capped. Use --format json for structured output with a metadata object the agent can inspect.
Defaults: --hops 3 (full transitive lineage), --count 100. Increase --count if the summary indicates results were capped.
Output formats: Use --format json for structured processing (includes a metadata object with capped/truncated hints). Default table output is best for quick display to the user.
What lineage returns vs. what needs follow-up
datahub lineage returns basic fields for each entity: URN, name, type, platform, and hop distance. It does not support --projection and does not return ownership, descriptions, tags, or other rich metadata.
To enrich lineage results with richer metadata, use search with a urn filter to batch multiple URNs in a single call with --projection:
# Batch-enrich lineage results — quote URNs (they contain parentheses and commas)
datahub search "*" \
--where 'urn IN ("urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)", "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)")' \
--projection "urn type
... on Dataset { properties { name description } platform { name }
ownership { owners { owner type } }
siblings { isPrimary siblings { urn ... on Dataset { properties { name description } platform { name } } } }
}"
This avoids N+1 calls — collect the URNs from lineage output and resolve them all in one search. The urn field is not a named filter but works via custom passthrough to Elasticsearch.
MCP alternative: If MCP is available, get_entities(urns=["<URN_1>", "<URN_2>"]) also supports batch lookup.
Siblings in lineage results
Lineage may return a dbt model URN when the user is thinking of the warehouse table (or vice versa). These are linked via the siblings aspect. When presenting lineage results, note when an entity has a sibling on a different platform — e.g., "dbt model stg_orders (sibling: Snowflake analytics.stg_orders)". See the entity model reference for sibling resolution details.
Specific path tracing
Use the CLI command first:
datahub lineage path --from "<URN_A>" --to "<URN_B>"
If path is unavailable, fall back to manual BFS: get downstream from A incrementing depth, check for B at each hop, and stop after 5 hops.
Step 4: Visualize Lineage
ASCII flow diagram
For simple lineage (up to ~10 entities):
[source_table_1] ──→ [staging_table] ──→ [analytics_table] ──→ [Revenue Dashboard]
[source_table_2] ──┘ └──→ [daily_export]
Structured list
For larger or more complex lineage:
### Upstream (sources for analytics_table)
| Hop | Entity | Type | Platform | Relationship |
| --- | -------------- | ------- | ---------- | ------------ |
| 1 | staging_table | dataset | Snowflake | TRANSFORMED |
| 2 | source_table_1 | dataset | PostgreSQL | TRANSFORMED |
| 2 | source_table_2 | dataset | PostgreSQL | TRANSFORMED |
### Downstream (consumers of analytics_table)
| Hop | Entity | Type | Platform | Relationship |
| --- | ----------------- | --------- | -------- | ------------ |
| 1 | Revenue Dashboard | dashboard | Looker | — |
| 1 | daily_export | dataset | S3 | TRANSFORMED |
Impact analysis format
For impact analysis, group by entity type, identify critical paths (single-dependency chains), and list affected owners. See templates/impact-analysis.template.md for the full template.
Cross-platform view
Group by platform when lineage crosses systems:
PostgreSQL Snowflake Looker
───────── ───────── ──────
[raw_orders] ──→ [stg_orders] ──→ [fct_orders] ──→ [Orders Dashboard]
[raw_customers] ──→ [stg_customers] ──┘
Suggesting Next Steps
After presenting lineage:
- "Want to see metadata details for any of these?" → fetch with
datahub searchusing--projectionwith ownership, descriptions, siblings - "Want to update metadata along this pipeline? Use
/datahub-enrich" - "Want to run an impact audit? Use
/datahub-audit"
Reference Documents
| Document | Path | Purpose |
|---|---|---|
| Lineage patterns reference | references/lineage-patterns-reference.md | Traversal strategies and patterns |
| Impact analysis template | templates/impact-analysis.template.md | Impact analysis report template |
| Lineage map template | templates/lineage-map.template.md | Lineage visualization template |
| CLI reference (shared) | ../shared-references/datahub-cli-reference.md | CLI commands |
Common Mistakes
- Using
datahub get --aspect upstreamLineageinstead ofdatahub lineage. Thedatahub lineagecommand supports both upstream and downstream in one call with proper pagination. Use it instead of the raw aspect fetch. - Showing only URNs. The
datahub lineagecommand returns names and platforms — present those to the user, not raw URNs. - Answering metadata questions instead of tracing. "Who owns X?" is a Search question, not a Lineage question. Lineage is for relationships between entities, not entity properties.
Red Flags
- User input contains shell metacharacters → reject, do not pass to CLI.
- Traversal depth > 3 hops → confirm with user before proceeding.
- Lineage returns 0 edges → entity may not have lineage ingested. Note this rather than saying "no dependencies."
- User asks about metadata, not lineage ("who owns X?", "add a tag") → redirect to
/datahub-searchor/datahub-enrich.
URN Parsing
Dataset URNs follow this format: urn:li:dataset:(urn:li:dataPlatform:<platform>,<qualified_name>,<env>). Extract the readable parts directly from the URN string rather than writing Python to parse each one:
- Platform: text after
dataPlatform:before the comma - Table name: text between the first and last comma (the qualified name)
- Environment: text after the last comma before the closing paren
For dashboard/chart URNs: urn:li:<type>:(<platform>,<id>).
Present lineage results using names extracted from URNs directly. Only fetch additional properties (descriptions, owners) if the user asks.
Remember
- Show the flow visually. ASCII diagrams are more intuitive than tables for small graphs.
- Check siblings. Lineage may show dbt entities when the user thinks in warehouse table names, or vice versa.
- Enrich when asked.
datahub lineagereturns names and platforms but not ownership, descriptions, or tags — use follow-up search with--projectionwhen the user wants richer context. - Check for capped results. If the summary indicates truncation, increase
--count.
DataHub Lineage
Explore lineage, trace data dependencies, and perform impact analysis using DataHub's lineage graph.
What it does
- Identifies the target entity
- Determines traversal direction and depth
- Executes lineage queries via MCP tools or CLI
- Visualizes the lineage graph with ASCII flow diagrams
Capabilities
- Impact analysis — What breaks if I change this table?
- Root cause — Where does this data come from?
- Full pipeline — End-to-end data flow mapping
- Cross-platform — Trace data across Snowflake, dbt, Looker, etc.
- Path finding — How does entity A connect to entity B?
Usage
/datahub-lineage impact analysis for customer_orders
/datahub-lineage what feeds into the Revenue Dashboard?
/datahub-lineage full pipeline for daily_revenue
/datahub-lineage path from raw_events to analytics_dashboard
Lineage Patterns Reference
Common lineage traversal strategies and patterns.
Traversal Strategies
Impact Analysis (Downstream)
Goal: Determine what breaks if an entity changes.
Strategy:
- Get all downstream entities (start with depth 1, expand as needed)
- Classify by type (datasets, dashboards, jobs)
- Identify critical paths (entities with single upstream dependency)
- List affected owners for notification
Key question: "Which downstream entities have no alternative data source?"
Root Cause (Upstream)
Goal: Trace where data originates and how it's transformed.
Strategy:
- Get all upstream entities (depth 1-3)
- Follow until reaching source-of-record systems (databases, APIs, files)
- Note transformation types at each hop (TRANSFORMED, VIEW, COPY)
- Identify the original data source
Key question: "Where does this data ultimately come from?"
Full Pipeline (Both Directions)
Goal: Map the complete data flow from source to consumption.
Strategy:
- Get upstream to source (root cause)
- Get downstream to consumers (impact)
- Merge into a single directed graph
- Present as end-to-end flow
Cross-Platform Tracing
Goal: Understand how data moves between systems.
Strategy:
- Trace lineage in both directions
- Group entities by platform
- Identify cross-platform edges (e.g., PostgreSQL → Snowflake via dbt)
- Highlight the integration points
Path Finding
Goal: Determine if and how entity A connects to entity B.
Strategy:
- Start BFS from entity A downstream
- At each hop, check if entity B appears
- If found, return the path
- Max depth: 5 hops (ask user before going deeper)
Lineage Edge Types
| Type | Meaning |
|---|---|
TRANSFORMED | Data was transformed (e.g., SQL query, dbt model) |
VIEW | Entity is a view over the source |
COPY | Data was copied without transformation |
Platform-Specific Lineage Notes
| Platform | Lineage Source | Notes |
|---|---|---|
| dbt | dbt manifest | Model-level lineage, often the richest |
| Airflow | Task dependencies | Job-level lineage |
| Snowflake | Query logs | Column-level lineage possible |
| BigQuery | Audit logs | Table-level lineage |
| Looker | LookML explores | Dashboard → dataset lineage |
| Tableau | Workbook metadata | Dashboard → dataset lineage |
Choosing the Right Command
| Need | Command | Why |
|---|---|---|
| Unfiltered upstream/downstream | datahub lineage | Simple, returns names and platforms |
| Column-level lineage | datahub lineage --column <field> | Only command that supports column tracing |
| Filter by type, platform, tags | searchAcrossLineage via datahub graphql | Server-side filtering avoids fetching full graph |
| Time-windowed lineage | searchAcrossLineage with lineageFlags | Only way to scope by edge update time |
| Large result sets (300+) | scrollAcrossLineage via datahub graphql | Cursor-based pagination for large graphs |
Lineage Limitations
- Use
datahub lineagefor both upstream and downstream traversal. Supports--hops,--column, and--format jsonwith metadata hints. - Use
searchAcrossLineagewhen filtering is needed.datahub lineagehas no filter support — use the GraphQL query viadatahub graphqlto filter by entity type, platform, tags, domain, or time window. - Depth: Deep lineage graphs (5+ hops) can be very large. Always cap and ask.
- Staleness: Lineage reflects the last ingestion. It may not reflect recent pipeline changes.
- Column-level: Not all sources provide column-level lineage. Note when unavailable.
Impact Analysis
Target Entity
Name: <!-- entity name --> URN: <!-- urn --> Platform: <!-- platform --> Type: <!-- dataset / dashboard / etc. -->
Impact Summary
Direct dependents (1 hop): <!-- count --> Transitive dependents (all hops): <!-- count --> Depth traced: <!-- hops -->
Affected Entities
By Type
| Type | Count | Entities |
|---|---|---|
| Datasets | <!-- n --> | <!-- list --> |
| Dashboards | <!-- n --> | <!-- list --> |
| Data Jobs | <!-- n --> | <!-- list --> |
| Charts | <!-- n --> | <!-- list --> |
By Platform
| Platform | Count |
|---|---|
| <!-- platform --> | <!-- n --> |
Critical Paths
<!-- Entities with single upstream dependency on target -->| Entity | Type | Risk |
|---|---|---|
| <!-- name --> | <!-- type --> | Single dependency — no alternative source |
Lineage Graph
<!-- ASCII flow diagram -->
Affected Owners
| Owner | Entities Affected |
|---|---|
| <!-- owner --> | <!-- count and list --> |
Recommendations
- <!-- Notification actions -->
- <!-- Migration/update suggestions -->
Lineage Map
Target Entity
Name: <!-- entity name --> URN: <!-- urn -->
Flow Diagram
<!-- ASCII lineage diagram -->
Upstream (Sources)
| Hop | Entity | Type | Platform | Relationship |
|---|---|---|---|---|
| 1 | <!-- name --> | <!-- type --> | <!-- platform --> | <!-- TRANSFORMED/VIEW/COPY --> |
Downstream (Consumers)
| Hop | Entity | Type | Platform | Relationship |
|---|---|---|---|---|
| 1 | <!-- name --> | <!-- type --> | <!-- platform --> | <!-- type --> |
Cross-Platform Boundaries
| From | To | Edge |
|---|---|---|
| <!-- platform A --> | <!-- platform B --> | <!-- entity A → entity B --> |
DataHub CLI Reference
Commands verified against DataHub CLI v1.4.0. Install via pip install acryl-datahub.
Tool Detection
Before running any DataHub commands, determine which tools are available:
- MCP tools available — If tools like
datahub_search,datahub_get_entity,datahub_get_lineageare in your tool list, use them directly. They are the preferred path — no CLI installation needed. - CLI available — If you have a
Bashtool, check:which datahub. If found, use the CLI commands documented below. - Neither — Suggest the user set up a DataHub connection using
/datahub-setup.
MCP takes priority over CLI when both are available — MCP tools are purpose-built for agent use with structured inputs/outputs and no shell overhead.
CLI ↔ MCP Equivalents
| Operation | CLI Command | MCP Tool |
|---|---|---|
| Search | datahub search "query" --where "..." | search(query="...", filter="...") |
| Get entity | datahub get --urn "..." --aspect ownership | get_entities(urns=["..."]) |
| Upstream lineage | datahub lineage --urn "..." --direction upstream | get_lineage(urn="...", upstream=true) |
| Downstream lineage | datahub lineage --urn "..." --direction downstream | get_lineage(urn="...", upstream=false) |
| GraphQL | datahub graphql --query '...' | execute_graphql(query="...") |
| Server config | datahub check server-config | Not needed (MCP server handles config) |
MCP tool names may be prefixed (e.g. mcp__datahub-cloud__search). Match by the function name suffix, not the full prefixed name. MCP tools are self-documenting — check their schemas for parameter details rather than relying on static documentation.
The rest of this document covers the CLI path.
Authentication
The CLI reads connection settings from ~/.datahubenv:
gms:
server: "http://localhost:8080"
token: "<personal-access-token>"
Or via environment variables:
export DATAHUB_GMS_URL="http://localhost:8080"
export DATAHUB_GMS_TOKEN="<token>"
Version Check
Before running commands, check the installed CLI version:
datahub version
If a skill requires a minimum version and the installed version is older, upgrade:
pip install --upgrade acryl-datahub --pre
The --pre flag ensures pre-release versions (e.g. 1.5.0rc1) are included, which may be required for newer features.
Server Detection
Detect whether you're connected to DataHub Cloud or OSS:
datahub check server-config
serverEnv: 'cloud'→ DataHub Cloud (supports popularity sorting, dataset features)serverEnv: 'core'or other → OSS / self-hosted (feature fields not available)
Cache this result for the session — don't re-check on every command. Some features marked (Cloud only) below require serverEnv: cloud.
Context
Pass context on CLI commands using -C key=value so commands can be correlated:
datahub -C skill=datahub-audit search "revenue"
datahub -C skill=datahub-audit -C caller=claude-code get --urn "..."
The -C flag goes on the root datahub command (before the subcommand). Use the skill's own name from its YAML frontmatter as the skill value. If the flag is not recognized, omit it — the command works the same without it.
Search & Discovery
The search CLI uses a positional query argument — not --query.
# Basic keyword search
datahub search "revenue"
# Search with limit
datahub search "customers" --limit 20
# Filter by platform (simple filter)
datahub search "*" --filter platform=snowflake
# Filter by entity type
datahub search "*" --where "entity_type = dataset"
# SQL-like WHERE expressions (recommended for agents)
datahub search "*" --where "platform = snowflake AND env = PROD"
datahub search "*" --where "platform IN (snowflake, bigquery)"
datahub search "*" --where "entity_type = dataset AND platform = snowflake"
# Multiple simple filters (AND between fields, comma = OR within field)
datahub search "*" --filter platform=snowflake --filter env=PROD
datahub search "*" --filter platform=snowflake,bigquery
# Output formats
datahub search "revenue" --table # Human-readable table
datahub search "revenue" --urns-only # URNs only, one per line
datahub search "revenue" --format json # JSON (default)
# Pagination (max 50 per page)
datahub search "customers" --limit 50 --offset 0 # page 1
datahub search "customers" --limit 50 --offset 50 # page 2
# Facets only (counts by type/platform/etc.)
datahub search "*" --facets-only --format json
# Dry run (preview query without executing)
datahub search "revenue" --where "platform = snowflake" --dry-run
# Projection (limit returned fields — reduces token cost)
datahub search "customers" --projection "urn type"
# Column-level search (find datasets containing a specific field)
datahub search "*" --where "entity_type = dataset AND fieldPaths = customer_id"
# Sorting
datahub search "*" --sort-by lastModifiedAt --sort-order desc --limit 10
datahub search "*" --sort-by _entityName --sort-order asc --limit 10
# Popularity / usage sorting (Cloud only — check serverEnv first)
# Most queried datasets
datahub search "*" --where "entity_type = dataset" \
--sort-by queryCountLast30DaysFeature --sort-order desc --limit 10 \
--projection "urn type ... on Dataset { properties { name } platform { name } statsSummary { queryCountLast30Days uniqueUserCountLast30Days } }"
# Most updated datasets
datahub search "*" --where "entity_type = dataset" --sort-by writeCountLast30DaysFeature --sort-order desc --limit 10
# Largest tables (by row count or bytes)
datahub search "*" --where "entity_type = dataset" --sort-by rowCountFeature --sort-order desc --limit 10
datahub search "*" --where "entity_type = dataset" --sort-by sizeInBytesFeature --sort-order desc --limit 10
# Existence filters (IS NULL / IS NOT NULL)
datahub search "*" --where "entity_type = dataset AND description IS NULL AND editableDescription IS NULL"
datahub search "*" --where "entity_type = dataset AND glossary_term IS NOT NULL"
# Sibling-aware description audit (single query, no N+1 fetches)
# Step 1: Find datasets missing both ingestion and user-edited descriptions
# Step 2: Project siblings with their descriptions to compute effective coverage
datahub search "*" \
--where "entity_type = dataset AND platform = snowflake AND description IS NULL AND editableDescription IS NULL" \
--projection "urn type ... on Dataset { siblings { isPrimary siblings { urn ... on Dataset { properties { name description } editableProperties { description } } } } }" \
--format json --limit 50
# URN resolution for filters
# Tag, domain, and glossary_term filters require full URNs — not display names.
# Always resolve the name to a URN first, then use the URN in the filter.
# Step 1: Find tag URN by name
datahub search "large table" --where "entity_type = tag" --urns-only --limit 1
# → urn:li:tag:sample_data___default_large_table
# Step 2: Use the URN in a filter
datahub search "*" --where "entity_type = dataset AND tags = 'urn:li:tag:sample_data___default_large_table'"
# Same pattern for domains:
datahub search "ecommerce" --where "entity_type = domain" --urns-only --limit 1
# → urn:li:domain:91994180-...
datahub search "*" --where "entity_type = dataset AND domain = 'urn:li:domain:91994180-...'"
# And glossary terms:
datahub search "PII" --where "entity_type = glossaryTerm" --urns-only --limit 1
datahub search "*" --where "entity_type = dataset AND glossary_term = 'urn:li:glossaryTerm:...'"
# Discover available filters
datahub search list-filters
datahub search describe-filter platform
# Agent best practices
datahub search --agent-context
Entity Retrieval
# Get full entity metadata
datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,table_name,PROD)"
# Get specific aspect
datahub get --urn "<URN>" --aspect schemaMetadata
datahub get --urn "<URN>" --aspect ownership
datahub get --urn "<URN>" --aspect globalTags
Lineage
# Upstream sources (full graph by default)
datahub lineage --urn "<URN>" --direction upstream
# Downstream dependents
datahub lineage --urn "<URN>" --direction downstream
# Limit to immediate neighbors
datahub lineage --urn "<URN>" --direction upstream --hops 1
# Column-level lineage (datasets only)
datahub lineage --urn "<URN>" --column customer_id --direction upstream
# JSON output (includes metadata with capped/hint info)
datahub lineage --urn "<URN>" --direction downstream --format json
# Find path between two entities
datahub lineage path --from "<URN_A>" --to "<URN_B>"
# Agent best practices
datahub lineage --agent-context
Timeline (Change History)
# Schema changes
datahub timeline --urn "<URN>" --category technical_schema
# Ownership changes
datahub timeline --urn "<URN>" --category owner
# Tag changes
datahub timeline --urn "<URN>" --category tag
# With time range
datahub timeline --urn "<URN>" --category technical_schema --start 7daysago
Categories: tag, glossary_term, technical_schema, documentation, owner
Write Operations (via GraphQL Mutations)
Write operations use datahub graphql --query 'mutation { ... }'. The CLI does not have dedicated tag, glossary, or inline put commands for these operations.
Important rules for GraphQL mutations:
- Return field subselections required. Mutations returning objects (not scalars like
Boolean) need{ urn }or similar after the mutation. Without it:SubselectionRequirederror. - Long queries must use temp files. Long inline
--querystrings get misinterpreted as file paths on macOS (File name too long). Write to a.graphqlfile and pass the path:datahub graphql --query /tmp/my-mutation.graphql --format json. - Short mutations can be inline. Simple mutations like
addTag,removeTag,addOwnerare short enough to pass inline.
Tags
# Create a tag
# With id: name-based URN (human-readable, but ID is immutable — can't rename later)
# Without id: GUID-based URN (opaque, but display name can change freely)
# When unsure, ask the user which they prefer.
datahub graphql --query 'mutation {
createTag(input: { id: "pii", name: "PII", description: "Contains PII data" })
}' --format json
# → returns urn:li:tag:pii
# Add tag to entity (tag must exist first)
datahub graphql --query 'mutation {
addTag(input: { tagUrn: "urn:li:tag:<TAG_URN>", resourceUrn: "<ENTITY_URN>" })
}' --format json
# Add tag to a specific field
datahub graphql --query 'mutation {
addTag(input: {
tagUrn: "urn:li:tag:<TAG_URN>",
resourceUrn: "<ENTITY_URN>",
subResourceType: DATASET_FIELD,
subResource: "<FIELD_PATH>"
})
}' --format json
# Remove tag
datahub graphql --query 'mutation {
removeTag(input: { tagUrn: "urn:li:tag:<TAG_URN>", resourceUrn: "<ENTITY_URN>" })
}' --format json
# Batch add tags
datahub graphql --query 'mutation {
batchAddTags(input: {
tagUrns: ["urn:li:tag:<TAG1>", "urn:li:tag:<TAG2>"],
resources: [{ resourceUrn: "<URN1>" }, { resourceUrn: "<URN2>" }]
})
}' --format json
Glossary Terms
# Add term to entity
datahub graphql --query 'mutation {
addTerm(input: { termUrn: "urn:li:glossaryTerm:<TERM>", resourceUrn: "<ENTITY_URN>" })
}' --format json
# Remove term
datahub graphql --query 'mutation {
removeTerm(input: { termUrn: "urn:li:glossaryTerm:<TERM>", resourceUrn: "<ENTITY_URN>" })
}' --format json
Ownership
# Add owner (appends — does not replace existing owners)
datahub graphql --query 'mutation {
addOwner(input: {
ownerUrn: "urn:li:corpuser:<USER>",
resourceUrn: "<ENTITY_URN>",
ownerEntityType: CORP_USER,
type: TECHNICAL_OWNER
})
}' --format json
# Remove owner
datahub graphql --query 'mutation {
removeOwner(input: { ownerUrn: "urn:li:corpuser:<USER>", resourceUrn: "<ENTITY_URN>" })
}' --format json
# Batch add owners
datahub graphql --query 'mutation {
batchAddOwners(input: {
owners: [{ ownerUrn: "urn:li:corpuser:<USER>", ownerEntityType: CORP_USER }],
resources: [{ resourceUrn: "<URN1>" }, { resourceUrn: "<URN2>" }]
})
}' --format json
Owner types: TECHNICAL_OWNER, BUSINESS_OWNER, DATA_STEWARD, NONE
Deprecation
# Deprecate
datahub graphql --query 'mutation {
updateDeprecation(input: { urn: "<URN>", deprecated: true, note: "Replaced by new_table" })
}' --format json
# Un-deprecate
datahub graphql --query 'mutation {
updateDeprecation(input: { urn: "<URN>", deprecated: false })
}' --format json
Domains
# Create domain
datahub graphql --query 'mutation {
createDomain(input: { name: "Marketing", description: "Marketing data" })
}' --format json
# Assign entity to domain (domain must exist)
datahub graphql --query 'mutation {
setDomain(entityUrn: "<ENTITY_URN>", domainUrn: "urn:li:domain:<DOMAIN_ID>")
}' --format json
# Remove from domain
datahub graphql --query 'mutation {
unsetDomain(entityUrn: "<ENTITY_URN>")
}' --format json
# Batch assign
datahub graphql --query 'mutation {
batchSetDomain(input: {
domainUrn: "urn:li:domain:<ID>",
resources: [{ resourceUrn: "<URN1>" }, { resourceUrn: "<URN2>" }]
})
}' --format json
Description
datahub graphql --query 'mutation {
updateDescription(input: {
description: "New description text",
resourceUrn: "<ENTITY_URN>"
})
}' --format json
Data Products
Note: domainUrn is required — every data product must belong to a domain. Use datahub graphql --describe createDataProduct --recurse to verify the schema.
# Create (domainUrn is REQUIRED)
datahub graphql --query 'mutation {
createDataProduct(input: {
domainUrn: "urn:li:domain:<DOMAIN_ID>",
properties: { name: "Revenue Analytics", description: "Revenue pipeline" }
}) { urn }
}' --format json
# Add assets to data product
datahub graphql --query 'mutation {
batchSetDataProduct(input: {
dataProductUrn: "urn:li:dataProduct:<ID>",
resourceUrns: ["<URN1>", "<URN2>"]
})
}' --format json
Verification & Health
# Check CLI version
datahub version
# Verify connectivity (this entity always exists)
datahub get --urn "urn:li:corpuser:datahub"
# Test search (confirms search index works)
datahub search "*" --limit 1
# Server configuration
datahub check server-config
Note: datahub check server-health does not exist. Use datahub get --urn "urn:li:corpuser:datahub" to verify connectivity.
GraphQL Discovery
# List all available operations
datahub graphql --list-operations --format json
# List mutations only
datahub graphql --list-mutations --format json
# Describe a specific operation
datahub graphql --describe addTag --format json
# Describe with full type expansion
datahub graphql --describe addTag --recurse --format json
# Dry run (preview without executing)
datahub graphql --query '{ me { corpUser { urn } } }' --dry-run
# Agent best practices
datahub graphql --agent-context
Batch Mutation Pattern (Python)
Shell loops with dataset URNs are fragile due to quoting issues with parentheses. For multi-entity mutations, use a Python script with temp files:
import subprocess, json, tempfile, os
def run_graphql_mutation(query, variables):
"""Run a GraphQL mutation with variables via temp file. Returns parsed JSON or None."""
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
json.dump(variables, f)
vf = f.name
try:
result = subprocess.run(
["datahub", "graphql", "-q", query, "-v", vf, "--format", "json", "--no-pretty"],
capture_output=True, text=True
)
if result.returncode == 0:
return json.loads(result.stdout)
else:
print(f"ERROR: {result.stderr.strip()[:120]}")
return None
finally:
os.unlink(vf)
# Example: batch update descriptions
query = "mutation updateDataset($urn: String!, $input: DatasetUpdateInput!) { updateDataset(urn: $urn, input: $input) { urn } }"
datasets = {
"urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)": "Description for table1",
"urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)": "Description for table2",
}
for urn, desc in datasets.items():
variables = {"urn": urn, "input": {"editableProperties": {"description": desc}}}
result = run_graphql_mutation(query, variables)
status = "OK" if result else "FAIL"
print(f" {urn.split(',')[1]}: {status}")
Output Processing
# Pipe search URNs to get for batch retrieval
datahub search "customers" --urns-only | xargs -I{} datahub get --urn {}
# Extract field names from schema
datahub get --urn "<URN>" --aspect schemaMetadata | python3 -c "
import sys, json
data = json.load(sys.stdin)
for f in data.get('schemaMetadata', {}).get('fields', []):
print(f['fieldPath'])
"