Available in: English 한국어 Português Türkçe

AI SkillQuery data lakeOperations

Ask business questions against a data lake with Athena-style safety and cost checks. — Claude Skill

Name: Querying Data Lake
Author: AWS

A Claude Skill for Claude Code by AWS✓ — run /querying-data-lake in Claude·Updated Jun 14, 2026·vmain@7cd875e

Compatible withChatGPT

ClaudeClaude CodeCodex / Codex CLI

Cursor

Gemini

Helps teams choose the right data source, workgroup, and SQL pattern, then query large data lake tables with cost awareness, result interpretation, and safe limits.

Turns a business question into a safe data lake query plan before scanning large tables.
Chooses appropriate workgroup, tables, partitions, limits, and query pattern.
Explains data scanned, cost risk, and whether the result is reliable enough to use.
Separates read-only analysis from destructive or unsafe actions.

YouToday

A team asks a data engineer for a number and gets an expensive or unclear one-off query.

With /querying-data-lake

Run /querying-data-lake to make the question, source, SQL pattern, cost, and caveats explicit before using the answer.

1 Clarify the metric2 Choose source and workgroup3 Limit scanned data4 Explain results and caveats

Who this is for

Data Engineer

Make ad hoc data lake questions safer, cheaper, and easier to review.

See skills for this role

Analytics Engineer

Turn stakeholder questions into query plans with metric definitions and caveats.

See skills for this role

What it does

Ad hoc metric answer

Answer a business question from lake data without creating a permanent dashboard first.

Cost-aware exploration

Explore large tables with partition and workgroup checks before scanning too much data.

Data source validation

Confirm whether data lake tables can support a metric or competitive benchmark.

How it works

Clarify the business question, date range, metric, filters, and output needed.

Identify the likely catalog, database, table, partition, and workgroup.

Create a query plan that limits scanned data and avoids unsafe operations.

Run or draft SQL with preview limits and validate result reasonableness.

Explain the answer, caveats, and whether the query should become a repeatable report.

Input options

Business question

The decision, metric, segment, and date range the query must answer.

Example

What the user pastes

Business question: did activation improve after the new onboarding checklist?
Date range: May 1 to June 14.
Segments: SMB, mid-market, enterprise.
Activation definition: workspace has invited at least two teammates and completed first project within 7 days.
Known tables:
- product_events partitioned by event_date
- accounts partitioned by created_month
- workspaces
Need: query plan, safe SQL outline, result table, and caveats before sharing with Product and Finance.

Useful result

Query plan

Use read-only Athena SQL. Limit product_events to event_date between May 1 and June 14. Join accounts only on workspace_id and created_month. Start with a 100-row preview before the aggregate query. Use the analytics workgroup if available because this is a business analysis query.

SQL outline

1. Find workspaces created in the date range.
2. Count invited_teammate and first_project_completed events within 7 days of workspace creation.
3. Mark activated when both conditions are true.
4. Group by account segment and pre/post checklist launch date.
5. Return workspace count, activated count, activation rate, and confidence caveat.

Result format

| Segment | Period | Workspaces | Activated | Activation rate | Readout |
|---|---|---:|---:|---:|---|
| SMB | Before | 420 | 151 | 36% | Baseline |
| SMB | After | 390 | 171 | 44% | Improvement visible |
| Mid-market | Before | 180 | 76 | 42% | Baseline |
| Mid-market | After | 165 | 84 | 51% | Stronger movement |
| Enterprise | Before | 52 | 19 | 37% | Small sample |
| Enterprise | After | 48 | 20 | 42% | Treat cautiously |

Caveats

This does not prove causality. Check whether acquisition source changed, whether event instrumentation was stable, and whether enterprise sample size is too small for a confident claim.

Metrics this improves

Query Performance

Reduces unnecessary full-table scans and inefficient exploratory SQL.

Operations

Warehouse Cost

Keeps data scanned and cost risk visible before queries run.

Operations

Metric Trust

Makes definitions, sources, and caveats explicit.

Operations

Works with

Google Sheets

manual

Share result tables and analysis caveats with stakeholders.

Slack

manual

Coordinate data owner review and stakeholder readouts.

SQL

manual

Draft or review SQL patterns for data lake analysis.

Want to use Querying Data Lake?

Choose how to get started.

Run in Claude Code

Free. Open source.

Install and run this skill locally on your computer.

Install Claude Code

Open a terminal on your computer and paste this command:

Install the skill

This downloads the skill with all its files to your computer:

Add -g at the end to make it available in all your projects.

Run it

Start Claude Code, then type the command:

then

View source on GitHub

Use on ElasticFlow

Team and collaboration features

Run skills from your browser. Share results, manage access, collaborate with your team. No terminal needed.

Free 14-day trial. Cancel anytime.

View on GitHub

Query Data Lake

Execute SQL queries on Amazon Athena across default and federated catalogs (Glue, S3 Tables, Redshift) with workgroup selection, statement classification, and error recovery.

Overview

Executes and manages Athena SQL queries across default and federated catalogs. Selects a workgroup, resolves target assets (delegating fuzzy references to finding-data-lake-assets), classifies statements for safety, and reports cost and data scanned. Use the AWS MCP server for sandboxed execution and audit logging; the same AWS CLI commands work directly when the MCP server is not available.

Constraints for parameter acquisition:

You MUST accept a single optional argument: SQL text, a named-query name, a workgroup name, a catalog name, or profile TABLE_NAME
You MUST accept the argument as direct text or a pointer to a file containing SQL
You MUST ask the user for the target AWS region if not already set
You MUST confirm the output S3 location before executing any non-trivial query
You MUST respect the user's decision to abort at any step

Common Tasks

1. Verify Dependencies

Check for required tools and AWS access before running queries.

Constraints:

You MUST verify AWS MCP server tools are available (aws___call_aws) and run queries through them when present; fall back to AWS CLI only if the MCP server is unavailable
You MUST NOT fall back to shell or Bash for query execution — results must be captured via the MCP tool or aws athena CLI so output location and cost are tracked
You MUST confirm credentials with aws sts get-caller-identity and inform the user about any missing tools

2. Resolve Workgroup

Check caller identity, list workgroups, auto-select the best one (see workgroup-selection.md).

Constraints:

You MUST select a workgroup before submitting any query (prevents output-location errors)
You MUST present the selected workgroup and its output location to the user
You MUST NOT auto-escalate to a different workgroup on failure without user confirmation

3. Resolve the Target Asset

If the user refers to a table by name, by business concept ("our quarterly report", "the sales data"), by S3 path, or by catalog without specifying the table, delegate to finding-data-lake-assets to return the concrete database.table (and catalog if non-default).

Constraints:

You MUST NOT attempt to resolve fuzzy asset references with athena list-data-catalogs or by iterating get-tables — those miss federated catalogs and waste tokens
You SHOULD skip this step only when the user provides a fully-qualified reference (exact database.table) or raw SQL they want executed as-is
You MUST state the resolved asset explicitly before building the query: "Found [table] in [catalog]. Using this for the query."
You SHOULD default to the default Glue catalog unless the user mentions "federated", "Redshift", "S3 Tables", or finding-data-lake-assets returns a different catalog

4. Discover Schema

For analytical queries, You SHOULD profile the target table before building the final query. You MUST show sample rows (SELECT ... LIMIT 5) as part of profiling.

5. Build Query

Table addressing depends on catalog type:

Default Glue catalog: database.table (omit the catalog prefix for single-catalog queries). In cross-catalog queries, qualify default-catalog tables with "awsdatacatalog".database.table.
Registered data source: datasource.database.table
Unregistered Glue catalog: "catalog/subcatalog".database.table

6. Classify and Execute

Classify the SQL statement before executing:

Statement	Behavior
`SELECT`, `SHOW`, `DESCRIBE`, `EXPLAIN`	Safe — execute
`INSERT`, `UPDATE`, `DELETE`, `DROP`, `ALTER`, `CREATE`, `TRUNCATE`, `MERGE`	Destructive — warn the user and require explicit confirmation
Unsure	Treat as destructive; confirm

Example tool call (via AWS MCP server):

aws___call_aws(command="aws athena start-query-execution --work-group <WORKGROUP_NAME> --query-string '<sql>' --query-execution-context Database=<db>")

For federated or S3 Tables catalogs, also set Catalog=<CATALOG_PATH> in the execution context (e.g. Catalog=s3tablescatalog/<BUCKET_NAME>).

Constraints:

You MUST warn the user before executing when the target is Redshift-federated ("No partition pruning — every query scans the full table")
You MUST warn the user before executing a cross-catalog join ("Cross-catalog joins incur network overhead and may be slow")
You MUST confirm the output S3 location before executing
You MUST explain which tool is being called before executing
You MUST respect the user's decision to abort

7. Present and Recover

Present results with cost, data scanned, duration, and actionable insights. On failure, list available workgroups and let the user choose which to retry with.

Argument Routing

Resolve in this order; stop at the first match:

Contains SQL keywords (SELECT, SHOW, DESCRIBE, INSERT, etc.) — SQL text, execute directly
profile TABLE_NAME — run comprehensive table profiling (see query-patterns.md)
Matches a known named query — look up and execute
Matches a known workgroup — show workgroup status and recent queries
Matches a known catalog — delegate to exploring-data-catalog to enumerate databases and tables
No args — show recent query activity and available tables

Principles

Always select workgroup before executing (prevents output-location errors)
Profile unfamiliar tables before running analytical queries
Present cost alongside results so users build cost awareness
Suggest LIMIT for exploratory queries on large tables
Never ask domain questions with obvious answers, but always confirm security-relevant actions (workgroup switches, output location changes, non-SELECT statements)

Troubleshooting

Error	Cause	Fix
Redshift identifier error with mixed case	Redshift-federated names are lowercase only	Lowercase the identifier
`CatalogId` validation failure	ARN passed instead of catalog name	Pass the catalog name, not the ARN
Cross-catalog `information_schema` returns nothing	Missing catalog qualifier	Use catalog-qualified path: `"catalog".information_schema.tables`
Query fails with output-location error	Workgroup has no output location configured	Select a different workgroup with an output location, or configure one
Destructive statement executed without confirmation	Statement classification skipped	Always classify `INSERT`/`UPDATE`/`DELETE`/`DROP`/`ALTER`/`CREATE`/`TRUNCATE`/`MERGE` and confirm with the user

Additional Resources

Reference documents

name: querying-data-lake description: >- Execute and manage Athena SQL queries across default and federated catalogs (Glue, S3 Tables, Redshift). Triggers on phrases like: query data, run SQL, athena query, analyze table, SQL query, workgroup status, profile table, query Redshift catalog, query S3 Tables. Do NOT use for finding specific data assets (use finding-data-lake-assets), full catalog audits (use exploring-data-catalog), importing data (use ingesting-into-data-lake). version: 1 argument-hint: '[SQL-query|query-name|workgroup-name|catalog-name|''profile TABLE_NAME'']'

Query Data Lake

Execute SQL queries on Amazon Athena across default and federated catalogs (Glue, S3 Tables, Redshift) with workgroup selection, statement classification, and error recovery.

Overview

Constraints for parameter acquisition:

You MUST accept a single optional argument: SQL text, a named-query name, a workgroup name, a catalog name, or profile TABLE_NAME
You MUST accept the argument as direct text or a pointer to a file containing SQL
You MUST ask the user for the target AWS region if not already set
You MUST confirm the output S3 location before executing any non-trivial query
You MUST respect the user's decision to abort at any step

Common Tasks

1. Verify Dependencies

Check for required tools and AWS access before running queries.

Constraints:

You MUST verify AWS MCP server tools are available (aws___call_aws) and run queries through them when present; fall back to AWS CLI only if the MCP server is unavailable
You MUST NOT fall back to shell or Bash for query execution — results must be captured via the MCP tool or aws athena CLI so output location and cost are tracked
You MUST confirm credentials with aws sts get-caller-identity and inform the user about any missing tools

2. Resolve Workgroup

Check caller identity, list workgroups, auto-select the best one (see workgroup-selection.md).

Constraints:

You MUST select a workgroup before submitting any query (prevents output-location errors)
You MUST present the selected workgroup and its output location to the user
You MUST NOT auto-escalate to a different workgroup on failure without user confirmation

3. Resolve the Target Asset

Constraints:

You MUST NOT attempt to resolve fuzzy asset references with athena list-data-catalogs or by iterating get-tables — those miss federated catalogs and waste tokens
You SHOULD skip this step only when the user provides a fully-qualified reference (exact database.table) or raw SQL they want executed as-is
You MUST state the resolved asset explicitly before building the query: "Found [table] in [catalog]. Using this for the query."
You SHOULD default to the default Glue catalog unless the user mentions "federated", "Redshift", "S3 Tables", or finding-data-lake-assets returns a different catalog

4. Discover Schema

For analytical queries, You SHOULD profile the target table before building the final query. You MUST show sample rows (SELECT ... LIMIT 5) as part of profiling.

5. Build Query

Table addressing depends on catalog type:

Default Glue catalog: database.table (omit the catalog prefix for single-catalog queries). In cross-catalog queries, qualify default-catalog tables with "awsdatacatalog".database.table.
Registered data source: datasource.database.table
Unregistered Glue catalog: "catalog/subcatalog".database.table

6. Classify and Execute

Classify the SQL statement before executing:

Statement	Behavior
`SELECT`, `SHOW`, `DESCRIBE`, `EXPLAIN`	Safe — execute
`INSERT`, `UPDATE`, `DELETE`, `DROP`, `ALTER`, `CREATE`, `TRUNCATE`, `MERGE`	Destructive — warn the user and require explicit confirmation
Unsure	Treat as destructive; confirm

Example tool call (via AWS MCP server):

aws___call_aws(command="aws athena start-query-execution --work-group <WORKGROUP_NAME> --query-string '<sql>' --query-execution-context Database=<db>")

For federated or S3 Tables catalogs, also set Catalog=<CATALOG_PATH> in the execution context (e.g. Catalog=s3tablescatalog/<BUCKET_NAME>).

Constraints:

You MUST warn the user before executing when the target is Redshift-federated ("No partition pruning — every query scans the full table")
You MUST warn the user before executing a cross-catalog join ("Cross-catalog joins incur network overhead and may be slow")
You MUST confirm the output S3 location before executing
You MUST explain which tool is being called before executing
You MUST respect the user's decision to abort

7. Present and Recover

Present results with cost, data scanned, duration, and actionable insights. On failure, list available workgroups and let the user choose which to retry with.

Argument Routing

Resolve in this order; stop at the first match:

Contains SQL keywords (SELECT, SHOW, DESCRIBE, INSERT, etc.) — SQL text, execute directly
profile TABLE_NAME — run comprehensive table profiling (see query-patterns.md)
Matches a known named query — look up and execute
Matches a known workgroup — show workgroup status and recent queries
Matches a known catalog — delegate to exploring-data-catalog to enumerate databases and tables
No args — show recent query activity and available tables

Principles

Always select workgroup before executing (prevents output-location errors)
Profile unfamiliar tables before running analytical queries
Present cost alongside results so users build cost awareness
Suggest LIMIT for exploratory queries on large tables
Never ask domain questions with obvious answers, but always confirm security-relevant actions (workgroup switches, output location changes, non-SELECT statements)

Troubleshooting

Error	Cause	Fix
Redshift identifier error with mixed case	Redshift-federated names are lowercase only	Lowercase the identifier
`CatalogId` validation failure	ARN passed instead of catalog name	Pass the catalog name, not the ARN
Cross-catalog `information_schema` returns nothing	Missing catalog qualifier	Use catalog-qualified path: `"catalog".information_schema.tables`
Query fails with output-location error	Workgroup has no output location configured	Select a different workgroup with an output location, or configure one
Destructive statement executed without confirmation	Statement classification skipped	Always classify `INSERT`/`UPDATE`/`DELETE`/`DROP`/`ALTER`/`CREATE`/`TRUNCATE`/`MERGE` and confirm with the user

Additional Resources

Common Query Patterns (Presto/Athena SQL)

Table Profiling

-- Schema discovery
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_schema = '<database>' AND table_name = '<table>';

-- Quick row count and date range
SELECT COUNT(*) as total_rows,
       MIN(created_at) as earliest,
       MAX(created_at) as latest
FROM <table>;

-- Sample data (always do this before analytical queries)
SELECT * FROM <table> LIMIT 5;

-- Null analysis
SELECT
    '<column>' as field,
    COUNT(*) - COUNT(<column>) as null_count,
    ROUND((COUNT(*) - COUNT(<column>)) * 100.0 / COUNT(*), 2) as null_pct
FROM <table>;

Cohort Retention

WITH cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC('month', first_activity_date) as cohort_month
    FROM users
),
activity AS (
    SELECT
        user_id,
        DATE_TRUNC('month', activity_date) as activity_month
    FROM user_activity
)
SELECT
    c.cohort_month,
    COUNT(DISTINCT c.user_id) as cohort_size,
    COUNT(DISTINCT CASE
        WHEN a.activity_month = c.cohort_month THEN a.user_id
    END) as month_0,
    COUNT(DISTINCT CASE
        WHEN a.activity_month = DATE_ADD('month', 1, c.cohort_month) THEN a.user_id
    END) as month_1,
    COUNT(DISTINCT CASE
        WHEN a.activity_month = DATE_ADD('month', 3, c.cohort_month) THEN a.user_id
    END) as month_3,
    COUNT(DISTINCT CASE
        WHEN a.activity_month = DATE_ADD('month', 6, c.cohort_month) THEN a.user_id
    END) as month_6
FROM cohorts c
LEFT JOIN activity a ON c.user_id = a.user_id
GROUP BY c.cohort_month
ORDER BY c.cohort_month;

Funnel Analysis

WITH funnel AS (
    SELECT
        user_id,
        MAX(CASE WHEN event = 'page_view' THEN 1 ELSE 0 END) as step_1_view,
        MAX(CASE WHEN event = 'signup_start' THEN 1 ELSE 0 END) as step_2_start,
        MAX(CASE WHEN event = 'signup_complete' THEN 1 ELSE 0 END) as step_3_complete,
        MAX(CASE WHEN event = 'first_purchase' THEN 1 ELSE 0 END) as step_4_purchase
    FROM events
    WHERE event_date >= DATE_ADD('day', -30, CURRENT_DATE)
    GROUP BY user_id
)
SELECT
    COUNT(*) as total_users,
    SUM(step_1_view) as viewed,
    SUM(step_2_start) as started_signup,
    SUM(step_3_complete) as completed_signup,
    SUM(step_4_purchase) as purchased,
    ROUND(100.0 * SUM(step_2_start) / NULLIF(SUM(step_1_view), 0), 1) as view_to_start_pct,
    ROUND(100.0 * SUM(step_3_complete) / NULLIF(SUM(step_2_start), 0), 1) as start_to_complete_pct,
    ROUND(100.0 * SUM(step_4_purchase) / NULLIF(SUM(step_3_complete), 0), 1) as complete_to_purchase_pct
FROM funnel;

Deduplication

-- Keep the most recent record per key (Presto/Athena syntax)
WITH ranked AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY entity_id
            ORDER BY updated_at DESC
        ) as rn
    FROM source_table
)
SELECT * FROM ranked WHERE rn = 1;

Window Functions

-- Running total
SUM(revenue) OVER (ORDER BY event_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total

-- 7-day moving average
AVG(revenue) OVER (ORDER BY event_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as moving_avg_7d

-- Period-over-period comparison
LAG(value, 1) OVER (PARTITION BY entity ORDER BY event_date) as prev_value

-- Percent of total
revenue / SUM(revenue) OVER () as pct_of_total
revenue / SUM(revenue) OVER (PARTITION BY category) as pct_of_category

-- Ranking
ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) as rank_in_category

Period Comparison / Growth

When the user asks for "growth", "change", or "comparison" between periods, compute the delta — not raw totals.

WITH quarterly AS (
    SELECT
        category,
        QUARTER(order_date) as q,
        SUM(amount) as revenue
    FROM orders
    WHERE YEAR(order_date) = 2025
    GROUP BY category, QUARTER(order_date)
)
SELECT
    curr.category,
    prev.revenue as prev_period,
    curr.revenue as curr_period,
    ROUND((curr.revenue - prev.revenue) / prev.revenue * 100, 1) as growth_pct
FROM quarterly curr
JOIN quarterly prev ON curr.category = prev.category AND curr.q = prev.q + 1
ORDER BY growth_pct DESC;

Performance-Aware Patterns

-- Always filter on partition keys to reduce scan cost
SELECT region, COUNT(*)
FROM sales
WHERE year = '2024' AND month = '02'
GROUP BY region;

-- Use LIMIT for exploratory queries
SELECT * FROM large_table LIMIT 100;

-- Use approximate functions for large-scale cardinality
SELECT APPROX_DISTINCT(user_id) as approx_unique_users
FROM events;

Data Quality Checks

-- Distinct value counts per column
SELECT
    COUNT(DISTINCT col1) as col1_unique,
    COUNT(DISTINCT col2) as col2_unique
FROM <table>;

-- Detect unexpected values
SELECT column_name, COUNT(*) as cnt
FROM <table>
GROUP BY column_name
ORDER BY cnt DESC
LIMIT 20;

-- Check for join explosion
SELECT COUNT(*) as pre_join FROM table_a;
SELECT COUNT(*) as post_join FROM table_a a JOIN table_b b ON a.id = b.a_id;

Workgroup Selection

Always list workgroups first before executing any query.

Detect Execution Context

Before selecting a workgroup, determine the current IAM identity:

aws sts get-caller-identity --query Arn --output text

The ARN pattern reveals the execution context:

ARN Pattern	Context	Workgroup Strategy
`arn:aws:sts::*:assumed-role/AmazonDataZone-<project-id>-<suffix>/<session>`	SageMaker Unified Studio project role	Use the project-scoped workgroup (see below)
`arn:aws:sts::*:assumed-role/SageMakerUnifiedStudio-<project-id>-<suffix>/<session>`	SageMaker Unified Studio project role	Use the project-scoped workgroup (see below)
`arn:aws:sts:::assumed-role/AmazonSageMaker-ExecutionRole-`	SageMaker notebook/studio role	Prefer `sagemaker-studio-workgroup-*`
Anything else	Standard IAM user/role	Follow general priority order

SageMaker Project Role Selection

When running as a SageMaker project role (AmazonDataZone-* or SageMakerUnifiedStudio-*):

List all workgroups the role can access:

aws athena list-work-groups --query 'WorkGroups[].Name' --output json

Extract the project ID from the role ARN. Split the role name on -. The first segment is the prefix (e.g., AmazonDataZone), the second segment is the project ID (e.g., abc123def), and subsequent segments form the suffix (e.g., DataLakeAccess). Take the second segment. The project ID is an alphanumeric string (no hyphens). Known suffixes that follow the project ID: DataLakeAccess, SparkAccess, QueryAccess, IngestionAccess. Example:
```
arn:aws:sts::123456789012:assumed-role/AmazonDataZone-abc123def-DataLakeAccess/session
                                                      ^^^^^^^^^
                                                      project ID = abc123def
```
Match the workgroup to the project. Project workgroups follow the pattern sagemaker-studio-workgroup-<project-id> or contain the project ID.
If exactly one sagemaker-studio-workgroup-* exists, verify its suffix contains the project ID extracted in step 2. If it matches, use it. If it does not match, fall through to step 6.
If multiple exist, pick the one whose suffix matches the project ID extracted from the role ARN. Optionally check environment variables SAGEMAKER_PROJECT_ID or SAGEMAKER_PROJECT_NAME if the ARN extraction is ambiguous.
If no sagemaker-studio-workgroup-* exists, do not fall back to other workgroups. Inform the user that no project-scoped workgroup was found and ask them to verify their project configuration or IAM permissions.

Project roles typically have IAM permissions scoped to their own workgroup. Attempting to use primary or another project's workgroup will fail with AccessDeniedException. Do not retry with primary in this context.

General Priority Order (Non-Project Roles)

sagemaker-studio-workgroup-* workgroups -- most reliable, always have output locations configured
Workgroups with explicitly configured output locations
primary workgroup (use with caution, may lack output location)

Error Recovery

Error	Context	Action
No output location	Any	Retry with the next workgroup in priority order
AccessDeniedException on workgroup	Project role	Do not retry with other workgroups. Inform the user their project role lacks access.
AccessDeniedException on workgroup	Standard role	Retry with the next workgroup in priority order
No workgroups found	Any	Ask the user to configure a workgroup or check IAM permissions

Anti-patterns

Never default to primary workgroup without checking others first
Never hardcode a workgroup name across sessions
Never retry with primary when running as a SageMaker project role -- it will fail with AccessDeniedException

Available in: English 한국어 Português Türkçe

AI SkillQuery data lakeOperations

Ask business questions against a data lake with Athena-style safety and cost checks. — Claude Skill

A Claude Skill for Claude Code by AWS✓ — run /querying-data-lake in Claude·Updated Jun 14, 2026·vmain@7cd875e

Compatible withChatGPT

ClaudeClaude CodeCodex / Codex CLI

Cursor

Gemini

Helps teams choose the right data source, workgroup, and SQL pattern, then query large data lake tables with cost awareness, result interpretation, and safe limits.

Turns a business question into a safe data lake query plan before scanning large tables.
Chooses appropriate workgroup, tables, partitions, limits, and query pattern.
Explains data scanned, cost risk, and whether the result is reliable enough to use.
Separates read-only analysis from destructive or unsafe actions.

YouToday

A team asks a data engineer for a number and gets an expensive or unclear one-off query.

With /querying-data-lake

Run /querying-data-lake to make the question, source, SQL pattern, cost, and caveats explicit before using the answer.

1 Clarify the metric2 Choose source and workgroup3 Limit scanned data4 Explain results and caveats

Who this is for

Data Engineer

Make ad hoc data lake questions safer, cheaper, and easier to review.

See skills for this role

Analytics Engineer

Turn stakeholder questions into query plans with metric definitions and caveats.

See skills for this role

What it does

Ad hoc metric answer

Answer a business question from lake data without creating a permanent dashboard first.

Cost-aware exploration

Explore large tables with partition and workgroup checks before scanning too much data.

Data source validation

Confirm whether data lake tables can support a metric or competitive benchmark.

How it works

Clarify the business question, date range, metric, filters, and output needed.

Identify the likely catalog, database, table, partition, and workgroup.

Create a query plan that limits scanned data and avoids unsafe operations.

Run or draft SQL with preview limits and validate result reasonableness.

Explain the answer, caveats, and whether the query should become a repeatable report.

Input options

Business question

The decision, metric, segment, and date range the query must answer.

Example

What the user pastes

Business question: did activation improve after the new onboarding checklist?
Date range: May 1 to June 14.
Segments: SMB, mid-market, enterprise.
Activation definition: workspace has invited at least two teammates and completed first project within 7 days.
Known tables:
- product_events partitioned by event_date
- accounts partitioned by created_month
- workspaces
Need: query plan, safe SQL outline, result table, and caveats before sharing with Product and Finance.

Useful result

Query plan

Use read-only Athena SQL. Limit product_events to event_date between May 1 and June 14. Join accounts only on workspace_id and created_month. Start with a 100-row preview before the aggregate query. Use the analytics workgroup if available because this is a business analysis query.

SQL outline

1. Find workspaces created in the date range.
2. Count invited_teammate and first_project_completed events within 7 days of workspace creation.
3. Mark activated when both conditions are true.
4. Group by account segment and pre/post checklist launch date.
5. Return workspace count, activated count, activation rate, and confidence caveat.

Result format

| Segment | Period | Workspaces | Activated | Activation rate | Readout |
|---|---|---:|---:|---:|---|
| SMB | Before | 420 | 151 | 36% | Baseline |
| SMB | After | 390 | 171 | 44% | Improvement visible |
| Mid-market | Before | 180 | 76 | 42% | Baseline |
| Mid-market | After | 165 | 84 | 51% | Stronger movement |
| Enterprise | Before | 52 | 19 | 37% | Small sample |
| Enterprise | After | 48 | 20 | 42% | Treat cautiously |

Caveats

This does not prove causality. Check whether acquisition source changed, whether event instrumentation was stable, and whether enterprise sample size is too small for a confident claim.

Metrics this improves

Query Performance

Reduces unnecessary full-table scans and inefficient exploratory SQL.

Operations

Warehouse Cost

Keeps data scanned and cost risk visible before queries run.

Operations

Metric Trust

Makes definitions, sources, and caveats explicit.

Operations

Works with

Google Sheets

manual

Share result tables and analysis caveats with stakeholders.

Slack

manual

Coordinate data owner review and stakeholder readouts.

SQL

manual

Draft or review SQL patterns for data lake analysis.

Want to use Querying Data Lake?

Choose how to get started.

Run in Claude Code

Free. Open source.

Install and run this skill locally on your computer.

Install Claude Code

Open a terminal on your computer and paste this command:

Install the skill

This downloads the skill with all its files to your computer:

Add -g at the end to make it available in all your projects.

Run it

Start Claude Code, then type the command:

then

View source on GitHub

Use on ElasticFlow

Team and collaboration features

Run skills from your browser. Share results, manage access, collaborate with your team. No terminal needed.

Free 14-day trial. Cancel anytime.

View on GitHub

Query Data Lake

Execute SQL queries on Amazon Athena across default and federated catalogs (Glue, S3 Tables, Redshift) with workgroup selection, statement classification, and error recovery.

Overview

Constraints for parameter acquisition:

You MUST accept a single optional argument: SQL text, a named-query name, a workgroup name, a catalog name, or profile TABLE_NAME
You MUST accept the argument as direct text or a pointer to a file containing SQL
You MUST ask the user for the target AWS region if not already set
You MUST confirm the output S3 location before executing any non-trivial query
You MUST respect the user's decision to abort at any step

Common Tasks

1. Verify Dependencies

Check for required tools and AWS access before running queries.

Constraints:

You MUST verify AWS MCP server tools are available (aws___call_aws) and run queries through them when present; fall back to AWS CLI only if the MCP server is unavailable
You MUST NOT fall back to shell or Bash for query execution — results must be captured via the MCP tool or aws athena CLI so output location and cost are tracked
You MUST confirm credentials with aws sts get-caller-identity and inform the user about any missing tools

2. Resolve Workgroup

Check caller identity, list workgroups, auto-select the best one (see workgroup-selection.md).

Constraints:

You MUST select a workgroup before submitting any query (prevents output-location errors)
You MUST present the selected workgroup and its output location to the user
You MUST NOT auto-escalate to a different workgroup on failure without user confirmation

3. Resolve the Target Asset

Constraints:

You MUST NOT attempt to resolve fuzzy asset references with athena list-data-catalogs or by iterating get-tables — those miss federated catalogs and waste tokens
You SHOULD skip this step only when the user provides a fully-qualified reference (exact database.table) or raw SQL they want executed as-is
You MUST state the resolved asset explicitly before building the query: "Found [table] in [catalog]. Using this for the query."
You SHOULD default to the default Glue catalog unless the user mentions "federated", "Redshift", "S3 Tables", or finding-data-lake-assets returns a different catalog

4. Discover Schema

For analytical queries, You SHOULD profile the target table before building the final query. You MUST show sample rows (SELECT ... LIMIT 5) as part of profiling.

5. Build Query

Table addressing depends on catalog type:

Default Glue catalog: database.table (omit the catalog prefix for single-catalog queries). In cross-catalog queries, qualify default-catalog tables with "awsdatacatalog".database.table.
Registered data source: datasource.database.table
Unregistered Glue catalog: "catalog/subcatalog".database.table

6. Classify and Execute

Classify the SQL statement before executing:

Statement	Behavior
`SELECT`, `SHOW`, `DESCRIBE`, `EXPLAIN`	Safe — execute
`INSERT`, `UPDATE`, `DELETE`, `DROP`, `ALTER`, `CREATE`, `TRUNCATE`, `MERGE`	Destructive — warn the user and require explicit confirmation
Unsure	Treat as destructive; confirm

Example tool call (via AWS MCP server):

aws___call_aws(command="aws athena start-query-execution --work-group <WORKGROUP_NAME> --query-string '<sql>' --query-execution-context Database=<db>")

For federated or S3 Tables catalogs, also set Catalog=<CATALOG_PATH> in the execution context (e.g. Catalog=s3tablescatalog/<BUCKET_NAME>).

Constraints:

You MUST warn the user before executing when the target is Redshift-federated ("No partition pruning — every query scans the full table")
You MUST warn the user before executing a cross-catalog join ("Cross-catalog joins incur network overhead and may be slow")
You MUST confirm the output S3 location before executing
You MUST explain which tool is being called before executing
You MUST respect the user's decision to abort

7. Present and Recover

Present results with cost, data scanned, duration, and actionable insights. On failure, list available workgroups and let the user choose which to retry with.

Argument Routing

Resolve in this order; stop at the first match:

Contains SQL keywords (SELECT, SHOW, DESCRIBE, INSERT, etc.) — SQL text, execute directly
profile TABLE_NAME — run comprehensive table profiling (see query-patterns.md)
Matches a known named query — look up and execute
Matches a known workgroup — show workgroup status and recent queries
Matches a known catalog — delegate to exploring-data-catalog to enumerate databases and tables
No args — show recent query activity and available tables

Principles

Always select workgroup before executing (prevents output-location errors)
Profile unfamiliar tables before running analytical queries
Present cost alongside results so users build cost awareness
Suggest LIMIT for exploratory queries on large tables
Never ask domain questions with obvious answers, but always confirm security-relevant actions (workgroup switches, output location changes, non-SELECT statements)

Troubleshooting

Error	Cause	Fix
Redshift identifier error with mixed case	Redshift-federated names are lowercase only	Lowercase the identifier
`CatalogId` validation failure	ARN passed instead of catalog name	Pass the catalog name, not the ARN
Cross-catalog `information_schema` returns nothing	Missing catalog qualifier	Use catalog-qualified path: `"catalog".information_schema.tables`
Query fails with output-location error	Workgroup has no output location configured	Select a different workgroup with an output location, or configure one
Destructive statement executed without confirmation	Statement classification skipped	Always classify `INSERT`/`UPDATE`/`DELETE`/`DROP`/`ALTER`/`CREATE`/`TRUNCATE`/`MERGE` and confirm with the user

Additional Resources

Reference documents

name: querying-data-lake description: >- Execute and manage Athena SQL queries across default and federated catalogs (Glue, S3 Tables, Redshift). Triggers on phrases like: query data, run SQL, athena query, analyze table, SQL query, workgroup status, profile table, query Redshift catalog, query S3 Tables. Do NOT use for finding specific data assets (use finding-data-lake-assets), full catalog audits (use exploring-data-catalog), importing data (use ingesting-into-data-lake). version: 1 argument-hint: '[SQL-query|query-name|workgroup-name|catalog-name|''profile TABLE_NAME'']'

Query Data Lake

Execute SQL queries on Amazon Athena across default and federated catalogs (Glue, S3 Tables, Redshift) with workgroup selection, statement classification, and error recovery.

Overview

Constraints for parameter acquisition:

You MUST accept a single optional argument: SQL text, a named-query name, a workgroup name, a catalog name, or profile TABLE_NAME
You MUST accept the argument as direct text or a pointer to a file containing SQL
You MUST ask the user for the target AWS region if not already set
You MUST confirm the output S3 location before executing any non-trivial query
You MUST respect the user's decision to abort at any step

Common Tasks

1. Verify Dependencies

Check for required tools and AWS access before running queries.

Constraints:

You MUST verify AWS MCP server tools are available (aws___call_aws) and run queries through them when present; fall back to AWS CLI only if the MCP server is unavailable
You MUST NOT fall back to shell or Bash for query execution — results must be captured via the MCP tool or aws athena CLI so output location and cost are tracked
You MUST confirm credentials with aws sts get-caller-identity and inform the user about any missing tools

2. Resolve Workgroup

Check caller identity, list workgroups, auto-select the best one (see workgroup-selection.md).

Constraints:

You MUST select a workgroup before submitting any query (prevents output-location errors)
You MUST present the selected workgroup and its output location to the user
You MUST NOT auto-escalate to a different workgroup on failure without user confirmation

3. Resolve the Target Asset

Constraints:

You MUST NOT attempt to resolve fuzzy asset references with athena list-data-catalogs or by iterating get-tables — those miss federated catalogs and waste tokens
You SHOULD skip this step only when the user provides a fully-qualified reference (exact database.table) or raw SQL they want executed as-is
You MUST state the resolved asset explicitly before building the query: "Found [table] in [catalog]. Using this for the query."
You SHOULD default to the default Glue catalog unless the user mentions "federated", "Redshift", "S3 Tables", or finding-data-lake-assets returns a different catalog

4. Discover Schema

For analytical queries, You SHOULD profile the target table before building the final query. You MUST show sample rows (SELECT ... LIMIT 5) as part of profiling.

5. Build Query

Table addressing depends on catalog type:

Default Glue catalog: database.table (omit the catalog prefix for single-catalog queries). In cross-catalog queries, qualify default-catalog tables with "awsdatacatalog".database.table.
Registered data source: datasource.database.table
Unregistered Glue catalog: "catalog/subcatalog".database.table

6. Classify and Execute

Classify the SQL statement before executing:

Statement	Behavior
`SELECT`, `SHOW`, `DESCRIBE`, `EXPLAIN`	Safe — execute
`INSERT`, `UPDATE`, `DELETE`, `DROP`, `ALTER`, `CREATE`, `TRUNCATE`, `MERGE`	Destructive — warn the user and require explicit confirmation
Unsure	Treat as destructive; confirm

Example tool call (via AWS MCP server):

aws___call_aws(command="aws athena start-query-execution --work-group <WORKGROUP_NAME> --query-string '<sql>' --query-execution-context Database=<db>")

For federated or S3 Tables catalogs, also set Catalog=<CATALOG_PATH> in the execution context (e.g. Catalog=s3tablescatalog/<BUCKET_NAME>).

Constraints:

You MUST warn the user before executing when the target is Redshift-federated ("No partition pruning — every query scans the full table")
You MUST warn the user before executing a cross-catalog join ("Cross-catalog joins incur network overhead and may be slow")
You MUST confirm the output S3 location before executing
You MUST explain which tool is being called before executing
You MUST respect the user's decision to abort

7. Present and Recover

Present results with cost, data scanned, duration, and actionable insights. On failure, list available workgroups and let the user choose which to retry with.

Argument Routing

Resolve in this order; stop at the first match:

Contains SQL keywords (SELECT, SHOW, DESCRIBE, INSERT, etc.) — SQL text, execute directly
profile TABLE_NAME — run comprehensive table profiling (see query-patterns.md)
Matches a known named query — look up and execute
Matches a known workgroup — show workgroup status and recent queries
Matches a known catalog — delegate to exploring-data-catalog to enumerate databases and tables
No args — show recent query activity and available tables

Principles

Always select workgroup before executing (prevents output-location errors)
Profile unfamiliar tables before running analytical queries
Present cost alongside results so users build cost awareness
Suggest LIMIT for exploratory queries on large tables
Never ask domain questions with obvious answers, but always confirm security-relevant actions (workgroup switches, output location changes, non-SELECT statements)

Troubleshooting

Error	Cause	Fix
Redshift identifier error with mixed case	Redshift-federated names are lowercase only	Lowercase the identifier
`CatalogId` validation failure	ARN passed instead of catalog name	Pass the catalog name, not the ARN
Cross-catalog `information_schema` returns nothing	Missing catalog qualifier	Use catalog-qualified path: `"catalog".information_schema.tables`
Query fails with output-location error	Workgroup has no output location configured	Select a different workgroup with an output location, or configure one
Destructive statement executed without confirmation	Statement classification skipped	Always classify `INSERT`/`UPDATE`/`DELETE`/`DROP`/`ALTER`/`CREATE`/`TRUNCATE`/`MERGE` and confirm with the user

Additional Resources

Common Query Patterns (Presto/Athena SQL)

Table Profiling

-- Schema discovery
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_schema = '<database>' AND table_name = '<table>';

-- Quick row count and date range
SELECT COUNT(*) as total_rows,
       MIN(created_at) as earliest,
       MAX(created_at) as latest
FROM <table>;

-- Sample data (always do this before analytical queries)
SELECT * FROM <table> LIMIT 5;

-- Null analysis
SELECT
    '<column>' as field,
    COUNT(*) - COUNT(<column>) as null_count,
    ROUND((COUNT(*) - COUNT(<column>)) * 100.0 / COUNT(*), 2) as null_pct
FROM <table>;

Cohort Retention

WITH cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC('month', first_activity_date) as cohort_month
    FROM users
),
activity AS (
    SELECT
        user_id,
        DATE_TRUNC('month', activity_date) as activity_month
    FROM user_activity
)
SELECT
    c.cohort_month,
    COUNT(DISTINCT c.user_id) as cohort_size,
    COUNT(DISTINCT CASE
        WHEN a.activity_month = c.cohort_month THEN a.user_id
    END) as month_0,
    COUNT(DISTINCT CASE
        WHEN a.activity_month = DATE_ADD('month', 1, c.cohort_month) THEN a.user_id
    END) as month_1,
    COUNT(DISTINCT CASE
        WHEN a.activity_month = DATE_ADD('month', 3, c.cohort_month) THEN a.user_id
    END) as month_3,
    COUNT(DISTINCT CASE
        WHEN a.activity_month = DATE_ADD('month', 6, c.cohort_month) THEN a.user_id
    END) as month_6
FROM cohorts c
LEFT JOIN activity a ON c.user_id = a.user_id
GROUP BY c.cohort_month
ORDER BY c.cohort_month;

Funnel Analysis

WITH funnel AS (
    SELECT
        user_id,
        MAX(CASE WHEN event = 'page_view' THEN 1 ELSE 0 END) as step_1_view,
        MAX(CASE WHEN event = 'signup_start' THEN 1 ELSE 0 END) as step_2_start,
        MAX(CASE WHEN event = 'signup_complete' THEN 1 ELSE 0 END) as step_3_complete,
        MAX(CASE WHEN event = 'first_purchase' THEN 1 ELSE 0 END) as step_4_purchase
    FROM events
    WHERE event_date >= DATE_ADD('day', -30, CURRENT_DATE)
    GROUP BY user_id
)
SELECT
    COUNT(*) as total_users,
    SUM(step_1_view) as viewed,
    SUM(step_2_start) as started_signup,
    SUM(step_3_complete) as completed_signup,
    SUM(step_4_purchase) as purchased,
    ROUND(100.0 * SUM(step_2_start) / NULLIF(SUM(step_1_view), 0), 1) as view_to_start_pct,
    ROUND(100.0 * SUM(step_3_complete) / NULLIF(SUM(step_2_start), 0), 1) as start_to_complete_pct,
    ROUND(100.0 * SUM(step_4_purchase) / NULLIF(SUM(step_3_complete), 0), 1) as complete_to_purchase_pct
FROM funnel;

Deduplication

-- Keep the most recent record per key (Presto/Athena syntax)
WITH ranked AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY entity_id
            ORDER BY updated_at DESC
        ) as rn
    FROM source_table
)
SELECT * FROM ranked WHERE rn = 1;

Window Functions

-- Running total
SUM(revenue) OVER (ORDER BY event_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total

-- 7-day moving average
AVG(revenue) OVER (ORDER BY event_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as moving_avg_7d

-- Period-over-period comparison
LAG(value, 1) OVER (PARTITION BY entity ORDER BY event_date) as prev_value

-- Percent of total
revenue / SUM(revenue) OVER () as pct_of_total
revenue / SUM(revenue) OVER (PARTITION BY category) as pct_of_category

-- Ranking
ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) as rank_in_category

Period Comparison / Growth

When the user asks for "growth", "change", or "comparison" between periods, compute the delta — not raw totals.

WITH quarterly AS (
    SELECT
        category,
        QUARTER(order_date) as q,
        SUM(amount) as revenue
    FROM orders
    WHERE YEAR(order_date) = 2025
    GROUP BY category, QUARTER(order_date)
)
SELECT
    curr.category,
    prev.revenue as prev_period,
    curr.revenue as curr_period,
    ROUND((curr.revenue - prev.revenue) / prev.revenue * 100, 1) as growth_pct
FROM quarterly curr
JOIN quarterly prev ON curr.category = prev.category AND curr.q = prev.q + 1
ORDER BY growth_pct DESC;

Performance-Aware Patterns

-- Always filter on partition keys to reduce scan cost
SELECT region, COUNT(*)
FROM sales
WHERE year = '2024' AND month = '02'
GROUP BY region;

-- Use LIMIT for exploratory queries
SELECT * FROM large_table LIMIT 100;

-- Use approximate functions for large-scale cardinality
SELECT APPROX_DISTINCT(user_id) as approx_unique_users
FROM events;

Data Quality Checks

-- Distinct value counts per column
SELECT
    COUNT(DISTINCT col1) as col1_unique,
    COUNT(DISTINCT col2) as col2_unique
FROM <table>;

-- Detect unexpected values
SELECT column_name, COUNT(*) as cnt
FROM <table>
GROUP BY column_name
ORDER BY cnt DESC
LIMIT 20;

-- Check for join explosion
SELECT COUNT(*) as pre_join FROM table_a;
SELECT COUNT(*) as post_join FROM table_a a JOIN table_b b ON a.id = b.a_id;

Workgroup Selection

Always list workgroups first before executing any query.

Detect Execution Context

Before selecting a workgroup, determine the current IAM identity:

aws sts get-caller-identity --query Arn --output text

The ARN pattern reveals the execution context:

ARN Pattern	Context	Workgroup Strategy
`arn:aws:sts::*:assumed-role/AmazonDataZone-<project-id>-<suffix>/<session>`	SageMaker Unified Studio project role	Use the project-scoped workgroup (see below)
`arn:aws:sts::*:assumed-role/SageMakerUnifiedStudio-<project-id>-<suffix>/<session>`	SageMaker Unified Studio project role	Use the project-scoped workgroup (see below)
`arn:aws:sts:::assumed-role/AmazonSageMaker-ExecutionRole-`	SageMaker notebook/studio role	Prefer `sagemaker-studio-workgroup-*`
Anything else	Standard IAM user/role	Follow general priority order

SageMaker Project Role Selection

When running as a SageMaker project role (AmazonDataZone-* or SageMakerUnifiedStudio-*):

List all workgroups the role can access:

aws athena list-work-groups --query 'WorkGroups[].Name' --output json

Extract the project ID from the role ARN. Split the role name on -. The first segment is the prefix (e.g., AmazonDataZone), the second segment is the project ID (e.g., abc123def), and subsequent segments form the suffix (e.g., DataLakeAccess). Take the second segment. The project ID is an alphanumeric string (no hyphens). Known suffixes that follow the project ID: DataLakeAccess, SparkAccess, QueryAccess, IngestionAccess. Example:
```
arn:aws:sts::123456789012:assumed-role/AmazonDataZone-abc123def-DataLakeAccess/session
                                                      ^^^^^^^^^
                                                      project ID = abc123def
```
Match the workgroup to the project. Project workgroups follow the pattern sagemaker-studio-workgroup-<project-id> or contain the project ID.
If exactly one sagemaker-studio-workgroup-* exists, verify its suffix contains the project ID extracted in step 2. If it matches, use it. If it does not match, fall through to step 6.
If multiple exist, pick the one whose suffix matches the project ID extracted from the role ARN. Optionally check environment variables SAGEMAKER_PROJECT_ID or SAGEMAKER_PROJECT_NAME if the ARN extraction is ambiguous.
If no sagemaker-studio-workgroup-* exists, do not fall back to other workgroups. Inform the user that no project-scoped workgroup was found and ask them to verify their project configuration or IAM permissions.

General Priority Order (Non-Project Roles)

sagemaker-studio-workgroup-* workgroups -- most reliable, always have output locations configured
Workgroups with explicitly configured output locations
primary workgroup (use with caution, may lack output location)

Error Recovery

Error	Context	Action
No output location	Any	Retry with the next workgroup in priority order
AccessDeniedException on workgroup	Project role	Do not retry with other workgroups. Inform the user their project role lacks access.
AccessDeniedException on workgroup	Standard role	Retry with the next workgroup in priority order
No workgroups found	Any	Ask the user to configure a workgroup or check IAM permissions

Anti-patterns

Never default to primary workgroup without checking others first
Never hardcode a workgroup name across sessions
Never retry with primary when running as a SageMaker project role -- it will fail with AccessDeniedException

Ask business questions against a data lake with Athena-style safety and cost checks. — Claude Skill

Who this is for

What it does

How it works

Input options

Example

Metrics this improves

Works with

Want to use Querying Data Lake?

Skill instructions

Query Data Lake

Overview

Common Tasks

1. Verify Dependencies

2. Resolve Workgroup

3. Resolve the Target Asset

4. Discover Schema

5. Build Query

6. Classify and Execute

7. Present and Recover

Argument Routing

Principles

Troubleshooting

Additional Resources

Reference documents

Query Data Lake

Overview

Common Tasks

1. Verify Dependencies

2. Resolve Workgroup

3. Resolve the Target Asset

4. Discover Schema

5. Build Query

6. Classify and Execute

7. Present and Recover

Argument Routing

Principles

Troubleshooting

Additional Resources

Common Query Patterns (Presto/Athena SQL)

Table Profiling

Cohort Retention

Funnel Analysis

Deduplication

Window Functions

Period Comparison / Growth

Performance-Aware Patterns

Data Quality Checks

Workgroup Selection

Detect Execution Context

SageMaker Project Role Selection

General Priority Order (Non-Project Roles)

Error Recovery

Anti-patterns

Ask business questions against a data lake with Athena-style safety and cost checks. — Claude Skill

Who this is for

What it does

How it works

Input options

Example

Metrics this improves

Works with

Want to use Querying Data Lake?

Skill instructions

Query Data Lake

Overview

Common Tasks

1. Verify Dependencies

2. Resolve Workgroup

3. Resolve the Target Asset

4. Discover Schema

5. Build Query

6. Classify and Execute

7. Present and Recover

Argument Routing

Principles

Troubleshooting

Additional Resources

Reference documents

Query Data Lake