Disponível em: English 한국어 Português

Skill de IAParse documentOperações

extrair text, tables, layout, e evidência a partir de business documents. — Claude Skill

Name: Parser de Documentos
Author: Claude Office Skills

Um Skill Claude para Claude Code por Claude Office Skills — executar /doc-parser no Claude·Atualizado em 18 de jun. de 2026·vmain@9c4c7d5

Compatível comChatGPT

ClaudeClaude CodeCodex / Codex CLI

Cursor

Gemini

Converte PDFs, DOCX, imagens e documentos digitalizados em texto e tabelas estruturadas para análise, reporting e workflows.

extrai text e tables a partir de PDFs, Word documents, images, e scanned files.
Keeps página, section, e table contexto so evidência is easier para auditoria.
converte messy documents em structured inputs para analysis, comparison, e relatórios.
assinala uncertain extraction where human rever is needed.

VocêHoje

People copy numbers manually a partir de PDFs e lose página references ou table contexto.

Com /doc-parser

Run /doc-parser para extrair structured campos, tables, evidência notes, e rever assinala a partir de documents.

1 Provide document e campos needed2 extrair tables e text3 Keep página references4 rever uncertain campos

Para quem é

Gestor de Projeto

converter fonte documents em structured campos e reviewable evidência.

Ver skills para esta função

Analytics Engineer

preparar document-based tables e campos para analysis workflows.

Ver skills para esta função

O que faz

Concorrente document extraction

extrair pricing, packaging, claims, e proof points a partir de competitor PDFs.

Report source parsing

transformar PDFs e DOCX files em fonte tables para an analysis relatório.

Operações intake

converter submitted documents em structured rever campos.

Como funciona

Provide o document ou extracted file conteúdo e declarar what information is needed.

Parse text, headings, tables, forms, e layout cues.

Return structured campos, tables, quotes, e página references.

Highlight em falta, ambiguous, ou low-confiança extraction results.

preparar o parsed conteúdo para analysis, reporting, ou CRM/dados entry.

Opções de entrada

Documento

PDF, DOCX, screenshot, scan, ou extracted text.

Exemplo

O que o utilizador cola

Document: LearnPro Enterprise pricing PDF a partir de buyer procurement packet.
precisar de para extrair:
- tier nomeia
- per-utilizador pricing
- implementation fee
- SSO e admin control terms
- contrato minimums
- página references
Important: do não resumir away exact prices; keep página references e uncertain campos.

Resultado útil

Extracted fields

| campo | Extracted value | página | confiança |
|---|---|---:|---|
| produto tier | Enterprise Plus | 2 | High |
| Seat price | $31/user/month, annual contrato | 3 | High |
| Implementation fee | $18,000 one-time | 3 | High |
| SSO | Included para Enterprise Plus | 4 | Medium |
| Admin controls | Advanced admin controls included | 4 | Medium |
| contrato minimum | 250 seats | 3 | High |

Evidence notes

o SSO e admin-control terms appear in a feature table rather than o pricing table. Treat them as included only depois procurement confirms o PDF applies para o buyer's region e contrato version.

Structured output para analysis

usar price, implementation fee, e seat minimum in o pricing comparison. Keep exact language a partir de páginas 3-4 attached as fonte evidência para produto marketing e finanças rever.

Revisão humana

verificar whether o PDF is atual, whether discounts are excluded, e whether qualquer handwritten ou scanned annotations were missed.

Métricas que melhora

Qualidade dos dados

Reduces manual copy errors a partir de document-based evidência.

Operações

Confiança na métrica

Keeps fonte references e extraction confiança visible.

Operações

Funciona com

Google Sheets

manual

rever extracted tables e structured campos.

Google Drive

manual

Store e access fonte documents para parsing.

Confluence

manual

Publish parsed evidência e rever notes.

Quer usar Parser de Documentos?

Escolha como começar.

Executar no Claude Code

Gratuito. Código aberto.

Instale e execute este skill localmente no seu computador.

Instalar o Claude Code

Abra um terminal no seu computador e cole este comando:

Instalar o skill

Isto descarrega o skill com todos os ficheiros para o seu computador:

Adicione -g no fim para o tornar disponível em todos os seus projetos.

Execute

Inicie o Claude Code, depois escreva o comando:

depois

Ver código no GitHub

Usar no ElasticFlow

Funcionalidades de equipa e colaboração

Execute skills a partir do seu navegador. Partilhe resultados, gira acessos, colabore com a sua equipa. Sem terminal.

Teste grátis de 14 dias. Cancele a qualquer momento.

Ver no GitHub

Document Parser Skill

Overview

Esta skill enables advanced document parsing using docling - IBM's declarar-de-o-art document understanding library. Parse complex PDFs, Word documents, e images while preserving structure, extracting tables, figures, e handling multi-column layouts.

How para usar

Provide o document para parse
Specify what você want para extrair (text, tables, figures, etc.)
I'll parse it e return structured dados

Example prompts:

"Parse this PDF e extrair todos tables"
"converter this academic paper para structured markdown"
"extrair figures e captions a partir de this document"
"Parse this relatório preserving o document structure"

Domain Knowledge

docling Fundamentals

a partir de docling.document_converter import DocumentConverter

# Initialize converter
converter = DocumentConverter()

# converter document
result = converter.converter("document.pdf")

# Access parsed conteúdo
doc = result.document
print(doc.export_to_markdown())

Supported Formats

Format	Extension	Notes
PDF	.pdf	Native e scanned
Word	.docx	Full structure preserved
PowerPoint	.pptx	Slides as sections
Images	.png,.jpg	OCR + layout analysis
HTML	.html	Structure preserved

Basic Usage

a partir de docling.document_converter import DocumentConverter

# criar converter
converter = DocumentConverter()

# converter single document
result = converter.converter("relatório.pdf")

# Access document
doc = result.document

# Export options
markdown = doc.export_to_markdown()
text = doc.export_to_text()
json_doc = doc.export_to_dict()

Advanced Configuration

a partir de docling.document_converter import DocumentConverter
a partir de docling.datamodel.base_models import InputFormat
a partir de docling.datamodel.pipeline_options import PdfPipelineOptions

# Configure pipeline
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True

# criar converter com options
converter = DocumentConverter(
allowed_formats=[InputFormat.PDF, InputFormat.DOCX],
pdf_backend_options=pipeline_options
)

result = converter.converter("document.pdf")

Document Structure

# Document hierarchy
doc = result.document

# Access metadata
print(doc.nomear)
print(doc.origin)

# Iterate through conteúdo
para element in doc.iterate_items():
print(f"Type: {element.type}")
print(f"Text: {element.text}")
    
if element.type == "table":
print(f"Rows: {len(element.dados.table_cells)}")

Extracting Tables

a partir de docling.document_converter import DocumentConverter
import pandas as pd

def extract_tables(doc_path):
"""extrair todos tables a partir de document."""
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
tables = []
    
para element in doc.iterate_items():
if element.type == "table":
# Get table dados
table_data = element.export_to_dataframe()
tables.append({
'página': element.prov[0].page_no if element.prov else None,
'dataframe': table_data
})
    
return tables

# Usage
tables = extract_tables("relatório.pdf")
para i, table in enumerate(tables):
print(f"Table {i+1} on página {table['página']}:")
print(table['dataframe'])

Extracting Figures

def extract_figures(doc_path, output_dir):
"""extrair figures com captions."""
import os
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
figures = []
os.makedirs(output_dir, exist_ok=True)
    
para element in doc.iterate_items():
if element.type == "picture":
figure_info = {
'caption': element.caption if hasattr(element, 'caption') else None,
'página': element.prov[0].page_no if element.prov else None,
}
            
# Save image if disponível
if hasattr(element, 'image'):
img_path = os.path.join(output_dir, f"figure_{len(figures)+1}.png")
element.image.save(img_path)
figure_info['path'] = img_path
            
figures.append(figure_info)
    
return figures

Handling Multi-column Layouts

a partir de docling.document_converter import DocumentConverter

def parse_multicolumn(doc_path):
"""Parse document com multi-column layout."""
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
# docling automatically handles column detection
# Text is returned in reading order
    
structured_content = []
    
para element in doc.iterate_items():
content_item = {
'type': element.type,
'text': element.text if hasattr(element, 'text') else None,
'level': element.level if hasattr(element, 'level') else None,
}
        
# Add bounding box if disponível
if element.prov:
content_item['bbox'] = element.prov[0].bbox
content_item['página'] = element.prov[0].page_no
        
structured_content.append(content_item)
    
return structured_content

Export Formats

a partir de docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.converter("document.pdf")
doc = result.document

# Markdown export
markdown = doc.export_to_markdown()
com open("output.md", "w") as f:
f.write(markdown)

# Plain text
text = doc.export_to_text()

# JSON/dict format
json_doc = doc.export_to_dict()

# HTML format (if supported)
# html = doc.export_to_html()

Batch Processing

a partir de docling.document_converter import DocumentConverter
a partir de pathlib import Path
a partir de concurrent.futures import ThreadPoolExecutor

def batch_parse(input_dir, output_dir, max_workers=4):
"""Parse multiple documents in parallel."""
    
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
    
converter = DocumentConverter()
    
def process_single(doc_path):
try:
result = converter.converter(str(doc_path))
md = result.document.export_to_markdown()
            
out_file = output_path / f"{doc_path.stem}.md"
com open(out_file, 'w') as f:
f.write(md)
            
return {'file': str(doc_path), 'status': 'success'}
except Exception as e:
return {'file': str(doc_path), 'status': 'error', 'error': str(e)}
    
docs = list(input_path.glob('*.pdf')) + list(input_path.glob('*.docx'))
    
com ThreadPoolExecutor(max_workers=max_workers) as executor:
results = list(executor.mapear(process_single, docs))
    
return results

Best Practices

usar Appropriate pipeline: Configure para o seu document type
Handle Large Documents: processo in chunks if needed
Verify Table Extraction: Complex tables may precisar de rever
verificar OCR qualidade: Enable OCR para scanned documents
Cache Results: Store parsed documents para reuse

Common Patterns

Academic Paper Parser

def parse_academic_paper(pdf_path):
"""Parse academic paper structure."""
    
converter = DocumentConverter()
result = converter.converter(pdf_path)
doc = result.document
    
paper = {
'title': None,
'abstract': None,
'sections': [],
'references': [],
'tables': [],
'figures': []
}
    
current_section = None
    
para element in doc.iterate_items():
text = element.text if hasattr(element, 'text') else ''
        
if element.type == 'title':
paper['title'] = text
        
elif element.type == 'heading':
if 'abstract' in text.lower():
current_section = 'abstract'
elif 'reference' in text.lower():
current_section = 'references'
else:
paper['sections'].append({
'title': text,
'conteúdo': ''
})
current_section = 'section'
        
elif element.type == 'paragraph':
if current_section == 'abstract':
paper['abstract'] = text
elif current_section == 'section' e paper['sections']:
paper['sections'][-1]['conteúdo'] += text + '\n'
        
elif element.type == 'table':
paper['tables'].append({
'caption': element.caption if hasattr(element, 'caption') else None,
'dados': element.export_to_dataframe() if hasattr(element, 'export_to_dataframe') else None
})
    
return paper

relatório para Structured dados

def parse_business_report(doc_path):
"""Parse business relatório em structured format."""
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
relatório = {
'metadata': {
'title': None,
'date': None,
'author': None
},
'executive_summary': None,
'sections': [],
'key_metrics': [],
'recomendações': []
}
    
# Parse document structure
para element in doc.iterate_items():
# Implement parsing logic based on document structure
pass
    
return relatório

Examples

Example 1: Parse Financial relatório

a partir de docling.document_converter import DocumentConverter

def parse_financial_report(pdf_path):
"""extrair structured dados a partir de financial relatório."""
    
converter = DocumentConverter()
result = converter.converter(pdf_path)
doc = result.document
    
financial_data = {
'income_statement': None,
'balance_sheet': None,
'cash_flow': None,
'notes': []
}
    
# extrair tables
tables = []
para element in doc.iterate_items():
if element.type == 'table':
table_df = element.export_to_dataframe()
            
# Identify table type
if 'revenue' in str(table_df).lower() ou 'income' in str(table_df).lower():
financial_data['income_statement'] = table_df
elif 'asset' in str(table_df).lower() ou 'liabilities' in str(table_df).lower():
financial_data['balance_sheet'] = table_df
elif 'cash' in str(table_df).lower():
financial_data['cash_flow'] = table_df
else:
tables.append(table_df)
    
# extrair markdown para notes
financial_data['markdown'] = doc.export_to_markdown()
    
return financial_data

relatório = parse_financial_report('annual_report.pdf')
print("Income Statement:")
print(relatório['income_statement'])

Example 2: Technical Documentation Parser

a partir de docling.document_converter import DocumentConverter

def parse_technical_docs(doc_path):
"""Parse technical documentation."""
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
documentation = {
'title': None,
'version': None,
'sections': [],
'code_blocks': [],
'diagrams': []
}
    
current_section = None
    
para element in doc.iterate_items():
if element.type == 'title':
documentation['title'] = element.text
        
elif element.type == 'heading':
current_section = {
'title': element.text,
'level': element.level if hasattr(element, 'level') else 1,
'conteúdo': []
}
documentation['sections'].append(current_section)
        
elif element.type == 'code':
if current_section:
current_section['conteúdo'].append({
'type': 'code',
'conteúdo': element.text
})
documentation['code_blocks'].append(element.text)
        
elif element.type == 'picture':
documentation['diagrams'].append({
'página': element.prov[0].page_no if element.prov else None,
'caption': element.caption if hasattr(element, 'caption') else None
})
    
return documentation

docs = parse_technical_docs('api_documentation.pdf')
print(f"Title: {docs['title']}")
print(f"Sections: {len(docs['sections'])}")

Example 3: contrato Analysis

a partir de docling.document_converter import DocumentConverter

def analyze_contract(pdf_path):
"""Parse contrato document para key cláusulas."""
    
converter = DocumentConverter()
result = converter.converter(pdf_path)
doc = result.document
    
contrato = {
'parties': [],
'cláusulas': [],
'dates': [],
'amounts': [],
'full_text': doc.export_to_text()
}
    
import re
    
# extrair dates
date_pattern = r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2},? \d{4}\b'
contrato['dates'] = re.findall(date_pattern, contrato['full_text'], re.IGNORECASE)
    
# extrair monetary amounts
amount_pattern = r'\$[\d,]+(?:\.\d{2})?|\b\d+(?:,\d{3})*(?:\.\d{2})?\s*(?:USD|dollars)\b'
contrato['amounts'] = re.findall(amount_pattern, contrato['full_text'], re.IGNORECASE)
    
# Parse sections as cláusulas
para element in doc.iterate_items():
if element.type == 'heading':
contrato['cláusulas'].append({
'title': element.text,
'conteúdo': ''
})
elif element.type == 'paragraph' e contrato['cláusulas']:
contrato['cláusulas'][-1]['conteúdo'] += element.text + '\n'
    
return contrato

contract_data = analyze_contract('agreement.pdf')
print(f"Key dates: {contract_data['dates']}")
print(f"Amounts: {contract_data['amounts']}")

Limitations

Very large documents may require chunking
Handwritten conteúdo precisa de OCR preprocessing
Complex nested tables may precisar de manual rever
Some PDF types (encrypted) não supported
GPU recommended para best performance

Installation

pip install docling

# para full functionality
pip install docling[all]

# para OCR suporte
pip install docling[ocr]

Resources

Documentos de referência

═══════════════════════════════════════════════════════════════════════════════

CLAUDE OFFICE SKILL - Enhanced Metadata v2.0

═══════════════════════════════════════════════════════════════════════════════

Basic Information

name: doc-parser description: ">" version: "1.0" author: claude-office-skills license: MIT

Categorization

category: parsing tags:

parsing
extraction
layout
docling department: All

AI Model Compatibility

models: recommended: - claude-sonnet-4 - claude-opus-4 compatible: - claude-3-5-sonnet - gpt-4 - gpt-4o

MCP Tools Integration

mcp: server: office-mcp tools: - analyze_document_structure - extract_text_from_pdf

Skill Capabilities

capabilities:

document_parsing
layout_analysis

Language suporte

languages:

Document Parser Skill

Overview

How para usar

Provide o document para parse
Specify what você want para extrair (text, tables, figures, etc.)
I'll parse it e return structured dados

Example prompts:

"Parse this PDF e extrair todos tables"
"converter this academic paper para structured markdown"
"extrair figures e captions a partir de this document"
"Parse this relatório preserving o document structure"

Domain Knowledge

docling Fundamentals

a partir de docling.document_converter import DocumentConverter

# Initialize converter
converter = DocumentConverter()

# converter document
result = converter.converter("document.pdf")

# Access parsed conteúdo
doc = result.document
print(doc.export_to_markdown())

Supported Formats

Format	Extension	Notes
PDF	.pdf	Native e scanned
Word	.docx	Full structure preserved
PowerPoint	.pptx	Slides as sections
Images	.png,.jpg	OCR + layout analysis
HTML	.html	Structure preserved

Basic Usage

a partir de docling.document_converter import DocumentConverter

# criar converter
converter = DocumentConverter()

# converter single document
result = converter.converter("relatório.pdf")

# Access document
doc = result.document

# Export options
markdown = doc.export_to_markdown()
text = doc.export_to_text()
json_doc = doc.export_to_dict()

Advanced Configuration

a partir de docling.document_converter import DocumentConverter
a partir de docling.datamodel.base_models import InputFormat
a partir de docling.datamodel.pipeline_options import PdfPipelineOptions

# Configure pipeline
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True

# criar converter com options
converter = DocumentConverter(
allowed_formats=[InputFormat.PDF, InputFormat.DOCX],
pdf_backend_options=pipeline_options
)

result = converter.converter("document.pdf")

Document Structure

# Document hierarchy
doc = result.document

# Access metadata
print(doc.nomear)
print(doc.origin)

# Iterate through conteúdo
para element in doc.iterate_items():
print(f"Type: {element.type}")
print(f"Text: {element.text}")
    
if element.type == "table":
print(f"Rows: {len(element.dados.table_cells)}")

Extracting Tables

a partir de docling.document_converter import DocumentConverter
import pandas as pd

def extract_tables(doc_path):
"""extrair todos tables a partir de document."""
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
tables = []
    
para element in doc.iterate_items():
if element.type == "table":
# Get table dados
table_data = element.export_to_dataframe()
tables.append({
'página': element.prov[0].page_no if element.prov else None,
'dataframe': table_data
})
    
return tables

# Usage
tables = extract_tables("relatório.pdf")
para i, table in enumerate(tables):
print(f"Table {i+1} on página {table['página']}:")
print(table['dataframe'])

Extracting Figures

def extract_figures(doc_path, output_dir):
"""extrair figures com captions."""
import os
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
figures = []
os.makedirs(output_dir, exist_ok=True)
    
para element in doc.iterate_items():
if element.type == "picture":
figure_info = {
'caption': element.caption if hasattr(element, 'caption') else None,
'página': element.prov[0].page_no if element.prov else None,
}
            
# Save image if disponível
if hasattr(element, 'image'):
img_path = os.path.join(output_dir, f"figure_{len(figures)+1}.png")
element.image.save(img_path)
figure_info['path'] = img_path
            
figures.append(figure_info)
    
return figures

Handling Multi-column Layouts

a partir de docling.document_converter import DocumentConverter

def parse_multicolumn(doc_path):
"""Parse document com multi-column layout."""
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
# docling automatically handles column detection
# Text is returned in reading order
    
structured_content = []
    
para element in doc.iterate_items():
content_item = {
'type': element.type,
'text': element.text if hasattr(element, 'text') else None,
'level': element.level if hasattr(element, 'level') else None,
}
        
# Add bounding box if disponível
if element.prov:
content_item['bbox'] = element.prov[0].bbox
content_item['página'] = element.prov[0].page_no
        
structured_content.append(content_item)
    
return structured_content

Export Formats

a partir de docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.converter("document.pdf")
doc = result.document

# Markdown export
markdown = doc.export_to_markdown()
com open("output.md", "w") as f:
f.write(markdown)

# Plain text
text = doc.export_to_text()

# JSON/dict format
json_doc = doc.export_to_dict()

# HTML format (if supported)
# html = doc.export_to_html()

Batch Processing

a partir de docling.document_converter import DocumentConverter
a partir de pathlib import Path
a partir de concurrent.futures import ThreadPoolExecutor

def batch_parse(input_dir, output_dir, max_workers=4):
"""Parse multiple documents in parallel."""
    
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
    
converter = DocumentConverter()
    
def process_single(doc_path):
try:
result = converter.converter(str(doc_path))
md = result.document.export_to_markdown()
            
out_file = output_path / f"{doc_path.stem}.md"
com open(out_file, 'w') as f:
f.write(md)
            
return {'file': str(doc_path), 'status': 'success'}
except Exception as e:
return {'file': str(doc_path), 'status': 'error', 'error': str(e)}
    
docs = list(input_path.glob('*.pdf')) + list(input_path.glob('*.docx'))
    
com ThreadPoolExecutor(max_workers=max_workers) as executor:
results = list(executor.mapear(process_single, docs))
    
return results

Best Practices

usar Appropriate pipeline: Configure para o seu document type
Handle Large Documents: processo in chunks if needed
Verify Table Extraction: Complex tables may precisar de rever
verificar OCR qualidade: Enable OCR para scanned documents
Cache Results: Store parsed documents para reuse

Common Patterns

Academic Paper Parser

def parse_academic_paper(pdf_path):
"""Parse academic paper structure."""
    
converter = DocumentConverter()
result = converter.converter(pdf_path)
doc = result.document
    
paper = {
'title': None,
'abstract': None,
'sections': [],
'references': [],
'tables': [],
'figures': []
}
    
current_section = None
    
para element in doc.iterate_items():
text = element.text if hasattr(element, 'text') else ''
        
if element.type == 'title':
paper['title'] = text
        
elif element.type == 'heading':
if 'abstract' in text.lower():
current_section = 'abstract'
elif 'reference' in text.lower():
current_section = 'references'
else:
paper['sections'].append({
'title': text,
'conteúdo': ''
})
current_section = 'section'
        
elif element.type == 'paragraph':
if current_section == 'abstract':
paper['abstract'] = text
elif current_section == 'section' e paper['sections']:
paper['sections'][-1]['conteúdo'] += text + '\n'
        
elif element.type == 'table':
paper['tables'].append({
'caption': element.caption if hasattr(element, 'caption') else None,
'dados': element.export_to_dataframe() if hasattr(element, 'export_to_dataframe') else None
})
    
return paper

relatório para Structured dados

def parse_business_report(doc_path):
"""Parse business relatório em structured format."""
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
relatório = {
'metadata': {
'title': None,
'date': None,
'author': None
},
'executive_summary': None,
'sections': [],
'key_metrics': [],
'recomendações': []
}
    
# Parse document structure
para element in doc.iterate_items():
# Implement parsing logic based on document structure
pass
    
return relatório

Examples

Example 1: Parse Financial relatório

a partir de docling.document_converter import DocumentConverter

def parse_financial_report(pdf_path):
"""extrair structured dados a partir de financial relatório."""
    
converter = DocumentConverter()
result = converter.converter(pdf_path)
doc = result.document
    
financial_data = {
'income_statement': None,
'balance_sheet': None,
'cash_flow': None,
'notes': []
}
    
# extrair tables
tables = []
para element in doc.iterate_items():
if element.type == 'table':
table_df = element.export_to_dataframe()
            
# Identify table type
if 'revenue' in str(table_df).lower() ou 'income' in str(table_df).lower():
financial_data['income_statement'] = table_df
elif 'asset' in str(table_df).lower() ou 'liabilities' in str(table_df).lower():
financial_data['balance_sheet'] = table_df
elif 'cash' in str(table_df).lower():
financial_data['cash_flow'] = table_df
else:
tables.append(table_df)
    
# extrair markdown para notes
financial_data['markdown'] = doc.export_to_markdown()
    
return financial_data

relatório = parse_financial_report('annual_report.pdf')
print("Income Statement:")
print(relatório['income_statement'])

Example 2: Technical Documentation Parser

a partir de docling.document_converter import DocumentConverter

def parse_technical_docs(doc_path):
"""Parse technical documentation."""
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
documentation = {
'title': None,
'version': None,
'sections': [],
'code_blocks': [],
'diagrams': []
}
    
current_section = None
    
para element in doc.iterate_items():
if element.type == 'title':
documentation['title'] = element.text
        
elif element.type == 'heading':
current_section = {
'title': element.text,
'level': element.level if hasattr(element, 'level') else 1,
'conteúdo': []
}
documentation['sections'].append(current_section)
        
elif element.type == 'code':
if current_section:
current_section['conteúdo'].append({
'type': 'code',
'conteúdo': element.text
})
documentation['code_blocks'].append(element.text)
        
elif element.type == 'picture':
documentation['diagrams'].append({
'página': element.prov[0].page_no if element.prov else None,
'caption': element.caption if hasattr(element, 'caption') else None
})
    
return documentation

docs = parse_technical_docs('api_documentation.pdf')
print(f"Title: {docs['title']}")
print(f"Sections: {len(docs['sections'])}")

Example 3: contrato Analysis

a partir de docling.document_converter import DocumentConverter

def analyze_contract(pdf_path):
"""Parse contrato document para key cláusulas."""
    
converter = DocumentConverter()
result = converter.converter(pdf_path)
doc = result.document
    
contrato = {
'parties': [],
'cláusulas': [],
'dates': [],
'amounts': [],
'full_text': doc.export_to_text()
}
    
import re
    
# extrair dates
date_pattern = r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2},? \d{4}\b'
contrato['dates'] = re.findall(date_pattern, contrato['full_text'], re.IGNORECASE)
    
# extrair monetary amounts
amount_pattern = r'\$[\d,]+(?:\.\d{2})?|\b\d+(?:,\d{3})*(?:\.\d{2})?\s*(?:USD|dollars)\b'
contrato['amounts'] = re.findall(amount_pattern, contrato['full_text'], re.IGNORECASE)
    
# Parse sections as cláusulas
para element in doc.iterate_items():
if element.type == 'heading':
contrato['cláusulas'].append({
'title': element.text,
'conteúdo': ''
})
elif element.type == 'paragraph' e contrato['cláusulas']:
contrato['cláusulas'][-1]['conteúdo'] += element.text + '\n'
    
return contrato

contract_data = analyze_contract('agreement.pdf')
print(f"Key dates: {contract_data['dates']}")
print(f"Amounts: {contract_data['amounts']}")

Limitations

Very large documents may require chunking
Handwritten conteúdo precisa de OCR preprocessing
Complex nested tables may precisar de manual rever
Some PDF types (encrypted) não supported
GPU recommended para best performance

Installation

pip install docling

# para full functionality
pip install docling[all]

# para OCR suporte
pip install docling[ocr]

Resources

Disponível em: English 한국어 Português

Skill de IAParse documentOperações

extrair text, tables, layout, e evidência a partir de business documents. — Claude Skill

Um Skill Claude para Claude Code por Claude Office Skills — executar /doc-parser no Claude·Atualizado em 18 de jun. de 2026·vmain@9c4c7d5

Compatível comChatGPT

ClaudeClaude CodeCodex / Codex CLI

Cursor

Gemini

Converte PDFs, DOCX, imagens e documentos digitalizados em texto e tabelas estruturadas para análise, reporting e workflows.

extrai text e tables a partir de PDFs, Word documents, images, e scanned files.
Keeps página, section, e table contexto so evidência is easier para auditoria.
converte messy documents em structured inputs para analysis, comparison, e relatórios.
assinala uncertain extraction where human rever is needed.

VocêHoje

People copy numbers manually a partir de PDFs e lose página references ou table contexto.

Com /doc-parser

Run /doc-parser para extrair structured campos, tables, evidência notes, e rever assinala a partir de documents.

1 Provide document e campos needed2 extrair tables e text3 Keep página references4 rever uncertain campos

Para quem é

Gestor de Projeto

converter fonte documents em structured campos e reviewable evidência.

Ver skills para esta função

Analytics Engineer

preparar document-based tables e campos para analysis workflows.

Ver skills para esta função

O que faz

Concorrente document extraction

extrair pricing, packaging, claims, e proof points a partir de competitor PDFs.

Report source parsing

transformar PDFs e DOCX files em fonte tables para an analysis relatório.

Operações intake

converter submitted documents em structured rever campos.

Como funciona

Provide o document ou extracted file conteúdo e declarar what information is needed.

Parse text, headings, tables, forms, e layout cues.

Return structured campos, tables, quotes, e página references.

Highlight em falta, ambiguous, ou low-confiança extraction results.

preparar o parsed conteúdo para analysis, reporting, ou CRM/dados entry.

Opções de entrada

Documento

PDF, DOCX, screenshot, scan, ou extracted text.

Exemplo

O que o utilizador cola

Document: LearnPro Enterprise pricing PDF a partir de buyer procurement packet.
precisar de para extrair:
- tier nomeia
- per-utilizador pricing
- implementation fee
- SSO e admin control terms
- contrato minimums
- página references
Important: do não resumir away exact prices; keep página references e uncertain campos.

Resultado útil

Extracted fields

| campo | Extracted value | página | confiança |
|---|---|---:|---|
| produto tier | Enterprise Plus | 2 | High |
| Seat price | $31/user/month, annual contrato | 3 | High |
| Implementation fee | $18,000 one-time | 3 | High |
| SSO | Included para Enterprise Plus | 4 | Medium |
| Admin controls | Advanced admin controls included | 4 | Medium |
| contrato minimum | 250 seats | 3 | High |

Evidence notes

o SSO e admin-control terms appear in a feature table rather than o pricing table. Treat them as included only depois procurement confirms o PDF applies para o buyer's region e contrato version.

Structured output para analysis

usar price, implementation fee, e seat minimum in o pricing comparison. Keep exact language a partir de páginas 3-4 attached as fonte evidência para produto marketing e finanças rever.

Revisão humana

verificar whether o PDF is atual, whether discounts are excluded, e whether qualquer handwritten ou scanned annotations were missed.

Métricas que melhora

Qualidade dos dados

Reduces manual copy errors a partir de document-based evidência.

Operações

Confiança na métrica

Keeps fonte references e extraction confiança visible.

Operações

Funciona com

Google Sheets

manual

rever extracted tables e structured campos.

Google Drive

manual

Store e access fonte documents para parsing.

Confluence

manual

Publish parsed evidência e rever notes.

Quer usar Parser de Documentos?

Escolha como começar.

Executar no Claude Code

Gratuito. Código aberto.

Instale e execute este skill localmente no seu computador.

Instalar o Claude Code

Abra um terminal no seu computador e cole este comando:

Instalar o skill

Isto descarrega o skill com todos os ficheiros para o seu computador:

Adicione -g no fim para o tornar disponível em todos os seus projetos.

Execute

Inicie o Claude Code, depois escreva o comando:

depois

Ver código no GitHub

Usar no ElasticFlow

Funcionalidades de equipa e colaboração

Execute skills a partir do seu navegador. Partilhe resultados, gira acessos, colabore com a sua equipa. Sem terminal.

Teste grátis de 14 dias. Cancele a qualquer momento.

Ver no GitHub

Document Parser Skill

Overview

How para usar

Provide o document para parse
Specify what você want para extrair (text, tables, figures, etc.)
I'll parse it e return structured dados

Example prompts:

"Parse this PDF e extrair todos tables"
"converter this academic paper para structured markdown"
"extrair figures e captions a partir de this document"
"Parse this relatório preserving o document structure"

Domain Knowledge

docling Fundamentals

a partir de docling.document_converter import DocumentConverter

# Initialize converter
converter = DocumentConverter()

# converter document
result = converter.converter("document.pdf")

# Access parsed conteúdo
doc = result.document
print(doc.export_to_markdown())

Supported Formats

Format	Extension	Notes
PDF	.pdf	Native e scanned
Word	.docx	Full structure preserved
PowerPoint	.pptx	Slides as sections
Images	.png,.jpg	OCR + layout analysis
HTML	.html	Structure preserved

Basic Usage

a partir de docling.document_converter import DocumentConverter

# criar converter
converter = DocumentConverter()

# converter single document
result = converter.converter("relatório.pdf")

# Access document
doc = result.document

# Export options
markdown = doc.export_to_markdown()
text = doc.export_to_text()
json_doc = doc.export_to_dict()

Advanced Configuration

a partir de docling.document_converter import DocumentConverter
a partir de docling.datamodel.base_models import InputFormat
a partir de docling.datamodel.pipeline_options import PdfPipelineOptions

# Configure pipeline
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True

# criar converter com options
converter = DocumentConverter(
allowed_formats=[InputFormat.PDF, InputFormat.DOCX],
pdf_backend_options=pipeline_options
)

result = converter.converter("document.pdf")

Document Structure

# Document hierarchy
doc = result.document

# Access metadata
print(doc.nomear)
print(doc.origin)

# Iterate through conteúdo
para element in doc.iterate_items():
print(f"Type: {element.type}")
print(f"Text: {element.text}")
    
if element.type == "table":
print(f"Rows: {len(element.dados.table_cells)}")

Extracting Tables

a partir de docling.document_converter import DocumentConverter
import pandas as pd

def extract_tables(doc_path):
"""extrair todos tables a partir de document."""
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
tables = []
    
para element in doc.iterate_items():
if element.type == "table":
# Get table dados
table_data = element.export_to_dataframe()
tables.append({
'página': element.prov[0].page_no if element.prov else None,
'dataframe': table_data
})
    
return tables

# Usage
tables = extract_tables("relatório.pdf")
para i, table in enumerate(tables):
print(f"Table {i+1} on página {table['página']}:")
print(table['dataframe'])

Extracting Figures

def extract_figures(doc_path, output_dir):
"""extrair figures com captions."""
import os
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
figures = []
os.makedirs(output_dir, exist_ok=True)
    
para element in doc.iterate_items():
if element.type == "picture":
figure_info = {
'caption': element.caption if hasattr(element, 'caption') else None,
'página': element.prov[0].page_no if element.prov else None,
}
            
# Save image if disponível
if hasattr(element, 'image'):
img_path = os.path.join(output_dir, f"figure_{len(figures)+1}.png")
element.image.save(img_path)
figure_info['path'] = img_path
            
figures.append(figure_info)
    
return figures

Handling Multi-column Layouts

a partir de docling.document_converter import DocumentConverter

def parse_multicolumn(doc_path):
"""Parse document com multi-column layout."""
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
# docling automatically handles column detection
# Text is returned in reading order
    
structured_content = []
    
para element in doc.iterate_items():
content_item = {
'type': element.type,
'text': element.text if hasattr(element, 'text') else None,
'level': element.level if hasattr(element, 'level') else None,
}
        
# Add bounding box if disponível
if element.prov:
content_item['bbox'] = element.prov[0].bbox
content_item['página'] = element.prov[0].page_no
        
structured_content.append(content_item)
    
return structured_content

Export Formats

a partir de docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.converter("document.pdf")
doc = result.document

# Markdown export
markdown = doc.export_to_markdown()
com open("output.md", "w") as f:
f.write(markdown)

# Plain text
text = doc.export_to_text()

# JSON/dict format
json_doc = doc.export_to_dict()

# HTML format (if supported)
# html = doc.export_to_html()

Batch Processing

a partir de docling.document_converter import DocumentConverter
a partir de pathlib import Path
a partir de concurrent.futures import ThreadPoolExecutor

def batch_parse(input_dir, output_dir, max_workers=4):
"""Parse multiple documents in parallel."""
    
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
    
converter = DocumentConverter()
    
def process_single(doc_path):
try:
result = converter.converter(str(doc_path))
md = result.document.export_to_markdown()
            
out_file = output_path / f"{doc_path.stem}.md"
com open(out_file, 'w') as f:
f.write(md)
            
return {'file': str(doc_path), 'status': 'success'}
except Exception as e:
return {'file': str(doc_path), 'status': 'error', 'error': str(e)}
    
docs = list(input_path.glob('*.pdf')) + list(input_path.glob('*.docx'))
    
com ThreadPoolExecutor(max_workers=max_workers) as executor:
results = list(executor.mapear(process_single, docs))
    
return results

Best Practices

usar Appropriate pipeline: Configure para o seu document type
Handle Large Documents: processo in chunks if needed
Verify Table Extraction: Complex tables may precisar de rever
verificar OCR qualidade: Enable OCR para scanned documents
Cache Results: Store parsed documents para reuse

Common Patterns

Academic Paper Parser

def parse_academic_paper(pdf_path):
"""Parse academic paper structure."""
    
converter = DocumentConverter()
result = converter.converter(pdf_path)
doc = result.document
    
paper = {
'title': None,
'abstract': None,
'sections': [],
'references': [],
'tables': [],
'figures': []
}
    
current_section = None
    
para element in doc.iterate_items():
text = element.text if hasattr(element, 'text') else ''
        
if element.type == 'title':
paper['title'] = text
        
elif element.type == 'heading':
if 'abstract' in text.lower():
current_section = 'abstract'
elif 'reference' in text.lower():
current_section = 'references'
else:
paper['sections'].append({
'title': text,
'conteúdo': ''
})
current_section = 'section'
        
elif element.type == 'paragraph':
if current_section == 'abstract':
paper['abstract'] = text
elif current_section == 'section' e paper['sections']:
paper['sections'][-1]['conteúdo'] += text + '\n'
        
elif element.type == 'table':
paper['tables'].append({
'caption': element.caption if hasattr(element, 'caption') else None,
'dados': element.export_to_dataframe() if hasattr(element, 'export_to_dataframe') else None
})
    
return paper

relatório para Structured dados

def parse_business_report(doc_path):
"""Parse business relatório em structured format."""
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
relatório = {
'metadata': {
'title': None,
'date': None,
'author': None
},
'executive_summary': None,
'sections': [],
'key_metrics': [],
'recomendações': []
}
    
# Parse document structure
para element in doc.iterate_items():
# Implement parsing logic based on document structure
pass
    
return relatório

Examples

Example 1: Parse Financial relatório

a partir de docling.document_converter import DocumentConverter

def parse_financial_report(pdf_path):
"""extrair structured dados a partir de financial relatório."""
    
converter = DocumentConverter()
result = converter.converter(pdf_path)
doc = result.document
    
financial_data = {
'income_statement': None,
'balance_sheet': None,
'cash_flow': None,
'notes': []
}
    
# extrair tables
tables = []
para element in doc.iterate_items():
if element.type == 'table':
table_df = element.export_to_dataframe()
            
# Identify table type
if 'revenue' in str(table_df).lower() ou 'income' in str(table_df).lower():
financial_data['income_statement'] = table_df
elif 'asset' in str(table_df).lower() ou 'liabilities' in str(table_df).lower():
financial_data['balance_sheet'] = table_df
elif 'cash' in str(table_df).lower():
financial_data['cash_flow'] = table_df
else:
tables.append(table_df)
    
# extrair markdown para notes
financial_data['markdown'] = doc.export_to_markdown()
    
return financial_data

relatório = parse_financial_report('annual_report.pdf')
print("Income Statement:")
print(relatório['income_statement'])

Example 2: Technical Documentation Parser

a partir de docling.document_converter import DocumentConverter

def parse_technical_docs(doc_path):
"""Parse technical documentation."""
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
documentation = {
'title': None,
'version': None,
'sections': [],
'code_blocks': [],
'diagrams': []
}
    
current_section = None
    
para element in doc.iterate_items():
if element.type == 'title':
documentation['title'] = element.text
        
elif element.type == 'heading':
current_section = {
'title': element.text,
'level': element.level if hasattr(element, 'level') else 1,
'conteúdo': []
}
documentation['sections'].append(current_section)
        
elif element.type == 'code':
if current_section:
current_section['conteúdo'].append({
'type': 'code',
'conteúdo': element.text
})
documentation['code_blocks'].append(element.text)
        
elif element.type == 'picture':
documentation['diagrams'].append({
'página': element.prov[0].page_no if element.prov else None,
'caption': element.caption if hasattr(element, 'caption') else None
})
    
return documentation

docs = parse_technical_docs('api_documentation.pdf')
print(f"Title: {docs['title']}")
print(f"Sections: {len(docs['sections'])}")

Example 3: contrato Analysis

a partir de docling.document_converter import DocumentConverter

def analyze_contract(pdf_path):
"""Parse contrato document para key cláusulas."""
    
converter = DocumentConverter()
result = converter.converter(pdf_path)
doc = result.document
    
contrato = {
'parties': [],
'cláusulas': [],
'dates': [],
'amounts': [],
'full_text': doc.export_to_text()
}
    
import re
    
# extrair dates
date_pattern = r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2},? \d{4}\b'
contrato['dates'] = re.findall(date_pattern, contrato['full_text'], re.IGNORECASE)
    
# extrair monetary amounts
amount_pattern = r'\$[\d,]+(?:\.\d{2})?|\b\d+(?:,\d{3})*(?:\.\d{2})?\s*(?:USD|dollars)\b'
contrato['amounts'] = re.findall(amount_pattern, contrato['full_text'], re.IGNORECASE)
    
# Parse sections as cláusulas
para element in doc.iterate_items():
if element.type == 'heading':
contrato['cláusulas'].append({
'title': element.text,
'conteúdo': ''
})
elif element.type == 'paragraph' e contrato['cláusulas']:
contrato['cláusulas'][-1]['conteúdo'] += element.text + '\n'
    
return contrato

contract_data = analyze_contract('agreement.pdf')
print(f"Key dates: {contract_data['dates']}")
print(f"Amounts: {contract_data['amounts']}")

Limitations

Very large documents may require chunking
Handwritten conteúdo precisa de OCR preprocessing
Complex nested tables may precisar de manual rever
Some PDF types (encrypted) não supported
GPU recommended para best performance

Installation

pip install docling

# para full functionality
pip install docling[all]

# para OCR suporte
pip install docling[ocr]

Resources

Documentos de referência

═══════════════════════════════════════════════════════════════════════════════

CLAUDE OFFICE SKILL - Enhanced Metadata v2.0

═══════════════════════════════════════════════════════════════════════════════

Basic Information

name: doc-parser description: ">" version: "1.0" author: claude-office-skills license: MIT

Categorization

category: parsing tags:

parsing
extraction
layout
docling department: All

AI Model Compatibility

models: recommended: - claude-sonnet-4 - claude-opus-4 compatible: - claude-3-5-sonnet - gpt-4 - gpt-4o

MCP Tools Integration

mcp: server: office-mcp tools: - analyze_document_structure - extract_text_from_pdf

Skill Capabilities

capabilities:

document_parsing
layout_analysis

Language suporte

languages:

Document Parser Skill

Overview

How para usar

Provide o document para parse
Specify what você want para extrair (text, tables, figures, etc.)
I'll parse it e return structured dados

Example prompts:

"Parse this PDF e extrair todos tables"
"converter this academic paper para structured markdown"
"extrair figures e captions a partir de this document"
"Parse this relatório preserving o document structure"

Domain Knowledge

docling Fundamentals

a partir de docling.document_converter import DocumentConverter

# Initialize converter
converter = DocumentConverter()

# converter document
result = converter.converter("document.pdf")

# Access parsed conteúdo
doc = result.document
print(doc.export_to_markdown())

Supported Formats

Format	Extension	Notes
PDF	.pdf	Native e scanned
Word	.docx	Full structure preserved
PowerPoint	.pptx	Slides as sections
Images	.png,.jpg	OCR + layout analysis
HTML	.html	Structure preserved

Basic Usage

a partir de docling.document_converter import DocumentConverter

# criar converter
converter = DocumentConverter()

# converter single document
result = converter.converter("relatório.pdf")

# Access document
doc = result.document

# Export options
markdown = doc.export_to_markdown()
text = doc.export_to_text()
json_doc = doc.export_to_dict()

Advanced Configuration

a partir de docling.document_converter import DocumentConverter
a partir de docling.datamodel.base_models import InputFormat
a partir de docling.datamodel.pipeline_options import PdfPipelineOptions

# Configure pipeline
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True

# criar converter com options
converter = DocumentConverter(
allowed_formats=[InputFormat.PDF, InputFormat.DOCX],
pdf_backend_options=pipeline_options
)

result = converter.converter("document.pdf")

Document Structure

# Document hierarchy
doc = result.document

# Access metadata
print(doc.nomear)
print(doc.origin)

# Iterate through conteúdo
para element in doc.iterate_items():
print(f"Type: {element.type}")
print(f"Text: {element.text}")
    
if element.type == "table":
print(f"Rows: {len(element.dados.table_cells)}")

Extracting Tables

a partir de docling.document_converter import DocumentConverter
import pandas as pd

def extract_tables(doc_path):
"""extrair todos tables a partir de document."""
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
tables = []
    
para element in doc.iterate_items():
if element.type == "table":
# Get table dados
table_data = element.export_to_dataframe()
tables.append({
'página': element.prov[0].page_no if element.prov else None,
'dataframe': table_data
})
    
return tables

# Usage
tables = extract_tables("relatório.pdf")
para i, table in enumerate(tables):
print(f"Table {i+1} on página {table['página']}:")
print(table['dataframe'])

Extracting Figures

def extract_figures(doc_path, output_dir):
"""extrair figures com captions."""
import os
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
figures = []
os.makedirs(output_dir, exist_ok=True)
    
para element in doc.iterate_items():
if element.type == "picture":
figure_info = {
'caption': element.caption if hasattr(element, 'caption') else None,
'página': element.prov[0].page_no if element.prov else None,
}
            
# Save image if disponível
if hasattr(element, 'image'):
img_path = os.path.join(output_dir, f"figure_{len(figures)+1}.png")
element.image.save(img_path)
figure_info['path'] = img_path
            
figures.append(figure_info)
    
return figures

Handling Multi-column Layouts

a partir de docling.document_converter import DocumentConverter

def parse_multicolumn(doc_path):
"""Parse document com multi-column layout."""
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
# docling automatically handles column detection
# Text is returned in reading order
    
structured_content = []
    
para element in doc.iterate_items():
content_item = {
'type': element.type,
'text': element.text if hasattr(element, 'text') else None,
'level': element.level if hasattr(element, 'level') else None,
}
        
# Add bounding box if disponível
if element.prov:
content_item['bbox'] = element.prov[0].bbox
content_item['página'] = element.prov[0].page_no
        
structured_content.append(content_item)
    
return structured_content

Export Formats

a partir de docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.converter("document.pdf")
doc = result.document

# Markdown export
markdown = doc.export_to_markdown()
com open("output.md", "w") as f:
f.write(markdown)

# Plain text
text = doc.export_to_text()

# JSON/dict format
json_doc = doc.export_to_dict()

# HTML format (if supported)
# html = doc.export_to_html()

Batch Processing

a partir de docling.document_converter import DocumentConverter
a partir de pathlib import Path
a partir de concurrent.futures import ThreadPoolExecutor

def batch_parse(input_dir, output_dir, max_workers=4):
"""Parse multiple documents in parallel."""
    
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
    
converter = DocumentConverter()
    
def process_single(doc_path):
try:
result = converter.converter(str(doc_path))
md = result.document.export_to_markdown()
            
out_file = output_path / f"{doc_path.stem}.md"
com open(out_file, 'w') as f:
f.write(md)
            
return {'file': str(doc_path), 'status': 'success'}
except Exception as e:
return {'file': str(doc_path), 'status': 'error', 'error': str(e)}
    
docs = list(input_path.glob('*.pdf')) + list(input_path.glob('*.docx'))
    
com ThreadPoolExecutor(max_workers=max_workers) as executor:
results = list(executor.mapear(process_single, docs))
    
return results

Best Practices

usar Appropriate pipeline: Configure para o seu document type
Handle Large Documents: processo in chunks if needed
Verify Table Extraction: Complex tables may precisar de rever
verificar OCR qualidade: Enable OCR para scanned documents
Cache Results: Store parsed documents para reuse

Common Patterns

Academic Paper Parser

def parse_academic_paper(pdf_path):
"""Parse academic paper structure."""
    
converter = DocumentConverter()
result = converter.converter(pdf_path)
doc = result.document
    
paper = {
'title': None,
'abstract': None,
'sections': [],
'references': [],
'tables': [],
'figures': []
}
    
current_section = None
    
para element in doc.iterate_items():
text = element.text if hasattr(element, 'text') else ''
        
if element.type == 'title':
paper['title'] = text
        
elif element.type == 'heading':
if 'abstract' in text.lower():
current_section = 'abstract'
elif 'reference' in text.lower():
current_section = 'references'
else:
paper['sections'].append({
'title': text,
'conteúdo': ''
})
current_section = 'section'
        
elif element.type == 'paragraph':
if current_section == 'abstract':
paper['abstract'] = text
elif current_section == 'section' e paper['sections']:
paper['sections'][-1]['conteúdo'] += text + '\n'
        
elif element.type == 'table':
paper['tables'].append({
'caption': element.caption if hasattr(element, 'caption') else None,
'dados': element.export_to_dataframe() if hasattr(element, 'export_to_dataframe') else None
})
    
return paper

relatório para Structured dados

def parse_business_report(doc_path):
"""Parse business relatório em structured format."""
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
relatório = {
'metadata': {
'title': None,
'date': None,
'author': None
},
'executive_summary': None,
'sections': [],
'key_metrics': [],
'recomendações': []
}
    
# Parse document structure
para element in doc.iterate_items():
# Implement parsing logic based on document structure
pass
    
return relatório

Examples

Example 1: Parse Financial relatório

a partir de docling.document_converter import DocumentConverter

def parse_financial_report(pdf_path):
"""extrair structured dados a partir de financial relatório."""
    
converter = DocumentConverter()
result = converter.converter(pdf_path)
doc = result.document
    
financial_data = {
'income_statement': None,
'balance_sheet': None,
'cash_flow': None,
'notes': []
}
    
# extrair tables
tables = []
para element in doc.iterate_items():
if element.type == 'table':
table_df = element.export_to_dataframe()
            
# Identify table type
if 'revenue' in str(table_df).lower() ou 'income' in str(table_df).lower():
financial_data['income_statement'] = table_df
elif 'asset' in str(table_df).lower() ou 'liabilities' in str(table_df).lower():
financial_data['balance_sheet'] = table_df
elif 'cash' in str(table_df).lower():
financial_data['cash_flow'] = table_df
else:
tables.append(table_df)
    
# extrair markdown para notes
financial_data['markdown'] = doc.export_to_markdown()
    
return financial_data

relatório = parse_financial_report('annual_report.pdf')
print("Income Statement:")
print(relatório['income_statement'])

Example 2: Technical Documentation Parser

a partir de docling.document_converter import DocumentConverter

def parse_technical_docs(doc_path):
"""Parse technical documentation."""
    
converter = DocumentConverter()
result = converter.converter(doc_path)
doc = result.document
    
documentation = {
'title': None,
'version': None,
'sections': [],
'code_blocks': [],
'diagrams': []
}
    
current_section = None
    
para element in doc.iterate_items():
if element.type == 'title':
documentation['title'] = element.text
        
elif element.type == 'heading':
current_section = {
'title': element.text,
'level': element.level if hasattr(element, 'level') else 1,
'conteúdo': []
}
documentation['sections'].append(current_section)
        
elif element.type == 'code':
if current_section:
current_section['conteúdo'].append({
'type': 'code',
'conteúdo': element.text
})
documentation['code_blocks'].append(element.text)
        
elif element.type == 'picture':
documentation['diagrams'].append({
'página': element.prov[0].page_no if element.prov else None,
'caption': element.caption if hasattr(element, 'caption') else None
})
    
return documentation

docs = parse_technical_docs('api_documentation.pdf')
print(f"Title: {docs['title']}")
print(f"Sections: {len(docs['sections'])}")

Example 3: contrato Analysis

a partir de docling.document_converter import DocumentConverter

def analyze_contract(pdf_path):
"""Parse contrato document para key cláusulas."""
    
converter = DocumentConverter()
result = converter.converter(pdf_path)
doc = result.document
    
contrato = {
'parties': [],
'cláusulas': [],
'dates': [],
'amounts': [],
'full_text': doc.export_to_text()
}
    
import re
    
# extrair dates
date_pattern = r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2},? \d{4}\b'
contrato['dates'] = re.findall(date_pattern, contrato['full_text'], re.IGNORECASE)
    
# extrair monetary amounts
amount_pattern = r'\$[\d,]+(?:\.\d{2})?|\b\d+(?:,\d{3})*(?:\.\d{2})?\s*(?:USD|dollars)\b'
contrato['amounts'] = re.findall(amount_pattern, contrato['full_text'], re.IGNORECASE)
    
# Parse sections as cláusulas
para element in doc.iterate_items():
if element.type == 'heading':
contrato['cláusulas'].append({
'title': element.text,
'conteúdo': ''
})
elif element.type == 'paragraph' e contrato['cláusulas']:
contrato['cláusulas'][-1]['conteúdo'] += element.text + '\n'
    
return contrato

contract_data = analyze_contract('agreement.pdf')
print(f"Key dates: {contract_data['dates']}")
print(f"Amounts: {contract_data['amounts']}")

Limitations

Very large documents may require chunking
Handwritten conteúdo precisa de OCR preprocessing
Complex nested tables may precisar de manual rever
Some PDF types (encrypted) não supported
GPU recommended para best performance

Installation

pip install docling

# para full functionality
pip install docling[all]

# para OCR suporte
pip install docling[ocr]

extrair text, tables, layout, e evidência a partir de business documents. — Claude Skill

Para quem é

O que faz

Como funciona

Opções de entrada

Exemplo

Métricas que melhora

Funciona com

Quer usar Parser de Documentos?

Instruções do skill

Document Parser Skill

Overview

How para usar

Domain Knowledge

docling Fundamentals

Supported Formats

Basic Usage

Advanced Configuration

Document Structure

Extracting Tables

Extracting Figures

Handling Multi-column Layouts

Export Formats

Batch Processing

Best Practices

Common Patterns

Academic Paper Parser

relatório para Structured dados

Examples

Example 1: Parse Financial relatório

Example 2: Technical Documentation Parser

Example 3: contrato Analysis

Limitations

Installation

Resources

Documentos de referência

═══════════════════════════════════════════════════════════════════════════════

CLAUDE OFFICE SKILL - Enhanced Metadata v2.0

═══════════════════════════════════════════════════════════════════════════════

Basic Information

Categorization

AI Model Compatibility

MCP Tools Integration

Skill Capabilities

Language suporte

Document Parser Skill

Overview

How para usar

Domain Knowledge

docling Fundamentals

Supported Formats

Basic Usage

Advanced Configuration

Document Structure

Extracting Tables

Extracting Figures

Handling Multi-column Layouts

Export Formats

Batch Processing

Best Practices

Common Patterns

Academic Paper Parser

relatório para Structured dados

Examples

Example 1: Parse Financial relatório

Example 2: Technical Documentation Parser

Example 3: contrato Analysis

Limitations

Installation

Resources

extrair text, tables, layout, e evidência a partir de business documents. — Claude Skill

Para quem é

O que faz

Como funciona

Opções de entrada

Exemplo

Métricas que melhora

Funciona com

Quer usar Parser de Documentos?

Instruções do skill