618 lines
18 KiB
Markdown
618 lines
18 KiB
Markdown
|
|
---
|
||
|
|
name: openakita/skills@summarizer
|
||
|
|
description: Summarize content from any source — URLs, local files, YouTube videos, and raw text. Use when the user asks to summarize a webpage, PDF, document, article, video, or any content. Supports multiple output formats (bullet points, executive summary, detailed notes) and configurable length. Can also extract raw content without summarization.
|
||
|
|
license: MIT
|
||
|
|
metadata:
|
||
|
|
author: openakita
|
||
|
|
version: "1.0.0"
|
||
|
|
based_on: moltbot/moltbot/summarize
|
||
|
|
---
|
||
|
|
|
||
|
|
# Universal Content Summarizer
|
||
|
|
|
||
|
|
Summarize content from any source: URLs, local files, YouTube videos, clipboard text, and more. Flexible output formats with configurable depth and style.
|
||
|
|
|
||
|
|
## When to Use This Skill
|
||
|
|
|
||
|
|
- User says "summarize this" and provides a URL, file, or text
|
||
|
|
- User shares a link to a webpage/article and wants a quick overview
|
||
|
|
- User has a PDF or document they want condensed
|
||
|
|
- User wants to extract content from a URL without summarizing (extract-only mode)
|
||
|
|
- User needs different summary formats for different audiences (executive vs. technical)
|
||
|
|
- User wants to summarize multiple sources and combine insights
|
||
|
|
- User asks for a TL;DR of any content
|
||
|
|
|
||
|
|
## Prerequisites
|
||
|
|
|
||
|
|
### Core Dependencies
|
||
|
|
|
||
|
|
No mandatory external dependencies for basic text summarization — the AI model handles it directly.
|
||
|
|
|
||
|
|
### For URL Content Extraction
|
||
|
|
|
||
|
|
The agent should use available web browsing/fetching tools to retrieve URL content. If running in an environment with shell access:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# For advanced HTML parsing (optional)
|
||
|
|
pip install beautifulsoup4 requests
|
||
|
|
|
||
|
|
# For PDF text extraction (optional)
|
||
|
|
pip install PyPDF2
|
||
|
|
# or
|
||
|
|
pip install pdfplumber
|
||
|
|
```
|
||
|
|
|
||
|
|
### For YouTube Videos
|
||
|
|
|
||
|
|
If the content source is a YouTube URL, this skill delegates to the youtube-summarizer or bilibili-watcher skills if available. Otherwise, it uses:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
pip install youtube-transcript-api
|
||
|
|
```
|
||
|
|
|
||
|
|
### Supported Input Types
|
||
|
|
|
||
|
|
| Input Type | How to Provide | Notes |
|
||
|
|
|---|---|---|
|
||
|
|
| URL (webpage) | Paste the URL | HTML content extracted automatically |
|
||
|
|
| URL (YouTube) | Paste YouTube link | Transcript extracted via API |
|
||
|
|
| Local file (text) | File path | `.txt`, `.md`, `.rst`, `.csv` |
|
||
|
|
| Local file (PDF) | File path | Requires PyPDF2 or pdfplumber |
|
||
|
|
| Local file (HTML) | File path | Parsed with BeautifulSoup |
|
||
|
|
| Local file (DOCX) | File path | Requires python-docx |
|
||
|
|
| Raw text | Paste directly | Any length |
|
||
|
|
| Clipboard | "Summarize my clipboard" | If clipboard access available |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Instructions
|
||
|
|
|
||
|
|
### Step 1: Identify the Content Source
|
||
|
|
|
||
|
|
Determine what the user wants summarized and how to access it:
|
||
|
|
|
||
|
|
```
|
||
|
|
Input Analysis:
|
||
|
|
1. Is it a URL? → Fetch the content
|
||
|
|
2. Is it a file path? → Read the file
|
||
|
|
3. Is it raw text? → Use directly
|
||
|
|
4. Is it a YouTube link? → Extract transcript
|
||
|
|
5. Is it multiple sources? → Process each, then combine
|
||
|
|
```
|
||
|
|
|
||
|
|
**URL Detection Patterns:**
|
||
|
|
|
||
|
|
```python
|
||
|
|
import re
|
||
|
|
|
||
|
|
def classify_input(text: str) -> str:
|
||
|
|
"""Classify the input type."""
|
||
|
|
text = text.strip()
|
||
|
|
|
||
|
|
# YouTube URLs
|
||
|
|
youtube_pattern = r'(youtube\.com|youtu\.be|youtube\.com/shorts)'
|
||
|
|
if re.search(youtube_pattern, text):
|
||
|
|
return 'youtube'
|
||
|
|
|
||
|
|
# Bilibili URLs
|
||
|
|
if 'bilibili.com' in text or 'b23.tv' in text:
|
||
|
|
return 'bilibili'
|
||
|
|
|
||
|
|
# General URLs
|
||
|
|
if re.match(r'https?://', text):
|
||
|
|
return 'url'
|
||
|
|
|
||
|
|
# File paths
|
||
|
|
if any(text.endswith(ext) for ext in ['.pdf', '.txt', '.md', '.html', '.docx', '.rst', '.csv']):
|
||
|
|
return 'file'
|
||
|
|
|
||
|
|
# Raw text
|
||
|
|
return 'text'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 2: Extract Content
|
||
|
|
|
||
|
|
#### From URLs (Webpages)
|
||
|
|
|
||
|
|
Use the available web fetching tools to retrieve and parse HTML content. Extract the main article text, removing navigation, ads, footers, and other boilerplate.
|
||
|
|
|
||
|
|
**Key extraction goals:**
|
||
|
|
- Article title and author
|
||
|
|
- Publication date if available
|
||
|
|
- Main body text with structure preserved
|
||
|
|
- Images and captions (noted but not downloaded)
|
||
|
|
- Any embedded data tables
|
||
|
|
|
||
|
|
```python
|
||
|
|
from bs4 import BeautifulSoup
|
||
|
|
import requests
|
||
|
|
|
||
|
|
def extract_url_content(url: str) -> dict:
|
||
|
|
"""Extract main content from a URL."""
|
||
|
|
response = requests.get(url, headers={
|
||
|
|
'User-Agent': 'Mozilla/5.0 (compatible; ContentSummarizer/1.0)'
|
||
|
|
}, timeout=30)
|
||
|
|
response.raise_for_status()
|
||
|
|
|
||
|
|
soup = BeautifulSoup(response.text, 'html.parser')
|
||
|
|
|
||
|
|
# Remove script, style, nav, footer elements
|
||
|
|
for tag in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
|
||
|
|
tag.decompose()
|
||
|
|
|
||
|
|
# Try to find the main article content
|
||
|
|
article = soup.find('article') or soup.find('main') or soup.find('body')
|
||
|
|
|
||
|
|
title = soup.find('title')
|
||
|
|
title_text = title.get_text().strip() if title else 'Untitled'
|
||
|
|
|
||
|
|
return {
|
||
|
|
'title': title_text,
|
||
|
|
'text': article.get_text(separator='\n', strip=True) if article else '',
|
||
|
|
'url': url
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
#### From Local Files
|
||
|
|
|
||
|
|
```python
|
||
|
|
from pathlib import Path
|
||
|
|
|
||
|
|
def extract_file_content(filepath: str) -> dict:
|
||
|
|
"""Extract text from various file formats."""
|
||
|
|
path = Path(filepath)
|
||
|
|
suffix = path.suffix.lower()
|
||
|
|
|
||
|
|
if suffix in ('.txt', '.md', '.rst', '.csv'):
|
||
|
|
text = path.read_text(encoding='utf-8')
|
||
|
|
return {'title': path.name, 'text': text, 'format': suffix}
|
||
|
|
|
||
|
|
elif suffix == '.pdf':
|
||
|
|
return extract_pdf(filepath)
|
||
|
|
|
||
|
|
elif suffix == '.html':
|
||
|
|
text = path.read_text(encoding='utf-8')
|
||
|
|
soup = BeautifulSoup(text, 'html.parser')
|
||
|
|
for tag in soup(['script', 'style']):
|
||
|
|
tag.decompose()
|
||
|
|
return {
|
||
|
|
'title': path.name,
|
||
|
|
'text': soup.get_text(separator='\n', strip=True),
|
||
|
|
'format': 'html'
|
||
|
|
}
|
||
|
|
|
||
|
|
elif suffix == '.docx':
|
||
|
|
return extract_docx(filepath)
|
||
|
|
|
||
|
|
else:
|
||
|
|
# Try reading as plain text
|
||
|
|
try:
|
||
|
|
text = path.read_text(encoding='utf-8')
|
||
|
|
return {'title': path.name, 'text': text, 'format': 'unknown'}
|
||
|
|
except UnicodeDecodeError:
|
||
|
|
raise ValueError(f"Cannot read binary file: {filepath}")
|
||
|
|
|
||
|
|
|
||
|
|
def extract_pdf(filepath: str) -> dict:
|
||
|
|
"""Extract text from PDF using available libraries."""
|
||
|
|
try:
|
||
|
|
import pdfplumber
|
||
|
|
with pdfplumber.open(filepath) as pdf:
|
||
|
|
pages = [page.extract_text() or '' for page in pdf.pages]
|
||
|
|
return {
|
||
|
|
'title': Path(filepath).name,
|
||
|
|
'text': '\n\n'.join(pages),
|
||
|
|
'format': 'pdf',
|
||
|
|
'pages': len(pdf.pages)
|
||
|
|
}
|
||
|
|
except ImportError:
|
||
|
|
pass
|
||
|
|
|
||
|
|
try:
|
||
|
|
from PyPDF2 import PdfReader
|
||
|
|
reader = PdfReader(filepath)
|
||
|
|
pages = [page.extract_text() or '' for page in reader.pages]
|
||
|
|
return {
|
||
|
|
'title': Path(filepath).name,
|
||
|
|
'text': '\n\n'.join(pages),
|
||
|
|
'format': 'pdf',
|
||
|
|
'pages': len(reader.pages)
|
||
|
|
}
|
||
|
|
except ImportError:
|
||
|
|
raise RuntimeError("Install pdfplumber or PyPDF2 to read PDFs: pip install pdfplumber")
|
||
|
|
|
||
|
|
|
||
|
|
def extract_docx(filepath: str) -> dict:
|
||
|
|
"""Extract text from DOCX files."""
|
||
|
|
try:
|
||
|
|
from docx import Document
|
||
|
|
doc = Document(filepath)
|
||
|
|
paragraphs = [p.text for p in doc.paragraphs if p.text.strip()]
|
||
|
|
return {
|
||
|
|
'title': Path(filepath).name,
|
||
|
|
'text': '\n\n'.join(paragraphs),
|
||
|
|
'format': 'docx'
|
||
|
|
}
|
||
|
|
except ImportError:
|
||
|
|
raise RuntimeError("Install python-docx to read DOCX files: pip install python-docx")
|
||
|
|
```
|
||
|
|
|
||
|
|
#### From YouTube Videos
|
||
|
|
|
||
|
|
Delegate to the youtube-summarizer skill or use youtube-transcript-api directly:
|
||
|
|
|
||
|
|
```python
|
||
|
|
from youtube_transcript_api import YouTubeTranscriptApi
|
||
|
|
|
||
|
|
def extract_youtube_content(url: str) -> dict:
|
||
|
|
"""Extract transcript from YouTube video."""
|
||
|
|
video_id = extract_video_id(url) # See youtube-summarizer skill
|
||
|
|
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['en', 'zh-Hans', 'ja'])
|
||
|
|
text = ' '.join(entry['text'] for entry in transcript)
|
||
|
|
return {
|
||
|
|
'title': f'YouTube Video {video_id}',
|
||
|
|
'text': text,
|
||
|
|
'format': 'youtube',
|
||
|
|
'segments': transcript
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 3: Generate the Summary
|
||
|
|
|
||
|
|
Choose the output format based on user request or default to bullet points.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Output Formats
|
||
|
|
|
||
|
|
### Format 1: Bullet Points (Default)
|
||
|
|
|
||
|
|
Best for: Quick scanning, team sharing, Slack/email updates.
|
||
|
|
|
||
|
|
```
|
||
|
|
# Summary: [Title]
|
||
|
|
|
||
|
|
**Source**: [URL or filename]
|
||
|
|
**Length**: ~X words / X pages / X minutes
|
||
|
|
|
||
|
|
## Key Points
|
||
|
|
• [Most important finding/conclusion]
|
||
|
|
• [Second key point]
|
||
|
|
• [Third key point]
|
||
|
|
• [Fourth key point — include specific numbers/data if available]
|
||
|
|
• [Fifth key point]
|
||
|
|
|
||
|
|
## Notable Details
|
||
|
|
• [Interesting data point or quote]
|
||
|
|
• [Counter-argument or limitation mentioned]
|
||
|
|
```
|
||
|
|
|
||
|
|
**Prompt template:**
|
||
|
|
```
|
||
|
|
Summarize the following content into 5-8 bullet points. Each bullet should:
|
||
|
|
- Be self-contained (understandable without reading the full text)
|
||
|
|
- Include specific numbers, names, or dates when relevant
|
||
|
|
- Be ordered by importance (most important first)
|
||
|
|
- Be concise (1-2 sentences max)
|
||
|
|
|
||
|
|
Content:
|
||
|
|
{content}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Format 2: Executive Summary
|
||
|
|
|
||
|
|
Best for: Leadership updates, decision-making, meeting prep.
|
||
|
|
|
||
|
|
```
|
||
|
|
# Executive Summary: [Title]
|
||
|
|
|
||
|
|
**Source**: [URL/file] | **Date**: [if available] | **Read time**: ~X min
|
||
|
|
|
||
|
|
## Bottom Line
|
||
|
|
[1-2 sentences: the single most important takeaway]
|
||
|
|
|
||
|
|
## Context
|
||
|
|
[2-3 sentences: why this matters, background]
|
||
|
|
|
||
|
|
## Key Findings
|
||
|
|
1. [Finding with supporting data]
|
||
|
|
2. [Finding with supporting data]
|
||
|
|
3. [Finding with supporting data]
|
||
|
|
|
||
|
|
## Implications
|
||
|
|
[What this means for the reader/team/organization]
|
||
|
|
|
||
|
|
## Recommended Actions
|
||
|
|
1. [Action item]
|
||
|
|
2. [Action item]
|
||
|
|
```
|
||
|
|
|
||
|
|
**Prompt template:**
|
||
|
|
```
|
||
|
|
Write an executive summary of the following content. Target audience: busy decision-makers
|
||
|
|
who need to understand the core message in under 2 minutes.
|
||
|
|
|
||
|
|
Structure:
|
||
|
|
1. Bottom Line (1-2 sentences — what's the one thing they need to know?)
|
||
|
|
2. Context (2-3 sentences — why does this matter?)
|
||
|
|
3. Key Findings (3-5 numbered points with data)
|
||
|
|
4. Implications (what this means going forward)
|
||
|
|
5. Recommended Actions (concrete next steps)
|
||
|
|
|
||
|
|
Content:
|
||
|
|
{content}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Format 3: Detailed Notes
|
||
|
|
|
||
|
|
Best for: Research, studying, reference material.
|
||
|
|
|
||
|
|
```
|
||
|
|
# Detailed Notes: [Title]
|
||
|
|
|
||
|
|
**Source**: [URL/file]
|
||
|
|
**Summary date**: [today]
|
||
|
|
**Original length**: ~X words
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
[3-5 sentence comprehensive overview]
|
||
|
|
|
||
|
|
## Section 1: [Topic]
|
||
|
|
[Detailed notes preserving key information, quotes, data]
|
||
|
|
- Sub-point with specifics
|
||
|
|
- Sub-point with specifics
|
||
|
|
|
||
|
|
## Section 2: [Topic]
|
||
|
|
[Detailed notes]
|
||
|
|
|
||
|
|
## Section 3: [Topic]
|
||
|
|
[Detailed notes]
|
||
|
|
|
||
|
|
## Key Quotes
|
||
|
|
> "[Exact quote]" — [Source/Author]
|
||
|
|
> "[Exact quote]" — [Source/Author]
|
||
|
|
|
||
|
|
## Data & Statistics
|
||
|
|
| Metric | Value | Context |
|
||
|
|
|---|---|---|
|
||
|
|
| [metric] | [value] | [context] |
|
||
|
|
|
||
|
|
## References & Links
|
||
|
|
- [Reference mentioned in the content]
|
||
|
|
```
|
||
|
|
|
||
|
|
### Format 4: Extract Only (No Summarization)
|
||
|
|
|
||
|
|
Best for: Content extraction for downstream processing.
|
||
|
|
|
||
|
|
When the user says "just extract" or "don't summarize", return the raw extracted text in clean markdown format without any summarization or analysis:
|
||
|
|
|
||
|
|
```
|
||
|
|
# Extracted Content: [Title]
|
||
|
|
|
||
|
|
**Source**: [URL/file]
|
||
|
|
**Extracted**: [timestamp]
|
||
|
|
**Word count**: X
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
[Full extracted text in clean markdown]
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Workflows
|
||
|
|
|
||
|
|
### Workflow 1: Quick URL Summary
|
||
|
|
|
||
|
|
User says: "Summarize https://example.com/article"
|
||
|
|
|
||
|
|
1. Detect input type: URL
|
||
|
|
2. Fetch and parse the webpage content
|
||
|
|
3. Generate bullet-point summary (default format)
|
||
|
|
4. Present with source attribution
|
||
|
|
|
||
|
|
### Workflow 2: PDF Summary
|
||
|
|
|
||
|
|
User says: "Summarize this PDF: /path/to/document.pdf"
|
||
|
|
|
||
|
|
1. Detect input type: file (PDF)
|
||
|
|
2. Extract text from all pages
|
||
|
|
3. Note total page count
|
||
|
|
4. Generate summary in requested format
|
||
|
|
5. Flag any extraction issues (scanned PDFs, images, etc.)
|
||
|
|
|
||
|
|
### Workflow 3: Custom Format Summary
|
||
|
|
|
||
|
|
User says: "Give me an executive summary of this article"
|
||
|
|
|
||
|
|
1. Detect input type and extract content
|
||
|
|
2. Use executive summary format
|
||
|
|
3. Include bottom line, key findings, and action items
|
||
|
|
|
||
|
|
### Workflow 4: Multi-Source Synthesis
|
||
|
|
|
||
|
|
User provides multiple URLs/files:
|
||
|
|
|
||
|
|
1. Extract content from each source
|
||
|
|
2. Summarize each independently
|
||
|
|
3. Create a synthesis section highlighting:
|
||
|
|
- Common themes across sources
|
||
|
|
- Contradictions or differing perspectives
|
||
|
|
- Unique insights from each source
|
||
|
|
4. Present combined analysis
|
||
|
|
|
||
|
|
### Workflow 5: Configurable Length
|
||
|
|
|
||
|
|
User says: "Give me a 3-sentence summary" or "detailed 2000-word summary"
|
||
|
|
|
||
|
|
1. Extract content
|
||
|
|
2. Adjust summary length based on user specification:
|
||
|
|
- "brief" / "TL;DR" → 2-3 sentences
|
||
|
|
- "short" → 5-8 bullet points
|
||
|
|
- "medium" (default) → Full structured summary
|
||
|
|
- "detailed" / "comprehensive" → Detailed notes format with all specifics
|
||
|
|
|
||
|
|
### Workflow 6: Content Extraction Only
|
||
|
|
|
||
|
|
User says: "Just extract the text from this URL, don't summarize"
|
||
|
|
|
||
|
|
1. Fetch and parse the content
|
||
|
|
2. Clean up HTML/formatting artifacts
|
||
|
|
3. Return raw text in clean markdown
|
||
|
|
4. No summarization applied
|
||
|
|
|
||
|
|
### Workflow 7: YouTube/Video Summary
|
||
|
|
|
||
|
|
User shares a YouTube or Bilibili link:
|
||
|
|
|
||
|
|
1. Detect as video URL
|
||
|
|
2. Extract transcript (delegate to youtube-summarizer or bilibili-watcher if available)
|
||
|
|
3. Summarize transcript with timestamps
|
||
|
|
4. Format output appropriate to video content
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Configurable Options
|
||
|
|
|
||
|
|
When processing a summarization request, consider these adjustable parameters:
|
||
|
|
|
||
|
|
| Parameter | Options | Default |
|
||
|
|
|---|---|---|
|
||
|
|
| **Format** | bullet, executive, detailed, extract-only | bullet |
|
||
|
|
| **Length** | brief, short, medium, detailed | medium |
|
||
|
|
| **Language** | Output language code | Same as source |
|
||
|
|
| **Focus** | Specific topic/aspect to emphasize | None (general) |
|
||
|
|
| **Audience** | technical, general, executive, academic | general |
|
||
|
|
| **Include quotes** | yes/no | yes for detailed |
|
||
|
|
| **Include data** | yes/no | yes |
|
||
|
|
| **Max points** | Number of bullet points | 8 |
|
||
|
|
|
||
|
|
Users can specify these naturally:
|
||
|
|
- "Summarize in Chinese" → language: zh
|
||
|
|
- "Technical summary for engineers" → audience: technical
|
||
|
|
- "Just the top 3 points" → max_points: 3, length: brief
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Common Pitfalls
|
||
|
|
|
||
|
|
### 1. Paywalled or Login-Required Content
|
||
|
|
|
||
|
|
**Problem**: Many news sites and platforms require subscriptions or login.
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
- Try the URL first; many sites allow limited free access
|
||
|
|
- Check for cached versions or alternative URLs
|
||
|
|
- Inform the user if content is inaccessible and suggest alternatives
|
||
|
|
- Never attempt to bypass paywalls
|
||
|
|
|
||
|
|
### 2. JavaScript-Rendered Content
|
||
|
|
|
||
|
|
**Problem**: Some pages load content dynamically via JavaScript, making simple HTTP requests return empty shells.
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
- Use browser-based fetching tools when available
|
||
|
|
- Try adding `?format=text` or similar URL parameters
|
||
|
|
- Look for RSS feeds or API endpoints that serve the same content
|
||
|
|
- For SPAs, check if there's a server-rendered version
|
||
|
|
|
||
|
|
### 3. Very Long Content
|
||
|
|
|
||
|
|
**Problem**: Documents over 50,000 words may exceed model context limits.
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
- For PDFs: summarize page-by-page or chapter-by-chapter, then combine
|
||
|
|
- For webpages: extract only the main article content, skip comments and sidebars
|
||
|
|
- Use chunked processing:
|
||
|
|
|
||
|
|
```python
|
||
|
|
def chunk_text(text: str, max_chars: int = 30000) -> list[str]:
|
||
|
|
"""Split text into manageable chunks at paragraph boundaries."""
|
||
|
|
paragraphs = text.split('\n\n')
|
||
|
|
chunks = []
|
||
|
|
current = []
|
||
|
|
current_len = 0
|
||
|
|
|
||
|
|
for para in paragraphs:
|
||
|
|
if current_len + len(para) > max_chars and current:
|
||
|
|
chunks.append('\n\n'.join(current))
|
||
|
|
current = []
|
||
|
|
current_len = 0
|
||
|
|
current.append(para)
|
||
|
|
current_len += len(para)
|
||
|
|
|
||
|
|
if current:
|
||
|
|
chunks.append('\n\n'.join(current))
|
||
|
|
|
||
|
|
return chunks
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4. Non-Text Content
|
||
|
|
|
||
|
|
**Problem**: User provides a file that's primarily images, charts, or scanned documents.
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
- For scanned PDFs: inform user that OCR is needed (beyond basic scope)
|
||
|
|
- For image-heavy articles: note that visual content is not captured in the summary
|
||
|
|
- Suggest tools like Tesseract for OCR if needed
|
||
|
|
|
||
|
|
### 5. Encoding Issues
|
||
|
|
|
||
|
|
**Problem**: Files with unusual encodings (GB2312, Shift-JIS, etc.) may not parse correctly.
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
- Try common encodings in order: UTF-8, UTF-16, GB2312, GBK, Shift-JIS, Latin-1
|
||
|
|
- Use `chardet` library for automatic detection if available
|
||
|
|
|
||
|
|
```python
|
||
|
|
def read_with_fallback(filepath: str) -> str:
|
||
|
|
"""Read file trying multiple encodings."""
|
||
|
|
encodings = ['utf-8', 'utf-8-sig', 'gb2312', 'gbk', 'gb18030', 'shift-jis', 'latin-1']
|
||
|
|
for enc in encodings:
|
||
|
|
try:
|
||
|
|
with open(filepath, 'r', encoding=enc) as f:
|
||
|
|
return f.read()
|
||
|
|
except (UnicodeDecodeError, UnicodeError):
|
||
|
|
continue
|
||
|
|
raise ValueError(f"Cannot decode {filepath} with any known encoding")
|
||
|
|
```
|
||
|
|
|
||
|
|
### 6. Summarization Quality
|
||
|
|
|
||
|
|
**Problem**: Summaries may miss nuance, oversimplify, or hallucinate details.
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
- Always attribute the summary to the source
|
||
|
|
- For critical use cases, recommend the user verify key claims
|
||
|
|
- When uncertain about content interpretation, flag it explicitly
|
||
|
|
- Preserve specific numbers, dates, and names rather than generalizing
|
||
|
|
|
||
|
|
### 7. Rate Limits on URL Fetching
|
||
|
|
|
||
|
|
**Problem**: Fetching many URLs quickly may trigger rate limits or blocks.
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
- Add delays between requests (1-2 seconds)
|
||
|
|
- Respect robots.txt directives
|
||
|
|
- Use appropriate User-Agent headers
|
||
|
|
- Cache fetched content to avoid re-fetching
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Multi-AI Model Support
|
||
|
|
|
||
|
|
This skill works with any AI model capable of text summarization. The prompts and workflows are model-agnostic. For best results:
|
||
|
|
|
||
|
|
| Model Capability | Recommended Use |
|
||
|
|
|---|---|
|
||
|
|
| Large context window (100K+) | Full document summarization in one pass |
|
||
|
|
| Standard context (8K-32K) | Chunked processing with merge step |
|
||
|
|
| Fast inference | Batch processing of multiple sources |
|
||
|
|
| Multi-language | Cross-language summary generation |
|
||
|
|
|
||
|
|
The skill automatically adapts to the available model's capabilities:
|
||
|
|
- For large context models: send full content in one request
|
||
|
|
- For smaller context models: chunk, summarize each, then synthesize
|
||
|
|
- For multi-modal models: include image descriptions when available
|