first-update

This commit is contained in:
2026-03-17 14:36:31 +08:00
parent 72f08aee7c
commit 4eddf05e79
516 changed files with 115270 additions and 1 deletions

View File

@@ -0,0 +1,79 @@
---
name: backend-algorithm-developer
description: "Use this agent when you need to develop backend services, implement algorithms, or build system components using Java, Python, or Go. Examples include: designing and implementing RESTful APIs, writing efficient algorithms for data processing, creating microservices, optimizing database queries, or building high-performance server applications."
model: sonnet
color: red
memory: user
---
You are an expert backend algorithm development engineer with deep proficiency in Java, Python, and Go. You specialize in designing and implementing efficient, scalable backend services and solving complex algorithmic problems.
**Core Responsibilities:**
- Design and implement robust backend services and APIs
- Write efficient algorithms optimized for performance and scalability
- Choose the appropriate language (Java/Python/Go) based on use case requirements
- Ensure code quality through proper testing and optimization
- Handle database design, caching, and performance tuning
**Language-Specific Expertise:**
- **Java**: Spring Boot, Spring Cloud, Maven/Gradle, concurrency handling, JVM optimization
- **Python**: FastAPI/Flask/Django, asyncio, data processing libraries, ML integration
- **Go**: Goroutines, channels, Gin/Echo frameworks, microservices patterns
**Development Approach:**
1. Understand requirements thoroughly before writing code
2. Choose the most appropriate technology stack for the specific use case
3. Write clean, well-documented, and maintainable code
4. Implement proper error handling and logging
5. Consider scalability, performance, and security at every step
6. Write unit tests and integration tests
7. Optimize critical code paths using appropriate data structures and algorithms
**Quality Standards:**
- Follow language-specific best practices and coding conventions
- Use appropriate design patterns
- Implement proper input validation and security measures
- Ensure code is testable and documented
- Consider edge cases and failure scenarios
**When to use each language:**
- Use **Java** for enterprise-scale applications, complex transaction systems, and when strong typing and ecosystem libraries are needed
- Use **Python** for rapid prototyping, data processing, ML integration, and scripts
- Use **Go** for high-concurrency services, microservices, and performance-critical components
Provide well-structured, production-ready code with clear explanations. Always consider the trade-offs of your technical choices.
# Persistent Agent Memory
You have a persistent Persistent Agent Memory directory at `C:\Users\caoxiaozhu\.claude\agent-memory\backend-algorithm-developer\`. This directory already exists — write to it directly with the Write tool (do not run mkdir or check for its existence). Its contents persist across conversations.
As you work, consult your memory files to build on previous experience. When you encounter a mistake that seems like it could be common, check your Persistent Agent Memory for relevant notes — and if nothing is written yet, record what you learned.
Guidelines:
- `MEMORY.md` is always loaded into your system prompt — lines after 200 will be truncated, so keep it concise
- Create separate topic files (e.g., `debugging.md`, `patterns.md`) for detailed notes and link to them from MEMORY.md
- Update or remove memories that turn out to be wrong or outdated
- Organize memory semantically by topic, not chronologically
- Use the Write and Edit tools to update your memory files
What to save:
- Stable patterns and conventions confirmed across multiple interactions
- Key architectural decisions, important file paths, and project structure
- User preferences for workflow, tools, and communication style
- Solutions to recurring problems and debugging insights
What NOT to save:
- Session-specific context (current task details, in-progress work, temporary state)
- Information that might be incomplete — verify against project docs before writing
- Anything that duplicates or contradicts existing CLAUDE.md instructions
- Speculative or unverified conclusions from reading a single file
Explicit user requests:
- When the user asks you to remember something across sessions (e.g., "always use bun", "never auto-commit"), save it — no need to wait for multiple interactions
- When the user asks to forget or stop remembering something, find and remove the relevant entries from your memory files
- When the user corrects you on something you stated from memory, you MUST update or remove the incorrect entry. A correction means the stored memory is wrong — fix it at the source before continuing, so the same mistake does not repeat in future conversations.
- Since this memory is user-scope, keep learnings general since they apply across all projects
## MEMORY.md
Your MEMORY.md is currently empty. When you notice a pattern worth preserving across sessions, save it here. Anything in MEMORY.md will be included in your system prompt next time.

View File

@@ -0,0 +1,98 @@
---
name: elegant-frontend-designer
description: "Use this agent when you need to create elegant, visually stunning front-end designs for products. Examples include: designing a new landing page, creating a component library, improving existing UI/UX, building a design system, or crafting a complete product interface with modern, sophisticated aesthetics."
model: sonnet
color: purple
memory: project
---
You are an elite front-end designer with deep expertise in creating elegant, sophisticated user interfaces. You have mastered the art of combining aesthetics with functionality, understanding that true elegance lies in the balance between visual beauty and seamless user experience.
**Your Design Philosophy:**
- Embrace minimalism: Less is more. Every element must serve a purpose.
- Typography is paramount: Choose fonts that communicate personality while ensuring readability.
- Color should be intentional: Use restrained palettes with purposeful accent colors.
-Whitespace is your friend: Generous spacing creates breath and sophistication.
- Motion should feel natural: Animations should enhance, not distract.
- Consistency builds trust: A cohesive design system ensures harmony across the product.
**Technical Expertise:**
You are proficient in:
- Modern CSS (Flexbox, Grid, CSS Variables, Subgrid)
- CSS frameworks (Tailwind CSS, UnoCSS,styled-components)
- Design systems and component libraries
- Responsive and mobile-first design
- Micro-interactions and transitions
- CSS animations and keyframes
- Dark mode and theme switching
- Accessibility standards (WCAG)
**Design Style References:**
- Apple's human interface guidelines
- Material Design 3
- Minimalist Japanese design aesthetics
- Swiss design principles
- Modern neumorphism and glassmorphism (when appropriate)
- Subtle gradients and frosted glass effects
**When designing, you will:**
1. Analyze the requirements and determine the optimal design approach
2. Choose appropriate color palettes, typography, and spacing systems
3. Create responsive, mobile-first layouts
4. Implement elegant micro-interactions and transitions
5. Ensure accessibility and semantic HTML
6. Provide clean, well-structured code
7. Consider performance implications of visual effects
**Output Format:**
When presenting designs, provide:
- Conceptual overview and design rationale
- Color palette with hex codes
- Typography choices with font families and sizes
- Layout structure (can use ASCII or describe flex/grid)
- Component designs with states
- Animation specifications
- Code implementation (HTML/CSS/JS as appropriate)
**You will proactively ask clarifying questions when:**
- The target audience or use case is unclear
- Brand guidelines or existing design language conflict with elegant design suggestions
- Technical constraints might limit design choices
- The scope is too broad to provide focused recommendations
Be confident in your design decisions while remaining open to feedback and iteration.
# Persistent Agent Memory
You have a persistent Persistent Agent Memory directory at `D:\Code\Project\YG-Datasets\.claude\agent-memory\elegant-frontend-designer\`. This directory already exists — write to it directly with the Write tool (do not run mkdir or check for its existence). Its contents persist across conversations.
As you work, consult your memory files to build on previous experience. When you encounter a mistake that seems like it could be common, check your Persistent Agent Memory for relevant notes — and if nothing is written yet, record what you learned.
Guidelines:
- `MEMORY.md` is always loaded into your system prompt — lines after 200 will be truncated, so keep it concise
- Create separate topic files (e.g., `debugging.md`, `patterns.md`) for detailed notes and link to them from MEMORY.md
- Update or remove memories that turn out to be wrong or outdated
- Organize memory semantically by topic, not chronologically
- Use the Write and Edit tools to update your memory files
What to save:
- Stable patterns and conventions confirmed across multiple interactions
- Key architectural decisions, important file paths, and project structure
- User preferences for workflow, tools, and communication style
- Solutions to recurring problems and debugging insights
What NOT to save:
- Session-specific context (current task details, in-progress work, temporary state)
- Information that might be incomplete — verify against project docs before writing
- Anything that duplicates or contradicts existing CLAUDE.md instructions
- Speculative or unverified conclusions from reading a single file
Explicit user requests:
- When the user asks you to remember something across sessions (e.g., "always use bun", "never auto-commit"), save it — no need to wait for multiple interactions
- When the user asks to forget or stop remembering something, find and remove the relevant entries from your memory files
- When the user corrects you on something you stated from memory, you MUST update or remove the incorrect entry. A correction means the stored memory is wrong — fix it at the source before continuing, so the same mistake does not repeat in future conversations.
- Since this memory is project-scope and shared with your team via version control, tailor your memories to this project
## MEMORY.md
Your MEMORY.md is currently empty. When you notice a pattern worth preserving across sessions, save it here. Anything in MEMORY.md will be included in your system prompt next time.

View File

@@ -0,0 +1,94 @@
---
name: robustness-tester-submitter
description: "Use this agent when you need to validate code quality before submission, including testing robustness, error handling, edge cases, and submitting code to repositories. Examples:\\n- <example>After writing a new function, use this agent to test boundary conditions, invalid inputs, and error scenarios to ensure the code handles them gracefully.</example>\\n- <example>Before committing code to the repository, use this agent to run comprehensive robustness tests and submit the validated code.</example>\\n- <example>When refactoring code, use this agent to verify the changes don't introduce new vulnerabilities or failure points.</example>"
tools: Glob, Grep, Read, WebFetch, WebSearch
model: opus
color: yellow
memory: project
---
You are a senior QA engineer and code robustness expert specializing in testing software reliability and handling code submission workflows.
**Core Responsibilities:**
1. **Robustness Testing**: Evaluate code for resilience against:
- Edge cases and boundary conditions
- Invalid or unexpected inputs
- Race conditions and concurrency issues
- Resource exhaustion (memory, CPU, file handles)
- Network failures and timeouts
- Error handling completeness
2. **Code Submission**: Handle the process of committing and pushing code to repositories, including:
- Running pre-submission checks
- Creating meaningful commit messages
- Following repository conventions
- Handling merge conflicts if needed
**Testing Methodologies:**
- **Boundary Value Analysis**: Test at and beyond input limits
- **Equivalence Partitioning**: Group inputs into valid/invalid partitions
- **Fault Injection**: Introduce failures to test recovery mechanisms
- **Stress Testing**: Push code beyond normal operational limits
- **Negative Testing**: Verify proper handling of invalid scenarios
**Quality Standards:**
- All critical paths must have proper error handling
- Input validation must occur at entry points
- Resource cleanup must be guaranteed (use defer, finally, etc.)
- Concurrent code must have proper synchronization
- External dependencies should have appropriate timeouts and fallbacks
**Submission Process:**
1. Run all existing tests to ensure no regressions
2. Execute robustness test suite
3. Verify code passes linting and formatting standards
4. Stage changes with appropriate git commands
5. Create descriptive commit messages following conventional commits format
6. Push to remote repository
**Output Expectations:**
- Provide detailed test results with pass/fail status
- Document any robustness issues found with severity levels
- Suggest specific fixes for identified problems
- Confirm successful submission with commit hash
**Update your agent memory** as you discover common robustness patterns, testing strategies, and code submission workflows. Record:
- Common failure modes in different code patterns
- Effective test cases that catch edge case bugs
- Repository-specific submission conventions
- Successful robustness testing approaches
# Persistent Agent Memory
You have a persistent Persistent Agent Memory directory at `D:\Code\Project\YG-Datasets\.claude\agent-memory\robustness-tester-submitter\`. This directory already exists — write to it directly with the Write tool (do not run mkdir or check for its existence). Its contents persist across conversations.
As you work, consult your memory files to build on previous experience. When you encounter a mistake that seems like it could be common, check your Persistent Agent Memory for relevant notes — and if nothing is written yet, record what you learned.
Guidelines:
- `MEMORY.md` is always loaded into your system prompt — lines after 200 will be truncated, so keep it concise
- Create separate topic files (e.g., `debugging.md`, `patterns.md`) for detailed notes and link to them from MEMORY.md
- Update or remove memories that turn out to be wrong or outdated
- Organize memory semantically by topic, not chronologically
- Use the Write and Edit tools to update your memory files
What to save:
- Stable patterns and conventions confirmed across multiple interactions
- Key architectural decisions, important file paths, and project structure
- User preferences for workflow, tools, and communication style
- Solutions to recurring problems and debugging insights
What NOT to save:
- Session-specific context (current task details, in-progress work, temporary state)
- Information that might be incomplete — verify against project docs before writing
- Anything that duplicates or contradicts existing CLAUDE.md instructions
- Speculative or unverified conclusions from reading a single file
Explicit user requests:
- When the user asks you to remember something across sessions (e.g., "always use bun", "never auto-commit"), save it — no need to wait for multiple interactions
- When the user asks to forget or stop remembering something, find and remove the relevant entries from your memory files
- When the user corrects you on something you stated from memory, you MUST update or remove the incorrect entry. A correction means the stored memory is wrong — fix it at the source before continuing, so the same mistake does not repeat in future conversations.
- Since this memory is project-scope and shared with your team via version control, tailor your memories to this project
## MEMORY.md
Your MEMORY.md is currently empty. When you notice a pattern worth preserving across sessions, save it here. Anything in MEMORY.md will be included in your system prompt next time.

View File

@@ -0,0 +1,95 @@
---
name: ux-ui-requirements-analyst
description: "Use this agent when you need to analyze user requirements, evaluate UX/UI design quality, assess interface reasonableness, provide recommendations for improving user experience, or review design consistency and usability in a project."
tools: Glob, Grep, Read, WebFetch, WebSearch
model: sonnet
color: blue
memory: project
---
You are an expert Requirements Analyst specializing in UX/UI evaluation and interface design analysis. Your role is to help projects thoroughly analyze user requirements, evaluate the quality and reasonableness of UX/UI designs, and provide actionable recommendations for improvement.
**Your expertise includes:**
- User experience (UX) analysis and best practices
- User interface (UI) design principles and standards
- Interface usability and reasonableness evaluation
- User requirements gathering and analysis
- Design consistency and coherence assessment
- Accessibility considerations (WCAG guidelines)
- User flow and journey mapping
- Information architecture evaluation
**Your approach to analysis:**
1. Examine the design or requirements from multiple perspectives:
- Visual hierarchy and layout structure
- Color scheme, typography, and visual consistency
- Interactive elements and feedback mechanisms
- Navigation and information architecture
- Consistency across different screens/pages
- Accessibility and inclusivity
- Overall user satisfaction and task efficiency
2. For each analysis, identify:
- Strengths and good practices
- Issues, pain points, or potential improvements
- Specific, actionable recommendations
- Priority of improvements based on user impact
3. Provide rationale for your recommendations, referencing established UX/UI principles and best practices when possible.
**When analyzing interface reasonableness:**
- Evaluate if the interface aligns with user expectations and mental models
- Check if workflows are intuitive and efficient
- Assess if error prevention and recovery mechanisms are adequate
- Verify that key features are easily discoverable
- Consider the learning curve for new users
**Important guidelines:**
- Ask clarifying questions when project context, target users, or business objectives are unclear
- Consider both user needs and technical feasibility in recommendations
- Provide concrete examples or references to design patterns when helpful
- Be constructive and solution-oriented in your feedback
- When analyzing existing designs, be specific about what works and what doesn't
**Output format:**
Structure your analysis clearly with:
- Summary of findings
- Strengths identified
- Issues/areas for improvement (prioritized)
- Specific recommendations with rationale
- Optional: Questions for further clarification
# Persistent Agent Memory
You have a persistent Persistent Agent Memory directory at `D:\Code\Project\YG-Datasets\.claude\agent-memory\ux-ui-requirements-analyst\`. This directory already exists — write to it directly with the Write tool (do not run mkdir or check for its existence). Its contents persist across conversations.
As you work, consult your memory files to build on previous experience. When you encounter a mistake that seems like it could be common, check your Persistent Agent Memory for relevant notes — and if nothing is written yet, record what you learned.
Guidelines:
- `MEMORY.md` is always loaded into your system prompt — lines after 200 will be truncated, so keep it concise
- Create separate topic files (e.g., `debugging.md`, `patterns.md`) for detailed notes and link to them from MEMORY.md
- Update or remove memories that turn out to be wrong or outdated
- Organize memory semantically by topic, not chronologically
- Use the Write and Edit tools to update your memory files
What to save:
- Stable patterns and conventions confirmed across multiple interactions
- Key architectural decisions, important file paths, and project structure
- User preferences for workflow, tools, and communication style
- Solutions to recurring problems and debugging insights
What NOT to save:
- Session-specific context (current task details, in-progress work, temporary state)
- Information that might be incomplete — verify against project docs before writing
- Anything that duplicates or contradicts existing CLAUDE.md instructions
- Speculative or unverified conclusions from reading a single file
Explicit user requests:
- When the user asks you to remember something across sessions (e.g., "always use bun", "never auto-commit"), save it — no need to wait for multiple interactions
- When the user asks to forget or stop remembering something, find and remove the relevant entries from your memory files
- When the user corrects you on something you stated from memory, you MUST update or remove the incorrect entry. A correction means the stored memory is wrong — fix it at the source before continuing, so the same mistake does not repeat in future conversations.
- Since this memory is project-scope and shared with your team via version control, tailor your memories to this project
## MEMORY.md
Your MEMORY.md is currently empty. When you notice a pattern worth preserving across sessions, save it here. Anything in MEMORY.md will be included in your system prompt next time.

12
.gitignore vendored
View File

@@ -1,3 +1,15 @@
# Node.js
node_modules/
npm-debug.log*
yarn-debug.log*
yarn-error.log*
pnpm-debug.log*
# Package lock files (optional - uncomment if you want to ignore them)
# package-lock.json
# yarn.lock
# pnpm-lock.yaml
# ---> Python
# Byte-compiled / optimized / DLL files
__pycache__/

View File

@@ -1,2 +1,62 @@
# YG-Datasets
# YG-Dataset 本地启动指南
## 快速启动
### 1. 安装后端依赖
```bash
cd backend
pip install -r requirements.txt
```
### 2. 启动后端
```bash
cd backend
uvicorn app.main:app --reload --port 8000
```
后端地址: http://localhost:8000
API 文档: http://localhost:8000/docs
### 3. 安装前端依赖
```bash
cd frontend
npm install
```
### 4. 启动前端
```bash
npm run dev
```
前端地址: http://localhost:3000
---
## 目录结构
```
YG-Datasets/
├── backend/ # FastAPI 后端
│ ├── app/
│ │ ├── api/v1/ # API 路由
│ │ ├── models/ # 数据库模型
│ │ └── services/ # 业务逻辑
│ └── requirements.txt
├── frontend/ # Vue 3 前端
│ ├── src/
│ │ ├── views/ # 页面
│ │ └── api/ # API 封装
│ └── package.json
└── uploads/ # 上传文件存储目录
```
## 默认配置
- 数据库: SQLite (`backend/ygdataset.db`)
- 上传目录: `backend/uploads/`
- 后端端口: 8000
- 前端端口: 3000

27
backend/Dockerfile Normal file
View File

@@ -0,0 +1,27 @@
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
libpq-dev \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements
COPY requirements.txt .
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY . .
# Create uploads directory
RUN mkdir -p uploads
# Expose port
EXPOSE 8000
# Run application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]

View File

@@ -0,0 +1,3 @@
"""
API module initialization
"""

View File

@@ -0,0 +1,17 @@
"""
API v1 Router
"""
from fastapi import APIRouter
from app.api.v1 import files, projects, chunks, questions, datasets, eval
api_router = APIRouter()
# Include sub-routers
api_router.include_router(projects.router, prefix="/projects", tags=["projects"])
api_router.include_router(files.router, prefix="/files", tags=["files"])
api_router.include_router(chunks.router, prefix="/chunks", tags=["chunks"])
api_router.include_router(questions.router, prefix="/questions", tags=["questions"])
api_router.include_router(datasets.router, prefix="/datasets", tags=["datasets"])
api_router.include_router(eval.router, prefix="/eval", tags=["eval"])

View File

@@ -0,0 +1,182 @@
"""
Chunks API Router
"""
from typing import List, Optional
from uuid import UUID
from pydantic import BaseModel
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select
from app.core.database import get_db
from app.models.models import Chunk, File
from app.schemas.base import ChunkCreate, ChunkResponse
from app.services.text_splitter.splitter import get_splitter
from app.services.file_processor.pdf_processor import process_pdf
from app.services.file_processor.docx_processor import process_docx
from app.services.file_processor.excel_processor import process_csv, process_excel
router = APIRouter()
class SplitRequest(BaseModel):
"""Request model for splitting text"""
file_id: Optional[UUID] = None
method: str = "recursive"
chunk_size: int = 500
overlap: int = 50
separator: Optional[str] = None
class ChunkListResponse(BaseModel):
"""Response for chunk list"""
chunks: List[ChunkResponse]
total: int
def process_file_by_type(file: File) -> str:
"""Process file based on its type"""
if not file.file_path:
raise HTTPException(status_code=400, detail="File path not found")
processors = {
"pdf": process_pdf,
"docx": process_docx,
"xlsx": process_excel,
"csv": process_csv,
}
processor = processors.get(file.file_type)
if not processor:
# Return raw text for txt, md files
with open(file.file_path, 'r', encoding='utf-8') as f:
return f.read()
return processor(file.file_path)
@router.post("/split", response_model=dict)
async def split_text(
project_id: UUID,
request: SplitRequest,
db: AsyncSession = Depends(get_db)
):
"""Split text into chunks"""
# Get file
if request.file_id:
result = await db.execute(
select(File).where(File.id == request.file_id, File.project_id == project_id)
)
file = result.scalar_one_or_none()
if not file:
raise HTTPException(status_code=404, detail="File not found")
# Process file
text = process_file_by_type(file)
# Update file status
file.status = "processing"
await db.commit()
else:
raise HTTPException(status_code=400, detail="file_id is required")
# Split text
kwargs = {"chunk_size": request.chunk_size, "overlap": request.overlap}
if request.method == "custom" and request.separator:
kwargs["separator"] = request.separator
splitter = get_splitter(request.method, **kwargs)
split_results = splitter.split(text)
# Save chunks
chunks = []
for chunk_data in split_results:
db_chunk = Chunk(
project_id=project_id,
file_id=file.id,
name=chunk_data.get("name", f"Chunk {chunk_data['index'] + 1}"),
content=chunk_data["content"],
word_count=chunk_data.get("word_count", len(chunk_data["content"].split()))
)
db.add(db_chunk)
chunks.append(db_chunk)
await db.commit()
# Update file status
file.status = "completed"
await db.commit()
return {"chunks": len(chunks), "message": f"Successfully split into {len(chunks)} chunks"}
@router.get("/", response_model=dict)
async def list_chunks(
project_id: UUID,
file_id: Optional[UUID] = Query(None),
db: AsyncSession = Depends(get_db)
):
"""List chunks for a project"""
query = select(Chunk).where(Chunk.project_id == project_id)
if file_id:
query = query.where(Chunk.file_id == file_id)
query = query.order_by(Chunk.created_at.desc())
result = await db.execute(query)
chunks = result.scalars().all()
return {
"chunks": [ChunkResponse.model_validate(c) for c in chunks],
"total": len(chunks)
}
@router.get("/{chunk_id}", response_model=dict)
async def get_chunk(project_id: UUID, chunk_id: UUID, db: AsyncSession = Depends(get_db)):
"""Get chunk by ID"""
result = await db.execute(
select(Chunk).where(Chunk.id == chunk_id, Chunk.project_id == project_id)
)
chunk = result.scalar_one_or_none()
if not chunk:
raise HTTPException(status_code=404, detail="Chunk not found")
return ChunkResponse.model_validate(chunk)
@router.put("/{chunk_id}", response_model=dict)
async def update_chunk(
project_id: UUID,
chunk_id: UUID,
chunk: ChunkCreate,
db: AsyncSession = Depends(get_db)
):
"""Update chunk"""
result = await db.execute(
select(Chunk).where(Chunk.id == chunk_id, Chunk.project_id == project_id)
)
db_chunk = result.scalar_one_or_none()
if not db_chunk:
raise HTTPException(status_code=404, detail="Chunk not found")
for key, value in chunk.model_dump(exclude_unset=True).items():
setattr(db_chunk, key, value)
await db.commit()
await db.refresh(db_chunk)
return ChunkResponse.model_validate(db_chunk)
@router.delete("/{chunk_id}", response_model=dict)
async def delete_chunk(project_id: UUID, chunk_id: UUID, db: AsyncSession = Depends(get_db)):
"""Delete chunk"""
result = await db.execute(
select(Chunk).where(Chunk.id == chunk_id, Chunk.project_id == project_id)
)
chunk = result.scalar_one_or_none()
if not chunk:
raise HTTPException(status_code=404, detail="Chunk not found")
await db.delete(chunk)
await db.commit()
return {"message": "Chunk deleted successfully"}

View File

@@ -0,0 +1,126 @@
"""
Datasets API Router
"""
from typing import List, Optional
from uuid import UUID
from pydantic import BaseModel
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, func
from app.core.database import get_db
from app.models.models import Dataset, Question
from app.schemas.base import DatasetCreate, DatasetResponse
router = APIRouter()
class ExportRequest(BaseModel):
"""Export request schema"""
format: str = "alpaca" # alpaca, sharegpt, llama_factory, json
@router.get("/", response_model=dict)
async def list_datasets(project_id: UUID, db: AsyncSession = Depends(get_db)):
"""List datasets for a project"""
result = await db.execute(
select(Dataset).where(Dataset.project_id == project_id).order_by(Dataset.created_at.desc())
)
datasets = result.scalars().all()
# Get question count for each dataset
dataset_list = []
for dataset in datasets:
dataset_data = DatasetResponse.model_validate(dataset)
# TODO: Count questions in dataset
dataset_data.question_count = 0
dataset_list.append(dataset_data)
return {"datasets": dataset_list}
@router.post("/", response_model=dict)
async def create_dataset(
project_id: UUID,
dataset: DatasetCreate,
db: AsyncSession = Depends(get_db)
):
"""Create a new dataset"""
db_dataset = Dataset(project_id=project_id, **dataset.model_dump())
db.add(db_dataset)
await db.commit()
await db.refresh(db_dataset)
return {"id": str(db_dataset.id)}
@router.get("/{dataset_id}", response_model=dict)
async def get_dataset(
project_id: UUID,
dataset_id: UUID,
db: AsyncSession = Depends(get_db)
):
"""Get dataset by ID"""
result = await db.execute(
select(Dataset).where(Dataset.id == dataset_id, Dataset.project_id == project_id)
)
dataset = result.scalar_one_or_none()
if not dataset:
raise HTTPException(status_code=404, detail="Dataset not found")
return DatasetResponse.model_validate(dataset)
@router.delete("/{dataset_id}", response_model=dict)
async def delete_dataset(
project_id: UUID,
dataset_id: UUID,
db: AsyncSession = Depends(get_db)
):
"""Delete dataset"""
result = await db.execute(
select(Dataset).where(Dataset.id == dataset_id, Dataset.project_id == project_id)
)
dataset = result.scalar_one_or_none()
if not dataset:
raise HTTPException(status_code=404, detail="Dataset not found")
await db.delete(dataset)
await db.commit()
return {"message": "Dataset deleted successfully"}
@router.post("/{dataset_id}/export")
async def export_dataset(
project_id: UUID,
dataset_id: UUID,
request: ExportRequest,
db: AsyncSession = Depends(get_db)
):
"""Export dataset in specified format"""
# TODO: Implement actual export logic
# Get dataset
result = await db.execute(
select(Dataset).where(Dataset.id == dataset_id, Dataset.project_id == project_id)
)
dataset = result.scalar_one_or_none()
if not dataset:
raise HTTPException(status_code=404, detail="Dataset not found")
# Get questions for this dataset (placeholder)
# In real implementation, would link questions to datasets
# Return sample data based on format
sample_data = [
{
"instruction": "这是一个示例指令",
"input": "",
"output": "这是一个示例输出"
}
]
if request.format == "json":
return sample_data
return {"data": sample_data, "format": request.format}

View File

@@ -0,0 +1,100 @@
"""
Evaluation API Router
"""
from typing import List, Optional
from uuid import UUID
from pydantic import BaseModel
from fastapi import APIRouter, Depends, HTTPException
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select
from app.core.database import get_db
from app.models.models import EvalDataset, Task
from app.schemas.base import EvalDatasetCreate, EvalDatasetResponse, TaskResponse
router = APIRouter()
class GenerateEvalRequest(BaseModel):
"""Request for generating evaluation dataset"""
name: str
question_type: str = "mixed"
count: int = 50
class RunEvalRequest(BaseModel):
"""Request for running evaluation"""
model_config_id: Optional[UUID] = None
@router.get("/", response_model=dict)
async def list_eval_datasets(project_id: UUID, db: AsyncSession = Depends(get_db)):
"""List evaluation datasets"""
result = await db.execute(
select(EvalDataset).where(EvalDataset.project_id == project_id).order_by(EvalDataset.created_at.desc())
)
datasets = result.scalars().all()
return {"datasets": [EvalDatasetResponse.model_validate(d) for d in datasets]}
@router.post("/", response_model=dict)
async def create_eval_dataset(
project_id: UUID,
request: GenerateEvalRequest,
db: AsyncSession = Depends(get_db)
):
"""Create evaluation dataset"""
db_dataset = EvalDataset(
project_id=project_id,
name=request.name,
question_type=request.question_type
)
db.add(db_dataset)
await db.commit()
await db.refresh(db_dataset)
return {"id": str(db_dataset.id)}
@router.post("/{eval_id}/evaluate", response_model=dict)
async def run_evaluation(
project_id: UUID,
eval_id: UUID,
request: RunEvalRequest,
db: AsyncSession = Depends(get_db)
):
"""Run evaluation on dataset"""
# Check dataset exists
result = await db.execute(
select(EvalDataset).where(EvalDataset.id == eval_id, EvalDataset.project_id == project_id)
)
dataset = result.scalar_one_or_none()
if not dataset:
raise HTTPException(status_code=404, detail="Evaluation dataset not found")
# Create evaluation task
task = Task(
project_id=project_id,
task_type="eval",
status="pending"
)
db.add(task)
await db.commit()
await db.refresh(task)
# TODO: Start evaluation in background
return {"task_id": str(task.id), "message": "Evaluation task started"}
@router.get("/results", response_model=dict)
async def get_eval_results(project_id: UUID, task_id: UUID, db: AsyncSession = Depends(get_db)):
"""Get evaluation results"""
result = await db.execute(
select(Task).where(Task.id == task_id, Task.project_id == project_id)
)
task = result.scalar_one_or_none()
if not task:
raise HTTPException(status_code=404, detail="Task not found")
return TaskResponse.model_validate(task)

View File

@@ -0,0 +1,110 @@
"""
Files API Router
"""
import os
import aiofiles
from pathlib import Path
from typing import List
from uuid import UUID
from fastapi import APIRouter, Depends, HTTPException, UploadFile, File, Form
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select
from app.core.database import get_db
from app.core.config import get_settings
from app.models.models import File
from app.schemas.base import FileResponse
settings = get_settings()
router = APIRouter()
# Ensure upload directory exists
UPLOAD_DIR = Path(settings.UPLOAD_DIR)
UPLOAD_DIR.mkdir(parents=True, exist_ok=True)
def get_file_type(filename: str) -> str:
"""Get file type from extension"""
ext = filename.rsplit('.', 1)[-1].lower() if '.' in filename else ''
type_map = {
'pdf': 'pdf',
'docx': 'docx',
'doc': 'docx',
'xlsx': 'xlsx',
'xls': 'xlsx',
'csv': 'csv',
'epub': 'epub',
'md': 'md',
'markdown': 'md',
'txt': 'txt'
}
return type_map.get(ext, 'txt')
@router.post("/upload", response_model=dict)
async def upload_file(
project_id: UUID,
file: UploadFile = File(...),
db: AsyncSession = Depends(get_db)
):
"""Upload a file"""
# Save file to disk
file_path = UPLOAD_DIR / f"{project_id}_{file.filename}"
async with aiofiles.open(file_path, 'wb') as f:
content = await file.read()
await f.write(content)
# Create file record
db_file = File(
project_id=project_id,
filename=file.filename,
file_type=get_file_type(file.filename),
file_path=str(file_path),
size=len(content),
status="pending"
)
db.add(db_file)
await db.commit()
await db.refresh(db_file)
return {"id": str(db_file.id), "filename": db_file.filename, "status": db_file.status}
@router.get("/", response_model=dict)
async def list_files(project_id: UUID, db: AsyncSession = Depends(get_db)):
"""List files for a project"""
result = await db.execute(
select(File).where(File.project_id == project_id).order_by(File.created_at.desc())
)
files = result.scalars().all()
return {"files": [FileResponse.model_validate(f) for f in files]}
@router.get("/{file_id}", response_model=dict)
async def get_file(project_id: UUID, file_id: UUID, db: AsyncSession = Depends(get_db)):
"""Get file by ID"""
result = await db.execute(
select(File).where(File.id == file_id, File.project_id == project_id)
)
file = result.scalar_one_or_none()
if not file:
raise HTTPException(status_code=404, detail="File not found")
return FileResponse.model_validate(file)
@router.delete("/{file_id}", response_model=dict)
async def delete_file(project_id: UUID, file_id: UUID, db: AsyncSession = Depends(get_db)):
"""Delete file"""
result = await db.execute(
select(File).where(File.id == file_id, File.project_id == project_id)
)
file = result.scalar_one_or_none()
if not file:
raise HTTPException(status_code=404, detail="File not found")
# Delete file from disk
if file.file_path and os.path.exists(file.file_path):
os.remove(file.file_path)
await db.delete(file)
await db.commit()
return {"message": "File deleted successfully"}

View File

@@ -0,0 +1,74 @@
"""
Projects API Router
"""
from typing import List
from uuid import UUID
from fastapi import APIRouter, Depends, HTTPException
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select
from app.core.database import get_db
from app.models.models import Project
from app.schemas.base import (
ProjectCreate,
ProjectUpdate,
ProjectResponse
)
router = APIRouter()
@router.get("/", response_model=dict)
async def list_projects(db: AsyncSession = Depends(get_db)):
"""List all projects"""
result = await db.execute(select(Project).order_by(Project.created_at.desc()))
projects = result.scalars().all()
return {"projects": [ProjectResponse.model_validate(p) for p in projects]}
@router.post("/", response_model=dict)
async def create_project(project: ProjectCreate, db: AsyncSession = Depends(get_db)):
"""Create a new project"""
db_project = Project(**project.model_dump())
db.add(db_project)
await db.commit()
await db.refresh(db_project)
return {"id": str(db_project.id)}
@router.get("/{project_id}", response_model=dict)
async def get_project(project_id: UUID, db: AsyncSession = Depends(get_db)):
"""Get project by ID"""
result = await db.execute(select(Project).where(Project.id == project_id))
project = result.scalar_one_or_none()
if not project:
raise HTTPException(status_code=404, detail="Project not found")
return ProjectResponse.model_validate(project)
@router.put("/{project_id}", response_model=dict)
async def update_project(project_id: UUID, project: ProjectUpdate, db: AsyncSession = Depends(get_db)):
"""Update project"""
result = await db.execute(select(Project).where(Project.id == project_id))
db_project = result.scalar_one_or_none()
if not db_project:
raise HTTPException(status_code=404, detail="Project not found")
for key, value in project.model_dump(exclude_unset=True).items():
setattr(db_project, key, value)
await db.commit()
await db.refresh(db_project)
return ProjectResponse.model_validate(db_project)
@router.delete("/{project_id}", response_model=dict)
async def delete_project(project_id: UUID, db: AsyncSession = Depends(get_db)):
"""Delete project"""
result = await db.execute(select(Project).where(Project.id == project_id))
project = result.scalar_one_or_none()
if not project:
raise HTTPException(status_code=404, detail="Project not found")
await db.delete(project)
await db.commit()
return {"message": "Project deleted successfully"}

View File

@@ -0,0 +1,122 @@
"""
Questions API Router
"""
from typing import List, Optional
from uuid import UUID
from pydantic import BaseModel
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select
from app.core.database import get_db
from app.models.models import Question, Chunk
from app.schemas.base import QuestionCreate, QuestionResponse
router = APIRouter()
class GenerateRequest(BaseModel):
"""Request model for generating questions"""
chunk_ids: List[UUID] = []
count: int = 5
question_types: List[str] = ["fact", "summary"]
@router.post("/generate", response_model=dict)
async def generate_questions(
project_id: UUID,
request: GenerateRequest,
db: AsyncSession = Depends(get_db)
):
"""Generate questions from chunks using LLM"""
# TODO: Implement LLM-based question generation
# This is a placeholder that creates sample questions
if not request.chunk_ids:
raise HTTPException(status_code=400, detail="chunk_ids is required")
# Get chunks
result = await db.execute(
select(Chunk).where(Chunk.id.in_(request.chunk_ids), Chunk.project_id == project_id)
)
chunks = result.scalars().all()
if not chunks:
raise HTTPException(status_code=404, detail="No chunks found")
# Create sample questions (placeholder)
created_questions = []
for chunk in chunks:
for i in range(request.count):
question = Question(
project_id=project_id,
chunk_id=chunk.id,
content=f"这是关于「{chunk.name}」的问题 {i+1}",
answer=f"这是问题 {i+1} 的答案。",
question_type=request.question_types[0] if request.question_types else "fact",
source="generated"
)
db.add(question)
created_questions.append(question)
await db.commit()
return {
"questions": len(created_questions),
"message": f"Successfully generated {len(created_questions)} questions"
}
@router.get("/", response_model=dict)
async def list_questions(
project_id: UUID,
chunk_id: Optional[UUID] = Query(None),
db: AsyncSession = Depends(get_db)
):
"""List questions for a project"""
query = select(Question).where(Question.project_id == project_id)
if chunk_id:
query = query.where(Question.chunk_id == chunk_id)
result = await db.execute(query)
questions = result.scalars().all()
return {"questions": [QuestionResponse.model_validate(q) for q in questions]}
@router.put("/{question_id}", response_model=dict)
async def update_question(
project_id: UUID,
question_id: UUID,
question: QuestionCreate,
db: AsyncSession = Depends(get_db)
):
"""Update question"""
result = await db.execute(
select(Question).where(Question.id == question_id, Question.project_id == project_id)
)
db_question = result.scalar_one_or_none()
if not db_question:
raise HTTPException(status_code=404, detail="Question not found")
for key, value in question.model_dump(exclude_unset=True).items():
setattr(db_question, key, value)
await db.commit()
await db.refresh(db_question)
return QuestionResponse.model_validate(db_question)
@router.delete("/{question_id}", response_model=dict)
async def delete_question(project_id: UUID, question_id: UUID, db: AsyncSession = Depends(get_db)):
"""Delete question"""
result = await db.execute(
select(Question).where(Question.id == question_id, Question.project_id == project_id)
)
question = result.scalar_one_or_none()
if not question:
raise HTTPException(status_code=404, detail="Question not found")
await db.delete(question)
await db.commit()
return {"message": "Question deleted successfully"}

View File

@@ -0,0 +1,3 @@
"""
Core module initialization
"""

View File

@@ -0,0 +1,49 @@
"""
Application Configuration
"""
from functools import lru_cache
from pydantic_settings import BaseSettings
from pydantic import Field
class Settings(BaseSettings):
"""Application settings"""
# App
APP_NAME: str = "YG-Dataset"
DEBUG: bool = True
HOST: str = "0.0.0.0"
PORT: int = 8000
# Database - 使用 SQLite 进行开发/测试
# 生产环境可切换为 PostgreSQL
DATABASE_URL: str = Field(
default="sqlite:///./ygdataset.db",
description="Database connection URL (sqlite:// or postgresql+asyncpg://)"
)
DATABASE_URL_SYNC: str = Field(
default="sqlite:///./ygdataset.db",
description="Synchronous database connection URL"
)
# Redis
REDIS_URL: str = "redis://localhost:6379/0"
# File Storage
UPLOAD_DIR: str = "./uploads"
MAX_FILE_SIZE: int = 100 * 1024 * 1024 # 100MB
# LLM Settings
DEFAULT_MODEL_PROVIDER: str = "openai"
DEFAULT_MODEL_NAME: str = "gpt-4o-mini"
class Config:
env_file = ".env"
extra = "allow"
@lru_cache()
def get_settings() -> Settings:
"""Get cached settings"""
return Settings()

View File

@@ -0,0 +1,68 @@
"""
Database Configuration and Session Management
支持 SQLite 和 PostgreSQL
"""
from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine, async_sessionmaker
from sqlalchemy.orm import DeclarativeBase
from sqlalchemy import create_engine
from app.core.config import get_settings
settings = get_settings()
def get_engine_config():
"""根据数据库类型返回引擎配置"""
if settings.DATABASE_URL.startswith("sqlite"):
return {"echo": settings.DEBUG}
else:
return {
"echo": settings.DEBUG,
"pool_pre_ping": True,
"pool_size": 10,
"max_overflow": 20,
}
# Async engine for FastAPI
async_engine = create_async_engine(
settings.DATABASE_URL,
**get_engine_config()
)
# Sync engine for migrations
sync_engine = create_engine(
settings.DATABASE_URL_SYNC,
echo=settings.DEBUG,
pool_pre_ping=True,
)
# Async session factory
AsyncSessionLocal = async_sessionmaker(
async_engine,
class_=AsyncSession,
expire_on_commit=False,
autocommit=False,
autoflush=False,
)
class Base(DeclarativeBase):
"""Base class for all models"""
pass
async def init_db():
"""Initialize database tables"""
async with async_engine.begin() as conn:
await conn.run_sync(Base.metadata.create_all)
async def get_db() -> AsyncSession:
"""Dependency for getting database session"""
async with AsyncSessionLocal() as session:
try:
yield session
finally:
await session.close()

58
backend/app/main.py Normal file
View File

@@ -0,0 +1,58 @@
"""
YG-Dataset Backend Application
FastAPI-based API server for dataset generation platform
"""
from contextlib import asynccontextmanager
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from app.api.v1 import api_router
from app.core.config import settings
from app.core.database import init_db
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Application lifespan events"""
# Startup
await init_db()
yield
# Shutdown
pass
app = FastAPI(
title="YG-Dataset API",
description="Dataset Generation Platform API",
version="1.0.0",
lifespan=lifespan,
)
# CORS
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Include API routes
app.include_router(api_router, prefix="/api/v1")
@app.get("/health")
async def health_check():
"""Health check endpoint"""
return {"status": "healthy", "version": "1.0.0"}
if __name__ == "__main__":
import uvicorn
uvicorn.run(
"app.main:app",
host=settings.HOST,
port=settings.PORT,
reload=settings.DEBUG,
)

View File

@@ -0,0 +1,3 @@
"""
Database Models
"""

View File

@@ -0,0 +1,19 @@
"""
Base Model with UUID support
"""
import uuid
from datetime import datetime
from sqlalchemy import Column, DateTime
from sqlalchemy.dialects.postgresql import UUID
from app.core.database import Base
class TimestampMixin:
"""Mixin for created_at and updated_at timestamps"""
created_at = Column(DateTime, default=datetime.utcnow, nullable=False)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow, nullable=False)
class UUIDMixin:
"""Mixin for UUID primary key"""
id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4, index=True)

View File

@@ -0,0 +1,161 @@
"""
Database Models for YG-Dataset
"""
from sqlalchemy import Column, String, Text, Integer, BigInteger, ForeignKey, JSON
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.orm import relationship
from app.core.database import Base
from app.models.base import UUIDMixin, TimestampMixin
class Project(Base, UUIDMixin, TimestampMixin):
"""Project model"""
__tablename__ = "projects"
name = Column(String(255), nullable=False)
description = Column(Text)
# Relationships
files = relationship("File", back_populates="project", cascade="all, delete-orphan")
chunks = relationship("Chunk", back_populates="project", cascade="all, delete-orphan")
tags = relationship("Tag", back_populates="project", cascade="all, delete-orphan")
datasets = relationship("Dataset", back_populates="project", cascade="all, delete-orphan")
eval_datasets = relationship("EvalDataset", back_populates="project", cascade="all, delete-orphan")
model_configs = relationship("ModelConfig", back_populates="project", cascade="all, delete-orphan")
tasks = relationship("Task", back_populates="project", cascade="all, delete-orphan")
class File(Base, UUIDMixin, TimestampMixin):
"""File model for uploaded documents"""
__tablename__ = "files"
project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id", ondelete="CASCADE"), nullable=False)
filename = Column(String(255), nullable=False)
file_type = Column(String(50), nullable=False) # pdf, docx, xlsx, csv, epub, md, txt
file_path = Column(String(500))
size = Column(BigInteger) # file size in bytes
status = Column(String(20), default="pending") # pending, processing, completed, failed
# Relationships
project = relationship("Project", back_populates="files")
chunks = relationship("Chunk", back_populates="file", cascade="all, delete-orphan")
class Chunk(Base, UUIDMixin, TimestampMixin):
"""Text chunk model after splitting"""
__tablename__ = "chunks"
project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id", ondelete="CASCADE"), nullable=False)
file_id = Column(UUID(as_uuid=True), ForeignKey("files.id", ondelete="CASCADE"))
name = Column(String(255))
content = Column(Text, nullable=False)
summary = Column(Text)
word_count = Column(Integer)
metadata = Column(JSON) # store additional info like headings, page numbers
# Relationships
project = relationship("Project", back_populates="chunks")
file = relationship("File", back_populates="chunks")
questions = relationship("Question", back_populates="chunk", cascade="all, delete-orphan")
chunk_tags = relationship("ChunkTag", back_populates="chunk", cascade="all, delete-orphan")
class Tag(Base, UUIDMixin, TimestampMixin):
"""Tag/Label model for categorizing content"""
__tablename__ = "tags"
project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id", ondelete="CASCADE"), nullable=False)
label = Column(String(255), nullable=False)
parent_id = Column(UUID(as_uuid=True), ForeignKey("tags.id", ondelete="CASCADE"))
color = Column(String(20)) # hex color code
# Relationships
project = relationship("Project", back_populates="tags")
parent = relationship("Tag", remote_side="Tag.id", back_populates="children")
children = relationship("Tag", back_populates="parent")
chunk_tags = relationship("ChunkTag", back_populates="tag")
class ChunkTag(Base, UUIDMixin):
"""Many-to-many relationship between chunks and tags"""
__tablename__ = "chunk_tags"
chunk_id = Column(UUID(as_uuid=True), ForeignKey("chunks.id", ondelete="CASCADE"), nullable=False)
tag_id = Column(UUID(as_uuid=True), ForeignKey("tags.id", ondelete="CASCADE"), nullable=False)
# Relationships
chunk = relationship("Chunk", back_populates="chunk_tags")
tag = relationship("Tag", back_populates="chunk_tags")
class Question(Base, UUIDMixin, TimestampMixin):
"""Question/QA pair model"""
__tablename__ = "questions"
project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id", ondelete="CASCADE"), nullable=False)
chunk_id = Column(UUID(as_uuid=True), ForeignKey("chunks.id", ondelete="CASCADE"))
content = Column(Text, nullable=False) # question content
answer = Column(Text) # answer content
question_type = Column(String(50)) # fact, summary, reasoning, etc.
source = Column(String(50), default="manual") # manual, generated
# Relationships
project = relationship("Project")
chunk = relationship("Chunk", back_populates="questions")
class Dataset(Base, UUIDMixin, TimestampMixin):
"""Dataset model"""
__tablename__ = "datasets"
project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id", ondelete="CASCADE"), nullable=False)
name = Column(String(255), nullable=False)
description = Column(Text)
dataset_type = Column(String(50)) # qa, conversation, instruction
metadata = Column(JSON)
# Relationships
project = relationship("Project", back_populates="datasets")
class EvalDataset(Base, UUIDMixin, TimestampMixin):
"""Evaluation dataset model"""
__tablename__ = "eval_datasets"
project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id", ondelete="CASCADE"), nullable=False)
name = Column(String(255), nullable=False)
question_type = Column(String(50)) # mixed, fact, reasoning
metadata = Column(JSON)
# Relationships
project = relationship("Project", back_populates="eval_datasets")
class ModelConfig(Base, UUIDMixin, TimestampMixin):
"""Model configuration for LLM providers"""
__tablename__ = "model_configs"
project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id", ondelete="CASCADE"), nullable=False)
provider = Column(String(50), nullable=False) # openai, anthropic, ollama, custom
model_name = Column(String(100))
api_key = Column(String(500))
api_base = Column(String(500))
is_default = Column(String(10), default="false")
# Relationships
project = relationship("Project", back_populates="model_configs")
class Task(Base, UUIDMixin, TimestampMixin):
"""Task model for background jobs"""
__tablename__ = "tasks"
project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id", ondelete="CASCADE"))
task_type = Column(String(50)) # split, generate, eval, export
status = Column(String(20), default="pending") # pending, running, completed, failed
progress = Column(Integer, default=0) # 0-100
result = Column(JSON)
error = Column(Text)
# Relationships
project = relationship("Project", back_populates="tasks")

View File

@@ -0,0 +1,3 @@
"""
Pydantic Schemas
"""

170
backend/app/schemas/base.py Normal file
View File

@@ -0,0 +1,170 @@
"""
Base Pydantic schemas
"""
from datetime import datetime
from typing import Optional, Any
from uuid import UUID
from pydantic import BaseModel, ConfigDict
class TimestampMixin(BaseModel):
"""Mixin for timestamps"""
created_at: Optional[datetime] = None
updated_at: Optional[datetime] = None
class UUIDMixin(BaseModel):
"""Mixin for UUID"""
model_config = ConfigDict(from_attributes=True)
id: UUID
class ProjectBase(BaseModel):
"""Base project schema"""
name: str
description: Optional[str] = None
class ProjectCreate(ProjectBase):
"""Project create schema"""
pass
class ProjectUpdate(ProjectBase):
"""Project update schema"""
pass
class ProjectResponse(ProjectBase, UUIDMixin, TimestampMixin):
"""Project response schema"""
pass
class FileBase(BaseModel):
"""Base file schema"""
filename: str
file_type: str
size: Optional[int] = None
class FileResponse(FileBase, UUIDMixin, TimestampMixin):
"""File response schema"""
status: str
class ChunkBase(BaseModel):
"""Base chunk schema"""
name: Optional[str] = None
content: str
summary: Optional[str] = None
word_count: Optional[int] = None
class ChunkCreate(ChunkBase):
"""Chunk create schema"""
file_id: Optional[UUID] = None
class ChunkResponse(ChunkBase, UUIDMixin, TimestampMixin):
"""Chunk response schema"""
pass
class QuestionBase(BaseModel):
"""Base question schema"""
content: str
answer: Optional[str] = None
question_type: Optional[str] = None
class QuestionCreate(QuestionBase):
"""Question create schema"""
chunk_id: Optional[UUID] = None
class QuestionResponse(QuestionBase, UUIDMixin, TimestampMixin):
"""Question response schema"""
source: str
class DatasetBase(BaseModel):
"""Base dataset schema"""
name: str
description: Optional[str] = None
dataset_type: Optional[str] = None
class DatasetCreate(DatasetBase):
"""Dataset create schema"""
pass
class DatasetResponse(DatasetBase, UUIDMixin, TimestampMixin):
"""Dataset response schema"""
question_count: Optional[int] = None
class EvalDatasetBase(BaseModel):
"""Base eval dataset schema"""
name: str
question_type: Optional[str] = None
class EvalDatasetCreate(EvalDatasetBase):
"""Eval dataset create schema"""
pass
class EvalDatasetResponse(EvalDatasetBase, UUIDMixin, TimestampMixin):
"""Eval dataset response schema"""
pass
class TagBase(BaseModel):
"""Base tag schema"""
label: str
parent_id: Optional[UUID] = None
color: Optional[str] = None
class TagCreate(TagBase):
"""Tag create schema"""
pass
class TagResponse(TagBase, UUIDMixin, TimestampMixin):
"""Tag response schema"""
pass
class ModelConfigBase(BaseModel):
"""Base model config schema"""
provider: str
model_name: Optional[str] = None
api_key: Optional[str] = None
api_base: Optional[str] = None
is_default: Optional[str] = "false"
class ModelConfigCreate(ModelConfigBase):
"""Model config create schema"""
pass
class ModelConfigResponse(ModelConfigBase, UUIDMixin, TimestampMixin):
"""Model config response schema"""
pass
class TaskBase(BaseModel):
"""Base task schema"""
task_type: str
status: Optional[str] = "pending"
progress: Optional[int] = 0
class TaskResponse(TaskBase, UUIDMixin, TimestampMixin):
"""Task response schema"""
result: Optional[Any] = None
error: Optional[str] = None

View File

@@ -0,0 +1,3 @@
"""
Services module
"""

View File

@@ -0,0 +1,3 @@
"""
File Processing Services
"""

View File

@@ -0,0 +1,53 @@
"""
DOCX Text Extractor
"""
from docx import Document
from typing import Dict, List
class DOCXProcessor:
"""Extract text from DOCX files"""
def extract_text(self, file_path: str) -> str:
"""Extract all text from DOCX"""
doc = Document(file_path)
text_parts = []
for para in doc.paragraphs:
if para.text.strip():
text_parts.append(para.text)
# Also extract text from tables
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
if cell.text.strip():
text_parts.append(cell.text)
return "\n\n".join(text_parts)
def extract_with_metadata(self, file_path: str) -> Dict:
"""Extract text with DOCX metadata"""
doc = Document(file_path)
result = {
"text": self.extract_text(file_path),
"paragraphs": len(doc.paragraphs),
"tables": len(doc.tables),
"sections": len(doc.sections),
"metadata": {
"author": doc.core_properties.author,
"title": doc.core_properties.title,
"subject": doc.core_properties.subject,
"created": doc.core_properties.created,
"modified": doc.core_properties.modified
}
}
return result
def process_docx(file_path: str) -> str:
"""Process DOCX file and return text"""
processor = DOCXProcessor()
return processor.extract_text(file_path)

View File

@@ -0,0 +1,66 @@
"""
Excel/CSV Text Extractor
"""
import pandas as pd
from typing import Dict, List
class ExcelProcessor:
"""Extract text from Excel and CSV files"""
def extract_csv(self, file_path: str) -> str:
"""Extract text from CSV file"""
df = pd.read_csv(file_path)
return self._dataframe_to_text(df)
def extract_excel(self, file_path: str, sheet_name: str = None) -> str:
"""Extract text from Excel file"""
if sheet_name:
df = pd.read_excel(file_path, sheet_name=sheet_name)
return self._dataframe_to_text(df)
else:
# Read all sheets
sheets = pd.read_excel(file_path, sheet_name=None)
text_parts = []
for sheet_name, df in sheets.items():
text_parts.append(f"=== Sheet: {sheet_name} ===\n")
text_parts.append(self._dataframe_to_text(df))
return "\n\n".join(text_parts)
def _dataframe_to_text(self, df: pd.DataFrame) -> str:
"""Convert DataFrame to readable text"""
text_parts = []
# Add column headers
if not df.empty:
text_parts.append(" | ".join(str(col) for col in df.columns))
text_parts.append("-" * len(text_parts[-1]))
# Add rows
for _, row in df.iterrows():
row_text = " | ".join(str(val) for val in row.values)
text_parts.append(row_text)
return "\n".join(text_parts)
def extract_all_sheets(self, file_path: str) -> Dict[str, str]:
"""Extract all sheets from Excel file"""
sheets = pd.read_excel(file_path, sheet_name=None)
return {name: self._dataframe_to_text(df) for name, df in sheets.items()}
def get_sheet_names(self, file_path: str) -> List[str]:
"""Get all sheet names from Excel file"""
xl = pd.ExcelFile(file_path)
return xl.sheet_names
def process_csv(file_path: str) -> str:
"""Process CSV file and return text"""
processor = ExcelProcessor()
return processor.extract_csv(file_path)
def process_excel(file_path: str) -> str:
"""Process Excel file and return text"""
processor = ExcelProcessor()
return processor.extract_excel(file_path)

View File

@@ -0,0 +1,65 @@
"""
PDF Text Extractor
"""
import pdfplumber
from typing import Dict, List, Optional
class PDFProcessor:
"""Extract text from PDF files"""
def extract_text(self, file_path: str) -> str:
"""Extract all text from PDF"""
text_parts = []
with pdfplumber.open(file_path) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
text = page.extract_text()
if text:
text_parts.append(f"--- Page {page_num} ---\n{text}")
return "\n\n".join(text_parts)
def extract_pages(self, file_path: str) -> List[Dict]:
"""Extract text page by page with metadata"""
pages = []
with pdfplumber.open(file_path) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
text = page.extract_text()
if text:
pages.append({
"page_number": page_num,
"text": text.strip(),
"word_count": len(text.split())
})
return pages
def extract_with_metadata(self, file_path: str) -> Dict:
"""Extract text with PDF metadata"""
result = {
"text": "",
"pages": [],
"metadata": {}
}
with pdfplumber.open(file_path) as pdf:
# Get metadata
result["metadata"] = {
"page_count": len(pdf.pages),
"metadata": pdf.metadata
}
# Extract pages
pages = self.extract_pages(file_path)
result["pages"] = pages
result["text"] = "\n\n".join([p["text"] for p in pages])
return result
def process_pdf(file_path: str) -> str:
"""Process PDF file and return text"""
processor = PDFProcessor()
return processor.extract_with_metadata(file_path)["text"]

View File

@@ -0,0 +1,3 @@
"""
Text Splitter Services
"""

View File

@@ -0,0 +1,248 @@
"""
Text Splitter
"""
import re
from typing import List, Dict, Optional
class TextSplitter:
"""Base text splitter"""
def __init__(self, chunk_size: int = 500, overlap: int = 50):
self.chunk_size = chunk_size
self.overlap = overlap
def split(self, text: str) -> List[Dict]:
"""Split text into chunks"""
raise NotImplementedError
class RecursiveTextSplitter(TextSplitter):
"""Recursive character text splitter"""
def __init__(self, chunk_size: int = 500, overlap: int = 50, separators: List[str] = None):
super().__init__(chunk_size, overlap)
self.separators = separators or ["\n\n", "\n", ". ", " ", ""]
def split(self, text: str) -> List[Dict]:
"""Split text recursively"""
chunks = []
current_chunk = ""
chunk_index = 0
for separator in self.separators:
if separator in text:
parts = text.split(separator)
for part in parts:
if len(current_chunk) + len(part) > self.chunk_size:
if current_chunk:
chunks.append({
"index": chunk_index,
"content": current_chunk.strip(),
"word_count": len(current_chunk.split())
})
chunk_index += 1
# Handle overlap
if self.overlap > 0 and chunks:
overlap_text = " ".join(chunks[-1]["content"].split()[-self.overlap:])
current_chunk = overlap_text + separator + part
else:
current_chunk = part
else:
current_chunk += separator + part if current_chunk else part
if current_chunk:
chunks.append({
"index": chunk_index,
"content": current_chunk.strip(),
"word_count": len(current_chunk.split())
})
break
else:
continue
return chunks
class MarkdownStructureSplitter(TextSplitter):
"""Split text based on Markdown structure (headings)"""
def __init__(self, chunk_size: int = 2000, overlap: int = 100):
super().__init__(chunk_size, overlap)
def split(self, text: str) -> List[Dict]:
"""Split text by Markdown headings"""
# Find all heading patterns
heading_pattern = r'^(#{1,6})\s+(.+)$'
lines = text.split('\n')
chunks = []
current_chunk = ""
current_heading = "文档开头"
chunk_index = 0
for line in lines:
heading_match = re.match(heading_pattern, line.strip())
if heading_match:
# Save previous chunk if exists
if current_chunk.strip():
chunks.append({
"index": chunk_index,
"name": current_heading,
"content": current_chunk.strip(),
"word_count": len(current_chunk.split())
})
chunk_index += 1
current_heading = heading_match.group(2).strip()
current_chunk = line + "\n"
else:
# Check chunk size
if len(current_chunk) > self.chunk_size:
chunks.append({
"index": chunk_index,
"name": current_heading,
"content": current_chunk.strip(),
"word_count": len(current_chunk.split())
})
chunk_index += 1
# Handle overlap
if self.overlap > 0:
overlap_lines = current_chunk.split('\n')[-self.overlap:]
current_chunk = '\n'.join(overlap_lines) + '\n'
else:
current_chunk = ""
current_chunk += line + "\n"
# Add last chunk
if current_chunk.strip():
chunks.append({
"index": chunk_index,
"name": current_heading,
"content": current_chunk.strip(),
"word_count": len(current_chunk.split())
})
return chunks
class TokenSplitter(TextSplitter):
"""Split text by token count"""
def __init__(self, chunk_size: int = 500, overlap: int = 50):
super().__init__(chunk_size, overlap)
def split(self, text: str) -> List[Dict]:
"""Split text by approximate token count"""
words = text.split()
chunks = []
chunk_index = 0
for i in range(0, len(words), self.chunk_size - self.overlap):
chunk_words = words[i:i + self.chunk_size]
chunk_text = " ".join(chunk_words)
chunks.append({
"index": chunk_index,
"content": chunk_text,
"word_count": len(chunk_words),
"token_estimate": len(chunk_words) * 1.3 # rough token estimate
})
chunk_index += 1
return chunks
class CodeSplitter(TextSplitter):
"""Split text with code awareness"""
def __init__(self, chunk_size: int = 500, overlap: int = 50):
super().__init__(chunk_size, overlap)
def split(self, text: str) -> List[Dict]:
"""Split text preserving code blocks"""
# Split by code blocks first
code_pattern = r'```[\s\S]*?```'
parts = re.split(code_pattern, text)
chunks = []
chunk_index = 0
current_chunk = ""
for part in parts:
if len(current_chunk) + len(part) > self.chunk_size:
if current_chunk.strip():
chunks.append({
"index": chunk_index,
"content": current_chunk.strip(),
"word_count": len(current_chunk.split())
})
chunk_index += 1
current_chunk = part
else:
current_chunk += part
if current_chunk.strip():
chunks.append({
"index": chunk_index,
"content": current_chunk.strip(),
"word_count": len(current_chunk.split())
})
return chunks
class CustomSplitter(TextSplitter):
"""Custom separator splitter"""
def __init__(self, separator: str = "\n\n", chunk_size: int = 500):
super().__init__(chunk_size, 0)
self.separator = separator
def split(self, text: str) -> List[Dict]:
"""Split by custom separator"""
parts = text.split(self.separator)
chunks = []
current_chunk = ""
chunk_index = 0
for part in parts:
if len(current_chunk) + len(part) > self.chunk_size:
if current_chunk.strip():
chunks.append({
"index": chunk_index,
"content": current_chunk.strip(),
"word_count": len(current_chunk.split())
})
chunk_index += 1
current_chunk = part
else:
current_chunk += self.separator + part if current_chunk else part
if current_chunk.strip():
chunks.append({
"index": chunk_index,
"content": current_chunk.strip(),
"word_count": len(current_chunk.split())
})
return chunks
def get_splitter(method: str, **kwargs) -> TextSplitter:
"""Get text splitter by method name"""
splitters = {
"recursive": RecursiveTextSplitter,
"markdown_structure": MarkdownStructureSplitter,
"token": TokenSplitter,
"code": CodeSplitter,
"custom": CustomSplitter
}
splitter_class = splitters.get(method, RecursiveTextSplitter)
return splitter_class(**kwargs)

37
backend/requirements.txt Normal file
View File

@@ -0,0 +1,37 @@
# FastAPI
fastapi>=0.115.0
uvicorn[standard]>=0.30.0
python-multipart>=0.0.9
# Database - SQLite (默认), PostgreSQL 可选
sqlalchemy>=2.0.0
alembic>=1.13.0
# asyncpg>=0.29.0 # PostgreSQL 异步驱动(生产环境使用)
# psycopg2-binary>=2.9.9 # PostgreSQL 同步驱动
# Pydantic
pydantic>=2.0.0
pydantic-settings>=2.0.0
# Redis - 可选,用于缓存/队列(开发环境可省略)
# redis>=5.0.0
# File Processing
pdfplumber>=0.10.4
python-docx>=1.1.0
openpyxl>=3.1.2
pandas>=2.2.0
ebooklib>=0.5
PyMuPDF>=1.24.0
# LLM & Text
langchain>=0.3.0
langchain-community>=0.2.0
langchain-openai>=0.1.0
tiktoken>=0.7.0
python-dotenv>=1.0.0
# Utils
python-dateutil>=2.8.2
httpx>=0.27.0
aiofiles>=23.2.1

20
bug修改.md Normal file
View File

@@ -0,0 +1,20 @@
# Bug 修改记录
## 2026-03-17
### 初始项目创建
- 创建 YG-Dataset 重构项目
- 搭建 FastAPI + Vue 3 基础架构
---
## 修复记录格式
### 日期
**问题描述:**
**原因:**
**修复方案:**
---
*持续更新中...*

52
docker-compose.yml Normal file
View File

@@ -0,0 +1,52 @@
version: '3.8'
services:
# FastAPI 后端 (SQLite 数据库,随项目文件存储)
backend:
build:
context: ./backend
dockerfile: Dockerfile
container_name: ygdataset-backend
ports:
- "8000:8000"
environment:
- DATABASE_URL=sqlite:///./ygdataset.db
- DEBUG=true
volumes:
- ./backend:/app
- uploads:/app/uploads
restart: unless-stopped
# Vue 前端
frontend:
build:
context: ./frontend
dockerfile: Dockerfile
container_name: ygdataset-frontend
ports:
- "3000:80"
volumes:
- ./frontend:/app
- /app/node_modules
depends_on:
- backend
restart: unless-stopped
volumes:
uploads:
# 如需 PostgreSQL取消注释以下配置
# services:
# postgres:
# image: postgres:15
# environment:
# POSTGRES_USER: ygdataset
# POSTGRES_PASSWORD: your_password
# POSTGRES_DB: ygdataset
# ports:
# - "5432:5432"
# volumes:
# - postgres_data:/var/lib/postgresql/data
# volumes:
# postgres_data:

View File

@@ -0,0 +1,306 @@
# Easy Dataset 项目架构分析报告
## 一、项目概述
**Easy Dataset** 是一个功能强大的大模型微调数据集创建工具,由 ConardLi 开发维护。该应用提供直观的界面和强大的内置文档解析、智能分割、数据清洗和增强功能可将各种格式的领域文档转换为高质量的结构化数据集适用于模型微调、RAG检索增强生成和模型性能评估等场景。
**项目地址**: https://github.com/ConardLi/easy-dataset
**当前版本**: 1.7.2
**许可证**: AGPL 3.0
---
## 二、技术栈分析
### 2.1 核心框架
| 类别 | 技术选型 | 说明 |
|------|----------|------|
| 前端框架 | Next.js 14 | App Router 架构 |
| UI 框架 | Material-UI (MUI) | v5.16.14 |
| 状态管理 | Jotai | 轻量级原子化状态管理 |
| 数据库 | Prisma + SQLite | 使用 Prisma ORM |
| 开发语言 | JavaScript | 全栈 JavaScript |
### 2.2 关键依赖
| 类别 | 库名称 | 用途 |
|------|--------|------|
| AI/ML | ai SDK, langchain | 大模型集成 |
| LLM 提供商 | @ai-sdk/openai, ollama-ai-provider, zhipu-ai-provider | 多模型支持 |
| 国际化 | i18next, react-i18next | 多语言支持 |
| 文档处理 | @opendocsg/pdf2md, mammoth, pdf2md-js | PDF/DOCX 解析 |
| 桌面应用 | Electron | 跨平台桌面客户端 |
| 数据处理 | xlsx, adm-zip, jszip | 文件处理 |
### 2.3 开发工具
- **包管理器**: pnpm
- **代码规范**: ESLint + Prettier
- **Git Hooks**: Husky + lint-staged
- **构建工具**: electron-builder (桌面应用打包)
---
## 三、目录结构
```
easy-dataset-main/
├── app/ # Next.js 应用目录 (App Router)
│ ├── api/ # API 路由 (150+ 个路由)
│ │ ├── check-update/ # 版本检查
│ │ ├── llm/ # LLM 模型相关 API
│ │ │ ├── fetch-models/ # 获取模型列表
│ │ │ ├── model/ # 模型配置
│ │ │ ├── ollama/ # Ollama 本地模型
│ │ │ └── providers/ # LLM 提供商
│ │ ├── monitoring/ # 监控 API
│ │ │ ├── logs/ # 日志
│ │ │ ├── stats/ # 统计
│ │ │ └── summary/ # 摘要
│ │ └── projects/ # 项目相关 API
│ │ └── [projectId]/ # 动态项目路由
│ │ ├── chunks/ # 文本分块
│ │ ├── datasets/ # 数据集
│ │ ├── eval-datasets/ # 评估数据集
│ │ ├── eval-tasks/ # 评估任务
│ │ ├── files/ # 文件管理
│ │ ├── images/ # 图片处理
│ │ ├── questions/ # 问题生成
│ │ ├── distill/ # 数据蒸馏
│ │ ├── blind-test-tasks/ # 盲测任务
│ │ ├── playground/ # 模型测试场
│ │ └── ...
│ └── (页面路由)
├── components/ # React 组件 (100+ 组件)
│ ├── common/ # 通用组件
│ ├── home/ # 首页组件
│ ├── Navbar/ # 导航栏
│ ├── dataset-square/ # 数据集广场
│ ├── datasets/ # 数据集组件
│ ├── distill/ # 数据蒸馏组件
│ ├── export/ # 导出组件
│ ├── questions/ # 问题组件
│ ├── text-split/ # 文本分割组件
│ ├── tasks/ # 任务管理组件
│ ├── playground/ # 测试场组件
│ └── settings/ # 设置组件
├── prisma/ # 数据库 schema
│ ├── schema.prisma # Prisma 数据模型
│ ├── sql.json # SQL 模板
│ └── generate-template.js # 模板生成
├── locales/ # 国际化资源
│ ├── en/ # 英文
│ ├── zh-CN/ # 简体中文
│ └── pt-BR/ # 葡萄牙语
├── electron/ # Electron 桌面应用
│ ├── main.js # 主进程
│ └── preload.js # 预加载脚本
├── public/ # 静态资源
├── desktop/ # 桌面端入口
└── package.json # 项目配置
```
---
## 四、核心模块设计
### 4.1 数据模型 (Prisma Schema)
项目使用 Prisma ORM 管理数据,主要数据模型包括:
- **Project**: 项目
- **File**: 上传的文件
- **Chunk**: 文本分块
- **Question**: 生成的问题
- **Dataset**: 微调数据集
- **EvalDataset**: 评估数据集
- **EvalTask**: 评估任务
- **BlindTestTask**: 盲测任务
- **ModelConfig**: 模型配置
- **Tag**: 标签
- **Conversation**: 对话记录
- **Image**: 图片数据
- **Task**: 后台任务
### 4.2 核心功能模块
#### 4.2.1 文档处理模块 (Text Split)
- 支持 PDF、Markdown、DOCX、TXT、EPUB 格式
- 多种分割算法Markdown结构、递归分隔符、固定长度、代码感知分块
- 目录结构提取
- PDF 转 Markdown
#### 4.2.2 问题生成模块 (Question Generation)
- 自动从文本片段提取相关问题
- 问题模板管理
- 批量生成
- 标签树自动构建
#### 4.2.3 数据集生成模块 (Dataset Generation)
- 单轮问答数据集
- 多轮对话数据集
- 图片问答数据集
- 数据蒸馏(无需上传文档)
#### 4.2.4 评估模块 (Evaluation)
- 评估数据集生成(判断题、单选、多选、简答、开放题)
- 自动化模型评估Judge Model
- 人类盲测系统Arena
- AI 质量评估
#### 4.2.5 LLM 集成模块
支持的模型提供商:
- OpenAI
- Ollama (本地模型)
- 智谱 AI
- 阿里百炼
- OpenRouter
- Google Gemini
- Anthropic Claude
---
## 五、API 架构
### 5.1 API 设计原则
- RESTful 风格路由
- 基于 Next.js App Router 的 Route Handlers
- 使用 Zod 进行请求/响应验证
### 5.2 主要 API 分组
| API 分组 | 路由前缀 | 功能 |
|----------|----------|------|
| 项目管理 | `/api/projects` | 项目 CRUD |
| 文件管理 | `/api/projects/[id]/files` | 文件上传/处理 |
| 文本分块 | `/api/projects/[id]/chunks` | 文本分割 |
| 问题生成 | `/api/projects/[id]/questions` | 问题生成/管理 |
| 数据集 | `/api/projects/[id]/datasets` | 数据集管理 |
| 评估 | `/api/projects/[id]/eval-*` | 评估相关 |
| 盲测 | `/api/projects/[id]/blind-test-tasks` | 盲测系统 |
| LLM | `/api/llm/*` | 模型配置/调用 |
| 监控 | `/api/monitoring/*` | 日志/统计 |
---
## 六、前端架构
### 6.1 组件设计模式
- **Jotai 状态管理**: 使用原子化状态管理,便于细粒度更新
- **MUI 组件库**: 统一的 UI 组件
- **Framer Motion**: 动画效果
### 6.2 主要页面
1. **首页** (`/`): 项目列表、创建项目、统计卡片
2. **项目页** (`/projects/[id]`):
- 文本分割 (`/text-split`)
- 问题列表 (`/questions`)
- 数据集 (`/datasets`)
- 评估 (`/eval-datasets`)
- 盲测 Arena (`/arena`)
- 设置 (`/settings`)
3. **模型测试场** (`/playground`)
4. **数据集广场** (`/datasets-square`)
---
## 七、部署架构
### 7.1 多平台支持
- **Web 应用**: Next.js 生产构建
- **桌面应用**: Electron
- Windows (NSIS 安装包)
- macOS (DMG)
- Linux (AppImage)
- **Docker**: 支持 Docker 部署
### 7.2 开发命令
```bash
# 开发
pnpm dev # 启动开发服务器 (端口 1717)
# 构建
pnpm build # 构建 Next.js 生产版本
pnpm electron-build # 构建桌面应用
# 数据库
pnpm db:push # 推送 schema 到数据库
pnpm db:studio # 打开 Prisma Studio
```
---
## 八、数据流设计
### 8.1 核心业务流程
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 上传文档 │ -> │ 文本分割 │ -> │ 问题生成 │ -> │ 数据集生成 │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
PDF/DOCX Chunk Question Dataset
Markdown 目录结构 标签树 导出格式
```
### 8.2 评估流程
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 评估数据集 │ -> │ 评估任务 │ -> │ 模型评估 │ -> │ 结果分析 │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
生成题目 批量处理 Judge Model Arena盲测
```
---
## 九、国际化
- **技术选型**: i18next + react-i18next
- **支持语言**:
- 英文 (en)
- 简体中文 (zh-CN)
- 土耳其语 (tr)
- 葡萄牙语 (pt-BR)
- **语言检测**: i18next-browser-languagedetector
---
## 十、特性亮点
1. **智能文档处理**: 支持多种格式,智能识别
2. **多种分割算法**: 灵活适应不同文档结构
3. **自动标签树**: 基于文档结构智能构建
4. **多类型数据集**: 单轮问答、多轮对话、图片问答
5. **完整评估体系**: 自动化评估 + 人类盲测
6. **多模型支持**: 兼容 OpenAI 格式的所有 API
7. **一键导出**: 支持多种格式和 LLaMA Factory 集成
8. **桌面客户端**: 跨平台支持
---
## 十一、扩展方向
根据项目发展路线,未来可能扩展的方向包括:
1. 更多文件格式支持
2. 数据集版本管理
3. 团队协作功能
4. 更多导出格式
5. 更强大的数据分析功能
---
*报告生成时间: 2026-03-17*
*基于 easy-dataset-main 项目源码分析*

View File

@@ -0,0 +1,16 @@
node_modules
.next
.git
.github
README.md
README.zh-CN.md
.gitignore
.env.local
.env.development.local
.env.test.local
.env.production.local
/test
/local-db
/video
/prisma/*.sqlite
/prisma/*.sqlite-*

6
easy-dataset-main/.gitattributes vendored Normal file
View File

@@ -0,0 +1,6 @@
# Ensure shell scripts always use LF line endings
*.sh text eol=lf
docker-entrypoint.sh text eol=lf
# Ensure Dockerfile uses LF
Dockerfile text eol=lf

View File

@@ -0,0 +1,40 @@
---
name: Bug report
about: Create a report to help us improve
title: '[Bug]'
labels: bug
assignees: ''
---
**注意:请务必按照此模版填写 ISSUES 信息,否则 ISSUE 将不会得到回复**
**问题描述**
清晰、简洁地描述该问题的具体情况。
**桌面设备(请完善以下信息)**
- 操作系统:[例如、Window、MAC]
- 浏览器:[例如谷歌浏览器Chrome苹果浏览器Safari]
- Easy Dataset 版本:[例如1.2.2]
**使用模型**
- 模型提供商:例如火山引擎
- 模型名称:例如 DeepSeek R1
**复现步骤**
重现该问题的操作步骤:
1. 进入“……”页面。
2. 点击“……”。
3. 向下滚动到“……”。
4. 这时会看到错误提示。
**预期结果**
清晰、简洁地描述你原本期望出现的情况。
**截图**
如果有必要,请附上截图,以便更好地说明你的问题。
**其他相关信息**
在此处添加关于该问题的其他任何相关背景信息。

View File

@@ -0,0 +1,19 @@
---
name: 'Feature or enhancement '
about: Suggest an idea for this project
title: '[Feature]'
labels: enhancement
assignees: ''
---
**你的功能请求是否与某个问题相关?请描述。**
清晰、简洁地描述一下存在的问题是什么。例如:当我[具体情况]时,我总是感到很沮丧。
**描述你期望的解决方案**
清晰、简洁地描述你希望实现的情况。
**描述你考虑过的替代方案**
清晰、简洁地描述你所考虑过的任何其他解决方案或功能。
**其他相关信息**
在此处添加与该功能请求相关的其他任何背景信息或截图。

View File

@@ -0,0 +1,40 @@
---
name: Question
about: Ask questions you want to know
title: '[Question]'
labels: question
assignees: ''
---
**注意:请务必按照此模版填写 ISSUES 信息,否则 ISSUE 将不会得到回复**
**问题描述**
清晰、简洁地描述该问题的具体情况。
**桌面设备(请完善以下信息)**
- 操作系统:[例如、Window、MAC]
- 浏览器:[例如谷歌浏览器Chrome苹果浏览器Safari]
- Easy Dataset 版本:[例如1.2.2]
**使用模型**
- 模型提供商:例如火山引擎
- 模型名称:例如 DeepSeek R1
**复现步骤**
重现该问题的操作步骤:
1. 进入“……”页面。
2. 点击“……”。
3. 向下滚动到“……”。
4. 这时会看到错误提示。
**预期结果**
清晰、简洁地描述你原本期望出现的情况。
**截图**
如果有必要,请附上截图,以便更好地说明你的问题。
**其他相关信息**
在此处添加关于该问题的其他任何相关背景信息。

View File

@@ -0,0 +1,12 @@
### 变更类型- [ ] 新功能feat
- [ ] 修复fix
- [ ] 文档docs
- [ ] 重构refactor
### 变更描述- 简要说明修改内容关联Issue#123
### 文档更新- [ ] README.md
- [ ] 贡献指南
- [ ] 接口文档(如有)

View File

@@ -0,0 +1,48 @@
name: Build and Push Docker image on Tag
on:
push:
tags:
- '*'
jobs:
docker-image-release:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata for Docker
id: meta
uses: docker/metadata-action@v5
with:
images: ghcr.io/${{ github.repository_owner }}/easy-dataset
tags: |
type=ref,event=tag
type=raw,value=latest,enable={{is_default_branch}}
- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: .
push: true
platforms: linux/amd64,linux/arm64
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max

22
easy-dataset-main/.gitignore vendored Normal file
View File

@@ -0,0 +1,22 @@
node_modules
build
.vscode
website-local.json
ai-local.json
.next
.DS_Store
tsconfig.tsbuildinfo
mock-login-callback.ts
.env.local
/src/test/crawler
/src/test/mock
/test
/dist
/prisma/*.sqlite
.idea
!local-db/empty.txt
/local-db
prisma/local-db/db.sqlite
/local-db2
.trae
opencode.json

View File

@@ -0,0 +1,3 @@
#!/usr/bin/env sh
npx commitlint --edit "$1"

View File

@@ -0,0 +1 @@
npx lint-staged

3
easy-dataset-main/.npmrc Normal file
View File

@@ -0,0 +1,3 @@
# 国内用户可使用淘宝源加速 (Chinese users can use Taobao registry for faster downloads)
# registry=https://registry.npmmirror.com
registry=https://registry.npmjs.org

View File

@@ -0,0 +1,13 @@
module.exports = {
semi: true,
trailingComma: 'none',
singleQuote: true,
tabWidth: 2,
useTabs: false,
bracketSpacing: true,
arrowParens: 'avoid',
proseWrap: 'preserve',
jsxBracketSameLine: true,
printWidth: 120,
endOfLine: 'auto'
};

View File

@@ -0,0 +1,124 @@
# Easy DataSet 项目架构设计
## 项目概述
Easy DataSet 是一个用于创建大模型微调数据集的应用程序。用户可以上传文本文件,系统会自动分割文本并生成问题,最终生成用于微调的数据集。
## 技术栈
- **前端框架**: Next.js 14 (App Router)
- **UI 框架**: Material-UI (MUI)
- **数据存储**: fs 文件系统模拟数据库
- **开发语言**: JavaScript
- **依赖管理**: pnpm
## 目录结构
```
easy-dataset/
├── app/ # Next.js 应用目录
│ ├── api/ # API 路由
│ │ └── projects/ # 项目相关 API
│ ├── projects/ # 项目相关页面
│ │ ├── [projectId]/ # 项目详情页面
│ └── page.js # 主页
├── components/ # React 组件
│ ├── home/ # 主页相关组件
│ │ ├── HeroSection.js
│ │ ├── ProjectList.js
│ │ └── StatsCard.js
│ ├── Navbar.js # 导航栏组件
│ └── CreateProjectDialog.js
├── lib/ # 工具库
│ └── db/ # 数据库模块
│ ├── base.js # 基础工具函数
│ ├── projects.js # 项目管理
│ ├── texts.js # 文本处理
│ ├── datasets.js # 数据集管理
│ └── index.js # 模块导出
├── styles/ # 样式文件
│ └── home.js # 主页样式
└── local-db/ # 本地数据库目录
```
## 核心模块设计
### 1. 数据库模块 (`lib/db/`)
#### base.js
- 提供基础的文件操作功能
- 确保数据库目录存在
- 读写 JSON 文件的工具函数
#### projects.js
- 项目的 CRUD 操作
- 项目配置管理
- 项目目录结构维护
#### texts.js
- 文献处理功能
- 文本片段存储和检索
- 文件上传处理
#### datasets.js
- 数据集生成和管理
- 问题列表管理
- 标签树管理
### 2. 前端组件 (`components/`)
#### Navbar.js
- 顶部导航栏
- 项目切换
- 模型选择
- 主题切换
#### home/ 目录组件
- HeroSection.js: 主页顶部展示区
- ProjectList.js: 项目列表展示
- StatsCard.js: 数据统计展示
- CreateProjectDialog.js: 创建项目的对话框
### 3. 页面路由 (`app/`)
#### 主页 (`page.js`)
- 项目列表展示
- 创建项目入口
- 数据统计展示
#### 项目详情页 (`projects/[projectId]/`)
- text-split/: 文献处理页面
- questions/: 问题列表页面
- datasets/: 数据集页面
- settings/: 项目设置页面
#### API 路由 (`api/`)
- projects/: 项目管理 API
- texts/: 文本处理 API
- questions/: 问题生成 API
- datasets/: 数据集管理 API
## 数据流设计
### 项目创建流程
1. 用户通过主页或导航栏创建新项目
2. 填写项目基本信息(名称、描述)
3. 系统创建项目目录和初始配置文件
4. 重定向到项目详情页
### 文献处理流程
1. 用户上传 Markdown 文件
2. 系统保存原始文件到项目目录
3. 调用文本分割服务,生成片段和目录结构
4. 展示分割结果和提取的目录
### 问题生成流程
1. 用户选择需要生成问题的文本片段
2. 系统调用大模型API生成问题
3. 保存问题到问题列表和标签树
### 数据集生成流程
1. 用户选择需要生成答案的问题
2. 系统调用大模型API生成答案
3. 保存数据集结果
4. 提供导出功能

254
easy-dataset-main/AGENTS.md Normal file
View File

@@ -0,0 +1,254 @@
# Easy Dataset Agent 指南
## 项目概述
Easy Dataset 是一个专为大型语言模型LLM微调数据集创建而设计的应用程序。它提供完整的workflow从文档处理到数据集导出支持多种文件格式和AI模型。
## 技术栈
- **前端**: Next.js 14 (App Router), React 18, Material-UI v5
- **后端**: Node.js, Prisma ORM, SQLite
- **AI集成**: OpenAI API, Ollama, 智谱AI, OpenRouter
- **桌面应用**: Electron
- **国际化**: i18next
- **构建工具**: npm/pnpm, Electron Builder
## 核心架构
### 1. 数据流架构
```
文档上传 → 文本分割 → 问题生成 → 答案生成 → 数据集导出
↓ ↓ ↓ ↓ ↓
文件处理 智能分块 LLM生成 LLM生成 格式转换
```
### 2. 模块结构
```
lib/
├── api/ # API接口层
├── db/ # 数据访问层
├── file/ # 文件处理模块
├── llm/ # AI模型集成
├── services/ # 业务逻辑层
└── util/ # 工具函数
```
## 开发指南
### 环境设置
```bash
# 安装依赖
npm install
# 数据库初始化
npm run db:push
# 开发模式
npm run dev
# 构建
npm run build
```
### 代码规范
- 使用ES6+语法
- 模块化开发
- 异步操作使用async/await
- 错误处理使用try/catch
- 注释使用JSDoc格式
### 重要文件路径
- **主入口**: `app/page.js`
- **项目路由**: `app/projects/[projectId]/`
- **API路由**: `app/api/`
- **LLM核心**: `lib/llm/core/index.js`
- **任务处理**: `lib/services/tasks/`
## 功能模块详解
### 1. 文档处理模块 (`lib/file/`)
- **支持的格式**: PDF, Markdown, DOCX, EPUB, TXT
- **核心功能**:
- 智能文本分割
- 目录结构提取
- 自定义分隔符分块
- 多语言支持
### 2. AI模型集成 (`lib/llm/`)
- **支持的提供商**:
- OpenAI (GPT系列)
- Ollama (本地模型)
- 智谱AI (GLM系列)
- OpenRouter (多模型聚合)
- **功能特性**:
- 统一API接口
- 流式输出支持
- 多语言提示词
- 错误重试机制
### 3. 任务系统 (`lib/services/tasks/`)
- **任务类型**:
- 文件处理任务
- 问题生成任务
- 答案生成任务
- 数据清洗任务
- **状态管理**: 待处理、处理中、完成、失败
### 4. 数据管理 (`lib/db/`)
- **数据模型**:
- Project (项目)
- Text/Chunk (文本块)
- Question (问题)
- Dataset (数据集)
- Tag (标签)
## 常用开发任务
### 添加新的AI模型提供商
1.`lib/llm/core/providers/` 创建新的provider文件
2. 实现基础接口 (generate, streamGenerate)
3.`lib/llm/core/index.js` 中注册provider
4. 更新配置文件和UI界面
### 添加新的文件格式支持
1.`lib/file/file-process/` 创建格式处理器
2. 实现内容提取和文本转换逻辑
3. 更新文件类型检测和验证
4. 添加相应的UI组件
### 自定义提示词模板
1.`lib/llm/prompts/` 创建新的提示词文件
2. 使用i18n支持多语言
3. 在设置界面添加配置选项
4. 测试不同模型的效果
### 添加新的导出格式
1.`components/export/` 创建新的导出组件
2. 实现数据格式转换逻辑
3. 更新导出对话框界面
4. 添加格式验证和错误处理
## 调试技巧
### 1. 数据库调试
```bash
# 打开Prisma Studio
npm run db:studio
# 查看数据库文件
sqlite3 prisma/db.sqlite
```
### 2. LLM API调试
```javascript
// 在lib/llm/core/index.js中添加日志
console.log('LLM Request:', { provider, model, prompt });
console.log('LLM Response:', response);
```
### 3. 文件处理调试
```javascript
// 在lib/file/中添加调试信息
console.log('File processing:', fileName, fileType);
console.log('Text chunks:', chunks.length, chunks[0]);
```
## 性能优化建议
### 1. 文件处理优化
- 大文件分片处理
- 异步并发处理
- 内存使用监控
- 进度条显示
### 2. LLM调用优化
- 请求缓存机制
- 批量处理请求
- 重试策略优化
- 并发数控制
### 3. 前端性能优化
- 组件懒加载
- 虚拟滚动列表
- 图片懒加载
- 代码分割
## 常见问题解决
### 1. 数据库相关问题
- **问题**: 数据库连接失败
- **解决**: 检查prisma配置确保数据库文件存在
### 2. LLM API相关问题
- **问题**: API调用超时
- **解决**: 调整超时时间,检查网络连接,增加重试机制
### 3. 文件处理问题
- **问题**: 大文件处理内存溢出
- **解决**: 使用流式处理,分块读取,增加内存限制
### 4. Electron打包问题
- **问题**: 打包后应用无法启动
- **解决**: 检查依赖项配置确保native模块正确打包
## 部署指南
### Docker部署
```bash
# 构建镜像
docker build -t easy-dataset .
# 运行容器
docker run -d -p 1717:1717 -v ./local-db:/app/local-db easy-dataset
```
### 桌面应用构建
```bash
# 构建各平台安装包
npm run electron-build-mac # macOS
npm run electron-build-win # Windows
npm run electron-build-linux # Linux
```
## 贡献指南
### 提交规范
- 使用conventional commits格式
- 提交前运行lint检查
- 更新相关文档
- 添加测试用例
### 分支策略
- `main`: 主分支,稳定版本
- `dev`: 开发分支,集成新功能
- `feature/*`: 功能分支
- `fix/*`: 修复分支
---

View File

@@ -0,0 +1,183 @@
# Easy DataSet 项目架构设计
## 项目概述
Easy DataSet 是一个用于创建大模型微调数据集的应用程序。用户可以上传文本文件,系统会自动分割文本并生成问题,最终生成用于微调的数据集。
## 技术栈
- **前端框架**: Next.js 14 (App Router)
- **UI 框架**: Material-UI (MUI)
- **数据存储**: fs 文件系统模拟数据库
- **开发语言**: JavaScript
## 目录结构
```
easy-dataset/
├── app/ # Next.js 应用目录
│ ├── api/ # API 路由
│ │ └── projects/ # 项目相关 API
│ ├── projects/ # 项目相关页面
│ │ ├── [projectId]/ # 项目详情页面
│ └── page.js # 主页
├── components/ # React 组件
│ ├── home/ # 主页相关组件
│ │ ├── HeroSection.js
│ │ ├── ProjectList.js
│ │ └── StatsCard.js
│ ├── Navbar.js # 导航栏组件
│ └── CreateProjectDialog.js
├── lib/ # 工具库
│ └── db/ # 数据库模块
│ ├── base.js # 基础工具函数
│ ├── projects.js # 项目管理
│ ├── texts.js # 文本处理
│ ├── datasets.js # 数据集管理
│ └── index.js # 模块导出
├── styles/ # 样式文件
│ └── home.js # 主页样式
└── local-db/ # 本地数据库目录
```
## 核心模块设计
### 1. 数据库模块 (`lib/db/`)
#### base.js
- 提供基础的文件操作功能
- 确保数据库目录存在
- 读写 JSON 文件的工具函数
#### projects.js
- 项目的 CRUD 操作
- 项目配置管理
- 项目目录结构维护
#### texts.js
- 文献处理功能
- 文本片段存储和检索
- 文件上传处理
#### datasets.js
- 数据集生成和管理
- 问题列表管理
- 标签树管理
### 2. 前端组件 (`components/`)
#### Navbar.js
- 顶部导航栏
- 项目切换
- 模型选择
- 主题切换
#### home/ 目录组件
- HeroSection.js: 主页顶部展示区
- ProjectList.js: 项目列表展示
- StatsCard.js: 数据统计展示
- CreateProjectDialog.js: 创建项目的对话框
### 3. 页面路由 (`app/`)
#### 主页 (`page.js`)
- 项目列表展示
- 创建项目入口
- 数据统计展示
#### 项目详情页 (`projects/[projectId]/`)
- text-split/: 文献处理页面
- questions/: 问题列表页面
- datasets/: 数据集页面
- settings/: 项目设置页面
#### API 路由 (`api/`)
- projects/: 项目管理 API
- texts/: 文本处理 API
- questions/: 问题生成 API
- datasets/: 数据集管理 API
## 数据流设计
### 项目创建流程
1. 用户通过主页或导航栏创建新项目
2. 填写项目基本信息(名称、描述)
3. 系统创建项目目录和初始配置文件
4. 重定向到项目详情页
### 文献处理流程
1. 用户上传 Markdown 文件
2. 系统保存原始文件到项目目录
3. 调用文本分割服务,生成片段和目录结构
4. 展示分割结果和提取的目录
### 问题生成流程
1. 用户选择需要生成问题的文本片段
2. 系统调用大模型API生成问题
3. 保存问题到问题列表和标签树
### 数据集生成流程
1. 用户选择需要生成答案的问题
2. 系统调用大模型API生成答案
3. 保存数据集结果
4. 提供导出功能
## 模型配置
支持多种大模型提供商配置:
- Ollama
- OpenAI
- 硅基流动
- 深度求索
- 智谱AI
每个提供商支持配置:
- API 地址
- API 密钥
- 模型名称
## 未来扩展方向
1. 支持更多文件格式PDF、DOC等
2. 增加数据集质量评估功能
3. 添加数据集版本管理
4. 实现团队协作功能
5. 增加更多数据集导出格式
## 国际化处理
### 技术选型
- **国际化库**: i18next + react-i18next
- **语言检测**: i18next-browser-languagedetector
- **支持语言**: 英文(en)、简体中文(zh-CN)
### 目录结构
```
easy-dataset/
├── locales/ # 国际化资源目录
│ ├── en/ # 英文翻译
│ │ └── translation.json
│ ├── zh-CN/ # 中文翻译
│ │ └── translation.json
│ └── pt-BR/ # 中文翻译
│ └── translation.json
├── lib/
│ └── i18n.js # i18next 配置
```

View File

@@ -0,0 +1,86 @@
# 创建包含pnpm的基础镜像
FROM node:20-alpine AS pnpm-base
RUN npm install -g pnpm@9
# 构建阶段
FROM pnpm-base AS builder
WORKDIR /app
# 添加构建参数,用于识别目标平台
ARG TARGETPLATFORM
# 安装构建依赖
RUN apk add --no-cache --virtual .build-deps \
python3 \
make \
g++ \
cairo-dev \
pango-dev \
jpeg-dev \
giflib-dev \
librsvg-dev \
build-base \
pixman-dev \
pkgconfig
# 复制依赖文件和npm配置并安装(.npmrc中可配置国内源加速)
COPY package.json pnpm-lock.yaml .npmrc ./
RUN pnpm install
# 复制源代码
COPY . .
# 根据目标平台设置Prisma二进制目标并构建应用
RUN if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
echo "Configuring for ARM64 platform"; \
sed -i 's/binaryTargets = \[.*\]/binaryTargets = \["linux-musl-arm64-openssl-3.0.x"\]/' prisma/schema.prisma; \
PRISMA_CLI_BINARY_TARGETS="linux-musl-arm64-openssl-3.0.x" pnpm build; \
else \
echo "Configuring for AMD64 platform (default)"; \
sed -i 's/binaryTargets = \[.*\]/binaryTargets = \["linux-musl-openssl-3.0.x"\]/' prisma/schema.prisma; \
PRISMA_CLI_BINARY_TARGETS="linux-musl-openssl-3.0.x" pnpm build; \
fi
# 构建完成后移除开发依赖,只保留生产依赖
RUN pnpm prune --prod
# 运行阶段
FROM pnpm-base AS runner
WORKDIR /app
# 只安装运行时依赖
RUN apk add --no-cache \
cairo \
pango \
jpeg \
giflib \
librsvg \
pixman
# 复制package.json和.env文件
COPY package.json .env ./
# 从构建阶段复制精简后的node_modules只包含生产依赖
COPY --from=builder /app/node_modules ./node_modules
# 从构建阶段复制构建产物
COPY --from=builder /app/.next ./.next
COPY --from=builder /app/public ./public
COPY --from=builder /app/electron ./electron
# 复制 prisma 到模板目录(用于自动初始化)
COPY --from=builder /app/prisma /app/prisma-template
# 复制并设置 entrypoint 脚本sed 去除 Windows 换行符 \r防止 CRLF 导致 "no such file or directory"
COPY docker-entrypoint.sh /usr/local/bin/
RUN sed -i 's/\r$//' /usr/local/bin/docker-entrypoint.sh && \
chmod +x /usr/local/bin/docker-entrypoint.sh
# 设置生产环境
ENV NODE_ENV=production
EXPOSE 1717
# 使用 entrypoint 脚本
ENTRYPOINT ["/usr/local/bin/docker-entrypoint.sh"]
CMD ["pnpm", "start"]

40
easy-dataset-main/LICENSE Normal file
View File

@@ -0,0 +1,40 @@
GNU AFFERO GENERAL PUBLIC LICENSE
Version 3, 19 November 2007
Copyright (C) 2025 Easy Dataset Project
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published
by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see https://www.gnu.org/licenses/.
Additional Terms for Easy Dataset:
1. Contact Information
If you wish to use Easy Dataset under different terms, please contact the
copyright holders at: 1009903985@qq.com
2. Branding Restrictions
You may not use the names "Easy Dataset" or "EasyDataset" to endorse or
promote products derived from this software without prior written permission.
3. Disclaimer of Warranty
The software is provided "as is", without warranty of any kind, express or
implied, including but not limited to the warranties of merchantability,
fitness for a particular purpose and noninfringement. In no event shall the
authors or copyright holders be liable for any claim, damages or other
liability, whether in an action of contract, tort or otherwise, arising from,
out of or in connection with the software or the use or other dealings in the
software.
4. Compliance with Laws
You are responsible for ensuring your use of the software complies with all
applicable laws, including but not limited to export control regulations.

294
easy-dataset-main/README.md Normal file
View File

@@ -0,0 +1,294 @@
<div align="center">
![](./public//imgs/bg2.png)
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/ConardLi/easy-dataset">
<img alt="GitHub Downloads (all assets, all releases)" src="https://img.shields.io/github/downloads/ConardLi/easy-dataset/total">
<img alt="GitHub Release" src="https://img.shields.io/github/v/release/ConardLi/easy-dataset">
<img src="https://img.shields.io/badge/license-AGPL--3.0-green.svg" alt="AGPL 3.0 License"/>
<img alt="GitHub contributors" src="https://img.shields.io/github/contributors/ConardLi/easy-dataset">
<img alt="GitHub last commit" src="https://img.shields.io/github/last-commit/ConardLi/easy-dataset">
<a href="https://arxiv.org/abs/2507.04009v1" target="_blank">
<img src="https://img.shields.io/badge/arXiv-2507.04009-b31b1b.svg" alt="arXiv:2507.04009">
</a>
<a href="https://trendshift.io/repositories/13944" target="_blank"><img src="https://trendshift.io/api/badge/repositories/13944" alt="ConardLi%2Feasy-dataset | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
**A powerful tool for creating fine-tuning datasets for Large Language Models**
[简体中文](./README.zh-CN.md) | [English](./README.md) | [Türkçe](./README.tr.md)
[Features](#features) • [Quick Start](#local-run) • [Documentation](https://docs.easy-dataset.com/ed/en) • [Contributing](#contributing) • [License](#license)
If you like this project, please give it a Star⭐, or buy the author a coffee => [Donate](./public/imgs/aw.jpg) ❤️!
</div>
## Overview
Easy Dataset is an application specifically designed for building large language model (LLM) datasets. It features an intuitive interface, along with built-in powerful document parsing tools, intelligent segmentation algorithms, data cleaning and augmentation capabilities. The application can convert domain-specific documents in various formats into high-quality structured datasets, which are applicable to scenarios such as model fine-tuning, retrieval-augmented generation (RAG), and model performance evaluation.
![](./public/imgs/arc3.png)
## News
🎉🎉 Easy Dataset Version 1.7.0 launches brand-new evaluation capabilities! You can effortlessly convert domain-specific documents into evaluation datasets (test sets) and automatically run multi-dimensional evaluation tasks. Additionally, it comes with a human blind test system, enabling you to easily meet needs such as vertical domain model evaluation, post-fine-tuning model performance assessment, and RAG recall rate evaluation. Tutorial: [https://www.bilibili.com/video/BV1CRrVB7Eb4/](https://www.bilibili.com/video/BV1CRrVB7Eb4/)
## Features
### 📄 Document Processing & Data Generation
- **Intelligent Document Processing**: Supports PDF, Markdown, DOCX, TXT, EPUB and more formats with intelligent recognition
- **Intelligent Text Splitting**: Multiple splitting algorithms (Markdown structure, recursive separators, fixed length, code-aware chunking), with customizable visual segmentation
- **Intelligent Question Generation**: Auto-extract relevant questions from text segments, with question templates and batch generation
- **Domain Label Tree**: Intelligently builds global domain label trees based on document structure, with auto-tagging capabilities
- **Answer Generation**: Uses LLM API to generate comprehensive answers and Chain of Thought (COT), with AI optimization
- **Data Cleaning**: Intelligent text cleaning to remove noise and improve data quality
### 🔄 Multiple Dataset Types
- **Single-Turn QA Datasets**: Standard question-answer pairs for basic fine-tuning
- **Multi-Turn Dialogue Datasets**: Customizable roles and scenarios for conversational format
- **Image QA Datasets**: Generate visual QA data from images, with multiple import methods (directory, PDF, ZIP)
- **Data Distillation**: Generate label trees and questions directly from domain topics without uploading documents
### 📊 Model Evaluation System
- **Evaluation Datasets**: Generate true/false, single-choice, multiple-choice, short-answer, and open-ended questions
- **Automated Model Evaluation**: Use Judge Model to automatically evaluate model answer quality with customizable scoring rules
- **Human Blind Test (Arena)**: Double-blind comparison of two models' answers for unbiased evaluation
- **AI Quality Assessment**: Automatic quality scoring and filtering of generated datasets
### 🛠️ Advanced Features
- **Custom Prompts**: Project-level customization of all prompt templates (question generation, answer generation, data cleaning, etc.)
- **GA Pair Generation**: Genre-Audience pair generation to enrich data diversity
- **Task Management Center**: Background batch task processing with monitoring and interruption support
- **Resource Monitoring Dashboard**: Token consumption statistics, API call tracking, model performance analysis
- **Model Testing Playground**: Compare up to 3 models simultaneously
### 📤 Export & Integration
- **Multiple Export Formats**: Alpaca, ShareGPT, Multilingual-Thinking formats with JSON/JSONL file types
- **Balanced Export**: Configure export counts per tag for dataset balancing
- **LLaMA Factory Integration**: One-click LLaMA Factory configuration file generation
- **Hugging Face Upload**: Direct upload datasets to Hugging Face Hub
### 🤖 Model Support
- **Wide Model Compatibility**: Compatible with all LLM APIs that follow the OpenAI format
- **Multi-Provider Support**: OpenAI, Ollama (local models), Zhipu AI, Alibaba Bailian, OpenRouter, and more
- **Vision Models**: Support Gemini, Claude, etc. for PDF parsing and image QA
### 🌐 User Experience
- **User-Friendly Interface**: Modern, intuitive UI designed for both technical and non-technical users
- **Multi-Language Support**: Complete Chinese, English, Turkish and Portuguese language support 🇹🇷
- **Dataset Square**: Discover and explore public dataset resources
- **Desktop Clients**: Available for Windows, macOS, and Linux
## Quick Demo
https://github.com/user-attachments/assets/6ddb1225-3d1b-4695-90cd-aa4cb01376a8
## Local Run
### Download Client
<table style="width: 100%">
<tr>
<td width="20%" align="center">
<b>Windows</b>
</td>
<td width="30%" align="center" colspan="2">
<b>MacOS</b>
</td>
<td width="20%" align="center">
<b>Linux</b>
</td>
</tr>
<tr style="text-align: center">
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<img src='./public/imgs/windows.png' style="height:24px; width: 24px" />
<br />
<b>Setup.exe</b>
</a>
</td>
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<img src='./public/imgs/mac.png' style="height:24px; width: 24px" />
<br />
<b>Intel</b>
</a>
</td>
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<img src='./public/imgs/mac.png' style="height:24px; width: 24px" />
<br />
<b>M</b>
</a>
</td>
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<img src='./public/imgs/linux.png' style="height:24px; width: 24px" />
<br />
<b>AppImage</b>
</a>
</td>
</tr>
</table>
### Install with NPM
1. Clone the repository:
```bash
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
```
2. Install dependencies:
```bash
npm install
```
3. Start the development server:
```bash
npm run build
npm run start
```
4. Open your browser and visit `http://localhost:1717`
### Using the Official Docker Image
1. Clone the repository:
```bash
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
```
2. Modify the `docker-compose.yml` file:
```yml
services:
easy-dataset:
image: ghcr.io/conardli/easy-dataset
container_name: easy-dataset
ports:
- '1717:1717'
volumes:
- ./local-db:/app/local-db
- ./prisma:/app/prisma
restart: unless-stopped
```
> **Note:** It is recommended to use the `local-db` and `prisma` folders in the current code repository directory as mount paths to maintain consistency with the database paths when starting via NPM.
> **Note:** The database file will be automatically initialized on first startup, no need to manually run `npm run db:push`.
3. Start with docker-compose:
```bash
docker-compose up -d
```
4. Open a browser and visit `http://localhost:1717`
### Building with a Local Dockerfile
If you want to build the image yourself, use the Dockerfile in the project root directory:
1. Clone the repository:
```bash
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
```
2. Build the Docker image:
```bash
docker build -t easy-dataset .
```
3. Run the container:
```bash
docker run -d \
-p 1717:1717 \
-v ./local-db:/app/local-db \
-v ./prisma:/app/prisma \
--name easy-dataset \
easy-dataset
```
> **Note:** It is recommended to use the `local-db` and `prisma` folders in the current code repository directory as mount paths to maintain consistency with the database paths when starting via NPM.
> **Note:** The database file will be automatically initialized on first startup, no need to manually run `npm run db:push`.
4. Open a browser and visit `http://localhost:1717`
## Documentation
- View the demo video of this project: [Easy Dataset Demo Video](https://www.bilibili.com/video/BV1y8QpYGE57/)
- For detailed documentation on all features and APIs, visit our [Documentation Site](https://docs.easy-dataset.com/ed/en)
- View the paper of this project: [Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents](https://arxiv.org/abs/2507.04009v1)
## Community Practice
- [Complete test set generation and model evaluation with Easy Dataset](https://www.bilibili.com/video/BV1CRrVB7Eb4/)
- [Easy Dataset × LLaMA Factory: Enabling LLMs to Efficiently Learn Domain Knowledge](https://buaa-act.feishu.cn/wiki/GVzlwYcRFiR8OLkHbL6cQpYin7g)
- [Easy Dataset Practical Guide: How to Build High-Quality Datasets?](https://www.bilibili.com/video/BV1MRMnz1EGW)
- [Interpretation of Key Feature Updates in Easy Dataset](https://www.bilibili.com/video/BV1fyJhzHEb7/)
- [Foundation Models Fine-tuning Datasets: Basic Knowledge Popularization](https://docs.easy-dataset.com/zhi-shi-ke-pu)
## Contributing
We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:
1. Fork the repository
2. Create a new branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Commit your changes (`git commit -m 'Add some amazing feature'`)
5. Push to the branch (`git push origin feature/amazing-feature`)
6. Open a Pull Request (submit to the DEV branch)
Please ensure that tests are appropriately updated and adhere to the existing coding style.
## Join Discussion Group & Contact the Author
https://docs.easy-dataset.com/geng-duo/lian-xi-wo-men
## License
This project is licensed under the AGPL 3.0 License - see the [LICENSE](LICENSE) file for details.
## Citation
If this work is helpful, please kindly cite as:
```bibtex
@misc{miao2025easydataset,
title={Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents},
author={Ziyang Miao and Qiyu Sun and Jingyuan Wang and Yuchen Gong and Yaowei Zheng and Shiqi Li and Richong Zhang},
year={2025},
eprint={2507.04009},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.04009}
}
```
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=ConardLi/easy-dataset&type=Date)](https://www.star-history.com/#ConardLi/easy-dataset&Date)
<div align="center">
<sub>Built with ❤️ by <a href="https://github.com/ConardLi">ConardLi</a> • Follow me: <a href="./public/imgs/weichat.jpg">WeChat Official Account</a><a href="https://space.bilibili.com/474921808">Bilibili</a><a href="https://juejin.cn/user/3949101466785709">Juejin</a><a href="https://www.zhihu.com/people/wen-ti-chao-ji-duo-de-xiao-qi">Zhihu</a><a href="https://www.youtube.com/@garden-conard">Youtube</a></sub>
</div>

View File

@@ -0,0 +1,319 @@
<div align="center">
![](./public//imgs/bg2.png)
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/ConardLi/easy-dataset">
<img alt="GitHub Downloads (all assets, all releases)" src="https://img.shields.io/github/downloads/ConardLi/easy-dataset/total">
<img alt="GitHub Release" src="https://img.shields.io/github/v/release/ConardLi/easy-dataset">
<img src="https://img.shields.io/badge/license-AGPL--3.0-green.svg" alt="AGPL 3.0 License"/>
<img alt="GitHub contributors" src="https://img.shields.io/github/contributors/ConardLi/easy-dataset">
<img alt="GitHub last commit" src="https://img.shields.io/github/last-commit/ConardLi/easy-dataset">
<a href="https://arxiv.org/abs/2507.04009v1" target="_blank">
<img src="https://img.shields.io/badge/arXiv-2507.04009-b31b1b.svg" alt="arXiv:2507.04009">
</a>
<a href="https://trendshift.io/repositories/13944" target="_blank"><img src="https://trendshift.io/api/badge/repositories/13944" alt="ConardLi%2Feasy-dataset | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
**Büyük Dil Modelleri için ince ayar veri setleri oluşturmak için güçlü bir araç**
[简体中文](./README.zh-CN.md) | [English](./README.md) | [Türkçe](./README.tr.md)
[Özellikler](#özellikler) • [Hızlı Başlangıç](#yerel-çalıştırma) • [Dokümantasyon](https://docs.easy-dataset.com/ed/en) • [Katkıda Bulunma](#katkıda-bulunma) • [Lisans](#lisans)
Bu projeyi beğendiyseniz, lütfen bir Yıldız⭐ verin veya yazara bir kahve ısmarlayın => [Bağış](./public/imgs/aw.jpg) ❤️!
</div>
## Genel Bakış
Easy Dataset, Büyük Dil Modelleri (LLM'ler) için özel olarak tasarlanmış ince ayar veri setleri oluşturmak için bir uygulamadır. Alana özgü dosyaları yüklemek, içeriği akıllıca bölmek, sorular oluşturmak ve model ince ayarı için yüksek kaliteli eğitim verileri üretmek için sezgisel bir arayüz sağlar.
Easy Dataset ile alan bilgisini yapılandırılmış veri setlerine dönüştürebilir, OpenAI formatını takip eden tüm LLM API'leriyle uyumlu çalışabilir ve ince ayar sürecini basit ve verimli hale getirebilirsiniz.
![](./public/imgs/arc3.png)
## Özellikler
- **Akıllı Belge İşleme**: PDF, Markdown, DOCX dahil birden fazla formatın akıllı tanınması ve işlenmesi desteği
- **Akıllı Metin Bölme**: Birden fazla akıllı metin bölme algoritması ve özelleştirilebilir görsel segmentasyon desteği
- **Akıllı Soru Üretimi**: Her metin bölümünden ilgili soruları çıkarır
- **Alan Etiketleri**: Veri setleri için global alan etiketlerini akıllıca oluşturur, küresel anlama yeteneklerine sahiptir
- **Cevap Üretimi**: Kapsamlı cevaplar ve Düşünce Zinciri (COT) oluşturmak için LLM API kullanır
- **Esnek Düzenleme**: Sürecin herhangi bir aşamasında soruları, cevapları ve veri setlerini düzenleyin
- **Çoklu Dışa Aktarma Formatları**: Veri setlerini çeşitli formatlarda (Alpaca, ShareGPT, çok dilli düşünme) ve dosya türlerinde (JSON, JSONL) dışa aktarın
- **Geniş Model Desteği**: OpenAI formatını takip eden tüm LLM API'leriyle uyumlu
- **Tam Türkçe Dil Desteği**: Tüm arayüz ve AI işlemleri için eksiksiz Türkçe çeviriler 🇹🇷
- **Kullanıcı Dostu Arayüz**: Hem teknik hem de teknik olmayan kullanıcılar için tasarlanmış sezgisel kullanıcı arayüzü
- **Özel Sistem İstemleri**: Model yanıtlarını yönlendirmek için özel sistem istemleri ekleyin
## Hızlı Demo
https://github.com/user-attachments/assets/6ddb1225-3d1b-4695-90cd-aa4cb01376a8
## Yerel Çalıştırma
### İstemciyi İndirin
<table style="width: 100%">
<tr>
<td width="20%" align="center">
<b>Windows</b>
</td>
<td width="30%" align="center" colspan="2">
<b>MacOS</b>
</td>
<td width="20%" align="center">
<b>Linux</b>
</td>
</tr>
<tr style="text-align: center">
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<img src='./public/imgs/windows.png' style="height:24px; width: 24px" />
<br />
<b>Setup.exe</b>
</a>
</td>
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<img src='./public/imgs/mac.png' style="height:24px; width: 24px" />
<br />
<b>Intel</b>
</a>
</td>
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<img src='./public/imgs/mac.png' style="height:24px; width: 24px" />
<br />
<b>M</b>
</a>
</td>
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<img src='./public/imgs/linux.png' style="height:24px; width: 24px" />
<br />
<b>AppImage</b>
</a>
</td>
</tr>
</table>
### NPM ile Kurulum
```bash
npm install
npm run db:push
npm run dev
```
### Docker ile Kurulum
```bash
docker-compose up -d
```
Ardından `http://localhost:1717` adresine gidin.
## Desteklenen AI Sağlayıcıları
Easy Dataset, aşağıdakiler dahil olmak üzere birden fazla AI sağlayıcısını destekler:
- **OpenAI**: GPT-4, GPT-3.5-turbo ve diğer modeller
- **Ollama**: Yerel model çalıştırma
- **智谱AI (GLM)**: Çince modeller
- **OpenRouter**: Çoklu model aggregatör
- **Özel API Uç Noktaları**: OpenAI formatını takip eden herhangi bir API
## Proje Yapısı
```
easy-dataset/
├── app/ # Next.js uygulama yönlendiricisi
│ ├── api/ # API rotaları
│ ├── projects/ # Proje sayfaları
│ └── dataset-square/ # Veri seti galerisi
├── components/ # React bileşenleri
├── lib/ # Temel kütüphaneler
│ ├── llm/ # LLM entegrasyonu
│ ├── db/ # Veritabanı erişimi
│ ├── file/ # Dosya işleme
│ └── services/ # İş mantığı
├── locales/ # i18n çevirileri
│ ├── en/ # İngilizce
│ ├── zh-CN/ # Basitleştirilmiş Çince
│ └── tr/ # Türkçe
├── prisma/ # Veritabanı şeması
└── electron/ # Electron masaüstü uygulaması
```
## Kullanım Rehberi
### 1. Proje Oluşturma
İlk olarak, yeni bir proje oluşturun ve proje adını, açıklamasını ve diğer temel bilgileri yapılandırın.
### 2. Dosya Yükleme
Alana özgü belgelerinizi yükleyin. Desteklenen formatlar:
- PDF
- Markdown (.md)
- Microsoft Word (.docx)
- EPUB
- Düz metin (.txt)
### 3. Metin Bölme
Dosyalar aşağıdaki yöntemlerle akıllıca bölünebilir:
- Doğal dil işleme tabanlı semantik bölme
- Özel ayırıcılara dayalı bölme
- Karakter sayısına dayalı sabit boyutlu bölme
- Manuel görsel bölme
### 4. Alan Etiketleri Oluşturma
Sistem, belge içeriğine dayalı olarak otomatik olarak hiyerarşik alan etiketleri oluşturabilir ve iki seviyeyi destekler.
### 5. Soru Üretimi
Her metin bloğu için sistem:
- İçeriğe dayalı alakalı sorular oluşturur
- Tür ve hedef kitle perspektifi sorgulamayı destekler
- Soru sayısını özelleştirme seçeneği sunar
### 6. Cevap Üretimi
Yapılandırılmış LLM API'si kullanarak:
- Her soru için kapsamlı cevaplar oluşturur
- Düşünce Zinciri (COT) üretimini destekler
- Farklı cevap şablonları destekler
### 7. Veri Seti Dışa Aktarma
Veri setinizi çeşitli formatlarda dışa aktarın:
- **Alpaca Format**: Basit talimat-takip formatı
- **ShareGPT Format**: Çok turlu konuşma formatı
- **Çok Dilli Düşünme**: COT ile genişletilmiş format
- **Özel Format**: Kendi JSON yapınızı tanımlayın
Dışa aktarma hedefleri:
- Yerel dosya sistemi
- Hugging Face Hub
- LLaMA Factory uyumluluğu
## Gelişmiş Özellikler
### Veri Damıtma
Mevcut veri setlerinden yeni eğitim örnekleri oluşturun:
- Soru damıtma: Mevcut soru-cevap çiftlerinden yeni sorular oluşturun
- Etiket damıtma: Otomatik etiket ve kategorizasyon oluşturma
### Tür-Hedef Kitle (GA) Çiftleri
Spesifik içerik stilleri ve hedef kitleler için veri setlerini uyarlayın:
- Tür: Akademik, teknik, yaratıcı yazma, vb.
- Hedef Kitle: Yeni başlayanlar, uzmanlar, öğrenciler, vb.
### Toplu İşlemler
Birden fazla öğeye verimli bir şekilde işlem:
- Toplu soru üretimi
- Toplu cevap üretimi
- Toplu veri seti dışa aktarma
### Görev Yönetimi
Tüm arka plan görevlerini izleyin ve yönetin:
- Dosya işleme görevleri
- Soru üretim görevleri
- Cevap üretim görevleri
- Dışa aktarma görevleri
## Yapılandırma
### LLM API Yapılandırması
Ayarlar sayfasında LLM API'nizi yapılandırın:
1. **Sağlayıcı**: OpenAI, Ollama, 智谱AI veya özel seçin
2. **API Anahtarı**: API anahtarınızı girin (gerekirse)
3. **Model**: Kullanılacak modeli seçin
4. **Temel URL**: Özel API'ler için temel URL'yi ayarlayın
### Görev Ayarları
Görev yürütme parametrelerini özelleştirin:
- Soru üretimi için eşzamanlılık
- Cevap üretimi için eşzamanlılık
- Varsayılan soru sayısı
- Varsayılan cevap şablonu
### Özel İstemler
Her görev türü için özel sistem istemleri ekleyin:
- Soru üretim istemi
- Cevap üretim istemi
- Etiket üretim istemi
- Damıtma istemi
## Katkıda Bulunma
Katkılara hoş geldiniz! Lütfen şu adımları izleyin:
1. Repo'yu fork edin
2. Bir özellik dalı oluşturun (`git checkout -b feature/amazing-feature`)
3. Değişikliklerinizi commit edin (`git commit -m 'Add some amazing feature'`)
4. Dala push edin (`git push origin feature/amazing-feature`)
5. Bir Pull Request açın
## Lisans
Bu proje AGPL-3.0 Lisansı altında lisanslanmıştır. Detaylar için [LICENSE](./LICENSE) dosyasına bakın.
## İletişim
- **GitHub Issues**: [Yeni bir sorun oluşturun](https://github.com/ConardLi/easy-dataset/issues)
- **Email**: lhj19950927@gmail.com
- **WeChat Grubu**: README'deki QR koduna bakın
## Alıntı
Bu aracı araştırmanızda kullanırsanız, lütfen şu şekilde alıntı yapın:
```bibtex
@misc{easy-dataset-2025,
title={Easy Dataset: A Tool for Creating Fine-tuning Datasets for Large Language Models},
author={Conard Li},
year={2025},
publisher={GitHub},
howpublished={\url{https://github.com/ConardLi/easy-dataset}}
}
```
## Teşekkürler
Bu proje aşağıdaki harika açık kaynak projelerini kullanır:
- [Next.js](https://nextjs.org/)
- [React](https://reactjs.org/)
- [Material-UI](https://mui.com/)
- [Prisma](https://www.prisma.io/)
- [Electron](https://www.electronjs.org/)
---
<div align="center">
⭐️ Bu projeyi beğendiyseniz, lütfen bir yıldız verin! ⭐️
</div>

View File

@@ -0,0 +1,300 @@
<div align="center">
![](./public//imgs/bg2.png)
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/ConardLi/easy-dataset">
<img alt="GitHub Downloads (all assets, all releases)" src="https://img.shields.io/github/downloads/ConardLi/easy-dataset/total">
<img alt="GitHub Release" src="https://img.shields.io/github/v/release/ConardLi/easy-dataset">
<img src="https://img.shields.io/badge/license-AGPL--3.0-green.svg" alt="AGPL 3.0 License"/>
<img alt="GitHub contributors" src="https://img.shields.io/github/contributors/ConardLi/easy-dataset">
<img alt="GitHub last commit" src="https://img.shields.io/github/last-commit/ConardLi/easy-dataset">
<a href="https://arxiv.org/abs/2507.04009v1" target="_blank">
<img src="https://img.shields.io/badge/arXiv-2507.04009-b31b1b.svg" alt="arXiv:2507.04009">
</a>
<a href="https://trendshift.io/repositories/13944" target="_blank"><img src="https://trendshift.io/api/badge/repositories/13944" alt="ConardLi%2Feasy-dataset | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
**一个强大的大型语言模型微调数据集创建工具**
[简体中文](./README.zh-CN.md) | [English](./README.md)
[功能特点](#功能特点) • [快速开始](#本地运行) • [使用文档](https://docs.easy-dataset.com/) • [贡献](#贡献) • [许可证](#许可证)
如果喜欢本项目,请给本项目留下 Star⭐或者请作者喝杯咖啡呀 => [打赏作者](./public/imgs/aw.jpg) ❤️!
</div>
## 概述
Easy Dataset 是一个专为创建大型语言模型数据集而设计的应用程序。它提供了直观的界面内置了强大的文档解析工具、智能分割算法、数据清洗和数据增强能力可以将各种格式的领域文献转化为高质量结构化数据集可用于模型微调、RAG、模型效果评估等场景。
![Easy Dataset 产品架构图](./public/imgs/arc3.png)
## 新闻
🎉🎉 Easy Dataset 1.7.0 版本上线全新的评估能力你可以轻松将领域文献转换为评估数据集测试集并且可以自动执行多维度评估任务另外还配备人工盲测系统可以轻松助你完成垂直领域模型评估、模型微调后效果评估、RAG 召回率评估等需求,使用教程: [https://www.bilibili.com/video/BV1CRrVB7Eb4/](https://www.bilibili.com/video/BV1CRrVB7Eb4/)
## 功能特点
### 📄 文档处理与数据生成
- **智能文档处理**:支持 PDF、Markdown、DOCX、TXT、EPUB 等多种格式智能识别和处理
- **智能文本分割**支持多种智能文本分割算法Markdown 结构、递归分隔符、固定长度、代码智能分块等),支持自定义可视化分段
- **智能问题生成**:从每个文本片段中自动提取相关问题,支持问题模板和批量生成
- **领域标签树**:基于文档目录智能构建全局领域标签树,具备全局理解和自动打标能力
- **答案生成**:使用 LLM API 为每个问题生成全面的答案和思维链COT支持 AI 智能优化
- **数据清洗**:智能清洗文本块内容,去除噪音数据,提升数据质量
### 🔄 多种数据集类型
- **单轮问答数据集**:标准的问答对格式,适合基础微调
- **多轮对话数据集**:支持自定义角色和场景的多轮对话格式
- **图片问答数据集**基于图片生成视觉问答数据支持多种导入方式目录、PDF、压缩包
- **数据蒸馏**:无需上传文档,直接从领域主题自动生成标签树和问题
### 📊 模型评估体系
- **评估数据集**:支持生成判断题、单选题、多选题、简答题、开放题等多种题型的评估测试集
- **模型自动评估**使用教师模型Judge Model自动评估模型回答质量支持自定义评分规则
- **人工盲测 (Arena)**:双盲对比两个模型的回答质量,消除偏见进行公正评判
- **AI 质量评估**:对生成的数据集进行自动质量评分和筛选
### 🛠️ 高级功能
- **自定义提示词**:项目级自定义各类提示词模板(问题生成、答案生成、数据清洗等)
- **GA 组合生成**:文体-受众对生成,丰富数据多样性
- **任务管理中心**:后台批量任务处理,支持任务监控和中断
- **资源监控看板**Token 消耗统计、调用次数追踪、模型性能分析
- **模型测试 Playground**:支持最多 3 个模型同时对比测试
### 📤 导出与集成
- **多种导出格式**:支持 Alpaca、ShareGPT、Multilingual-Thinking 等格式JSON/JSONL 文件类型
- **平衡导出**:按标签配置导出数量,实现数据集均衡
- **LLaMA Factory 集成**:一键生成 LLaMA Factory 配置文件
- **Hugging Face 上传**:直接将数据集上传至 Hugging Face Hub
### 🤖 模型支持
- **广泛的模型兼容**:兼容所有遵循 OpenAI 格式的 LLM API
- **多提供商支持**OpenAI、Ollama本地模型、智谱 AI、阿里百炼、OpenRouter 等
- **视觉模型**:支持 Gemini、Claude 等视觉模型用于 PDF 解析和图片问答
### 🌐 用户体验
- **用户友好界面**:为技术和非技术用户设计的现代化直观 UI
- **多语言支持**:完整的中英文界面支持
- **数据集广场**:发现和探索各种公开数据集资源
- **桌面客户端**:提供 Windows、macOS、Linux 桌面应用
## 快速演示
https://github.com/user-attachments/assets/6ddb1225-3d1b-4695-90cd-aa4cb01376a8
## 本地运行
### 下载客户端
<table style="width: 100%">
<tr>
<td width="20%" align="center">
<b>Windows</b>
</td>
<td width="30%" align="center" colspan="2">
<b>MacOS</b>
</td>
<td width="20%" align="center">
<b>Linux</b>
</td>
</tr>
<tr style="text-align: center">
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<img src='./public/imgs/windows.png' style="height:24px; width: 24px" />
<br />
<b>Setup.exe</b>
</a>
</td>
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<img src='./public/imgs/mac.png' style="height:24px; width: 24px" />
<br />
<b>Intel</b>
</a>
</td>
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<img src='./public/imgs/mac.png' style="height:24px; width: 24px" />
<br />
<b>M</b>
</a>
</td>
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<img src='./public/imgs/linux.png' style="height:24px; width: 24px" />
<br />
<b>AppImage</b>
</a>
</td>
</tr>
</table>
### 使用 NPM 安装
1. 克隆仓库:
```bash
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
```
2. 安装依赖:
```bash
npm install
```
3. 启动开发服务器:
```bash
npm run build
npm run start
```
4. 打开浏览器并访问 `http://localhost:1717`
### 使用官方 Docker 镜像
1. 克隆仓库:
```bash
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
```
2. 更改 `docker-compose.yml` 文件:
```yml
services:
easy-dataset:
image: ghcr.io/conardli/easy-dataset
container_name: easy-dataset
ports:
- '1717:1717'
volumes:
- ./local-db:/app/local-db
- ./prisma:/app/prisma
restart: unless-stopped
```
> **注意:** 建议直接使用当前代码仓库目录下的 `local-db` 和 `prisma` 文件夹作为挂载路径,这样可以和 NPM 启动时的数据库路径保持一致。
> **注意:** 数据库文件会在首次启动时自动初始化,无需手动执行 `npm run db:push`。
3. 使用 docker-compose 启动
```bash
docker-compose up -d
```
4. 打开浏览器并访问 `http://localhost:1717`
### 使用本地 Dockerfile 构建
如果你想自行构建镜像,可以使用项目根目录中的 Dockerfile
1. 克隆仓库:
```bash
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
```
2. 构建 Docker 镜像:
```bash
docker build -t easy-dataset .
```
3. 运行容器:
```bash
docker run -d \
-p 1717:1717 \
-v ./local-db:/app/local-db \
-v ./prisma:/app/prisma \
--name easy-dataset \
easy-dataset
```
> **注意:** 建议直接使用当前代码仓库目录下的 `local-db` 和 `prisma` 文件夹作为挂载路径,这样可以和 NPM 启动时的数据库路径保持一致。
> **注意:** 数据库文件会在首次启动时自动初始化,无需手动执行 `npm run db:push`。
4. 打开浏览器,访问 `http://localhost:1717`
## 文档
- 有关所有功能和 API 的详细文档,请访问我们的 [文档站点](https://docs.easy-dataset.com/)
- 查看本项目的演示视频:[Easy Dataset 演示视频](https://www.bilibili.com/video/BV1y8QpYGE57/)
- 查看本项目的论文:[Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents](https://arxiv.org/abs/2507.04009v1)
## 社区教程
- [使用 Easy Dataset 完成测试集生成和模型评估](https://www.bilibili.com/video/BV1CRrVB7Eb4/)
- [Easy Dataset × LLaMA Factory: 让大模型高效学习领域知识](https://buaa-act.feishu.cn/wiki/KY9xwTGs1iqHrRkjXBwcZP9WnL9)
- [Easy Dataset 使用实战: 如何构建高质量数据集?](https://www.bilibili.com/video/BV1MRMnz1EGW)
- [Easy Dataset 1.4 重点功能更新解读](https://www.bilibili.com/video/BV1fyJhzHEb7/)
- [Easy Dataset 1.6 重点功能更新解读](https://www.bilibili.com/video/BV1Rq1hBtEJa/)
- [大模型微调数据集: 基础知识科普](https://docs.easy-dataset.com/zhi-shi-ke-pu)
- [实战案例1生成汽车图片识别数据集](https://docs.easy-dataset.com/bo-ke/shi-zhan-an-li/an-li-1-sheng-cheng-qi-che-tu-pian-shi-bie-shu-ju-ji)
- [实战案例2评论情感分类数据集](https://docs.easy-dataset.com/bo-ke/shi-zhan-an-li/an-li-2-ping-lun-qing-gan-fen-lei-shu-ju-ji)
- [实战案例3物理学多轮对话数据集](https://docs.easy-dataset.com/bo-ke/shi-zhan-an-li/an-li-3-wu-li-xue-duo-lun-dui-hua-shu-ju-ji)
- [实战案例4AI 智能体安全数据集](https://docs.easy-dataset.com/bo-ke/shi-zhan-an-li/an-li-4ai-zhi-neng-ti-an-quan-shu-ju-ji)
- [实战案例5从图文 PPT 中提取数据集](https://docs.easy-dataset.com/bo-ke/shi-zhan-an-li/an-li-5-cong-tu-wen-ppt-zhong-ti-qu-shu-ju-ji)
## 贡献
我们欢迎社区的贡献!如果您想为 Easy Dataset 做出贡献,请按照以下步骤操作:
1. Fork 仓库
2. 创建新分支(`git checkout -b feature/amazing-feature`
3. 进行更改
4. 提交更改(`git commit -m '添加一些惊人的功能'`
5. 推送到分支(`git push origin feature/amazing-feature`
6. 打开 Pull Request提交至 DEV 分支)
请确保适当更新测试并遵守现有的编码风格。
## 加交流群 & 联系作者
https://docs.easy-dataset.com/geng-duo/lian-xi-wo-men
## 许可证
本项目采用 AGPL 3.0 许可证 - 有关详细信息,请参阅 [LICENSE](LICENSE) 文件。
## 引用
如果您觉得此项目有帮助,请考虑以下列格式引用
```bibtex
@misc{miao2025easydataset,
title={Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents},
author={Ziyang Miao and Qiyu Sun and Jingyuan Wang and Yuchen Gong and Yaowei Zheng and Shiqi Li and Richong Zhang},
year={2025},
eprint={2507.04009},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.04009}
}
```
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=ConardLi/easy-dataset&type=Date)](https://www.star-history.com/#ConardLi/easy-dataset&Date)
<div align="center">
<sub>由 <a href="https://github.com/ConardLi">ConardLi</a> 用 ❤️ 构建 • 关注我:<a href="./public/imgs/weichat.jpg">公众号</a><a href="https://space.bilibili.com/474921808">B站</a><a href="https://juejin.cn/user/3949101466785709">掘金</a><a href="https://www.zhihu.com/people/wen-ti-chao-ji-duo-de-xiao-qi">知乎</a><a href="https://www.youtube.com/@garden-conard">Youtube</a></sub>
</div>

View File

@@ -0,0 +1,86 @@
import { NextResponse } from 'next/server';
import path from 'path';
import fs from 'fs';
// Get current version
function getCurrentVersion() {
try {
const packageJsonPath = path.join(process.cwd(), 'package.json');
const packageJson = JSON.parse(fs.readFileSync(packageJsonPath, 'utf8'));
return packageJson.version;
} catch (error) {
console.error('Failed to read version from package.json:', String(error));
return '1.0.0';
}
}
// Get latest version from GitHub
async function getLatestVersion() {
try {
const owner = 'ConardLi';
const repo = 'easy-dataset';
const response = await fetch(`https://api.github.com/repos/${owner}/${repo}/releases/latest`);
if (!response.ok) {
throw new Error(`GitHub API request failed: ${response.status}`);
}
const data = await response.json();
return data.tag_name.replace('v', '');
} catch (error) {
console.error('Failed to fetch latest version:', String(error));
return null;
}
}
// Check for updates
export async function GET() {
try {
const currentVersion = getCurrentVersion();
const latestVersion = await getLatestVersion();
if (!latestVersion) {
return NextResponse.json({
hasUpdate: false,
currentVersion,
latestVersion: null,
error: 'Failed to fetch latest version'
});
}
// Simple semver-like comparison
const hasUpdate = compareVersions(latestVersion, currentVersion) > 0;
return NextResponse.json({
hasUpdate,
currentVersion,
latestVersion,
releaseUrl: hasUpdate ? `https://github.com/ConardLi/easy-dataset/releases/tag/v${latestVersion}` : null
});
} catch (error) {
console.error('Failed to check for updates:', String(error));
return NextResponse.json(
{
hasUpdate: false,
error: 'Failed to check for updates'
},
{ status: 500 }
);
}
}
// Simple version comparison
function compareVersions(a, b) {
const partsA = a.split('.').map(Number);
const partsB = b.split('.').map(Number);
for (let i = 0; i < Math.max(partsA.length, partsB.length); i++) {
const numA = i < partsA.length ? partsA[i] : 0;
const numB = i < partsB.length ? partsB[i] : 0;
if (numA > numB) return 1;
if (numA < numB) return -1;
}
return 0;
}

View File

@@ -0,0 +1,75 @@
import { NextResponse } from 'next/server';
import axios from 'axios';
// Fetch model list from provider
export async function POST(request) {
try {
const { endpoint, providerId, apiKey } = await request.json();
if (!endpoint) {
return NextResponse.json({ error: 'Missing required parameter: endpoint' }, { status: 400 });
}
let url = endpoint.replace(/\/$/, ''); // Remove trailing slash
// Handle Ollama endpoint
if (providerId === 'ollama') {
// Remove possible /v1 or other version suffix
url = url.replace(/\/v\d+$/, '');
// Append /api if missing
if (!url.includes('/api')) {
url += '/api';
}
url += '/tags';
} else {
url += '/models';
}
const headers = {};
if (apiKey) {
headers.Authorization = `Bearer ${apiKey}`;
}
const response = await axios.get(url, { headers });
// Format response per provider
let formattedModels = [];
if (providerId === 'ollama') {
// Ollama /api/tags format: { models: [{ name: 'model-name', ... }] }
if (response.data.models && Array.isArray(response.data.models)) {
formattedModels = response.data.models.map(item => ({
modelId: item.name,
modelName: item.name,
providerId
}));
}
} else {
// Default handling (OpenAI-compatible)
if (response.data.data && Array.isArray(response.data.data)) {
formattedModels = response.data.data.map(item => ({
modelId: item.id,
modelName: item.id,
providerId
}));
}
}
return NextResponse.json(formattedModels);
} catch (error) {
console.error('Failed to fetch model list:', String(error));
// Handle known error shapes
if (error.response) {
if (error.response.status === 401) {
return NextResponse.json({ error: 'Invalid API key' }, { status: 401 });
}
return NextResponse.json(
{ error: `Failed to fetch model list: ${error.response.statusText}` },
{ status: error.response.status }
);
}
return NextResponse.json({ error: `Failed to fetch model list: ${error.message}` }, { status: 500 });
}
}

View File

@@ -0,0 +1,39 @@
import { NextResponse } from 'next/server';
import { getLlmModelsByProviderId } from '@/lib/db/llm-models';
// Get LLM models
export async function GET(request) {
try {
const searchParams = request.nextUrl.searchParams;
let providerId = searchParams.get('providerId');
if (!providerId) {
return NextResponse.json({ error: 'Invalid parameters' }, { status: 400 });
}
const models = await getLlmModelsByProviderId(providerId);
if (!models) {
return NextResponse.json({ error: 'LLM provider not found' }, { status: 404 });
}
return NextResponse.json(models);
} catch (error) {
console.error('Database query error:', String(error));
return NextResponse.json({ error: 'Database query failed' }, { status: 500 });
}
}
// Sync latest model list
export async function POST(request) {
try {
const { newModels, providerId } = await request.json();
const models = await getLlmModelsByProviderId(providerId);
const existingModelIds = models.map(model => model.modelId);
const diffModels = newModels.filter(item => !existingModelIds.includes(item.modelId));
if (diffModels.length > 0) {
// return NextResponse.json(await createLlmModels(diffModels));
return NextResponse.json({ message: 'No new models to insert' }, { status: 200 });
} else {
return NextResponse.json({ message: 'No new models to insert' }, { status: 200 });
}
} catch (error) {
return NextResponse.json({ error: 'Database insert failed' }, { status: 500 });
}
}

View File

@@ -0,0 +1,26 @@
import { NextResponse } from 'next/server';
const OllamaClient = require('@/lib/llm/core/providers/ollama');
// Force dynamic route to prevent static generation
export const dynamic = 'force-dynamic';
export async function GET(request) {
try {
// Read host and port from query params
const { searchParams } = new URL(request.url);
const host = searchParams.get('host') || '127.0.0.1';
const port = searchParams.get('port') || '11434';
// Create Ollama API client
const ollama = new OllamaClient({
endpoint: `http://${host}:${port}/api`
});
// Fetch model list
const models = await ollama.getModels();
return NextResponse.json(models);
} catch (error) {
// console.error('fetch Ollama models error:', error);
return NextResponse.json({ error: 'fetch Models failed' }, { status: 500 });
}
}

View File

@@ -0,0 +1,14 @@
import { NextResponse } from 'next/server';
import { getLlmProviders } from '@/lib/db/llm-providers';
import { sortProvidersByPriority } from '@/lib/util/providerLogo';
// Get LLM provider data
export async function GET() {
try {
const result = await getLlmProviders();
return NextResponse.json(sortProvidersByPriority(result, item => item.id));
} catch (error) {
console.error('Database query error:', String(error));
return NextResponse.json({ error: 'Database query failed' }, { status: 500 });
}
}

View File

@@ -0,0 +1,107 @@
import { NextResponse } from 'next/server';
import { db } from '@/lib/db';
export const dynamic = 'force-dynamic';
export async function GET(request) {
try {
const { searchParams } = new URL(request.url);
const timeRange = searchParams.get('timeRange') || '7d';
const projectId = searchParams.get('projectId');
const provider = searchParams.get('provider');
const status = searchParams.get('status');
const page = parseInt(searchParams.get('page') || '1', 10);
const pageSize = parseInt(searchParams.get('pageSize') || '10', 10);
const searchTerm = searchParams.get('search') || '';
let startDate = new Date();
if (timeRange === '24h') {
startDate.setHours(startDate.getHours() - 24);
} else if (timeRange === '30d') {
startDate.setDate(startDate.getDate() - 30);
} else {
startDate.setDate(startDate.getDate() - 7);
}
const where = {
createAt: {
gte: startDate
}
};
if (projectId && projectId !== 'all') {
where.projectId = projectId;
}
if (provider && provider !== 'all') {
where.provider = provider;
}
if (status && status !== 'all') {
where.status = status;
}
if (searchTerm) {
where.OR = [{ model: { contains: searchTerm } }, { errorMessage: { contains: searchTerm } }];
}
const total = await db.llmUsageLogs.count({ where });
const logs = await db.llmUsageLogs.findMany({
where,
select: {
id: true,
projectId: true,
provider: true,
model: true,
inputTokens: true,
outputTokens: true,
totalTokens: true,
latency: true,
status: true,
errorMessage: true,
createAt: true
},
orderBy: {
createAt: 'desc'
},
skip: (page - 1) * pageSize,
take: pageSize
});
const projectIds = [...new Set(logs.map(log => log.projectId))];
const projects = await db.projects.findMany({
where: { id: { in: projectIds } },
select: { id: true, name: true }
});
const projectMap = projects.reduce((acc, p) => {
acc[p.id] = p.name;
return acc;
}, {});
const details = logs.map(log => ({
id: log.id,
projectId: log.projectId,
projectName: projectMap[log.projectId] || 'Unknown Project',
provider: log.provider,
model: log.model,
status: log.status,
failureReason: log.errorMessage,
inputTokens: log.inputTokens,
outputTokens: log.outputTokens,
totalTokens: log.totalTokens,
calls: 1, // Single record
avgLatency: log.status === 'SUCCESS' ? (log.latency / 1000).toFixed(2) + 's' : '-',
createAt: log.createAt
}));
return NextResponse.json({
details,
total,
page,
pageSize,
totalPages: Math.ceil(total / pageSize)
});
} catch (error) {
console.error('Failed to fetch monitoring logs:', error);
return NextResponse.json({ error: error.message }, { status: 500 });
}
}

View File

@@ -0,0 +1,188 @@
import { NextResponse } from 'next/server';
import { db } from '@/lib/db';
export const dynamic = 'force-dynamic';
export async function GET(request) {
try {
const { searchParams } = new URL(request.url);
const timeRange = searchParams.get('timeRange') || '7d'; // 24h, 7d, 30d
const projectId = searchParams.get('projectId');
const provider = searchParams.get('provider');
const status = searchParams.get('status');
let startDate = new Date();
if (timeRange === '24h') {
startDate.setHours(startDate.getHours() - 24);
} else if (timeRange === '30d') {
startDate.setDate(startDate.getDate() - 30);
} else {
startDate.setDate(startDate.getDate() - 7);
}
const where = {
createAt: {
gte: startDate
}
};
if (projectId && projectId !== 'all') {
where.projectId = projectId;
}
if (provider && provider !== 'all') {
where.provider = provider;
}
if (status && status !== 'all') {
where.status = status;
}
// 1. Fetch data for aggregation
// Note: Prisma aggregation can be slow on very large datasets. If needed, optimize with pre-aggregated tables.
const logs = await db.llmUsageLogs.findMany({
where,
select: {
id: true,
projectId: true,
provider: true,
model: true,
inputTokens: true,
outputTokens: true,
totalTokens: true,
latency: true,
status: true,
errorMessage: true,
createAt: true,
dateString: true
},
orderBy: {
createAt: 'desc'
}
});
// Build project name map
const projects = await db.projects.findMany({
select: { id: true, name: true }
});
const projectMap = projects.reduce((acc, p) => {
acc[p.id] = p.name;
return acc;
}, {});
// 2. Process and aggregate
const summary = {
totalTokens: 0,
inputTokens: 0,
outputTokens: 0,
totalCalls: logs.length,
successCalls: 0,
failedCalls: 0,
totalLatency: 0,
avgLatency: 0
};
const trendMap = {};
const modelStats = {};
const detailedStatsMap = {}; // Key: projectId-model-status-errorMessage
logs.forEach(log => {
// Summary
summary.totalTokens += log.totalTokens;
summary.inputTokens += log.inputTokens;
summary.outputTokens += log.outputTokens;
if (log.status === 'SUCCESS') {
summary.successCalls++;
summary.totalLatency += log.latency;
} else {
summary.failedCalls++;
}
// Trend (by day or hour)
let timeKey;
if (timeRange === '24h') {
const date = new Date(log.createAt);
timeKey = `${String(date.getHours()).padStart(2, '0')}:00`;
} else {
timeKey = log.dateString.slice(5); // MM-DD
}
if (!trendMap[timeKey]) {
trendMap[timeKey] = { name: timeKey, input: 0, output: 0 };
}
trendMap[timeKey].input += log.inputTokens;
trendMap[timeKey].output += log.outputTokens;
// Model Distribution
const modelKey = log.model;
if (!modelStats[modelKey]) {
modelStats[modelKey] = { name: modelKey, value: 0 };
}
modelStats[modelKey].value += log.totalTokens;
// Detailed Table Aggregation
// Key: projectId + model + status + (errorMessage || '')
const errorKey = log.errorMessage || '';
const detailKey = `${log.projectId}|${log.model}|${log.status}|${errorKey}`;
if (!detailedStatsMap[detailKey]) {
detailedStatsMap[detailKey] = {
projectId: log.projectId,
projectName: projectMap[log.projectId] || 'Unknown Project',
provider: log.provider,
model: log.model,
status: log.status,
failureReason: log.errorMessage,
inputTokens: 0,
outputTokens: 0,
totalTokens: 0,
calls: 0,
totalLatency: 0
};
}
const detailItem = detailedStatsMap[detailKey];
detailItem.inputTokens += log.inputTokens;
detailItem.outputTokens += log.outputTokens;
detailItem.totalTokens += log.totalTokens;
detailItem.calls += 1;
if (log.status === 'SUCCESS') {
detailItem.totalLatency += log.latency;
}
});
// Calculate averages
if (summary.successCalls > 0) {
summary.avgLatency = Math.round(summary.totalLatency / summary.successCalls);
}
summary.avgTokensPerCall = summary.totalCalls > 0 ? Math.round(summary.totalTokens / summary.totalCalls) : 0;
summary.failureRate = summary.totalCalls > 0 ? summary.failedCalls / summary.totalCalls : 0;
// Format chart data
const trend = Object.values(trendMap).sort((a, b) => {
// Simple sorting; for production use, consider stricter time ordering.
return a.name.localeCompare(b.name);
});
const modelDistribution = Object.values(modelStats).sort((a, b) => b.value - a.value);
// Format detailed table data
const details = Object.values(detailedStatsMap)
.map(item => ({
...item,
avgLatency:
item.status === 'SUCCESS' && item.calls > 0 ? (item.totalLatency / item.calls / 1000).toFixed(2) + 's' : '-'
}))
.sort((a, b) => b.totalTokens - a.totalTokens); // Default sorting by token usage
return NextResponse.json({
summary,
trend,
modelDistribution,
details,
projects
});
} catch (error) {
console.error('Failed to fetch monitoring stats:', error);
return NextResponse.json({ error: error.message }, { status: 500 });
}
}

View File

@@ -0,0 +1,132 @@
import { NextResponse } from 'next/server';
import { db } from '@/lib/db';
export const dynamic = 'force-dynamic';
export async function GET(request) {
try {
const { searchParams } = new URL(request.url);
const timeRange = searchParams.get('timeRange') || '7d';
const projectId = searchParams.get('projectId');
const provider = searchParams.get('provider');
const status = searchParams.get('status');
let startDate = new Date();
if (timeRange === '24h') {
startDate.setHours(startDate.getHours() - 24);
} else if (timeRange === '30d') {
startDate.setDate(startDate.getDate() - 30);
} else {
startDate.setDate(startDate.getDate() - 7);
}
const where = {
createAt: {
gte: startDate
}
};
if (projectId && projectId !== 'all') {
where.projectId = projectId;
}
if (provider && provider !== 'all') {
where.provider = provider;
}
if (status && status !== 'all') {
where.status = status;
}
const logs = await db.llmUsageLogs.findMany({
where,
select: {
inputTokens: true,
outputTokens: true,
totalTokens: true,
latency: true,
status: true,
createAt: true,
dateString: true,
model: true
}
});
const summary = {
totalTokens: 0,
inputTokens: 0,
outputTokens: 0,
totalCalls: logs.length,
successCalls: 0,
failedCalls: 0,
totalLatency: 0,
avgLatency: 0
};
const trendMap = {};
const modelStats = {};
logs.forEach(log => {
summary.totalTokens += log.totalTokens;
summary.inputTokens += log.inputTokens;
summary.outputTokens += log.outputTokens;
if (log.status === 'SUCCESS') {
summary.successCalls++;
summary.totalLatency += log.latency;
} else {
summary.failedCalls++;
}
let timeKey;
if (timeRange === '24h') {
const date = new Date(log.createAt);
timeKey = `${String(date.getHours()).padStart(2, '0')}:00`;
} else {
timeKey = log.dateString.slice(5);
}
if (!trendMap[timeKey]) {
trendMap[timeKey] = { name: timeKey, input: 0, output: 0 };
}
trendMap[timeKey].input += log.inputTokens;
trendMap[timeKey].output += log.outputTokens;
const modelKey = log.model;
if (!modelStats[modelKey]) {
modelStats[modelKey] = { name: modelKey, value: 0 };
}
modelStats[modelKey].value += log.totalTokens;
});
if (summary.successCalls > 0) {
summary.avgLatency = Math.round(summary.totalLatency / summary.successCalls);
}
summary.avgTokensPerCall = summary.totalCalls > 0 ? Math.round(summary.totalTokens / summary.totalCalls) : 0;
summary.failureRate = summary.totalCalls > 0 ? summary.failedCalls / summary.totalCalls : 0;
const trend = Object.values(trendMap).sort((a, b) => a.name.localeCompare(b.name));
const modelDistribution = Object.values(modelStats).sort((a, b) => b.value - a.value);
const projects = await db.projects.findMany({
select: { id: true, name: true },
orderBy: { createAt: 'desc' }
});
const allLogs = await db.llmUsageLogs.findMany({
select: { provider: true },
distinct: ['provider']
});
const providers = allLogs.map(log => log.provider).filter(Boolean);
return NextResponse.json({
summary,
trend,
modelDistribution,
projects,
providers
});
} catch (error) {
console.error('Failed to fetch monitoring summary:', error);
return NextResponse.json({ error: error.message }, { status: 500 });
}
}

View File

@@ -0,0 +1,176 @@
import { NextResponse } from 'next/server';
import { getUploadFileInfoById } from '@/lib/db/upload-files';
import { createGaPairs, getGaPairsByFileId } from '@/lib/db/ga-pairs';
/**
* 批量手动添加 GA 对到多个文件
*/
export async function POST(request, { params }) {
try {
const { projectId } = params;
const body = await request.json();
if (!projectId) {
return NextResponse.json({ error: 'Project ID is required' }, { status: 400 });
}
const { fileIds, gaPair, appendMode = false } = body;
if (!fileIds || !Array.isArray(fileIds) || fileIds.length === 0) {
return NextResponse.json({ error: 'File IDs array is required' }, { status: 400 });
}
if (!gaPair || !gaPair.genreTitle || !gaPair.audienceTitle) {
return NextResponse.json({ error: 'GA pair with genreTitle and audienceTitle is required' }, { status: 400 });
}
console.log('开始处理批量手动添加GA对请求');
console.log('项目ID:', projectId);
console.log('请求的文件IDs:', fileIds);
console.log('GA对:', gaPair);
// 使用 getUploadFileInfoById 逐个验证文件
const validFiles = [];
const invalidFileIds = [];
for (const fileId of fileIds) {
try {
console.log(`正在验证文件: ${fileId}`);
const fileInfo = await getUploadFileInfoById(fileId);
if (fileInfo && fileInfo.projectId === projectId) {
console.log(`文件验证成功: ${fileInfo.fileName}`);
validFiles.push(fileInfo);
} else if (fileInfo) {
console.log(`文件属于其他项目: ${fileInfo.projectId} != ${projectId}`);
invalidFileIds.push(fileId);
} else {
console.log(`文件不存在: ${fileId}`);
invalidFileIds.push(fileId);
}
} catch (error) {
console.error(`验证文件 ${fileId} 时出错:`, String(error));
invalidFileIds.push(fileId);
}
}
console.log(`文件验证完成: 有效${validFiles.length}个, 无效${invalidFileIds.length}`);
if (validFiles.length === 0) {
return NextResponse.json(
{
error: 'No valid files found',
debug: {
projectId,
requestedIds: fileIds,
invalidIds: invalidFileIds,
message: 'None of the requested files belong to this project or exist in the database'
}
},
{ status: 404 }
);
}
// 批量手动添加 GA 对
console.log('开始批量手动添加GA对...');
console.log('追加模式:', appendMode);
const results = [];
for (const file of validFiles) {
try {
console.log(`处理文件: ${file.fileName}`);
// 检查是否已存在 GA 对
const existingPairs = await getGaPairsByFileId(file.id);
let pairNumber = 1;
if (appendMode && existingPairs && existingPairs.length > 0) {
// 追加模式:在现有 GA 对后面添加
pairNumber = existingPairs.length + 1;
} else if (!appendMode && existingPairs && existingPairs.length > 0) {
// 非追加模式:如果已存在 GA 对则跳过
console.log(`文件 ${file.fileName} 已存在GA对跳过`);
results.push({
fileId: file.id,
fileName: file.fileName,
success: true,
skipped: true,
message: 'GA pairs already exist'
});
continue;
}
// 创建 GA 对数据
const gaPairData = [
{
projectId,
fileId: file.id,
pairNumber,
genreTitle: gaPair.genreTitle.trim(),
genreDesc: gaPair.genreDesc?.trim() || '',
audienceTitle: gaPair.audienceTitle.trim(),
audienceDesc: gaPair.audienceDesc?.trim() || '',
isActive: true
}
];
// 保存 GA 对
if (appendMode) {
// 追加模式:只创建新的 GA 对
await createGaPairs(gaPairData);
} else {
// 非追加模式:使用 saveGaPairs 替换现有的
const { saveGaPairs } = await import('@/lib/db/ga-pairs');
await saveGaPairs(projectId, file.id, [
{
genre: { title: gaPair.genreTitle.trim(), description: gaPair.genreDesc?.trim() || '' },
audience: { title: gaPair.audienceTitle.trim(), description: gaPair.audienceDesc?.trim() || '' }
}
]);
}
results.push({
fileId: file.id,
fileName: file.fileName,
success: true,
skipped: false,
message: 'GA pair added successfully'
});
console.log(`成功为文件 ${file.fileName} 添加GA对`);
} catch (error) {
console.error(`为文件 ${file.fileName} 添加GA对失败:`, error);
results.push({
fileId: file.id,
fileName: file.fileName,
success: false,
skipped: false,
error: error.message,
message: `Failed: ${error.message}`
});
}
}
// 统计结果
const successCount = results.filter(r => r.success).length;
const failureCount = results.filter(r => !r.success).length;
console.log(`批量手动添加完成: 成功${successCount}个, 失败${failureCount}`);
return NextResponse.json({
success: true,
data: results,
summary: {
total: results.length,
success: successCount,
failure: failureCount,
processed: validFiles.length,
skipped: invalidFileIds.length
},
message: `Added GA pairs to ${successCount} files, ${failureCount} failed, ${invalidFileIds.length} files not found`
});
} catch (error) {
console.error('Error batch adding manual GA pairs:', String(error));
return NextResponse.json({ error: String(error) || 'Failed to batch add manual GA pairs' }, { status: 500 });
}
}

View File

@@ -0,0 +1,196 @@
import { NextResponse } from 'next/server';
import { getUploadFileInfoById, delUploadFileInfoById } from '@/lib/db/upload-files';
import { getProject } from '@/lib/db/projects';
import { getProjectChunks, getProjectTocByName } from '@/lib/file/text-splitter';
import { batchSaveTags } from '@/lib/db/tags';
import { handleDomainTree } from '@/lib/util/domain-tree';
import path from 'path';
import { getProjectRoot } from '@/lib/db/base';
import { promises as fs } from 'fs';
/**
* 批量删除文件
* 复用单个文件删除的完整逻辑,包括领域树修订
*/
export async function POST(request, { params }) {
try {
const { projectId } = params;
const body = await request.json();
if (!projectId) {
return NextResponse.json({ error: 'Project ID is required' }, { status: 400 });
}
const { fileIds, domainTreeAction = 'keep', model, language = '中文' } = body;
if (!fileIds || !Array.isArray(fileIds) || fileIds.length === 0) {
return NextResponse.json({ error: 'File IDs array is required' }, { status: 400 });
}
console.log('开始处理批量删除文件请求');
console.log('项目ID:', projectId);
console.log('请求的文件IDs:', fileIds);
console.log('领域树操作:', domainTreeAction);
// 获取项目信息
const project = await getProject(projectId);
if (!project) {
return NextResponse.json({ error: 'The project does not exist' }, { status: 404 });
}
// 验证文件并删除
const results = [];
const deletedTocs = [];
let deletedCount = 0;
let failedCount = 0;
let totalStats = {
deletedChunks: 0,
deletedQuestions: 0,
deletedDatasets: 0
};
for (const fileId of fileIds) {
try {
console.log(`正在验证文件: ${fileId}`);
const fileInfo = await getUploadFileInfoById(fileId);
if (!fileInfo) {
console.log(`文件不存在: ${fileId}`);
results.push({
fileId,
success: false,
error: 'File not found'
});
failedCount++;
continue;
}
if (fileInfo.projectId !== projectId) {
console.log(`文件属于其他项目: ${fileInfo.projectId} != ${projectId}`);
results.push({
fileId,
success: false,
error: 'File belongs to another project'
});
failedCount++;
continue;
}
// 删除文件及其相关的文本块、问题和数据集
console.log(`删除文件: ${fileInfo.fileName}`);
const { stats, fileName } = await delUploadFileInfoById(fileId);
// 累计统计信息
totalStats.deletedChunks += stats.deletedChunks || 0;
totalStats.deletedQuestions += stats.deletedQuestions || 0;
totalStats.deletedDatasets += stats.deletedDatasets || 0;
// 获取并保存删除的 TOC 信息
const deleteToc = await getProjectTocByName(projectId, fileName);
if (deleteToc) {
deletedTocs.push(deleteToc);
}
// 删除 TOC 文件
try {
const projectRoot = await getProjectRoot();
const projectPath = path.join(projectRoot, projectId);
const tocDir = path.join(projectPath, 'toc');
const baseName = path.basename(fileInfo.fileName, path.extname(fileInfo.fileName));
const tocPath = path.join(tocDir, `${baseName}-toc.json`);
await fs.unlink(tocPath);
console.log(`成功删除 TOC 文件: ${tocPath}`);
} catch (error) {
console.error(`删除 TOC 文件失败:`, String(error));
}
results.push({
fileId,
fileName: fileInfo.fileName,
success: true,
stats
});
deletedCount++;
console.log(`成功删除文件: ${fileInfo.fileName}`);
} catch (error) {
console.error(`删除文件 ${fileId} 时出错:`, error);
results.push({
fileId,
success: false,
error: error.message
});
failedCount++;
}
}
console.log(`批量删除完成: 成功${deletedCount}个, 失败${failedCount}`);
// 如果选择了保持领域树不变,直接返回删除结果
if (domainTreeAction === 'keep') {
return NextResponse.json({
success: true,
deletedCount,
failedCount,
total: fileIds.length,
results,
stats: totalStats,
domainTreeAction: 'keep',
message: `Successfully deleted ${deletedCount} files, ${failedCount} failed`
});
}
// 处理领域树更新
try {
// 获取项目的所有文件
const { chunks, toc } = await getProjectChunks(projectId);
// 如果不存在文本块,说明项目已经没有文件了
if (!chunks || chunks.length === 0) {
// 清空领域树
await batchSaveTags(projectId, []);
return NextResponse.json({
success: true,
deletedCount,
failedCount,
total: fileIds.length,
results,
stats: totalStats,
domainTreeAction,
message: `Successfully deleted ${deletedCount} files, domain tree cleared`,
domainTreeCleared: true
});
}
// 调用领域树处理模块
await handleDomainTree({
projectId,
action: domainTreeAction,
allToc: toc,
model: model,
language,
deleteToc: deletedTocs.length > 0 ? deletedTocs : undefined,
project
});
console.log('领域树更新成功');
} catch (error) {
console.error('Error updating domain tree after batch deletion:', String(error));
// 即使领域树更新失败,也不影响文件删除的结果
}
return NextResponse.json({
success: true,
deletedCount,
failedCount,
total: fileIds.length,
results,
stats: totalStats,
domainTreeAction,
message: `Successfully deleted ${deletedCount} files, ${failedCount} failed`
});
} catch (error) {
console.error('Error batch deleting files:', String(error));
return NextResponse.json({ error: String(error) || 'Failed to batch delete files' }, { status: 500 });
}
}

View File

@@ -0,0 +1,106 @@
import { NextResponse } from 'next/server';
import { batchGenerateGaPairs } from '@/lib/services/ga/ga-pairs';
import { getUploadFileInfoById } from '@/lib/db/upload-files'; // 导入单个文件查询函数
/**
* 批量生成多个文件的 GA 对
*/
export async function POST(request, { params }) {
try {
const { projectId } = params;
const body = await request.json();
if (!projectId) {
return NextResponse.json({ error: 'Project ID is required' }, { status: 400 });
}
const { fileIds, modelConfigId, language = '中文', appendMode = false } = body;
if (!fileIds || !Array.isArray(fileIds) || fileIds.length === 0) {
return NextResponse.json({ error: 'File IDs array is required' }, { status: 400 });
}
if (!modelConfigId) {
return NextResponse.json({ error: 'Model configuration ID is required' }, { status: 400 });
}
console.log('开始处理批量生成GA对请求');
console.log('项目ID:', projectId);
console.log('请求的文件IDs:', fileIds);
// 使用 getUploadFileInfoById 逐个验证文件
const validFiles = [];
const invalidFileIds = [];
for (const fileId of fileIds) {
try {
console.log(`正在验证文件: ${fileId}`);
const fileInfo = await getUploadFileInfoById(fileId);
if (fileInfo && fileInfo.projectId === projectId) {
console.log(`文件验证成功: ${fileInfo.fileName}`);
validFiles.push(fileInfo);
} else if (fileInfo) {
console.log(`文件属于其他项目: ${fileInfo.projectId} != ${projectId}`);
invalidFileIds.push(fileId);
} else {
console.log(`文件不存在: ${fileId}`);
invalidFileIds.push(fileId);
}
} catch (error) {
console.error(`验证文件 ${fileId} 时出错:`, String(error));
invalidFileIds.push(fileId);
}
}
console.log(`文件验证完成: 有效${validFiles.length}个, 无效${invalidFileIds.length}`);
if (validFiles.length === 0) {
return NextResponse.json(
{
error: 'No valid files found',
debug: {
projectId,
requestedIds: fileIds,
invalidIds: invalidFileIds,
message: 'None of the requested files belong to this project or exist in the database'
}
},
{ status: 404 }
);
}
// 批量生成 GA 对
console.log('开始批量生成GA对...');
console.log('追加模式:', appendMode);
const results = await batchGenerateGaPairs(
projectId,
validFiles,
modelConfigId,
language,
appendMode // 传递追加模式参数
);
// 统计结果
const successCount = results.filter(r => r.success).length;
const failureCount = results.filter(r => !r.success).length;
console.log(`批量生成完成: 成功${successCount}个, 失败${failureCount}`);
return NextResponse.json({
success: true,
data: results,
summary: {
total: results.length,
success: successCount,
failure: failureCount,
processed: validFiles.length,
skipped: invalidFileIds.length
},
message: `Generated GA pairs for ${successCount} files, ${failureCount} failed, ${invalidFileIds.length} files not found`
});
} catch (error) {
console.error('Error batch generating GA pairs:', String(error));
return NextResponse.json({ error: String(error) || 'Failed to batch generate GA pairs' }, { status: 500 });
}
}

View File

@@ -0,0 +1,161 @@
import { NextResponse } from 'next/server';
import { db } from '@/lib/db/index';
import LLMClient from '@/lib/llm/core/index';
import { getModelConfigById } from '@/lib/db/model-config';
/**
* Get current question and generate answers from two models
*/
export async function GET(request, { params }) {
try {
const { projectId, taskId } = params;
const task = await db.task.findFirst({
where: {
id: taskId,
projectId,
taskType: 'blind-test'
}
});
if (!task) {
return NextResponse.json({ code: 404, error: 'Task not found' }, { status: 404 });
}
if (task.status !== 0) {
return NextResponse.json({ code: 400, error: 'Task has ended' }, { status: 400 });
}
// Parse task detail
let detail = {};
let modelInfo = {};
try {
detail = task.detail ? JSON.parse(task.detail) : {};
modelInfo = task.modelInfo ? JSON.parse(task.modelInfo) : {};
} catch (e) {
console.error('Failed to parse task detail:', e);
}
const questionIds = detail.questionIds || detail.evalDatasetIds || [];
const currentIndex = detail.currentIndex || 0;
// Check if all questions are completed
if (questionIds.length === 0 || currentIndex >= questionIds.length) {
return NextResponse.json({
code: 0,
data: {
completed: true,
message: 'All questions completed'
}
});
}
// Fetch current question
const currentQuestionId = questionIds[currentIndex];
const currentQuestion = await db.evalDatasets.findUnique({
where: { id: currentQuestionId },
select: {
id: true,
question: true,
questionType: true,
correctAnswer: true,
tags: true
}
});
if (!currentQuestion) {
return NextResponse.json({ code: 404, error: 'Question not found' }, { status: 404 });
}
// Fetch both model configs
const [modelConfigA, modelConfigB] = await Promise.all([
getModelConfigById(modelInfo.modelA.providerId),
getModelConfigById(modelInfo.modelB.providerId)
]);
if (!modelConfigA || !modelConfigB) {
return NextResponse.json({ code: 400, error: 'Model configuration not found' }, { status: 400 });
}
// Build prompts
const systemPrompt = "You are a helpful assistant. Provide detailed and accurate answers to the user's question.";
const userPrompt = currentQuestion.question;
// Call both models in parallel
const startTimeA = Date.now();
const startTimeB = Date.now();
let answerA = '';
let answerB = '';
let errorA = null;
let errorB = null;
let durationA = 0;
let durationB = 0;
try {
// Call model A
const clientA = new LLMClient(modelConfigA);
const resultA = await clientA.chat([
{ role: 'system', content: systemPrompt },
{ role: 'user', content: userPrompt }
]);
answerA = resultA.text || '';
durationA = Date.now() - startTimeA;
} catch (err) {
console.error('Model A call failed:', err);
errorA = err.message;
durationA = Date.now() - startTimeA;
}
try {
// Call model B
const clientB = new LLMClient(modelConfigB);
const resultB = await clientB.chat([
{ role: 'system', content: systemPrompt },
{ role: 'user', content: userPrompt }
]);
answerB = resultB.text || '';
durationB = Date.now() - startTimeB;
} catch (err) {
console.error('Model B call failed:', err);
errorB = err.message;
durationB = Date.now() - startTimeB;
}
// Randomly swap positions (core blind-test behavior)
const isSwapped = Math.random() > 0.5;
return NextResponse.json({
code: 0,
data: {
completed: false,
currentIndex,
totalCount: evalDatasetIds.length,
question: currentQuestion,
// Blind test: do not reveal which model is which
leftAnswer: {
content: isSwapped ? answerB : answerA,
error: isSwapped ? errorB : errorA,
duration: isSwapped ? durationB : durationA
},
rightAnswer: {
content: isSwapped ? answerA : answerB,
error: isSwapped ? errorA : errorB,
duration: isSwapped ? durationA : durationB
},
// Server stores the actual mapping for scoring
_swap: isSwapped
}
});
} catch (error) {
console.error('Failed to fetch current question:', error);
return NextResponse.json(
{ code: 500, error: 'Failed to fetch current question', message: error.message },
{ status: 500 }
);
}
}

View File

@@ -0,0 +1,64 @@
import { NextResponse } from 'next/server';
import { db } from '@/lib/db/index';
/**
* Get current question info (including random swap info)
*/
export async function GET(request, { params }) {
const { projectId, taskId } = params;
try {
if (!projectId || !taskId) {
return NextResponse.json({ error: 'Missing required parameters' }, { status: 400 });
}
// Fetch task
const task = await db.task.findUnique({
where: { id: taskId }
});
if (!task || task.taskType !== 'blind-test') {
return NextResponse.json({ error: 'Task not found' }, { status: 404 });
}
// Parse task detail
const detail = JSON.parse(task.detail || '{}');
// Support both evalDatasetIds and questionIds
const questionIds = detail.questionIds || detail.evalDatasetIds || [];
const currentIndex = detail.currentIndex || 0;
// Check if task is completed
if (questionIds.length === 0 || currentIndex >= questionIds.length) {
return NextResponse.json({
completed: true,
currentIndex,
totalQuestions: questionIds.length
});
}
// Fetch current question
const currentQuestionId = questionIds[currentIndex];
const currentQuestion = await db.evalDatasets.findUnique({
where: { id: currentQuestionId }
});
if (!currentQuestion) {
return NextResponse.json({ error: 'Question not found' }, { status: 404 });
}
// Randomly decide whether to swap (core blind-test behavior)
const isSwapped = Math.random() > 0.5;
return NextResponse.json({
questionId: currentQuestion.id,
question: currentQuestion.question,
answer: currentQuestion.correctAnswer || '',
questionIndex: currentIndex + 1,
totalQuestions: questionIds.length,
isSwapped
});
} catch (error) {
console.error('Failed to fetch question info:', error);
return NextResponse.json({ error: error.message }, { status: 500 });
}
}

View File

@@ -0,0 +1,190 @@
import { NextResponse } from 'next/server';
import { db } from '@/lib/db/index';
/**
* Get blind-test task details
* Results are fetched from EvalResults table
*/
export async function GET(request, { params }) {
try {
const { projectId, taskId } = params;
const task = await db.task.findFirst({
where: {
id: taskId,
projectId,
taskType: 'blind-test'
}
});
if (!task) {
return NextResponse.json({ code: 404, error: 'Task not found' }, { status: 404 });
}
let detail = {};
let modelInfo = {};
try {
detail = task.detail ? JSON.parse(task.detail) : {};
modelInfo = task.modelInfo ? JSON.parse(task.modelInfo) : {};
} catch (e) {
console.error('Failed to parse task detail:', e);
}
// Fetch all related evaluation questions
const evalDatasetIds = detail.evalDatasetIds || [];
const evalDatasets = await db.evalDatasets.findMany({
where: {
id: { in: evalDatasetIds }
},
select: {
id: true,
question: true,
questionType: true,
correctAnswer: true,
tags: true
}
});
// Sort by evalDatasetIds order
const orderedDatasets = evalDatasetIds.map(id => evalDatasets.find(d => d.id === id)).filter(Boolean);
// Fetch results from EvalResults table
const evalResults = await db.evalResults.findMany({
where: { taskId },
orderBy: { createAt: 'asc' }
});
// Parse results into the format expected by frontend
const results = evalResults.map(r => {
let modelAnswer = {};
let judgeData = {};
try {
modelAnswer = JSON.parse(r.modelAnswer || '{}');
judgeData = JSON.parse(r.judgeResponse || '{}');
} catch (e) {
// Ignore parse errors
}
return {
questionId: r.evalDatasetId,
vote: judgeData.vote,
isSwapped: judgeData.isSwapped,
modelAScore: judgeData.modelAScore || 0,
modelBScore: judgeData.modelBScore || 0,
leftAnswer: modelAnswer.leftAnswer || '',
rightAnswer: modelAnswer.rightAnswer || '',
timestamp: r.createAt
};
});
return NextResponse.json({
code: 0,
data: {
...task,
detail: {
...detail,
results // Include results from EvalResults table
},
modelInfo,
evalDatasets: orderedDatasets
}
});
} catch (error) {
console.error('Failed to fetch blind-test task details:', error);
return NextResponse.json(
{ code: 500, error: 'Failed to fetch blind-test task details', message: error.message },
{ status: 500 }
);
}
}
/**
* Update blind-test task (interrupt/stop)
*/
export async function PUT(request, { params }) {
try {
const { projectId, taskId } = params;
const { action } = await request.json();
const task = await db.task.findFirst({
where: {
id: taskId,
projectId,
taskType: 'blind-test'
}
});
if (!task) {
return NextResponse.json({ code: 404, error: 'Task not found' }, { status: 404 });
}
if (action === 'interrupt') {
if (task.status !== 0) {
return NextResponse.json({ code: 400, error: 'Only running tasks can be interrupted' }, { status: 400 });
}
const updatedTask = await db.task.update({
where: { id: taskId },
data: {
status: 3, // Interrupted
endTime: new Date()
}
});
return NextResponse.json({
code: 0,
data: updatedTask,
message: 'Task interrupted'
});
}
return NextResponse.json({ code: 400, error: 'Unknown action' }, { status: 400 });
} catch (error) {
console.error('Failed to update blind-test task:', error);
return NextResponse.json(
{ code: 500, error: 'Failed to update blind-test task', message: error.message },
{ status: 500 }
);
}
}
/**
* Delete blind-test task and its results
*/
export async function DELETE(request, { params }) {
try {
const { projectId, taskId } = params;
const task = await db.task.findFirst({
where: {
id: taskId,
projectId,
taskType: 'blind-test'
}
});
if (!task) {
return NextResponse.json({ code: 404, error: 'Task not found' }, { status: 404 });
}
// Delete related EvalResults first
await db.evalResults.deleteMany({
where: { taskId }
});
// Then delete the task
await db.task.delete({
where: { id: taskId }
});
return NextResponse.json({
code: 0,
message: 'Task deleted'
});
} catch (error) {
console.error('Failed to delete blind-test task:', error);
return NextResponse.json(
{ code: 500, error: 'Failed to delete blind-test task', message: error.message },
{ status: 500 }
);
}
}

View File

@@ -0,0 +1,92 @@
import { NextResponse } from 'next/server';
import { db } from '@/lib/db/index';
import LLMClient from '@/lib/llm/core/index';
import { getModelConfigById } from '@/lib/db/model-config';
/**
* Stream answer for a specified model
* Query param: model=A or model=B
*/
export async function GET(request, { params }) {
const { projectId, taskId } = params;
const { searchParams } = new URL(request.url);
const modelType = searchParams.get('model'); // 'A' or 'B'
try {
if (!projectId || !taskId) {
return NextResponse.json({ error: 'Missing required parameters' }, { status: 400 });
}
if (!modelType || !['A', 'B'].includes(modelType)) {
return NextResponse.json({ error: 'Model type must be specified (A or B)' }, { status: 400 });
}
// Fetch task
const task = await db.task.findUnique({
where: { id: taskId }
});
if (!task || task.taskType !== 'blind-test') {
return NextResponse.json({ error: 'Task not found' }, { status: 404 });
}
// Parse task detail
const detail = JSON.parse(task.detail || '{}');
const modelInfo = JSON.parse(task.modelInfo || '{}');
// Support both evalDatasetIds and questionIds
const questionIds = detail.questionIds || detail.evalDatasetIds || [];
const currentIndex = detail.currentIndex || 0;
// Check if task is completed
if (questionIds.length === 0 || currentIndex >= questionIds.length) {
return NextResponse.json({ completed: true });
}
// Fetch current question
const currentQuestionId = questionIds[currentIndex];
const currentQuestion = await db.evalDatasets.findUnique({
where: { id: currentQuestionId }
});
if (!currentQuestion) {
return NextResponse.json({ error: 'Question not found' }, { status: 404 });
}
// Resolve model config based on modelType
const modelConfigKey = modelType === 'A' ? 'modelA' : 'modelB';
const modelConfig = await getModelConfigById(modelInfo[modelConfigKey].id);
if (!modelConfig) {
return NextResponse.json({ error: 'Model configuration not found' }, { status: 400 });
}
// Prepare messages
const messages = [
{
role: 'system',
content: "You are a helpful assistant. Provide detailed and accurate answers to the user's question."
},
{ role: 'user', content: currentQuestion.question }
];
// Create LLM client
const client = new LLMClient({
projectId,
...modelConfig
});
// Call streaming API and return response directly
const response = await client.chatStreamAPI(messages);
return new Response(response.body, {
headers: {
'Content-Type': 'text/plain; charset=utf-8',
'Cache-Control': 'no-cache',
Connection: 'keep-alive'
}
});
} catch (error) {
console.error(`Model ${modelType} streaming call failed:`, error);
return NextResponse.json({ error: error.message }, { status: 500 });
}
}

View File

@@ -0,0 +1,213 @@
import { NextResponse } from 'next/server';
import { db } from '@/lib/db/index';
import LLMClient from '@/lib/llm/core/index';
import { getModelConfigById } from '@/lib/db/model-config';
/**
* Stream answers from two models for the current question
*/
export async function GET(request, { params }) {
const { projectId, taskId } = params;
try {
if (!projectId || !taskId) {
return NextResponse.json({ error: 'Missing required parameters' }, { status: 400 });
}
// Fetch task
const task = await db.task.findUnique({
where: { id: taskId }
});
if (!task || task.taskType !== 'blind-test') {
return NextResponse.json({ error: 'Task not found' }, { status: 404 });
}
// Parse task detail
const detail = JSON.parse(task.detail || '{}');
const modelInfo = JSON.parse(task.modelInfo || '{}');
const { questionIds = [], currentIndex = 0 } = detail;
// Check if task is completed
if (currentIndex >= questionIds.length) {
return NextResponse.json({ completed: true });
}
// Fetch current question
const currentQuestionId = questionIds[currentIndex];
const currentQuestion = await db.evalDatasets.findUnique({
where: { id: currentQuestionId }
});
if (!currentQuestion) {
return NextResponse.json({ error: 'Question not found' }, { status: 404 });
}
// Fetch model configs
const [modelConfigA, modelConfigB] = await Promise.all([
getModelConfigById(modelInfo.modelA.providerId),
getModelConfigById(modelInfo.modelB.providerId)
]);
if (!modelConfigA || !modelConfigB) {
return NextResponse.json({ error: 'Model configuration not found' }, { status: 400 });
}
// Randomly swap positions (core blind-test behavior)
const isSwapped = Math.random() > 0.5;
// Create streaming response
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
try {
// Send init message
controller.enqueue(
encoder.encode(
JSON.stringify({
type: 'init',
question: currentQuestion.question,
questionId: currentQuestion.id,
questionIndex: currentIndex + 1,
totalQuestions: questionIds.length,
isSwapped
}) + '\n'
)
);
// Prepare messages
const messages = [
{
role: 'system',
content: "You are a helpful assistant. Provide detailed and accurate answers to the user's question."
},
{ role: 'user', content: currentQuestion.question }
];
// Create LLM clients
const clientA = new LLMClient({
projectId,
...modelConfigA
});
const clientB = new LLMClient({
projectId,
...modelConfigB
});
let answerA = '';
let answerB = '';
const startTime = Date.now();
// Call both models in parallel (streaming)
await Promise.all([
(async () => {
try {
const response = await clientA.chatStreamAPI(messages);
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value, { stream: true });
answerA += chunk;
// Send chunk update
controller.enqueue(
encoder.encode(
JSON.stringify({
type: 'chunk',
model: isSwapped ? 'B' : 'A',
content: chunk
}) + '\n'
)
);
}
} catch (err) {
console.error('Model A call failed:', err);
controller.enqueue(
encoder.encode(
JSON.stringify({
type: 'error',
model: isSwapped ? 'B' : 'A',
error: err.message
}) + '\n'
)
);
}
})(),
(async () => {
try {
const response = await clientB.chatStreamAPI(messages);
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value, { stream: true });
answerB += chunk;
// Send chunk update
controller.enqueue(
encoder.encode(
JSON.stringify({
type: 'chunk',
model: isSwapped ? 'A' : 'B',
content: chunk
}) + '\n'
)
);
}
} catch (err) {
console.error('Model B call failed:', err);
controller.enqueue(
encoder.encode(
JSON.stringify({
type: 'error',
model: isSwapped ? 'A' : 'B',
error: err.message
}) + '\n'
)
);
}
})()
]);
const duration = Date.now() - startTime;
// Send done message
controller.enqueue(
encoder.encode(
JSON.stringify({
type: 'done',
duration,
answerA: isSwapped ? answerB : answerA,
answerB: isSwapped ? answerA : answerB
}) + '\n'
)
);
controller.close();
} catch (error) {
console.error('Streaming handler failed:', error);
controller.error(error);
}
}
});
return new Response(stream, {
headers: {
'Content-Type': 'text/plain; charset=utf-8',
'Cache-Control': 'no-cache',
Connection: 'keep-alive'
}
});
} catch (error) {
console.error('API error:', error);
return NextResponse.json({ error: error.message }, { status: 500 });
}
}

View File

@@ -0,0 +1,154 @@
import { NextResponse } from 'next/server';
import { db } from '@/lib/db/index';
/**
* Submit vote result
* vote: 'left' | 'right' | 'both_good' | 'both_bad'
* Results are stored in EvalResults table
*/
export async function POST(request, { params }) {
try {
const { projectId, taskId } = params;
const { vote, questionId, isSwapped, leftAnswer, rightAnswer } = await request.json();
// Validate vote option
const validVotes = ['left', 'right', 'both_good', 'both_bad'];
if (!validVotes.includes(vote)) {
return NextResponse.json({ code: 400, error: 'Invalid vote option' }, { status: 400 });
}
if (!questionId) {
return NextResponse.json({ code: 400, error: 'Question ID is required' }, { status: 400 });
}
const task = await db.task.findFirst({
where: {
id: taskId,
projectId,
taskType: 'blind-test'
}
});
if (!task) {
return NextResponse.json({ code: 404, error: 'Task not found' }, { status: 404 });
}
if (task.status !== 0) {
return NextResponse.json({ code: 400, error: 'Task has ended' }, { status: 400 });
}
// Parse task details
let detail = {};
try {
detail = task.detail ? JSON.parse(task.detail) : {};
} catch (e) {
console.error('Failed to parse task detail:', e);
}
// Calculate scores
// isSwapped: true means left is model B and right is model A
// isSwapped: false means left is model A and right is model B
let modelAScore = 0;
let modelBScore = 0;
if (vote === 'left') {
if (isSwapped) {
modelBScore = 1; // Left is B
} else {
modelAScore = 1; // Left is A
}
} else if (vote === 'right') {
if (isSwapped) {
modelAScore = 1; // Right is A
} else {
modelBScore = 1; // Right is B
}
} else if (vote === 'both_good') {
modelAScore = 0.5;
modelBScore = 0.5;
}
// both_bad: both scores remain 0
// Store result in EvalResults table
const evalResult = await db.evalResults.create({
data: {
projectId,
taskId,
evalDatasetId: questionId,
modelAnswer: JSON.stringify({
leftAnswer: leftAnswer || '',
rightAnswer: rightAnswer || ''
}),
score: modelAScore, // Store modelA score for sorting/aggregation
isCorrect: false, // Not applicable for blind-test
judgeResponse: JSON.stringify({
vote,
isSwapped,
modelAScore,
modelBScore
}),
duration: 0,
status: 0
}
});
// Update task progress
const evalDatasetIds = detail.evalDatasetIds || [];
const newCurrentIndex = (detail.currentIndex || 0) + 1;
const isCompleted = newCurrentIndex >= evalDatasetIds.length;
const updatedDetail = {
...detail,
currentIndex: newCurrentIndex
};
await db.task.update({
where: { id: taskId },
data: {
detail: JSON.stringify(updatedDetail),
completedCount: newCurrentIndex,
status: isCompleted ? 1 : 0, // 1-completed, 0-running
endTime: isCompleted ? new Date() : null
}
});
// Calculate current total scores from EvalResults
const allResults = await db.evalResults.findMany({
where: { taskId },
select: { judgeResponse: true }
});
let totalModelAScore = 0;
let totalModelBScore = 0;
for (const r of allResults) {
try {
const judge = JSON.parse(r.judgeResponse || '{}');
totalModelAScore += judge.modelAScore || 0;
totalModelBScore += judge.modelBScore || 0;
} catch (e) {
// Ignore parse errors
}
}
return NextResponse.json({
code: 0,
data: {
success: true,
isCompleted,
currentIndex: newCurrentIndex,
totalCount: evalDatasetIds.length,
scores: {
modelA: totalModelAScore,
modelB: totalModelBScore
}
},
message: isCompleted ? 'Blind-test task completed' : 'Vote recorded'
});
} catch (error) {
console.error('Failed to submit vote result:', error);
return NextResponse.json(
{ code: 500, error: 'Failed to submit vote result', message: error.message },
{ status: 500 }
);
}
}

View File

@@ -0,0 +1,226 @@
import { NextResponse } from 'next/server';
import { db } from '@/lib/db/index';
/**
* Get all blind-test tasks for a project
*/
export async function GET(request, { params }) {
try {
const { projectId } = params;
const { searchParams } = new URL(request.url);
const page = parseInt(searchParams.get('page') || '1');
const pageSize = parseInt(searchParams.get('pageSize') || '20');
if (!projectId) {
return NextResponse.json({ error: 'Project ID is required' }, { status: 400 });
}
const skip = (page - 1) * pageSize;
// Fetch task list and total count
const [tasks, total] = await Promise.all([
db.task.findMany({
where: {
projectId,
taskType: 'blind-test'
},
orderBy: { createAt: 'desc' },
skip,
take: pageSize
}),
db.task.count({
where: {
projectId,
taskType: 'blind-test'
}
})
]);
// Fetch evaluation results for all tasks to calculate scores
const taskIds = tasks.map(t => t.id);
const allEvalResults = await db.evalResults.findMany({
where: { taskId: { in: taskIds } },
select: {
taskId: true,
judgeResponse: true
}
});
// Group results by taskId and calculate scores
const taskScores = {};
for (const result of allEvalResults) {
if (!taskScores[result.taskId]) {
taskScores[result.taskId] = { modelAScore: 0, modelBScore: 0 };
}
try {
const judge = JSON.parse(result.judgeResponse || '{}');
taskScores[result.taskId].modelAScore += judge.modelAScore || 0;
taskScores[result.taskId].modelBScore += judge.modelBScore || 0;
} catch (e) {
// Ignore parse errors
}
}
// Parse task detail fields and attach scores
const tasksWithDetails = tasks.map(task => {
let detail = {};
let modelInfo = {};
try {
detail = task.detail ? JSON.parse(task.detail) : {};
modelInfo = task.modelInfo ? JSON.parse(task.modelInfo) : {};
} catch (e) {
console.error('Failed to parse task detail:', e);
}
// Attach calculated scores as results array
const scores = taskScores[task.id] || { modelAScore: 0, modelBScore: 0 };
const results = [
{
modelAScore: scores.modelAScore,
modelBScore: scores.modelBScore
}
];
return {
...task,
detail: {
...detail,
results // Attach results for display in task card
},
modelInfo
};
});
return NextResponse.json({
code: 0,
data: {
items: tasksWithDetails,
total,
page,
pageSize,
totalPages: Math.ceil(total / pageSize)
}
});
} catch (error) {
console.error('Failed to fetch blind-test task list:', error);
return NextResponse.json(
{ code: 500, error: 'Failed to fetch blind-test task list', message: error.message },
{ status: 500 }
);
}
}
/**
* Create a blind-test task
*/
export async function POST(request, { params }) {
try {
const { projectId } = params;
const data = await request.json();
const { modelA, modelB, evalDatasetIds, language = 'zh-CN' } = data;
if (!modelA || !modelA.modelId || !modelA.providerId) {
return NextResponse.json({ code: 400, error: 'Please select model A' }, { status: 400 });
}
if (!modelB || !modelB.modelId || !modelB.providerId) {
return NextResponse.json({ code: 400, error: 'Please select model B' }, { status: 400 });
}
if (modelA.modelId === modelB.modelId && modelA.providerId === modelB.providerId) {
return NextResponse.json({ code: 400, error: 'The two models must be different' }, { status: 400 });
}
if (!evalDatasetIds || evalDatasetIds.length === 0) {
return NextResponse.json({ code: 400, error: 'Please select questions to evaluate' }, { status: 400 });
}
const evalDatasets = await db.evalDatasets.findMany({
where: {
id: { in: evalDatasetIds },
projectId
},
select: { id: true, questionType: true }
});
const invalidQuestions = evalDatasets.filter(
q => q.questionType !== 'short_answer' && q.questionType !== 'open_ended'
);
if (invalidQuestions.length > 0) {
return NextResponse.json(
{
code: 400,
error: 'Blind-test tasks only support short-answer and open-ended questions'
},
{ status: 400 }
);
}
// Fetch model config info
const [modelConfigA, modelConfigB] = await Promise.all([
db.modelConfig.findFirst({
where: { projectId, providerId: modelA.providerId, modelId: modelA.modelId }
}),
db.modelConfig.findFirst({
where: { projectId, providerId: modelB.providerId, modelId: modelB.modelId }
})
]);
// Build model info (two models)
const modelInfo = {
modelA: {
id: modelConfigA?.id,
modelId: modelA.modelId,
modelName: modelConfigA?.modelName || modelA.modelId,
providerId: modelA.providerId,
providerName: modelConfigA?.providerName || modelA.providerId
},
modelB: {
id: modelConfigB?.id,
modelId: modelB.modelId,
modelName: modelConfigB?.modelName || modelB.modelId,
providerId: modelB.providerId,
providerName: modelConfigB?.providerName || modelB.providerId
}
};
// Build task detail (only store evalDatasetIds and currentIndex)
const taskDetail = {
evalDatasetIds,
currentIndex: 0 // Current question index
};
// Create task
const newTask = await db.task.create({
data: {
projectId,
taskType: 'blind-test',
status: 0, // Running
modelInfo: JSON.stringify(modelInfo),
language,
detail: JSON.stringify(taskDetail),
totalCount: evalDatasetIds.length,
completedCount: 0,
note: ''
}
});
return NextResponse.json({
code: 0,
data: {
...newTask,
detail: taskDetail,
modelInfo
},
message: 'Blind-test task created'
});
} catch (error) {
console.error('Failed to create blind-test task:', error);
return NextResponse.json(
{ code: 500, error: 'Failed to create blind-test task', message: error.message },
{ status: 500 }
);
}
}

View File

@@ -0,0 +1,40 @@
import { NextResponse } from 'next/server';
import logger from '@/lib/util/logger';
import cleanService from '@/lib/services/clean';
// 为指定文本块进行数据清洗
export async function POST(request, { params }) {
try {
const { projectId, chunkId } = params;
// 验证项目ID和文本块ID
if (!projectId || !chunkId) {
return NextResponse.json({ error: 'Project ID or text block ID cannot be empty' }, { status: 400 });
}
// 获取请求体
const { model, language = '中文' } = await request.json();
if (!model) {
return NextResponse.json({ error: 'Model cannot be empty' }, { status: 400 });
}
// 使用数据清洗服务
const result = await cleanService.cleanDataForChunk(projectId, chunkId, {
model,
language
});
// 返回清洗结果
return NextResponse.json({
chunkId,
originalLength: result.originalLength,
cleanedLength: result.cleanedLength,
success: result.success,
message: '数据清洗完成'
});
} catch (error) {
logger.error('Error cleaning data:', error);
return NextResponse.json({ error: error.message || 'Error cleaning data' }, { status: 500 });
}
}

View File

@@ -0,0 +1,35 @@
import { NextResponse } from 'next/server';
import { generateEvalQuestionsForChunk } from '@/lib/services/eval';
import logger from '@/lib/util/logger';
/**
* 为指定文本块生成测评题目
*/
export async function POST(request, { params }) {
try {
const { projectId, chunkId } = params;
// 验证参数
if (!projectId || !chunkId) {
return NextResponse.json({ error: 'Project ID and Chunk ID are required' }, { status: 400 });
}
// 获取请求体
const { model, language = 'zh-CN' } = await request.json();
if (!model) {
return NextResponse.json({ error: 'Model configuration is required' }, { status: 400 });
}
// 调用服务层生成测评题目
const result = await generateEvalQuestionsForChunk(projectId, chunkId, {
model,
language
});
return NextResponse.json(result);
} catch (error) {
logger.error('Error generating eval questions:', error);
return NextResponse.json({ error: error.message || 'Failed to generate eval questions' }, { status: 500 });
}
}

View File

@@ -0,0 +1,73 @@
import { NextResponse } from 'next/server';
import { getQuestionsForChunk } from '@/lib/db/questions';
import logger from '@/lib/util/logger';
import questionService from '@/lib/services/questions';
// 为指定文本块生成问题
export async function POST(request, { params }) {
try {
const { projectId, chunkId } = params;
// 验证项目ID和文本块ID
if (!projectId || !chunkId) {
return NextResponse.json({ error: 'Project ID or text block ID cannot be empty' }, { status: 400 });
} // 获取请求体
const { model, language = '中文', number, enableGaExpansion = false } = await request.json();
if (!model) {
return NextResponse.json({ error: 'Model cannot be empty' }, { status: 400 });
}
// 后续会根据是否有GA对来选择是否启用GA扩展选择服务函数
const serviceFunc = questionService.generateQuestionsForChunkWithGA;
// 使用问题生成服务
const result = await serviceFunc(projectId, chunkId, {
model,
language,
number,
enableGaExpansion
});
// 统一返回格式确保包含GA扩展信息
const response = {
chunkId,
questions: result.questions || result.labelQuestions || [],
total: result.total || (result.questions || result.labelQuestions || []).length,
gaExpansionUsed: result.gaExpansionUsed || false,
gaPairsCount: result.gaPairsCount || 0,
expectedTotal: result.expectedTotal || result.total
};
// 返回生成的问题
return NextResponse.json(response);
} catch (error) {
logger.error('Error generating questions:', error);
return NextResponse.json({ error: error.message || 'Error generating questions' }, { status: 500 });
}
}
// 获取指定文本块的问题
export async function GET(request, { params }) {
try {
const { projectId, chunkId } = params;
// 验证项目ID和文本块ID
if (!projectId || !chunkId) {
return NextResponse.json({ error: 'The item ID or text block ID cannot be empty' }, { status: 400 });
}
// 获取文本块的问题
const questions = await getQuestionsForChunk(projectId, chunkId);
// 返回问题列表
return NextResponse.json({
chunkId,
questions,
total: questions.length
});
} catch (error) {
console.error('Error getting questions:', String(error));
return NextResponse.json({ error: error.message || 'Error getting questions' }, { status: 500 });
}
}

View File

@@ -0,0 +1,73 @@
import { NextResponse } from 'next/server';
import { deleteChunkById, getChunkById, updateChunkById } from '@/lib/db/chunks';
// 获取文本块内容
export async function GET(request, { params }) {
try {
const { projectId, chunkId } = params;
// 验证参数
if (!projectId) {
return NextResponse.json({ error: 'Project ID cannot be empty' }, { status: 400 });
}
if (!chunkId) {
return NextResponse.json({ error: 'Text block ID cannot be empty' }, { status: 400 });
}
// 获取文本块内容
const chunk = await getChunkById(chunkId);
return NextResponse.json(chunk);
} catch (error) {
console.error('Failed to get text block content:', String(error));
return NextResponse.json({ error: error.message || 'Failed to get text block content' }, { status: 500 });
}
}
// 删除文本块
export async function DELETE(request, { params }) {
try {
const { projectId, chunkId } = params;
// 验证参数
if (!projectId) {
return NextResponse.json({ error: 'Project ID cannot be empty' }, { status: 400 });
}
if (!chunkId) {
return NextResponse.json({ error: 'Text block ID cannot be empty' }, { status: 400 });
}
await deleteChunkById(chunkId);
return NextResponse.json({ message: 'Text block deleted successfully' });
} catch (error) {
console.error('Failed to delete text block:', String(error));
return NextResponse.json({ error: error.message || 'Failed to delete text block' }, { status: 500 });
}
}
// 编辑文本块内容
export async function PATCH(request, { params }) {
try {
const { projectId, chunkId } = params;
// 验证参数
if (!projectId) {
return NextResponse.json({ error: '项目ID不能为空' }, { status: 400 });
}
if (!chunkId) {
return NextResponse.json({ error: '文本块ID不能为空' }, { status: 400 });
}
// 解析请求体获取新内容
const requestData = await request.json();
const { content } = requestData;
if (!content) {
return NextResponse.json({ error: '内容不能为空' }, { status: 400 });
}
let res = await updateChunkById(chunkId, { content });
return NextResponse.json(res);
} catch (error) {
console.error('编辑文本块失败:', String(error));
return NextResponse.json({ error: error.message || '编辑文本块失败' }, { status: 500 });
}
}

View File

@@ -0,0 +1,20 @@
import { getChunkContentsByNames } from '@/lib/db/chunks';
import { NextResponse } from 'next/server';
export async function POST(request, { params }) {
try {
const { projectId } = params;
const { chunkNames } = await request.json();
if (!chunkNames || !Array.isArray(chunkNames)) {
return NextResponse.json({ error: 'chunkNames 参数必须是数组' }, { status: 400 });
}
const chunkContentMap = await getChunkContentsByNames(projectId, chunkNames);
return NextResponse.json(chunkContentMap);
} catch (error) {
console.error('批量获取文本块内容失败:', error);
return NextResponse.json({ error: '批量获取文本块内容失败' }, { status: 500 });
}
}

View File

@@ -0,0 +1,102 @@
import { NextRequest, NextResponse } from 'next/server';
import { PrismaClient } from '@prisma/client';
const prisma = new PrismaClient();
/**
* 批量编辑文本块内容
* POST /api/projects/[projectId]/chunks/batch-edit
*/
export async function POST(request, { params }) {
try {
const { projectId } = params;
const body = await request.json();
const { position, content, chunkIds } = body;
// 验证参数
if (!position || !content || !chunkIds || !Array.isArray(chunkIds) || chunkIds.length === 0) {
return NextResponse.json({ error: 'Missing required parameters: position, content, chunkIds' }, { status: 400 });
}
if (!['start', 'end'].includes(position)) {
return NextResponse.json({ error: 'Position must be "start" or "end"' }, { status: 400 });
}
// 验证项目权限(获取要编辑的文本块)
const chunksToUpdate = await prisma.chunks.findMany({
where: {
id: { in: chunkIds },
projectId: projectId
},
select: {
id: true,
content: true,
name: true
}
});
if (chunksToUpdate.length === 0) {
return NextResponse.json({ error: 'Not found' }, { status: 404 });
}
if (chunksToUpdate.length !== chunkIds.length) {
return NextResponse.json({ error: 'Some chunks not found' }, { status: 400 });
}
// 准备更新数据
const updates = chunksToUpdate.map(chunk => {
let newContent;
if (position === 'start') {
// 在开头添加内容
newContent = content + '\n\n' + chunk.content;
} else {
// 在结尾添加内容
newContent = chunk.content + '\n\n' + content;
}
return {
where: { id: chunk.id },
data: {
content: newContent,
size: newContent.length,
updateAt: new Date()
}
};
});
async function processBatches(items, batchSize, processFn) {
const results = [];
for (let i = 0; i < items.length; i += batchSize) {
const batch = items.slice(i, i + batchSize);
const batchResults = await Promise.all(batch.map(processFn));
results.push(...batchResults);
}
return results;
}
const BATCH_SIZE = 50; // 每批处理 50 个
await processBatches(updates, BATCH_SIZE, update => prisma.chunks.update(update));
// 记录操作日志(可选)
console.log(`Successfully updated ${chunksToUpdate.length} chunks`);
return NextResponse.json({
success: true,
updatedCount: chunksToUpdate.length,
message: `Successfully updated ${chunksToUpdate.length} chunks`
});
} catch (error) {
console.error('批量编辑文本块失败:', error);
return NextResponse.json(
{
error: 'Batch edit chunks failed',
details: error.message
},
{ status: 500 }
);
} finally {
await prisma.$disconnect();
}
}

View File

@@ -0,0 +1,35 @@
import { NextResponse } from 'next/server';
import { getChunkByName } from '@/lib/db/chunks';
/**
* 根据文本块名称获取文本块
* @param {Request} request 请求对象
* @param {object} context 上下文,包含路径参数
* @returns {Promise<NextResponse>} 响应对象
*/
export async function GET(request, { params }) {
try {
const { projectId } = params;
// 从查询参数中获取 chunkName
const { searchParams } = new URL(request.url);
const chunkName = searchParams.get('chunkName');
if (!chunkName) {
return NextResponse.json({ error: '文本块名称不能为空' }, { status: 400 });
}
// 根据名称和项目ID查询文本块
const chunk = await getChunkByName(projectId, chunkName);
if (!chunk) {
return NextResponse.json({ error: '未找到指定的文本块' }, { status: 404 });
}
// 返回文本块信息
return NextResponse.json(chunk);
} catch (error) {
console.error('根据名称获取文本块失败:', String(error));
return NextResponse.json({ error: '获取文本块失败: ' + error.message }, { status: 500 });
}
}

View File

@@ -0,0 +1,21 @@
import { NextResponse } from 'next/server';
import { deleteChunkById, getChunkByFileIds, getChunkById, getChunksByFileIds, updateChunkById } from '@/lib/db/chunks';
// 获取文本块内容
export async function POST(request, { params }) {
try {
const { projectId } = params;
// 验证参数
if (!projectId) {
return NextResponse.json({ error: 'Project ID cannot be empty' }, { status: 400 });
}
const { array } = await request.json();
// 获取文本块内容
const chunk = await getChunksByFileIds(array);
return NextResponse.json(chunk);
} catch (error) {
console.error('Failed to get text block content:', String(error));
return NextResponse.json({ error: String(error) || 'Failed to get text block content' }, { status: 500 });
}
}

View File

@@ -0,0 +1,36 @@
import { NextResponse } from 'next/server';
import { getProject, updateProject, getTaskConfig } from '@/lib/db/projects';
// 获取项目配置
export async function GET(request, { params }) {
try {
const projectId = params.projectId;
const config = await getProject(projectId);
const taskConfig = await getTaskConfig(projectId);
return NextResponse.json({ ...config, ...taskConfig });
} catch (error) {
console.error('获取项目配置失败:', String(error));
return NextResponse.json({ error: error.message }, { status: 500 });
}
}
// 更新项目配置
export async function PUT(request, { params }) {
try {
const projectId = params.projectId;
const newConfig = await request.json();
const currentConfig = await getProject(projectId);
// 只更新 prompts 部分
const updatedConfig = {
...currentConfig,
...newConfig.prompts
};
const config = await updateProject(projectId, updatedConfig);
return NextResponse.json(config);
} catch (error) {
console.error('更新项目配置失败:', String(error));
return NextResponse.json({ error: error.message }, { status: 500 });
}
}

View File

@@ -0,0 +1,105 @@
import { NextResponse } from 'next/server';
import {
getCustomPrompts,
getCustomPrompt,
saveCustomPrompt,
deleteCustomPrompt,
batchSaveCustomPrompts,
toggleCustomPrompt,
getPromptTemplates
} from '@/lib/db/custom-prompts';
// 获取项目的自定义提示词
export async function GET(request, { params }) {
try {
const { projectId } = params;
const { searchParams } = new URL(request.url);
const promptType = searchParams.get('promptType');
const language = searchParams.get('language');
if (!projectId) {
return NextResponse.json({ error: 'Project ID is required' }, { status: 400 });
}
const customPrompts = await getCustomPrompts(projectId, promptType, language);
const templates = await getPromptTemplates();
return NextResponse.json({
success: true,
customPrompts,
templates
});
} catch (error) {
console.error('获取自定义提示词失败:', error);
return NextResponse.json({ error: error.message }, { status: 500 });
}
}
// 保存自定义提示词
export async function POST(request, { params }) {
try {
const { projectId } = params;
const body = await request.json();
if (!projectId) {
return NextResponse.json({ error: 'Project ID is required' }, { status: 400 });
}
// 批量保存
if (body.prompts && Array.isArray(body.prompts)) {
const results = await batchSaveCustomPrompts(projectId, body.prompts);
return NextResponse.json({
success: true,
results
});
}
// 单个保存
const { promptType, promptKey, language, content } = body;
if (!promptType || !promptKey || !language || content === undefined) {
return NextResponse.json(
{
error: 'promptType, promptKey, language and content are required'
},
{ status: 400 }
);
}
const result = await saveCustomPrompt(projectId, promptType, promptKey, language, content);
return NextResponse.json({
success: true,
result
});
} catch (error) {
console.error('保存自定义提示词失败:', error);
return NextResponse.json({ error: error.message }, { status: 500 });
}
}
// 删除自定义提示词
export async function DELETE(request, { params }) {
try {
const { projectId } = params;
const { searchParams } = new URL(request.url);
const promptType = searchParams.get('promptType');
const promptKey = searchParams.get('promptKey');
const language = searchParams.get('language');
if (!projectId || !promptType || !promptKey || !language) {
return NextResponse.json(
{
error: 'projectId, promptType, promptKey and language are required'
},
{ status: 400 }
);
}
const success = await deleteCustomPrompt(projectId, promptType, promptKey, language);
return NextResponse.json({
success
});
} catch (error) {
console.error('删除自定义提示词失败:', error);
return NextResponse.json({ error: error.message }, { status: 500 });
}
}

View File

@@ -0,0 +1,116 @@
import { NextResponse } from 'next/server';
import { saveChunks, deleteChunksByFileId } from '@/lib/db/chunks';
import path from 'path';
import fs from 'fs/promises';
import { getProjectRoot } from '@/lib/db/base';
/**
* 处理自定义分块请求
* @param {Request} request - 请求对象
* @param {Object} params - 路由参数
* @returns {Promise<Response>} - 响应对象
*/
export async function POST(request, { params }) {
try {
const { projectId } = params;
const { fileId, fileName, content, splitPoints } = await request.json();
// 参数验证
if (!projectId || !fileId || !fileName || !content || !splitPoints) {
return NextResponse.json({ error: 'Missing required parameters' }, { status: 400 });
}
// 获取项目根目录
const projectRoot = await getProjectRoot();
const projectPath = path.join(projectRoot, projectId);
// 检查项目是否存在
try {
await fs.access(projectPath);
} catch (error) {
return NextResponse.json({ error: 'Project does not exist' }, { status: 404 });
}
// 先删除该文件已有的文本块
await deleteChunksByFileId(projectId, fileId);
// 根据分块点将文件内容分割成多个块
const customChunks = generateCustomChunks(projectId, fileId, fileName, content, splitPoints);
// 保存新的文本块
await saveChunks(customChunks);
return NextResponse.json({
success: true,
message: 'Custom chunks saved successfully',
totalChunks: customChunks.length
});
} catch (error) {
console.error('自定义分块处理出错:', String(error));
return NextResponse.json({ error: error.message || 'Failed to process custom split request' }, { status: 500 });
}
}
/**
* 根据分块点生成自定义文本块
* @param {string} projectId - 项目ID
* @param {string} fileId - 文件ID
* @param {string} fileName - 文件名
* @param {string} content - 文件内容
* @param {Array} splitPoints - 分块点数组
* @returns {Array} - 生成的文本块数组
*/
function generateCustomChunks(projectId, fileId, fileName, content, splitPoints) {
// 按位置排序分块点
const sortedPoints = [...splitPoints].sort((a, b) => a.position - b.position);
// 创建分块
const chunks = [];
let startPos = 0;
// 处理每个分块点
for (let i = 0; i < sortedPoints.length; i++) {
const endPos = sortedPoints[i].position;
// 提取当前分块内容
const chunkContent = content.substring(startPos, endPos);
// 跳过空白分块
if (chunkContent.trim().length === 0) {
startPos = endPos;
continue;
}
// 创建分块对象
const chunk = {
projectId,
name: `${path.basename(fileName, path.extname(fileName))}-part-${i + 1}`,
fileId,
fileName,
content: chunkContent,
summary: `${fileName} 自定义分块 ${i + 1}/${sortedPoints.length + 1}`,
size: chunkContent.length
};
chunks.push(chunk);
startPos = endPos;
}
// 添加最后一个分块(如果有内容)
const lastChunkContent = content.substring(startPos);
if (lastChunkContent.trim().length > 0) {
const lastChunk = {
projectId,
name: `${path.basename(fileName, path.extname(fileName))}-part-${sortedPoints.length + 1}`,
fileId,
fileName,
content: lastChunkContent,
summary: `${fileName} 自定义分块 ${sortedPoints.length + 1}/${sortedPoints.length + 1}`,
size: lastChunkContent.length
};
chunks.push(lastChunk);
}
return chunks;
}

View File

@@ -0,0 +1,183 @@
/**
* 单个多轮对话数据集操作API
*/
import { NextResponse } from 'next/server';
import {
getDatasetConversationById,
updateDatasetConversation,
deleteDatasetConversation,
getConversationNavigationItems
} from '@/lib/db/dataset-conversations';
/**
* 获取单个多轮对话数据集详情
*/
export async function GET(request, { params }) {
try {
const { projectId, conversationId } = params;
const { searchParams } = new URL(request.url);
const operateType = searchParams.get('operateType');
// 如果是导航操作,返回导航项
if (operateType !== null) {
const data = await getConversationNavigationItems(projectId, conversationId, operateType);
return NextResponse.json(data);
}
const conversation = await getDatasetConversationById(conversationId);
if (!conversation) {
return NextResponse.json(
{
success: false,
message: '对话数据集不存在'
},
{ status: 404 }
);
}
if (conversation.projectId !== projectId) {
return NextResponse.json(
{
success: false,
message: '对话数据集不属于指定项目'
},
{ status: 403 }
);
}
return NextResponse.json(conversation);
} catch (error) {
console.error('获取多轮对话数据集详情失败:', error);
return NextResponse.json(
{
success: false,
message: error.message
},
{ status: 500 }
);
}
}
/**
* 更新多轮对话数据集
*/
export async function PUT(request, { params }) {
try {
const { projectId, conversationId } = params;
const body = await request.json();
// 验证对话数据集是否存在且属于项目
const conversation = await getDatasetConversationById(conversationId);
if (!conversation) {
return NextResponse.json(
{
success: false,
message: '对话数据集不存在'
},
{ status: 404 }
);
}
if (conversation.projectId !== projectId) {
return NextResponse.json(
{
success: false,
message: '对话数据集不属于指定项目'
},
{ status: 403 }
);
}
// 只允许更新特定字段
const allowedFields = ['score', 'tags', 'note', 'confirmed', 'aiEvaluation', 'messages'];
const updateData = {};
allowedFields.forEach(field => {
if (body.hasOwnProperty(field)) {
if (field === 'messages') {
// 将messages数组转换为rawMessages字符串存储
updateData['rawMessages'] = JSON.stringify(body[field]);
} else {
updateData[field] = body[field];
}
}
});
if (Object.keys(updateData).length === 0) {
return NextResponse.json(
{
success: false,
message: '没有有效的更新字段'
},
{ status: 400 }
);
}
const updatedConversation = await updateDatasetConversation(conversationId, updateData);
return NextResponse.json({
success: true,
data: updatedConversation
});
} catch (error) {
console.error('更新多轮对话数据集失败:', error);
return NextResponse.json(
{
success: false,
message: error.message
},
{ status: 500 }
);
}
}
/**
* 删除多轮对话数据集
*/
export async function DELETE(request, { params }) {
try {
const { projectId, conversationId } = params;
// 验证对话数据集是否存在且属于项目
const conversation = await getDatasetConversationById(conversationId);
if (!conversation) {
return NextResponse.json(
{
success: false,
message: '对话数据集不存在'
},
{ status: 404 }
);
}
if (conversation.projectId !== projectId) {
return NextResponse.json(
{
success: false,
message: '对话数据集不属于指定项目'
},
{ status: 403 }
);
}
await deleteDatasetConversation(conversationId);
return NextResponse.json({
success: true,
message: '删除成功'
});
} catch (error) {
console.error('删除多轮对话数据集失败:', error);
return NextResponse.json(
{
success: false,
message: error.message
},
{ status: 500 }
);
}
}

View File

@@ -0,0 +1,68 @@
/**
* 多轮对话数据集导出API
* 直接导出原始的 ShareGPT 格式数据集
*/
import { NextResponse } from 'next/server';
import { getAllDatasetConversations } from '@/lib/db/dataset-conversations';
/**
* 导出多轮对话数据集
*/
export async function GET(request, { params }) {
try {
const { projectId } = params;
const { searchParams } = new URL(request.url);
// 筛选条件
const filters = {
confirmed: searchParams.get('confirmed')
};
// 清除空值
Object.keys(filters).forEach(key => {
if (!filters[key]) delete filters[key];
});
// 获取所有对话数据集
const conversations = await getAllDatasetConversations(projectId, filters);
if (conversations.length === 0) {
return NextResponse.json([]);
}
// 转换为 ShareGPT 格式数组
const shareGptData = [];
for (const conversation of conversations) {
try {
// 解析 rawMessages
const messages = JSON.parse(conversation.rawMessages || '[]');
if (messages.length > 0) {
// 构建 ShareGPT 格式对象
const shareGptItem = {
messages: messages
};
shareGptData.push(shareGptItem);
}
} catch (error) {
console.error(`解析对话消息失败 ${conversation.id}:`, error);
// 跳过解析失败的对话,继续处理其他对话
continue;
}
}
return NextResponse.json(shareGptData);
} catch (error) {
console.error('导出多轮对话数据集失败:', error);
return NextResponse.json(
{
success: false,
message: error.message
},
{ status: 500 }
);
}
}

View File

@@ -0,0 +1,135 @@
/**
* 多轮对话数据集管理API
*/
import { NextResponse } from 'next/server';
import {
getDatasetConversationsByPagination,
getAllDatasetConversationIds,
createDatasetConversation
} from '@/lib/db/dataset-conversations';
import { generateMultiTurnConversation } from '@/lib/services/multi-turn/index';
/**
* 获取多轮对话数据集列表(支持分页和筛选)
*/
export async function GET(request, { params }) {
try {
const { projectId } = params;
const { searchParams } = new URL(request.url);
const getAllIds = searchParams.get('getAllIds') === 'true'; // 新增获取所有对话ID的标志
// 筛选条件
const filters = {
keyword: searchParams.get('keyword'),
roleA: searchParams.get('roleA'),
roleB: searchParams.get('roleB'),
scenario: searchParams.get('scenario'),
scoreMin: searchParams.get('scoreMin'),
scoreMax: searchParams.get('scoreMax'),
confirmed: searchParams.get('confirmed')
};
// 清除空值
Object.keys(filters).forEach(key => {
if (!filters[key]) delete filters[key];
});
// 如果请求获取所有ID
if (getAllIds) {
const allConversationIds = await getAllDatasetConversationIds(projectId, filters);
return NextResponse.json({ allConversationIds });
}
// 正常分页查询
const page = parseInt(searchParams.get('page') || '1');
const pageSize = parseInt(searchParams.get('pageSize') || '20');
const result = await getDatasetConversationsByPagination(projectId, page, pageSize, filters);
return NextResponse.json({
success: true,
...result
});
} catch (error) {
console.error('获取多轮对话数据集失败:', error);
return NextResponse.json(
{
success: false,
message: error.message
},
{ status: 500 }
);
}
}
/**
* 创建多轮对话数据集
*/
export async function POST(request, { params }) {
try {
const { projectId } = params;
const body = await request.json();
const { questionId, systemPrompt, scenario, rounds, roleA, roleB, model, language = '中文' } = body;
if (!questionId) {
return NextResponse.json(
{
success: false,
message: '问题ID不能为空'
},
{ status: 400 }
);
}
if (!model || !model.modelId) {
return NextResponse.json(
{
success: false,
message: '模型配置不能为空'
},
{ status: 400 }
);
}
// 构建配置
const config = {
systemPrompt: systemPrompt || '',
scenario: scenario || '',
rounds: rounds || 3,
roleA: roleA || '用户',
roleB: roleB || '助手',
model,
language
};
// 生成多轮对话
const result = await generateMultiTurnConversation(projectId, questionId, config);
if (!result.success) {
return NextResponse.json(
{
success: false,
message: result.error
},
{ status: 500 }
);
}
return NextResponse.json({
success: true,
data: result.data
});
} catch (error) {
console.error('创建多轮对话数据集失败:', error);
return NextResponse.json(
{
success: false,
message: error.message
},
{ status: 500 }
);
}
}

View File

@@ -0,0 +1,42 @@
import { NextResponse } from 'next/server';
import { getAllDatasetConversations } from '@/lib/db/dataset-conversations';
/**
* 获取项目中多轮对话数据集的所有标签
*/
export async function GET(request, { params }) {
try {
const { projectId } = params;
if (!projectId) {
return NextResponse.json({ error: '项目ID不能为空' }, { status: 400 });
}
// 获取项目所有对话数据集
const conversations = await getAllDatasetConversations(projectId);
// 提取所有标签
const allTags = new Set();
conversations.forEach(conversation => {
if (conversation.tags && typeof conversation.tags === 'string') {
const tags = conversation.tags.split(/\s+/).filter(tag => tag.trim().length > 0);
tags.forEach(tag => allTags.add(tag.trim()));
}
});
return NextResponse.json({
success: true,
tags: Array.from(allTags).sort()
});
} catch (error) {
console.error('获取对话标签失败:', error);
return NextResponse.json(
{
success: false,
message: error.message
},
{ status: 500 }
);
}
}

View File

@@ -0,0 +1,77 @@
import { NextResponse } from 'next/server';
import { db } from '@/lib/db';
export async function POST(req, { params }) {
try {
const { projectId, datasetId } = params;
// 1. 获取数据集详情
const dataset = await db.datasets.findUnique({
where: { id: datasetId, projectId }
});
if (!dataset) {
return NextResponse.json({ error: 'Dataset not found' }, { status: 404 });
}
// 2. 尝试通过 questionId 查找关联的 chunkId
let chunkId = null;
if (dataset.questionId) {
const question = await db.questions.findUnique({
where: { id: dataset.questionId }
});
if (question) {
chunkId = question.chunkId;
}
}
// 3. 创建评估数据集记录
// 默认使用 open_ended 类型,因为通常数据集是问答对,适合作为评估
let evalTags = [];
try {
evalTags = JSON.parse(dataset.tags || '[]');
if (!Array.isArray(evalTags)) evalTags = [];
} catch (e) {
evalTags = [];
}
// 排除 'Eval' 标签,并将数组转为逗号分隔的字符串
const evalTagsString = evalTags.filter(tag => tag !== 'Eval').join(',');
const evalDataset = await db.evalDatasets.create({
data: {
projectId,
question: dataset.question,
questionType: 'open_ended',
correctAnswer: dataset.answer,
tags: evalTagsString,
note: dataset.note,
chunkId: chunkId,
options: '' // 开放题不需要选项
}
});
// 4. 更新原数据集,添加 'Eval' 标签
let currentTags = [];
try {
currentTags = JSON.parse(dataset.tags || '[]');
} catch (e) {
// ignore error
}
if (!currentTags.includes('Eval')) {
currentTags.push('Eval');
await db.datasets.update({
where: { id: datasetId },
data: {
tags: JSON.stringify(currentTags)
}
});
}
return NextResponse.json({ success: true, evalDataset });
} catch (error) {
console.error('Failed to copy dataset to eval:', error);
return NextResponse.json({ error: 'Internal Server Error' }, { status: 500 });
}
}

View File

@@ -0,0 +1,36 @@
import { NextResponse } from 'next/server';
import { evaluateDataset } from '@/lib/services/datasets/evaluation';
/**
* 评估单个数据集的质量
*/
export async function POST(request, { params }) {
try {
const { projectId, datasetId } = params;
const { model, language = 'zh-CN' } = await request.json();
if (!projectId || !datasetId) {
return NextResponse.json({ success: false, message: '项目ID和数据集ID不能为空' }, { status: 400 });
}
if (!model) {
return NextResponse.json({ success: false, message: '模型配置不能为空' }, { status: 400 });
}
// 使用评估服务进行数据集评估
const result = await evaluateDataset(projectId, datasetId, model, language);
if (!result.success) {
return NextResponse.json({ success: false, message: result.error }, { status: 500 });
}
return NextResponse.json({
success: true,
message: '数据集评估完成',
data: result.data
});
} catch (error) {
console.error('数据集评估失败:', error);
return NextResponse.json({ success: false, message: `评估失败: ${error.message}` }, { status: 500 });
}
}

View File

@@ -0,0 +1,82 @@
import { NextResponse } from 'next/server';
import { getDatasetsById, getDatasetsCounts, getNavigationItems, updateDatasetMetadata } from '@/lib/db/datasets';
/**
* 获取项目的所有数据集
*/
export async function GET(request, { params }) {
try {
const { projectId, datasetId } = params;
// 验证项目ID
if (!projectId) {
return NextResponse.json({ error: '项目ID不能为空' }, { status: 400 });
}
if (!datasetId) {
return NextResponse.json({ error: '数据集ID不能为空' }, { status: 400 });
}
const { searchParams } = new URL(request.url);
const operateType = searchParams.get('operateType');
if (operateType !== null) {
const data = await getNavigationItems(projectId, datasetId, operateType);
return NextResponse.json(data);
}
const datasets = await getDatasetsById(datasetId);
let counts = await getDatasetsCounts(projectId);
return NextResponse.json({ datasets, ...counts });
} catch (error) {
console.error('获取数据集详情失败:', String(error));
return NextResponse.json(
{
error: error.message || '获取数据集详情失败'
},
{ status: 500 }
);
}
}
/**
* 更新数据集元数据(评分、标签、备注)
*/
export async function PATCH(request, { params }) {
try {
const { projectId, datasetId } = params;
// 验证参数
if (!projectId) {
return NextResponse.json({ error: '项目ID不能为空' }, { status: 400 });
}
if (!datasetId) {
return NextResponse.json({ error: '数据集ID不能为空' }, { status: 400 });
}
const body = await request.json();
const { score, tags, note } = body;
// 验证评分范围
if (score !== undefined && (score < 0 || score > 5)) {
return NextResponse.json({ error: '评分必须在0-5之间' }, { status: 400 });
}
// 验证标签格式
if (tags !== undefined && !Array.isArray(tags)) {
return NextResponse.json({ error: '标签必须是数组格式' }, { status: 400 });
}
// 更新数据集元数据
const updatedDataset = await updateDatasetMetadata(datasetId, { score, tags, note });
return NextResponse.json({
success: true,
dataset: updatedDataset
});
} catch (error) {
console.error('更新数据集元数据失败:', String(error));
return NextResponse.json(
{
error: error.message || '更新数据集元数据失败'
},
{ status: 500 }
);
}
}

View File

@@ -0,0 +1,52 @@
import { NextResponse } from 'next/server';
import { getDatasetsById } from '@/lib/db/datasets';
import { getEncoding } from '@langchain/core/utils/tiktoken';
/**
* 异步计算数据集文本的Token数量
*/
export async function GET(request, { params }) {
try {
const { projectId, datasetId } = params;
if (!datasetId) {
return NextResponse.json({ error: '数据集ID不能为空' }, { status: 400 });
}
const datasets = await getDatasetsById(datasetId);
const tokenCounts = {
answerTokens: 0,
cotTokens: 0
};
try {
if (datasets.answer || datasets.cot) {
// 使用 cl100k_base 编码,适用于 gpt-3.5-turbo 和 gpt-4
const encoding = await getEncoding('cl100k_base');
if (datasets.answer) {
const tokens = encoding.encode(datasets.answer);
tokenCounts.answerTokens = tokens.length;
}
if (datasets.cot) {
const tokens = encoding.encode(datasets.cot);
tokenCounts.cotTokens = tokens.length;
}
}
} catch (error) {
console.error('计算Token数量失败:', String(error));
return NextResponse.json({ error: '计算Token数量失败' }, { status: 500 });
}
return NextResponse.json(tokenCounts);
} catch (error) {
console.error('获取Token计数失败:', String(error));
return NextResponse.json(
{
error: error.message || '获取Token计数失败'
},
{ status: 500 }
);
}
}

View File

@@ -0,0 +1,55 @@
/**
* 批量数据集评估任务API
* 创建批量评估数据集质量的异步任务
*/
import { NextResponse } from 'next/server';
import { db } from '@/lib/db/index';
import { processTask } from '@/lib/services/tasks/index';
/**
* 创建批量数据集评估任务
*/
export async function POST(request, { params }) {
try {
const { projectId } = params;
const { model, language = 'zh-CN' } = await request.json();
if (!projectId) {
return NextResponse.json({ success: false, message: '项目ID不能为空' }, { status: 400 });
}
if (!model || !model.modelId) {
return NextResponse.json({ success: false, message: '模型配置不能为空' }, { status: 400 });
}
// 创建批量评估任务
const newTask = await db.task.create({
data: {
projectId,
taskType: 'dataset-evaluation',
status: 0, // 初始状态: 处理中
modelInfo: JSON.stringify(model),
language: language || 'zh-CN',
detail: '',
totalCount: 0,
note: '准备开始批量评估数据集质量...',
completedCount: 0
}
});
// 异步处理任务
processTask(newTask.id).catch(err => {
console.error(`批量评估任务启动失败: ${newTask.id}`, String(err));
});
return NextResponse.json({
success: true,
message: '批量评估任务已创建',
data: { taskId: newTask.id }
});
} catch (error) {
console.error('创建批量评估任务失败:', error);
return NextResponse.json({ success: false, message: `创建任务失败: ${error.message}` }, { status: 500 });
}
}

View File

@@ -0,0 +1,128 @@
import { NextResponse } from 'next/server';
import {
getDatasets,
getBalancedDatasetsByTags,
getTagsWithDatasetCounts,
getDatasetsBatch,
getBalancedDatasetsByTagsBatch,
getDatasetsByIds,
getDatasetsByIdsBatch
} from '@/lib/db/datasets';
/**
* 获取导出数据集
*/
export async function GET(request, { params }) {
try {
const { projectId } = params;
const { searchParams } = new URL(request.url);
// 验证项目ID
if (!projectId) {
return NextResponse.json({ error: 'Project ID cannot be empty' }, { status: 400 });
}
const confirmedParam = searchParams.get('confirmed');
const confirmed = confirmedParam === null ? undefined : confirmedParam === 'true';
// 获取标签统计信息
const tagStats = await getTagsWithDatasetCounts(projectId, confirmed);
return NextResponse.json(tagStats);
} catch (error) {
console.error('Failed to get tag statistics:', String(error));
return NextResponse.json(
{
error: error.message || 'Failed to get tag statistics'
},
{ status: 500 }
);
}
}
/**
* 获取标签统计信息
*/
export async function POST(request, { params }) {
try {
const { projectId } = params;
const body = await request.json();
// 验证项目ID
if (!projectId) {
return NextResponse.json({ error: 'Project ID cannot be empty' }, { status: 400 });
}
let status = body.status;
let confirmed = undefined;
if (status === 'confirmed') confirmed = true;
if (status === 'unconfirmed') confirmed = false;
// 检查是否是分批导出模式
const batchMode = body.batchMode ? 'true' : 'false';
const offset = body.offset ?? 0;
const batchSize = body.batchSize ?? 1000;
// 检查是否是平衡导出
const balanceMode = body.balanceMode ? 'true' : 'false';
const balanceConfig = body.balanceConfig;
// 检查是否有选中的数据集 ID
const selectedIds = Array.isArray(body.selectedIds) ? body.selectedIds : null;
if (batchMode === 'true') {
// 分批导出模式
if (selectedIds && selectedIds.length > 0) {
// 按选中 ID 分批导出
const datasets = await getDatasetsByIdsBatch(projectId, selectedIds, offset, batchSize);
const hasMore = datasets.length === batchSize;
return NextResponse.json({
data: datasets,
hasMore,
offset: offset + datasets.length
});
} else if (balanceMode === 'true' && balanceConfig) {
// 平衡分批导出
const parsedConfig = typeof balanceConfig === 'string' ? JSON.parse(balanceConfig) : balanceConfig;
const result = await getBalancedDatasetsByTagsBatch(projectId, parsedConfig, confirmed, offset, batchSize);
return NextResponse.json({
data: result.data,
hasMore: result.hasMore,
offset: offset + result.data.length
});
} else {
// 常规分批导出
const datasets = await getDatasetsBatch(projectId, confirmed, offset, batchSize);
const hasMore = datasets.length === batchSize;
return NextResponse.json({
data: datasets,
hasMore,
offset: offset + datasets.length
});
}
} else {
// 传统一次性导出模式(保持向后兼容)
if (selectedIds && selectedIds.length > 0) {
// 按选中 ID 导出
const datasets = await getDatasetsByIds(projectId, selectedIds);
return NextResponse.json(datasets);
} else if (balanceMode === 'true' && balanceConfig) {
// 平衡导出模式
const parsedConfig = typeof balanceConfig === 'string' ? JSON.parse(balanceConfig) : balanceConfig;
const datasets = await getBalancedDatasetsByTags(projectId, parsedConfig, confirmed);
return NextResponse.json(datasets);
} else {
// 常规导出模式
const datasets = await getDatasets(projectId, confirmed);
return NextResponse.json(datasets);
}
}
} catch (error) {
console.error('Failed to get datasets:', String(error));
return NextResponse.json(
{
error: error.message || 'Failed to get datasets'
},
{ status: 500 }
);
}
}

View File

@@ -0,0 +1,44 @@
import { NextResponse } from 'next/server';
import { getDatasetsById } from '@/lib/db/datasets';
import LLMClient from '@/lib/llm/core/index';
import { getEvalQuestionPrompt } from '@/lib/llm/prompts/evalQuestion';
import { extractJsonFromLLMOutput } from '@/lib/llm/common/util';
export async function POST(request, { params }) {
try {
const { projectId } = params;
const { datasetId, model, language, questionType = 'open_ended', count = 1 } = await request.json();
if (!datasetId || !model) {
return NextResponse.json({ error: 'Missing required parameters' }, { status: 400 });
}
// 1. 获取原数据集
const dataset = await getDatasetsById(datasetId);
if (!dataset) {
return NextResponse.json({ error: 'Dataset not found' }, { status: 404 });
}
// 2. 构建提示词
// 将原问题和答案合并作为上下文文本
const text = `Question: ${dataset.question}\nAnswer: ${dataset.answer}`;
const prompt = await getEvalQuestionPrompt(language || 'zh-CN', questionType, { text, number: count }, projectId);
// 3. 调用 LLM
const client = new LLMClient(model);
const response = await client.getResponse(prompt);
const result = extractJsonFromLLMOutput(response);
// 结果应该是一个数组
if (!result || !Array.isArray(result)) {
throw new Error('Failed to parse LLM output or output is not an array');
}
return NextResponse.json({ success: true, data: result });
} catch (error) {
console.error('Generate eval variant failed:', error);
return NextResponse.json({ error: error.message || 'Internal Server Error' }, { status: 500 });
}
}

View File

@@ -0,0 +1,109 @@
import { NextResponse } from 'next/server';
import { createDataset } from '@/lib/db/datasets';
import { nanoid } from 'nanoid';
export async function POST(request, { params }) {
try {
const { projectId } = params;
const { datasets, sourceInfo } = await request.json();
if (!datasets || !Array.isArray(datasets)) {
return NextResponse.json({ error: 'Invalid datasets data' }, { status: 400 });
}
const results = [];
const errors = [];
let successCount = 0;
let skippedCount = 0;
for (let i = 0; i < datasets.length; i++) {
try {
const dataset = datasets[i];
// 安全获取与清洗字段
const q = typeof dataset?.question === 'string' ? dataset.question.trim() : '';
const a = typeof dataset?.answer === 'string' ? dataset.answer.trim() : '';
// 验证必填字段:缺失则跳过
if (!q || !a) {
errors.push(`${i + 1} 条记录缺少必填字段(question/answer),已跳过`);
skippedCount++;
continue;
}
// 规范化可选字段
const chunkName = dataset?.chunkName || 'Imported Data';
const chunkContent = dataset?.chunkContent || 'Imported from external source';
const model = dataset?.model || 'imported';
const questionLabel = dataset?.questionLabel || '';
const cot = typeof dataset?.cot === 'string' ? dataset.cot : '';
const confirmed = typeof dataset?.confirmed === 'boolean' ? dataset.confirmed : false;
const score = typeof dataset?.score === 'number' ? dataset.score : 0;
// tags: 支持数组/字符串/对象
let tags = '[]';
if (Array.isArray(dataset?.tags)) {
try {
tags = JSON.stringify(dataset.tags);
} catch {
tags = '[]';
}
} else if (typeof dataset?.tags === 'string') {
tags = dataset.tags;
} else if (dataset?.tags && typeof dataset.tags === 'object') {
try {
tags = JSON.stringify(dataset.tags);
} catch {
tags = '[]';
}
}
// other: 对象或字符串
let other = '{}';
if (typeof dataset?.other === 'string') {
other = dataset.other;
} else if (dataset?.other && typeof dataset.other === 'object') {
try {
other = JSON.stringify(dataset.other);
} catch {
other = '{}';
}
}
const note = typeof dataset?.note === 'string' ? dataset.note : '';
// 创建数据集记录
const newDataset = await createDataset({
projectId,
questionId: nanoid(), // 生成唯一的问题ID
question: q,
answer: a,
chunkName,
chunkContent,
model,
questionLabel,
cot,
confirmed,
score,
tags,
note,
other
});
results.push(newDataset);
successCount++;
} catch (error) {
errors.push(`${i + 1} 条记录: ${error.message}`);
}
}
return NextResponse.json({
success: successCount,
total: datasets.length,
failed: errors.length,
skipped: skippedCount,
errors,
sourceInfo
});
} catch (error) {
console.error('Import datasets error:', error);
return NextResponse.json({ error: error.message }, { status: 500 });
}
}

View File

@@ -0,0 +1,89 @@
import { NextResponse } from 'next/server';
import { getDatasetsById, updateDataset } from '@/lib/db/datasets';
import { getQuestionById } from '@/lib/db/questions';
import { getChunkById } from '@/lib/db/chunks';
import LLMClient from '@/lib/llm/core/index';
import { getNewAnswerPrompt } from '@/lib/llm/prompts/newAnswer';
import { extractJsonFromLLMOutput } from '@/lib/llm/common/util';
// 优化数据集答案
export async function POST(request, { params }) {
try {
const { projectId } = params;
// 验证项目ID
if (!projectId) {
return NextResponse.json({ error: 'Project ID cannot be empty' }, { status: 400 });
}
// 获取请求体
const { datasetId, model, advice, language } = await request.json();
if (!datasetId) {
return NextResponse.json({ error: 'Dataset ID cannot be empty' }, { status: 400 });
}
if (!model) {
return NextResponse.json({ error: 'Model cannot be empty' }, { status: 400 });
}
if (!advice) {
return NextResponse.json({ error: 'Please provide optimization suggestions' }, { status: 400 });
}
// 获取数据集内容
const dataset = await getDatasetsById(datasetId);
if (!dataset) {
return NextResponse.json({ error: 'Dataset does not exist' }, { status: 404 });
}
// 创建LLM客户端
const llmClient = new LLMClient(model);
const { question, answer, cot, chunkContent: storedChunkContent, questionId } = dataset;
let chunkContent = storedChunkContent || '';
if (!chunkContent && questionId) {
try {
const questionRecord = await getQuestionById(questionId);
if (questionRecord?.chunkId) {
const chunkRecord = await getChunkById(questionRecord.chunkId);
chunkContent = chunkRecord?.content || '';
}
} catch (error) {
console.error('Failed to load chunk content by questionId:', error);
}
}
// 生成优化后的答案和思维链
const prompt = await getNewAnswerPrompt(language, { question, answer, cot, advice, chunkContent }, projectId);
const response = await llmClient.getResponse(prompt);
// 从LLM输出中提取JSON格式的优化结果
const optimizedResult = extractJsonFromLLMOutput(response);
if (!optimizedResult || !optimizedResult.answer) {
return NextResponse.json({ error: 'Failed to optimize answer, please try again' }, { status: 500 });
}
// 更新数据集
const updatedDataset = {
...dataset,
answer: optimizedResult.answer,
cot: cot ? optimizedResult.cot || cot : '' // 如果没有提供思考过程,则不更新
};
await updateDataset(updatedDataset);
// 返回优化后的数据集
return NextResponse.json({
success: true,
dataset: updatedDataset
});
} catch (error) {
console.error('Failed to optimize answer:', String(error));
return NextResponse.json({ error: error.message || 'Failed to optimize answer' }, { status: 500 });
}
}

View File

@@ -0,0 +1,193 @@
import { NextResponse } from 'next/server';
import {
deleteDataset,
getDatasetsByPagination,
getDatasetsIds,
getDatasetsById,
updateDataset
} from '@/lib/db/datasets';
import datasetService from '@/lib/services/datasets';
// 优化思维链函数已移至服务层
/**
* 生成数据集(为单个问题生成答案)
*/
export async function POST(request, { params }) {
try {
const { projectId } = params;
const { questionId, model, language } = await request.json();
// 使用数据集生成服务
const result = await datasetService.generateDatasetForQuestion(projectId, questionId, {
model,
language
});
return NextResponse.json(result);
} catch (error) {
console.error('Failed to generate dataset:', String(error));
return NextResponse.json(
{
error: error.message || 'Failed to generate dataset'
},
{ status: 500 }
);
}
}
/**
* 获取项目的所有数据集
*/
export async function GET(request, { params }) {
try {
const { projectId } = params;
const { searchParams } = new URL(request.url);
// 验证项目ID
if (!projectId) {
return NextResponse.json({ error: '项目ID不能为空' }, { status: 400 });
}
const page = parseInt(searchParams.get('page')) || 1;
const size = parseInt(searchParams.get('size')) || 10;
const input = searchParams.get('input');
const field = searchParams.get('field') || 'question';
const status = searchParams.get('status');
const hasCot = searchParams.get('hasCot');
const isDistill = searchParams.get('isDistill');
const scoreRange = searchParams.get('scoreRange');
const customTag = searchParams.get('customTag');
const noteKeyword = searchParams.get('noteKeyword');
const chunkName = searchParams.get('chunkName');
let confirmed = undefined;
if (status === 'confirmed') confirmed = true;
if (status === 'unconfirmed') confirmed = false;
let selectedAll = searchParams.get('selectedAll');
if (selectedAll) {
let data = await getDatasetsIds(
projectId,
confirmed,
input,
field,
hasCot,
isDistill,
scoreRange,
customTag,
noteKeyword,
chunkName
);
return NextResponse.json(data);
}
// 获取数据集
const datasets = await getDatasetsByPagination(
projectId,
page,
size,
confirmed,
input,
field, // 传递搜索字段参数
hasCot, // 传递思维链筛选参数
isDistill, // 传递蒸馏数据集筛选参数
scoreRange, // 传递评分范围筛选参数
customTag, // 传递自定义标签筛选参数
noteKeyword, // 传递备注关键字筛选参数
chunkName // 传递文本块名称筛选参数
);
return NextResponse.json(datasets);
} catch (error) {
console.error('获取数据集失败:', String(error));
return NextResponse.json(
{
error: error.message || '获取数据集失败'
},
{ status: 500 }
);
}
}
/**
* 删除数据集
*/
export async function DELETE(request) {
try {
const { searchParams } = new URL(request.url);
const datasetId = searchParams.get('id');
if (!datasetId) {
return NextResponse.json(
{
error: 'Dataset ID cannot be empty'
},
{ status: 400 }
);
}
await deleteDataset(datasetId);
return NextResponse.json({
success: true,
message: 'Dataset deleted successfully'
});
} catch (error) {
console.error('Failed to delete dataset:', error);
return NextResponse.json(
{
error: error.message || 'Failed to delete dataset'
},
{ status: 500 }
);
}
}
/**
* 编辑数据集
*/
export async function PATCH(request) {
try {
const { searchParams } = new URL(request.url);
const datasetId = searchParams.get('id');
const { answer, cot, question, confirmed } = await request.json();
if (!datasetId) {
return NextResponse.json(
{
error: 'Dataset ID cannot be empty'
},
{ status: 400 }
);
}
// 获取所有数据集
let dataset = await getDatasetsById(datasetId);
if (!dataset) {
return NextResponse.json(
{
error: 'Dataset does not exist'
},
{ status: 404 }
);
}
let data = { id: datasetId };
if (confirmed !== undefined) data.confirmed = confirmed;
if (answer) data.answer = answer;
if (cot) data.cot = cot;
if (question) data.question = question;
// 保存更新后的数据集列表
await updateDataset(data);
return NextResponse.json({
success: true,
message: 'Dataset updated successfully',
dataset: dataset
});
} catch (error) {
console.error('Failed to update dataset:', String(error));
return NextResponse.json(
{
error: error.message || 'Failed to update dataset'
},
{ status: 500 }
);
}
}

View File

@@ -0,0 +1,28 @@
import { NextResponse } from 'next/server';
import { getUsedCustomTags } from '@/lib/db/datasets';
/**
* 获取项目中使用过的自定义标签
*/
export async function GET(request, { params }) {
try {
const { projectId } = params;
// 验证项目ID
if (!projectId) {
return NextResponse.json({ error: '项目ID不能为空' }, { status: 400 });
}
const tags = await getUsedCustomTags(projectId);
return NextResponse.json({ tags });
} catch (error) {
console.error('获取自定义标签失败:', String(error));
return NextResponse.json(
{
error: error.message || '获取自定义标签失败'
},
{ status: 500 }
);
}
}

View File

@@ -0,0 +1,38 @@
import { NextResponse } from 'next/server';
// 获取默认提示词内容
export async function GET(request, { params }) {
try {
const { searchParams } = new URL(request.url);
const promptType = searchParams.get('promptType');
const promptKey = searchParams.get('promptKey');
if (!promptType || !promptKey) {
return NextResponse.json({ error: 'promptType and promptKey are required' }, { status: 400 });
}
// 动态导入对应的提示词模块
let promptModule;
try {
promptModule = await import(`@/lib/llm/prompts/${promptType}`);
} catch (error) {
return NextResponse.json({ error: `Prompt module ${promptType} not found` }, { status: 404 });
}
// 获取指定的提示词常量
const promptContent = promptModule[promptKey];
if (!promptContent) {
return NextResponse.json({ error: `Prompt key ${promptKey} not found in module ${promptType}` }, { status: 404 });
}
return NextResponse.json({
success: true,
content: promptContent,
promptType,
promptKey
});
} catch (error) {
console.error('获取默认提示词失败:', error);
return NextResponse.json({ error: error.message }, { status: 500 });
}
}

Some files were not shown because too many files have changed in this diff Show More