Chunk Module API Usage Guide
The Compileo Chunk Module provides a flexible API for document chunking with multiple strategies. This guide covers how to use the chunking API endpoints programmatically.
Chunking Strategy Options
graph TD
A[Choose Chunking Strategy] --> B{Strategy Type}
B --> C[Character]
B --> D[Semantic]
B --> E[Schema]
B --> F[Delimiter]
B --> G[Token]
C --> C1[chunk_size: int]
C --> C2[overlap: int]
D --> D1[semantic_prompt: string]
E --> E1[schema_definition: string]
F --> F1[delimiter: string]
G --> G1[chunk_size: int]
G --> G2[overlap: int]
API Endpoints
Chunk Retrieval
GET /api/v1/chunks/document/{document_id}
Retrieve all chunks for a specific document.
Response:
{
"document_id": 1,
"chunks": [
{
"id": 1,
"chunk_index": 1,
"token_count": 278,
"file_path": "storage/chunks/1/1/chunk_1.md",
"content_preview": "# PROGNOSIS\nMost patients respond well...",
"chunk_strategy": "schema"
}
],
"total": 5
}
GET /api/v1/chunks/project/{project_id}
Retrieve chunks for all documents associated with a specific project. Useful for validating project-wide processing status.
Query Parameters:
- limit: Maximum number of chunks to return (default: 100)
Response:
{
"project_id": "b357e573-89a5-4b40-8e1b-4c075a1835a6",
"chunks": [
{
"id": "uuid-1",
"document_id": "uuid-doc-1",
"chunk_index": 1,
"chunk_strategy": "semantic",
"file_path": "storage/chunks/..."
}
],
"total": 1
}
Chunk Deletion
DELETE /api/v1/chunks/{chunk_id}
Delete a specific chunk.
Response:
DELETE /api/v1/chunks/batch
Delete multiple chunks by their IDs. Supports flexible input formats for bulk operations.
Request Body:
Response:
DELETE /api/v1/chunks/document/{document_id}
Delete all chunks for a document.
Response:
Document Processing
POST /api/v1/documents/process
Process documents with specified chunking strategy and parameters.
AI-Assisted Analysis
POST /api/v1/documents/analyze-chunking
Get AI recommendations for optimal chunking strategy based on document analysis.
AI Analysis Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
document_id |
integer | Yes | ID of document to analyze |
goal |
string | Yes | Description of chunking objective |
examples |
array | No | List of example strings from document |
model |
string | No | AI model for analysis (gemini, grok, ollama) |
AI Analysis Example Request
{
"document_id": 101,
"goal": "Split the document at every chapter, but each chapter has a different name and format",
"examples": [
"Page 1: Headers: # Chapter 1: Introduction",
"Page 3: Section: This chapter provides an overview...",
"Page 5: Selected: Each new chapter starts with a level 1 header"
],
"model": "gemini"
}
AI Analysis Response
{
"recommended_strategy": "schema",
"parameters": {
"json_schema": "{\"rules\": [{\"type\": \"pattern\", \"value\": \"^# \"}, {\"type\": \"delimiter\", \"value\": \"\\n\\n\"}], \"combine\": \"any\"}",
"explanation": "Schema-based chunking recommended for consistent chapter header patterns"
},
"confidence": 0.85,
"alternative_strategies": [
{
"strategy": "semantic",
"parameters": {"custom_prompt": "Split at chapter boundaries..."},
"confidence": 0.72
}
]
}
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id |
integer | Yes | ID of the project containing documents |
document_ids |
array | Yes | List of document IDs to process |
parser |
string | No | Document parser (gemini, grok, ollama, pypdf, unstructured, huggingface, novlm) |
chunk_strategy |
string | No | Chunking strategy (token, character, semantic, delimiter, schema) |
chunk_size |
integer | No | Chunk size (tokens for token strategy, characters for character strategy) |
overlap |
integer | No | Overlap between chunks |
num_ctx |
integer | No | Context window size for Ollama models (overrides default setting) |
semantic_prompt |
string | No | Custom prompt for semantic chunking |
schema_definition |
string | No | JSON schema for schema-based chunking |
character_chunk_size |
integer | No | Character chunk size (overrides chunk_size) |
character_overlap |
integer | No | Character overlap (overrides overlap) |
sliding_window |
boolean | No | Enable sliding window chunking for multi-file documents (auto-enabled for multi-file docs) |
system_instruction |
string | No | System-level instructions to guide the model's behavior, especially for Gemini. |
Character-Based Chunking
Split documents by character count with configurable overlap. Fast and deterministic.
Example Request
{
"project_id": 1,
"document_ids": [101, 102],
"parser": "pypdf",
"chunk_strategy": "character",
"character_chunk_size": 500,
"character_overlap": 50
}
Use Cases
- Fixed-size text processing
- Memory-constrained environments
- Deterministic chunking results
- Simple document structures
Semantic Chunking
Use AI to intelligently split documents based on meaning and context. Supports multi-file documents with dynamic cross-file chunking for semantic coherence.
Simplified Universal Cross-File Document Support
The API automatically handles multi-file documents using universal forwarding logic that ensures semantic coherence across file boundaries:
- Universal Forwarding Rules: All chunking strategies use the same simple rule - if content remains at the end, forward it to the next file
- Strategy-Agnostic Detection: Removed complex per-strategy incomplete chunk detection code
- Automatic Content Spacing: Intelligent space insertion between forwarded content and main content prevents word concatenation
- Memory-Based State Management: Simplified ChunkState object maintains forwarded content between file processing
Automatic Processing: Cross-file chunking is automatically applied to multi-file documents. The system dynamically forwards incomplete chunks as overlap content to subsequent files.
Benefits: - Improved semantic chunking quality at file boundaries - Better search results with reduced duplication - More coherent chunks for AI processing - Simplified architecture with universal forwarding rules - All 5 chunking strategies (character, token, semantic, schema, delimiter) use identical logic
Example Request
{
"project_id": 1,
"document_ids": [101],
"parser": "ollama",
"chunk_strategy": "semantic",
"semantic_prompt": "This is a medical textbook that is structured as follows: disease / condition and discussion about it, then another disease / condition and discussion about it. Split should occur at the end of each discussion and before next disease / condition title.",
"num_ctx": 4096
}```
### Prompt Examples
**General Purpose (Recommended for all models including Gemini):**
User Instruction:
Output Requirements: - Return ONLY a comma-separated list of the exact heading strings that should start a new chunk. - Do not include any other text, explanations, or formatting. - Each heading should be exactly as it appears in the document.
Example: If the instruction is "Split by chapter" and the text contains "# Chapter 1" and "# Chapter 2", your output should be:
Chapter 1,# Chapter 2
Document to analyze:
This is a medical textbook that is structured as follows: disease / condition and discussion about it, then another disease / condition and discussion about it. Split should occur at the end of each discussion and before next disease / condition title. Split this medical document at natural section boundaries, ensuring each chunk contains complete clinical information about a single condition, symptom, or treatment. Divide this legal document at section boundaries, keeping each complete legal clause, definition, or contractual obligation in a single chunk. Split this technical document at logical boundaries, ensuring each chunk contains complete explanations of single concepts, algorithms, or procedures.### Use Cases
- Complex document structures
- Meaning preservation
- Context-aware splitting
- Domain-specific requirements
- Medical textbooks and clinical documents
- Multi-file document processing
### Ollama Context Window Configuration
When using Ollama models for semantic chunking, you can control the context window size:
- **`num_ctx`**: Specifies the maximum context length in tokens for Ollama models
- **Default**: Uses the value configured in GUI settings (typically 60000 tokens)
- **Override**: API parameter takes precedence over settings default
- **Performance**: Smaller values reduce memory usage but may limit complex analysis
- **Compatibility**: Only applies to Ollama models; ignored for Gemini/Grok
### Recent Improvements
**Simplified Universal Cross-File Chunking Architecture:**
- **Universal Forwarding Rules**: All chunking strategies use the same simple rule - if content remains at end, forward it to next file
- **Strategy-Agnostic Detection**: Removed 50+ lines of complex per-strategy incomplete chunk detection code
- **Automatic Content Spacing**: Intelligent space insertion between forwarded content and main content prevents word concatenation
- **Memory-Based State Management**: Simplified ChunkState object maintains forwarded content between file processing
- **All Strategies Unified**: Character, token, semantic, schema, and delimiter strategies use identical cross-file logic
**Enhanced Content Processing:**
- **Intelligent Spacing**: Automatic space insertion prevents issues like "glutendamages" โ "gluten damages"
- **Simplified Architecture**: Single forwarding mechanism instead of strategy-specific code
- **Memory Efficient**: No duplicate content storage or complex overlap calculations
- **Universal Compatibility**: Works identically across all 5 chunking strategies
**Streamlined Implementation:**
- **Removed Strategy-Specific Code**: Eliminated complex per-strategy incomplete chunk detection logic
- **Dynamic Overlap Generation**: Overlap created naturally during chunking, not pre-computed
- **Simplified Data Structures**: Clean content processing with automatic forwarding
- **Improved Performance**: Reduced complexity and memory usage in cross-file processing
**Enhanced Quality Assurance:**
- **Comprehensive Testing**: Verified all chunking strategies work correctly with cross-file processing
- **Real-World Validation**: Tested on medical documents with proper semantic coherence
- **Spacing Integrity**: Automatic prevention of word concatenation across file boundaries
- **Universal Logic**: Same forwarding rules apply to all strategies regardless of complexity
## Dynamic Cross-File Chunking
Advanced chunking method that dynamically generates overlap content during processing, ensuring semantic coherence across file boundaries. Automatically applied to multi-file documents with guaranteed boundary integrity.
### How It Works
1. **Sequential File Processing**: Files are processed one by one in order
2. **Universal Forwarding Logic**: All chunking strategies use the same simple rule - if content remains at the end, forward it to the next file
3. **Automatic Content Spacing**: Intelligent space insertion between forwarded content and main content prevents word concatenation
4. **Memory-Based State Management**: Simplified ChunkState object maintains forwarded content between file processing
5. **Strategy Transparency**: Chunking engines unaware of cross-file logic - they process complete content
### Processing Flow
File 2 Processing: โโโ Combine: overlap_from_file1 + separator + main_content โโโ Apply chunking strategy to combined content โโโ Create complete chunks + new leftover content โโโ Forward new leftover โ File 3's overlap_content
File N Processing: โโโ Combine: overlap_from_prev + separator + main_content โโโ Apply chunking strategy โโโ Create final complete chunks
### Universal Forwarding Mechanism
The architecture ensures **100% boundary integrity** with simplified logic:
- **Parsing**: Creates clean `main_content` without overlap assumptions
- **First File**: Processes from natural document start
- **Subsequent Files**: Receive overlap as guaranteed start boundary
- **All Strategies**: Use identical forwarding rules regardless of chunking method
- **Automatic Spacing**: Prevents word concatenation (e.g., "glutendamages" โ "gluten damages")
### Example Request
```json
{
"project_id": 1,
"document_ids": [101, 102, 103],
"parser": "gemini",
"chunk_strategy": "semantic",
"sliding_window": true
}
Window Processing Structure
Each window contains structured content for AI processing:
{
"content_type": "sliding_window_chunk",
"main_content": "# Chapter 3: Advanced Topics\n\nThis chapter covers...",
"overlap_content": "# Chapter 2 Conclusion\n\nIn summary, the basic concepts...",
"metadata": {
"window_size": 2500,
"overlap_tokens": 400,
"total_tokens": 2900
}
}
Use Cases
- Large Multi-File Documents: PDFs split into multiple parts
- Cross-File Continuity: Topics spanning artificial file boundaries
- Semantic Coherence: Maintaining context across pagination breaks
- Quality Improvement: Better search results and AI processing
Configuration Options
- Automatic Detection: Applied automatically to documents with multiple parsed files
Schema-Based Chunking
Apply custom rules combining patterns and delimiters for precise control.
Schema Format
The API automatically attempts to fix common JSON syntax errors in regex patterns (e.g., unescaped \s, \n) and literal control characters, but it is best practice to provide a fully escaped JSON string.
Regex Support: Schema strategies now support re.MULTILINE, allowing the use of ^ anchors to match the start of lines within a document.
{
"rules": [
{
"type": "pattern",
"value": "# [A-Z\\\\s]+"
},
{
"type": "delimiter",
"value": "\\\\n\\\\n"
}
],
"combine": "any"
}
Combine Options
"any": Split when any rule matches"all": Split only when all rules match at the same position
Example Request
{
"project_id": 1,
"document_ids": [101],
"parser": "unstructured",
"chunk_strategy": "schema",
"schema_definition": "{\"rules\": [{\"type\": \"pattern\", \"value\": \"^## \"}, {\"type\": \"delimiter\", \"value\": \"\\n\\n\"}], \"combine\": \"any\"}"
}
Rule Types
Pattern Rules:
- Use regex patterns to match specific text structures
- Supports re.MULTILINE mode (use ^ to match start of line)
- Examples: "^# ", "^[0-9]+\.", "<chapter>"
Delimiter Rules:
- Split on exact string matches
- Examples: "\n\n", "<hr>", "---"
Use Cases
- Structured documents with known patterns
- Custom document formats
- Precise control requirements
- Multi-criteria splitting
Delimiter-Based Chunking
Simple splitting on specified delimiter strings with enhanced flexibility for multiple delimiters.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
delimiters |
array | No | List of delimiter strings to split on (default: ["\n\n", "\n"]) |
chunk_size |
integer | No | Maximum chunk size in characters |
overlap |
integer | No | Overlap between chunks in characters |
Example Request
{
"project_id": 1,
"document_ids": [101],
"parser": "pypdf",
"chunk_strategy": "delimiter",
"delimiters": ["#", "\n\n", "---"],
"chunk_size": 1000,
"overlap": 100
}
Delimiter Examples
Markdown Headers:
Mixed Delimiters:
Custom Patterns:
Use Cases
- Simple document structures
- Known separator patterns
- Quick processing needs
- Markdown document chunking
- Custom delimiter patterns
Token-Based Chunking
Precise token counting using tiktoken library with overlap. Requires tiktoken package to be installed.
Example Request
{
"project_id": 1,
"document_ids": [101],
"parser": "grok",
"chunk_strategy": "token",
"chunk_size": 512,
"overlap": 50
}
Error Handling
Token chunking will fail explicitly if:
- tiktoken library is not installed
- Invalid tokenizer model specified
- Strategy creation fails for any reason
Error Response:
{
"detail": "Failed to create token-based chunking strategy: No module named 'tiktoken'. Token-based chunking requires tiktoken library."
}
Use Cases
- LLM input preparation with exact token limits
- Token-aware processing for API constraints
- Precise semantic chunking based on token boundaries
Response Format
Success Response
{
"job_id": "chunk_job_12345",
"message": "Successfully processed 1 documents, created 5 chunks",
"processed_documents": 1,
"total_chunks": 5,
"estimated_duration": "Completed",
"debug_info": {
"total_requested": 1,
"project_id": 1,
"parser": "pypdf",
"chunk_strategy": "character",
"character_chunk_size": 500,
"character_overlap": 50
}
}
Job Status Checking
GET /api/v1/documents/process/{job_id}/status
Error Handling
Common Errors
400 Bad Request:
{
"detail": "Invalid chunk_strategy. Must be one of: token, character, semantic, delimiter, schema"
}
422 Validation Error:
{
"detail": [
{
"loc": ["body", "character_chunk_size"],
"msg": "ensure this value is greater than 0",
"type": "value_error.const"
}
]
}
Best Practices
Strategy Selection
- Character: For simple, fast processing
- Semantic: For complex documents requiring AI understanding
- Schema: For structured documents with known patterns
- Delimiter: For simple separator-based splitting
- Token: For LLM-specific token limits
Performance Optimization
- Use character strategy for large volumes
- Batch multiple documents together
- Choose appropriate parsers for document types
- Monitor job status for long-running processes
Error Prevention
- Validate schema JSON before submission
- Test prompts with sample documents
- Use appropriate chunk sizes for your use case
- Monitor API rate limits for AI strategies
Integration Examples
Python Client
import requests
# Character chunking
response = requests.post("http://localhost:8000/api/v1/documents/process", json={
"project_id": 1,
"document_ids": [101],
"chunk_strategy": "character",
"character_chunk_size": 500,
"character_overlap": 50
})
# Semantic chunking
response = requests.post("http://localhost:8000/api/v1/documents/process", json={
"project_id": 1,
"document_ids": [101],
"chunk_strategy": "semantic",
"semantic_prompt": "Split at natural topic boundaries..."
})
JavaScript Client
// Character chunking
const response = await fetch('/api/v1/documents/process', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
project_id: 1,
document_ids: [101],
chunk_strategy: 'character',
character_chunk_size: 500,
character_overlap: 50
})
});
This API provides comprehensive document chunking capabilities with multiple strategies to handle diverse document processing needs.