Skip to content

Chunk Module API Usage Guide

The Compileo Chunk Module provides a flexible API for document chunking with multiple strategies. This guide covers how to use the chunking API endpoints programmatically.

Chunking Strategy Options

graph TD
    A[Choose Chunking Strategy] --> B{Strategy Type}
    B --> C[Character]
    B --> D[Semantic]
    B --> E[Schema]
    B --> F[Delimiter]
    B --> G[Token]

    C --> C1[chunk_size: int]
    C --> C2[overlap: int]

    D --> D1[semantic_prompt: string]

    E --> E1[schema_definition: string]

    F --> F1[delimiter: string]

    G --> G1[chunk_size: int]
    G --> G2[overlap: int]

API Endpoints

Chunk Retrieval

GET /api/v1/chunks/document/{document_id}

Retrieve all chunks for a specific document.

Response:

{
  "document_id": 1,
  "chunks": [
    {
      "id": 1,
      "chunk_index": 1,
      "token_count": 278,
      "file_path": "storage/chunks/1/1/chunk_1.md",
      "content_preview": "# PROGNOSIS\nMost patients respond well...",
      "chunk_strategy": "schema"
    }
  ],
  "total": 5
}

GET /api/v1/chunks/project/{project_id}

Retrieve chunks for all documents associated with a specific project. Useful for validating project-wide processing status.

Query Parameters: - limit: Maximum number of chunks to return (default: 100)

Response:

{
  "project_id": "b357e573-89a5-4b40-8e1b-4c075a1835a6",
  "chunks": [
    {
      "id": "uuid-1",
      "document_id": "uuid-doc-1",
      "chunk_index": 1,
      "chunk_strategy": "semantic",
      "file_path": "storage/chunks/..."
    }
  ],
  "total": 1
}

Chunk Deletion

DELETE /api/v1/chunks/{chunk_id}

Delete a specific chunk.

Response:

{
  "message": "Chunk 1 deleted successfully"
}

DELETE /api/v1/chunks/batch

Delete multiple chunks by their IDs. Supports flexible input formats for bulk operations.

Request Body:

{
  "chunk_ids": [1, 2, 3, 5, 7, 10]
}

Response:

{
  "message": "Successfully deleted 6 chunks",
  "deleted_count": 6
}

DELETE /api/v1/chunks/document/{document_id}

Delete all chunks for a document.

Response:

{
  "message": "Deleted 5 chunks for document 1",
  "deleted_count": 5
}

Document Processing

POST /api/v1/documents/process

Process documents with specified chunking strategy and parameters.

AI-Assisted Analysis

POST /api/v1/documents/analyze-chunking

Get AI recommendations for optimal chunking strategy based on document analysis.

AI Analysis Request Parameters

Parameter Type Required Description
document_id integer Yes ID of document to analyze
goal string Yes Description of chunking objective
examples array No List of example strings from document
model string No AI model for analysis (gemini, grok, ollama)

AI Analysis Example Request

{
  "document_id": 101,
  "goal": "Split the document at every chapter, but each chapter has a different name and format",
  "examples": [
    "Page 1: Headers: # Chapter 1: Introduction",
    "Page 3: Section: This chapter provides an overview...",
    "Page 5: Selected: Each new chapter starts with a level 1 header"
  ],
  "model": "gemini"
}

AI Analysis Response

{
  "recommended_strategy": "schema",
  "parameters": {
    "json_schema": "{\"rules\": [{\"type\": \"pattern\", \"value\": \"^# \"}, {\"type\": \"delimiter\", \"value\": \"\\n\\n\"}], \"combine\": \"any\"}",
    "explanation": "Schema-based chunking recommended for consistent chapter header patterns"
  },
  "confidence": 0.85,
  "alternative_strategies": [
    {
      "strategy": "semantic",
      "parameters": {"custom_prompt": "Split at chapter boundaries..."},
      "confidence": 0.72
    }
  ]
}

Request Parameters

Parameter Type Required Description
project_id integer Yes ID of the project containing documents
document_ids array Yes List of document IDs to process
parser string No Document parser (gemini, grok, ollama, pypdf, unstructured, huggingface, novlm)
chunk_strategy string No Chunking strategy (token, character, semantic, delimiter, schema)
chunk_size integer No Chunk size (tokens for token strategy, characters for character strategy)
overlap integer No Overlap between chunks
num_ctx integer No Context window size for Ollama models (overrides default setting)
semantic_prompt string No Custom prompt for semantic chunking
schema_definition string No JSON schema for schema-based chunking
character_chunk_size integer No Character chunk size (overrides chunk_size)
character_overlap integer No Character overlap (overrides overlap)
sliding_window boolean No Enable sliding window chunking for multi-file documents (auto-enabled for multi-file docs)
system_instruction string No System-level instructions to guide the model's behavior, especially for Gemini.

Character-Based Chunking

Split documents by character count with configurable overlap. Fast and deterministic.

Example Request

{
  "project_id": 1,
  "document_ids": [101, 102],
  "parser": "pypdf",
  "chunk_strategy": "character",
  "character_chunk_size": 500,
  "character_overlap": 50
}

Use Cases

  • Fixed-size text processing
  • Memory-constrained environments
  • Deterministic chunking results
  • Simple document structures

Semantic Chunking

Use AI to intelligently split documents based on meaning and context. Supports multi-file documents with dynamic cross-file chunking for semantic coherence.

Simplified Universal Cross-File Document Support

The API automatically handles multi-file documents using universal forwarding logic that ensures semantic coherence across file boundaries:

  • Universal Forwarding Rules: All chunking strategies use the same simple rule - if content remains at the end, forward it to the next file
  • Strategy-Agnostic Detection: Removed complex per-strategy incomplete chunk detection code
  • Automatic Content Spacing: Intelligent space insertion between forwarded content and main content prevents word concatenation
  • Memory-Based State Management: Simplified ChunkState object maintains forwarded content between file processing

Automatic Processing: Cross-file chunking is automatically applied to multi-file documents. The system dynamically forwards incomplete chunks as overlap content to subsequent files.

Benefits: - Improved semantic chunking quality at file boundaries - Better search results with reduced duplication - More coherent chunks for AI processing - Simplified architecture with universal forwarding rules - All 5 chunking strategies (character, token, semantic, schema, delimiter) use identical logic

Example Request

{
  "project_id": 1,
  "document_ids": [101],
  "parser": "ollama",
  "chunk_strategy": "semantic",
  "semantic_prompt": "This is a medical textbook that is structured as follows: disease / condition and discussion about it, then another disease / condition and discussion about it. Split should occur at the end of each discussion and before next disease / condition title.",
  "num_ctx": 4096
}```

### Prompt Examples

**General Purpose (Recommended for all models including Gemini):**
You are an expert document analysis tool. Your task is to split a document into logical chunks based on the user's instruction. You will be given an instruction and the document text. You must identify the exact headings or titles that mark the beginning of a new chunk according to the instruction.

User Instruction:

Output Requirements: - Return ONLY a comma-separated list of the exact heading strings that should start a new chunk. - Do not include any other text, explanations, or formatting. - Each heading should be exactly as it appears in the document.

Example: If the instruction is "Split by chapter" and the text contains "# Chapter 1" and "# Chapter 2", your output should be:

Chapter 1,# Chapter 2

Document to analyze:

**Medical Textbooks (Example of specific user_instruction):**
This is a medical textbook that is structured as follows: disease / condition and discussion about it, then another disease / condition and discussion about it. Split should occur at the end of each discussion and before next disease / condition title.
**General Medical Documents (Example of specific user_instruction):**
Split this medical document at natural section boundaries, ensuring each chunk contains complete clinical information about a single condition, symptom, or treatment.
**Legal Documents (Example of specific user_instruction):**
Divide this legal document at section boundaries, keeping each complete legal clause, definition, or contractual obligation in a single chunk.
**Technical Documentation (Example of specific user_instruction):**
Split this technical document at logical boundaries, ensuring each chunk contains complete explanations of single concepts, algorithms, or procedures.
### Use Cases
- Complex document structures
- Meaning preservation
- Context-aware splitting
- Domain-specific requirements
- Medical textbooks and clinical documents
- Multi-file document processing

### Ollama Context Window Configuration

When using Ollama models for semantic chunking, you can control the context window size:

- **`num_ctx`**: Specifies the maximum context length in tokens for Ollama models
- **Default**: Uses the value configured in GUI settings (typically 60000 tokens)
- **Override**: API parameter takes precedence over settings default
- **Performance**: Smaller values reduce memory usage but may limit complex analysis
- **Compatibility**: Only applies to Ollama models; ignored for Gemini/Grok

### Recent Improvements

**Simplified Universal Cross-File Chunking Architecture:**
- **Universal Forwarding Rules**: All chunking strategies use the same simple rule - if content remains at end, forward it to next file
- **Strategy-Agnostic Detection**: Removed 50+ lines of complex per-strategy incomplete chunk detection code
- **Automatic Content Spacing**: Intelligent space insertion between forwarded content and main content prevents word concatenation
- **Memory-Based State Management**: Simplified ChunkState object maintains forwarded content between file processing
- **All Strategies Unified**: Character, token, semantic, schema, and delimiter strategies use identical cross-file logic

**Enhanced Content Processing:**
- **Intelligent Spacing**: Automatic space insertion prevents issues like "glutendamages" โ†’ "gluten damages"
- **Simplified Architecture**: Single forwarding mechanism instead of strategy-specific code
- **Memory Efficient**: No duplicate content storage or complex overlap calculations
- **Universal Compatibility**: Works identically across all 5 chunking strategies

**Streamlined Implementation:**
- **Removed Strategy-Specific Code**: Eliminated complex per-strategy incomplete chunk detection logic
- **Dynamic Overlap Generation**: Overlap created naturally during chunking, not pre-computed
- **Simplified Data Structures**: Clean content processing with automatic forwarding
- **Improved Performance**: Reduced complexity and memory usage in cross-file processing

**Enhanced Quality Assurance:**
- **Comprehensive Testing**: Verified all chunking strategies work correctly with cross-file processing
- **Real-World Validation**: Tested on medical documents with proper semantic coherence
- **Spacing Integrity**: Automatic prevention of word concatenation across file boundaries
- **Universal Logic**: Same forwarding rules apply to all strategies regardless of complexity

## Dynamic Cross-File Chunking

Advanced chunking method that dynamically generates overlap content during processing, ensuring semantic coherence across file boundaries. Automatically applied to multi-file documents with guaranteed boundary integrity.

### How It Works

1. **Sequential File Processing**: Files are processed one by one in order
2. **Universal Forwarding Logic**: All chunking strategies use the same simple rule - if content remains at the end, forward it to the next file
3. **Automatic Content Spacing**: Intelligent space insertion between forwarded content and main content prevents word concatenation
4. **Memory-Based State Management**: Simplified ChunkState object maintains forwarded content between file processing
5. **Strategy Transparency**: Chunking engines unaware of cross-file logic - they process complete content

### Processing Flow
File 1 Processing: โ”œโ”€โ”€ Apply chunking strategy to main_content โ”œโ”€โ”€ Create complete chunks + leftover content โ”œโ”€โ”€ Forward leftover โ†’ File 2's overlap_content

File 2 Processing: โ”œโ”€โ”€ Combine: overlap_from_file1 + separator + main_content โ”œโ”€โ”€ Apply chunking strategy to combined content โ”œโ”€โ”€ Create complete chunks + new leftover content โ”œโ”€โ”€ Forward new leftover โ†’ File 3's overlap_content

File N Processing: โ”œโ”€โ”€ Combine: overlap_from_prev + separator + main_content โ”œโ”€โ”€ Apply chunking strategy โ”œโ”€โ”€ Create final complete chunks

### Universal Forwarding Mechanism

The architecture ensures **100% boundary integrity** with simplified logic:

- **Parsing**: Creates clean `main_content` without overlap assumptions
- **First File**: Processes from natural document start
- **Subsequent Files**: Receive overlap as guaranteed start boundary
- **All Strategies**: Use identical forwarding rules regardless of chunking method
- **Automatic Spacing**: Prevents word concatenation (e.g., "glutendamages" โ†’ "gluten damages")

### Example Request

```json
{
  "project_id": 1,
  "document_ids": [101, 102, 103],
  "parser": "gemini",
  "chunk_strategy": "semantic",
  "sliding_window": true
}

Window Processing Structure

Each window contains structured content for AI processing:

{
  "content_type": "sliding_window_chunk",
  "main_content": "# Chapter 3: Advanced Topics\n\nThis chapter covers...",
  "overlap_content": "# Chapter 2 Conclusion\n\nIn summary, the basic concepts...",
  "metadata": {
    "window_size": 2500,
    "overlap_tokens": 400,
    "total_tokens": 2900
  }
}

Use Cases

  • Large Multi-File Documents: PDFs split into multiple parts
  • Cross-File Continuity: Topics spanning artificial file boundaries
  • Semantic Coherence: Maintaining context across pagination breaks
  • Quality Improvement: Better search results and AI processing

Configuration Options

  • Automatic Detection: Applied automatically to documents with multiple parsed files

Schema-Based Chunking

Apply custom rules combining patterns and delimiters for precise control.

Schema Format

The API automatically attempts to fix common JSON syntax errors in regex patterns (e.g., unescaped \s, \n) and literal control characters, but it is best practice to provide a fully escaped JSON string.

Regex Support: Schema strategies now support re.MULTILINE, allowing the use of ^ anchors to match the start of lines within a document.

{
  "rules": [
    {
      "type": "pattern",
      "value": "# [A-Z\\\\s]+"
    },
    {
      "type": "delimiter",
      "value": "\\\\n\\\\n"
    }
  ],
  "combine": "any"
}

Combine Options

  • "any": Split when any rule matches
  • "all": Split only when all rules match at the same position

Example Request

{
  "project_id": 1,
  "document_ids": [101],
  "parser": "unstructured",
  "chunk_strategy": "schema",
  "schema_definition": "{\"rules\": [{\"type\": \"pattern\", \"value\": \"^## \"}, {\"type\": \"delimiter\", \"value\": \"\\n\\n\"}], \"combine\": \"any\"}"
}

Rule Types

Pattern Rules: - Use regex patterns to match specific text structures - Supports re.MULTILINE mode (use ^ to match start of line) - Examples: "^# ", "^[0-9]+\.", "<chapter>"

Delimiter Rules: - Split on exact string matches - Examples: "\n\n", "<hr>", "---"

Use Cases

  • Structured documents with known patterns
  • Custom document formats
  • Precise control requirements
  • Multi-criteria splitting

Delimiter-Based Chunking

Simple splitting on specified delimiter strings with enhanced flexibility for multiple delimiters.

Parameters

Parameter Type Required Description
delimiters array No List of delimiter strings to split on (default: ["\n\n", "\n"])
chunk_size integer No Maximum chunk size in characters
overlap integer No Overlap between chunks in characters

Example Request

{
  "project_id": 1,
  "document_ids": [101],
  "parser": "pypdf",
  "chunk_strategy": "delimiter",
  "delimiters": ["#", "\n\n", "---"],
  "chunk_size": 1000,
  "overlap": 100
}

Delimiter Examples

Markdown Headers:

{
  "delimiters": ["#", "##", "###"]
}

Mixed Delimiters:

{
  "delimiters": ["\n\n", "---", "<hr>"]
}

Custom Patterns:

{
  "delimiters": ["SECTION:", "CHAPTER", "<div class=\"chapter\">"]
}

Use Cases

  • Simple document structures
  • Known separator patterns
  • Quick processing needs
  • Markdown document chunking
  • Custom delimiter patterns

Token-Based Chunking

Precise token counting using tiktoken library with overlap. Requires tiktoken package to be installed.

Example Request

{
  "project_id": 1,
  "document_ids": [101],
  "parser": "grok",
  "chunk_strategy": "token",
  "chunk_size": 512,
  "overlap": 50
}

Error Handling

Token chunking will fail explicitly if: - tiktoken library is not installed - Invalid tokenizer model specified - Strategy creation fails for any reason

Error Response:

{
  "detail": "Failed to create token-based chunking strategy: No module named 'tiktoken'. Token-based chunking requires tiktoken library."
}

Use Cases

  • LLM input preparation with exact token limits
  • Token-aware processing for API constraints
  • Precise semantic chunking based on token boundaries

Response Format

Success Response

{
  "job_id": "chunk_job_12345",
  "message": "Successfully processed 1 documents, created 5 chunks",
  "processed_documents": 1,
  "total_chunks": 5,
  "estimated_duration": "Completed",
  "debug_info": {
    "total_requested": 1,
    "project_id": 1,
    "parser": "pypdf",
    "chunk_strategy": "character",
    "character_chunk_size": 500,
    "character_overlap": 50
  }
}

Job Status Checking

GET /api/v1/documents/process/{job_id}/status

{
  "status": "completed",
  "result": {
    "processed_documents": 1,
    "total_chunks": 5
  }
}

Error Handling

Common Errors

400 Bad Request:

{
  "detail": "Invalid chunk_strategy. Must be one of: token, character, semantic, delimiter, schema"
}

422 Validation Error:

{
  "detail": [
    {
      "loc": ["body", "character_chunk_size"],
      "msg": "ensure this value is greater than 0",
      "type": "value_error.const"
    }
  ]
}

Best Practices

Strategy Selection

  • Character: For simple, fast processing
  • Semantic: For complex documents requiring AI understanding
  • Schema: For structured documents with known patterns
  • Delimiter: For simple separator-based splitting
  • Token: For LLM-specific token limits

Performance Optimization

  • Use character strategy for large volumes
  • Batch multiple documents together
  • Choose appropriate parsers for document types
  • Monitor job status for long-running processes

Error Prevention

  • Validate schema JSON before submission
  • Test prompts with sample documents
  • Use appropriate chunk sizes for your use case
  • Monitor API rate limits for AI strategies

Integration Examples

Python Client

import requests

# Character chunking
response = requests.post("http://localhost:8000/api/v1/documents/process", json={
    "project_id": 1,
    "document_ids": [101],
    "chunk_strategy": "character",
    "character_chunk_size": 500,
    "character_overlap": 50
})

# Semantic chunking
response = requests.post("http://localhost:8000/api/v1/documents/process", json={
    "project_id": 1,
    "document_ids": [101],
    "chunk_strategy": "semantic",
    "semantic_prompt": "Split at natural topic boundaries..."
})

JavaScript Client

// Character chunking
const response = await fetch('/api/v1/documents/process', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    project_id: 1,
    document_ids: [101],
    chunk_strategy: 'character',
    character_chunk_size: 500,
    character_overlap: 50
  })
});

This API provides comprehensive document chunking capabilities with multiple strategies to handle diverse document processing needs.