Chunk Module GUI Usage Guide
The Compileo GUI provides an intuitive web interface for document chunking with multiple strategies. This guide covers how to use the chunking features through the Streamlit-based web application.
Chunking Strategy Options
graph TD
A[Document Processing Tab] --> B[Select Chunk Strategy]
B --> C{Strategy Choice}
C --> D[Character]
C --> E[Semantic]
C --> F[Schema]
C --> G[Delimiter]
C --> H[Token]
D --> D1[Chunk Size Slider<br/>500-2000 chars]
D --> D2[Overlap Slider<br/>0-500 chars]
E --> E1[Prompt Text Area<br/>Custom AI instructions]
F --> F1[Schema Text Area<br/>JSON rule definition]
G --> G1[Delimiter Multi-Select<br/>Common separators]
H --> H1[Chunk Size Slider<br/>256-2048 tokens]
H --> H2[Overlap Slider<br/>0-512 tokens]
Accessing Document Processing
Navigation
- Launch the Compileo GUI:
streamlit run src/compileo/features/gui/main.py - Navigate to the Document Processing tab in the sidebar
- Select your target project from the dropdown
Chunk Management
The chunk management interface is available in the Document Processing tab under the "๐ View & Manage Chunks" tab.
Features:
- View all chunks for a document with detailed metadata (ID, index, token count)
- Expandable sections with full content preview, displaying chunk indices and token counts (e.g., Chunk 1 (Tokens: ~150)) instead of internal IDs
- Flexible chunk deletion using simple index numbers (no UUIDs required):
- Single chunks: 1
- Multiple chunks: 1, 3, 5
- Ranges: 1-10
- Combinations: 1-5, 8, 10-12
- Bulk document chunk deletion with confirmation
- Refresh button for data reloading
Interface Overview
The Document Processing tab provides three main sections: - Parse Documents Tab: Upload and parse documents to markdown - Configure & Chunk Documents Tab: Configure chunking strategies and process documents - View & Manage Chunks Tab: View, inspect, and delete chunks
Configuration Modes
The Configure & Chunk Documents tab offers two configuration approaches:
Manual Configuration (Default)
- Strategy Selection: Choose from character, semantic, schema, delimiter, or token strategies
- Parameter Configuration: Set strategy-specific parameters manually
- Direct Processing: Immediate chunking with your chosen settings
AI-Assisted Configuration
- Goal Description: Describe your chunking objective (required)
- Document Preview: Browse document content with pagination
- Page-Aware Copy: Extract examples from specific pages
- AI Recommendations: Get intelligent strategy suggestions
- Refined Processing: Apply AI-recommended parameters
AI-Assisted Configuration Workflow
The AI-assisted configuration provides intelligent chunking recommendations based on your document structure and goals.
Streamlined Design Principles
- No Redundancy: Single document preview serves both browsing and example extraction
- Compact Layout: Example collection shown as editable summary, not verbose expandable sections
- Progressive Disclosure: Simple workflow with optional advanced features
- Contextual Actions: Copy buttons work on current page without complex page selection
- Side-by-Side Selection: Document and parsed file selectors in compact single row for efficient navigation
- File-Specific Content: Full parsed file content loading with GUI pagination for large document chunks
- Seamless Switching: Instant content updates when changing between parsed files within a document
Step 1: Describe Your Chunking Goal
Required Field: Describe what you want to achieve with chunking.
Examples of Good Goals: - "Split medical records at patient visit boundaries" - "Divide legal documents at clause and section breaks" - "Chunk research papers at methodology, results, and discussion sections" - "Split technical documentation at feature and API boundaries"
Step 2: Choose Document and Parsed File for Analysis
- Select Document: Choose a parsed document from the dropdown in the compact selection row
- Select Parsed File: Choose a specific parsed file chunk from the automatically loaded dropdown
- Browse Content: Use pagination controls to navigate through the full parsed file content (10,000 characters per page)
- View Metrics: See total characters, words, and lines for the selected parsed file
- Switch Files: Change between different parsed files within the same document seamlessly
- Select Target Documents: Choose which documents to apply the final chunking configuration to
Step 3: Extract Examples with Text Selection
Use the interactive text selection to extract relevant examples from specific pages:
Direct Text Selection
- Click and drag to select any portion of text in the content preview area
- Real-time feedback shows your current selection in a dedicated field
- Header highlighting automatically emphasizes potential headers (markdown # headers, ALL CAPS lines)
- Flexible selection allows choosing any text portion (headers, sections, samples, or custom selections)
Adding Selected Text to Examples
- "Copy Selected Text to Examples" button adds your current selection to the example collection
- Character count shows the size of your selection
- Multiple selections can be added from different pages
- Clear selection button removes current selection without adding to examples
Selection Tips
- Headers first: Start by selecting document headers as they provide the best guidance for AI chunking
- Multiple examples: Collect examples from different parts of the document for better AI understanding
- Precise selection: Select exactly the text you want the AI to learn from
- Context matters: Include surrounding context when selecting examples to show boundary patterns
Step 4: Build Example Collection
- Navigate Pages: Use pagination to browse different sections of the document
- Select Text: Click and drag to select meaningful text portions (focus on headers and boundary patterns)
- Add to Examples: Click "Copy Selected Text to Examples" to add your selection
- Review Collection: See all collected examples in an expandable list
- Remove if Needed: Use individual remove buttons to refine your example collection
Example Collection Summary:
๐ Collected Examples (3 items)
โข Example 1 (47 chars): # Introduction
This chapter covers the basic concepts
โข Example 2 (38 chars): ## Methods
The methodology section explains our approach
โข Example 3 (52 chars): Each chapter begins with a level 1 header followed by content
Step 5: Get AI Recommendations
- Click "Analyze & Recommend": AI analyzes your goal, document content, and collected examples
- Review Recommendations: AI suggests optimal chunking strategy and parameters
- Accept or Modify: Use AI-recommended settings or make adjustments
- Process Documents: Apply the configuration to your selected target documents
AI Recommendation Examples
For Chapter-Based Splitting:
- Strategy: Schema-based
- Rules: [{"type": "pattern", "value": "^# "}, {"type": "delimiter", "value": "\n\n"}]
- Combine: "any"
For Semantic Document Splitting: - Strategy: Semantic - Prompt: Custom prompt based on your goal and examples - Model: Recommended AI model for your content type
Tips for Better AI Recommendations
Goal Description Tips
- Be Specific: "Split at chapter boundaries" vs "Split at every chapter, where chapters start with '# Chapter X:'"
- Include Context: Mention document type (medical, legal, technical)
- Specify Patterns: Reference actual formatting patterns you observe
Example Quality Tips
- Focus on Headers: Select document headers first as they provide the clearest chunking guidance
- Include Context: When selecting text, include surrounding context to show boundary patterns
- Multiple Examples: Collect examples from different parts of the document for comprehensive AI understanding
- Precise Selection: Select exactly the text that demonstrates your desired chunking behavior
- Variety: Show different types of content boundaries and structural elements
Document Selection Tips
- Representative Sample: Choose a document that represents your typical content
- Complete Parsing: Ensure documents are fully parsed before AI analysis
- Content Quality: Use well-structured source documents for better analysis
Character-Based Chunking
Split documents by character count with configurable overlap. Perfect for simple, fast processing.
How to Use
- Select Strategy: Choose "Character" from the chunk strategy dropdown
- Configure Parameters:
- Chunk Size: Use the slider to set characters per chunk (500-2000 recommended)
- Overlap: Set character overlap between chunks (10-20% of chunk size)
- Upload Documents: Drag and drop PDF, DOCX, TXT, MD, CSV, JSON, or XML files
- Choose Parser: Select appropriate document parser (pypdf for PDFs, unstructured for Office docs)
- Process: Click the "Process Documents" button
Example Configuration
Chunk Strategy: Character
Chunk Size: 1000 characters
Overlap: 100 characters
Parser: pypdf
Files: medical_report.pdf
When to Use
- Large Document Sets: Process many documents quickly
- Memory Constraints: Predictable memory usage
- Simple Requirements: When semantic understanding isn't needed
- Testing: Deterministic results for validation
Visual Feedback
- Progress bar shows upload and processing status
- Real-time chunk count updates
- Success confirmation with processing statistics
Semantic Chunking
Use AI to intelligently split documents based on meaning and context with custom prompts. Supports multi-file documents with dynamic cross-file chunking for guaranteed boundary integrity.
โ Advanced Universal Cross-File Support
The GUI automatically handles multi-file documents using universal forwarding logic that ensures semantic coherence across file boundaries:
- Universal Forwarding Rules: All chunking strategies use the same simple rule - if content remains at the end, forward it to the next file
- Strategy-Agnostic Detection: Removed complex per-strategy incomplete chunk detection code
- Automatic Content Spacing: Intelligent space insertion between forwarded content and main content prevents word concatenation
- Memory-Based State Management: Simplified ChunkState object maintains forwarded content between file processing
Automatic Processing: Cross-file chunking is automatically applied to multi-file documents. The system dynamically forwards incomplete chunks as overlap content to subsequent files.
Benefits: - Improved semantic chunking quality at file boundaries - Better search results with reduced duplication - More coherent chunks for AI processing - Simplified architecture with universal forwarding rules - All 5 chunking strategies (character, token, semantic, schema, delimiter) use identical logic
How to Use
- Select Strategy: Choose "Semantic" from the chunk strategy dropdown
- Craft Your Prompt: Use the text area to provide custom instructions for the AI
- Upload Documents: Select documents requiring semantic understanding
- Choose AI Model: Select from available models (Gemini, Grok, Ollama)
- Process: Click "Process Documents"
Prompt Examples
Medical Textbooks (Recommended):
This is a medical textbook that is structured as follows: disease / condition and discussion about it, then another disease / condition and discussion about it. Split should occur at the end of each discussion and before next disease / condition title.
Medical Documents:
Split this medical document at natural section boundaries, ensuring each chunk contains complete clinical information about a single condition, symptom, or treatment. Pay special attention to maintaining the integrity of diagnostic criteria and treatment protocols.
Legal Documents:
Divide this legal document at section boundaries, keeping each complete legal clause, definition, or contractual obligation in a single chunk. Preserve the logical flow of legal arguments and obligations.
Technical Documentation:
Split this technical document at logical boundaries, ensuring each chunk contains complete explanations of single concepts, algorithms, or procedures. Keep code examples with their explanations.
Research Papers:
Divide this research paper at meaningful section boundaries, keeping complete methodologies, results, and discussions intact. Ensure each chunk represents a coherent scientific contribution.
Advanced Prompt Techniques
Role-Playing:
You are a medical editor. Split this clinical document at natural breaks where one medical topic ends and another begins. Ensure each chunk tells a complete clinical story.
Chain-of-Thought:
Think step-by-step: 1) Identify topic changes, 2) Find natural break points, 3) Ensure content completeness, 4) Split at optimal boundaries. Split this document accordingly.
Specific Instructions:
Split at section headers, but also consider content flow. If a section header appears mid-explanation, keep the explanation with its header. Prioritize content coherence over strict header boundaries.
When to Use
- Complex Documents: Content requiring AI understanding
- Domain Expertise: Specialized knowledge needed for splitting
- Quality Requirements: When semantic coherence is critical
- Custom Logic: Unique splitting requirements
Tips for Better Results
- Be Specific: Include domain context and desired behavior
- Provide Examples: Reference specific patterns in your documents
- Test Iteratively: Try different prompts and refine based on results
- Balance Detail: Too vague = poor splits, too specific = inflexible
Schema-Based Chunking
Apply custom rules combining patterns and delimiters for precise document splitting control.
How to Use
- Select Strategy: Choose "Schema" from the chunk strategy dropdown
- Define Schema: Use the text area to create JSON rules for splitting
- Upload Documents: Select documents matching your schema patterns
- Choose Parser: Select parser that preserves your target patterns
- Process: Click "Process Documents"
Schema Structure
{
"rules": [
{
"type": "pattern",
"value": "# [A-Z]+"
},
{
"type": "delimiter",
"value": "\\n\\n"
}
],
"combine": "any"
}
Rule Types
Pattern Rules (Regex)
Match specific text patterns using regular expressions. Note: Regex now supports multiline mode! Use ^ to match the start of any line, not just the start of the document.
^#: Markdown headers at the start of a line
- ^[0-9]+\.: Numbered sections at the start of a line
- <chapter>: XML/HTML tags
- ^[A-Z][A-Z\s]+$: ALL CAPS section headers
Delimiter Rules
Split on exact string matches:
-\n\n: Double line breaks
- ---: Horizontal rules
- <hr>: HTML horizontal rules
- \f: Page breaks
Combine Options
"any": Split when any rule matches
{
"rules": [
{"type": "pattern", "value": "^# "},
{"type": "delimiter", "value": "\n\n"}
],
"combine": "any"
}
"all": Split only when all rules match at the same position
{
"rules": [
{"type": "pattern", "value": "^# "},
{"type": "delimiter", "value": "\n\n"}
],
"combine": "all"
}
Schema Examples
Markdown Documents:
{
"rules": [
{"type": "pattern", "value": "^## "},
{"type": "delimiter", "value": "\n\n"}
],
"combine": "any"
}
Structured Text:
{
"rules": [
{"type": "pattern", "value": "^SECTION [0-9]+"},
{"type": "delimiter", "value": "---"}
],
"combine": "any"
}```
**XML/HTML:**
```json
{
"rules": [
{"type": "pattern", "value": "<div class=\"chapter\">"},
{"type": "pattern", "value": "<section>"}
],
"combine": "any"
}
When to Use
- Structured Documents: Known formatting patterns
- Custom Formats: Proprietary document structures
- Precise Control: Exact splitting behavior required
- Multi-Criteria: Complex rules combining patterns and delimiters
Schema Validation and Auto-Correction
The GUI provides robust validation and automatic correction for schema definitions:
- JSON Syntax: Checks for valid JSON structure.
- Auto-Correction: Automatically fixes common JSON syntax errors in regex patterns (e.g., unescaped \s, \n) and literal control characters, ensuring the schema is always parseable.
- Required Fields: Ensures rules and combine are present.
- Rule Validation: Verifies rule types and values.
- Error Messages: Clear feedback for fixing truly invalid schema issues.
Delimiter-Based Chunking
Simple splitting on specified delimiter strings with size controls. Enhanced with dual-input interface for maximum flexibility.
How to Use
- Select Strategy: Choose "Delimiter" from the chunk strategy dropdown
- Choose Delimiters: Use the dual-input interface:
- Quick Select: Choose from common delimiters with descriptive labels
- Custom Delimiter: Enter any custom delimiter pattern
- Configure Size: Set chunk size and overlap parameters
- Upload Documents: Select documents with clear delimiter patterns
- Process: Click "Process Documents"
Enhanced Delimiter Interface
The delimiter strategy now provides two input methods for maximum flexibility:
Quick Select (Common Delimiters)
Choose from pre-configured delimiters with clear descriptions: - # (Markdown headers): Split at markdown headers (default) - --- (Horizontal rule): Split at horizontal rules - ### (H3 headers): Split at level 3 markdown headers - ## (H2 headers): Split at level 2 markdown headers - \n\n (Double newline): Split at paragraph breaks - \n (Single newline): Split at line breaks - . : Split at sentence endings - ! : Split at exclamation marks - ? : Split at question marks
Custom Delimiter Field
Enter any delimiter pattern you need:
- Markdown headers: #, ##, ###
- HTML/XML tags: <div>, </section>, <chapter>
- Custom patterns: SECTION:, CHAPTER, ---
- Regex patterns: Any string pattern
Smart Combination
- Automatic merging: Combines quick-select and custom delimiters
- Duplicate removal: Eliminates duplicate delimiters while preserving order
- User feedback: Shows selected delimiter count and validation messages
Configuration Options
Chunk Strategy: Delimiter
Quick Select: # (Markdown headers), --- (Horizontal rule)
Custom Delimiter: <section>
Combined Delimiters: ['#', '---', '<section>']
Chunk Size: 1000 characters
Overlap: 100 characters
When to Use
- Simple Structures: Documents with clear separator patterns
- Known Formats: Consistent delimiter usage
- Quick Processing: Fast chunking for basic requirements
- Text Files: Plain text with obvious breaks
Token-Based Chunking
Traditional token counting with overlap for LLM compatibility.
How to Use
- Select Strategy: Choose "Token" from the chunk strategy dropdown
- Configure Parameters:
- Chunk Size: Set token count per chunk (256-2048 recommended)
- Overlap: Set token overlap between chunks (10-25% of chunk size)
- Upload Documents: Select documents for LLM processing
- Choose Parser: Select appropriate document parser
- Process: Click "Process Documents"
Token Estimation
- English Text: ~4 characters per token
- Code: ~2-3 characters per token
- Mathematical: ~1-2 characters per token
Example Configuration
Chunk Strategy: Token
Chunk Size: 512 tokens
Overlap: 50 tokens
Parser: gemini
Files: research_paper.pdf
When to Use
- LLM Input: Prepare documents for language model processing
- API Limits: Respect token limits of AI services
- Embeddings: Create chunks for vector embeddings
- Legacy Compatibility: Existing token-based workflows
Processing Workflow
Step-by-Step Process
- Project Selection
- Choose existing project or create new one
-
Verify project permissions and space
-
Document Upload
- Drag and drop multiple files
- Automatic file type detection
- Progress indicators for large uploads
-
Support for batch processing
-
Strategy Configuration
- Select chunking strategy from dropdown
- Dynamic form fields appear based on selection
- Real-time validation and helpful hints
-
Preview of expected behavior
-
Parser Selection
- Choose appropriate document parser
- Consider document type and complexity
-
Balance speed vs. accuracy
-
Processing Execution
- Click "Process Documents" to start
- Real-time progress monitoring
- Job status updates and estimated completion
-
Error handling and retry options
-
Results Review
- View processing statistics
- Check chunk quality and count
- Download results or proceed to next steps
- Error logs and troubleshooting information
Progress Monitoring
The GUI provides comprehensive progress tracking:
- Upload Progress: File-by-file upload status
- Processing Status: Current operation and completion percentage
- Job Management: Background job tracking with unique IDs
- Error Reporting: Clear error messages with suggested fixes
- Results Summary: Chunk counts, processing time, and quality metrics
Error Handling
Common Issues and Solutions:
- Invalid Schema: Check JSON syntax and required fields
- Unsupported File Type: Verify file format compatibility
- API Key Missing: Configure AI service credentials in Settings
- Large File Timeout: Use character strategy or increase timeout
- Memory Issues: Reduce chunk size or process in batches
Advanced Features
Batch Processing
Process multiple documents simultaneously:
- Upload several files at once
- Configure chunking strategy once
- Process all documents with same settings
- Monitor collective progress
- Review batch results
Strategy Comparison
Test different strategies on the same document:
- Process same document with different strategies
- Compare chunk counts and quality
- Analyze processing time differences
- Choose optimal strategy for your use case
Custom Workflows
Combine chunking with other Compileo features:
- Chunk โ Extract: Use chunks for selective taxonomy extraction
- Chunk โ Generate: Feed chunks to dataset generation
- Chunk โ Quality: Analyze chunk quality metrics
- Chunk โ Benchmark: Test chunking impact on model performance
Best Practices
Strategy Selection Guide
| Document Type | Recommended Strategy | Reasoning |
|---|---|---|
| Simple text files | Character | Fast, predictable, no AI required |
| Markdown docs | Schema | Precise control with header patterns |
| Legal documents | Semantic | Complex clause and obligation boundaries |
| Medical records | Semantic | Clinical context and diagnosis integrity |
| Research papers | Semantic | Methodological and discussion coherence |
| Structured data | Delimiter | Known separators and formatting |
| Code repositories | Token | LLM processing compatibility |
| Mixed content | Schema | Flexible rule combinations |
Performance Optimization
- Large Documents: Use character strategy for speed
- Many Documents: Process in batches to manage memory
- AI Strategies: Choose appropriate models for your needs
- Quality vs Speed: Balance requirements with processing time
Quality Assurance
- Test Small First: Try strategies on sample documents
- Validate Results: Check chunk boundaries and content integrity
- Iterate Prompts: Refine semantic prompts based on results
- Monitor Metrics: Track processing time and chunk quality
Integration with Other Features
Dataset Generation Workflow
- Chunk Documents โ Select appropriate chunking strategy
- Review Chunks โ Verify quality and boundaries
- Generate Dataset โ Use chunks as input for dataset creation
- Quality Check โ Analyze dataset quality metrics
- Iterate โ Refine chunking strategy if needed
Taxonomy Integration
- Process Documents โ Create high-quality chunks
- Generate Taxonomy โ Use chunks for AI taxonomy creation
- Selective Extraction โ Apply taxonomy to chunks
- Refine Taxonomy โ Improve based on extraction results
Benchmarking Connection
- Chunk Documents โ Prepare data for model evaluation
- Generate Datasets โ Create evaluation datasets
- Run Benchmarks โ Test model performance on chunked data
- Analyze Results โ Compare chunking strategy impact
This GUI provides comprehensive document chunking capabilities with an intuitive interface, making advanced chunking strategies accessible to users of all technical levels while maintaining the power and flexibility needed for complex document processing workflows.