Advanced Entity Extraction & Dataset Generation in Compileo GUI
Overview
The Compileo GUI provides a sophisticated, unified interface for advanced entity extraction, relationship inference, and automated Q&A dataset generation. The extraction system transforms unstructured text into structured datasets through AI-powered entity recognition and intelligent relationship discovery.
Key Capabilities
Entity Extraction
- Dual Extraction Modes: Supports both Named Entity Recognition (NER) and Whole Text Extraction.
- NER: Extracts specific entities (names, terms, concepts) from text chunks.
- Whole Text: Extracts complete relevant text portions classified into taxonomy categories.
- High-Precision Validation: Strict subtractive validation stage that programmatically filters out hallucinations and discovery errors.
- Snippet Deduplication: Programmatic deduplication of extracted segments to ensure unique and clean results.
- Flexible Extraction Modes: Choose between Contextual and Document-Wide processing.
- Contextual Extraction: Only extracts from child categories when parent context is present in the text, preventing false positives.
- Document-Wide Extraction: Processes all chunks for selected categories regardless of contextual relevance, maximizing coverage.
- AI-Powered Recognition: Extract specific entities from text using Grok, Gemini, or Ollama models
- Taxonomy Integration: Work with hierarchical taxonomies for precise categorization
- Multi-Category Support: Extract entities across multiple taxonomy categories simultaneously
- Confidence Scoring: Quality assessment for all extracted entities
Relationship Inference
- Automatic Discovery: Identify relationships between co-occurring entities
- Domain Agnostic: Works across medical, business, legal, and technical domains
- Confidence Weighting: Quality scoring for relationship strength
Q&A Dataset Generation
- Template-Based Creation: Generate question-answer pairs from entity relationships
- Multiple Formats: Export in JSONL, JSON, and CSV formats
- Context Preservation: Maintain source relationships and metadata
Accessing Extraction
- Navigate to the Application: Open Compileo in your web browser
- Select Extraction: Click on "๐ Extraction" in the sidebar
- Choose Operation: Use the three-tab interface for different workflows
Interface Components
Three-Tab Structure
The extraction interface is organized into three main tabs:
๐ Run Extraction Tab
Project & Taxonomy Selection
- Project Selector: Choose from available projects in "Project Name (ID: 123)" format
- Taxonomy Selector: Pick the taxonomy to use for entity categorization
AI Model Configuration
- Extraction Type: Select the type of extraction:
Named Entity Recognition (NER)orWhole Text Extraction. - Extraction Mode: Choose the extraction mode:
Contextual Extraction(default) orDocument-Wide Extraction. - Primary Classifier: Select the AI model (Grok, Gemini, Ollama) for the initial classification stage.
- Validation Stage: Optionally enable a second AI model for result validation. This stage is subtractive, meaning it verifies existing findings and removes errors but is strictly forbidden from discovering new categories.
- Validation Classifier: Choose a different AI model for the validation stage if enabled.
Taxonomy Tree Selection
- Interactive Tree: Expandable taxonomy hierarchy with checkboxes
- Search Functionality: Filter categories by name or description
- Batch Controls: Expand/collapse all, select/deselect all visible categories
- Selection Summary: Shows count of selected categories for extraction
Extraction Parameters
- Depth Control: How deep in taxonomy hierarchy to extract (1-5 levels)
- Confidence Threshold: Minimum confidence score for results (0.0-1.0)
- Batch Size: Number of chunks to process per batch (1-100)
- Max Chunks: Maximum chunks to process (1-10,000)
Job Initiation
- Start Extraction: Launch extraction job with configured parameters. The interface will provide in-view synchronous monitoring, allowing you to follow the job's progress through various stages without leaving the page.
- Progress Feedback: Real-time stage-based updates (e.g., "Initializing", "Analyzing Chunks", "Storing Results") ensure you know exactly what the system is doing.
๐ Monitor Jobs Tab
Project Selection
- Project Filter: View jobs for specific projects
Job Dashboard
- Status Overview: Summary metrics for total, running, completed, and failed jobs
- Job Cards: Individual cards for each extraction job with:
- Status Indicators: Color-coded status badges and progress bars
- Timing Information: Created, started, completed timestamps
- Expandable Details: Parameters, error messages, and timing breakdowns
Job Management Actions
- Restart Failed Jobs: Green "Restart" button for failed/cancelled jobs
- Cancel Running Jobs: Red "Cancel" button for pending/running jobs
- View Completed Results: Blue "View Results" button for completed jobs
๐ Browse & Manage Extractions Tab
Enhanced Management Interface
- Search & Filter Controls: Find specific extraction jobs by name, status, or date range
- Bulk Selection: Select multiple jobs for batch operations
- Advanced Filtering: Filter by project, taxonomy, AI model, or extraction type
- Sorting Options: Sort by creation date, completion time, or status
Job Management Actions
- View Results: Access detailed extraction results for completed jobs
- Delete Jobs: Permanently remove extraction jobs and associated data
- Restart Failed Jobs: Retry failed extractions with the same parameters
- Cancel Running Jobs: Stop active extraction processes
Results Organization
- Optimized Data Structure: Results are deduplicated and optimized for downstream processing, focusing on unique text snippets and entities.
- Category-Based Display: Results grouped by taxonomy categories
- Expandable Sections: Each category shows extracted entities with:
- Entity Frequency: How many chunks contain each entity
- Confidence Scores: Quality indicators for extractions
- Source References: Links to original text chunks
Relationship Analysis
- Relationship Summary: Overview of discovered entity relationships
- Type Distribution: Breakdown by relationship types (associative, causal, etc.)
- Confidence Metrics: Quality assessment of relationship inferences
Q&A Dataset Generation
- One-Click Generation: Automatic Q&A pair creation from relationships
- Sample Preview: View generated questions and answers before export
- Export Options: Download in JSONL, JSON, or CSV formats
- Statistics Display: Generation metrics and quality indicators
Job Status Types
Pending
- Appearance: Status shows "Pending", progress at 0%
- Actions Available: Cancel
- Description: Job is queued and waiting for processing resources
Running
- Appearance: Status shows "Running", progress bar updates in real-time
- Actions Available: Cancel
- Description: Job is actively being processed with live progress updates
Completed
- Appearance: Status shows "Completed", progress at 100%
- Actions Available: View Results
- Description: Job finished successfully with results available
Failed
- Appearance: Status shows "Failed", progress bar shows last completed percentage
- Actions Available: Restart
- Description: Job encountered an error and stopped
Cancelled
- Appearance: Status shows "Cancelled", progress shows completion at cancellation time
- Actions Available: Restart
- Description: Job was manually stopped by user
Error Handling
The GUI provides intelligent error handling with user-friendly messages:
Common Error Types
- ๐ Connection Error: Network issues preventing communication with the server
- โฑ๏ธ Timeout Error: Operations taking longer than expected
- ๐ Authentication Error: Session expired, requiring re-login
- โ Not Found: Requested job or resource doesn't exist
- ๐ Rate Limit Exceeded: Too many requests, need to wait
Specific Extraction Errors
- ๐ Taxonomy not found: Selected taxonomy doesn't exist
- ๐ค AI model unavailable: Selected AI service issues
- ๐๏ธ Database connection failed: Backend database problems
- ๐ API rate limit exceeded: External AI service limits reached
- ๐ Entity extraction failed: AI parsing or response issues
Workflow Examples
Complete Entity Extraction Workflow
- Setup Project & Taxonomy:
- Select project containing your documents
-
Choose appropriate taxonomy for entity categorization
-
Configure AI Models:
- Select primary AI model (Grok recommended for accuracy)
- Optionally enable validation with different AI model
-
Adjust confidence threshold based on use case
-
Select Categories:
- Use taxonomy tree to select relevant categories
- Search and filter categories as needed
-
Review selection summary before proceeding
-
Run Extraction:
- Set extraction parameters (depth, batch size, max chunks)
- Click "Start Extraction" to launch job
-
Monitor progress in Monitor Jobs tab
-
Review Results:
- Switch to Browse & Manage Extractions tab
- Explore extracted entities by category
- Review relationship discoveries
- Generate Q&A datasets if needed
Advanced Dataset Generation
- Complete Entity Extraction: Ensure extraction job finishes successfully
- Review Relationships: Check discovered entity associations
- Generate Q&A Pairs: Use one-click generation from relationships
- Preview & Export: Review samples and download in preferred format
- Integration: Use exported datasets in ML training pipelines
Best Practices
Entity Extraction Optimization
- AI Model Selection: Use Grok for highest accuracy, Gemini for speed
- Confidence Thresholds: Start with 0.5, adjust based on domain requirements
- Category Selection: Be specific - fewer, well-chosen categories yield better results
- Batch Size Tuning: Larger batches for homogeneous content, smaller for diverse content
Relationship Inference
- Domain Understanding: Ensure taxonomy reflects real-world relationships
- Quality Validation: Review relationship confidence scores
- Iterative Refinement: Adjust taxonomy based on extraction results
Q&A Dataset Generation
- Template Customization: Review generated Q&A pairs for quality
- Format Selection: Choose JSONL for most ML frameworks
- Context Preservation: Include relationship metadata when possible
- Quality Assurance: Manually review samples before large-scale generation
Integration with Other Features
Document Processing Pipeline
- Upload Documents: Use Documents section to ingest and chunk files
- Create Taxonomy: Build domain-specific taxonomies for categorization
- Run Extraction: Use unified extraction interface for entity discovery
- Generate Datasets: Create training data from extracted entities and relationships
- Quality Analysis: Review extraction metrics and relationship quality
Advanced Analytics
- Monitor Performance: Track extraction accuracy across different domains
- Relationship Mining: Discover patterns in entity associations
- Dataset Quality: Assess generated Q&A pair quality and diversity
- Model Improvement: Use extraction results to refine AI prompts and taxonomies
Troubleshooting
No Entities Found
- Check AI Model: Ensure selected AI model has valid API keys
- Verify Taxonomy: Confirm taxonomy categories are relevant to content
- Adjust Confidence: Lower confidence threshold if too restrictive
- Review Content: Ensure documents contain extractable entities
Poor Relationship Quality
- Taxonomy Refinement: Improve category definitions and relationships
- Content Quality: Ensure documents have clear entity associations
- AI Model Selection: Try different models for better relationship inference
Q&A Generation Issues
- Relationship Quality: Ensure high-confidence relationships exist
- Template Relevance: Customize templates for your domain
- Content Coverage: Verify sufficient entity pairs for generation
Advanced Features
Multi-Model Validation
- Cross-Model Agreement: Use different AI models for validation
- Confidence Boosting: Higher confidence for agreed-upon extractions
- Error Detection: Identify inconsistent AI responses
Batch Processing
- Large-Scale Operations: Process thousands of documents efficiently
- Progress Monitoring: Real-time updates for long-running jobs
- Resource Management: Automatic load balancing across available resources
Custom Templates
- Domain-Specific Q&A: Create templates for specialized domains
- Question Variety: Generate multiple question types from same relationships
- Context Enhancement: Include domain-specific context in generated pairs
This comprehensive GUI transforms complex entity extraction and relationship inference into an accessible, powerful tool for creating high-quality structured datasets from unstructured text. The unified interface eliminates the need for technical expertise while providing advanced capabilities for expert users.
Job Status Types
Pending
- Appearance: Status shows "Pending", progress at 0%
- Actions Available: Cancel
- Description: Job is queued and waiting for processing resources
Running
- Appearance: Status shows "Running", progress bar updates in real-time
- Actions Available: Cancel
- Description: Job is actively being processed with live progress updates
Completed
- Appearance: Status shows "Completed", progress at 100%
- Actions Available: View Results
- Description: Job finished successfully with results available
Failed
- Appearance: Status shows "Failed", progress bar shows last completed percentage
- Actions Available: Restart
- Description: Job encountered an error and stopped
Cancelled
- Appearance: Status shows "Cancelled", progress shows completion at cancellation time
- Actions Available: Restart
- Description: Job was manually stopped by user
Error Handling
The GUI provides intelligent error handling with user-friendly messages:
Common Error Types
- ๐ Connection Error: Network issues preventing communication with the server
- โฑ๏ธ Timeout Error: Operations taking longer than expected
- ๐ Authentication Error: Session expired, requiring re-login
- โ Not Found: Requested job or resource doesn't exist
- ๐ Rate Limit Exceeded: Too many requests, need to wait
Specific Extraction Errors
- ๐ Taxonomy not found: Selected taxonomy doesn't exist
- ๐ค Classification service unavailable: AI service issues
- ๐๏ธ Database connection failed: Backend database problems
- ๐ API rate limit exceeded: External service limits reached
Workflow Examples
Monitoring Active Extractions
- Select Project: Choose the project with active extraction jobs
- View Dashboard: See all jobs with their current status and progress
- Monitor Progress: Watch progress bars update in real-time for running jobs
- Check Details: Expand job cards to see parameters and timing information
Handling Failed Jobs
- Identify Failed Jobs: Look for jobs with "Failed" status
- Review Error Details: Expand the job card to see specific error messages
- Assess Feasibility: Determine if the error is recoverable
- Restart if Appropriate: Click "Restart" button for transient failures
- Monitor Restart: Watch the restarted job progress through the dashboard
Managing Long-Running Jobs
- Identify Long Jobs: Look for jobs running for extended periods
- Check Progress: Monitor if progress is still advancing
- Cancel if Stuck: Use "Cancel" button for jobs that appear stuck
- Restart After Review: Restart cancelled jobs after investigating issues
Best Practices
Job Monitoring
- Regular Checks: Periodically refresh the page to see latest job status
- Progress Tracking: Use progress bars to estimate completion time
- Error Review: Always check error details before restarting failed jobs
- Resource Management: Cancel unnecessary jobs to free up system resources
Error Recovery
- Understand Errors: Read error messages carefully to identify root causes
- Retry Strategically: Only restart jobs where the error appears transient
- Contact Support: For persistent errors, gather error details before contacting support
- Prevent Recurrence: Note patterns in failures to avoid similar issues
Performance Optimization
- Batch Monitoring: Use the dashboard to monitor multiple jobs simultaneously
- Priority Management: Cancel lower-priority jobs if high-priority work is blocked
- Timing Awareness: Note typical completion times for different job types
- Load Balancing: Distribute large extraction workloads across multiple projects
Integration with Other Features
Document Processing Workflow
- Upload Documents: Use the Documents section to upload and process files
- Create Taxonomy: Build or generate taxonomies for content categorization
- Start Extraction: Initiate extraction jobs (typically through API or other interfaces)
- Monitor Progress: Use the Extraction GUI to track job completion
- View Results: Access extraction results through the completed job interface
Dataset Generation Follow-up
- Complete Extractions: Ensure extraction jobs finish successfully
- Generate Datasets: Use the Dataset section to create training data
- Quality Analysis: Review extraction quality metrics
- Iterate: Restart failed extractions or adjust parameters as needed
Troubleshooting
Jobs Not Appearing
- Check Project Selection: Ensure correct project is selected
- Refresh Page: Use browser refresh to reload job list
- Verify Permissions: Ensure you have access to view jobs in the selected project
Progress Not Updating
- Check Connection: Verify internet connectivity
- Refresh Browser: Hard refresh (Ctrl+F5) to clear cache
- Contact Support: If progress consistently doesn't update
Action Buttons Not Working
- Check Job Status: Buttons only appear for valid state transitions
- Verify Permissions: Ensure you have permissions to modify jobs
- Network Issues: Check connection and retry
Advanced Features
Real-time Updates
The Job Management dashboard provides comprehensive visibility into all active and completed jobs across the platform.
- Manual Refresh: Use the "Refresh Dashboard" button for a clean, non-flickering update of all job metrics and lists.
- In-Place Monitoring: Jobs started from the extraction tabs are monitored synchronously, keeping you informed of their specific status in real-time.
- Accurate Progress: Monitoring focuses on stage-based status messages from the backend, providing a more reliable indicator of work completed than simple percentage bars.
Bulk Operations
While individual job control is available, consider:
- Project Organization: Group related jobs in dedicated projects
- Batch Processing: Use API or CLI for bulk job operations
- Automation: Set up monitoring scripts for large-scale operations
Performance Metrics
Track job performance over time:
- Completion Times: Note typical duration for different job types
- Success Rates: Monitor failure frequency and types
- Resource Usage: Observe system impact during peak processing
This GUI provides a user-friendly way to manage the complex asynchronous nature of extraction jobs while maintaining full visibility into the processing pipeline.