Skip to content

Extraction CLI Commands

This guide covers the command-line interface for creating and managing extraction jobs in Compileo.

Overview

The extraction CLI provides complete parity with the extraction API, allowing users to create extraction jobs through command-line interface with identical functionality and parameters.

Commands

extraction create

Creates a new extraction job using a taxonomy to extract content from processed documents.

Syntax

compileo extraction create --project-id <project_id> --taxonomy-id <taxonomy_id> --selected-categories "category1,category2" --effective-classifier <model> [options]

Required Parameters

  • --project-id (integer): The ID of the project containing the data
  • --taxonomy-id (integer): The ID of the taxonomy to use for extraction
  • --selected-categories (string): Comma-separated list of category names to extract
  • --initial-classifier (string): The AI model for the initial classification stage (e.g., grok, gemini, ollama, openai).
  • --extraction-type (string, default: "ner"): Type of extraction: 'ner' for named entities, 'whole_text' for complete text portions.

Optional Parameters

  • --extraction-mode (string, default: "contextual"): Extraction mode: 'contextual' or 'document-wide'.
  • --enable-validation-stage: Enable validation stage for improved accuracy
  • --validation-classifier (string): Separate classifier for validation phase
  • --confidence-threshold (float, default: 0.5): Minimum confidence score (0.0-1.0)
  • --max-chunks (integer): Maximum number of chunks to process

Available AI Models

  • Gemini: gemini/gemini-1.5-flash, gemini/gemini-1.5-pro
  • Ollama: ollama/llama3.1, ollama/llama3.2, ollama/mistral
  • Grok: grok/grok-1, grok/grok-beta
  • OpenAI: openai/gpt-4o, openai/gpt-4-turbo, openai/gpt-3.5-turbo

Examples

Basic extraction with Gemini:

compileo extraction create \
  --project-id 1 \
  --taxonomy-id 5 \
  --selected-categories "diagnosis,treatment" \
  --effective-classifier "gemini/gemini-1.5-flash"

Advanced extraction with validation:

compileo extraction create \
  --project-id 1 \
  --taxonomy-id 5 \
  --selected-categories "symptoms,causes,effects" \
  --effective-classifier "ollama/llama3.1" \
  --enable-validation-stage \
  --validation-classifier "gemini/gemini-1.5-pro" \
  --confidence-threshold 0.7 \
  --max-chunks 1000

Limited processing for testing:

compileo extraction create \
  --project-id 1 \
  --taxonomy-id 5 \
  --selected-categories "all" \
  --effective-classifier "gemini/gemini-1.5-flash" \
  --max-chunks 100

Document-wide extraction for maximum coverage:

compileo extraction create \
  --project-id 1 \
  --taxonomy-id 5 \
  --selected-categories "diagnosis,treatment" \
  --effective-classifier "ollama/llama3.1" \
  --extraction-mode document-wide \
  --confidence-threshold 0.3

Output

Success Response:

๐Ÿš€ Creating extraction job for project 1
๐Ÿ“‹ Taxonomy ID: 5
๐Ÿท๏ธ Categories: diagnosis, treatment
๐Ÿค– Classifier: gemini/gemini-1.5-flash

โœ… Extraction job created successfully!
๐Ÿ“‹ Job ID: 12345
๐Ÿ” Monitor progress: compileo jobs poll 12345
๐Ÿ“Š Check status: compileo jobs status 12345

Job Monitoring

After creating an extraction job, you can monitor its progress using the job management commands:

# Check job status
compileo jobs status 12345

# Poll for completion (long-running)
compileo jobs poll 12345

# Stream real-time updates
compileo jobs stream 12345

Error Handling

The CLI provides clear error messages for common issues:

  • Invalid project ID: "Project with ID X not found"
  • Invalid taxonomy ID: "Taxonomy with ID Y not found"
  • Invalid categories: "Category 'invalid' not found in taxonomy"
  • Model unavailable: "AI model 'invalid/model' is not available"
  • Connection errors: "Failed to connect to API server"

Integration with Other Commands

Complete Workflow Example

# 1. Create project
compileo projects create --name "Medical Study" --description "Clinical trial data"

# 2. Upload documents
compileo documents upload --project-id 1 --files "study1.pdf,study2.pdf"

# 3. Parse documents
compileo documents parse --project-id 1 --document-ids "1,2"

# 4. Chunk documents
compileo documents chunk --project-id 1 --document-ids "1,2" --chunker "gemini"

# 5. Generate taxonomy
compileo taxonomy generate --project-id 1 --name "Medical Conditions" --documents "1,2"

# 6. Create extraction job
compileo extraction create \
  --project-id 1 \
  --taxonomy-id 1 \
  --selected-categories "diagnosis,treatment,outcome" \
  --effective-classifier "gemini/gemini-1.5-flash"

# 7. Monitor and retrieve results
compileo jobs status <job_id>
compileo extraction results <job_id>

Configuration

The CLI uses the same configuration as the API and GUI:

  • API Keys: Retrieved from GUI settings database
  • Model Selection: Based on available AI providers
  • Validation: Same parameter validation as API endpoints

Troubleshooting

Common Issues

"Model not available" - Check that the AI model is properly configured in settings - Verify API keys are set for the selected provider - Try a different model from the same provider

"Taxonomy not found" - Verify the taxonomy ID exists for the project - Check that taxonomy generation completed successfully - Use compileo taxonomy list --project-id X to see available taxonomies

"No chunks found" - Ensure documents have been parsed and chunked - Check that chunking completed successfully - Verify project contains processed documents

Getting Help

# Show command help
compileo extraction create --help

# List all extraction commands
compileo extraction --help

# General CLI help
compileo --help

API Parity

The CLI provides complete feature parity with the /api/v1/extraction/ endpoint:

  • Same parameters: All API parameters are supported
  • Same validation: Identical input validation rules
  • Same processing: Uses the same backend extraction pipeline
  • Same results: Produces identical extraction results

This ensures consistent behavior whether using the API, GUI, or CLI interfaces.