Dataset Generation in Compileo API
Overview
The Compileo API provides comprehensive endpoints for generating high-quality datasets from processed document chunks. The dataset generation process supports multiple AI models, taxonomy integration, and various output formats with optional quality analysis and benchmarking.
Base URL: /api/v1
Key Features
- Multi-Model Support: Choose from Gemini, Grok, or Ollama for parsing, chunking, and classification
- Taxonomy Integration: Generate datasets aligned with existing taxonomies
- Flexible Output: JSONL and Parquet formats supported
- Quality Assurance: Optional quality analysis with configurable thresholds
- Performance Benchmarking: Optional AI model benchmarking after generation
- Asynchronous Processing: All generation runs as background jobs with real-time monitoring
API Endpoints
1. Generate Dataset
Submits a dataset generation job with comprehensive configuration options.
- Endpoint:
POST /datasets/generate - Description: Initiates dataset generation from processed chunks with full configuration control
-
Request Body:
{ "project_id": 1, "data_source": "Chunks Only", "prompt_name": "default", "custom_prompt": "Optional custom prompt content", "selected_categories": ["category1", "category2"], "generation_mode": "default", "format_type": "jsonl", "concurrency": 1, "batch_size": 50, "include_evaluation_sets": false, "taxonomy_project": null, "taxonomy_name": null, "output_dir": ".", "analyze_quality": true, "quality_threshold": 0.7, "enable_versioning": false, "dataset_name": null, "run_benchmarks": false, "benchmark_suite": "glue", "parsing_model": "gemini", "chunking_model": "gemini", "classification_model": "gemini", "datasets_per_chunk": 3, "only_validated": false } -
Parameters:
project_id(integer, required): Project containing processed chunksdata_source(string): Data source mode - "Chunks Only", "Taxonomy", "Extract" (default: "Chunks Only")extraction_file_id(string, optional): Specific extraction job ID (UUID) when data_source is "Extract". Use/api/v1/datasets/extraction-files/{project_id}to get a list of available extraction jobs.selected_categories(array of strings, optional): List of category names to filter by when data_source is "Extract"prompt_name(string): Name of prompt template to use (default: "default")custom_prompt(string, optional): Custom prompt content for generationgeneration_mode(string): Generation mode - "instruction following", "question and answer", "question", "answer", "summarization"format_type(string): Output format - "jsonl", "parquet", or plugin formats (e.g., "anki")concurrency(integer): Number of parallel processing threadsbatch_size(integer): Number of chunks to process per batch (0 = all at once)include_evaluation_sets(boolean): Generate train/validation/test splitstaxonomy_project(string, optional): Project name containing taxonomytaxonomy_name(string, optional): Name of taxonomy to align withoutput_dir(string): Output directory pathanalyze_quality(boolean): Enable quality analysisquality_threshold(float): Quality threshold for pass/fail (0-1)enable_versioning(boolean): Enable dataset versioningdataset_name(string, optional): Name for versioned datasetsrun_benchmarks(boolean): Run AI model benchmarks after generationbenchmark_suite(string): Benchmark suite - "glue", "superglue", "mmlu", "medical"parsing_model(string): AI model for document parsingchunking_model(string): AI model for text chunkingclassification_model(string): AI model for content classificationdatasets_per_chunk(integer): Maximum datasets per text chunkonly_validated(boolean): Filter extraction results to only include data that has passed validation
Data Source Modes:
- "Chunks Only": Uses raw text chunks directly from processed documents. No taxonomy or extraction filtering required. Best for basic dataset generation from any content.
- "Taxonomy": Applies taxonomy definitions to enhance generation prompts. Works with all chunks in the project (no extraction dependency). Adds domain-specific context and terminology. This mode strictly bypasses extraction data loading for efficiency.
- "Extract": Uses extracted entities as the primary content source. Generates datasets focused on specific concepts/entities. Creates educational content about extracted terms.
Batch Processing:
- batch_size controls memory usage by processing chunks in smaller groups
- Set to 0 to process all chunks at once (legacy behavior)
- Smaller batches reduce memory usage but may take longer
- Each batch creates a separate output file: dataset_[job_id]_batch_[N].[format]. Parquet exports are saved as binary .parquet files, while JSON/JSONL are saved as UTF-8 text files.
Generation Mode Options:
- "default": Generates content based on the provided prompt name or custom prompt.
- "instruction following" (Recommended): Generates modern instruction-response pairs following Alpaca/Dolly standards for advanced AI fine-tuning
- "question and answer": Creates traditional Q&A pairs from document content
- "question": Generates questions without corresponding answers
- "answer": Generates answers without explicit questions
-
"summarization": Creates concise summaries of document content
-
Success Response (200 OK):
2. Generate Evaluation Dataset
Creates comprehensive evaluation datasets with train/validation/test splits.
- Endpoint:
POST /datasets/generate-evaluation - Description: Generates evaluation-ready datasets with multiple splits and quality analysis
- Request Body: Same as
/datasets/generatebut optimized for evaluation - Success Response: Same format as dataset generation
3. Get Dataset Generation Status
Monitors the progress of dataset generation jobs.
- Endpoint:
GET /datasets/generate/{job_id}/status - Description: Returns current status and progress of dataset generation. This endpoint implements a persistent database fallback, ensuring status is available even after server restarts or worker instance changes.
- Path Parameters:
job_id(string, required): Dataset generation job ID- Success Response (200 OK):
{ "job_id": "dataset-gen-uuid-123", "status": "running", "progress": 65, "current_step": "Processing batch 2/3 (25 chunks)", "estimated_completion": "2024-01-21T14:30:00Z", "result": { "batch_files": [ { "batch_index": 0, "file_path": "/storage/datasets/1/dataset_dataset-gen-uuid-123_batch_0.json", "entries_count": 15, "chunks_processed": 5 }, { "batch_index": 1, "file_path": "/storage/datasets/1/dataset_dataset-gen-uuid-123_batch_1.json", "entries_count": 18, "chunks_processed": 5 } ], "completed_batches": 2, "total_batches": 3, "total_entries_so_far": 33 } }
4. Get Dataset
Retrieves generated dataset details and entries.
- Endpoint:
GET /datasets/{dataset_id} - Description: Returns dataset metadata and summary information
- Path Parameters:
dataset_id(string, required): Dataset identifier- Success Response (200 OK):
{ "id": "dataset_123", "name": "Medical Diagnosis Dataset", "entries": [], "total_entries": 1500, "created_at": "2024-01-21T12:00:00Z", "format_type": "jsonl", "batch_files": [ { "batch_index": 0, "file_path": "/storage/datasets/1/dataset_123_batch_0.json", "entries_count": 500, "chunks_processed": 50 }, { "batch_index": 1, "file_path": "/storage/datasets/1/dataset_123_batch_1.json", "entries_count": 500, "chunks_processed": 50 }, { "batch_index": 2, "file_path": "/storage/datasets/1/dataset_123_batch_2.json", "entries_count": 500, "chunks_processed": 50 } ], "total_batches": 3, "quality_summary": { "overall_score": 0.85, "diversity_score": 0.82, "bias_score": 0.88 } }
5. Get Dataset Entries
Retrieves dataset entries with pagination and filtering.
- Endpoint:
GET /datasets/{dataset_id}/entries - Description: Returns paginated dataset entries with optional filtering
- Path Parameters:
dataset_id(string, required): Dataset identifier- Query Parameters:
page(integer): Page number (default: 1)per_page(integer): Entries per page (default: 50)filter_quality(float, optional): Minimum quality score filtersort_by(string): Sort field - "quality", "category", "difficulty"- Success Response (200 OK):
{ "entries": [ { "id": "entry_1", "question": "What are the symptoms of myocardial infarction?", "answer": "Chest pain, shortness of breath, diaphoresis...", "category": "cardiology", "quality_score": 0.92, "difficulty": "intermediate", "source_chunk": "chunk_45", "metadata": { "model": "gemini", "taxonomy_level": 2 } } ], "total": 1500, "page": 1, "per_page": 50, "quality_summary": { "average_score": 0.85, "distribution": {"high": 1200, "medium": 250, "low": 50} } }
6. Update Dataset Entry
Updates individual dataset entries for refinement.
- Endpoint:
PUT /datasets/{dataset_id}/entries/{entry_id} - Description: Modifies question, answer, or metadata for dataset entries
- Path Parameters:
dataset_id(string, required): Dataset identifierentry_id(string, required): Entry identifier- Request Body:
7. Submit Dataset Feedback
Collects user feedback on dataset quality and relevance.
- Endpoint:
POST /datasets/{dataset_id}/feedback - Description: Submits bulk feedback for multiple dataset entries
- Request Body:
8. Get Extraction Files
Retrieves available extraction jobs for dataset generation.
- Endpoint:
GET /datasets/extraction-files/{project_id} - Description: Returns list of completed extraction jobs for a project, used to populate extraction file selection dropdown.
- Path Parameters:
project_id(string, required): Project ID (UUID) to get extraction files for. Accepts both integer and string formats for compatibility.- Success Response (200 OK):
{ "extraction_files": [ { "id": 123, "job_id": 123, "status": "completed", "created_at": "2024-01-15T10:30:00Z", "extraction_type": "ner", "entity_count": 150, "display_name": "Job 123 - ner (150 entities)" }, { "id": 124, "job_id": 124, "status": "completed", "created_at": "2024-01-16T14:20:00Z", "extraction_type": "whole_text", "entity_count": 89, "display_name": "Job 124 - whole_text (89 entities)" } ] }
9. Regenerate Dataset Entries
Triggers regeneration of specific dataset entries.
- Endpoint:
POST /datasets/{dataset_id}/regenerate - Description: Regenerates entries with updated parameters or models
- Request Body:
10. Download Dataset
Downloads generated datasets in the specified format.
- Endpoint:
GET /datasets/{dataset_id}/download - Description: Downloads dataset file (JSONL/Parquet)
- Path Parameters:
dataset_id(string, required): Dataset identifier- Success Response: File download with appropriate content-type
Integration with Job Management
All dataset generation operations return a job_id and run asynchronously. Dataset generation jobs are now stored in the database and persist across API server restarts. Use the dedicated dataset generation status endpoint to monitor progress:
POST /api/v1/datasets/generate
Content-Type: application/json
{
"project_id": 1,
"classification_model": "gemini",
"analyze_quality": true
}
Response:
{
"job_id": "dataset-gen-uuid-789",
"message": "Dataset generation started"
}
# Monitor progress (dataset jobs persist across server restarts)
GET /api/v1/datasets/generate/dataset-gen-uuid-789/status
Model Selection Guidelines
Parsing Models
- Gemini: Best for complex document understanding and multi-format support
- Grok: Good for technical and structured content
- Ollama: Local processing, no API keys required
Classification Models
- Gemini: Superior taxonomy alignment and category classification
- Grok: Better for nuanced medical and technical categorization
- Ollama: Cost-effective for high-volume processing
Chunking Models
- Gemini: Intelligent semantic chunking with context awareness
- Grok: Good for technical document segmentation
- Ollama: Fast processing for simple chunking strategies
Error Handling
Common Error Responses
- 400 Bad Request: Invalid parameters or missing required fields
- 404 Not Found: Project, taxonomy, or dataset not found
- 503 Service Unavailable: Job queue manager not initialized (Redis unavailable)
Job-Specific Errors
Dataset generation jobs may fail with specific error messages:
- "No chunks found for project": Project has no processed chunks
- "API key not configured": Selected model requires API key
- "Taxonomy validation failed": Specified taxonomy not found or invalid
- "Quality threshold not met": Generated dataset failed quality checks
Best Practices
- Model Selection: Choose appropriate models based on content type and requirements
- Chunk Preparation: Ensure high-quality chunking before generation
- Quality Analysis: Enable quality analysis for production datasets
- Versioning: Use versioning for iterative dataset improvement
- Monitoring: Always monitor job progress using the jobs API
- Resource Planning: Consider concurrency limits and processing time for large datasets