π¬ Compileo: The Ultimate AI-Powered Document Processing & Dataset Engineering Suite
Compileo is an enterprise-grade, modular platform designed to transform raw, unstructured data into high-quality, structured intelligence. Whether you're processing 1,000-page medical PDFs, scraping JavaScript-heavy websites, or engineering datasets for LLM fine-tuning and personal study, Compileo provides a unified, AI-driven lifecycle for the modern data era.
π What can Compileo do?
Compileo isn't just a parserβit's a comprehensive data engineering ecosystem. It automates the complex journey from "Messy Source" to Validated, Categorized Intelligence."
π Multi-Source Knowledge Consolidation
Imagine you have several thick textbooks and want to create a specialized dataset focused only on disease treatments. Compileo can: 1. Ingest all books simultaneously (PDF, DOCX, etc.). 2. Discover a unified "Treatment" taxonomy across all sources automatically. 3. Extract every mention of treatments, dosages, and contraindications with high precision. 4. Consolidate this into one unified, high-quality Q&A dataset for training or study.
πΉοΈ Three Ways to Work
Compileo is designed for every workflow: * Web GUI: A user-friendly Streamlit interface with a 7-step guided wizard. * REST API: Seamlessly integrate dataset generation into your own applications. * CLI: Automate heavy-duty processing with powerful command-line parameters.
π Features
π Intelligent Document Processing & AI-Assisted Chunking
- Massive PDF Autonomy: Automatically splits 1,000+ page documents into manageable segments with semantic overlaps, ensuring LLM token limits are never hit while preserving context.
- Two-Pass VLM Parsing: Employs a "Skim and Extract" methodology using Vision-Language Models (Grok, Gemini, Ollama) to first understand document layout and then extract high-fidelity Markdown.
- AI-Assisted Strategy Recommendation: Don't guess how to split your data. Describe your goal (e.g., "I want to extract detailed surgical procedures"), and Compileo's AI will analyze your documents to recommend the optimal Semantic, Token, or Schema-based chunking strategy.
π§ Semantic Data Engineering
- AI-Assisted Taxonomy: Don't waste weeks defining categories. Compileo's Smart Sampling selects representative content to suggest and build hierarchical knowledge trees automatically.
- Multi-Stage Extraction: Performs Hierarchical Classification, moving from coarse-grained categories to fine-grained entities based on your custom or generated taxonomy.
- Context-Aware NER: Uses parent context during extraction to disambiguate entities and discover deep relationships between concepts.
π§ͺ Advanced Quality Control & Evaluation
- AI Confidence Scoring: Every extracted entity and relationship is assigned an AI confidence level (0.0 - 1.0), allowing you to filter for only the most reliable data.
- Deep Quality Metrics: Automated scoring for Lexical Diversity, Demographic Bias, Answer Coherence, and Target Audience Alignment via the
datasetqualmodule. - Fine-Tuned Model Testing: Use the
benchmarkingmodule to evaluate how your fine-tuned models perform on your custom datasets using industry-standard metrics (Accuracy, F1, BLEU, ROUGE) and suites like GLUE or MMLU.
π Developer Extensibility
- Robust Plugin System: Effortlessly extend Compileo by adding custom Ingestion Handlers, Dataset Formatters, or API Routers via a simple
.zippackage architecture. - Custom Exports: Out-of-the-box support for Anki export, allowing you to turn any technical document into a high-quality study deck.
π» System Requirements
- CPU: 4-core processor minimum (8-core recommended).
- RAM: 8GB minimum (16GB recommended for heavy processing).
- GPU (Optional): NVIDIA GPU with 8GB+ VRAM. Required for HuggingFace local inference and advanced system performance monitoring.
- Storage: 25GB free disk space.
- Operating System: Linux, macOS, or Windows.
οΏ½οΈ Installation
π³ Option 1: Docker
The fastest way to deploy the full stack (API, GUI, and Redis).
- Clone & Prepare:
- Launch:
- Access:
- Web GUI:
http://localhost:8501 - API Docs:
http://localhost:8000/docs
- Web GUI:
π API Authentication & Security
Compileo implements an "Auto-Lock" security model designed for zero-config startup without sacrificing security.
- Unsecured Mode (Default): If no API keys are defined, Compileo allows all requests. This is ideal for first-time setup and local experimentation.
- Secured Mode: As soon as you define an API key, the system "locks" and strictly requires that key for all API and GUI operations.
How to Secure Your Instance (Choose One):
- GUI (Recommended): Launch Compileo, go to Settings > π API Configuration, enter one or more API Keys, and click Save. The system locks instantly.
- CLI: Start the API with the
--api-keyflag: - Environment: Define
COMPILEO_API_KEY=your_secret_keyin your.envor Docker configuration.
How to Connect to a Secured Instance:
All API requests must include the following header:
π Option 2: Python Environment
Ideal for local development, CLI automation, or custom integrations.
Prerequisites: A running Redis server (sudo apt install redis-server).
- Setup Environment:
- Install Dependencies:
- Start Services:
Note: For API security, set API keys via the GUI Settings after startup.
# Option A: Start services separately (recommended for production) # 1. Start the API server (in one terminal) uvicorn src.compileo.api.main:app --host 0.0.0.0 --port 8000 # 2. Start the Web GUI (in another terminal) streamlit run src/compileo/features/gui/main.py --server.port 8501 --server.address 0.0.0.0 # Option B: Start both in same terminal (API in background) uvicorn src.compileo.api.main:app --host 0.0.0.0 --port 8000 & streamlit run src/compileo/features/gui/main.py --server.port 8501 --server.address 0.0.0.0
π License
Apache 2.0