Create Structured Datasets for
LLM Fine-tuning

Transform your documents into perfectly formatted training datasets for LLM fine-tuning. Our API helps you process, structure, and store training data efficiently. Ideal for creating custom AI models and domain adaptation.

Fine-tuning Data Processing Features

Data Processing

  • • Automatic text cleaning and normalization
  • • Custom data extraction patterns
  • • Multi-format document support
  • • Batch processing capabilities

Dataset Creation

  • • Instruction-response pair generation
  • • Quality scoring and filtering
  • • Automated data validation
  • • MongoDB-ready JSON output

Integration Features

  • • Direct MongoDB integration
  • • RESTful API access
  • • Bulk export capabilities
  • • Version control support

API Integration Guide

1

Upload Documents

curl -X POST \
  -H "Authorization: Token YOUR_TOKEN" \
  -F "file=@/path/to/documents.pdf" \
  https://monkt.com/api/transformations/

Response:

{
  "success": true,
  "result": {
    "transformations": [{
      "uuid": "0974aa22-bc2b-4994-bc2f-0b4020bbe846",
      "status": "completed",
      "created": "2024-12-30 06:57:20"
    }]
  }
}
2

Define Training Data Schema

curl -X POST \
  -H "Authorization: Token YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Fine-tuning Schema",
    "schema": {
      "type": "object",
      "properties": {
        "instruction": {"type": "string"},
        "input": {"type": "string"},
        "output": {"type": "string"}
      }
    }
  }' \
  https://monkt.com/api/schemas/

Response:

{
  "success": true,
  "result": {
    "uuid": "1b5baa64-52ea-4f3d-8792-4852e24eacce",
    "name": "Fine-tuning Schema"
  }
}
3

Get Training Dataset

curl -X POST \
  -H "Authorization: Token YOUR_TOKEN" \
  https://monkt.com/api/transformations/{transformation_uuid}/json/{schema_uuid}/

Response:

{
  "success": true,
  "result": {
    "training_examples": [
      {
        "instruction": "Summarize the key points",
        "input": "Document content...",
        "output": "Key points summary..."
      },
      {
        "instruction": "Extract relevant information",
        "input": "Document text...",
        "output": "Extracted information..."
      }
    ]
  }
}
Usage Information

Track your API usage and limits in the response headers or info object:
• Transformations Used: 203/250
• Days Until Reset: 27
• Plan: Free

Fine-tuning Use Cases

Domain Adaptation

Create specialized training datasets from your industry documents:

  • • Legal document processing
  • • Medical research analysis
  • • Financial report understanding
  • • Technical documentation

Task-Specific Training

Prepare datasets for specific AI tasks:

  • • Question answering systems
  • • Document summarization
  • • Information extraction
  • • Content classification

Understanding LLM Fine-tuning Data Processing

Creating high-quality training datasets for LLM fine-tuning requires careful processing and structuring of source documents. Our system automates this process, transforming raw content into perfectly formatted training examples that meet the specific requirements of different LLM frameworks.

Instruction-Response Pair Generation

The system automatically generates natural instruction-response pairs from your content. Advanced NLP algorithms identify suitable content segments and create appropriate instructions that capture the intended task. This results in training examples that help models learn specific behaviors and domain knowledge.

Data Quality Assessment

Each generated training example undergoes rigorous quality checks. The system evaluates instruction clarity, response completeness, and overall example coherence. Quality scores help filter out suboptimal examples, ensuring your training dataset maintains high standards for model learning.

Format Compatibility

Training data is automatically formatted to match popular LLM frameworks like OpenAI's fine-tuning format, Anthropic's Constitutional AI format, or custom schemas. The system handles proper JSON structure, escape characters, and format-specific requirements, making the dataset immediately ready for training.

Content Diversification

The converter ensures training data diversity by varying instruction types, complexity levels, and response styles. This helps prevent overfitting and enables models to handle a broader range of inputs. The system can also generate multiple variations of similar examples to improve model robustness.

Metadata Enhancement

Training examples are enriched with relevant metadata such as task categories, difficulty levels, and domain tags. This additional context helps in dataset organization, selective training, and performance analysis. The metadata also enables targeted fine-tuning for specific use cases.

Validation and Testing

The system automatically splits processed data into training and validation sets. It performs statistical analysis to ensure balanced representation across different instruction types and content categories. This helps in monitoring training progress and preventing overfitting during model fine-tuning.

Batch Processing and Scaling

For large-scale training needs, the system supports efficient batch processing of multiple documents. It maintains consistent formatting and quality standards while handling large volumes of content. The distributed processing architecture ensures quick turnaround times for dataset creation.

Whether you're fine-tuning models for specific domains, creating instruction-following assistants, or developing specialized AI applications, our system provides the structured training data needed for successful model adaptation. The combination of intelligent processing, quality control, and format compatibility ensures your fine-tuning datasets meet the highest standards for AI model training.

Document conversion hub

Transform any document format into AI-ready content. Choose your conversion type below.

Blog

Frequently Asked Questions

What types of articles can I convert to JSON?

We support various content sources including news websites, blog posts, RSS feeds, and content management systems. Our system can handle content in multiple languages and from different publishing platforms.

How accurate is the article extraction?

Our AI-powered system achieves over 95% accuracy in content extraction. We use advanced NLP techniques to identify key information like titles, authors, dates, and main content while filtering out ads and irrelevant elements.

Can I customize the JSON output format?

Yes! You can define custom schemas to extract only the fields you need. This allows you to match your existing system's requirements and integrate seamlessly with your content workflow.

How do I integrate with my content system?

Our API provides standardized JSON output that can be integrated with popular CMS platforms and content aggregators. We provide SDKs and detailed integration guides for major programming languages.

What about data security and privacy?

We take security seriously. All requests are encrypted using TLS 1.3, and we process data in isolated environments. We are GDPR compliant and automatically delete processed content after 24 hours.

What are the API rate limits?

API access is available starting with our Pro plan, which includes up to 1,000 transformations per month. Enterprise plans offer higher limits up to 5 million transformations monthly with custom integration support. Free accounts can test the service through our web interface with limited transformations.

Can I use both UI and API for article conversion?

Yes! You can choose what works best for you. Our user-friendly dashboard provides a simple interface for manual conversions and monitoring, while our API enables automated integration into your systems. Both methods support the same features and conversion quality.