Create Structured Datasets for
LLM Fine-tuning

Transform your documents into perfectly formatted training datasets for LLM fine-tuning. Our API helps you process, structure, and store training data efficiently. Ideal for creating custom AI models and domain adaptation.

Learn More

Fine-tuning Pipeline

LLM Fine-tuning Process

Example Training Data Format

{
  "training_examples": [
    {
      "instruction": "Summarize the key points of this document",
      "input": "Original document text...",
      "output": "Structured summary of main points..."
    },
    {
      "instruction": "Extract relevant information",
      "input": "Document content...",
      "output": "Extracted structured data..."
    }
  ]
}

Fine-tuning Data Processing Features

Data Processing

  • • Automatic text cleaning and normalization
  • • Custom data extraction patterns
  • • Multi-format document support
  • • Batch processing capabilities

Dataset Creation

  • • Instruction-response pair generation
  • • Quality scoring and filtering
  • • Automated data validation
  • • MongoDB-ready JSON output

Integration Features

  • • Direct MongoDB integration
  • • RESTful API access
  • • Bulk export capabilities
  • • Version control support

API Integration Guide

1

Upload Documents

curl -X POST \
  -H "Authorization: Token YOUR_TOKEN" \
  -F "file=@/path/to/documents.pdf" \
  https://monkt.com/api/transformations/

Response:

{
  "success": true,
  "result": {
    "transformations": [{
      "uuid": "0974aa22-bc2b-4994-bc2f-0b4020bbe846",
      "status": "completed",
      "created": "2024-12-30 06:57:20"
    }]
  }
}
2

Define Training Data Schema

curl -X POST \
  -H "Authorization: Token YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Fine-tuning Schema",
    "schema": {
      "type": "object",
      "properties": {
        "instruction": {"type": "string"},
        "input": {"type": "string"},
        "output": {"type": "string"}
      }
    }
  }' \
  https://monkt.com/api/schemas/

Response:

{
  "success": true,
  "result": {
    "uuid": "1b5baa64-52ea-4f3d-8792-4852e24eacce",
    "name": "Fine-tuning Schema"
  }
}
3

Get Training Dataset

curl -X POST \
  -H "Authorization: Token YOUR_TOKEN" \
  https://monkt.com/api/transformations/{transformation_uuid}/json/{schema_uuid}/

Response:

{
  "success": true,
  "result": {
    "training_examples": [
      {
        "instruction": "Summarize the key points",
        "input": "Document content...",
        "output": "Key points summary..."
      },
      {
        "instruction": "Extract relevant information",
        "input": "Document text...",
        "output": "Extracted information..."
      }
    ]
  }
}
Usage Information

Track your API usage and limits in the response headers or info object:
• Transformations Used: 203/250
• Days Until Reset: 27
• Plan: Free

Fine-tuning Use Cases

Domain Adaptation

Create specialized training datasets from your industry documents:

  • • Legal document processing
  • • Medical research analysis
  • • Financial report understanding
  • • Technical documentation

Task-Specific Training

Prepare datasets for specific AI tasks:

  • • Question answering systems
  • • Document summarization
  • • Information extraction
  • • Content classification

Frequently Asked Questions

What types of articles can I convert to JSON?

We support various content sources including news websites, blog posts, RSS feeds, and content management systems. Our system can handle content in multiple languages and from different publishing platforms.

How accurate is the article extraction?

Our AI-powered system achieves over 95% accuracy in content extraction. We use advanced NLP techniques to identify key information like titles, authors, dates, and main content while filtering out ads and irrelevant elements.

Can I customize the JSON output format?

Yes! You can define custom schemas to extract only the fields you need. This allows you to match your existing system's requirements and integrate seamlessly with your content workflow.

How do I integrate with my content system?

Our API provides standardized JSON output that can be integrated with popular CMS platforms and content aggregators. We provide SDKs and detailed integration guides for major programming languages.

What about data security and privacy?

We take security seriously. All requests are encrypted using TLS 1.3, and we process data in isolated environments. We are GDPR compliant and automatically delete processed content after 24 hours.

What are the API rate limits?

API access is available starting with our Pro plan, which includes up to 1,000 transformations per month. Enterprise plans offer higher limits up to 5 million transformations monthly with custom integration support. Free accounts can test the service through our web interface with limited transformations.

Can I use both UI and API for article conversion?

Yes! You can choose what works best for you. Our user-friendly dashboard provides a simple interface for manual conversions and monitoring, while our API enables automated integration into your systems. Both methods support the same features and conversion quality.

Frequently asked questions

What file formats do you support?

We support a wide range of document formats including PDF, Word (DOC, DOCX), PowerPoint (PPT, PPTX), Excel (XLS, XLSX), HTML, and plain text files. Our system can process both text and embedded images within these documents.

How does the JSON schema customization work?

Pro users can define custom JSON schemas to specify exactly how they want their data structured. You can either use our automated schema detection or provide your own schema definition. This ensures your output data matches your exact requirements.

How do you handle document storage and security?

All documents are encrypted both in transit and at rest. We maintain secure storage for your processed documents, allowing you to access them anytime. Documents are automatically deleted after 30 days unless you specify otherwise.

What's included in the API access?

Pro and Enterprise users get full API access with comprehensive documentation. You can integrate our document processing directly into your workflow, automate batch processing, and retrieve transformed documents programmatically.

How does batch processing work?

You can upload multiple documents at once through our interface or API. Our system processes them in parallel, maintaining consistent formatting across all outputs. Progress tracking and notifications are available for batch jobs.

How do you handle images in documents?

Our system automatically detects and processes images within documents. We can extract image content, generate descriptive text, and include them in your markdown or JSON output in a format suitable for AI/LLM processing.

What kind of support do you offer?

All users get access to our documentation and email support. Pro users receive priority support with faster response times. Enterprise customers get dedicated support teams and custom SLAs to meet their specific needs.

Can I try before subscribing?

Yes! You can try our service with a sample document to see the quality of our markdown and JSON outputs. This helps you understand how our system handles document formatting and structure before committing to a subscription.