Dataset Engineering ภาคปฏิบัติ: สร้าง Fine-tuning Dataset จากข้อมูลองค์กรจริง

AI / Automation Consulting

ในบทความก่อนหน้าเราพูดถึงว่าทำไม Dataset Engineering ถึงเป็นทักษะถัดไปของ AI stack — ตอนนี้มาลงมือทำจริง

สถานการณ์: ทีม Sales ใช้ AI ช่วยเขียนอีเมล แต่ต้องแก้ tone, format, และ closer ทุกครั้ง RAG ดึงข้อมูลสินค้าได้ถูก แต่ "วิธีเขียน" ไม่ใช่ของบริษัท ผลคือพนักงานเสียเวลา reshape output วันละหลายรอบ

บทความนี้จะพาสร้าง fine-tuning dataset จาก sales email จริง 20 ฉบับ ขยายเป็น 1,000+ ตัวอย่าง แล้ว train ด้วย QLoRA + SFT บน Qwen2.5-7B — ทั้ง pipeline ใช้ HuggingFace TRL ตรงๆ ไม่มี abstraction ซ่อน

1. Task Formulation: ตั้งโจทย์ก่อนเก็บข้อมูลสักตัว

ขั้นตอนนี้สำคัญที่สุดและถูกข้ามบ่อยที่สุด ก่อนจะไปเก็บ data ต้องตอบให้ได้ก่อนว่า: โมเดลจะ "ทำอะไร" ในรูปแบบไหน และวัดว่าสำเร็จยังไง

สำหรับ sales email use case:

คำถาม	คำตอบ
Task	เขียน sales email ใน tone/format ของบริษัท
Input format	ประเภทอีเมล + context (ลูกค้า, สินค้า, สถานะ)
Output format	อีเมลที่พร้อมส่ง — ครบ subject, body, closer
วัดผลยังไง	เทียบ output กับ email จริงของทีม + human eval score

Python

# task_definition.py
from dataclasses import dataclass
 
@dataclass
class SalesEmailTask:
    """
    กำหนดโครงสร้างของแต่ละ training example
    ก่อนเก็บ data ต้อง define ตรงนี้ให้ชัดก่อน
    """
    email_type: str          # follow_up | intro | proposal | thank_you
    customer_context: str    # ข้อมูลลูกค้า สินค้า สถานะ
    expected_output: str     # อีเมลที่พร้อมส่ง
    quality_score: float     # 1-5, human rated
    source: str              # "real" | "synthetic"
 
    def to_messages(self, system_prompt: str) -> list[dict]:
        """แปลงเป็น chat format สำหรับ SFT"""
        return [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"ประเภท: {self.email_type}\n\n{self.customer_context}"},
            {"role": "assistant", "content": self.expected_output}
        ]

System prompt ที่ใช้ train ควรสั้น ชัด และสะท้อน role จริง:

Python

SYSTEM_PROMPT = """คุณเป็น Sales Communication Specialist ของบริษัท
เขียนอีเมลในสไตล์และ format ที่ทีม Sales ใช้จริง
ใช้ tone มืออาชีพแต่เป็นมิตร ปิดด้วย call-to-action ที่ชัดเจน"""

2. Collection: รวบรวม Seed Data จากองค์กร

ไม่ต้องมีข้อมูลเป็นพัน — แค่ 10-20 ตัวอย่างที่ดีก็พอสำหรับเริ่มต้น สิ่งสำคัญคือ seed examples ต้องเป็นตัวแทนของ behavior ที่ต้องการจริงๆ

แหล่ง data ที่ใช้ได้:

แหล่ง	สิ่งที่ได้	ข้อระวัง
Email archive (sent folder)	อีเมลที่ถูกส่งจริง มี tone จริง	ต้อง anonymize ชื่อลูกค้า
CRM notes/templates	Template ที่ทีมใช้	อาจเป็น template เปล่า ไม่มี context
Sales playbook	Format มาตรฐาน, closer	มักเป็น ideal case ไม่ใช่ real case
Top performer emails	ตัวอย่างจากคนที่ผลงานดี	อาจมี personal style มากเกินไป

Python

# collect_seed_data.py
import json
from pathlib import Path
 
def load_seed_emails(data_dir: str) -> list[SalesEmailTask]:
    """
    Load seed examples จาก JSON files ที่ทีม Sales เตรียมไว้
    Format: แต่ละ file = 1 email example
    """
    seeds = []
    for file in Path(data_dir).glob("*.json"):
        data = json.loads(file.read_text(encoding="utf-8"))
        seeds.append(SalesEmailTask(
            email_type=data["type"],
            customer_context=data["context"],
            expected_output=data["email"],
            quality_score=data.get("score", 4.0),

หมายเหตุจากผู้เขียน: ขั้นตอนนี้คือจุดที่ Data Engineer มีบทบาทชัดที่สุด — พวกเขารู้ว่า email data อยู่ใน system ไหน export ยังไง field ไหน clean field ไหนมี noise คนที่ไม่รู้ข้อมูลองค์กรต้องเสียเวลาหาตรงนี้เป็นสัปดาห์

3. Filtering: คัดเฉพาะที่ใช้ได้จริง

ไม่ใช่ทุกอีเมลที่ส่งไปจะเป็น training example ที่ดี ต้องมี quality criteria ที่ชัดเจน

Python

# filter_data.py
from dataclasses import dataclass
 
@dataclass
class QualityFilter:
    """เกณฑ์คัดกรอง — ปรับตาม domain"""
    min_output_length: int = 100    # อีเมลสั้นเกินไปมักไม่สมบูรณ์
    max_output_length: int = 2000   # ยาวเกินไปมักเป็น thread ไม่ใช่ single email
    required_fields: tuple = ("Subject:", "เรียน")  # ต้องมี format ครบ
    min_quality_score: float = 3.5
 
def filter_examples(
    examples: list[SalesEmailTask],
    criteria: QualityFilter = QualityFilter()
) -> list[SalesEmailTask]:
    """
    คัดกรองตาม quality criteria
    Return: examples ที่ผ่านเกณฑ์ + log ที่ถูกตัด
    """
    passed, rejected = [], []
 
    for ex in

Rule of thumb: ถ้า seed 20 ตัวอย่าง ควรผ่าน filter อย่างน้อย 15 ถ้าตกเยอะกว่านั้น criteria อาจ strict เกินไป หรือ raw data อาจต้อง curate เพิ่ม

4. Synthetic Generation: จาก 20 ตัวอย่าง → 1,000+ ตัวอย่าง

นี่คือจุดที่ Self-Instruct เข้ามา — ใช้ LLM สร้าง variations จาก seed examples ที่ผ่าน filter แล้ว

Python

# synthetic_generation.py
import anthropic
import json
import random
 
client = anthropic.Anthropic()
 
def generate_synthetic_examples(
    seeds: list[SalesEmailTask],
    target_count: int = 1000,
    batch_size: int = 5
) -> list[SalesEmailTask]:
    """
    Self-Instruct pipeline:
    1. สุ่ม seed 3 ตัวเป็น few-shot examples
    2. ให้ Claude สร้าง variation ใหม่
    3. Filter ด้วย quality criteria เดิม
    """
    synthetic = []
    email_types = ["follow_up", "intro", "proposal", "thank_you", "check_in"]
 
    while len(synthetic) < target_count:
        # สุ่ม seed examples เป็น few-shot

สร้าง synthetic แล้วต้อง filter ด้วย criteria เดียวกัน กับ seed data — อย่า assume ว่า LLM สร้างของดีทุกตัว:

Python

# หลัง generate แล้ว filter ทันที
raw_synthetic = generate_synthetic_examples(filtered_seeds, target_count=1200)
filtered_synthetic = filter_examples(raw_synthetic)  # ใช้ filter เดิม
 
# รวม seed + synthetic
full_dataset = filtered_seeds + filtered_synthetic
print(f"Final dataset: {len(full_dataset)} examples")
print(f"  Real: {sum(1 for x in full_dataset if x.source == 'real')}")
print(f"  Synthetic: {sum(1 for x in full_dataset if x.source == 'synthetic')}")

ตัวเลขที่ควรเห็น: จาก 1,200 synthetic ที่สร้าง ควรผ่าน filter ประมาณ 800-1,000 ถ้าผ่านน้อยกว่า 60% ให้ปรับ generation prompt

5. Deduplication: ตัดของซ้ำก่อน Train

Synthetic data มักมี near-duplicates เพราะ LLM สร้าง variations ที่คล้ายกันมาก ถ้าไม่ตัดออก โมเดลจะ memorize pattern แทนที่จะเรียนรู้ behavior

Python

# deduplication.py
from datasketch import MinHash, MinHashLSH
 
def dedup_dataset(
    examples: list[SalesEmailTask],
    threshold: float = 0.7
) -> list[SalesEmailTask]:
    """
    MinHash LSH deduplication
    threshold 0.7 = ตัด pair ที่คล้ายกัน 70%+ ออก
    """
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    unique = []
 
    for i, ex in enumerate(examples):
        m = MinHash(num_perm=128)
        # hash ที่ระดับ 3-gram ของ output
        tokens = ex.expected_output.split()
        for j in range(len(tokens) - 2):

6. Format: เตรียมข้อมูลสำหรับ SFT

TRL SFTTrainer รับ dataset ใน chat format ผ่าน apply_chat_template — ต้องแปลง examples ให้เป็น format ที่ tokenizer เข้าใจ

Python

# prepare_dataset.py
from datasets import Dataset
 
def to_hf_dataset(
    examples: list[SalesEmailTask],
    system_prompt: str,
    test_ratio: float = 0.1
) -> tuple[Dataset, Dataset]:
    """
    แปลงเป็น HuggingFace Dataset ที่มี 'messages' column
    แบ่ง train/test โดยแยก real examples ไว้ใน test set ด้วย
    """
    records = []
    for ex in examples:
        records.append({
            "messages": ex.to_messages(system_prompt),
            "email_type": ex.email_type,
            "source": ex.source,
        })
 
    ds = Dataset.from_list(records)
    split = ds.train_test_split(test_size=test_ratio, seed=42)
 
    print

ขั้นตอน	จำนวน	หมายเหตุ
Seed (real emails)	20	จาก Sales team
หลัง filter	17	ตัด 3 ที่ format ไม่ครบ
Synthetic generated	1,200	Claude Sonnet
หลัง filter synthetic	950	ตัด ~20% low quality
รวม seed + synthetic	967
หลัง dedup	840	ตัด near-duplicates
Train set	756	90%
Test set	84	10% (รวม real examples)

7. Training: QLoRA + SFT ด้วย TRL

ส่วนที่ทุกคนรอ — แต่ถ้า data ข้างบนพังตรงนี้ก็ช่วยไม่ได้

Python

# train.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer
 
# --- 1. Model & Tokenizer ---
MODEL_ID = "Qwen/Qwen2.5-7B-Instruct"  # Thai support ดี, open-weight
 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
 
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=

Hyperparameter reference:

Parameter	ค่าที่ใช้	เหตุผล
LoRA rank (r)	16	dataset < 5K → r=16 พอ ถ้า >10K ลอง r=32
Learning rate	2e-4	Standard สำหรับ LoRA+QLoRA
Epochs	3	SFT มักไม่ต้องมาก ถ้า loss ต่ำเร็วให้ลดเหลือ 2
Max seq length	2048	Email ไม่ยาวมาก 2048 เหลือเฟือ
Effective batch	16	4 per device × 4 accumulation
Precision	bf16	ถ้า GPU ไม่รองรับ ใช้ fp16 แทน

Hardware: RTX 4090 24GB รันได้สบาย Training time ประมาณ 30-60 นาทีสำหรับ dataset 756 examples × 3 epochs ค่า cloud บน RunPod/Lambda ประมาณ 50-100 บาท

8. Evaluation: ก่อน vs หลัง

ผลลัพธ์ที่วัดไม่ได้ = ผลลัพธ์ที่ขายไม่ได้ ต้องเทียบ output ก่อนและหลัง fine-tune บน test set เดียวกัน

Python

# evaluate.py
from peft import PeftModel
 
def load_finetuned(base_model_id: str, adapter_path: str):
    """Load base model + LoRA adapter"""
    model = AutoModelForCausalLM.from_pretrained(
        base_model_id,
        quantization_config=bnb_config,
        device_map="auto",
    )
    model = PeftModel.from_pretrained(model, adapter_path)
    tokenizer = AutoTokenizer.from_pretrained(adapter_path)
    return model, tokenizer
 
 
def compare_outputs(
    test_examples: list[dict],
    base_model_id: str,
    adapter_path: str,
    system_prompt: str
):
    """
    เทียบ output จาก base model vs fine-tuned model

สิ่งที่ควรเห็นหลัง fine-tune:

Dimension	Base Model	Fine-tuned
Tone	Generic professional	ตรง tone ของบริษัท
Format	ไม่แน่นอน	Subject → เรียน → Body → Closer คงที่
Closer/CTA	กว้าง ไม่ specific	ตรง playbook (นัดคุย, ส่ง proposal)
ความยาว	มักยาวเกินไป	ตรง range ที่ทีมใช้จริง

สำหรับ quantitative evaluation ใช้ LLM-as-Judge เทียบกับ reference email:

Python

# llm_judge.py
def judge_email_quality(
    generated: str,
    reference: str,
    client: anthropic.Anthropic
) -> dict:
    """
    ใช้ Claude ให้คะแนนเทียบกับ reference email
    Return: scores 1-5 สำหรับแต่ละ dimension
    """
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": (
                f"ให้คะแนน 1-5 สำหรับ sales email นี้เทียบกับ reference:\n\n"
                f"Reference:\n{reference}\n\n"
                f"Generated:\n{generated}\n\n"

Architecture Overview

Text

Seed Data (20 emails)
       |
  [Quality Filter]  ← criteria: length, format, score
       |
  17 clean seeds
       |
  [Synthetic Generation]  ← Claude API + Self-Instruct
       |
  1,200 synthetic
       |
  [Quality Filter]  ← same criteria as seeds
       |
  950 passed
       |
  [Deduplication]  ← MinHash LSH (threshold=0.7)
       |
  840 unique
       |
  [Format → HF Dataset]  ← messages format + train/test split
       |
  756 train / 84 test
       |
  [SFT + QLoRA]  ← TRL SFTTrainer + Qwen2.5-7B
       |
  LoRA Adapter (~50MB)
       |
  [Evaluation]  ← base vs fine-tuned on test set

Tech Stack

Python

# requirements.txt
 
# Training
transformers>=4.46.0       # Model loading, tokenizer
trl>=0.12.0                # SFTTrainer, SFTConfig
peft>=0.13.0               # LoRA, QLoRA
bitsandbytes>=0.44.0       # 4-bit quantization
accelerate>=1.0.0          # Multi-GPU, mixed precision
torch>=2.4.0               # PyTorch (CUDA 12.1+)
 
# Dataset
datasets>=3.0.0            # HuggingFace Datasets
datasketch>=1.6.0          # MinHash LSH deduplication
 
# Synthetic Generation
anthropic>=0.40.0          # Claude API สำหรับ Self-Instruct
 
# Evaluation
scikit-learn>=1.5.0

ทางเลือก Model สำหรับภาษาไทย:

Model	ขนาด	Thai Support	หมายเหตุ
Qwen/Qwen2.5-7B-Instruct	7B	ดี	tokenizer รองรับ Thai ดี ใช้ในบทความนี้
scb10x/typhoon2-8b-instruct	8B	ดีมาก	Thai-focused model จาก SCB
meta-llama/Llama-3.1-8B-Instruct	8B	พอใช้	popular แต่ Thai tokenization ไม่ดีเท่า
google/gemma-2-9b-it	9B	พอใช้	lightweight, ดีสำหรับ English-heavy use case

สรุป: เริ่มจากเล็ก วัดผลให้ได้

Pipeline ทั้งหมดตั้งแต่ seed 20 ตัวอย่างจนถึง fine-tuned model ที่ deploy ได้ ใช้เวลาไม่ถึงสัปดาห์ ค่าใช้จ่ายหลักอยู่ที่ Claude API สำหรับ synthetic generation ประมาณ 200-500 บาท กับ training cost อีก 50-100 บาท — รวมไม่ถึง 1,000 บาทสำหรับ model ที่เขียนอีเมลใน tone ของบริษัทได้เอง

Priority สำหรับการเริ่มต้น:

Task Formulation — ตั้งโจทย์ให้ถูกก่อน ผิดตรงนี้ทั้ง pipeline เสียเปล่า
Seed Quality — 20 ตัวอย่างที่ดีสำคัญกว่า 200 ตัวอย่างที่พอใช้
Filter ก่อน Train — ทั้ง real data และ synthetic data ต้องผ่าน criteria เดียวกัน
Evaluate อย่างจริงจัง — base vs fine-tuned บน test set เดียวกัน ไม่ใช่ vibes
Iterate — ดูผลแล้ว iterate dataset ไม่ใช่ iterate hyperparameters

อ้างอิง:

Dataset Engineering ทักษะถัดไปของ AI Stack — Part 1 ของบทความนี้
AI Engineering, Ch.7-8: Fine-tuning & Dataset Engineering — decodeai.in
HuggingFace TRL Documentation: SFTTrainer
QLoRA: Efficient Finetuning of Quantized LLMs (2023)
LIMA: Less Is More for Alignment (2023)
Qwen2.5 Technical Report

Dataset Engineering ภาคปฏิบัติ: สร้าง Fine-tuning Dataset จากข้อมูลองค์กรจริง

1. Task Formulation: ตั้งโจทย์ก่อนเก็บข้อมูลสักตัว

สำหรับ sales email use case:

คำถาม	คำตอบ
Task	เขียน sales email ใน tone/format ของบริษัท
Input format	ประเภทอีเมล + context (ลูกค้า, สินค้า, สถานะ)
Output format	อีเมลที่พร้อมส่ง — ครบ subject, body, closer
วัดผลยังไง	เทียบ output กับ email จริงของทีม + human eval score

Python

# task_definition.py
from dataclasses import dataclass
 
@dataclass
class SalesEmailTask:
    """
    กำหนดโครงสร้างของแต่ละ training example
    ก่อนเก็บ data ต้อง define ตรงนี้ให้ชัดก่อน
    """
    email_type: str          # follow_up | intro | proposal | thank_you
    customer_context: str    # ข้อมูลลูกค้า สินค้า สถานะ
    expected_output: str     # อีเมลที่พร้อมส่ง
    quality_score: float     # 1-5, human rated
    source: str              # "real" | "synthetic"
 
    def to_messages(self, system_prompt: str) -> list[dict]:
        """แปลงเป็น chat format สำหรับ SFT"""
        return [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"ประเภท: {self.email_type}\n\n{self.customer_context}"},
            {"role": "assistant", "content": self.expected_output}
        ]

System prompt ที่ใช้ train ควรสั้น ชัด และสะท้อน role จริง:

Python

SYSTEM_PROMPT = """คุณเป็น Sales Communication Specialist ของบริษัท
เขียนอีเมลในสไตล์และ format ที่ทีม Sales ใช้จริง
ใช้ tone มืออาชีพแต่เป็นมิตร ปิดด้วย call-to-action ที่ชัดเจน"""

2. Collection: รวบรวม Seed Data จากองค์กร

แหล่ง data ที่ใช้ได้:

แหล่ง	สิ่งที่ได้	ข้อระวัง
Email archive (sent folder)	อีเมลที่ถูกส่งจริง มี tone จริง	ต้อง anonymize ชื่อลูกค้า
CRM notes/templates	Template ที่ทีมใช้	อาจเป็น template เปล่า ไม่มี context
Sales playbook	Format มาตรฐาน, closer	มักเป็น ideal case ไม่ใช่ real case
Top performer emails	ตัวอย่างจากคนที่ผลงานดี	อาจมี personal style มากเกินไป

Python

# collect_seed_data.py
import json
from pathlib import Path
 
def load_seed_emails(data_dir: str) -> list[SalesEmailTask]:
    """
    Load seed examples จาก JSON files ที่ทีม Sales เตรียมไว้
    Format: แต่ละ file = 1 email example
    """
    seeds = []
    for file in Path(data_dir).glob("*.json"):
        data = json.loads(file.read_text(encoding="utf-8"))
        seeds.append(SalesEmailTask(
            email_type=data["type"],
            customer_context=data["context"],
            expected_output=data["email"],
            quality_score=data.get("score", 4.0),

หมายเหตุจากผู้เขียน: ขั้นตอนนี้คือจุดที่ Data Engineer มีบทบาทชัดที่สุด — พวกเขารู้ว่า email data อยู่ใน system ไหน export ยังไง field ไหน clean field ไหนมี noise คนที่ไม่รู้ข้อมูลองค์กรต้องเสียเวลาหาตรงนี้เป็นสัปดาห์

3. Filtering: คัดเฉพาะที่ใช้ได้จริง

ไม่ใช่ทุกอีเมลที่ส่งไปจะเป็น training example ที่ดี ต้องมี quality criteria ที่ชัดเจน

Python

# filter_data.py
from dataclasses import dataclass
 
@dataclass
class QualityFilter:
    """เกณฑ์คัดกรอง — ปรับตาม domain"""
    min_output_length: int = 100    # อีเมลสั้นเกินไปมักไม่สมบูรณ์
    max_output_length: int = 2000   # ยาวเกินไปมักเป็น thread ไม่ใช่ single email
    required_fields: tuple = ("Subject:", "เรียน")  # ต้องมี format ครบ
    min_quality_score: float = 3.5
 
def filter_examples(
    examples: list[SalesEmailTask],
    criteria: QualityFilter = QualityFilter()
) -> list[SalesEmailTask]:
    """
    คัดกรองตาม quality criteria
    Return: examples ที่ผ่านเกณฑ์ + log ที่ถูกตัด
    """
    passed, rejected = [], []
 
    for ex in

4. Synthetic Generation: จาก 20 ตัวอย่าง → 1,000+ ตัวอย่าง

นี่คือจุดที่ Self-Instruct เข้ามา — ใช้ LLM สร้าง variations จาก seed examples ที่ผ่าน filter แล้ว

Python

# synthetic_generation.py
import anthropic
import json
import random
 
client = anthropic.Anthropic()
 
def generate_synthetic_examples(
    seeds: list[SalesEmailTask],
    target_count: int = 1000,
    batch_size: int = 5
) -> list[SalesEmailTask]:
    """
    Self-Instruct pipeline:
    1. สุ่ม seed 3 ตัวเป็น few-shot examples
    2. ให้ Claude สร้าง variation ใหม่
    3. Filter ด้วย quality criteria เดิม
    """
    synthetic = []
    email_types = ["follow_up", "intro", "proposal", "thank_you", "check_in"]
 
    while len(synthetic) < target_count:
        # สุ่ม seed examples เป็น few-shot

Python

# หลัง generate แล้ว filter ทันที
raw_synthetic = generate_synthetic_examples(filtered_seeds, target_count=1200)
filtered_synthetic = filter_examples(raw_synthetic)  # ใช้ filter เดิม
 
# รวม seed + synthetic
full_dataset = filtered_seeds + filtered_synthetic
print(f"Final dataset: {len(full_dataset)} examples")
print(f"  Real: {sum(1 for x in full_dataset if x.source == 'real')}")
print(f"  Synthetic: {sum(1 for x in full_dataset if x.source == 'synthetic')}")

5. Deduplication: ตัดของซ้ำก่อน Train

Python

# deduplication.py
from datasketch import MinHash, MinHashLSH
 
def dedup_dataset(
    examples: list[SalesEmailTask],
    threshold: float = 0.7
) -> list[SalesEmailTask]:
    """
    MinHash LSH deduplication
    threshold 0.7 = ตัด pair ที่คล้ายกัน 70%+ ออก
    """
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    unique = []
 
    for i, ex in enumerate(examples):
        m = MinHash(num_perm=128)
        # hash ที่ระดับ 3-gram ของ output
        tokens = ex.expected_output.split()
        for j in range(len(tokens) - 2):

6. Format: เตรียมข้อมูลสำหรับ SFT

TRL SFTTrainer รับ dataset ใน chat format ผ่าน apply_chat_template — ต้องแปลง examples ให้เป็น format ที่ tokenizer เข้าใจ

Python

# prepare_dataset.py
from datasets import Dataset
 
def to_hf_dataset(
    examples: list[SalesEmailTask],
    system_prompt: str,
    test_ratio: float = 0.1
) -> tuple[Dataset, Dataset]:
    """
    แปลงเป็น HuggingFace Dataset ที่มี 'messages' column
    แบ่ง train/test โดยแยก real examples ไว้ใน test set ด้วย
    """
    records = []
    for ex in examples:
        records.append({
            "messages": ex.to_messages(system_prompt),
            "email_type": ex.email_type,
            "source": ex.source,
        })
 
    ds = Dataset.from_list(records)
    split = ds.train_test_split(test_size=test_ratio, seed=42)
 
    print

ขั้นตอน	จำนวน	หมายเหตุ
Seed (real emails)	20	จาก Sales team
หลัง filter	17	ตัด 3 ที่ format ไม่ครบ
Synthetic generated	1,200	Claude Sonnet
หลัง filter synthetic	950	ตัด ~20% low quality
รวม seed + synthetic	967
หลัง dedup	840	ตัด near-duplicates
Train set	756	90%
Test set	84	10% (รวม real examples)

7. Training: QLoRA + SFT ด้วย TRL

ส่วนที่ทุกคนรอ — แต่ถ้า data ข้างบนพังตรงนี้ก็ช่วยไม่ได้

Python

# train.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer
 
# --- 1. Model & Tokenizer ---
MODEL_ID = "Qwen/Qwen2.5-7B-Instruct"  # Thai support ดี, open-weight
 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
 
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=

Hyperparameter reference:

Parameter	ค่าที่ใช้	เหตุผล
LoRA rank (r)	16	dataset < 5K → r=16 พอ ถ้า >10K ลอง r=32
Learning rate	2e-4	Standard สำหรับ LoRA+QLoRA
Epochs	3	SFT มักไม่ต้องมาก ถ้า loss ต่ำเร็วให้ลดเหลือ 2
Max seq length	2048	Email ไม่ยาวมาก 2048 เหลือเฟือ
Effective batch	16	4 per device × 4 accumulation
Precision	bf16	ถ้า GPU ไม่รองรับ ใช้ fp16 แทน

8. Evaluation: ก่อน vs หลัง

Python

# evaluate.py
from peft import PeftModel
 
def load_finetuned(base_model_id: str, adapter_path: str):
    """Load base model + LoRA adapter"""
    model = AutoModelForCausalLM.from_pretrained(
        base_model_id,
        quantization_config=bnb_config,
        device_map="auto",
    )
    model = PeftModel.from_pretrained(model, adapter_path)
    tokenizer = AutoTokenizer.from_pretrained(adapter_path)
    return model, tokenizer
 
 
def compare_outputs(
    test_examples: list[dict],
    base_model_id: str,
    adapter_path: str,
    system_prompt: str
):
    """
    เทียบ output จาก base model vs fine-tuned model

สิ่งที่ควรเห็นหลัง fine-tune:

Dimension	Base Model	Fine-tuned
Tone	Generic professional	ตรง tone ของบริษัท
Format	ไม่แน่นอน	Subject → เรียน → Body → Closer คงที่
Closer/CTA	กว้าง ไม่ specific	ตรง playbook (นัดคุย, ส่ง proposal)
ความยาว	มักยาวเกินไป	ตรง range ที่ทีมใช้จริง

สำหรับ quantitative evaluation ใช้ LLM-as-Judge เทียบกับ reference email:

Python

# llm_judge.py
def judge_email_quality(
    generated: str,
    reference: str,
    client: anthropic.Anthropic
) -> dict:
    """
    ใช้ Claude ให้คะแนนเทียบกับ reference email
    Return: scores 1-5 สำหรับแต่ละ dimension
    """
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": (
                f"ให้คะแนน 1-5 สำหรับ sales email นี้เทียบกับ reference:\n\n"
                f"Reference:\n{reference}\n\n"
                f"Generated:\n{generated}\n\n"

Architecture Overview

Text

Seed Data (20 emails)
       |
  [Quality Filter]  ← criteria: length, format, score
       |
  17 clean seeds
       |
  [Synthetic Generation]  ← Claude API + Self-Instruct
       |
  1,200 synthetic
       |
  [Quality Filter]  ← same criteria as seeds
       |
  950 passed
       |
  [Deduplication]  ← MinHash LSH (threshold=0.7)
       |
  840 unique
       |
  [Format → HF Dataset]  ← messages format + train/test split
       |
  756 train / 84 test
       |
  [SFT + QLoRA]  ← TRL SFTTrainer + Qwen2.5-7B
       |
  LoRA Adapter (~50MB)
       |
  [Evaluation]  ← base vs fine-tuned on test set

Tech Stack

Python

# requirements.txt
 
# Training
transformers>=4.46.0       # Model loading, tokenizer
trl>=0.12.0                # SFTTrainer, SFTConfig
peft>=0.13.0               # LoRA, QLoRA
bitsandbytes>=0.44.0       # 4-bit quantization
accelerate>=1.0.0          # Multi-GPU, mixed precision
torch>=2.4.0               # PyTorch (CUDA 12.1+)
 
# Dataset
datasets>=3.0.0            # HuggingFace Datasets
datasketch>=1.6.0          # MinHash LSH deduplication
 
# Synthetic Generation
anthropic>=0.40.0          # Claude API สำหรับ Self-Instruct
 
# Evaluation
scikit-learn>=1.5.0

ทางเลือก Model สำหรับภาษาไทย:

Model	ขนาด	Thai Support	หมายเหตุ
Qwen/Qwen2.5-7B-Instruct	7B	ดี	tokenizer รองรับ Thai ดี ใช้ในบทความนี้
scb10x/typhoon2-8b-instruct	8B	ดีมาก	Thai-focused model จาก SCB
meta-llama/Llama-3.1-8B-Instruct	8B	พอใช้	popular แต่ Thai tokenization ไม่ดีเท่า
google/gemma-2-9b-it	9B	พอใช้	lightweight, ดีสำหรับ English-heavy use case

สรุป: เริ่มจากเล็ก วัดผลให้ได้

Priority สำหรับการเริ่มต้น:

Task Formulation — ตั้งโจทย์ให้ถูกก่อน ผิดตรงนี้ทั้ง pipeline เสียเปล่า
Seed Quality — 20 ตัวอย่างที่ดีสำคัญกว่า 200 ตัวอย่างที่พอใช้
Filter ก่อน Train — ทั้ง real data และ synthetic data ต้องผ่าน criteria เดียวกัน
Evaluate อย่างจริงจัง — base vs fine-tuned บน test set เดียวกัน ไม่ใช่ vibes
Iterate — ดูผลแล้ว iterate dataset ไม่ใช่ iterate hyperparameters

อ้างอิง:

Dataset Engineering ทักษะถัดไปของ AI Stack — Part 1 ของบทความนี้
AI Engineering, Ch.7-8: Fine-tuning & Dataset Engineering — decodeai.in
HuggingFace TRL Documentation: SFTTrainer
QLoRA: Efficient Finetuning of Quantized LLMs (2023)
LIMA: Less Is More for Alignment (2023)
Qwen2.5 Technical Report