Rethinking RAG: Document Ingestion ที่หลายคนมองข้าม

AI / Automation Consulting

ช่วงหลังที่ลองสร้าง RAG Agent มาเรื่อยๆ มีอยู่จุดนึงที่รู้สึกว่า "อ้าว... ทำไมมันพังตอน Production ทั้งที่ Prototype ดูดีมากเลย?" ตัวเลขที่เคยแม่นเริ่มคลาดเคลื่อน บริบทที่ควรจะรู้กลับหายไป

พอย้อนกลับไปดู ปัญหาอยู่ที่ขั้นตอน Ingestion นี่แหละ ไม่ใช่ LLM ไม่ใช่ Embedding Model แต่เป็นตอนที่เรา "ยัดข้อมูลเข้าไป" นั่นเอง

บทความนี้รวม pattern ที่ส่งผลต่อคุณภาพ Retrieval อย่างมีนัยสำคัญ พร้อม code และ tech stack จริง

1. Naive Chunking: ง่ายดี แต่พังจริง

วิธีที่ทุกคนเริ่มต้นคือ Fixed-Size Chunking — หั่น PDF ตาม Token แล้วโยนเข้า Vector Store ง่ายมาก เร็วมาก แต่กับเอกสารที่ซับซ้อน มันแทบจะใช้ไม่ได้เลย เพราะตัดกลางประโยคกลางย่อหน้าโดยไม่สนใจความหมาย

Recursive Chunking ที่พยายามรักษาย่อหน้าไว้ดีขึ้น แต่ก็ยัง "แก้ปลาย" อยู่ดี มันไม่รู้ว่าข้อมูลในเอกสารมีโครงสร้างยังไง

สิ่งที่ควรทำคือวิเคราะห์ประเภทเอกสารก่อน แล้วค่อยเลือก strategy ที่เหมาะสม แทนที่จะใช้ chunk_size=500 ตายตัวกับทุกเอกสาร

2. Layout-Aware Parsing: ต้องใช้ ไม่ใช่ทางเลือก

นี่คือหลุมที่เจ็บปวดมากที่สุด พอเจอ PDF ที่มีตารางหรือ layout หลายคอลัมน์ การ chunk แบบเดิมทำให้ข้อมูลในตารางแตกกระจาย อ่านไม่รู้เรื่องเลย

ลองนึกภาพตารางรายงานการเงินแบบนี้:

รายการ	2566	2567
รายได้รวม	1.2B	1.5B

ถ้า chunk แบบ naive อาจได้แค่ "รายได้ 2566" กับ "รวม 1.2B" แยกกัน LLM ไม่มีทางเชื่อมโยงได้เลย

Layout-Aware Parsing จึงไม่ใช่ optional อีกต่อไป ระบบต้องรู้ว่าอะไรคือหัวข้อ อะไรคือตาราง ลำดับการอ่านคืออะไร ก่อนที่จะ embed อะไรลงไปทั้งนั้น

Tool ที่แนะนำคือ Docling จาก IBM สำหรับ PDF ที่มี table/column ซับซ้อน และ unstructured.io สำหรับ mixed file types

3. Hierarchical Chunking: แก้ปัญหา Search vs Context

ปัญหาคลาสสิกของ RAG คือ chunk ใหญ่ทำให้ search แม่นน้อยลง แต่ถ้า chunk เล็กเกินไป LLM ก็ไม่มีบริบทพอจะตอบ

วิธีที่ลงตัวคือ Parent-Child Chunking เก็บทั้ง "child" (เล็ก ~128 tokens) และ "parent" (ใหญ่ ~1024 tokens) — ตอน vector search ใช้ child เพราะแม่นยำ แต่ตอนส่งให้ LLM อ่านดึง parent มาแทน เพราะมีบริบทครบ

Rule of thumb:

ระดับ	ขนาด	ใช้สำหรับ
Child	64-128 tokens	ทำ embedding สำหรับ vector search
Parent	512-1024 tokens	ส่งให้ LLM อ่านเป็น context
Overlap	10-15% ของ parent size	ป้องกันข้อมูลหายระหว่าง chunk

Python

# hierarchical_chunking.py
from dataclasses import dataclass, field
 
@dataclass
class HierarchicalChunk:
    """
    Parent-Child structure:
    - ใช้ child (เล็ก) ตอน vector search → ความแม่นยำสูง
    - ดึง parent (ใหญ่) ให้ LLM อ่าน → บริบทครบ
    """
    parent_id: str
    child_id: str
    parent_text: str      # ส่งให้ LLM อ่าน (512-1024 tokens)
    child_text: str       # ใช้ทำ embedding (64-128 tokens)
    metadata: dict = field(default_factory=dict)
 
def create_hierarchical_chunks(
    text: str,
    parent_size: int = 1024,
    child_size: int = 128,
    overlap: int = 32
) -> list[HierarchicalChunk]:
    """
    สร้าง Parent-Child chunks
    1 parent → หลาย children
    """
    import hashlib
 
    def chunk_text(text, size, overlap):
        chunks, start = [], 0
        while start < len(text):
            end = min(start + size, len(text))
            chunks.append((start, text[start:end]))
            start += size - overlap
        return chunks
 
    parents = chunk_text(text, parent_size, overlap * 4)
    result = []
 
    for p_idx, (p_start, parent_text) in enumerate(parents):
        parent_id = hashlib.md5(f"parent_{p_idx}_{parent_text[:50]}".encode()).hexdigest()[:8]
        children = chunk_text(parent_text, child_size, overlap)
 
        for c_idx, (_, child_text) in enumerate(children):
            child_id = f"{parent_id}_c{c_idx}"
            result.append(HierarchicalChunk(
                parent_id=parent_id,
                child_id=child_id,
                parent_text=parent_text,
                child_text=child_text,
                metadata={
                    "parent_index": p_idx,
                    "child_index": c_idx,
                    "parent_char_start": p_start
                }
            ))
 
    return result
 
 
# Retrieval: ค้นด้วย child → ส่ง parent ให้ LLM
def retrieve_with_context(
    query: str,
    chunks: list[HierarchicalChunk],
    embedder,
    top_k: int = 3
) -> list[str]:
    """
    Search ด้วย child chunks (แม่นยำ)
    แต่ return parent chunks (มีบริบท)
    """
    child_embeddings = {
        c.child_id: embedder(c.child_text) for c in chunks
    }
    query_emb = embedder(query)
 
    from numpy import dot, linalg
    scores = {
        cid: dot(query_emb, emb) / (linalg.norm(query_emb) * linalg.norm(emb))
        for cid, emb in child_embeddings.items()
    }
    top_children = sorted(scores, key=scores.get, reverse=True)[:top_k]
 
    seen_parents = set()
    contexts = []
    for child_id in top_children:
        chunk = next(c for c in chunks if c.child_id == child_id)
        if chunk.parent_id not in seen_parents:
            seen_parents.add(chunk.parent_id)
            contexts.append(chunk.parent_text)
 
    return contexts

4. Semantic Chunking: แบ่งตามความหมาย ไม่ใช่ขนาด

สำหรับเอกสารเฉพาะทางอย่างงานวิจัยหรือกฎหมาย การแบ่งตามโครงสร้างปกติอาจไม่พอ Semantic Chunking วัด "Topic Shift" จากความหมายจริงๆ ทำให้ได้ chunk ที่เนื้อหาสอดคล้องกันจริงๆ

ควรใช้เมื่อ:

เอกสารที่ topic เปลี่ยนแบบไม่มี heading ชัดเจน
งานวิจัยที่มีการอ้างอิงข้ามส่วน
กฎหมายที่ clause หนึ่งขึ้นอยู่กับอีก clause

Cost สูงกว่า เพราะต้องเรียก LLM ระหว่างขั้นตอน chunking แต่ถ้า use case ต้องการความแม่นยำระดับนั้นก็คุ้ม

5. Metadata-Enriched Ingestion: ละเลยไม่ได้

ถ้าไม่ฝัง metadata ลงใน chunk ตั้งแต่แรก มีโอกาสสูงมากที่ Agent จะไปดึงข้อมูลเก่ามาตอบโดยที่เราไม่รู้ตัว

Metadata ที่ควรมีทุก chunk:

ingestion_date, doc_version สำหรับกรองข้อมูลเก่า
source, page_number สำหรับอ้างอิงได้
doc_type, category สำหรับ pre-filter ก่อน vector search
chunk_index, parent_id สำหรับ trace lineage

ตัวอย่าง metadata structure ที่ควรฝังไปกับทุก chunk:

JSON

{
  "content": "...ข้อความ...",
  "metadata": {
    "source": "policy_v3.pdf",
    "doc_version": "3.0",
    "effective_date": "2025-01-01",
    "ingestion_ts": "2025-03-01T10:00:00Z",
    "chunk_index": 12,
    "parent_id": "a3f9b2c1"
  }
}

ตัวอย่างการใช้จริง: user ถามเรื่องนโยบายบริษัท filter เฉพาะ version ล่าสุดก่อน search ก็ไม่มีโอกาสดึง policy เก่ามาตอบแล้ว

6. Adaptive Ingestion: อนาคตที่กำลังจะมา

เทรนด์ที่น่าสนใจที่สุดคือการใช้ Agent มา "วิเคราะห์เอกสารก่อน" แล้วค่อยเลือก strategy ที่เหมาะสม แทนที่จะใช้วิธีเดียวกับทุกอย่าง

Document type กับ strategy ที่เหมาะสม:

Document Type	Strategy
Financial Report	Layout-Aware + Table extraction
Legal Document	Semantic + Hierarchical
Research Paper	Semantic + Citation linking
Chat Log	Recursive + Temporal metadata
Unknown	Recursive (safe default)

Python

# adaptive_ingestion.py
import anthropic
from pathlib import Path
from typing import Literal
from dataclasses import dataclass
 
DocumentType = Literal["financial_report", "legal_doc", "research_paper", "chat_log", "unknown"]
 
@dataclass
class IngestionStrategy:
    doc_type: DocumentType
    chunking_method: str
    chunk_size: int
    overlap: int
    use_layout_parsing: bool
    metadata_fields: list[str]
 
class AdaptiveIngestionAgent:
    """
    ใช้ LLM วิเคราะห์เอกสารก่อน แล้วค่อยเลือก strategy ที่เหมาะสม
    แทนที่จะใช้วิธีเดียวกับทุกเอกสาร
    """

Architecture Overview

เมื่อรวม pattern ทั้งหมดเข้าด้วยกัน pipeline จะมีหน้าตาประมาณนี้:

Text

Input Files (PDF/DOCX/TXT)
         |
[Adaptive Ingestion Agent]  <-- Claude API (Document Classifier)
         | classify
  +----------------------------+
  | Layout Parser  |  Semantic |
  |  (Docling)     |  Chunker  |
  +----------------------------+
         | chunks + metadata
  [Hierarchical Index]
  Child chunks -> Qdrant (vector search)
  Parent chunks -> stored with parent_id
         | query time
  Search child -> retrieve parent -> LLM
         |
  Answer (with full context)

Tech Stack สำหรับ Production

Python

# requirements.txt
 
# Core LLM & Embeddings
anthropic>=0.40.0          # Claude API (LLM + Document Analysis)
openai>=1.50.0             # OpenAI Embeddings (text-embedding-3-large)
 
# Document Parsing (Layout-Aware)
docling>=2.0.0             # IBM Layout-Aware Parser (PDF tables, columns)
unstructured[pdf]>=0.15.0  # Fallback parser + file type detection
pymupdf>=1.24.0            # Fast PDF text extraction (fitz)
python-docx>=1.1.0         # Word document parsing
 
# Vector Store
qdrant-client>=1.10.0      # Qdrant (self-hosted / cloud)
 
# Orchestration / Pipeline
langchain>=0.3.0           # Pipeline orchestration
langchain-community>=0.3

สรุป: Ingestion คือกระดูกสันหลังของ RAG

RAG ที่ใช้งานได้จริงไม่ได้ขึ้นอยู่กับว่าใช้ Embedding Model อันดับ 1 บน Leaderboard หรือเปล่า แต่ขึ้นอยู่กับว่าเราจัดการ Ingestion ได้ดีแค่ไหน

ถ้าข้อมูลที่เข้าไปในระบบมันพัง ต่อให้ LLM ฉลาดแค่ไหน มันก็แค่ผู้เชี่ยวชาญที่อ่านหนังสือฉีกๆ อยู่ดี

Priority ที่แนะนำสำหรับ Production:

Layout-Aware Parsing — ก่อนเลย ถ้ามี PDF/table
Metadata Enrichment — ทุก chunk ต้องมี
Hierarchical Chunking — แก้ปัญหา search vs context
Semantic Chunking — สำหรับ domain-specific docs
Adaptive Ingestion — เมื่อพร้อม scale

Rethinking RAG: Document Ingestion ที่หลายคนมองข้าม

1. Naive Chunking: ง่ายดี แต่พังจริง

2. Layout-Aware Parsing: ต้องใช้ ไม่ใช่ทางเลือก

ลองนึกภาพตารางรายงานการเงินแบบนี้:

รายการ	2566	2567
รายได้รวม	1.2B	1.5B

Tool ที่แนะนำคือ Docling จาก IBM สำหรับ PDF ที่มี table/column ซับซ้อน และ unstructured.io สำหรับ mixed file types

3. Hierarchical Chunking: แก้ปัญหา Search vs Context

Rule of thumb:

ระดับ	ขนาด	ใช้สำหรับ
Child	64-128 tokens	ทำ embedding สำหรับ vector search
Parent	512-1024 tokens	ส่งให้ LLM อ่านเป็น context
Overlap	10-15% ของ parent size	ป้องกันข้อมูลหายระหว่าง chunk

Python

# hierarchical_chunking.py
from dataclasses import dataclass, field
 
@dataclass
class HierarchicalChunk:
    """
    Parent-Child structure:
    - ใช้ child (เล็ก) ตอน vector search → ความแม่นยำสูง
    - ดึง parent (ใหญ่) ให้ LLM อ่าน → บริบทครบ
    """
    parent_id: str
    child_id: str
    parent_text: str      # ส่งให้ LLM อ่าน (512-1024 tokens)
    child_text: str       # ใช้ทำ embedding (64-128 tokens)
    metadata: dict = field(default_factory=dict)
 
def create_hierarchical_chunks(
    text: str,
    parent_size: int = 1024,
    child_size: int = 128,
    overlap: int = 32
) -> list[HierarchicalChunk]:
    """
    สร้าง Parent-Child chunks
    1 parent → หลาย children
    """
    import hashlib
 
    def chunk_text(text, size, overlap):
        chunks, start = [], 0
        while start < len(text):
            end = min(start + size, len(text))
            chunks.append((start, text[start:end]))
            start += size - overlap
        return chunks
 
    parents = chunk_text(text, parent_size, overlap * 4)
    result = []
 
    for p_idx, (p_start, parent_text) in enumerate(parents):
        parent_id = hashlib.md5(f"parent_{p_idx}_{parent_text[:50]}".encode()).hexdigest()[:8]
        children = chunk_text(parent_text, child_size, overlap)
 
        for c_idx, (_, child_text) in enumerate(children):
            child_id = f"{parent_id}_c{c_idx}"
            result.append(HierarchicalChunk(
                parent_id=parent_id,
                child_id=child_id,
                parent_text=parent_text,
                child_text=child_text,
                metadata={
                    "parent_index": p_idx,
                    "child_index": c_idx,
                    "parent_char_start": p_start
                }
            ))
 
    return result
 
 
# Retrieval: ค้นด้วย child → ส่ง parent ให้ LLM
def retrieve_with_context(
    query: str,
    chunks: list[HierarchicalChunk],
    embedder,
    top_k: int = 3
) -> list[str]:
    """
    Search ด้วย child chunks (แม่นยำ)
    แต่ return parent chunks (มีบริบท)
    """
    child_embeddings = {
        c.child_id: embedder(c.child_text) for c in chunks
    }
    query_emb = embedder(query)
 
    from numpy import dot, linalg
    scores = {
        cid: dot(query_emb, emb) / (linalg.norm(query_emb) * linalg.norm(emb))
        for cid, emb in child_embeddings.items()
    }
    top_children = sorted(scores, key=scores.get, reverse=True)[:top_k]
 
    seen_parents = set()
    contexts = []
    for child_id in top_children:
        chunk = next(c for c in chunks if c.child_id == child_id)
        if chunk.parent_id not in seen_parents:
            seen_parents.add(chunk.parent_id)
            contexts.append(chunk.parent_text)
 
    return contexts

4. Semantic Chunking: แบ่งตามความหมาย ไม่ใช่ขนาด

ควรใช้เมื่อ:

เอกสารที่ topic เปลี่ยนแบบไม่มี heading ชัดเจน
งานวิจัยที่มีการอ้างอิงข้ามส่วน
กฎหมายที่ clause หนึ่งขึ้นอยู่กับอีก clause

5. Metadata-Enriched Ingestion: ละเลยไม่ได้

Metadata ที่ควรมีทุก chunk:

ingestion_date, doc_version สำหรับกรองข้อมูลเก่า
source, page_number สำหรับอ้างอิงได้
doc_type, category สำหรับ pre-filter ก่อน vector search
chunk_index, parent_id สำหรับ trace lineage

ตัวอย่าง metadata structure ที่ควรฝังไปกับทุก chunk:

JSON

{
  "content": "...ข้อความ...",
  "metadata": {
    "source": "policy_v3.pdf",
    "doc_version": "3.0",
    "effective_date": "2025-01-01",
    "ingestion_ts": "2025-03-01T10:00:00Z",
    "chunk_index": 12,
    "parent_id": "a3f9b2c1"
  }
}

6. Adaptive Ingestion: อนาคตที่กำลังจะมา

Document type กับ strategy ที่เหมาะสม:

Document Type	Strategy
Financial Report	Layout-Aware + Table extraction
Legal Document	Semantic + Hierarchical
Research Paper	Semantic + Citation linking
Chat Log	Recursive + Temporal metadata
Unknown	Recursive (safe default)

Python

# adaptive_ingestion.py
import anthropic
from pathlib import Path
from typing import Literal
from dataclasses import dataclass
 
DocumentType = Literal["financial_report", "legal_doc", "research_paper", "chat_log", "unknown"]
 
@dataclass
class IngestionStrategy:
    doc_type: DocumentType
    chunking_method: str
    chunk_size: int
    overlap: int
    use_layout_parsing: bool
    metadata_fields: list[str]
 
class AdaptiveIngestionAgent:
    """
    ใช้ LLM วิเคราะห์เอกสารก่อน แล้วค่อยเลือก strategy ที่เหมาะสม
    แทนที่จะใช้วิธีเดียวกับทุกเอกสาร
    """

Architecture Overview

เมื่อรวม pattern ทั้งหมดเข้าด้วยกัน pipeline จะมีหน้าตาประมาณนี้:

Text

Input Files (PDF/DOCX/TXT)
         |
[Adaptive Ingestion Agent]  <-- Claude API (Document Classifier)
         | classify
  +----------------------------+
  | Layout Parser  |  Semantic |
  |  (Docling)     |  Chunker  |
  +----------------------------+
         | chunks + metadata
  [Hierarchical Index]
  Child chunks -> Qdrant (vector search)
  Parent chunks -> stored with parent_id
         | query time
  Search child -> retrieve parent -> LLM
         |
  Answer (with full context)

Tech Stack สำหรับ Production

Python

# requirements.txt
 
# Core LLM & Embeddings
anthropic>=0.40.0          # Claude API (LLM + Document Analysis)
openai>=1.50.0             # OpenAI Embeddings (text-embedding-3-large)
 
# Document Parsing (Layout-Aware)
docling>=2.0.0             # IBM Layout-Aware Parser (PDF tables, columns)
unstructured[pdf]>=0.15.0  # Fallback parser + file type detection
pymupdf>=1.24.0            # Fast PDF text extraction (fitz)
python-docx>=1.1.0         # Word document parsing
 
# Vector Store
qdrant-client>=1.10.0      # Qdrant (self-hosted / cloud)
 
# Orchestration / Pipeline
langchain>=0.3.0           # Pipeline orchestration
langchain-community>=0.3

สรุป: Ingestion คือกระดูกสันหลังของ RAG

Priority ที่แนะนำสำหรับ Production:

Layout-Aware Parsing — ก่อนเลย ถ้ามี PDF/table
Metadata Enrichment — ทุก chunk ต้องมี
Hierarchical Chunking — แก้ปัญหา search vs context
Semantic Chunking — สำหรับ domain-specific docs
Adaptive Ingestion — เมื่อพร้อม scale