Week 6 · 🟒 과제 5

Hybrid RAG μ‹€μ „

PDF μž„λ² λ”© Β· BM25+Vector ν•˜μ΄λΈŒλ¦¬λ“œ Β· Cohere Rerank Β· 좜처 인용 Q&A

🏠 κ°•μ˜λ…ΈνŠΈ ν™ˆ πŸ“¦ 이 νŽ˜μ΄μ§€ GitHub β†— πŸ“‚ W6 폴더 β†—
λͺ©ν‘œ

이번 μ£Ό λͺ©ν‘œ β€” 정확도 λŒμ–΄μ˜¬λ¦¬κΈ°

W5의 λ‹¨μˆœ 벑터 검색은 μ˜λ―ΈλŠ” λΉ„μŠ·ν•˜μ§€λ§Œ μ •ν™•ν•œ 숫자/λ‚ μ§œλ₯Ό λ†“μΉ˜λŠ” 약점이 μžˆμŠ΅λ‹ˆλ‹€. W6μ—μ„œλŠ”:

Why Hybrid

μ™œ ν•˜μ΄λΈŒλ¦¬λ“œκ°€ ν•„μš”ν•œκ°€

μ‹œλ‚˜λ¦¬μ˜€λ²‘ν„°λ§Œν‚€μ›Œλ“œλ§Œν•˜μ΄λΈŒλ¦¬λ“œ
"맀좜이 μ–΄λ–»κ²Œ λ³€ν–ˆλ‚˜?" (의미)βœ… μž˜ν•¨βŒ λ™μ˜μ–΄ λͺ» μž‘μŒβœ…
"2024λ…„ 4λΆ„κΈ° μ˜μ—…μ΄μ΅μ€?" (μ •ν™•κ°’)❌ λΉ„μŠ·ν•œ λΆ„κΈ° μ„žμž„βœ… μ •ν™•βœ…
"HBM3E" (고유λͺ…사)❌ μΌλ°˜ν™”λ¨βœ… μ •ν™•βœ…
Workflow

μ›Œν¬ν”Œλ‘œμš° 흐름 (질의)

πŸ”΄ Triggerβ†’ Webhook β€” μžμ—°μ–΄ 질의 + ticker ν•„ν„°
🟣 Embedβ†’ OpenAI 질의 μž„λ² λ”©
🟒 Vectorβ†’ Supabase top-20 cosine search (ν•„ν„°: ticker)
🟒 BM25β†’ Supabase ts_rank β€” ν‚€μ›Œλ“œ λ§€μΉ­ top-20 (병렬)
⚫ Mergeβ†’ Code λ…Έλ“œ β€” 두 κ²°κ³Ό dedup, ν•©μ§‘ν•© ~30건
πŸ”΅ Rerankβ†’ Cohere Rerank β€” top 5 μž¬μˆœμœ„
🟣 Answerβ†’ Claude Sonnet β€” top-5λ₯Ό μ»¨ν…μŠ€νŠΈλ‘œ 인용 λ‹΅λ³€
🟒 Replyβ†’ Slack β€” λ‹΅λ³€ + [좜처1, 좜처2] 링크
Schema

Supabase ν…Œμ΄λΈ” (W5 ν™•μž₯)

create table disclosures (
  id bigserial primary key,
  ticker text not null,
  doc_type text,           -- 'μ‚¬μ—…λ³΄κ³ μ„œ'|'λΆ„κΈ°'|'μ£Όμš”κ³΅μ‹œ'
  doc_year int,
  page int,
  section text,
  content text,
  embedding vector(1536),
  fts tsvector              -- ν•œκ΅­μ–΄ BM25용
);

-- 인덱슀
create index on disclosures using ivfflat (embedding vector_cosine_ops);
create index on disclosures using gin (fts);

-- Trigger둜 fts μžλ™ 생성
create function disclosures_fts_trigger() returns trigger as $$
begin
  new.fts := to_tsvector('simple', new.content);
  return new;
end $$ language plpgsql;

create trigger disclosures_fts_update before insert or update
on disclosures for each row execute function disclosures_fts_trigger();
Cohere

Cohere Rerank 호좜

POST https://api.cohere.com/v1/rerank
Authorization: Bearer {COHERE_KEY}
{
  "model": "rerank-multilingual-v3.0",
  "query": "{{μ‚¬μš©μž 질문}}",
  "documents": [
    "{{청크 본문 1}}", "{{청크 본문 2}}", ...
  ],
  "top_n": 5,
  "return_documents": true
}
Answer Prompt

인용 κ°•μ œ ν”„λ‘¬ν”„νŠΈ

λ„ˆλŠ” IR 자료 뢄석가닀. μ•„λž˜ 검색 결과만으둜 λ‹΅λ³€ν•˜κ³ , λ°˜λ“œμ‹œ 좜처λ₯Ό μΈμš©ν•˜λΌ.

질문: {{μ‚¬μš©μž 질문}}

검색 κ²°κ³Ό (top 5, 신뒰도 순):
[1] {ticker} {doc_type} p.{page}: {content}
[2] ...

응닡 κ·œμΉ™:
- 검색 결과에 λͺ…μ‹œλœ μ‚¬μ‹€λ§Œ μ‚¬μš©. μ™ΈλΆ€ 지식 κΈˆμ§€.
- λͺ¨λ“  μ£Όμž₯에 [숫자] ν˜•μ‹ 좜처 인용.
- 검색 결과둜 λ‹΅ν•  수 μ—†μœΌλ©΄ "κ·Όκ±° λΆ€μ‘±" λͺ…μ‹œ.

λ‹΅λ³€:
Missions

🟒 과제 5개

1
관심쒅λͺ© 1개 μ‚¬μ—…λ³΄κ³ μ„œ μž„λ² λ”©
DARTμ—μ„œ μ‚¬μ—…λ³΄κ³ μ„œ PDF λ‹€μš΄λ‘œλ“œ β†’ μ›Œν¬ν”Œλ‘œμš° 톡과 β†’ Supabase에 청크 수 확인
2
벑터 검색 단독 검증
"24λ…„ R&D λΉ„μš©μ€?" 질의 β†’ top-5 결과의 page 확인
3
BM25 검색 μΆ”κ°€
같은 질의λ₯Ό ts_rankλ‘œλ„ 검색해 κ²°κ³Ό 비ꡐ β†’ μ–΄λŠ μͺ½μ΄ μ •ν™•ν•œμ§€
4
Rerank 효과 μΈ‘μ •
Rerank 적용/미적용 닡변을 본인이 채점, 5건 쀑 λͺ‡ 건 κ°œμ„ λλŠ”μ§€
5
μ‘΄μž¬ν•˜μ§€ μ•ŠλŠ” 정보 질의
PDF에 μ—†λŠ” λ‚΄μš© 질의 μ‹œ "κ·Όκ±° λΆ€μ‘±" μ‘λ‹΅ν•˜λŠ”μ§€ 검증 (ν• λ£¨μ‹œλ„€μ΄μ…˜ 차단 확인)
πŸ’Ž Bonus Session

Qdrant Native Hybrid + 금육 μ‘μš© μ‹€μŠ΅

λ³Έ μ£Όμ°¨ λ³ΈνŽΈμ€ Supabase pgvector + ts_rank둜 hybridλ₯Ό κ΅¬ν˜„ν–ˆμ§€λ§Œ, Qdrant 1.10+λŠ” λ™μΌν•œ 흐름을 ν•œ 번의 API 호좜둜 λλ‚΄λŠ” native hybridλ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€. μ‚¬μ—…λ³΄κ³ μ„œΒ·κ³΅μ‹œΒ· μ• λ„λ¦¬μŠ€νŠΈ 리포트 검색에 κ·ΈλŒ€λ‘œ μ μš©ν•΄λ΄…μ‹œλ‹€.

μ°Έμ‘° μ˜μƒ: Hybrid Search in Legal AI with Qdrant & n8n (Qdrant 곡식) Β· Qdrant Hybrid Search Tutorial Β· How to Set Up and Use Qdrant for Hybrid Search
λ³Έ μ„Έμ…˜μ€ μœ„ μ˜μƒλ“€μ˜ Legal 도메인을 금육 도메인(μ‚¬μ—…λ³΄κ³ μ„œΒ·κ³΅μ‹œΒ·IR)으둜 μΉ˜ν™˜ν•œ μ‘μš© μ‹€μŠ΅μž…λ‹ˆλ‹€.
μ™œ Qdrant Hybrid?

1. Supabase 방식 vs Qdrant Native 비ꡐ

ν•­λͺ©Supabase pgvector + ts_rankQdrant Native Hybrid
호좜 횟수2회 (vector + BM25 λ”°λ‘œ)1회 (Query API prefetch)
RRF κ²°ν•©n8n Code λ…Έλ“œμ—μ„œ μˆ˜λ™μ„œλ²„ μΈ‘ μžλ™
Sparse λͺ¨λΈPostgres ν•œκ΅­μ–΄ ν˜•νƒœμ†Œ ν•œκ³„BM25 / SPLADE 선택
ν•œκ΅­μ–΄ 처리tsvector ν•œκ³„ (쑰사 처리)fastembed λ‹€κ΅­μ–΄ κ°€λŠ₯
μŠ€μΌ€μΌ500MB 무료, 단일 PG1GB 무료, λΆ„μ‚° κ°€λŠ₯
ν•™μŠ΅ 곑선SQL μ΅μˆ™ν•˜λ©΄ 쉬움Named vector κ°œλ… ν•„μš”
Architecture

2. Named Vectors β€” ν•˜λ‚˜μ˜ μ»¬λ ‰μ…˜, 두 개의 벑터

collection: "invest_disclosures"
  named vectors:
    - name: "dense"
      size: 1536
      distance: Cosine          # OpenAI text-embedding-3-small
    - name: "sparse"
      modifier: idf              # BM25 λ‚΄μž₯ IDF
      # sparse vector indices/values ν˜•μ‹
  payload:
    - ticker, doc_type, doc_year, page, section, content

μž„λ² λ”© λ‹¨κ³„μ—μ„œ ν•œ 청크당 dense + sparse 두 벑터λ₯Ό λ™μ‹œμ— μ €μž₯. 검색은 λ‘˜μ„ ν•œ 호좜둜.

Workflow

3. n8n μ›Œν¬ν”Œλ‘œμš° 흐름

(A) μž„λ² λ”© (PDF μ—…λ‘œλ“œ μ‹œ 1회)

πŸ”΄ Triggerβ†’ Drive Trigger β€” μ‚¬μ—…λ³΄κ³ μ„œ PDF μ—…λ‘œλ“œ
πŸ”΅ Extractβ†’ PDF β†’ ν…μŠ€νŠΈ β†’ 1000자 청크 (overlap 200)
🟣 Denseβ†’ OpenAI text-embedding-3-small (1536d)
🟣 Sparseβ†’ FastEmbed BM25 (μ„œλ²„ ν•¨μˆ˜ λ˜λŠ” HTTP)
🟒 Upsertβ†’ Qdrant /points (vectors: {dense, sparse} λ™μ‹œ)

(B) 검색 (질의)

πŸ”΄ Triggerβ†’ Webhook β€” μžμ—°μ–΄ 질의
🟣 Embed Γ— 2β†’ Dense (OpenAI) + Sparse (BM25) λ™μ‹œ 생성
🟒 Hybrid Qβ†’ Qdrant /points/query β€” prefetch [dense, sparse] + RRF fusion ν•œ λ²ˆμ—
🟣 Answerβ†’ Claude β€” top-K둜 인용 λ‹΅λ³€
🟒 Replyβ†’ Slack / Webhook 응닡
API

4. Qdrant Hybrid Query API (ν•œ 번의 호좜)

μ»¬λ ‰μ…˜ 생성

PUT https://YOUR-CLUSTER.qdrant.tech/collections/invest_disclosures
{
  "vectors": {
    "dense": { "size": 1536, "distance": "Cosine" }
  },
  "sparse_vectors": {
    "sparse": { "modifier": "idf" }
  }
}

μž„λ² λ”© upsert

PUT /collections/invest_disclosures/points
{
  "points": [{
    "id": 1,
    "vector": {
      "dense":  [0.12, -0.05, ...1536개],
      "sparse": { "indices": [42, 1024, ...], "values": [0.7, 0.3, ...] }
    },
    "payload": {
      "ticker": "005930",
      "doc_type": "μ‚¬μ—…λ³΄κ³ μ„œ", "doc_year": 2024, "page": 187,
      "section": "μ—°κ΅¬κ°œλ°œ ν™œλ™", "content": "..."
    }
  }]
}

Hybrid Search (RRF ν•œ λ²ˆμ—)

POST /collections/invest_disclosures/points/query
{
  "prefetch": [
    { "query": [0.12, -0.05, ...], "using": "dense",  "limit": 20 },
    { "query": {"indices":[...], "values":[...]}, "using": "sparse", "limit": 20 }
  ],
  "query": { "fusion": "rrf" },
  "limit": 5,
  "with_payload": true,
  "filter": { "must": [{"key": "ticker", "match": {"value": "005930"}}] }
}

응닡은 두 κ²€μƒ‰μ˜ RRF κ²°ν•© top-5. "fusion": "dbsf"둜 distribution-based score fusion도 선택 κ°€λŠ₯.

금육 μ‘μš©

5. μ–΄λ–€ μ§ˆλ¬Έμ— μ–΄λŠ μͺ½μ΄ κ°•ν•œκ°€

금육 질의Dense만BM25만Hybrid
"μ‚Όμ„±μ „μž 24λ…„ R&D νˆ¬μžμ•‘μ€?" (μ •ν™• 숫자)β–³ λΉ„μŠ·ν•œ λΆ„κΈ° μ„žμž„βœ… μ •ν™• νŽ˜μ΄μ§€βœ…
"AI λ©”λͺ¨λ¦¬ 사이클 전망" (의미)βœ… λ™μ˜μ–΄ 작음❌ ν‚€μ›Œλ“œ ν•œμ •βœ…
"HBM3E 단독곡급" (고유λͺ…사)β–³ μΌλ°˜ν™”λ¨βœ… μ •ν™•βœ…
"지배ꡬ쑰 ESG 리슀크" (κ°œλ…)βœ… 의미 검색△ ν‘œν˜„ λ‹€μ–‘βœ…
"κ³΅μ‹œλ²ˆν˜Έ 20240315000123" (μ‹λ³„μž)βŒβœ… exactβœ…
핡심 μΈμ‚¬μ΄νŠΈ β€” IR/κ³΅μ‹œλŠ” 숫자·고유λͺ…μ‚¬Β·μ‹λ³„μžκ°€ λ§Žμ•„ BM25κ°€ κ°•ν•˜κ³ , μ „λž΅ λ³€ν™”Β·λ¦¬μŠ€ν¬Β·μ‹œμž₯ 전망은 의미 검색이 κ°•ν•©λ‹ˆλ‹€. Hybridκ°€ λ‘˜ λ‹€ μž‘μŠ΅λ‹ˆλ‹€.
Bonus Missions

πŸ’Ž λ³΄λ„ˆμŠ€ 과제 5개

1
Qdrant μ»¬λ ‰μ…˜ 생성
named vectors (dense 1536 + sparse idf) μ»¬λ ‰μ…˜μ„ cloud.qdrant.ioμ—μ„œ 생성. cluster info 확인
2
FastEmbed BM25 호좜
Qdrant FastEmbed μ„œλ²„ λ˜λŠ” Python ν•¨μˆ˜λ‘œ ν•œκ΅­μ–΄ 토큰 sparse 벑터 생성. n8n Code λ…Έλ“œμ—μ„œ indices/values μΆ”μΆœ
3
관심쒅λͺ© 3개 PDF μž„λ² λ”©
μ‚Όμ„±μ „μžΒ·SKν•˜μ΄λ‹‰μŠ€Β·NAVER μ‚¬μ—…λ³΄κ³ μ„œ PDFλ₯Ό dense+sparse λ™μ‹œμ— upsert. 청크 총 수 확인
4
5개 질의 비ꡐ μ‹€ν—˜
μœ„ ν‘œμ˜ 5개 질의λ₯Ό Dense만 / BM25만 / Hybrid μ„Έ λͺ¨λ“œλ‘œ 각각 호좜, top-5 결과의 page 비ꡐ. 응닡 μ‹œκ°„λ„ μΈ‘μ •
5
RRF vs DBSF 비ꡐ
동일 질의λ₯Ό fusion=rrf와 fusion=dbsf둜 ν˜ΈμΆœν•΄ κ²°κ³Ό 차이 뢄석. 본인이 μ±„μ ν•œ 정확도 점수 ν‘œ μž‘μ„±
Tools

도움 자료