ããã«ã¡ã¯ïŒSCSKã®éå£ã§ãã ååã®èšäºã§ã¯ãRAGã®å
šäœåïŒIndexing / Retrieval / Augmentation / GenerationïŒãšããLLMã®æ§èœãã®ãã®ãããåæ®µã®èšèšã§åè³ªãæ±ºãŸããããšãæŽçããŸããã ïŒã·ãªãŒãº1ïŒRAGã®åºæ¬æ
å ± / 第1åïŒRAGãšã¯ïŒå
šäœåããªãå¿
èŠããåºæ¬ãããŒãšèšèšã®åæ RAGïŒæ€çŽ¢æ¡åŒµçæïŒã®å®çŸ©ããªãå¿
èŠããåºæ¬ãããŒïŒIndexing/æ€çŽ¢/è£åŒ·/çæïŒãæŽçããŸãã blog.usize-tech.com 2026.01.27 ä»åã¯ã·ãªãŒãº1ïŒRAGã®åºæ¬èŠçŽ ïŒã®ç¬¬2åãšããŠã ããã£ã³ãã³ã°ïŒãã£ã³ã¯åïŒã ãæ±ããŸãã æ©éã§ããçããã«è³ªåã§ãã ãæ€çŽ¢çµæã¯è¿ã£ãŠããã®ã«ãåçãåã¿åããªãïŒæççã«ãªããããšããããŸãããïŒ çŸå Žã§ããèµ·ãããã®ç¶æ³ãRetrievalïŒæ€çŽ¢ïŒã®åé¡ã«èŠããŸãããå®ã¯ Indexingæã«âæ ¹æ ãã©ãåãåºããŠä¿åãããâ ãåå ã«ãªã£ãŠããã±ãŒã¹ãå°ãªããããŸããã ãšããã®ããRAGã¯ãæ€çŽ¢ãããã£ã³ã¯ïŒæçïŒããã³ã³ããã¹ããšããŠLLMã«æž¡ãä»çµã¿ãªã®ã§ã ãããããã£ã³ã¯ã®åäœãæªããã°ãæ€çŽ¢ãåœãã£ãŠããŠãâåçã«å¿
èŠãªæ
å ±ãæããªãâ ç¶æ
ã«ãªããŸãã RAGããã®æ
å ±æ€çŽ¢èªäœã¯æåããŠããã®ã«ååŸããæ
å ±ã®å質ãäœãââããã¯RAGã®âããããâã§ãã ããã§æ¬èšäºã§ã¯ããŸã RAGå
šäœåã®äžã§ãã£ã³ãã³ã°ãã©ãã«äœçœ®ããã©ã®ãããªåœ¹å²ãæãããŠããã®ã ãå³ã§æŒãããããã§ããµã€ãºã»ãªãŒããŒã©ããã»æŠç¥ã®éžã³æ¹ããããŠç°¡åãªæ€èšŒãã¢ãŸã§äžæ°ã«æŽçããŸãã æ¬èšäºã§æ±ãç¯å² ãã£ã³ãã³ã°ã®äœçœ®ã¥ã ïŒRAGã®Indexingå·¥çšã®äžã§ããã£ã³ãã³ã°ãæ€çŽ¢å質ã«ã©ãå¹ãã èšèšãã©ã¡ãŒã¿ãšæŠç¥ ïŒchunk size / overlap ã®åæãšã代衚çãªãã£ã³ãã³ã°æŠç¥ã®äœ¿ãåã æ€èšŒã®é²ãæ¹ ïŒLangChain + Vertex AI EmbeddingsïŒGoogleïŒã§ãæŠç¥å·®ãâååŸãã£ã³ã¯âãšããŠèŠããåããã㢠â»è©äŸ¡ïŒRagasãªã©ã®å®éè©äŸ¡ïŒã¯éèŠãªã®ã§è§ŠããŸãããè©³çŽ°ã¯æ¬¡åïŒè©äŸ¡ç·šïŒã§æ±ããŸãã RAGã®Indexingå·¥çš ãã£ã³ãã³ã°ã¯ãRAGã®IndexingïŒã€ã³ããã¯ã¹äœæïŒå·¥çšã®äžæ žã§ããããã§ã®èšèšããåŸç¶ã®Retrievalå質ã«çŽçµããŸãã Indexingã®åºæ¬ãããŒ ææžãåã蟌ãïŒParsing / æŽåœ¢ïŒ ææžããã£ã³ã¯ã«åå²ããïŒChunkingïŒ ãã£ã³ã¯ãåã蟌ã¿ã«å€æããïŒEmbeddingsïŒ ãã¯ãã«DBïŒãŸãã¯æ€çŽ¢åºç€ïŒã«ä¿åããïŒIndexingïŒ åºæ¬ãããŒã«é¢ããŠã¯ãç§ãçºè¡šããäžèšè³æãRAGã®å
šäœåãšãã£ã³ãã³ã°ã®äœçœ®ä»ããã§ãŸãšããŠããã®ã§äžèªãã ããã â»Parsingéšåã«ã€ããŠã¯è¡šçŸãçããå³ãèŒããŠããŸãã 2026幎1æ è±æŽ²äŒïŒçºè¡šè³æïŒ ãŸããäžèšAWSããã°ã§ãRAGã®æµããèšèŒãããŠããŸãã Evaluate the reliability of Retrieval Augmented Generation applications using Amazon Bedrock | Amazon Web Services In this post, we show you how to evaluate the performance, trustworthiness, and potential biases of your RAG pipelines a... aws.amazon.com ãã£ã³ãã³ã°ïŒãã£ã³ã¯åïŒãšã¯ ãã£ã³ãã³ã°ïŒChunkingïŒãšã¯ãé·ãããã¥ã¡ã³ãã æ€çŽ¢ãšçæã«æ±ããããåäœ ãžåå²ããåãã£ã³ã¯ãåã蟌ã¿ïŒEmbeddingïŒã«å€æããŠä¿åããå·¥çšã§ãã ãã€ã³ãã¯ããã£ã³ãã³ã°ãåãªããæç« ãåããäœæ¥ã§ã¯ãªãã æ€çŽ¢ç²ŸåºŠã»æèä¿æã»ã³ã¹ãã»ã¬ã€ãã³ã·ãå¶åŸ¡ããéèŠãªäœæ¥ ã ãšããç¹ã§ããæ¥µç«¯ã«èšãã°ãLLMãã©ãã ã髿§èœã§ãã âæŸãæ ¹æ ããºã¬ãŠããã°ããºã¬ããŸãŸè³¢ãçããâ ã ãã§ãã å
çšã®çºè¡šè³æå
ã§ãè§ŠããŠããŸãããã äžé©åãªãã£ã³ã¯ã¯ããŽããå
¥ããŠãŽããåºãïŒGarbage In, Gargabe OutïŒ ããšèšãæããããšãã§ããŸãã ãŸãæŒãããïŒãµã€ãºãšãªãŒããŒã©ããïŒæéèŠãã©ã¡ãŒã¿ïŒ ãã£ã³ãã³ã°èšèšã®åºæ¬ã¯ã chunk sizeïŒãµã€ãºïŒ ãš chunk overlapïŒãªãŒããŒã©ããïŒ ã§ãããããå€ããšãåŸæ®µã®ãæŠç¥ïŒsplitterïŒã®çš®é¡ããã©ãã ã工倫ããŠããRetrievalå質ãå®å®ããŸããã çšèªæŽçïŒchunk / chunk size / chunk overlapã«ã€ã㊠ããã§ãã chunk ã¯ãæ€çŽ¢ã»çæã§æ±ãããã«åå²ããããã¹ãã®ã²ãšãããŸãããæããŸãã ãã®ã²ãšãããŸãã®äžéé·ã chunk size ãé£ãåããã£ã³ã¯å士㧠éè€ãããé·ã ã chunk overlap ã§ãã chunk size ïŒ1ãã£ã³ã¯ã«å«ããããã¹ãéïŒäžéïŒãåäœã¯ ããŒã¯ã³ ïŒæšå¥šïŒãŸãã¯æåæ°ã chunk overlap ïŒé£æ¥ãã£ã³ã¯éã§éè€ãããéãå¢çã§æ
å ±ãæ¬ ããã®ãç·©åãã圹å²ãæã€ã å³è§£ïŒsize=500, overlap=100 ã®ãšãäœãèµ·ããïŒ äŸãã° chunk size = 500 ã overlap = 100 ãªãã 1ã€ç®ã®ãã£ã³ã¯ã 0ã500ã2ã€ç®ã¯ 400ã900 ã®ããã« 100åã ãéãªããŸã ã ïŒâ»éå§äœçœ®ã¯ (n-1) à (size - overlap) ã®ã¹ã©ã€ãã£ã³ã°ãŠã£ã³ããŠã«ãªããŸãïŒ å³ãäŸïŒãµã€ãºãšãªãŒããŒã©ããã®é¢ä¿ ç²ŸåºŠã»æèã»ã³ã¹ããžã®åœ±é¿ã«ã€ã㊠chunk size ãš overlap ã¯ã æ€çŽ¢ç²ŸåºŠïŒãã€ãºïŒ ã æèä¿æïŒæçåèæ§ïŒ ã ã³ã¹ãïŒã¬ã€ãã³ã· ã«åœ±é¿ãäžããŸãã ããã§ã¯ã åçãã©ã厩ããŠããã ãã®æèŠãæŽããããã«ããã€ã³ãã ãæŽçããŸãã 1) chunk size ã圱é¿ãäžãããã®ïŒãã€ãº â æèïŒ å€§ãããã ïŒ1ãã£ã³ã¯ã«é¢ä¿ãªãæ
å ±ãæ··ããããããæ€çŽ¢ã§ãã€ãºãä¹ãïŒãã¯ãã«ãâå¹³ååâãããã¯ãšãªãšã®æŽåãçããªãïŒãçæåŽãå
¥åããŒã¯ã³ãå¢ããã³ã¹ãã»ã¬ã€ãã³ã·ãå¢ããã å°ãããã ïŒæ¡ä»¶ã»äŸå€ã»åç
§ïŒäž»èªãåæïŒããã£ã³ã¯å¢çã§å¥ãããããªããåçãæççã«ãªããããããã£ã³ã¯æ°ãå¢ãããããæ€çŽ¢ïŒTop-k / rerankïŒè² è·ãå¢ããããã 2) chunk overlap ã圱é¿ãäžãããã®ïŒå¢çæ¬ èœ â åé·ïŒ åºå®é·åå²ã§ã¯ãæã®éäžãããã ããããªã©ã®æ¡ä»¶ç¯ãå¢çã§åãããããååŸã¯ã§ããŠããäŸå€æ¡ä»¶ãèœã¡ãããäž»èªãæ¶ããããšãã£ã圢ã§åçã厩ããããšããããŸãã overlap ã¯ãã®âå¢çæ¬ èœâãç·©å ããŸãã overlap ãå¢ãã ïŒæçåã«åŒ·ããªãïŒå¿
èŠãªæ ¹æ ãåããã£ã³ã¯ã«æ®ããããïŒã overlap ãå¢ããããã ïŒåãå
容ãè€æ°ãã£ã³ã¯ã«å
¥ã£ãŠæ€çŽ¢çµæãåé·ã«ãªããã³ã¹ããå¢ããïŒã€ã³ããã¯ã¹ãµã€ãºã»ååŸãã£ã³ã¯éè€ïŒã 3) chunk size / overlapã®èª¿æŽ ãŸã åºå®é· + overlap ãããŒã¹ã©ã€ã³ã«ããŠã åçãã©ã厩ããŠãããïŒæçåïŒãã€ãºæ··å
¥ãªã©ïŒ ãèŠãªãã調æŽããã®ãå
å®ã§ãã åçãæçç â overlap ãå¢ããããŸã㯠size ãå°ã倧ãããã é¢ä¿ãªãæãæ··ããïŒãã€ãºïŒ â overlap ãæžãããsize ãå°ãããããå¿
èŠãªãæ§é èªèã»ã¡ã¿ããŒã¿ã掻çšãã ç®å®ãšããŠã¯ããŸã overlap ã chunk size ã® 10ã20% çšåºŠããå§ãããšãå¢çåé¡ãæãã€ã€ã³ã¹ãããããŸã§å¢ããããšã¯ãªãããšæããŸãã ããŒã¯ã³åºæºã§èããããšã®éèŠæ§ ãã£ã³ã¯ãµã€ãºãæåæ°ã§åããšãã¢ãã«åŽã®ããŒã¯ãã€ã¶å·®åã§ æ³å®ä»¥äžã«ããŒã¯ã³ãèšãã ããšããããŸãã ãã®ãããæåæ°ãåºæºã«ãã£ã³ã¯ãµã€ãºãéžæããã®ã§ã¯ãªãããããŒã¯ã³ããŒã¹ã§ãµã€ãºã管çãããäºãéèŠãšãªããŸãïŒç¹ã«æ¥æ¬èªã¯å·®ãåºãããïŒã äžèšã®å
Œξ
å ±ã¯åèã«ãªãã®ã§ãã確èªãã ããã ã»Azure AI SearchïŒãã£ã³ãã³ã°ã®èãæ¹ïŒæšå¥šã®åºçºç¹ïŒäŸïŒ512 tokens + 25% overlapïŒ Chunk documents - Azure AI Search Learn strategies for chunking PDFs, HTML files, and other large documents for agentic retrieval and vector search. learn.microsoft.com ã»Google CloudïŒåãèŸŒã¿æã® chunk_size / chunk_overlapãã¬ã€ã¢ãŠãè§£æã®çµ±åïŒRAG EngineïŒ Use Document AI layout parser with Vertex AI RAG Engine  | Generative AI on Vertex AI  | Google Cloud Documentation cloud.google.com ã»WeaviateïŒchunkingã®ããŒã¹ã©ã€ã³ãšçºå±ææ³ã®æŽçïŒoverlapç®å®å«ãïŒ Chunking Strategies to Improve Your RAG Performance | Weaviate Learn how chunking strategies can help improve your RAG performance and explore different chunking methods. weaviate.io ãã£ã³ãã³ã°æŠç¥ã®å
šäœåïŒä»£è¡š6ãã¿ãŒã³ïŒïŒçºå±2ïŒ ããããã¯ããã£ã³ãã³ã°æŠç¥ã®ææ³ãæŽçããŸãã ãã£ã³ãã³ã°æŠç¥ãéžæããéã¯ããããªãé«åºŠãªæŠç¥ã«é£ã¶ã®ã§ã¯ãªãã åºå®é· or ååž°ã§ããŒã¹ã©ã€ã³ãäœã åçã®åŽ©ãæ¹ïŒæçåïŒãã€ãºïŒè¡šåŽ©ãïŒ ããåå ãæšå®ãã å¿
èŠãªãšããã ããã£ã³ãã³ã°æŠç¥å€æŽïŒæ§é èªèïŒã»ãã³ãã£ãã¯ïŒéå±€ïŒã³ã³ããã¹ãä»äžïŒ ã®é ããæ€èšŒã³ã¹ããå°ãããªãããšæããŸãã ããããã®ãã£ã³ãã³ã°æŠç¥ã®èª¬æãšLangChainã§ã®å®è£
ã³ãŒãã«ã€ããŠç°¡åã«èª¬æããŸãã (1) åºå®é·ïŒããŒã¯ã³ïŒïŒãªãŒããŒã©ãã äœçœ®ã¥ã ïŒæåã«äœãã¹ã ããŒã¹ã©ã€ã³ ããã¥ãŒãã³ã°ïŒsize/overlapïŒãšãã°èгå¯ããããããæ¹åãµã€ã¯ã«ã®èµ·ç¹ã«ãªããŸãã 匷㿠ïŒå®è£
ãç°¡åãé床ã»ã³ã¹ãèŠç©ããããããããæ¯èŒå®éšïŒA/BïŒã§å·®åãåããããã åŒ±ã¿ ïŒæã®éäžã§åãããã衚ã»ã³ãŒãã»ç« ç¯æ§é ãç¡èŠããŠåå²ããã¡ïŒïŒæ§é ãããææžã§ã¯å質ãäœããªããããïŒã LangChainæå°å®è£
from langchain_text_splitters import CharacterTextSplitter splitter = CharacterTextSplitter.from_tiktoken_encoder( chunk_sise=512, chunk_overlap=128, separator="", keep_separator=False, ) chunks = splitter.split_text(text) # text: str (2) ååž°çåå²ïŒæ®µèœâæ¹è¡â空çœâŠã®åªå
é äœïŒ ä»çµã¿ ïŒèªç¶ãªåºåãïŒæ®µèœã»æ¹è¡ïŒãåªå
ãã€ã€äžéãµã€ãºã«åããã 匷㿠ïŒåºå®é·ãããèªã¿ããããã£ã³ã¯ãã«ãªãããããæ€çŽ¢ãå®å®ããããã 匱㿠ïŒè¡šãã³ãŒããªã©âæ§é ãæã€ããŒã¿âã§ã¯åŽ©ããããšãããïŒååŠçãéèŠïŒã åãææž ïŒè°äºé²ãããã°ãäžè¬ããã¥ã¡ã³ããèªç¶èšèªäžå¿ã®è³æã LangChainæå°å®è£
from langchain_text_splitters import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter(chunk_sise=1200, chunk_overlap=100) chunks = splitter.split_text(text) # å¿
èŠã§ããã°ãäžèšã®ããã«ãåªå
ããåºåãããæç€ºãã splitter = RecursiveCharacterTextSplitter( chunk_size=1200, chunk_overlap=100, separators=["\n\n", "\n", "ã", " ", ""] ) chunks = splitter.split_text(text) (3) æ§é èªèïŒèŠåºãã»è¡šã»ãªã¹ãã»ã¬ã€ã¢ãŠãïŒ ä»çµã¿ ïŒèŠåºãéå±€ãç®æ¡æžãã衚ãHTMLã¿ã°ãPDFã¬ã€ã¢ãŠãçãè§£æããŠãè«çåäœãã§åå²ã 匷㿠ïŒä»æ§æžãPDFã§èµ·ãããã¡ãªã衚厩ãããç« ç¯ã®æçµ¶ããæãããããã¡ã¿ããŒã¿ïŒç« ã¿ã€ãã«ãªã©ïŒãä»ããããã 匱㿠ïŒååŠçïŒããŒã¹ïŒã®å質ãããã«ããã¯ãå°å
¥ã³ã¹ããäžãããããã åãææž ïŒMarkdown/HTML/PDF/OfficeææžïŒç¹ã«è¡šãå€ãè³æïŒã LangChainæå°å®è£
ãæ§é èªèãã¯å
¥å圢åŒã§å®è£
ãåãããŸãã ããã§ã¯ã HTML / Markdownã®èŠåºããã¡ã¿ããŒã¿åããŠåå² ããäŸã瀺ããŸãã HTMLïŒã¿ã°åäœã§åå²ïŒ from langchain_text_splitters import HTMLHeaderTextSplitter headers_to_split_on = [ ("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3") ] splitter = HTMLHeaderTextSplitter(headers_to_split_on) docs = splitter.split_text(html_text) # html_text: str MarkdownïŒèŠåºãã§åå²ïŒ from langchain_text_splitters import MarkdownHeaderTextSplitter headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3") ] splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[("#","h1"), ("##","h2"), ("###","h3")]) docs = splitter.split_text(markdown_text) (4) ã»ãã³ãã£ãã¯åå²ïŒæå³ã®å€ããç®ã§åãïŒ ä»çµã¿ ïŒé£æ¥æã®åã蟌ã¿é¡äŒŒåºŠãèœã¡ãå°ç¹ãbreakpointãšããŠåå²ã 匷㿠ïŒãããã¯å¢çãæãããããé·æã»è«æã§âæŠå¿µã®é£ç¶æ§âãä¿ã¡ãããã 匱㿠ïŒååŠçã³ã¹ããå¢ãããéŸå€ïŒã©ãã§åããïŒã®ãã¥ãŒãã³ã°ãå¿
èŠã åãææž ïŒé·æèšäºãè«æãèª¬ææžïŒè©±é¡ãé »ç¹ã«åãæ¿ããè³æïŒã LangChainæå°å®è£
ããã§ã¯ãåã蟌ã¿é¡äŒŒåºŠã§breakpointãæã€ããšã§ã»ãã³ãã£ãã¯åå²ãå®è£
ããäŸã瀺ããŸãã import numpy as np from langchain_google_vertexai import VertexAIEmbeddings emb = VertexAIEmbeddings(model_name="gemini-embedding-001") sents = text.split("ã") # äŸïŒç²ãã®æåå²ïŒå®éã¯ãã£ãšäžå¯§ã«åå²ïŒ vecs = np.array(emb.embed_documents(sents)) sim = (vecs[:-1] * vecs[1:]).sum(axis=1) / (np.linalg.norm(vecs[:-1],axis=1)*np.linalg.norm(vecs[1:],axis=1)) breaks = np.where(sim < 0.75)[0] # éŸå€ã¯èŠèª¿æŽ # breaks ãå¢çã«ãã£ã³ã¯ãçµã¿ç«ãŠãïŒããã¯æ°è¡ã§ã¯å²æïŒ äžèšã®äŸã§ã¯ããVertexAIEmbeddingsããå©çšããŠããŸãã ããããLangChainã®å
¬åŒããã¥ã¡ã³ãã確èªãããšããVertexAIEmbeddingsãã¯ éæšå¥šïŒå°æ¥ãªãªãŒã¹ã§åé€ïŒ ãšãªã£ãŠããŸãã VertexAIEmbeddings - Docs by LangChain docs.langchain.com å
¬åŒããã¥ã¡ã³ãã«èšèŒã®ãšããããGoogleGenerativeAIEmbeddingsãã§ä»£æ¿ããŠãã ããã https://docs.langchain.com/oss/python/integrations/text_embedding/google_generative_ai (5) éå±€ïŒHierarchicalïŒ ä»çµã¿ ïŒæ€çŽ¢ã¯å°ãã£ã³ã¯ã§è¡ããçæã®éã¯èŠªãã£ã³ã¯ïŒãã倧ããæèïŒãæž¡ãã åŒ·ã¿ ïŒæ¡ä»¶ã»äŸå€ã»åæãªã©ã®âèæ¯âãåçã«ä¹ãããããæçåã«åŒ·ãã 匱㿠ïŒèŠªãµã€ãºã倧ãããããããšã³ã¹ãå¢ã芪åã®èšèšïŒãµã€ãºæ¯ã»èŠªãµã€ãºã®éžã³æ¹ïŒãèŠç¹ã åãææž ïŒèŠçŽã»èšèšæžã»ä»æ§æžã»ç ç©¶è³æïŒåç
§é¢ä¿ã匷ãè³æïŒã LangChainæå°å®è£
ãåã§æ€çŽ¢ããèŠªãæž¡ãããŸã§ã®äžé£ã®æµããæå°æ§æã§ç€ºããŸãã from langchain.retrievers import ParentDocumentRetriever from langchain.storage import InMemoryStore from langchain_text_splitters import RecursiveCharacterTextSplitter child = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50) # å: å°ããåäœ parent = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100) # 芪: 倧ããåäœ store = InMemoryStore() retriever = ParentDocumentRetriever( vectorstore=vs, docstore=store, child_splitter=child, parent_splitter=parent ) retriever.add_documents(docs) # docs: List[Document] (6) ã¡ã¿ããŒã¿é§åïŒãã£ã«ã¿/åå²/äžŠã¹æ¿ãïŒ ä»çµã¿ ïŒç« ç¯ãæ¥ä»ãã·ã¹ãã åãéšååãªã©ã®ã¡ã¿ããŒã¿ãä»ããæ€çŽ¢æã«ãã£ã«ã¿ãåªå
é äœä»ãã«æŽ»çšããã 匷㿠ïŒå°éçšèªãå€ãé åã§ã誀ãããããã€ãºãæãããããéçšã®â説æè²¬ä»»âã«ãå¹ãã 匱㿠ïŒä»äžèšèšãéã ãšé广ïŒãã£ã«ã¿ãå¹ããªããã¡ã¿ããŒã¿ãäžæŽåãªã©ïŒã åãææž ïŒç€Ÿå
ããã¥ã¡ã³ãå
šè¬ïŒAPåºç€ããã¥ã¡ã³ãã¯ç¹ã«çžæ§ãè¯ãïŒã LangChainæå°å®è£
åå²èªäœã¯ååž°çåå²ã»æ§é èªèãå©çšãã metadataãä»ããŠæ€çŽ¢æã«ãã£ã«ã¿ ããã®ããã€ã³ãã§ãïŒããã¯VectorStoreåŽã®æ©èœã«äŸåããŸãïŒã from langchain_core.documents import Document docs = [ Document(page_content="...", metadata={"system":"APåºç€", "version":"v1"}), Document(page_content="...", metadata={"system":"APåºç€", "version":"v2"}), ] vectorstore.add_documents(docs) retriever = vectorstore.as_retriever(search_kwargs={"k": 5, "filter": {"system": "APåºç€"}}) hits = retriever.invoke("ããã©ã«ãèšå®å€ã¯ïŒ") äžèšã§ã¯ã filter= ã§ãã£ã«ã¿ãªã³ã°ãè¡ã£ãŠããŸãããã®ãã£ã«ã¿ãªã³ã°ãå¹ããã©ããã¯VectorStoreå®è£
äŸåã§ãã ïŒäŸïŒ Pinecone / Weaviate çã¯åŒ·ããFAISSã¯åŒ±ãïŒ [çºå±] ã³ã³ããã¹ãä»äžïŒãã£ã³ã¯ã«âäœçœ®ã¥ã説æâãè¶³ãïŒ ãã£ã³ã¯åäœã§ã¯äž»èªãåæãæããã¡ãªå Žåããã£ã³ã¯ã«çã説æïŒææžå
ã§ã®äœçœ®ã¥ãïŒãä»äžããŠããåã蟌ãããšããçºå±çã¢ãããŒãããããŸããäž»ã«ãæç€ºä»£åè©ãå€ãããåæãå€ããææžã§å¹ããŸããã玢åŒã³ã¹ãã¯å¢ããŸãã LangChainæå°å®è£
ãã£ã³ã¯æ¬æã«çãå眮ãïŒã¿ã€ãã« / ç« /ç®çãªã©ïŒãã€ããŠåã蟌ãäŸã瀺ããŸãã from langchain_core.documents import Document enriched = [] for d in docs: # docs: Document[] prefix = f"[{d.metadata.get('h2','')}/{d.metadata.get('h3','')}] " enriched.append(Document(page_content=prefix + d.page_content, metadata=d.metadata)) vectorstore.add_documents(enriched) [çºå±] Late ChunkingïŒå
ã«ææžå
šäœã§ãšã³ã³ãŒãâåŸã§åå²ïŒ éåžžã¯ãchunkâembedãã§ãããå
ã«ææžå
šäœãéããŠæèãæããããã¯ãã«è¡šçŸãåŸãŠããåå²ããããšããçºå±çãªèãæ¹ã§ããææžå
šäœã®æèãå¹ãäžæ¹ãé©çšæ¡ä»¶ãã³ã¹ãé¢ã®æ€èšãå¿
èŠã§ãã åè ã»LangChainïŒText SplittersïŒæŠå¿µãšå®è£
ïŒ LangChain overview - Docs by LangChain LangChain is an open source framework with a pre-built agent architecture and integrations for any model or tool â so yo... python.langchain.com ã»Google CloudïŒlayout parserçµ±åïŒæ§é èªèã®å
¥å£ãšããŠæçšïŒ Use Document AI layout parser with Vertex AI RAG Engine  | Generative AI on Vertex AI  | Google Cloud Documentation cloud.google.com ã»PineconeïŒsemantic/contextual chunking ãå«ãæŠç¥æŽç Chunking Strategies for LLM Applications | Pinecone In the context of building LLM-related applications, chunking is the process of breaking down large pieces of text into ... www.pinecone.io ã»WeaviateïŒchunkingæŠç¥ïŒïŒçºå±ææ³ïŒæŽç Chunking Strategies to Improve Your RAG Performance | Weaviate Learn how chunking strategies can help improve your RAG performance and explore different chunking methods. weaviate.io ã»IBM watsonxïŒLangChainäºæChunker/飿¥ãã£ã³ã¯æ¡åŒµïŒwindow searchïŒ RAG - IBM watsonx.ai ibm.github.io æŠç¥å¥æ¯èŒè¡šïŒç²ŸåºŠã»ã³ã¹ãã»å®è£
é£åºŠã®ãã¬ãŒããªã åæŠç¥ã¯äžèœã§ã¯ãããŸããã 粟床ïŒPrecisionïŒïŒãã€ãºèæ§ ïŒ å®è£
é£åºŠ ïŒ ã³ã¹ã ïŒ ã¬ã€ãã³ã· ã®ãã¬ãŒããªãã確èªããã©ã®æŠç¥ãå©çšãããã倿ããå¿
èŠããããŸãã äžèšè¡šã«åãã£ã³ãã³ã°æŠç¥ã®ç¹åŸŽããŸãšããŠããŸãã 衚. ãã£ã³ãã³ã°æŠç¥æ¯èŒ æŠç¥ 粟床 ãã€ãºèæ§ å®è£
é£åºŠ ã³ã¹ã ã¬ã€ãã³ã· åºå®é· + overlap äœãäž äœ äœ äœ äœ ååž°çåå² äž äž äœ äœ äœ æ§é èªè äžãé« é« äž äž äž ã»ãã³ãã£ãã¯ é« é« é« é« é« éå±€ïŒsmall-to-bigïŒ äžãé« äž äž äž äž ã³ã³ããã¹ãä»äž/çºå± äžãé« é« äžãé« äžãé« äžãé« ãã®è¡šã¯ãã©ããæåŒ·ãããæ±ºãããã®ã§ã¯ãããŸãããåãã£ã³ãã³ã°æŠç¥ã«åŸæãªæç« æ§é ãªã©ããããããäºåã«ãã®å
容ãå å³ããŠéžæããå¿
èŠããããŸãããŸããæåã«éžãã æŠç¥ã§ããŸã粟床ãåºãªãã£ãå Žåã¯ãä»ã®ãã£ã³ãã³ã°æŠç¥ãæ¡çšããŠã¿ããªã©ã® ãã©ã€&ãšã©ãŒ ãå¿
èŠã«ãªããŸãã ãã£ã³ãã³ã°æŠç¥ãéžã³æ¹ äžåºŠæ¡çšããæŠç¥ã§æããããªç²ŸåºŠãåºãªãå Žå㯠ãåçãã¿ãŒã³ã ã確èªãããšããã§ãã åçãã¿ãŒã³ãšãã®åå ã»å¯Ÿçã®äžäŸã瀺ããŸããäžèšãæ£è§£ã§ã¯ãããŸããããåèã«ããŠããã ããã°ãšæããŸãã 衚. åçãã¿ãŒã³ã®åå ãšãã®å¯Ÿç åçã®åŽ©ãæ¹ïŒãããããã¿ãŒã³ïŒ ãããã¡ãªåå åªå
ããŠè©Šã察ç åçãæççïŒäŸå€æ¡ä»¶ãèœã¡ãïŒ ãµã€ãºå°ããã / overlapäžè¶³ overlapå¢ / éå±€ïŒsmall-to-bigïŒ é¢ä¿ãªãæãæ··ããïŒãã€ãºå€ãïŒ ãµã€ãºå€§ããã / ååŠçäžè¶³ ãµã€ãºåæž / æ§é èªè / ã¡ã¿ããŒã¿ãã£ã«ã¿ è¡šã®æ°å€ã厩ãã PDF/衚ã®ããŒã¹åŽ©ã æ§é èªèïŒlayout parserçïŒ/ åã蟌ã¿ååŠçã®æ¹å åãçšèªã§ã奿æžãããããã ã¡ã¿ããŒã¿äžè¶³ / ãã£ã«ã¿ç¡ã ã¡ã¿ããŒã¿ä»äžïŒã·ã¹ãã /éšå/çæ°ïŒ+ ãã£ã«ã¿ æ€çŽ¢ã¯åœããã®ã«äž»èªãäžæ åç
§ãå€ã / æèãæãã overlapå¢ / ã³ã³ããã¹ãä»äž æ€èšŒãã¢ïŒLangChain + Vertex AI ããããã¯ãã¢ããŒãã§ããä»åã¯ããã£ã³ãã³ã°æŠç¥ã«ãã£ãŠãæ€çŽ¢ã§æŸããæ ¹æ ãã©ãå€ããããããLangChainã§ãµã¯ããšæ¯èŒã§ãã圢ã«ããŸãã ãªããæ¬ãã¢ã®å
容ãããå°ã詳ããããå
容ã«ã€ããŠã¯GitHubã§å
¬éããŠããã®ã§ããã²ç¢ºèªããŠã¿ãŠãã ããã GitHub - HiaHia1969/chunking_demo_public Contribute to HiaHia1969/chunking_demo_public development by creating an account on GitHub. github.com æ§æ ïŒTextSplitterïŒæŠç¥ïŒ â EmbeddingsïŒVertex AIïŒ â VectorStoreïŒããŒã«ã«ïŒ â Retriever â ååŸãã£ã³ã¯ã®æ¯èŒ åæïŒç°å¢æ§ç¯ ä»å㯠uv ãå©çšããŠç°å¢æ§ç¯ãè¡ããŸãã # äœæ¥ãã£ã¬ã¯ããªæºå mkdir langchain_demo && cd langchain_demo # uvåæå uv init # ã©ã€ãã©ãªæºå uv add langchain \ langchain-community \ langchain-text-splitters \ langchain-google-genai \ faiss-cpu \ python-dotenv \ numpy \ tiktoken # GitHubãªããžããªãåèã«ããå Žåã¯ãäžèšã³ãã³ãã§äŸåé¢ä¿ã解決ã§ããŸãã uv sync å³ ãã£ã¬ã¯ããªæ§é å³ pyproject.tomlã®å
容 ç°å¢å€æ° .env ãã¡ã€ã«ã«VertexAIçµç±ã§Googleã¢ãã«ãåŒã³åºãããã®èšå®ãè¡ããŸãã APIããŒã¯äºåã«çºè¡ããŠããå¿
èŠããããŸãã GOOGLE_API_KEY=<ååŸããAPIããŒ> GOOGLE_CLOUD_PROJECT=<Google Cloudã®ãããžã§ã¯ãå> GOOGLE_CLOUD_LOCATION=<ãªãŒãžã§ã³å> GOOGLE_GENAI_USE_VERTEXAI=true EMBEDDING_MODEL=gemini-embedding-001 å³ ç°å¢å€æ°ã®èšå® å
±éïŒãã¯ãã«åãšæ€çŽ¢ã®ãŠãŒãã£ãªã㣠import os from dataclasses import dataclass from typing import List, Tuple from dotenv import load_dotenv from langchain_google_genai import GoogleGenerativeAIEmbeddings from langchain_community.vectorstores import FAISS # LangChain splitters from langchain_text_splitters import RecursiveCharacterTextSplitter # ç°å¢å€æ°ã®èªã¿èŸŒã¿ load_dotenv() @dataclass class SearchResult: label: str docs: List[str] def build_vs(chunks: List[str], embeddings: GoogleGenerativeAIEmbeddings) -> FAISS: """Build a local FAISS vector store from plain text chunks.""" return FAISS.from_texts(chunks, embedding=embeddings) def topk_texts(vs: FAISS, query: str, k: int = 3) -> List[str]: docs = vs.similarity_search(query, k=k) return [d.page_content for d in docs] def show(title: str, texts: List[str]) -> None: print(f"\n===== {title} =====") for i, t in enumerate(texts, 1): print(f"\n--- top{i} ---\n{t}") # EmbeddingsïŒGoogle Generative AIïŒ # æ¬èšäºã§ã¯ãgemini-embedding-001ãå©çšããŸããå©çšã§ããã¢ãã«ã¯äžèšã確èªããŠãã ãã embeddings = GoogleGenerativeAIEmbeddings( model=os.getenv("EMBEDDING_MODEL", "gemini-embedding-001"), api_key=os.getenv("GOOGLE_API_KEY"), project=os.getenv("GOOGLE_CLOUD_PROJECT"), location=os.getenv("GOOGLE_CLOUD_LOCATION"), vertexai=os.getenv("GOOGLE_GENAI_USE_VERTEXAI", "true").lower() == "true", ) ãã¢1ïŒoverlapã®æç¡ã§ãäŸå€æ¡ä»¶ãèœã¡ãããåçŸ å¯Ÿå¿ãœãŒã¹ ïŒ demos/demo1_overlap_effect.py ç®çïŒåçºã±ãŒã¹ã ãã§ãªãè€æ°ã±ãŒã¹ã§ããoverlap ã Top1 ã®æ ¹æ ååŸã«äžãã圱é¿ã確èªããŸãã ãã®ãã¢ã§ç¢ºèªããããš ç®ç ïŒå¢çåæãèµ·ãããšããoverlap ã Top1 ã®æ ¹æ æ¬ èœãã©ããŸã§ç·©åã§ãããã確èªãã èšå® ïŒ chunk_size=120 ã overlap=0 ãš overlap=20 ãæ€çŽ¢ã¯ k=1 ïŒTop1ïŒã§æ¯èŒ æåŸ
ãããå·®å ïŒoverlap ããã®æ¹ããåºæ¬ + äŸå€ããåäžãã£ã³ã¯ã«æ®ãããããTop1 æ¬ èœãæžã èªã¿æ¹ ïŒ`å€å®` è¡ãš `Top1ã§åºæ¬+äŸå€ãåæååŸã§ããä»¶æ°`ïŒåçŸçïŒãèŠã from langchain_text_splitters import RecursiveCharacterTextSplitter # å
±éãŠãŒãã£ãªãã£ïŒbuild_vs / topk_texts / showïŒãš embeddings ã¯åç¯ãå©çš def make_doc(noise_repeat: int) -> str: return ( "èæ¯èª¬æã" * noise_repeat + "Aéšåã®èšå®æ¹éã¯æ¬¡ã®éããåºæ¬ã¯X=ONãšããã" + "ãã ãBã¢ãŒãæã®ã¿äŸå€ã§X=OFFãšããã" ) query = "Aéšåã®èšå®æ¹éãæããŠãã ãããåºæ¬èšå®(X=ON)ãšäŸå€èšå®(X=OFF)ãäž¡æ¹å«ããŠãã ããã" # ãã£ã³ã¯åïŒå¢çã§X=ONãåæãããèšå® chunk_size = 120 overlap0 = 0 overlap1 = 20 split0 = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap0) split1 = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap1) # 代衚ã±ãŒã¹ïŒnoise_repeat=20ïŒ doc = make_doc(20) chunks0 = split0.split_text(doc) chunks1 = split1.split_text(doc) show("overlap=0ïŒå¢çã§äŸå€ãèœã¡ãããïŒ", topk_texts(build_vs(chunks0, embeddings), query, k=1)) show("overlap=20ïŒäŸå€ãåå±
ããããïŒ", topk_texts(build_vs(chunks1, embeddings), query, k=1)) # è€æ°ã±ãŒã¹ for r in [16, 18, 20, 22, 24]: d = make_doc(r) c0 = split0.split_text(d) c1 = split1.split_text(d) t0 = topk_texts(build_vs(c0, embeddings), query, k=1) t1 = topk_texts(build_vs(c1, embeddings), query, k=1) ok0 = any("X=ON" in t and "X=OFF" in t for t in t0) ok1 = any("X=ON" in t and "X=OFF" in t for t in t1) print(f"noise_repeat={r}: overlap=0 -> {'â' if ok0 else 'Ã'}, overlap=20 -> {'â' if ok1 else 'Ã'}") å®è¡çµæ å®è¡ã³ãã³ã åºåçµæïŒèŠçŽïŒ [èšå®] chunk_size=120 ã代衚ã±ãŒã¹ãnoise_repeat=20 overlap=0 : å€å® à äŸå€èšå®(X=OFF)ãæ¬ èœ overlap=20 : å€å® â åºæ¬èšå®ãšäŸå€èšå®ã®äž¡æ¹ãå«ãŸãã ãè¿œå æ€èšŒãè€æ°ã±ãŒã¹ã§ã®åçŸçïŒTop1ïŒ noise_repeat=16: overlap=0 -> Ã, overlap=20 -> à noise_repeat=18: overlap=0 -> Ã, overlap=20 -> à noise_repeat=20: overlap=0 -> Ã, overlap=20 -> â noise_repeat=22: overlap=0 -> â, overlap=20 -> â noise_repeat=24: overlap=0 -> â, overlap=20 -> â Top1ã§åºæ¬+äŸå€ãåæååŸã§ããä»¶æ° overlap=0: 2/5 overlap=20: 3/5 èå¯ ä»£è¡šã±ãŒã¹ã§ã¯ overlap=0 ã§åãããŒããoverlap=20 ã§ååã§ããããšãåçŸããŸããã è€æ°ã±ãŒã¹ã§ã overlap=20 ã®æ¹ã Top1 ã§æ ¹æ ãæãä»¶æ°ãå€ãïŒ3/5 vs 2/5ïŒãæ¹ååŸåã確èªã§ããŸããã å·®åã¯å¢çäœçœ®ã«äŸåãããããå®åã§ã¯ overlap åäœã§ã¯ãªã chunk_size ãš k ãåãããŠèª¿æŽããã®ã劥åœã§ãã ä»åã®ãããã¢ã§ã¯å·®åã¯éå®çã§ãããå®åãããžã§ã¯ãã®é·æã»å€æ¡ä»¶ææžã§ã¯å¢çåæãå¢ãããããoverlapã®å¹ãç®ã¯äžè¬ã«å€§ãããªããŸãã 芳å¯ãã€ã³ã overlapã¯åžžã«å¹ãéæ³ã§ã¯ãªããå¢çäŸåã®åé¡ãç·©åããææ®µ Top1éçšã§ã¯ãå¢çæ
å ±ãæ®ãä¿éºãšããŠæå¹ã«åãããã ãã¢2ïŒåºå®é·ïŒtokenïŒ vs ååž°åå²ã§ãèªã¿ããããã£ã³ã¯ããæ¯èŒ å¯Ÿå¿ãœãŒã¹ ïŒ demos/demo2_token_vs_recursive.py ç®çïŒåºå®é·ã ãšæãããåãã«ãªãã人éãèªãã§ãæå³ãåãã¥ããïŒïŒLLMã«ãå³ããïŒããšã瀺ããŸãã â»tokenåŽã¯æ¥æ¬èªã§æååããã«ãã `token_splitter()` ã䜿ããŸããlangchaignã®ãCharacterTextSplitterããå©çšããŠããŸãã ä»åã®ãã¢ãäœæããã«ããããåœåã¯ãTokenTextSplitterããå©çšããŠããŸããã ããããæ¥æ¬èªã®ãã£ã³ãã³ã°æã«ãã£ã³ã¯æååãæååãããŠããŸããšããäºè±¡ãçºçããŠããŸããã äžèšã®ãããªæãã§ãã ...å¶åŸ¡ããᅵ ᅵèšå€æã§ãã ã©ããããTokenTextSplitterãã§ã¯ãæ¥æ¬èªãªã©ã®ãã«ããã€ãæåãå«ãæååãåå²ãããšãåå²åŸã«æååããçºçããå¯èœæ§ãããããã§ãã ãã®ãããä»åã¯ãTokenTextSplitterãã§ã¯ãªãããCharacterTextSplitterããæ¡çšããŠããŸãã langchainå
¬åŒããã¥ã¡ã³ã Text splitter integrations - Docs by LangChain Integrate with text splitters using LangChain. docs.langchain.com ãã®ãã¢ã§ç¢ºèªããããš ç®ç ïŒåºå®é·åå²ãšååž°åå²ã§ããã£ã³ã¯ã®å¯èªæ§ãšæå³ãŸãšãŸããã©ãå€ããããæ¯èŒãã èšå® ïŒTokenåŽã¯ chunk_size=25 ãRecursiveåŽã¯ chunk_size=120 ãã©ã¡ãã overlap=0 æåŸ
ãããå·®å ïŒTokenåå²ã¯æéäžã§åãããããRecursiveåå²ã¯èªç¶ãªæå¢çãä¿ã¡ããã èªã¿æ¹ ïŒTokenåŽã® `[NG] æã®éäžã§åæ` ãšãRecursiveåŽã® `[OK] èªç¶ãªåºåã` ãæ¯èŒãã from langchain_text_splitters import RecursiveCharacterTextSplitter from src.splitters import token_splitter text = """ RAGã®ãã£ã³ãã³ã°ã¯åãªãåå²ã§ã¯ãããŸããã æ€çŽ¢ç²ŸåºŠãšæèä¿æãããã«ã³ã¹ããšã¬ã€ãã³ã·ã®ãã¬ãŒããªããå¶åŸ¡ããèšèšå€æã§ãã äŸãã°ãæ¡ä»¶ã»äŸå€ã»åç
§ãå€ã仿§æžã§ã¯ãæèã®æçåãèŽåœçã«ãªããŸãã """ token_split = token_splitter(chunk_size=25, chunk_overlap=0) rec_split = RecursiveCharacterTextSplitter(chunk_size=120, chunk_overlap=0) token_chunks = token_split.split_text(text) rec_chunks = rec_split.split_text(text) print("\n===== token splitïŒåºå®é·ã®ã€ã¡ãŒãžïŒ =====") for c in token_chunks: print("-", c) print("\n===== recursive splitïŒèªç¶ãªãŸãšãŸãïŒ =====") for c in rec_chunks: print("-", c) å®è¡çµæ å®è¡ã³ãã³ã åºåçµæïŒèŠçŽïŒ ããã¿ãŒã³1ãTokenåå² (chunk_size=25ããŒã¯ã³) çµæ: 7åã®ãã£ã³ã¯ã«åå² äŸ: - ãRAGã®ãã£ã³ãã³ã°ã¯åãªãåå²ã§ã¯ãããŸã - ããããæ€çŽ¢ç²ŸåºŠãšæèä¿ã ããã¿ãŒã³2ãRecursiveåå² (chunk_size=120æå) çµæ: 1åã®ãã£ã³ã¯ã«åå² äŸ: - ãRAGã®ãã£ã³ãã³ã°ã¯åãªãåå²ã§ã¯ãããŸããã...ïŒå
šæïŒã èå¯ Tokenåå²ã¯é·ãå¶åŸ¡ã«ã¯åŒ·ãäžæ¹ãæã®éäžåæãé£ç¶ããæå³ãŸãšãŸãã厩ããããããšã確èªã§ããŸããã Recursiveåå²ã¯ä»åã®ããã¹ãã§ã¯1ãã£ã³ã¯ã«åãŸããæèã®äžè²«æ§ãä¿æã§ããŠããŸãã æ¥æ¬èªã§ã¯ãæååãããªãtokenåå²ãã䜿ã£ãŠãã æèä¿æã®èгç¹ã§ã¯Recursiveåªäœ ã«ãªããããããšããäœçœ®ã¥ãã劥åœã§ãã ãã¢3ïŒæ§é èªèïŒã¬ã€ã¢ãŠãè§£æïŒã«å¯ãããšäœãå¬ããã 察å¿ãœãŒã¹ ïŒ demos/demo3_semantic_breakpoints.py ç®çïŒæ§é ãªãã®åå²ãšãèŠåºãæ§é ã䜿ã£ãåå²ã§ããã£ã³ã¯ã®æå³çãŸãšãŸããã©ãå€ããããæ¯èŒããŸãã ãã®ãã¢ã§ç¢ºèªããããš ç®ç ïŒå¹³æåå²ãšèŠåºãåå²ã§ããããã¯å®çµæ§ãšæ€çŽ¢åãã¡ã¿ããŒã¿ã®æç¡ãæ¯èŒãã èšå® ïŒå¹³æã¯ RecursiveCharacterTextSplitter ãæ§é ãã㯠MarkdownHeaderTextSplitter ïŒHeader 1ã3ïŒ æåŸ
ãããå·®å ïŒèŠåºãåå²ã®æ¹ãç« åäœã§ãŸãšãŸããHeaderã¡ã¿ããŒã¿ãä»äžããã èªã¿æ¹ ïŒ`ã¡ã¿ããŒã¿` è¡ãšãå¹³æåŽã®ããããã¯æ··åšãæç¡ã確èªãã from src.splitters import markdown_header_splitter, recursive_splitter plain_doc = """ ã·ã¹ãã èšå®ã¬ã€ã Aéšåã®èšå® åºæ¬èšå® Aéšåã®èšå®æ¹éã¯æ¬¡ã®éãã§ãã åºæ¬ã¯ãX=ONããšããã äŸå€èšå® ãã ããBã¢ãŒãã®å Žåã¯äŸå€ã§ãX=OFFãšããã """ markdown_doc = """ # ã·ã¹ãã èšå®ã¬ã€ã ## 1. Aéšåã®èšå® ### åºæ¬èšå® Aéšåã®èšå®æ¹éã¯æ¬¡ã®éãã§ãã åºæ¬ã¯ãX=ONããšããã ### äŸå€èšå® ãã ããBã¢ãŒãã®å Žåã¯äŸå€ã§ãX=OFFãšããã """ # ãã¿ãŒã³1: æ§é ãªãïŒRecursiveïŒ plain_chunks = recursive_splitter(chunk_size=100, chunk_overlap=0).split_text(plain_doc) # ãã¿ãŒã³2: æ§é èªèïŒMarkdown HeaderïŒ headers_to_split_on = [("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3")] md_docs = markdown_header_splitter(headers_to_split_on).split_text(markdown_doc) print("plain chunks:", len(plain_chunks)) print("markdown header chunks:", len(md_docs)) for d in md_docs: print(d.metadata, d.page_content[:40]) å®è¡çµæ å®è¡ã³ãã³ã åºåçµæïŒèŠçŽïŒ ããã¿ãŒã³1ãæ§é ãªãïŒRecursiveïŒ çµæ: 3åã®ãã£ã³ã¯ - Chunk 2 ã«ãäŸå€èšå®ããšãèªèšŒèšå®ããåå±
ãããããã¯ãæ··åš ããã¿ãŒã³2ãMarkdown Headeråå² çµæ: 4åã®ãã£ã³ã¯ïŒèŠåºãåäœïŒ - Chunk 1 metadata: {'Header 1': 'ã·ã¹ãã èšå®ã¬ã€ã', 'Header 2': '1. Aéšåã®èšå®', 'Header 3': 'åºæ¬èšå®'} - Chunk 2 metadata: {'Header 1': 'ã·ã¹ãã èšå®ã¬ã€ã', 'Header 2': '1. Aéšåã®èšå®', 'Header 3': 'äŸå€èšå®'} èå¯ æ§é ãªãåå²ã§ã¯ãèŠåºãã ãæ®ãããç°ãªãç« ãåå±
ãããç¶æ
ãçºçããæ€çŽ¢æã®è§£éãäžå®å®ã«ãªããŸãã èŠåºãåå²ã§ã¯ãã£ã³ã¯å¢çãææžæ§é ãšäžèŽãã ãããã¯å®çµæ§ãšã¡ã¿ããŒã¿æŽ»çšæ§ ã倧ããåäžããŸãã 仿§æžã»æé æžã»éçšããã¥ã¡ã³ãã®ãããªæ§é åææžã§ã¯ããŸãHeaderåå²ãåªå
ããã®ãå®è·µçã§ãã 芳å¯ãã€ã³ã æ§é ãªãåå²ã§ã¯ãèŠåºããšæ¬æãæ··åšããããããããã¯ã忣ãããã èŠåºãåå²ã§ã¯ãHeaderã¡ã¿ããŒã¿ä»ãã§ãããã¯åäœã«ãŸãšãŸãããã åè ã»LangChainïŒText SplittersïŒæŠå¿µãšå®è£
ïŒ LangChain overview - Docs by LangChain LangChain is an open source framework with a pre-built agent architecture and integrations for any model or tool â so yo... python.langchain.com ã»LangChainïŒVertex AI embeddings integration Google Vertex AI integration - Docs by LangChain Integrate with the Google Vertex AI embedding model using LangChain Python. python.langchain.com ã»Google CloudïŒlayout parserçµ±åïŒæ§é èªèã®å
¥å£ãšããŠæçšïŒ Use Document AI layout parser with Vertex AI RAG Engine  | Generative AI on Vertex AI  | Google Cloud Documentation cloud.google.com è©äŸ¡ïŒæ¬¡åèšäºïŒïŒãã£ã³ãã³ã°æ¹åã¯ã©ã枬ãïŒ ãã£ã³ãã³ã°ã¯âããã£ãœãâæ¹åã§ããŠããŸãäžæ¹ã§ã䞻芳è©äŸ¡ã«å¯ããšè¿·èµ°ããã¡ã§ããæäœéãæ¬¡ã®ææšã§å®éçã«ãè¯ããªã£ãïŒæªããªã£ãããæž¬ããç¶æ
ã«ããŠããã®ãå®å
šã§ãïŒè©³çŽ°ã¯æ¬¡åã§æ±ããŸãïŒã Context Recall ïŒæ£è§£ã«å¿
èŠãªæ ¹æ ãTop-kã«å
¥ã£ãŠããã Context Precision ïŒTop-kããã€ãºã ããã«ãªã£ãŠããªãã Faithfulness ïŒåçãååŸããæ ¹æ ã«æ¥å°ããŠããã Answer Relevancy ïŒè³ªåã«ã¡ãããšçããŠããã ããããã®è©äŸ¡ã»æ¹åã«ãŒãã¯ã 代衚ã¯ãšãª50ä»¶ïŒãã¡ã¯ãç³»/åæç³»/æé ç³»ãæ··ããïŒ ããŒã¹ã©ã€ã³ïŒåºå®é·+overlap or ååž°ïŒã§Top-kãã°ãä¿å 1ã€ã ãæ¡ä»¶ãå€ããŠæ¯èŒïŒãµã€ãºã ããoverlapã ããæ§é èªèã ãâŠïŒ ã§ããããã§âæ¹åã®æ¹åæ§âãæŽããŸãã ïŒè£è¶³ïŒAmazon Bedrock Knowledge Basesã§èããå Žå ã·ãªãŒãº2以éã§æ¬æ Œçã«æ€èšŒäºå®ã§ãããããããŒãžããµãŒãã¹ã§æ¥œãããããå Žåã®æŽçã眮ããŠãããŸãã AWSã§ã¯ãAmazon Bedrock Knowledge BasesãšãããããŒãžããµãŒãã¹ãæäŸãããŠãããRAGç°å¢ãç°¡åã«æ§ç¯ããããšãå¯èœã§ãã2026幎2ææç¹ã§å©çšã§ããAmazon Bedrock Knowledge BasesïŒBedrock KBïŒã§å©çšã§ãããã£ã³ãã³ã°æŠç¥ã¯äžèšãšãªããŸãã ãããŸã§èª¬æããŠãããã£ã³ãã³ã°æŠç¥ãšå¯Ÿå¿ä»ãããšããã£ããæ¬¡ã®ã€ã¡ãŒãžã§ãïŒè©³çްã¯Tipsã·ãªãŒãºã§æ€èšŒããŸãïŒã 衚. Amazon Bedrock Knowledge Bases ã§å©çšå¯èœãªãã£ã³ãã³ã°æŠç¥ Bedrock KB äžè¬æŠç¥ã®èªã¿æ¿ã äžèš Default ããŒã¹ã©ã€ã³ è¿·ã£ãããŸããã Fixed-size åºå®é· + overlap é床ã»ã³ã¹ãåªå
Hierarchical éå±€ïŒHierarchicalïŒ è€éæèåã Semantic ã»ãã³ãã£ã㯠é«ç²ŸåºŠå¯ãïŒã³ã¹ãå¢ã«æ³šæïŒ None åå²ãªã ååŠçæžã¿/FAQåã ãŸãšã æ¬èšäºã§ã¯ãRAGã«ããããã£ã³ãã³ã°æŠç¥ã«ã€ããŠèª¬æããŠããŸããã ãŸãã¯åºå®é· + overlapïŒååž°åå²ã§ããŒã¹ã©ã€ã³ãäœã æçåã»ãã€ãºã»è¡šåŽ©ããªã©ã åçãã©ã厩ããŠããã ããåå ãæšå®ããå¿
èŠãªãšããã ãé«åºŠåãã ãã¢ã®ããã«ãååŸãã£ã³ã¯ãæ¯èŒããŠãã©ããå£ããŠããããã芳å¯ãã æ¹åã¯è©äŸ¡ææšïŒRecall/Precision/FaithfulnessçïŒã§âå®éçã«æž¬ããç¶æ
âã«ããŠé²ãã æ¬¡åã¯ããã®æ¹åãæ¬åœã«å¹ããŠãããã倿ããããã«ã RAGã®è©äŸ¡ïŒå®éè©äŸ¡ïŒ ãæ±ããŸããRagasãªã©ã®è©äŸ¡ææšã§ãè¯ããªã£ãïŒæªããªã£ãããæž¬ããç¶æ
ã«ããŠãããŸãããã æ¬¡åããã²ã芧ãã ããã ããã®è³ªåãããã¥ã¡ã³ãã«æžããŠãããåé¡ãçµãããããïŒRAGé£èŒãå§ããŸã 瀟å
ãã¬ããžãRAGã§æŽ»çšããèšå€§ãªããã¥ã¡ã³ãããå¿
èŠæ
å ±ãçŽ æ©ãèŠã€ããä»çµã¿ãç®æããŸããæ¬èšäºã§ã¯é£èŒéå§ã®èæ¯ãšãRAGåºç€ãBedrockå®è£
ã»ã¢ããª/ãšãŒãžã§ã³ãæ§ç¯ãŸã§ã®æ§æã玹ä»ããŸãã blog.usize-tech.com 2026.01.27 ïŒã·ãªãŒãº1ïŒRAGã®åºæ¬æ
å ± / 第1åïŒRAGãšã¯ïŒå
šäœåããªãå¿
èŠããåºæ¬ãããŒãšèšèšã®åæ RAGïŒæ€çŽ¢æ¡åŒµçæïŒã®å®çŸ©ããªãå¿
èŠããåºæ¬ãããŒïŒIndexing/æ€çŽ¢/è£åŒ·/çæïŒãæŽçããŸãã blog.usize-tech.com 2026.01.27