ããã«ã¡ã¯ãFintech SREã®äœè€éåº(@T)ã§ãã ãã®èšäºã¯ã Merpay & Mercoin Tech Openness Month 2025 ã®11æ¥ç®ã®èšäºã§ãã Google瀟ãæå±ãã Site Reliability Engineering Book ã«ãã£ãŠåºãç¥ãããããã«ãªã£ãSREã®ä¿¡é Œæ§ãããžã¡ã³ãã¯ãéçºãšéçšã®é¢ä¿æ§ãåå®çŸ©ããSLI/SLOãšãšã©ãŒããžã§ããã«å§ãŸããAvailabilityã»Latencyã»ãšã©ãŒã¬ãŒãã»ãã©ãã£ãã¯ã»ãªãœãŒã¹é£œå床ã»èä¹
æ§ãšãã£ããããªææšã§è£åŒ·ãããŠããŸããã ãšãããè¿å¹Žãå€§èŠæš¡èšèªã¢ãã«ïŒLLMïŒã®é²æ©ãèããããµãŒãã¹ã«LLMãå©çšããæ©äŒãå¢ããããšã«ãã£ãŠã ããã³ãããæ°è¡å€ããã ãã§åçå質ãå€åãã Latencyããšã©ãŒã¬ãŒããè¯å¥œã§ãå¹»èŠïŒãã«ã·ããŒã·ã§ã³ïŒãæ¥å¢ãã ã¢ãã«ã®è»œåŸ®ãªã¢ããããŒãã§åçã¹ã¿ã€ã«ãæ¿å€ãã ãšãã£ããåŸæ¥ææšã§ã¯èŠèœãšããã¡ãªäºè±¡ã«ééããããšãå€ããªããŸããã ã€ãŸã ãLLMãµãŒãã¹ã®ä¿¡é Œæ§ã ãå®ãã«ã¯ãã¯ã©ã·ãã¯ãªã€ã³ãã©ææšã®ä»ã« LLMãµãŒãã¹åºæã®åè³ªææš ãéãåãããŠã¢ãã¿ãªã³ã°ããå¿
èŠæ§ãè¿«ãããŠããŸãã æ¬èšäºã§ã¯ãLLMãµãŒãã¹ã®ä¿¡é Œæ§è©äŸ¡ã«äžå¯æ¬ ãªææšã®éžå®ãããå
·äœçãªæž¬å®ã»è©äŸ¡æ¹æ³ãŸã§ããDeepEvalã©ã€ãã©ãªãçšãããã¢ã亀ããŠç޹ä»ããŸãã 1. LLMãµãŒãã¹ã®äžè¬çè©äŸ¡ææš LLMãµãŒãã¹ã®ä¿¡é Œæ§ã枬ãäžã§ãã©ã®ãããªææšã«çç®ãã¹ãã§ããããïŒ LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide ã§ã¯ãäžèšã®è©äŸ¡èгç¹ã®ä»£è¡šäŸãæããããŠããŸããã ææšå 説æ åçã®é¢é£æ§ (Answer Relevancy) 質åã«å¯ŸããŠãã©ãã ãé©åã«çããŠããããæž¬ãææš ã¿ã¹ã¯å®é床 (Task Completion) äžããããã¿ã¹ã¯ãã©ãã ãæ£ç¢ºã«ããéããããããæž¬ãææš æ£ç¢ºã (Correctness) äºåã«çšæãããæ£è§£ãšã©ãã ãäžèŽããŠããããæž¬ãææš å¹»èŠã®æç¡ (Hallucination) äºå®ã«åºã¥ããªãå
容ãããã¿ã©ã¡ãªæ
å ±ãå«ãŸããŠããªãããæž¬ãææš ããŒã«äœ¿çšã®æ£ç¢ºã (Tool Correctness) ã¿ã¹ã¯ãéæããããã«æ£ããããŒã«ãéžã³ãå®è¡ã§ããããæž¬ãææš æèé©åæ§ (Contextual Relevancy) æ€çŽ¢ãããæ
å ±ã質åã«å¯ŸããŠã©ãã ãé©åããæž¬ãææš 責任ããAIææš (Responsible Metrics) å·®å¥çãªè¡šçŸãæ»æçãªå
容ãå«ãã§ããªãããç¹å®ã®å±æ§ã«å¯ŸããŠåèŠãæã£ãŠããªãããªã©ãæž¬ãææš ã¿ã¹ã¯åºæææš (Task-Specific Metrics) èŠçŽã翻蚳ãªã©ããç¹å®ã®ã¿ã¹ã¯ãã«ãããŠLLMã®æ§èœã枬ãããã®ææš åŸæ¥ã®ãµãŒãã¹ã®ä»£è¡šçãªææšãšããŠãAvailabilityãLatencyãªã©ãšãã£ãã€ã³ãã©ç³»SLIãç£èŠããã°ããŠãŒã¶ãŒãžã£ãŒããŒãšé¢é£ä»ããŠã客ããŸæºè¶³åºŠãææ¡ããããšãã§ããŸããã ãããLLMãµãŒãã¹ã§ã¯ããå¿çãæå³ã«æ²¿ããäºå®ã«åºã¥ããŠãããããã¿ã¹ã¯ãæ£ããå®éã§ãããããšãã£ãçæå質ãã®ãã®ãã客ããŸæºè¶³åºŠã«çŽçµããŸãã ãã®ãããåŸæ¥ã®AvailabilityãLatencyã«å ããLLMãµãŒãã¹ç¹æã®çæå質ãæããSLIãèšèšããã客ããŸããæå³ã©ããã®æ£ããåçãè¿
éã«åŸããããããå®éçã«ç€ºããææšäœç³»ãæŽããå¿
èŠããããŸãã ã§ã¯ãå
·äœçã«LLMãµãŒãã¹ã®ææšãèšèšããäžã§ãã©ã®ææšãéžå®ããã¹ãã§ããããã 1.1. äžè¬çè©äŸ¡ææšã®èœãšã穎 äžèšã®è¡šã«ãããåçã®é¢é£æ§ãæ£ç¢ºããå¹»èŠã®æç¡ãšãã£ãäžè¬çãªè©äŸ¡èгç¹ã¯éªšæ Œã§ããããã¹ãŠã®LLMãµãŒãã¹ã®ãŠãŒã¹ã±ãŒã¹åºæã®æåæ¡ä»¶ããã£ããã¢ããã§ãããšã¯éããŸããã ããšãã°èŠçŽãµãŒãã¹ãªããç¶²çŸ
æ§ãããççŸã®æç¡ããRAGãªããæ€çŽ¢æèã®é©å床ããšãã£ãç¬èªææšããªããã°ãã客ããŸãåŸã䟡å€ã枬ãåããªãããšãå€ãã§ãã The Accuracy Trap: Why Your Modelâs 90 % Might Mean Nothing ãšããèšäºã§ã¯ã顧客é¢åïŒchurnïŒäºæž¬ã¢ãã«ããã¹ã粟床92%ãéæããã«ãããããããå®éçšã§ã¯è§£çŽé²æ¢ã©ããã誀èŠåãšåãããŒããçºçããçµæãšããŠé¢åçãå¢ããããšã解説ããŠããŸãã æèšãšããŠã¯ãã客ããŸèŠç¹ã®ãšã³ãããŒãšã³ãè©äŸ¡ãæåªå
ã«ããããšããããšã ãšæãããŸãã LLMãµãŒãã¹ã¯RAGããšãŒãžã§ã³ãæ©æ§ãªã©è€éãªå
éšæ§é ãæã¡ãŸãããäžéã³ã³ããŒãã³ãããããæ¹åããŠããã客ããŸãåãåãåçãåäžããªããã°ROIã¯äžãããŸããã ãã©ãã¯ããã¯ã¹ãšããŠã®æçµåºåãèšæž¬ãããšã³ãããŒãšã³ãã§æž¬ã£ãçµæãããµããŒãå·¥æ°åæžã売äžåäžãšçžé¢ããããšãããã®LLMãµãŒãã¹ã®éžå®ãã¹ãè©äŸ¡ææšã§ãããã 1.2. åªããè©äŸ¡ææšãšã¯? The Complete LLM Evaluation Playbook: How To Run LLM Evals That Matter ã§ã¯ãåªããè©äŸ¡ææšã®æ¡ä»¶ãšããŠãäžèš3ç¹ãæããããŠããŸããã å®éçã§ããããšïŒQuantitativeïŒ è©äŸ¡çµæãšããŠæ°å€ã¹ã³ã¢ãç®åºã§ããããšãæ°å€ã§è©äŸ¡ã§ããã°ãåæ Œã©ã€ã³ãšãªããããå€ããèšå®ããããã¹ã³ã¢ã®æç³»åå€åã远ã£ãŠã¢ãã«æ¹åã®å¹æã枬å®ãããã§ããããšãæãŸããã§ã ä¿¡é Œæ§ãé«ãããšïŒReliableïŒ åžžã«å®å®ããè©äŸ¡çµæãåŸãããããšãLLMã®åºåã«äºæž¬äžèœãªæºããããã以äžãè©äŸ¡ææšãŸã§äžå®å®ã§ã¯å°ããŸããäŸãã°LLMãçšããè©äŸ¡ææ³ïŒåŸè¿°ã®LLM-as-a-judgeãªã©ïŒã¯åŸæ¥ææ³ããé«ç²ŸåºŠãªåé¢ãè©äŸ¡çµæã«ã°ãã€ããåºãããåŸåãããããæ³šæãå¿
èŠã§ã æ£ç¢ºã§ããããšïŒAccurateïŒ LLMã¢ãã«ã®æ§èœãå®éã®äººéã®è©äŸ¡ãšè¿ãåºæºã§ç確ã«åæ ã§ããããšãè©äŸ¡ã¹ã³ã¢ãé«ãåºå=人éã«ãšã£ãŠè¯å¥œãšæããããåºåããšãªãã®ãçæ³ã§ããããã®ããã«ã¯äººéã®æåŸ
ãšæŽåããåºæºã§è©äŸ¡ããå¿
èŠããããŸã ãŸããè©äŸ¡ææšå€ããããé«ãã¹ã³ã¢ãå©ãåºããŠãã売äžãã客ããŸæºè¶³åºŠãªã©ã®ããžãã¹ææã«ã€ãªãããªããã°æå³ããããŸããã åèšäºã§ã¯ãããã Metric Outcome FitïŒææšãšææã®ã€ãªããïŒ ãšåŒãã§ããããçŸå Žã§è¡ãããLLMã®ææšè©äŸ¡ã®95%ã¯ããã®ã€ãªããããªã䟡å€ãçãŸãªãããšãŸã§èšåãããŠããŸãããããžãã¹äžãè¯ãçµæããšã¿ãªãããã±ãŒã¹ãææšã確å®ã«âè¯ãâãšå€å®ã§ããããäžèšã確èªã»èª¿æŽãç¶ããããšããææšãå€ããªãå¯äžã®æ¹æ³ããšç޹ä»ãããŠããŸãã 2. ææšã®è©äŸ¡æ¹æ³ã®å
šäœå 次ã«ãææšãå®éã«è©äŸ¡ããææ³ã®çš®é¡ã«ã€ããŠç޹ä»ããŸãã倧å¥ãããšãäžèšã®4ã€ãååšããããããã«é·æã»çæããããŸãã çµ±èšçææ³ (string-based / n-gram based / surface base) LLM以å€ã®ã¢ãã«ãçšããææ³ (classifier / learned metrics / small-LM metrics) çµ±èšçææ³ãLLM以å€ã®ã¢ãã«ãåæã«çšãããã€ããªãããªææ³ (embedding-based metric) LLMãã®ãã®ãçšããææ³ (LLM based / generative evaluator) 2.1 çµ±èšçææ³ 人æã§äœæããæ£è§£ããŒã¿ãšåºåããã¹ããæååã¬ãã«ã§æ¯èŒããé¡äŒŒåºŠã枬ã£ãŠè©äŸ¡ããææ³ã§ãã BLEU ã¢ãã«åºåãšæåŸ
ãããæ£è§£æãšã®1ã4-gram 粟床ãå¹³åããbrevity penalty ã乿³ããŠç²ŸåºŠããŒã¹ã§ç®åºããé·ãã®éäžè¶³ã«å¯Ÿããããã«ãã£ãå å³ããã¹ã³ã¢ãäžããŸã ROUGE èŠçŽè©äŸ¡ã«ããçšããããROUGE-L㯠LCS(æé·å
±ééšåå)ããŒã¹ã§åçŸçãšç²ŸåºŠã® F1ãåããROUGE-1/2 ã n-gramåçŸçã«åºã¥ãèŠçŽãå
ææžãã©ãã ãã«ããŒããŠããããæž¬ããŸã METEOR 粟床ãšåçŸçã®äž¡é¢ããè©äŸ¡ããèªé ã®éããå矩èªã®ãããã³ã°ãèæ
®ããææšã§ãã(æçµã¹ã³ã¢ã¯ç²ŸåºŠã»åçŸçã®èª¿åå¹³åã«èªé ããã«ãã£ã乿³ããŠç®åºïŒ ç·šéè·é¢ïŒ ã¬ãŒãã³ã·ã¥ã¿ã€ã³è·é¢ ïŒ åºåãšæ£è§£ã®æååå·®åãã®ãã®ã枬å®ããææšãå®åã§ã¯è€æ°æé·ã®æ¯èŒã«ãã®ãŸãŸäœ¿ãããšã¯çšã§ããã£ããã¢ããã³ã¹ãã®å²ã«ã¯äœ¿çšãããŠããªãã±ãŒã¹ãå€ãããã§ã ref: LLM evaluation metrics â BLEU, ROGUE and METEOR explained ãããçµ±èšçææšã¯èšç®ãåçŽã§åçŸæ§ïŒäžè²«æ§ïŒã¯é«ãã§ãããããã¹ãã®æå³ãæèãèæ
®ããªãããLLMãçæãããããªé·æåçãé«åºŠãªæšè«ãèŠããåºåã®è©äŸ¡ã«ã¯äžåãã§ããäºå®ãçŽç²ãªçµ±èšææ³ã§ã¯åºåã®è«ççæŽåæ§ãææã®æ£ãããŸã§ã¯è©äŸ¡ã§ãããè€éãªåºåã«å¯ŸããŠã¯ç²ŸåºŠãäžååã ãšãããŠããŸãã 2.2. LLM以å€ã®ã¢ãã«ãçšããææ³ è©äŸ¡å°çšã®æ©æ¢°åŠç¿ã¢ãã«ãçšããŠãåé¡ã¢ãã«ãåã蟌ã¿ã¢ãã«ãªã©ãæ¯èŒç軜éãªèªç¶èšèªåŠçã¢ãã«ã䜿ã£ãŠè©äŸ¡ããææ³ã§ãã NLIïŒèªç¶èšèªæšè«ïŒã¢ãã« LLMã®åºåãäžããããåç
§ããã¹ã(äºå®æ
å ±ãªã©)ã«å¯ŸããŠãæŽåããŠãããïŒEntailmentïŒ/ ççŸããŠãããïŒContradictionïŒ/ ç¡é¢ä¿ãïŒNeutralïŒãåé¡ã§ããŸãããã®å Žåãã¢ãã«ã®åºåã¹ã³ã¢ã¯ãè«ççã«ã©ãã ãäžè²«ããŠããããã衚ã0.0~1.0ã®ç¢ºçå€ã«ãªããŸã Transformeråã®èšèªã¢ãã«ïŒNLI, BLEURTãªã©ïŒãããŒã¹ã«åŠç¿ããå°çšã¢ãã« LLMã®åºåãšæåŸ
ãããæ£è§£ãšã®é¡äŒŒåºŠãã¹ã³ã¢ãªã³ã°ããŠèšæž¬ããææ³ã§ãã¢ãã«ããŒã¹ææ³ã§ã¯ãããã¹ãã®æå³ãããçšåºŠèæ
®ããè©äŸ¡ãå¯èœã«ãªããŸãããè©äŸ¡ã¢ãã«èªäœã«äžç¢ºå®æ§ããããããã¹ã³ã¢ã®äžè²«æ§ïŒå®å®æ§ïŒã«æ¬ ããå ŽåããããŸããäŸãã°ãNLIã¢ãã«ã¯å
¥åæãé·å€§ã«ãªããšããŸã倿ã§ããªãã£ãããBLEURTã¯åŠç¿ããŒã¿ã®åãã«åœ±é¿ãåãè©äŸ¡ãåãå¯èœæ§ãææãããŠããŸã 2.3. çµ±èšçææ³ãLLM以å€ã®ã¢ãã«ãåæã«çšãããã€ããªãããªææ³ äžèšã®äžéã«äœçœ®ããææ³ã§ãäºååŠç¿æžã¿èšèªã¢ãã«ã®åã蟌ãã§ãã¯ãã«åããå€ãšãçµ±èšçãªè·é¢èšç®ãçµã¿åãããŠè©äŸ¡ããææ³ã§ãã BERTScore BERT ãªã©ã§æ±ããååèªã®æèãã¯ãã«å士㮠ã³ãµã€ã³é¡äŒŒåºŠ ãèšç®ããåºåæãšåç
§æã®æå³çãªéãªã床åããæž¬å®ããŸã MoverScore åºåæãšåç
§æããããã«ã€ããŠåèªåã蟌ã¿ãçšããååžãäœæãããããã Earth Moverâs DistanceïŒæé©èŒžéè·é¢ïŒ ãèšç®ããŠäž¡è
ã®å·®ç°ã枬å®ããŸã ãããã®ææ³ã¯åèªã¬ãã«ã»è¡šé¢ã¬ãã«ãè¶
ããŠæå³çãªè¿ããæããããç¹ã§çµ±èšçææ³ã§æããBLEUãªã©ããåªããŠããŸãããçµå±ã¯å
ãšãªãåã蟌ã¿ã¢ãã«ïŒBERTçïŒã®æ§èœããã€ã¢ã¹ã«åœ±é¿ããããšãã匱ç¹ããããŸããäŸãã°å°éé åã®æèãææ°ã®ç¥èã«ã€ããŠãäºååŠç¿ã¢ãã«ãé©åãªãã¯ãã«è¡šçŸãæã£ãŠããªããã°æ£ç¢ºãªè©äŸ¡ã¯ã§ããŸããããŸãè©äŸ¡ã¢ãã«ãå
å
ãã瀟äŒçãã€ã¢ã¹ãã¹ã³ã¢ã«çŸãããªã¹ã¯ããããŸãã 2.4. LLMãçšããææ³ïŒLLM-as-a-judgeïŒ è©äŸ¡ææ³ã®äžã§ãè¿å¹Žæ³šç®ãããŠããã®ããLLMèªäœã«èšæž¬ãããŠåºåå質ãè©äŸ¡ãããææ³ãLLM-as-a-judgeã§ãã é«åºŠãªLLMã«ãäžããããåçãåºæºãæºãããè©äŸ¡ããŠãã ããããšæç€ºãäžããã¢ãã«ããè©äŸ¡ã¹ã³ã¢ãå€å®ãåŒãåºãã¢ãããŒãã«ãªããŸãã LLMã¯æç« ã®æå³çè§£ãè€éãªå€æãã§ããããã人éã®äž»èгã«è¿ãè©äŸ¡ãèªååã§ããç¹ã倧ããªé·æã§ãã å®éãGPT-4ãè©äŸ¡è
ã«çšãã G-Eval ãšããææ³ã§ã¯ãè©äŸ¡ã¹ã³ã¢ãšäººéè©äŸ¡ãšã®çžé¢ãåŸæ¥ã®èªåè©äŸ¡ããã倧å¹
ã«åäžããããšãã G-Eval Simply Explained: LLM-as-a-Judge for LLM Evaluation ãšããèšäºã§ã玹ä»ãããŠããŸãã äžæ¹ã§ãLLMããŒã¹ã®è©äŸ¡ã¯ãã®ã¢ãã«ã®å¿ç次第ã§çµæãå€åããããããã¹ã³ã¢ã®å®å®æ§ïŒä¿¡é Œæ§ïŒã«èª²é¡ããããŸãã LLMã«åãåçãåè©äŸ¡ãããŠãæ¯åãŸã£ããåãã¹ã³ã¢ãåŸãããä¿èšŒã¯ãªããã¢ãã«ã®ã©ã³ãã èŠçŽ ãåºåã®æºãããè©äŸ¡çµæã«ã圱é¿ãåãŒãããã§ãã äžèšã«ã代衚çãªLLM-as-a-judgeã®ææ³ãããã¯ã¢ããããŠã¿ãŸãã G-Eval è©äŸ¡åºæºã1ïœ5段éã¹ã±ãŒã«ã§æ¡ç¹ããLLMãè©äŸ¡ã¹ã³ã¢ãšè©äŸ¡çµæã®çç±(Chain of Thoughtã®çµæ)ãè¿ãä»çµã¿ QAG Score åºåããQA(Yes/No/Unknown)ãèªåçæããåæã§åãQAãè§£ããäž¡è
ã®äžèŽçãã¹ã³ã¢ã«ãã SelfCheckGPT åãããã³ããã§Nåãµã³ããªã³ã°ããçææå士ã®äžè²«æ§(äŸïŒN-gramã»QAã»BERTScoreãªã©è€æ°ã®æ¯èŒã¢ãŒã)ãæž¬ã£ãŠäºå®æ§ãæšå®ãããã°ãã€ãã倧ããã»ã©å¹»èŠã®å¯èœæ§ãé«ããªã DAG(deep acyclic graph) DeepEval ãæäŸããæ±ºå®æšåã¡ããªãã¯ãåããŒãã¯LLMå€å®(Yes/No)ã§ãçµè·¯ã«ãã£ãŠåºå®ã¹ã³ã¢ãè¿ããã LLM-as-a-judgeãªã®ã«ããŒã«å€å®ããŒããæ±ºå®æšã§æããéšåç¹ã決å®è«åãã Prometheus2 Model GPT-4ãå«ãé«å質ãžã£ããžã®ãã£ãŒãããã¯ãšå€æ°ã®è©äŸ¡ãã¬ãŒã¹ã§èžçãã7B/8Ã7Bã®è©äŸ¡ã¢ãã«ã人é/GPT-4ãšã®äžèŽç0.6ã0.7(çŽæ¥æ¡ç¹), 72â85%(ãã¢ã¯ã€ãºæ¯èŒ)ã§ç«èšŒæžã¿ æåŸã«ããããŸã§æããææšã®èšæž¬è©äŸ¡æ¹æ³ããŸãšããŠã¿ãã®ãäžèšã®è¡šã«ãªããŸãã çš®é¡ å
·äœçãªææ³ é·æ çæ çµ±èšçææ³ BLEU / ROUGE / METEOR / ç·šéè·é¢ïŒã¬ãŒãã³ã·ã¥ã¿ã€ã³è·é¢ïŒ ã»èšç®ãåçŽã§é«éã»åçŸæ§ãé«ã ã»è¿œå åŠç¿ãäžèŠã§å®è£
ã容æ ã»æå³ã»æèãèæ
®ãã衚局äžèŽã®ã¿ãè©äŸ¡ ã»è«çæŽåæ§ãé«åºŠãªæšè«ãå¿
èŠãªåºåã«ã¯äžåã LLM 以å€ã®ã¢ãã«ãçšããææ³ NLIïŒèªç¶èšèªæšè«ïŒã¢ãã« / BLEURT / Transformer ããŒã¹ã®å°çšè©äŸ¡ã¢ãã« ã»æå³çè§£ãè«ççäžè²«æ§ãããçšåºŠè©äŸ¡ã§ãã ã»LLM ããèšç®ã³ã¹ããäœããèªåã§ fine-tune å¯èœ ã»è©äŸ¡ã¢ãã«èªäœã®äžç¢ºå®æ§ ã»ãã€ã¢ã¹ã«äŸå ã»é·æã»å°éé åã§ç²ŸåºŠãäœäžãããã ãã€ããªããææ³ BERTScore / MoverScore ã»åã蟌ã¿ã§èªçŸ©çè¿ããæããçµ±èšææšããé«ç²ŸåºŠ ã»æ±ºå®è«çã§åçŸæ§ãä¿ã¡ããã ã»åã蟌ã¿å
ã¢ãã«ã®åŠç¿ç¯å²ã»ãã€ã¢ã¹ã«å·Šå³ããã ã»ææ°ç¥èãçãå°éé åã§ã¯é©åãã«ãã LLM ãçšããææ³ïŒLLM-as-a-judgeïŒ G-Eval / QAG Score / SelfCheckGPT / DAG (Deep Acyclic Graph) / Prometheus2 Model ã»äººéè©äŸ¡ã«è¿ãè€éãªå€æãèªååã§ãã ã»åçã®å€é¢çå質ãäžæ¬ã§è©äŸ¡å¯èœ ã»åºåã確ççã§ã¹ã³ã¢ã«æºãããåºããã ã»ã¢ãã«å©çšã³ã¹ããé«ããããã³ããã«ææ ãããè©äŸ¡ææ³ãå®éã«èšæž¬ããã«ã¯ãå¹ççã«æž¬å®ããããã®ããŒã«ãå¿
èŠã§ãã ããã§ãä»åã¯LLMè©äŸ¡ã©ã€ãã©ãªã®äžããåèèšäºã§å£éèŠãŠããDeepEvalã«ã€ããŠç޹ä»ããããšæããŸãã 3. DeepEval DeepEval ã¯ãLLMãµãŒãã¹ãè©äŸ¡ããããã®Pythonã©ã€ãã©ãªã§ãã ãã¹ãã±ãŒã¹ã®äœæãè©äŸ¡ææšã®å®çŸ©ãè©äŸ¡ã®å®è¡ãè¡ãããã®ãã¬ãŒã ã¯ãŒã¯ãæäŸããŸãã DeepEvalã¯ãå¿çã®é¢é£æ§ãå¿ å®æ§ãæèã®ç²ŸåºŠãªã©ãããŸããŸãªåŽé¢ãè©äŸ¡ããææšããµããŒãããŠãããã«ã¹ã¿ã ææšãè©äŸ¡ããŒã¿ã»ããã®èªåçæãPytestã®ãããªãã¹ããã¬ãŒã ã¯ãŒã¯ãšã®çµ±åããµããŒãããŠããŸãã å
¬åŒããã¥ã¡ã³ã ã«ã¯ã詳现ãªã€ã³ã¹ããŒã«æé ãåºæ¬çãªäœ¿ç𿹿³ãåçš®è©äŸ¡ææšã®èšå®æ¹æ³ãã«ã¹ã¿ã ææšã®äœææ¹æ³ãªã©ã詳ãã解説ãããŠããŸãã ããã§ã¯ãç°¡åãªèŠçŽãµãŒãã¹ãå
ã«ãè©äŸ¡æé ãå®è·µããŠã¿ãããšæããŸãã 3.1 å®è·µäŸïŒ èŠçŽãµãŒãã¹ã§ã®ææšæ±ºå®ãšæž¬å®æ¹æ³ ããã§æ³å®ããèŠçŽãµãŒãã¹ã¯ãèšäºãããã¥ã¡ã³ããªã©ã®é·æãå
¥åãšããŠåãåãããã®å
容ãçããŸãšããèŠçŽæãçæãããµãŒãã¹ã§ãã LLMã®ä»çµã¿çã«åŸæåéãšããŠçã£å
ã«æãã€ããµãŒãã¹ã ãšæããŸãã ä»åã¯ãã°ãªã 童話ãèŠçŽããŠãåäŸã§ãããããããªæç« ã§èŠçŽããŠããããµãŒãã¹ãèããŠã¿ãããšæããŸãã 3.2 ææšã®éžå® èŠçŽãšãã芳ç¹ãããäžè¬çãªè©äŸ¡ææšãšããŠæãã€ãææšã¯ã åçã®é¢é£æ§ (Answer Relevancy) , æ£ç¢ºã (Correctness) , å¹»èŠã®æç¡ (Hallucination) ã§ãã Deepevalã® G-Eval ãå©çšããŠãäžèš3ã€ã®ææšã«å¯Ÿå¿ããããšãã§ããŸãããä»åã®ã±ãŒã¹ã§ã® 1.2. åªããè©äŸ¡ææšãšã¯? ã«è©²åœããã調æ»ããå¿
èŠããããŸãã å®éçã§ããããš(Quantitative) G-Evalã¯0ã1ã®é£ç¶ã¹ã³ã¢ãè¿ãã®ã§ãè©äŸ¡çµæãšããŠæ°å€ã¹ã³ã¢ãç®åºã§ãããšèšããŸã ä¿¡é Œæ§ãé«ãããš(Reliable) G-Evalã¯æ¬æ¥ç¢ºççã§ãããLLMã¢ãã«ã«æž¡ã temperatureã®ãªãã·ã§ã³ã0ã§åŒã³åºã ã evaluation_stepsãåºå®ãCoTçæåŠçãã¹ããã ã Rubricãæå®ããŠè©äŸ¡ã¹ã³ã¢ãäžå®ã«ãã ãšãã3ç¹ãå®è¡ããã°ãåãå
¥åã§åãã¹ã³ã¢ãã»ãŒåçŸãããããšãã§ããã®ã§ãåžžã«å®å®ããè©äŸ¡çµæãåŸããããã§ã(å³å¯ã«ã¯ãOpenAIåŽã® sampling noiseã system randomness ãæ®ã£ãŠããå®å
šåçŸã«ã¯è³ããŸãããtop_p=0, seed åºå®å¯èœãª API/backend ã䜿ããïŒæçµçã«ã¯ majority vote/ensemble è©äŸ¡ãæšå¥šãããŸã) æ£ç¢ºã§ããããš(Accurate) G-Evalã¯åç
§(expected_outputãä»åã®ã±ãŒã¹ã®å Žåãã°ãªã 童話ã®åæãæ£è§£ããŒã¿ã§ã)ä»ãã®è©äŸ¡ã§ãããäºå®ç
§åãäžå¿ãšããã¿ã¹ã¯ã§ã¯G-Evalã¯äººéå€å®ãšã®çžé¢ãé«ãããšãè«æã»å®éçšã®äž¡æ¹ã§ç€ºãããŠããŸãã ãã£ãŠãä»åã®ã±ãŒã¹ã§ã¯ã åçã®é¢é£æ§ (Answer Relevancy) , æ£ç¢ºã (Correctness) , å¹»èŠã®æç¡ (Hallucination) ã®ææšã«ã€ããŠãDeepEvalã®G-Evalã§ã®ææšè©äŸ¡ã䜿çšããããšã¯åŠ¥åœã ãšèšãããã§ãã 3.3 è©äŸ¡èгç¹ã®åè§£ 次ã«ãããã¯ã¢ããããææšãã©ã®ãããªæé ã§è©äŸ¡ãããã®ããè©äŸ¡ããããã«å¿
èŠãªèгç¹ãã¹ããããåæããŠãããŸãã 幞ããªããšã«ãè©äŸ¡èгç¹ãåè§£ããäžã§ãåèã«ãªããããªæç®ããGoogle Cloudã® Vertex AIã®ããã¥ã¡ã³ã – ã¢ãã«ããŒã¹è©äŸ¡ã®ææšããã³ãã ãã³ãã¬ãŒã ã«ãããŸããã®ã§ãä»åã¯ãã¡ããåèã«è©äŸ¡èгç¹ãåè§£ããŠãããããšæããŸãã åçã®é¢é£æ§ (Answer Relevancy) STEP1. Identify user intent â List the explicit and implicit requirements in the prompt. STEP2. Extract answer points â Summarize the key claims or pieces of information in the response. STEP3. Check coverage â Map answer points to each requirement; note any gaps. STEP4. Detect off-topic content â Flag irrelevant or distracting segments. STEP5. Assign score â Choose 1-5 from the rubric and briefly justify the choice. æ£ç¢ºã (Correctness) STEP1. Review reference answer (ground truth). STEP2. Isolate factual claims in the model response. STEP3. Cross-check each claim against the reference or authoritative sources. STEP4. Record discrepancies â classify as omissions, factual errors, or contradictions. STEP5. Assign score using the rubric, citing the most significant discrepancies. å¹»èŠã®æç¡ (Hallucination) STEP1. Highlight factual statements â names, dates, statistics, citations, etc. STEP2. Compare with provided context and known reliable data. STEP3. Label claims as verified, unverifiable, or false. STEP4. Estimate hallucination impact â proportion and importance of unsupported content. STEP5. Assign score following the rubric and list specific hallucinated elements. 3.4 è©äŸ¡ã¹ã³ã¢ã®ç®åº ã§ã¯ãå®éã«è©äŸ¡æž¬å®ãããŠè©äŸ¡ã¹ã³ã¢ãç®åºããŠã¿ãŸãã ãŸããèŠçŽããã顿ãšããã³ãããçšæããŸãã ä»åãã°ãªã 童話ã®åæã¯ èµ€ããã ã䜿çšããããã³ããã¯äžèšãçšæããŠã¿ãŸããã 以äžã®ã°ãªã 童話ã®å
容ã®èŠçŽãäœæããŠãã ããã èŠä»¶ïŒ 1. äž»èŠãªç»å Žäººç©ãéèŠãªèŠçŽ ãç¹å®ããŠå«ãã 2. å
å®¹ã®æµããè«ççã«æŽçãã 3. éèŠãªåºæ¥äºã転æç¹ãå«ãã 4. åæã®å
容ã«å¿ å®ã§ããããš 5. èŠçŽã®é·ãã¯500æå以å
ã«åãã ã°ãªã 童話ã®å
å®¹ïŒ {èµ€ãããã®åæ} èŠçŽïŒ""" 䜿çšããè©äŸ¡ã¹ã¯ãªããã¯äžèšã«ãªããŸãã import asyncio import openai from deepeval.metrics.g_eval.g_eval import GEval from deepeval.metrics.g_eval.utils import Rubric from deepeval.test_case.llm_test_case import LLMTestCase, LLMTestCaseParams async def evaluate_comprehensive_metrics(client: openai.AsyncOpenAI, test_case: LLMTestCase, prompt_name: str, original_text: str) -> dict: """G-Evalã¡ããªã¯ã¹è©äŸ¡ãå®è¡""" # åçã®é¢é£æ§è©äŸ¡ (Answer Relevancy) geval_answer_relevancy = GEval( name="Answer Relevancy", evaluation_steps=[ "STEP1. **Identify user intent** â List the explicit and implicit requirements in the prompt.", "STEP2. **Extract answer points** â Summarize the key claims or pieces of information in the response.", "STEP3. **Check coverage** â Map answer points to each requirement; note any gaps.", "STEP4. **Detect off-topic content** â Flag irrelevant or distracting segments.", "STEP5. **Assign score** â Choose 1-5 from the rubric and briefly justify the choice.", ], rubric=[ Rubric(score_range=(0, 2), expected_outcome="Largely unrelated or fails to answer the question at all."), Rubric(score_range=(3, 4), expected_outcome="Misunderstands the main intent or covers it only marginally; most content is off-topic."), Rubric(score_range=(5, 6), expected_outcome="Answers the question only partially or dilutes focus with surrounding details; relevance is acceptable but not strong."), Rubric(score_range=(7, 8), expected_outcome="Covers all major points; minor omissions or slight digressions that donât harm overall relevance."), Rubric(score_range=(9, 10), expected_outcome="Fully addresses every aspect of the user question; no missing or extraneous information and a clear, logical focus."), ], evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT], model="gpt-4o" ) # æ£ç¢ºãè©äŸ¡ (Correctness) geval_correctness = GEval( name="Correctness", evaluation_steps=[ "STEP1. **Review reference answer** (ground truth).", "STEP2. **Isolate factual claims** in the model response.", "STEP3. **Cross-check** each claim against the reference or authoritative sources.", "STEP4. **Record discrepancies** â classify as omissions, factual errors, or contradictions.", "STEP5. **Assign score** using the rubric, citing the most significant discrepancies.", ], rubric=[ Rubric(score_range=(0, 2), expected_outcome="Nearly everything is incorrect or contradictory to the reference."), Rubric(score_range=(3, 4), expected_outcome="Substantial divergence from the reference; multiple errors but some truths remain."), Rubric(score_range=(5, 6), expected_outcome="Partially correct; at least one important element is wrong or missing."), Rubric(score_range=(7, 8), expected_outcome="Main facts are correct; only minor inaccuracies or ambiguities."), Rubric(score_range=(9, 10), expected_outcome="All statements align perfectly with the provided ground-truth reference or verifiable facts; zero errors.") ], evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT], model="gpt-4o" ) # å¹»èŠã®æç¡è©äŸ¡ (Hallucination) geval_hallucination = GEval( name="Hallucination", evaluation_steps=[ "STEP1. **Highlight factual statements** â names, dates, statistics, citations, etc.", "STEP2. **Compare with provided context** and known reliable data.", "STEP3. **Label claims** as verified, unverifiable, or false.", "STEP4. **Estimate hallucination impact** â proportion and importance of unsupported content.", "STEP5. **Assign score** following the rubric and list specific hallucinated elements.", ], rubric=[ Rubric(score_range=(0, 2), expected_outcome="Response is dominated by fabricated or clearly false content."), Rubric(score_range=(3, 4), expected_outcome="Key parts rely on invented or unverifiable information."), Rubric(score_range=(5, 6), expected_outcome="Some unverified or source-less details appear, but core content is factual."), Rubric(score_range=(7, 8), expected_outcome="Contains minor speculative language that remains verifiable or harmless."), Rubric(score_range=(9, 10), expected_outcome="All content is grounded in the given context or universally accepted facts; no unsupported claims.") ], evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT], model="gpt-4o" ) await asyncio.to_thread(geval_answer_relevancy.measure, test_case) await asyncio.to_thread(geval_correctness.measure, test_case) await asyncio.to_thread(geval_hallucination.measure, test_case) # Rubricã¹ã³ã¢ãæšå®ãã颿°(衚瀺çš) def extract_rubric_score_from_normalized(normalized_score, rubric_list): """æ£èŠåãããã¹ã³ã¢(0.0-1.0)ããRubricã®ç¯å²ãç¹å®""" scaled_score = normalized_score * 10 for rubric_item in rubric_list: score_range = rubric_item.score_range if score_range[0] <= scaled_score <= score_range[1]: return { 'scaled_score': scaled_score, 'rubric_range': score_range, 'expected_outcome': rubric_item.expected_outcome } return None answer_relevancy_rubric_info = extract_rubric_score_from_normalized( geval_answer_relevancy.score, geval_answer_relevancy.rubric ) correctness_rubric_info = extract_rubric_score_from_normalized( geval_correctness.score, geval_correctness.rubric ) hallucination_rubric_info = extract_rubric_score_from_normalized( geval_hallucination.score, geval_hallucination.rubric ) return { "answer_relevancy_score": geval_answer_relevancy.score, "answer_relevancy_rubric_info": answer_relevancy_rubric_info, "answer_relevancy_reason": geval_answer_relevancy.reason, "correctness_score": geval_correctness.score, "correctness_rubric_info": correctness_rubric_info, "correctness_reason": geval_correctness.reason, "hallucination_score": geval_hallucination.score, "hallucination_rubric_info": hallucination_rubric_info, "hallucination_reason": geval_hallucination.reason, } async def generate_summary(client: openai.AsyncOpenAI, prompt_template: str, full_story: str, model: str = "gpt-4o") -> str: """LLMã䜿ã£ãŠèŠçŽãçæ""" prompt = prompt_template.format(context=full_story) try: response = await client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}], max_tokens=300, temperature=0.0, top_p=0, logit_bias={} ) content = response.choices[0].message.content return content.strip() if content else "" except Exception as e: return f"Error: {str(e)}" async def process_prompt(client: openai.AsyncOpenAI, prompt_info: dict, full_story: str, context: list) -> dict: model = prompt_info.get("model", "gpt-4o") # èŠçŽçæ summary = await generate_summary(client, prompt_info["template"], full_story, model) # ãã¹ãã±ãŒã¹äœæ test_case = LLMTestCase( input=prompt_info["template"], # ããã³ãã actual_output=summary, # èŠçŽçµæ retrieval_context=context # èŠçŽå¯Ÿè±¡(童話)ã®åæ ) # è©äŸ¡å®è¡ metrics_result = await evaluate_comprehensive_metrics(client, test_case, prompt_info['name'], full_story) return { "prompt_name": prompt_info['name'], "model": model, "summary": summary, **metrics_result } async def main(): # 童話ã®åæãèªã¿èŸŒã¿ with open('little_red_riding_hood.txt', 'r', encoding='utf-8') as f: full_story = f.read().strip() context = [full_story] prompts = [ { "name": "prompt-01", "template": """以äžã®ããã¹ããèªãã§ãå
容ã®èŠçŽãäœæããŠãã ããã èŠä»¶ïŒ 1. äž»èŠãªç»å Žäººç©ãéèŠãªèŠçŽ ãç¹å®ããŠå«ãã 2. å
å®¹ã®æµããè«ççã«æŽçãã 3. éèŠãªåºæ¥äºã転æç¹ãå«ãã 4. åæã®å
容ã«å¿ å®ã§ããããš 5. èŠçŽã®é·ãã¯250æå以å
ã«åãã ããã¹ãïŒ {context} èŠçŽïŒ""", "model": "gpt-4o" }, ] async with openai.AsyncOpenAI() as client: tasks = [ process_prompt(client, prompt_info, full_story, context) for prompt_info in prompts ] all_results = await asyncio.gather(*tasks) # çµæè¡šç€ºåŠç ... if __name__ == "__main__": asyncio.run(main()) å®è¡ããèŠçŽçµæã¯äžèšã«ãªããŸããã æãèµ€ãããã¡ãããšããæããã女ã®åãããŸããã圌女ã¯ãã°ãããããèµ€ãããããããããããããã€ããã¶ã£ãŠããŸããã ããæ¥ãç
æ°ã®ãã°ãããã«ãèåãšã¶ã©ãé
ãå±ãããããæ£®ãéã£ãŠãã°ãããã®å®¶ã«åãããŸãã éäžã§çŒã«åºäŒããè¡ãå
ãæããŠããŸããŸããçŒã¯å
åãããŠãã°ãããã飲ã¿èŸŒã¿ãèµ€ãããã¡ãããéšããŠé£²ã¿èŸŒã¿ãŸãã ããããéãããã£ãç©äººãçŒã®ãè
¹ãåãéããèµ€ãããã¡ãããšãã°ããããæåºããŸããèµ€ãããã¡ããã¯æèšãåŸãŠãäºåºŠã𿣮ã§éãå€ããªããšå¿ã«èªããŸããã G-Evalãè©äŸ¡ããçµæã¯äžèšã«ãªããŸãã(1åç®ãæç²) - åçã®é¢é£æ§ (Answer Relevancy): 0.912 - Expected Outcome: Fully addresses every aspect of the user question; no missing or extraneous information and a clear, logical focus. - Reason: The summary includes key characters like Little Red Riding Hood, her grandmother, the wolf, and the hunter. It logically organizes the flow of events, such as the journey through the forest, the encounter with the wolf, and the rescue. Important events like the wolf's deception and the rescue by the hunter are covered. The summary is faithful to the original text and concise, with no extraneous information. - æ£ç¢ºã (Correctness): 0.901 - Expected Outcome: All statements align perfectly with the provided ground-truth reference or verifiable facts; zero errors. - Reason: The main facts in the Actual Output align well with the Retrieval Context, including the characters, events, and moral of the story. Minor details like the specific dialogue and actions are slightly condensed but do not affect the overall accuracy. - å¹»èŠã®æç¡ (Hallucination): 0.903 - Expected Outcome: All content is grounded in the given context or universally accepted facts; no unsupported claims. - Reason: The output closely follows the context with accurate details about Little Red Riding Hood, her grandmother, the wolf, and the hunter. The sequence of events and character actions are consistent with the context, with no unsupported claims. ã¹ã³ã¢ã決å®ããè©äŸ¡çç±ãèŠãŸããšãåææšã«å¯ŸããŠç確ã«è©äŸ¡ããŠããããã«ã¿ããŸãã 3.2. ææšã®éžå® ã§ãG-Evalã¯è©äŸ¡ã«æºãããããããšã玹ä»ããŸããããã£ãŠãäžèšã®ã¹ã¯ãªããã50åå®è¡ããŠãèšæž¬ããè©äŸ¡æ°å€ã®æ£åžå³ã¯äžèšã«ãªããŸãã çµæçã«ã¯ããã¹ãŠã®ææšã§ã¹ã³ã¢å€ãæŠã 0.9ä»¥äž ã«ãªããŸããããããã§åææšã®SLIå€ãæŠã0.9ãšããŠSLOã0.9以äžãšããŠç®æšå€ã«æ²ããããšã¯ã§ããã§ããããïŒ 3.5. è©äŸ¡ææšã®ã¬ãã¥ãŒ äžèšã§ç޹ä»ãããšããããã®ãµãŒãã¹ã¯ã ã°ãªã 童話ãèŠçŽããŠãåäŸã§ãããããããªæç« ã§èŠçŽããŠããããµãŒãã¹ ã§ãã äžèšã®èŠçŽçµæã åäŸã§ããããããã« ããã«ã¯ãäžèšã®ææšãèæ
®ããªããšãããªãã§ãããã å¯èªæ§ (Readability): åäŸãèªããªãé£ããæŒ¢åã衚çŸã䜿ãããŠããªããïŒ éšããŠ?ãæèš?ãã¶ã©ãé
? å®å
šæ§ã»æå®³æ§ (Toxicity / Safety): çŸä»£ã®ã³ã³ãã©ã€ã¢ã³ã¹ãšç
§ããåãããŠãåäŸã«ã¯éæ¿ãªè¡šçŸã䜿ãããŠããªãã? ãè
¹ãåãéã? è©äŸ¡ææšã¯ã客ããŸäŸ¡å€ãšããžãã¹KPIãšå¯æ¥ã«é¢é£ä»ããããšãæèããŠè©äŸ¡ææšãéžå®ããå¿
èŠããããŸãã ä»åã®èŠçŽãµãŒãã¹ã®å Žåãäžè¬çè©äŸ¡ææšããã察象è
ãèæ
®ããŠäžèšã®ææšãã¿ã¹ã¯åºæææš(Task-Specific Metrics)ãšããŠæåªå
ã«èããææšã«ããã¹ãã§ãã ãŸããããã«äŒŽããããã³ãããä¿®æ£ããªããã°ãªããªãã§ãããã ãšã¯ãããååããå®ç§ãªææšã»ãããäœãã®ã¯å°é£ã§ãã The Complete LLM Evaluation Playbook: How To Run LLM Evals That Matter ã§ã¯ã è©äŸ¡ææšã¯ãŸã1ã€ããå§ããæçµçã«ã¯5ã€ã«çµãææšèšèšãæãŸãã ãšãããŸããã è©äŸ¡ææšã®ã¹ã³ã¢ãã Metric Outcome Fit – ææšãšææã®ã€ãªãã (åã©ããã¡ã«é »ç¹ã«å©çšãããããš)ãšãã©ãã ãäžèŽããŠããããæèããªããææšãéžå®ãèšæž¬ãè©äŸ¡ããå¿
èŠããããŸãã (å®ãµãŒãã¹ã ã£ãå Žåãããžãã¹KPIãšããŠã¯ãæç« ããç»åã§æäŸããæ¹ããè¯ãææãåŸããããããããŸãã) 3.6. èªååã®å¯èœæ§ãæ¢ã ä»åã®äŸã§ã¯ã人éãææšã®éžå®ãè©äŸ¡ã¹ã³ã¢ã®ç®åºãè©äŸ¡ã¹ã³ã¢ã®ç®åºãææšè©äŸ¡ã®ã¬ãã¥ãŒã宿œããŸããã G-Evalã¯GPT-4ã¯ã©ã¹ã®ã¢ãã«ã«ãè©äŸ¡æé ãèªåã§åè§£ããŠèããããæçµã¹ã³ã¢ã ããè¿ããããä»çµã¿ããšãããã人éã®ä»£ããã« è©äŸ¡åºæºã®é©çšã»ã¹ã³ã¢ãªã³ã°ã»éèšãŸã§ãã¯ã³ã·ã§ããã§èªååã§ããŸãã 以äžã¯ãã®æé äŸã§ãã è©äŸ¡ã¿ã¹ã¯ã®æç€º: è©äŸ¡ã«äœ¿ãLLMã«å¯Ÿãããããããæç€ºããçææãããè©äŸ¡åºæºã«åŸã£ãŠ1ã5ç¹ã§æ¡ç¹ããŠäžããããšãã£ãã¿ã¹ã¯èª¬æãäžããããã®éã«ããã®è©äŸ¡åºæºã®å®çŸ©ãæç€ºããŠLLMã«æèãæãã(äŸãã°ãLLMãµãŒãã¹ã®äžè¬çè©äŸ¡ææšã«ãã£ãææšäžèЧãæç€ºãã) è©äŸ¡èгç¹ã®åè§£: 1.ã§LLMãéžå®ããææšã«å¯ŸããŠãå¿
èŠãªèгç¹ãã¹ããããèªãåæããã ã¹ã³ã¢ç®åº: ç¶ããŠã¢ãã«ã«ãå
ã»ã©çæããè©äŸ¡ã¹ãããã«åŸããå®éã®å
¥åã»åºåãè©äŸ¡ããã æ³šæç¹ãšããŠãLLMãè©äŸ¡è
ã ãšâLLMãããâåºåãé倧è©äŸ¡ããæ°èªä»èŸŒãã ãã§ã¹ã³ã¢ãæäœãããè匱æ§ããããŸãããã®ãããå¥ç³»åã®LLMã¢ãã«ã§è©äŸ¡ããŠã¿ãããšãã2ã€ã®åçã䞊ã¹ãŠã©ã¡ããè¯ããæ¯ã¹ããã¢ã¯ã€ãºæ¯èŒãç°åžžæ€ç¥ãªã©ã§ç·©åã詊ã¿ãŠããå®å
šãªäžç«æ§ã¯ä¿èšŒã§ããŸããã ãŸãã 3.2. ææšã®éžå® ã§ã玹ä»ããŸããããG-Evalã¯ç¢ºççè©äŸ¡ææ³ãæ
ã«ãåãåçã§ãè©äŸ¡ãæºãããšããåçŸæ§ã«åé¡ããããè©äŸ¡ããã³ãããã·ãŒããåºå®ãããªã©ã®å·¥å€«ãå¿
èŠã§ãã ãããã®çç±ãããæçµå€æã¯å¿
ã人éã®ã¬ãã¥ãŒã䜵çšããŠè£æ£ã»æ€èšŒããäºæ®µæ§ããåãããšãäžå¯æ¬ ã§ãã 4. ãŸãšã LLMãµãŒãã¹ã®ä¿¡é Œæ§è©äŸ¡ã«äžå¯æ¬ ãªææšã®éžå®ãããå
·äœçãªæž¬å®ã»è©äŸ¡æ¹æ³ãŸã§ããDeepEvalã©ã€ãã©ãªãçšãããã¢ã亀ããŠã玹ä»ããŸããã åŸæ¥ã®AvailabilityãLatencyãšãã£ãææšã ãã§ã¯æž¬ããããªããLLMãµãŒãã¹ã®ä¿¡é Œæ§è©äŸ¡ãã®ææšãSLIãšããŠã©ãå®çŸ©ãããã¯ãSREã«ãšã£ãŠãæ°ããåéã ãšæããŸãã æ¬èšäºã§è©ŠããDeepEvalãªã©ã®è©äŸ¡ããŒã«ã®ã¢ãããŒãããæ°ããéžæè¢ã®äžã€ã«éããŸãããLLMã®è©äŸ¡ææšã¯çŸåšãçµ¶è³ç ç©¶äžã®åéã§ãããLLMãµãŒãã¹ã®ä¿¡é Œæ§ãã©ã枬ããããšããåãã«ããŸã å¯äžã®æ£è§£ã¯ãªãããã§ãããã ããã®å
ãæ°ããè©äŸ¡ææšãæ°ããæž¬å®ææ³ãçºèŠããããšããŠãã ããã®ææšã¯æ¬åœã«ã客ããŸæºè¶³åºŠã衚ããŠããã®ãïŒã ãšããåãã¯ãå€ããããšã®ãªãæ¬è³ªçãªåãããã ãšæããŸãã æè¡ã®é²æ©ãšãšãã«ããã®åããããå¿ãããæ¥ã
ã®SREæ¥åã«åãçµãã§ãããã°å¹žãã§ãã ææ¥ã®èšäºã¯ @k_kinukawaããã®ãã¡ã«ã«ãªã¢ãã€ã« Dev Offsiteã§AI Hackathonããã話ãã§ããåŒãç¶ããæ¥œãã¿ãã ããã References Site Reliability Engineering Book: https://sre.google/books/ LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation The Accuracy Trap: Why Your Modelâs 90 % Might Mean Nothing: https://medium.com/%40edgar_muyale/the-accuracy-trap-why-your-models-90-might-mean-nothing-f3243fce6fe8 The Complete LLM Evaluation Playbook: How To Run LLM Evals That Matter: https://www.confident-ai.com/blog/the-ultimate-llm-evaluation-playbook ã¬ãŒãã³ã·ã¥ã¿ã€ã³è·é¢: https://note.com/noa813/n/nb7ffd5a8f5e9 LLM evaluation metrics â BLEU, ROUGE and METEOR explained: https://avinashselvam.medium.com/llm-evaluation-metrics-bleu-rogue-and-meteor-explained-a5d2b129e87f BERTScore: https://openreview.net/pdf?id=SkeHuCVFDr BERT: https://en.wikipedia.org/wiki/BERT_(language_model) ã³ãµã€ã³é¡äŒŒåºŠ: https://atmarkit.itmedia.co.jp/ait/articles/2112/08/news020.html MoverScore: https://arxiv.org/abs/1909.02622 Earth Moverâs DistanceïŒæé©èŒžéè·é¢ïŒ: https://zenn.dev/derwind/articles/dwd-optimal-transport01#%E6%9C%80%E9%81%A9%E8%BC%B8%E9%80%81%E8%B7%9D%E9%9B%A2 G-Eval (Paper): https://arxiv.org/abs/2303.16634 G-Eval Simply Explained: LLM-as-a-Judge for LLM Evaluation: https://www.confident-ai.com/blog/g-eval-the-definitive-guide QAG Score: https://arxiv.org/abs/2210.04320 SelfCheckGPT: https://arxiv.org/abs/2303.08896 DAG(deep acyclic graph): https://deepeval.com/docs/metrics-dag Prometheus2 Model: https://arxiv.org/abs/2405.01535 DeepEval: https://deepeval.com/docs/getting-started Vertex AI – ã¢ãã«ããŒã¹è©äŸ¡ã®ææšããã³ãã ãã³ãã¬ãŒã: https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates èµ€ããã: https://ja.wikipedia.org/wiki/%E8%B5%A4%E3%81%9A%E3%81%8D%E3%82%93