Elasticsearchã®æšæºã¢ãã©ã€ã¶ãŒã¯ Kuromoji ã§ãããä»ã«ãæ¥æ¬èªåãã®ã¢ãã©ã€ã¶ãŒãååšããŸããæ¬èšäºã§ã¯ Sudachi ã MeCab ãããã³Pythonã©ã€ãã©ãªã® Janome ããã㊠LLMïŒGPT-4ïŒ ãšãã£ãéžæè¢ãæ¯èŒããã©ããªå Žé¢ã§ã©ãã䜿ãã¹ãããæ€èšããŸããã ãªããElasticsearch 9.xã§ã¯SudachiãMeCabã®å
¬åŒå¯Ÿå¿ãã©ã°ã€ã³ã¯ãŸã ãªãªãŒã¹ãããŠããŸããããã®ããä»åã¯Â Pythonç°å¢ã§äºåã«åœ¢æ
çŽ è§£æããŠããElasticsearchã«ã€ã³ããã¯ã¹ããæ¹æ³ ãæ¡çšããŠããŸããäžæ¹ãJanomeã¯çŽPythonå®è£
ã§ããããã远å ãã©ã°ã€ã³ãªãã§äœ¿çšå¯èœã§ãã ç®æ¬¡ æ¯èŒå¯Ÿè±¡ã¢ãã©ã€ã¶ãŒ ã¹ããã0ïŒäºåæºå Pythonç°å¢ãšããã¹ãèšå® ããŒã¯ã³æ£èŠåã»ã¯ãªãŒãã³ã°é¢æ° æ¯èŒæŠç¥ ããŒã¯ã³å颿°ã®å®çŸ©ïŒå
±éãã©ãŒãããåºåïŒ å
æ¬çããŒã¯ãã€ã¶ãŒæ¯èŒåæ ð KuromojiïŒããŒã¹ã©ã€ã³ïŒåºå ð Sudachi A/B/CãMeCabãJanomeãGPT-4 åºå åæçµæãµããªãŒã»çµ±èšè¡š 1. ããŒã¯ã³æ°æ¯èŒè¡š 2. ãŠããŒã¯ããŒã¯ã³åæ èå¯ ãŸãšã æ¯èŒå¯Ÿè±¡ã¢ãã©ã€ã¶ãŒ 1. Elasticsearchãã©ã°ã€ã³ç³» ãã Kuromoji ïŒElasticsearchæšæºã®æ¥æ¬èªã¢ãã©ã€ã¶ãŒ 2. Pythonã©ã€ãã©ãªç³»ïŒã¢ããªåŽã§äºåè§£æïŒ SudachiPy ïŒ3çš®é¡ã®ç²åºŠïŒA/B/CïŒã«å¯Ÿå¿ MeCab ïŒé«éãã€å®çžŸè±å¯ãªåœ¢æ
çŽ è§£æåš Janome ïŒPure Pythonã§å°å
¥ãç°¡å 3. LLMç³»ïŒå€éšAPIïŒ ãã OpenAI GPT-4o ïŒæèãèæ
®ããæè»ãªåå²ãå¯èœ ã¹ããã0ïŒäºåæºå 1. ä»®æ³ç°å¢ã®äœæïŒuvã䜿çšïŒ curl -LsSf https://astral.sh/uv/install.sh | sh mkdir analyzer-project && cd analyzer-project uv venv source .venv/bin/activate 2. å¿
èŠãªPythonã©ã€ãã©ãªã®ã€ã³ã¹ããŒã« uv pip install elasticsearch janome openai sudachipy sudachidict_core mecab-python3 3. Elasticsearchãã©ã°ã€ã³ã®ç¢ºèªïŒKuromojiïŒ bin/elasticsearch-plugin install analysis-kuromoji 4. MeCabæ¬äœã®ã€ã³ã¹ããŒã«ïŒmacOSåãïŒ brew install mecab mecab-ipadic 5. OpenAI APIããŒã®èšå® export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" Pythonç°å¢ãšããã¹ãèšå® import os import sys import platform import re from dotenv import load_dotenv # Import the working elasticsearch connection utility from elastic_conection import es, test_connection # Japanese text analysis libraries import MeCab from janome.tokenizer import Tokenizer from sudachipy import tokenizer as sudachi_tokenizer from sudachipy import dictionary as sudachi_dictionary # OpenAI for LLM comparison from openai import OpenAI # Environment and target text setup print("Pythonç°å¢æ
å ±") print("="*30) print(f"Python Version: {sys.version.split()[0]}") print(f"Platform: {platform.platform()}") print(f"Architecture: {platform.machine()}") # Check if in virtual environment if hasattr(sys, 'real_prefix') or (hasattr(sys, 'base_prefix') and sys.base_prefix != sys.prefix): print("ä»®æ³ç°å¢ã§å®è¡äž") else: print("ä»®æ³ç°å¢ã§ã¯ãããŸãã") print(f"å®è¡ç°å¢: {sys.executable}") # Define the target text for analysis TARGET_TEXT = """ãããœãã³é ã®èª¬ææžã§ããããããœãã³ã¯ãè§£ç±é®çäœçšã®ããéã¹ããã€ãæ§æççè¬ïŒNSAIDsïŒã§ã çã¿ãçºç±ãççãæãã广ããããŸããå
·äœçã«ã¯ãé¢ç¯ãªãŠãããå€åœ¢æ§é¢ç¯çãè
°ççãè©ããã æ¯çãæè¡åŸãå€å·åŸã®ççãçã¿ã颚éªã«ããç±ãçã¿ãªã©ã«çšããããŸãã""" print(f"\nåæå¯Ÿè±¡ããã¹ã: {TARGET_TEXT}") print(f"æåæ°: {len(TARGET_TEXT)}") print() Pythonç°å¢æ
å ± ============================== Python Version: 3.13.3 Platform: macOS-15.5-arm64-arm-64bit-Mach-O Architecture: arm64 ä»®æ³ç°å¢ã§å®è¡äž å®è¡ç°å¢: /Users/*****/es-analyzer-project/.venv/bin/python åæå¯Ÿè±¡ããã¹ã: ãããœãã³é ã®èª¬ææžã§ããããããœãã³ã¯ãè§£ç±é®çäœçšã®ããéã¹ããã€ãæ§æççè¬ïŒNSAIDsïŒã§ã çã¿ãçºç±ãççãæãã广ããããŸããå
·äœçã«ã¯ãé¢ç¯ãªãŠãããå€åœ¢æ§é¢ç¯çãè
°ççãè©ããã æ¯çãæè¡åŸãå€å·åŸã®ççãçã¿ã颚éªã«ããç±ãçã¿ãªã©ã«çšããããŸãã æåæ°: 137 ããŒã¯ã³æ£èŠåã»ã¯ãªãŒãã³ã°é¢æ° Kuromoji ãåºæºïŒããŒã¹ã©ã€ã³ïŒãšããŠäœ¿çš ããä»ã®ã¢ãã©ã€ã¶ãŒã®çµæãKuromojiã¬ãã«ãŸã§ã¯ãªãŒãã³ã°ããŠæ¯èŒããŸãã æ¯èŒæŠç¥ 1. Kuromoji = ããŒã¹ã©ã€ã³ ã¯ãªãŒãã³ã°ãªã : Kuromojiã®çåºåããã®ãŸãŸäœ¿çš çç± : Elasticsearchã«æé©åæžã¿ãèªç¶ã«ã¹ãããã¯ãŒãé€å»æžã¿ åœ¹å² : ä»ã®ã¢ãã©ã€ã¶ãŒã®ç®æšã¬ãã«ãšããŠæ©èœ 2. ä»ã®ã¢ãã©ã€ã¶ãŒ = Kuromojiã¬ãã«ãŸã§ã¯ãªãŒãã³ã° MeCab, SudachiPy, Janome, OpenAI : ã¯ãªãŒãã³ã°é¢æ°ãé©çš ç®æš : Kuromojiãšåçã®å質ã¬ãã«ã«èª¿æŽ æ¯èŒ : ã¯ãªãŒãã³ã°åŸã«Kuromojiãšã®é¡äŒŒåºŠãæž¬å® ã¯ãªãŒãã³ã°åŠçå
容ïŒKuromoji以å€ïŒ 1. å©è©ã»å©åè©ã®é€å» : ãã¯ãããããã«ãããããªã©æ©èœèªã®åé€ 2. å¥èªç¹ã»èšå·ã®é€å» : ãããããããïŒããïŒããªã©ã®é€å» 3. ã¹ãããã¯ãŒãã®é€å» : æ€çŽ¢ã§æå³ã®èãèªã®åé€ 4. 空çœã»æ°åã®é€å» : çŽç²ãªæ°åãç©ºçœæåã®é€å» æåŸ
ããã广 å
¬å¹³ãªæ¯èŒ : å
šã¢ãã©ã€ã¶ãŒãåãå質ã¬ãã«ã§æ¯èŒããã å®çšæ§è©äŸ¡ : æ€çŽ¢ãšã³ãžã³ã§ã®å®éã®äœ¿çšå Žé¢ãæ³å® æé©å广 : åã¢ãã©ã€ã¶ãŒã®ã¯ãªãŒãã³ã°åŸã®æ§èœåäžãç¢ºèª Kuromojiåªäœæ§ : Elasticsearchãã©ã°ã€ã³ãšããŠã®æé©å广ãç¢ºèª def clean_tokens_like_kuromoji(tokens): """ Clean tokens to match Kuromoji's behavior by removing stop words and punctuation. Args: tokens (list): List of token strings Returns: list: Cleaned tokens with stop words and punctuation removed """ # Japanese stop words and particles commonly filtered out stop_words = { 'ã¯', 'ã®', 'ã', 'ã«', 'ã', 'ã§', 'ãš', 'ãã', 'ãŸã§', 'ãã', 'ãž', 'ã', 'ã', 'ã', 'ãŠ', 'ã', 'ã ', 'ã§ãã', 'ã§ã', 'ãŸã', 'ãã', 'ãã', 'ãã', 'ãã', 'ãªã', 'ãã®', 'ãã®', 'ãã®', 'ã©ã®', 'ãã', 'ãã', 'ãã', 'ã©ã', 'ãã', 'ãã', 'ããã', 'ã©ã', 'ãã', 'ãã', 'ãã', 'ã©ã', 'ãšãã', 'ãšãã£ã', 'ã«ãã', 'ã«ãããŠ', 'ã«ã€ããŠ', 'ã«é¢ããŠ', 'ã«å¯ŸããŠ', 'ã«é¢ãã', 'ã«ã€ããŠ', 'ãã', 'ã¡ãã', 'ãã', 'ããŸ', 'ãã', 'ãŸã', 'ãã', 'ãã', 'ã¯ã', 'ããã', 'ãã', 'ããã', 'ããŒ', 'ããŒ', 'ããŒ', 'ããŒ' } # Punctuation and symbols to remove punctuation_patterns = [ r'^[ããïŒïŒ.,!?;:()ïŒïŒ\[\]ããããããããããããâŠâ¥ã»]+$', # Pure punctuation r'^[ãŒ\-~ã]+$', # Long vowel marks and dashes r'^[ã\s]+$', # Whitespace (including full-width) r'^\d+$', # Pure numbers r'^[a-zA-Z]+$', # Pure alphabet r'^[ïŒ-ïŒ]+$', # Full-width numbers r'^[ïœ-ïœïŒ¡-]+$' # Full-width alphabet ] cleaned = [] for token in tokens: if not token or not token.strip(): continue # Remove stop words if token in stop_words: continue # Remove punctuation and unwanted patterns is_punctuation = False for pattern in punctuation_patterns: if re.match(pattern, token): is_punctuation = True break if not is_punctuation: cleaned.append(token) return cleaned print("Token cleaning function defined") ããŒã¯ãã€ã¶ãŒåæåã»æ¥ç¶ç¢ºèª åããŒã¯ãã€ã¶ãŒãåæåããåäœç¢ºèªãè¡ããŸãã åæå察象 : 1. Elasticsearch + Kuromoji ããŒã«ã«ElasticsearchãµãŒããŒïŒlocalhost:9200ïŒãžã®æ¥ç¶ Kuromojiã¢ãã©ã€ã¶ãŒã®å©çšå¯èœæ§ç¢ºèª 2. MeCab Homebrewç°å¢ã®IPAèŸæžãã¹èšå® åãã¡æžãã¢ãŒãïŒ-O wakatiïŒã§ã®åæå 3. SudachiPy æšæºèŸæžã®èªã¿èŸŒã¿ A/B/Cã¢ãŒã察å¿ããŒã¯ãã€ã¶ãŒãªããžã§ã¯ãäœæ 4. Janome çŽPythonå®è£
ã®ããŒã¯ãã€ã¶ãŒåæå äŸåé¢ä¿ãªãã®ç°¡åã»ããã¢ãã 5. OpenAI GPT-4o ç°å¢å€æ°ããAPIããŒååŸ GPT-4ã¢ãã«ãžã®æ¥ç¶ç¢ºèª 泚æ : åããŒã«ãæ£åžžã«åæåãããªãå Žåããšã©ãŒã¡ãã»ãŒãžã衚瀺ãããŸãã # === ã·ã¹ãã æ¥ç¶ç¢ºèªãšããŒã¯ãã€ã¶ãŒåæå === print("=== ã·ã¹ãã æ¥ç¶ç¢ºèª ===") # Elasticsearchæ¥ç¶ãšåæå print("Elasticsearchã«æ¥ç¶äž...") try: # elastic_conection.pyããESã¯ã©ã€ã¢ã³ããã€ã³ããŒã from elastic_conection import es, test_connection # æ¥ç¶ãã¹ã if test_connection(): es_available = True print("Elasticsearchæ¥ç¶æå") else: es_available = False print("Elasticsearchæ¥ç¶å€±æ") print("è§£æ±ºæ¹æ³: ElasticsearchãµãŒããŒãèµ·åããŠãã ãã") print(" brew services start elasticsearch") except Exception as e: es_available = False print(f"Elasticsearchæ¥ç¶å€±æ: {e}") print("è§£æ±ºæ¹æ³: ElasticsearchãµãŒããŒãèµ·åããŠãã ãã") print(" brew services start elasticsearch") # åããŒã¯ãã€ã¶ãŒã®åæå print("\nããŒã¯ãã€ã¶ãŒãåæåäž...") # MeCabåæå (Homebrewç°å¢å¯Ÿå¿) tagger = None try: import MeCab # Homebrewç°å¢çšã®èšå®ãªã¹ãïŒæ£ããmecabrcãã¹ãå«ãïŒ configs_to_try = [ "-r /opt/homebrew/etc/mecabrc -Owakati", # Homebrew mecabrc + wakati mode "-r /opt/homebrew/etc/mecabrc", # Homebrew mecabrc only "-Owakati", # wakati mode only "", # ããã©ã«ãèšå® "-d /opt/homebrew/lib/mecab/dic/ipadic", # ipadicèŸæžãã¹ (Homebrew) "-d /usr/local/lib/mecab/dic/ipadic", # ipadicèŸæžãã¹ (åŸæ¥ã®ãã¹) ] for config in configs_to_try: try: print(f" MeCabèšå®ã詊è¡äž: '{config if config else 'ããã©ã«ã'}'") tagger = MeCab.Tagger(config) # ãã¹ãå®è¡ããŠåäœç¢ºèª test_result = tagger.parse("ãã¹ã") if test_result and len(test_result.strip()) > 0: mecab_config = config if config else "ããã©ã«ãèšå®" print(f"MeCabåæåæå (èšå®: {mecab_config})") print(f" ãã¹ãçµæ: {test_result.strip()}") break except Exception as e: print(f" èšå®å€±æ: {e}") continue if not tagger: print("MeCabåæå倱æ: MeCab could not be initialized with any configuration") print("è§£æ±ºæ¹æ³: brew install mecab mecab-ipadic") except ImportError: print("MeCabåæå倱æ: MeCabãã€ã³ã¹ããŒã«ãããŠããŸãã") print("è§£æ±ºæ¹æ³: brew install mecab mecab-ipadic") # SudachiPyåæå tokenizer_obj = None try: from sudachipy import tokenizer from sudachipy import dictionary tokenizer_obj = dictionary.Dictionary().create() print("SudachiPyåæåæå") except Exception as e: print(f"SudachiPyåæå倱æ: {e}") # Janomeåæå janome_tokenizer = None try: from janome.tokenizer import Tokenizer janome_tokenizer = Tokenizer() print("Janomeåæåæå") except Exception as e: print(f"Janomeåæå倱æ: {e}") # OpenAI APIåæå openai_client = None try: import openai from dotenv import load_dotenv import os load_dotenv() api_key = os.getenv("OPENAI_API_KEY") if api_key: openai_client = openai.OpenAI(api_key=api_key) print("OpenAI APIåæåæå") else: print("OpenAI APIåæå倱æ: APIããŒãèšå®ãããŠããŸãã") print("è§£æ±ºæ¹æ³: .envãã¡ã€ã«ã«OPENAI_API_KEYãèšå®ããŠãã ãã") except Exception as e: print(f"OpenAI APIåæå倱æ: {e}") # åæåãµããªãŒ print("\nåæåãµããªãŒ:") print(f" Elasticsearch: {'OK' if es_available else 'NG'}") print(f" MeCab: {'OK' if tagger else 'NG'}") print(f" SudachiPy: {'OK' if tokenizer_obj else 'NG'}") print(f" Janome: {'OK' if janome_tokenizer else 'NG'}") print(f" OpenAI: {'OK' if openai_client else 'NG'}") ããŒã¯ã³å颿°ã®å®çŸ©ïŒå
±éãã©ãŒãããåºåïŒ tokenize_with_kuromoji(text) tokenize_with_mecab(text) tokenize_with_sudachi(text, mode) tokenize_with_janome(text) tokenize_with_openai(text) ãã¹ãŠã®é¢æ°ã§ãšã©ãŒãã³ããªã³ã°ãå®è£
ãã倱ææã¯ç©ºãªã¹ããè¿ããŸãã def tokenize_with_kuromoji(text): """Tokenize text using Elasticsearch Kuromoji analyzer""" if not es_available: print("Kuromoji tokenization skipped: Elasticsearch not available") return [] try: response = es.indices.analyze( body={ "analyzer": "kuromoji", "text": text } ) return [token['token'] for token in response['tokens']] except Exception as e: print(f"Kuromoji tokenization error: {e}") return [] def tokenize_with_mecab(text): """Tokenize text using MeCab""" if not tagger: print("MeCab tokenization skipped: MeCab not available") return [] try: result = tagger.parse(text).strip().split() return [token for token in result if token] except Exception as e: print(f"MeCab tokenization error: {e}") return [] def tokenize_with_sudachi(text, mode='C'): """Tokenize text using SudachiPy with specified mode (A, B, or C)""" if not tokenizer_obj: print("SudachiPy tokenization skipped: SudachiPy not available") return [] try: mode_map = {'A': sudachi_tokenizer.Tokenizer.SplitMode.A, 'B': sudachi_tokenizer.Tokenizer.SplitMode.B, 'C': sudachi_tokenizer.Tokenizer.SplitMode.C} tokens = tokenizer_obj.tokenize(text, mode_map[mode]) return [token.surface() for token in tokens] except Exception as e: print(f"SudachiPy tokenization error: {e}") return [] def tokenize_with_janome(text): """Tokenize text using Janome""" if not janome_tokenizer: print("Janome tokenization skipped: Janome not available") return [] try: tokens = janome_tokenizer.tokenize(text, wakati=True) return list(tokens) except Exception as e: print(f"Janome tokenization error: {e}") return [] def tokenize_with_openai(text): """Tokenize text using OpenAI GPT-4""" if not openai_client: print("OpenAI tokenization skipped: OpenAI client not available") return [] try: prompt = f""" Please tokenize the following Japanese text into meaningful segments. Return only a comma-separated list of tokens, no explanations. Text: {text} Tokens:""" response = openai_client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], max_tokens=200, temperature=0 ) result = response.choices[0].message.content.strip() tokens = [token.strip() for token in result.split(',')] return [token for token in tokens if token] except Exception as e: print(f"OpenAI tokenization error: {e}") return [] print("All tokenization functions defined (with availability checks)") å
æ¬çããŒã¯ãã€ã¶ãŒæ¯èŒåæ def compare_all_tokenizers(text): """Compare all tokenizers on the given text - using Kuromoji as baseline (no cleaning)""" print(f"åæå¯Ÿè±¡ããã¹ã: {text}") print("=" * 80) results = {} # 1. Kuromoji (Elasticsearch) - BASELINE (no cleaning applied) print("\n1. Kuromoji (Elasticsearch) - ããŒã¹ã©ã€ã³") kuromoji_tokens = tokenize_with_kuromoji(text) results['kuromoji'] = { 'raw': kuromoji_tokens, 'cleaned': kuromoji_tokens # No cleaning - use as baseline } print(f"Raw/Baseline ({len(kuromoji_tokens)}): {kuromoji_tokens}") print("Kuromojiã¯ããŒã¹ã©ã€ã³ãšããŠäœ¿çšïŒã¯ãªãŒãã³ã°ãªãïŒ") # 2. SudachiPy (all modes) - cleaned to match Kuromoji behavior for mode in ['A', 'B', 'C']: print(f"\n2. SudachiPy Mode {mode} - KuromojiããŒã¹èª¿æŽæžã¿") sudachi_tokens = tokenize_with_sudachi(text, mode) sudachi_cleaned = clean_tokens_like_kuromoji(sudachi_tokens) results[f'sudachi_{mode}'] = { 'raw': sudachi_tokens, 'cleaned': sudachi_cleaned } print(f"Raw ({len(sudachi_tokens)}): {sudachi_tokens}") print(f"Cleaned ({len(sudachi_cleaned)}): {sudachi_cleaned}") # 3. MeCab - cleaned to match Kuromoji behavior print(f"\n3. MeCab - KuromojiããŒã¹èª¿æŽæžã¿") mecab_tokens = tokenize_with_mecab(text) mecab_cleaned = clean_tokens_like_kuromoji(mecab_tokens) results['mecab'] = { 'raw': mecab_tokens, 'cleaned': mecab_cleaned } print(f"Raw ({len(mecab_tokens)}): {mecab_tokens}") print(f"Cleaned ({len(mecab_cleaned)}): {mecab_cleaned}") # 4. Janome - cleaned to match Kuromoji behavior print(f"\n4. Janome - KuromojiããŒã¹èª¿æŽæžã¿") janome_tokens = tokenize_with_janome(text) janome_cleaned = clean_tokens_like_kuromoji(janome_tokens) results['janome'] = { 'raw': janome_tokens, 'cleaned': janome_cleaned } print(f"Raw ({len(janome_tokens)}): {janome_tokens}") print(f"Cleaned ({len(janome_cleaned)}): {janome_cleaned}") # 5. OpenAI GPT-4 - cleaned to match Kuromoji behavior if openai_client: print(f"\n5. OpenAI GPT-4 - KuromojiããŒã¹èª¿æŽæžã¿") openai_tokens = tokenize_with_openai(text) openai_cleaned = clean_tokens_like_kuromoji(openai_tokens) results['openai'] = { 'raw': openai_tokens, 'cleaned': openai_cleaned } print(f"Raw ({len(openai_tokens)}): {openai_tokens}") print(f"Cleaned ({len(openai_cleaned)}): {openai_cleaned}") else: print(f"\n5. OpenAI GPT-4 (ã¹ããã - API key not available)") # Comparison summary with Kuromoji as baseline print(f"\nKuromojiããŒã¹ã©ã€ã³æ¯èŒ:") kuromoji_baseline = set(results['kuromoji']['cleaned']) for name, data in results.items(): if name != 'kuromoji': cleaned_tokens = set(data['cleaned']) overlap = len(kuromoji_baseline & cleaned_tokens) total_unique = len(kuromoji_baseline | cleaned_tokens) similarity = (overlap / total_unique * 100) if total_unique > 0 else 0 print(f" {name:<12}: {similarity:.1f}% similarity to Kuromoji baseline") return results # Run the comparison analysis_results = compare_all_tokenizers(TARGET_TEXT) ð KuromojiïŒããŒã¹ã©ã€ã³ïŒåºå ããŒã¯ã³æ°ïŒ42 ç¹åŸŽïŒåãã¡æžãã现ãããElasticsearchã§ã®ã€ã³ããã¯ã¹ã«æé© ð Sudachi A/B/CãMeCabãJanomeãGPT-4 åºå Sudachiã¢ãŒãããšã«ç²åºŠãå€å GPT-4oã¯èªåœçãŸãšãŸãéèŠã§åå²ãç°ãªã MeCabã»Janomeã¯çްããåãããé¡äŒŒåºŠãé«ã 1. Kuromoji (Elasticsearch) - ããŒã¹ã©ã€ã³ Raw/Baseline (42): ['ãããœãã³', 'é ', '説æ', 'æž', 'ãããœãã³', 'è§£ç±', 'é®ç', 'äœçš', 'é', 'ã¹ããã€ã', 'æ§', 'æ', 'çç', 'è¬', 'nsaids', 'çã¿', 'çºç±', 'çç', 'æãã', '广', 'å
·äœ', 'ç', 'é¢ç¯', 'ãªãŠãã', 'å€åœ¢', 'æ§', 'é¢ç¯', 'ç', 'è
°ç', 'ç', 'è©ãã', 'æ¯ç', 'æè¡', 'åŸ', 'å€å·', 'åŸ', 'çç', 'çã', '颚éª', 'ç±', 'çã¿', 'çšãã'] Kuromojiã¯ããŒã¹ã©ã€ã³ãšããŠäœ¿çšïŒã¯ãªãŒãã³ã°ãªãïŒ 2. SudachiPy Mode A Raw (86): ['ãããœãã³', 'é ', 'ã®', '説æ', 'æž', 'ã§ã', 'ã', 'ã', 'ãããœãã³', 'ã¯', 'ã', 'è§£ç±', 'é®ç', 'äœçš', 'ã®', 'ãã', 'é', 'ã¹ããã€ã', 'æ§', 'æ', 'çç', 'è¬', 'ïŒ', 'NSAIDs', 'ïŒ', 'ã§', 'ã', '\n', 'ç', 'ã¿', 'ã', 'çºç±', 'ã', 'çç', 'ã', 'æãã', '广', 'ã', 'ãã', 'ãŸã', 'ã', 'å
·äœ', 'ç', 'ã«', 'ã¯', 'ã', 'é¢ç¯', 'ãªãŠãã', 'ã', 'å€åœ¢', 'æ§', 'é¢ç¯', 'ç', 'ã', 'è
°ç', 'ç', 'ã', 'è©ãã', 'ã', '\n', 'æ¯ç', 'ã', 'æè¡', 'åŸ', 'ã', 'å€å·', 'åŸ', 'ã®', 'çç', 'ã', 'ç', 'ã¿', 'ã', '颚éª', 'ã«', 'ãã', 'ç±', 'ã', 'ç', 'ã¿', 'ãªã©', 'ã«', 'çšã', 'ãã', 'ãŸã', 'ã'] Cleaned (49): ['ãããœãã³', 'é ', '説æ', 'æž', 'ã', 'ãããœãã³', 'è§£ç±', 'é®ç', 'äœçš', 'é', 'ã¹ããã€ã', 'æ§', 'æ', 'çç', 'è¬', 'ç', 'ã¿', 'çºç±', 'çç', 'æãã', '广', 'ãã', 'å
·äœ', 'ç', 'é¢ç¯', 'ãªãŠãã', 'å€åœ¢', 'æ§', 'é¢ç¯', 'ç', 'è
°ç', 'ç', 'è©ãã', 'æ¯ç', 'æè¡', 'åŸ', 'å€å·', 'åŸ', 'çç', 'ç', 'ã¿', '颚éª', 'ãã', 'ç±', 'ç', 'ã¿', 'ãªã©', 'çšã', 'ãã'] 2. SudachiPy Mode B Raw (78): ['ãããœãã³', 'é ', 'ã®', 'èª¬ææž', 'ã§ã', 'ã', 'ã', 'ãããœãã³', 'ã¯', 'ã', 'è§£ç±', 'é®ç', 'äœçš', 'ã®', 'ãã', 'é', 'ã¹ããã€ã', 'æ§', 'æ', 'çç', 'è¬', 'ïŒ', 'NSAIDs', 'ïŒ', 'ã§', 'ã', '\n', 'çã¿', 'ã', 'çºç±', 'ã', 'çç', 'ã', 'æãã', '广', 'ã', 'ãã', 'ãŸã', 'ã', 'å
·äœç', 'ã«', 'ã¯', 'ã', 'é¢ç¯', 'ãªãŠãã', 'ã', 'å€åœ¢æ§', 'é¢ç¯ç', 'ã', 'è
°çç', 'ã', 'è©ãã', 'ã', '\n', 'æ¯ç', 'ã', 'æè¡', 'åŸ', 'ã', 'å€å·', 'åŸ', 'ã®', 'çç', 'ã', 'çã¿', 'ã', '颚éª', 'ã«', 'ãã', 'ç±', 'ã', 'çã¿', 'ãªã©', 'ã«', 'çšã', 'ãã', 'ãŸã', 'ã'] Cleaned (41): ['ãããœãã³', 'é ', 'èª¬ææž', 'ã', 'ãããœãã³', 'è§£ç±', 'é®ç', 'äœçš', 'é', 'ã¹ããã€ã', 'æ§', 'æ', 'çç', 'è¬', 'çã¿', 'çºç±', 'çç', 'æãã', '广', 'ãã', 'å
·äœç', 'é¢ç¯', 'ãªãŠãã', 'å€åœ¢æ§', 'é¢ç¯ç', 'è
°çç', 'è©ãã', 'æ¯ç', 'æè¡', 'åŸ', 'å€å·', 'åŸ', 'çç', 'çã¿', '颚éª', 'ãã', 'ç±', 'çã¿', 'ãªã©', 'çšã', 'ãã'] 2. SudachiPy Mode C Raw (78): ['ãããœãã³', 'é ', 'ã®', 'èª¬ææž', 'ã§ã', 'ã', 'ã', 'ãããœãã³', 'ã¯', 'ã', 'è§£ç±', 'é®ç', 'äœçš', 'ã®', 'ãã', 'é', 'ã¹ããã€ã', 'æ§', 'æ', 'çç', 'è¬', 'ïŒ', 'NSAIDs', 'ïŒ', 'ã§', 'ã', '\n', 'çã¿', 'ã', 'çºç±', 'ã', 'çç', 'ã', 'æãã', '广', 'ã', 'ãã', 'ãŸã', 'ã', 'å
·äœç', 'ã«', 'ã¯', 'ã', 'é¢ç¯', 'ãªãŠãã', 'ã', 'å€åœ¢æ§', 'é¢ç¯ç', 'ã', 'è
°çç', 'ã', 'è©ãã', 'ã', '\n', 'æ¯ç', 'ã', 'æè¡', 'åŸ', 'ã', 'å€å·', 'åŸ', 'ã®', 'çç', 'ã', 'çã¿', 'ã', '颚éª', 'ã«', 'ãã', 'ç±', 'ã', 'çã¿', 'ãªã©', 'ã«', 'çšã', 'ãã', 'ãŸã', 'ã'] Cleaned (41): ['ãããœãã³', 'é ', 'èª¬ææž', 'ã', 'ãããœãã³', 'è§£ç±', 'é®ç', 'äœçš', 'é', 'ã¹ããã€ã', 'æ§', 'æ', 'çç', 'è¬', 'çã¿', 'çºç±', 'çç', 'æãã', '广', 'ãã', 'å
·äœç', 'é¢ç¯', 'ãªãŠãã', 'å€åœ¢æ§', 'é¢ç¯ç', 'è
°çç', 'è©ãã', 'æ¯ç', 'æè¡', 'åŸ', 'å€å·', 'åŸ', 'çç', 'çã¿', '颚éª', 'ãã', 'ç±', 'çã¿', 'ãªã©', 'çšã', 'ãã'] 3. MeCab Raw (80): ['ãããœãã³', 'é ', 'ã®', '説æ', 'æž', 'ã§ã', 'ã', 'ã', 'ãããœãã³', 'ã¯', 'ã', 'è§£ç±', 'é®ç', 'äœçš', 'ã®', 'ãã', 'é', 'ã¹ããã€ã', 'æ§', 'æ', 'çç', 'è¬', 'ïŒ', 'NSAIDs', 'ïŒ', 'ã§', 'ã', 'çã¿', 'ã', 'çºç±', 'ã', 'çç', 'ã', 'æãã', '广', 'ã', 'ãã', 'ãŸã', 'ã', 'å
·äœ', 'ç', 'ã«', 'ã¯', 'ã', 'é¢ç¯', 'ãªãŠãã', 'ã', 'å€åœ¢', 'æ§', 'é¢ç¯', 'ç', 'ã', 'è
°ç', 'ç', 'ã', 'è©ãã', 'ã', 'æ¯ç', 'ã', 'æè¡', 'åŸ', 'ã', 'å€å·', 'åŸ', 'ã®', 'çç', 'ã', 'çã¿', 'ã', '颚éª', 'ã«ãã', 'ç±', 'ã', 'çã¿', 'ãªã©', 'ã«', 'çšã', 'ãã', 'ãŸã', 'ã'] Cleaned (45): ['ãããœãã³', 'é ', '説æ', 'æž', 'ã', 'ãããœãã³', 'è§£ç±', 'é®ç', 'äœçš', 'é', 'ã¹ããã€ã', 'æ§', 'æ', 'çç', 'è¬', 'çã¿', 'çºç±', 'çç', 'æãã', '广', 'ãã', 'å
·äœ', 'ç', 'é¢ç¯', 'ãªãŠãã', 'å€åœ¢', 'æ§', 'é¢ç¯', 'ç', 'è
°ç', 'ç', 'è©ãã', 'æ¯ç', 'æè¡', 'åŸ', 'å€å·', 'åŸ', 'çç', 'çã¿', '颚éª', 'ç±', 'çã¿', 'ãªã©', 'çšã', 'ãã'] 4. Janome Raw (82): ['ãããœãã³', 'é ', 'ã®', '説æ', 'æž', 'ã§ã', 'ã', 'ã', 'ãããœãã³', 'ã¯', 'ã', 'è§£ç±', 'é®ç', 'äœçš', 'ã®', 'ãã', 'é', 'ã¹ããã€ã', 'æ§', 'æ', 'çç', 'è¬', 'ïŒ', 'NSAIDs', 'ïŒ', 'ã§', 'ã', '\n', 'çã¿', 'ã', 'çºç±', 'ã', 'çç', 'ã', 'æãã', '广', 'ã', 'ãã', 'ãŸã', 'ã', 'å
·äœ', 'ç', 'ã«', 'ã¯', 'ã', 'é¢ç¯', 'ãªãŠãã', 'ã', 'å€åœ¢', 'æ§', 'é¢ç¯', 'ç', 'ã', 'è
°ç', 'ç', 'ã', 'è©ãã', 'ã', '\n', 'æ¯ç', 'ã', 'æè¡', 'åŸ', 'ã', 'å€å·', 'åŸ', 'ã®', 'çç', 'ã', 'çã¿', 'ã', '颚éª', 'ã«ãã', 'ç±', 'ã', 'çã¿', 'ãªã©', 'ã«', 'çšã', 'ãã', 'ãŸã', 'ã'] Cleaned (45): ['ãããœãã³', 'é ', '説æ', 'æž', 'ã', 'ãããœãã³', 'è§£ç±', 'é®ç', 'äœçš', 'é', 'ã¹ããã€ã', 'æ§', 'æ', 'çç', 'è¬', 'çã¿', 'çºç±', 'çç', 'æãã', '广', 'ãã', 'å
·äœ', 'ç', 'é¢ç¯', 'ãªãŠãã', 'å€åœ¢', 'æ§', 'é¢ç¯', 'ç', 'è
°ç', 'ç', 'è©ãã', 'æ¯ç', 'æè¡', 'åŸ', 'å€å·', 'åŸ', 'çç', 'çã¿', '颚éª', 'ç±', 'çã¿', 'ãªã©', 'çšã', 'ãã'] 5. OpenAI GPT-4o Raw (40): ['ãããœãã³é ', 'ã®', 'èª¬ææž', 'ã§ã', 'ã', 'ã', 'ãããœãã³', 'ã¯', 'ã', 'è§£ç±é®çäœçš', 'ã®', 'ãã', 'éã¹ããã€ãæ§æççè¬', 'ïŒ', 'NSAIDs', 'ïŒ', 'ã§', 'ã', 'çã¿', 'ã', 'çºç±', 'ã', 'çç', 'ã', 'æãã', '广', 'ã', 'ãããŸã', 'ã', 'å
·äœç', 'ã«', 'ã¯', 'ã', 'é¢ç¯ãªãŠãã', 'ã', 'å€åœ¢æ§é¢ç¯ç', 'ã', 'è
°çç', 'ã', 'è©ãã'] Cleaned (17): ['ãããœãã³é ', 'èª¬ææž', 'ã', 'ãããœãã³', 'è§£ç±é®çäœçš', 'éã¹ããã€ãæ§æççè¬', 'çã¿', 'çºç±', 'çç', 'æãã', '广', 'ãããŸã', 'å
·äœç', 'é¢ç¯ãªãŠãã', 'å€åœ¢æ§é¢ç¯ç', 'è
°çç', 'è©ãã'] åæçµæãµããªãŒã»çµ±èšè¡š ããŒã¯ã³åçµæãå®éçã«åæããåã¢ãã©ã€ã¶ãŒã®ç¹æ§ãæç¢ºã«ããŸãã ãµããªãŒããŒãã«å
容 : 1. ããŒã¯ã³æ°æ¯èŒè¡š Raw Tokens : åã¢ãã©ã€ã¶ãŒã®çããŒã¯ã³æ° Cleaned Tokens : ã¯ãªãŒãã³ã°åŸã®ããŒã¯ã³æ° Effectiveness : ã¯ãªãŒãã³ã°å¹æçïŒãã€ãºé€å»çïŒ 2. ãŠããŒã¯ããŒã¯ã³åæ åã¢ãã©ã€ã¶ãŒåºæã®ããŒã¯ã³ : ä»ã§ã¯æ€åºãããªãç¬èªããŒã¯ã³ å
šã¢ãã©ã€ã¶ãŒå
±éããŒã¯ã³ : ãã¹ãŠã§äžèŽããåºæ¬ããŒã¯ã³ ããŒã¯ã³å€æ§æ§ : å
šäœã§ã®èªåœã«ãã¬ããž åæææš : ããŒã¯ã³ç·æ° : åææ³ã®åå²ç²åºŠã®éã å
±é床 : ã¢ãã©ã€ã¶ãŒéã®äžèŽç ç¬èªæ§ : åææ³ã®ç¹åŸŽçãªåå²ãã¿ãŒã³ æŽ»çšæ¹æ³ : æ€çŽ¢ã·ã¹ãã : å
±éããŒã¯ã³ã¯æ€çŽ¢ç²ŸåºŠåäžã«å¯äž NLPåŠç : çšéã«å¿ããæé©ã¢ãã©ã€ã¶ãŒéžæ å質è©äŸ¡ : ããŒã¯ã³åã®äžè²«æ§ã»ä¿¡é Œæ§è©äŸ¡ Tokenization Results Summary - Kuromoji Baseline ========================================================================================== Tokenizer Raw Tokens Final Tokens vs Kuromoji Note ------------------------------------------------------------------------------------------ kuromoji 42 42 100.0% ããŒã¹ã©ã€ã³ sudachi_A 86 49 71.4% ã¯ãªãŒãã³ã°æžã¿ sudachi_B 78 41 53.3% ã¯ãªãŒãã³ã°æžã¿ sudachi_C 78 41 53.3% ã¯ãªãŒãã³ã°æžã¿ mecab 80 45 79.5% ã¯ãªãŒãã³ã°æžã¿ janome 82 45 79.5% ã¯ãªãŒãã³ã°æžã¿ openai 40 17 15.9% ã¯ãªãŒãã³ã°æžã¿ äžã®è¡šã®èªã¿è§£ãæ¹ïŒ Raw Tokens (çããŒã¯ã³æ°) ã¢ãã©ã€ã¶ãŒãããã¹ããæåã«åå²ããçŽåŸã®ããŒã¯ã³ïŒåèªïŒã®ç·æ°ã§ãããã®æ°å€ã倧ããã»ã©ããã现ããåèªãåå²ããŠããããšã瀺ããŸãã Final Tokens (æçµããŒã¯ã³æ°) ãçããŒã¯ã³ãããå©è©ïŒãã¯ãããããªã©ïŒãå¥èªç¹ãèšå·ãšãã£ãæ€çŽ¢ãã€ãºã«ãªããããäžèŠãªããŒã¯ã³ãåãé€ããïŒã¯ãªãŒãã³ã°ããïŒåŸã®æ°ã§ãã ãã®ã¯ãªãŒãã³ã°åŠçã«ãããåã¢ãã©ã€ã¶ãŒãå
¬å¹³ãªåä¿µã§æ¯èŒã§ããããã«ãªããŸãã vs Kuromoji (Kuromojiãšã®é¡äŒŒåºŠ) ã¯ãªãŒãã³ã°åŸã®ããŒã¯ã³ã»ããããåºæºã§ããKuromojiã®ããŒã¯ã³ã»ãããšã©ãã ã䌌ãŠãããã瀺ãå²åïŒJaccardä¿æ°ïŒã§ãã èšç®åŒ : (äž¡è
ã«å
±éããããŒã¯ã³æ°) ÷ (ã©ã¡ããäžæ¹ã«ã§ãååšãããŠããŒã¯ãªããŒã¯ã³ç·æ°) ãã®ããŒã»ã³ããŒãžãé«ãã»ã©ããã®ã¢ãã©ã€ã¶ãŒã®åå²çµæãKuromojiãšäŒŒãŠããããšãæå³ããŸããäŸãã°ãMeCabãšJanomeã¯ã¯ãªãŒãã³ã°åŸã«80%ã®é¡äŒŒåºŠãšãªããKuromojiãšéåžžã«è¿ãçµæãåºããŠããããšãããããŸãã ð Unique Tokens Analysis (Cleaned Results vs Kuromoji Baseline) ====================================================================== sudachi_A: ['ãã', 'ãªã©', 'ã', 'ã¿', 'ãã', 'ãã', 'çšã', 'ç'] sudachi_B: ['ãã', 'ãªã©', 'ã', 'ãã', 'ãã', 'å
·äœç', 'å€åœ¢æ§', 'çšã', 'è
°çç', 'èª¬ææž', 'é¢ç¯ç'] sudachi_C: ['ãã', 'ãªã©', 'ã', 'ãã', 'ãã', 'å
·äœç', 'å€åœ¢æ§', 'çšã', 'è
°çç', 'èª¬ææž', 'é¢ç¯ç'] mecab: ['ãã', 'ãªã©', 'ã', 'ãã', 'çšã'] janome: ['ãã', 'ãªã©', 'ã', 'ãã', 'çšã'] openai: ['ãããŸã', 'ã', 'ãããœãã³é ', 'å
·äœç', 'å€åœ¢æ§é¢ç¯ç', 'è
°çç', 'è§£ç±é®çäœçš', 'èª¬ææž', 'é¢ç¯ãªãŠãã', 'éã¹ããã€ãæ§æççè¬'] Kuromoji unique tokens (not found in cleaned versions of others): kuromoji: ['nsaids', 'çšãã', 'çã'] Common tokens across all tokenizers (after cleaning): ['ãããœãã³', '广', 'æãã', 'çç', 'çºç±', 'è©ãã'] ð çµ±èšãµããªãŒ: ð KuromojiããŒã¹ã©ã€ã³ç·ããŒã¯ã³æ°: 34 ð å
šã¢ãã©ã€ã¶ãŒå
±éããŒã¯ã³æ°: 6 ð å
±é床: 17.6% äžèšã¯ãåã¢ãã©ã€ã¶ãŒã®ã¯ãªãŒãã³ã°åŸããŒã¯ã³ããããKuromojiã®ããŒã¯ã³ããåŒããæ®ãã®ãªã¹ãã§ããã€ãŸãã Kuromojiã«ã¯ãªããããã®ã¢ãã©ã€ã¶ãŒã ããçæãããŠããŒã¯ãªããŒã¯ã³ ã§ããåããŒã«ã®èŸæžãåå²ã«ãŒã«ã®éããèŠãŠãšããŸãã Kuromojiã ããçæããããŒã¯ã³ ã®åŸã«ãã¯ãªãŒãã³ã°åŸã« ãã¹ãŠã®ã¢ãã©ã€ã¶ãŒãå
±éããŠçæããããŒã¯ã³ ã§ãããããã¯ãã©ã®ããŒã«ã䜿ã£ãŠãåå²çµæãå€ãããªããæç« ã®æ žãšãªãéèŠãªåèªãšèšããŸãã èå¯ Kuromoji / MeCab / Janome ãå°éè·ãâãå°éããè·ãããçè·åž«ãâãçè·ããåž«ãã®ããã«ãåèªã现ããåå²ããŸãã ããã«ããæ€çŽ¢ãããçãåäžããéšåäžèŽæ€çŽ¢ã匷調衚瀺ã«é©ããŠããŸãã Sudachi ãå°éè·ãããçè·åž«ããªã©ã®è€åèªã1ããŒã¯ã³ãšããŠä¿æããŸãã æå³ã®ãŸãšãŸããéèŠããåæã«åããŠãããã¢ãŒãåãæ¿ãïŒA/B/CïŒã§ç²åºŠã調æŽã§ããŸãã Cã¢ãŒã : ãããè¯ãæ€çŽ¢äœéšãæäŸããããã1èªæ±ãã«ãªããããç¹å®ã®è€åèªã§ã®æ€çŽ¢ã«ã¯äžåããªå ŽåããããŸãã Bã¢ãŒã : å®çšé¢ã§ãã©ã³ã¹ã®åããç²åºŠãæäŸããŸãã LLMïŒGPT-4oïŒ æèçè§£ã«åºã¥ããåãã¡æžããå¯èœã§ãã ãä»è·çŠç¥å£«ãããèªç¥çããªã©ãèªåœãšããŠèªç¶ãªãŸãšãŸãã§åºåãããŸãã ããŒã¯ã³ã®äžè²«æ§ããªããããElasticsearchã®ã€ã³ããã¯ã¹çšéã«ã¯äžåãã§ãããæå³çè§£ã質åå¿çã«æé©ã§ãã ãŸãšã æ¥æ¬èªæ€çŽ¢ã«ãããŠãã©ã®ã¢ãã©ã€ã¶ãŒã䜿ããã¯ãæ€çŽ¢ãããå
容ããšãæ±ããç²åºŠãã«ãã£ãŠå€ãããŸãã 现ããäžèŽãããããªã Kuromoji æ€çŽ¢æããã®ãŸãŸäžèŽãããããªã Sudachi Cã¢ãŒã ãã©ã³ã¹éèŠãªã Sudachi Bã¢ãŒã The post æ¥æ¬èªã¢ãã©ã€ã¶ãŒã®æ¯èŒïŒKuromojiã»Sudachiã»MeCabã»Janomeã»LLMã®æ§èœæ€èšŒ first appeared on Elastic Portal .