ãã®èšäºã¯ Measuring the accuracy of rule or ML-based matching in AWS Entity Resolution (èšäºå
¬éæ¥ : 2025 幎 9 æ 29 æ¥) ã翻蚳ãããã®ã§ãã ãšã³ãã£ãã£ãããã³ã°ã®ã«ãŒã«ã»ãããã¢ãã«ãå®éã«ååãªç²ŸåºŠãæã£ãŠãããã©ãããã©ã®ããã«å€æããã°ããã§ããããïŒè€æ°ã®ã¢ã€ãã³ãã£ãã£ãããã€ããŒãè©äŸ¡ããå Žåã§ããç¬èªã®ãããã³ã°ã«ãŒã«ãæ§ç¯ããå Žåã§ããäŒæ¥ã¯éæãããæç¢ºãªç²ŸåºŠã¬ãã«ã®åºæºãšãç°ãªãã¢ãããŒãã客芳çã«æž¬å®ã»æ¯èŒããããã®ãã¬ãŒã ã¯ãŒã¯ã確ç«ããå¿
èŠããããŸããã¢ã€ãã³ãã£ãã£ããã»ã¹ã客芳çã«æž¬å®ããªãäŒæ¥ã¯ãå®è£
æéãæ°é±éãå Žåã«ãã£ãŠã¯æ°ã¶æãå»¶é·ããŠããŸããæé ãç¹°ãè¿ãããã粟床枬å®ã®ææ³ã«é«ã³ã¹ããªå€æŽãå ãããããããšã«ãªããŸãã æ¬èšäºã§ã¯ã AWS Entity Resolution ã®æ¢åæ©èœã§ããç¬èªã®æ©æ¢°åŠç¿ (ML) ã¢ã«ãŽãªãºã ã䜿çšããŠã¢ãã«ã®ç²ŸåºŠããã¹ãããã¢ãããŒãã«ã€ããŠèª¬æãã宿ŒããŸããAWS Entity Resolution ã¯ãè€æ°ã®ã¢ããªã±ãŒã·ã§ã³ããã£ãã«ãããŒã¿ã¹ãã¢éã«ä¿åãããé¢é£ãã顧客ã補åãããžãã¹ããŸãã¯ãã«ã¹ã±ã¢èšé²ã®ãããã³ã°ããªã³ã¯ãæ¡åŒµãæ¯æŽããŸãããŸããç¬èªã®ããŒã¿ãŸãã¯åæãªãŒãã³ãœãŒã¹ããŒã¿ã»ããã䜿çšããŠçµæãåçŸããããã«å¿
èŠãªãã¹ãŠãæäŸããŸãã ãã®ãã¬ãŒã ã¯ãŒã¯ã䜿çšããããšã§ããããã³ã°ã®ç²ŸåºŠãè¿
éã«è©äŸ¡ããæ¹æ³ãæäŸããŸãããã®ããã»ã¹ã¯ããã³ãããŒã¯ã詊ã¿ãŠãããããããšã³ãã£ãã£ãããã³ã°ããã»ã¹ã«é©çšã§ããŸãã 粟床ãéèŠãªçç± ãŸãã粟床ãšã¯äœãæå³ããã®ã§ããããïŒæ¬èšäºã§ã®ç²ŸåºŠãšã¯ãåäžäººç©ã«å±ããèšé²ãæ£ããèå¥ãããã€ç°ãªã人ç©ã®èšé²ã誀ã£ãŠãããã³ã°ããªãé »åºŠãæããŸããã€ãŸããå®å
šã«æ£ç¢ºãªãœãªã¥ãŒã·ã§ã³ã¯ãåäžäººç©ã«å±ãããã¹ãŠã®éè€èšé²ããããã³ã°ããæçãèŠéãããšãªããä»ã®äººç©ã«å±ããèšé²ã«äœåãªããŒã¿ããããã³ã°ããããšããããŸããã ããã¯çŽæçãªæŠå¿µã§ãããäžè²«ããæž¬å®ã¯å°é£ã§ããå€ãã®äŒæ¥ã顧客ããŒã¿ã®éè€æé€ã»çµ±åãããžã§ã¯ãã«çæããéãçé°æ§ãšçéœæ§ãæ£ç¢ºã«æž¬å®ããäžè²«ããæ¹æ³è«ãææšãæã£ãŠããŸããããŸããå€ãã®äŒæ¥ã¯ã顧客ããŒã¿ã§çºçããè€éãªãšããžã±ãŒã¹ãæããä¿¡é Œã§ããå人æ
å ±ããŒã¿ã»ãããäžè¶³ããŠããŸãã 100% ã¯ãªãŒã³ãªããŒã¿ããŸãã¯ååã«å°ããªãµã³ãã«ã»ããã§ç²ŸåºŠææš 100% ãéæããããšã¯å¯èœã§ããããããå®äžçã®ããŒã¿ããã倧ããªããŒã¿ããªã¥ãŒã ã§ã¯ãçã®ææ§æ§ãæã€ç¡æ°ã®ãšããžã±ãŒã¹ãååšããããã100% ã®ç²ŸåºŠã§ãããã³ã°ããããšã¯çŸå®çã§ã¯ãããŸããããããã£ãŠãäŒæ¥ã¯äžå¯èœãªç®æšãè¿œãæ±ããç¡éã®å®è£
ãµã€ã¯ã«ã«é¥ããªãããã粟床ã«ã€ããŠã枬å®å¯èœãªéŸå€ãèšå®ããå¿
èŠããããŸãã 仿¥ãäŒæ¥ã¯ãããŸã§ä»¥äžã«å€ãã®æçåããããŒã¿ãåãåã£ãŠããŸããã¢ãã€ã«ã¢ããªã®ã¿ããããªã³ã©ã€ã³ã¯ãªãã¯ãèªèšŒã»ãã·ã§ã³ã®ãã¹ãŠããäŒæ¥ãæ¶è²»è
è¡åãçè§£ããäœéšãããŒãœãã©ã€ãºããéçšãæé©åããã®ã«åœ¹ç«ã€ããŒã¿ãçæããŸãããã®ããŒã¿ã顧客ã®çµ±äžãã¥ãŒã«ãŸãšããããšãã§ããäŒæ¥ã¯ããããã®æŽå¯ã䜿çšããŠããè¯ãããŒãœãã©ã€ãºãããäœéšãæäŸã§ããŸãããŸããããæ
å ±ã«åºã¥ãã補åãããŒã±ãã£ã³ã°ãè²©å£²ã®æææ±ºå®ãè¡ãããšãã§ããŸãã ããŒã¿ããããå€ãã®äŸ¡å€ãåŒãåºãããšã«çŠç¹ãåœãŠãŠããäŒæ¥ã«ã¯ãéžæã§ãããšã³ãã£ãã£ãããã³ã°ããŒã«ããµãŒãã¹ãå¹
åºããããŸããããããäŒæ¥ã¯ãã°ãã°ãœãªã¥ãŒã·ã§ã³ã®è©äŸ¡ãšå®è£
ã§æ°ã¶æãå Žåã«ãã£ãŠã¯æ°å¹Žéåæ»ããŠããŸããŸãã æåã®é害ã®äžã€ã¯ãã¢ã€ãã³ãã£ãã£ãããã³ã°ãã¬ãŒã ã¯ãŒã¯ãšã¢ãããŒããè©äŸ¡ããããã®å
ç¢ã§äžè²«ãããã¬ãŒã ã¯ãŒã¯ãäžè¶³ããŠããããšã§ããã©ã®ã¢ã€ãã³ãã£ãã£ãããã³ã°ææ³ãèªç€ŸããŒã¿ã«æãé©ããŠããããäŒæ¥ã¯ã©ã倿ããã°ããã®ã§ããããïŒç²ŸåºŠã«é¢ããå©å®³é¢ä¿ã¯ãŸããŸãé«ããªã£ãŠããŸãã顧客ã¯ãé »ç¹ã«å©çšãããã©ã³ããäŒæ¥ã«ããããŒãœãã©ã€ãºãããäœéšãæåŸ
ããŠããŸãã粟床ã®ãã³ãããŒã¯ããæ¥çããŠãŒã¹ã±ãŒã¹ã«åºã¥ããŠäŒæ¥ããšã«ç°ãªããŸãã ã¢ã€ãã³ãã£ãã£è§£æ±ºããã»ã¹ãç¹å®ã®ããŒãºã«å¯Ÿå¿ããŠããããšãäŒæ¥ãä¿¡é Œããããã«ã¯ãæ¬çªããŒã¿ã§å®éã«èŠããããšããžã±ãŒã¹ãå«ãä¿¡é Œã§ããããŒã¿ã®ãã³ãããŒã¯ãäœæãã顧客ããŒã¿ãåéããããããã·ã¹ãã ããåãåãèšé²ã®çš®é¡ã«åºã¥ããŠç²ŸåºŠãå®çŸ©ããå¿
èŠããããŸãã æ£è§£ããŒã¿ã»ãã (ã°ã©ãŠã³ããã¥ã«ãŒã¹) ãããã³ã°ããã»ã¹ã®ç²ŸåºŠãè©äŸ¡ããæãåºãåãå
¥ããããŠããæ¹æ³ã¯ãããã»ã¹ã®çµæãæåã§æ³šéä»ããããæ£è§£ããŒã¿ã»ãã (çå®ã»ãããšãåŒã°ãã) ãšæ¯èŒããããšã§ããAI ã«ãããæ£è§£ããŒã¿ã»ãããšã¯ãäºå®ãšããŠç¥ãããŠããããŒã¿ãæããã¢ãã«åãããŠããã·ã¹ãã ã®æåŸ
ããããŠãŒã¹ã±ãŒã¹ã®çµæã衚ããŸãã ãã®ãŠãŒã¹ã±ãŒã¹ã§ã¯ãæ£è§£ããŒã¿ã»ããã¯ã人éãæåã§ã¬ãã¥ãŒãããããã³ã°ãã¹ããã©ãããæ³šéä»ãããèšé²ãã¢ã®å°ããªãµãã»ããã§ããæ£è§£ããŒã¿ã»ããã¯å€§ããããå¿
èŠã¯ãããŸããããããŒã¿ã§é »ç¹ã«çºçãããŠãŒã¹ã±ãŒã¹ã®ä»£è¡šçãªã»ãããå«ãã®ã«ååãªå€§ããã§ããå¿
èŠããããŸãã ãã ããæ£è§£ããŒã¿ã»ããã«ã¯å人è奿
å ± (PII) ãå¿
èŠã§ãããããäŒæ¥ã¯æŠå¿µå®èšŒã§ããããå
±æãŸãã¯äœ¿çšããããšã«ã€ããŠæ
éã§ããå¿
èŠããããŸãããŸããå¿
èŠãªãã¹ãŠã®ã»ãã¥ãªãã£ãããã³ã«ãæŽã£ãŠããããšã確èªããå¿
èŠããããŸããæ£è§£ããŒã¿ã»ããã¯ãä»ã®ãã³ãããŒã¯ããŒã¿ã»ãããšæ¯èŒããŠãããè¯ãåçŸå¯èœãªçµæãåŸãããšãã§ããŸãã ã¢ã€ãã³ãã£ãã£è§£æ±ºç²ŸåºŠæž¬å®ã®èª²é¡ 顧客ã®åäžãã¥ãŒã®äŸ¡å€ã¯ããã®ããŒã¿ã衚çŸããããšããŠããçŸå®äžçãã©ãã ãå¿ å®ã«åæ ããŠãããã«ã»ãŒäŸåããŠããŸããããããåã
ã®èšé²ã®ã¿ãèŠãŠããå Žåã粟床ã®è©äŸ¡åºæºãäžå®ãããå€åããŠããŸãå¯èœæ§ããããŸããäŒæ¥ã¯ãã¢ã€ãã³ãã£ãã£ãããã³ã°ã®ã«ãŒã«ãæ§ç¯ãããã¢ã«ãŽãªãºã ã䜿çšãããããéã«ãæç¢ºãªç²ŸåºŠã®éŸå€ãæã€å¿
èŠããããŸããããã§ãªããã°ãä¿®æ£ãšå€æŽã®ç¡éã®ããã»ã¹ã«é¥ã£ãŠããŸããŸãããããã®éŸå€ã¯é¡§å®¢ã«ãã£ãŠç°ãªããŸãã äŸãã°ãåãæ¥çã® 2 ã€ã®äŒæ¥ããããŒã¿ãåéããå Žæã®ã³ã³ããã¹ãã«åºã¥ããŠç°ãªã粟床ã®éŸå€ãèšå®ããå¿
èŠãããå ŽåãèŠãŠã¿ãŸãããã2 ã€ã®ç°ãªãå°å£²æ¥è
ãäŒæ¥ A ãš B ãèããŠã¿ãŸãã äŒæ¥ A ã¯ãååŒã€ãã³ããšãã€ã€ã«ãã£ããã°ã©ã ãã顧客ããŒã¿ãåéããå®åºèã®é£æåå°å£²æ¥è
ã§ãããããã®ååŒã¯ãçŸéã䜿çšãããå Žåã¯ããŒã¿ããªãããäžåž¯å
ã§å
±æããããäŒæ¥ã«ãã£ãŠäœ¿çšããããããå¯èœæ§ã®ããã¯ã¬ãžããã«ãŒãããŒã¯ã³ã䜿çšããŸããããã«ãã«ãŒãããŒã¹ã®ãã€ã€ã«ãã£ããã°ã©ã ãéããŠãã€ã€ã«ãã£ããŒã¿ãåéããå Žåã空çœãäžå®å
šãªããŒã¿ããŸãã¯è€æ°ã®ç°ãªã人ã
ã«é¢é£ä»ããããå€ãã®å
±æäœæãšåºå®é»è©±ããŒã¿ãæã€ã«ãŒããããå¯èœæ§ããããŸããæ°ããã«ãŒããå
±æãããã«ãŒããæç€ºããããšã§ãã€ã€ã«ãã£ç¹å
žãåããããšãã§ãããããé¡§å®¢ãæ£ç¢ºãªããŒã¿ãå
±æããã€ã³ã»ã³ãã£ãã¯ãããŸããã äŒæ¥ A ã¯ãäžå®å
šãªååããŒã¿ãã¯ã¬ãžããã«ãŒãããŒã¯ã³ãé«ãå²åã§å
±æãããããŒã¿ãå«ãèšé²ãã¢ã§ãããã³ã°ããã¹ãããå¿
èŠããããŸããããã«ã顧客ãããŒã¿ãå
±æããæ¹æ³ã«åºã¥ããŠäŒæ¥ A ãå¯èœãªæãæ£ç¢ºãªè§£æ±ºã¬ãã«ã§ãããããã°ã«ãŒãåã®ããã®äžåž¯ã¬ãã«ã®ãããã³ã°ã«ã®ã¿é¢å¿ããããŸãã ãããããŠã§ããµã€ãã§ãã¹ãŠã®ååŒãè¡ããªã³ã©ã€ã³å°å£²æ¥è
ã§ããäŒæ¥ B ãšå¯Ÿæ¯ããŠã¿ãŸããããã»ãŒãã¹ãŠã®é¡§å®¢ãã¡ãŒã«ã¢ãã¬ã¹ã§èªèšŒããé²èЧè¡åã«é¢é£ä»ãããããããã¡ã€ã«ãæã£ãŠããŸãã顧客ã¯å®éã«è³Œå
¥ããååãåãåãããã«ãæ£ç¢ºãªäœæãååãã¡ãŒã«ã¢ãã¬ã¹ã®å€ãå
±æããå¿
èŠããããŸããåãäžåž¯å
ã®å人ã§ããå人ã®ã¢ã«ãŠã³ããã¡ãŒã«ã¢ãã¬ã¹ãéããŠé åæžãåãåããè¿åãéå§ããæ¹ãè¿
éã§ãããããèªåã®ååãšã¡ãŒã«ã¢ãã¬ã¹ã§è³Œå
¥ããå¯èœæ§ãé«ããªããŸãã å®åºèå°å£²æ¥è
ã§ããäŒæ¥ A ãšã¯ç°ãªããäŒæ¥ B ã¯å人ã¬ãã«ã§ãŠãŒã¶ãŒããããã³ã°ã§ããŸããé
éå
äœæãšé»è©±çªå·ãå
±æãããå¯èœæ§ãããããããããã³ã°ããåã«ããé«ãå²åã®å
±æå±æ§ãèŠæ±ããããšãã§ããŸããããããä»ã®å€ãã®ä¿¡é Œã§ããããŒã¿ãäžåž¯ã®ã¡ã³ããŒãéè€ããããŒã¿ãæã€ãŠãŒã¶ãŒãåºå¥ããããšãã§ããŸãã äž¡æ¹ã®å°å£²æ¥è
ã¯ãèªç€Ÿã®ããŒã¿ã«ååšããã·ããªãªãåæ ãã蚱容å¯èœãªãããã³ã°ã®ç¬èªã®éŸå€ãæã€ç¬èªã®æ£è§£ããŒã¿ã»ãããäœæããã°ãæè¯ã®çµæãåŸãããã§ãããããã®ã»ããã«ã¯ããŸãšããã¹ãèšé²ã®æç (çéœæ§) ãšãå€ãã®ç¹åŸŽãå
±æãããåé¢ããŠããå¿
èŠãããèšé² (çé°æ§) ã®äž¡æ¹ã®ãã¹ãã±ãŒã¹ãå«ããå¿
èŠããããŸãããããã³ã°ã«äœ¿çšããããã«ããããã®ãã¹ãã±ãŒã¹ã¯ãèšé²ããããã³ã°ãã¹ããã©ãããç€ºãæ£è§£ããŒã¿ã»ãããšããŠæ³šéä»ããããå¿
èŠããããŸãã ããŒã¿ãµã€ãšã³ã¹ã³ãã¥ããã£ã§ã¯ãç²ŸåºŠãæž¬å®ããæãæšæºçãªæ¹æ³ã¯ãF1 ã¹ã³ã¢ãšåŒã°ããææšã§ãã F1 ã¹ã³ã¢ ã¯ãæ£è§£ããŒã¿ã»ããã«å¯Ÿããã¢ãã«ããã©ãŒãã³ã¹ã® 2 ã€ã®äž»èŠãªåŽé¢ã§ãã粟å¯åºŠãšåçŸçãå¹³ååããææšã§ãã ãšã³ãã£ãã£ãããã³ã°ã¢ãã«ã®ã³ã³ããã¹ãã§ã¯ã粟å¯åºŠã¯ãæ£è§£ããŒã¿ã»ããã§ãããã³ã°ãããŠããªã 2 ã€ã®èšé²ã誀ã£ãŠãŸãšããŠããŸãåœéœæ§ããããã¢ãã«ãã©ã®çšåºŠé²ããããæããŸãããã®æèã§ã®åçŸçã¯ãæ£è§£ããŒã¿ã»ããã§ã°ã«ãŒãåãããŠããèšé²ãã¢ãã«ãã©ã®çšåºŠæ£ãããŸãšããããšãã§ããããæããŸãããããã£ãŠãæ£è§£ããŒã¿ã»ããã«ã¯ããŸãšããã¹ãèšé²ãã¢ãšãé¡äŒŒæ§ãå
±æãããäžç·ã«å±ããªãèšé²ãã¢ã®äž¡æ¹ãå«ãŸããŠããå¿
èŠããããŸã (å³ 1 åç
§) ã å³ 1 â åçŸçãšç²Ÿå¯åºŠãå®çŸ©ãã衚 F1 ã¹ã³ã¢ã¯ã粟å¯åºŠãšåçŸçã®èª¿åå¹³åãšããŠå®çŸ©ãããæ¬¡ã®ããã«èšç®ãããŸã : F1 ã¹ã³ã¢ = 2 à [(粟å¯åºŠ à åçŸç) / (粟å¯åºŠ + åçŸç)] 粟å¯åºŠã¯ãçéœæ§ (æ£ããããã) ããçéœæ§ãšåœéœæ§ (äžæ£ãªããã) ã®åèšã§å²ã£ãæ¯çã§ããåçŸçã¯ãçéœæ§ããçéœæ§ããã³åœé°æ§ (èŠéãããããã) ã®åèšã§å²ã£ãæ¯çã§ããF1 ã¹ã³ã¢ã¯ 0 ãã 1 ã®ç¯å²ã§ãå€ãé«ãã»ã©ç²Ÿå¯åºŠãšåçŸçã®ãã©ã³ã¹ãè¯ãããšã瀺ããŸãããã®ãã©ã³ã¹ã¯ãç°ãªãæ¥çã粟å¯åºŠãŸãã¯åçŸçãç°ãªã£ãŠåªå
ããããéèŠã§ããäŸãã°ããã«ã¹ã±ã¢æ¥çã¯ãã°ãã°åœéœæ§ãæå°åããããšãç®æã (粟å¯åºŠãéèŠ) ãåºåæ¥çã¯çéœæ§ãæå€§åããããšã«çŠç¹ãåœãŠãŸã (åçŸçãéèŠ) ã ããŒã¿è©äŸ¡ã®ãŠã©ãŒã¯ã¹ã«ãŒ 粟床ããã¹ãããããã«é¡§å®¢ã䜿çšã§ããå
¬éããŒã¿ã»ããã¯ããã»ã©å€ããããŸããããããã®åæã§ãã䜿çšãããããŒã¿ã»ããã®äžã€ãããªãã€ãªå·ææš©è
ãã¡ã€ã«ã§ãã ãªãã€ãªå·ææš©è
ãã¡ã€ã« ã¯ã人ç©ãããã³ã°ã®ããã®ããç¥ãããå
¬éããŒã¿ã»ããã§ãããªãã€ãªå·ã®ææš©è
æ
å ±ããååãäœæãçå¹Žææ¥ãå«ã 105,000 ä»¶ã®èšé²ãå«ãŸããŠããŸãã ãªãã€ãªå·ææš©è
ãã¡ã€ã«ã¯ãå®éã®ããŒã¿ãå«ããããéçºè
ã«ããå€ãã®ãšã³ãã£ãã£ãããã³ã°ãœãªã¥ãŒã·ã§ã³ã§æãäžè¬çã«äœ¿çšãããæ£è§£ããŒã¿ã»ããã§ããããããå®éã®é¡§å®¢ããŒã¿ã®ä»£çãšããŠã®æçšæ§ãå¶éããããã€ãã®æ¬ ç¹ããããŸããé»è©±çªå·ãšã¡ãŒã«ãã£ãŒã«ããäžè¶³ããŠãããæ£èŠåãããŠããªãéµäŸ¿äœæãªã©ã®ããããããŒã¿å質åé¡ããªããèšé²ãéåžžã«å®å
šã§ããåŸåããããŸãã ãããã®ããè€éã§ããŒã¿å質ã®äœãäŸã«å¯Ÿå¿ãããããAWS Entity Resolution Data Science ããŒã ã¯ãããããå°é£ãªãšã³ãã£ãã£è§£æ±ºã·ããªãªãããå¿ å®ã«åçŸããæ°ããåæããŒã¿ã»ãããéçºãããªãŒãã³ãœãŒã¹åããŸãããããã¯ã BPID: A Benchmark for Personal Identity Deduplication ãšåŒã°ããŠããŸããBPID ã¯ãååãã¡ãŒã«ãé»è©±ãäœæãçå¹Žææ¥ãã£ãŒã«ãã«ãããè€éãªãã¿ãŒã³ãæã€ 2 äžä»¶ã®åæèšé²ãå«ããã¯ããã«å°é£ãªããŒã¿ã»ããã§ããBPID ã¯ãäžçææ°ã®èªç¶èšèªåŠçäŒè°ã®äžã€ã§ãã Empirical Methods in Natural Language Processing (EMNLP) 2024 ã§çºè¡šãããŸããã 以äžã®äŸã§ã¯ãAWS Entity Resolution ã®æ©èœã§ããæ©æ¢°åŠç¿ããŒã¹ã®ãããã³ã°ã¢ãã«ã®ç²ŸåºŠã枬å®ããæé ã宿ŒããŸããBPID ããã®ãªãŒãã³ãœãŒã¹ã®æ£è§£ããŒã¿ã»ããã䜿çšããŸãã åææ¡ä»¶ AWS ã¢ã«ãŠã³ã ããŒã¿ãããã³ã°æŠå¿µã®åºæ¬ççè§£ åæå®è¡ã®ããã® Jupyter Notebook ãŸãã¯é¡äŒŒç°å¢ Python ãšããŒã¿åŠçã©ã€ãã©ãªã®ç¥è 1. ããŒã¿ã®ããŠã³ããŒã ãŸãããã¹ãã§äœ¿çšããããŒã¿ãããŠã³ããŒãããå¿
èŠããããŸããBPID ããŒã¿ã»ãããããŠã³ããŒãããŠè§£åããŸãã粟床è©äŸ¡ã«ã¯ matching_dataset.jsonl ã䜿çšããŸãã以äžã¯ãBPID ããŒã¿ã»ããããã®ãã¢ã®äŸã§ãïŒ {"profile1": {"fullname": "corrie arreola", "email": ["c_orrie@bizdev.org", "c0rri3@gov.us"], "phone": ["03 1284418523"], "addr": [], "dob": "1953 11 09"}, "profile2": {"fullname": "arreola corrie", "email": ["arreola_2023@gmail.com"], "phone": [], "addr": ["45434 11478 jenny road tx 75155 0411 falconer", "100209 57 summer drive hollywood"], "dob": "09 nov 1953"}, "match": "True"} {"profile1": {"fullname": "elroy warner", "email": ["e.l.roy@private-domain.info"], "phone": [], "addr": ["21480 miser road seal cove tx 75109 0784 united states of america"], "dob": "2007 09 26"}, "profile2": {"fullname": "charlee warner", "email": ["charlee.smith@biz-tech.com"], "phone": [], "addr": ["21480 miser rd j646 seal cove tx 75109"], "dob": "09 2007"}, "match": "False"} 2. ãã¹ããšæ£è§£ããŒã¿ã»ããã®ååŠç ãã¹ãããŒã¿ãã 2 ã€ã®ããŒã¿ã»ãããæºåããŸãã1 ã€ã¯å
¥åçšããã 1 ã€ã¯ãããã³ã°åŸã®ç²ŸåºŠæž¬å®çšã®æ£è§£ããŒã¿ã»ããã§ãã matching_dataset.jsonl ãããã»ããµã«å¿
èŠãªã¹ããŒãã«å€æããå¿
èŠããããŸããAWS Entity Resolution ã§äœ¿çšããããã«ãã®ããŒã¿ãæºåããã«ã¯ããŸãããŒã«ã«ãŸãã¯ä»®æ³ç°å¢ã«ããŒã¿ãèªã¿èŸŒãå¿
èŠããããŸãã import json import pandas as pd #BPIDããŒã¿ã»ãã㯠zenodo.org/records/13932202 ãéããŠå
¬éãããŠããŸã raw_data_path = "./data_release/matching_dataset.jsonl" raw_data = [json.loads(line) for line in open(raw_data_path).readlines()] 次ã«ãå
¥åã¬ã³ãŒãããã©ããåã»å€æããŠãAWS Entity Resolution ã§èªã¿èŸŒãã圢åŒã«ããŸããã©ãã«ã¯ä»¥äžã«æŠèª¬ãããŠããããã«å¥ã®ãã¡ã€ã«ã«ä¿åãããŸãïŒ max_length_mapping = {"email": 0, "phone": 0, "addr": 0} for data_i in raw_data: for field in max_length_mapping: max_length_mapping [field] = max( max_length_mapping[field], len(data_i["profile1"][field]), len(data_i["profile2"][field]) ) print(f"max_email_length={ max_length_mapping['email']}") print(f"max_phone_length={ max_length_mapping ['phone']}") print(f"max_addr_length={ max_length_mapping['addr']}") profile_list = [] name_list = [] dob_list = [] email_list = {f"email_{i+1}": [] for i in range(max_length_mapping["email"])} phone_list = {f"phone_{i+1}": [] for i in range(max_length_mapping["phone"])} addr_list = {f"addr{i+1}": [] for i in range(max_length_mapping["addr"])} 次ã«ã以äžã®ã¹ã¯ãªãããå®è¡ãã粟床ã¹ã³ã¢èšç®çšã«ã©ãã«ãåé¢ã»æºåããŸãïŒ label_list = [] for i, data_i in enumerate(raw_data): p1, p2 = data_i["profile1"], data_i["profile2"] #ãã¢ã®ã©ãã«ãä¿å label_list.append({"profile_id_1":f"pair{i}_0", "profile_id_2":f"pair{i}_1", "label": data_i["match"]}) #ãããã¡ã€ã«ã远å for p in [p1, p2]: p_json = {"profileid":f"pair{i}_0", "fullname":p["fullname"], "dob":p["dob"]} for attr in ["email", "phone", "addr"]: for j in range(1, max_length_mapping[attr]+1): p_json[f"{attr}_{j}"] = p[attr][j] if j<len(p[attr]) else "" profile_list.append(p_json) 次ã«ã以äžã䜿çšããŠåŠçããããããã¡ã€ã«ãšã©ãã«ã json ãã¡ã€ã«ãšããŠä¿åããŸãïŒ # ãããã¡ã€ã«ã json ãã¡ã€ã«ã«ä¿å f_out = open("./data_release/BPID_matching_profiles_processed.jsonl", "w") for p in profile_list: f_out.write(json.dumps(p)+ "\n") f_out.close() # ã©ãã«ã json ãã¡ã€ã«ã«ä¿å f_out = open("./data_release/BPID_matching_label.jsonl", "w") for label_pair in label_list: f_out.write(json.dumps(label_pair)+ "\n") f_out.close() ãã®ååŠçãå®è¡ããåŸããããã¡ã€ã«ããŒã¿ ( BPID_matching_profiles_processed.jsonl ) ã¯ä»¥äžã®ããã«ãªããŸãïŒ {"profileid": "pair0_0", "fullname": "corrie arreola", "dob": "1953 11 09", "email_1": "c0rri3@gov.us", "email_2": "", "email_3": "", "phone_1": "", "phone_2": "", "phone_3": "", "addr_1": "", "addr_2": "", "addr_3": "", "addr_4": "", "addr_5": ""} {"profileid": "pair0_1", "fullname": "arreola corrie", "dob": "09 nov 1953", "email_1": "", "email_2": "", "email_3": "", "phone_1": "", "phone_2": "", "phone_3": "", "addr_1": "100209 57 summer drive hollywood", "addr_2": "", "addr_3": "", "addr_4": "", "addr_5": ""} ä»éããã©ãã«ãã¡ã€ã« ( BPID_matching_label.jsonl ) ã¯ä»¥äžã®ããã«ãªããŸãïŒ {"profile_id_1": "pair0_0", "profile_id_2": "pair0_1", "label": "True"} {"profile_id_1": "pair1_0", "profile_id_2": "pair1_1", "label": "False"} 3. ãããã³ã°ã¯ãŒã¯ãããŒã®å®è¡ ãã¹ãããŒã¿ã倿ãããããè©äŸ¡äºå®ã®ã¯ãŒã¯ãããŒã«å¯Ÿã㊠ãããã³ã°ã¯ãŒã¯ãã㌠ãå®è¡ããŸãã ç®æšã¯ãå
¥åããŒã¿ã»ããå
ã®ä»»æã® 2 ã€ã®èšé²ãåäžäººç©ã«å±ãããã©ãããè¯å®çãŸãã¯åŠå®çã«ãããã³ã°ã§ãããããã³ã°çµæãååŸããããšã§ãããã®æé ã¯ãµãŒãã¹ã«ãã£ãŠç°ãªããŸãã æåŸã«ãAWS Entity Resolution ãããã³ã°ã¯ãŒã¯ãããŒãå®è¡ããŸãã以äžã¯ãAWS Entity Resolution ããã®åºåäŸã§ãïŒ InputSourceARN,ConfidenceLevel,addr1,addr2,addr3,addr4,addr5,dob,email1,email2,email3,fullname,phone1,phone2,phone3,profileid,RecordId,MatchID arn:aws:glue:us-west-2: :table/yefan-bpid-benchmark/input,0.75296247,cookson tx 75110,,,,,2003 7,,,,,26806124715,3236000026,,pair1622_1,pair1622_1,6d08ce607181460584e2436e66660b2300003566935683247 arn:aws:glue:us-west-2::table/yefan-bpid-benchmark/input,0.75296247,cookson tx 75110,,,,,2003 07 10,yong123@business-domain.co.uk,,,yong stearns,6807159172,6806124715,,pair1622_0,pair1622_0,6d08ce607181460584e2436e66660b2300003566935683247 4. F1 ã¹ã³ã¢ã®èšç® ãããã³ã°çµæãååŸããåŸãçããŒã¿ã®ã©ãã«æ
å ±ãšãããã³ã°çµæã䜿çšã㊠F1 ã¹ã³ã¢ææšãèšç®ã§ããŸããããŒã¿ã»ãã matching_dataset.jsonl å
ã®åãã¢ã«ã¯ãããããŸãã¯éãããã®ã©ãã«ããããŸããåãã¢ã«ã€ããŠããããã³ã°çµæã§ã©ãã«ãšäžèŽãããã©ããã確èªããŸãããã®åŸããã®ãã¢ã 4 ã€ã®ã«ããŽãªã®ããããã«å²ãåœãŠãŸãïŒ çéœæ§ (TP) ïŒã©ãã«ãšãããã³ã°çµæã®äž¡æ¹ããããã瀺å åœéœæ§ (FP) ïŒã©ãã«ã¯ãéããããã ããããã³ã°çµæã¯ããã çé°æ§ (TN) ïŒã©ãã«ãšãããã³ã°çµæã®äž¡æ¹ãéãããã瀺å åœé°æ§ (FN) ïŒã©ãã«ã¯ãããããã ããããã³ã°çµæã¯éããã ããã 4 ã€ã®ã¿ã€ãã®æ°ãååŸããåŸã以äžãèšç®ã§ããŸãïŒ ç²Ÿå¯åºŠ = TP / (TP + FP) åçŸç = TP / (TP + FN) F1 ã¹ã³ã¢ = 2 à [(粟å¯åºŠ à åçŸç) / (粟å¯åºŠ + åçŸç)] ç¬èªã®ãã³ãããŒã¯ãã¹ãã®å®è¡ ãã³ãããŒã¯ããã»ã¹ãå®è¡ããããšã§ããããã®çµæãåçŸã§ããŸãã以äžã¯ãæ©æ¢°åŠç¿ããŒã¹ãããã³ã°ã®ããã® AWS Entity Resolution ã§ãã®ããã»ã¹ãå®è¡ããããã«å¿
èŠãªæé ãšããŒãããã¯ã®æŠèŠã§ããæé ãšããŒã¿ã¯ãã«ãŒã«ããŒã¹ãããã³ã°ã¯ãŒã¯ãããŒãŸãã¯ä»ã®ãããã€ããŒããã®ãããã³ã°ããã»ã¹ã®ç²ŸåºŠãè©äŸ¡ããããã«åå©çšããããšãã§ããŸããBPID ããŒã¿ã«ã¯å®éã®é¡§å®¢ PII ãå«ãŸããŠããªããããåºç€ãšãªãåç
§ã°ã©ãã䜿çšãããããã€ããŒãè©äŸ¡ããããã«äœ¿çšã§ããŸãã ã¢ã€ãã³ãã£ãã£è§£æ±ºããã»ã¹ã®æ¹åãç®æãããŒã ã«ã¯ã以äžããå§ãããŸãïŒ ç¬èªã®è©äŸ¡ã®ããã® BPID ããŒã¿ã»ããã®ããŠã³ããŒã AWS Entity Resolution ã®æ©æ¢°åŠç¿ããŒã¹ãããã³ã°æ©èœã®æ¢çŽ¢ ãã³ããŒè©äŸ¡ã«ãããäž»èŠææšãšããŠã® F1 ã¹ã³ã¢ã®æ€èš çµè« äŒæ¥ã顧客ããŒã¿ãçµ±åããããã«äœ¿çšããã«ãŒã«ãã¢ã«ãŽãªãºã ã®ç²ŸåºŠã枬å®ããããšã¯éåžžã«å°é£ã§ããã»ãšãã©ã®äŒæ¥ã¯ããã³ãããŒã¯ãšãªã泚éä»ãæ£è§£ããŒã¿ã»ãããæãããæž¬å®ã®ããã®äžè²«ããæ¹æ³è«ãæ¬ ããŠããå¯èœæ§ããããŸãã AWS Entity Resolution ã䜿çšããã¢ã€ãã³ãã£ãã£è§£æ±ºãµãŒãã¹ã®å
æ¬çãªç²ŸåºŠè©äŸ¡ã®å®æœæ¹æ³ã宿ŒããŸããããã³ãããŒã¯ææ³ããªãŒãã³ãœãŒã¹ããŒã¿ã»ããããããŠèªè
ã粟床è©äŸ¡ãåçŸã§ããã¹ããããã€ã¹ãããã¬ã€ããæäŸããŸããã AWS ã®æ
åœè
ã«é£çµ¡ããŠãã客æ§ã®ããžãã¹ã®å éãã©ã®ããã«æ¯æŽã§ããããã確èªãã ããã åèè³æ AWS Entity Resolution ã®é«åºŠãªã«ãŒã«ããŒã¹ãã¡ãžãŒãããã³ã°ã䜿çšããŠäžå®å
šãªããŒã¿ã解決ãã 顧客ã®çµ±äžãã¥ãŒãæ§ç¯ããæ¹æ³ AWS ã§ã®ãšã³ãã£ãã£è§£æ±ºã®ããã®ã«ãŒã«æšå¥šçæã«é¢ããã¬ã€ãã³ã¹ Travis Barnes Travis ã¯ãAWS Entity Resolution ã®ã·ãã¢ãããã¯ããããŒãžã£ãŒ (ãã¯ãã«ã«) ãšããŠãé«åºŠãªã¢ã€ãã³ãã£ãã£è§£æ±ºæè¡ãéããŠé¡§å®¢ãããŒã¿äŸ¡å€ãæå€§åã§ããããæ¯æŽããŠããŸããã¢ã€ãã³ãã£ãã£ãšã¢ãããã¯åéã§é©æ°çãªè£œåãæ§ç¯ããŠãã 10 幎以äžã®çµéšãæã¡ãå®éã®ããžãã¹ææã«ã€ãªããè€éãªããŒã¿èª²é¡ã®è§£æ±ºã«æ
ç±ã泚ãã§ããŸãã Yefan Tao Yefan ã¯ãå€§èŠæš¡ãªãšã³ãã£ãã£è§£æ±ºãšæ
å ±æ€çŽ¢ã·ã¹ãã ãå°éãšããã·ãã¢å¿çšç§åŠè
ã§ããèªç¶èšèªåŠç (NLP) ããã³é¢é£åéã«ãããŠãå
ç¢ã§å¹æçãªæ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ãéçºããŠããŸããç ç©¶ãšå®çšåã®æ©æž¡ãã«ãããé·å¹Žã®çµéšãæã¡ãå¹çæ§ãšç²ŸåºŠã®äž¡é¢ã§éçã«ææŠããè€éãªããŒã¿ã¬ããã³ã¹ãšã¢ã€ãã³ãã£ãã£ã®èª²é¡è§£æ±ºã«æ³šåããŠããŸãã æ¬çš¿ã®ç¿»èš³ã¯ããœãªã¥ãŒã·ã§ã³ã¢ãŒããã¯ãã®é«æ©ãæ
åœããŸãããåæã¯ ãã¡ã ã