RevCommã§é³å£°åŠçã®ç ç©¶éçºãæ
åœããŠããå è€éå¹³ã§ããçãã㯠é»è©±ã®éè©±çžæãå±å€ãã«ãã§ãªã©ã®éé³ç°å¢äžã«ããããã«ãçžæã®å£°ãèããã¥ãããŠèŠåŽããçµéšã¯ãããŸãããïŒ æ¬èšäºã§ã¯ã ç©ççãªé³éã¯ãã®ãŸãŸã« éé³ç°å¢äžã®èããïŒé³å£°äºè§£åºŠïŒãæ¹åããã¢ãã«ã§ããNELE-GANãçšãããéè©±çžæãéé³ç°å¢äžã«ããŠãèãåããããé»è©±ã®å®çŸã«åããå®éšã玹ä»ããŸãã åŒç€Ÿã®ãµãŒãã¹ã§ããMiiTelïŒããŒãã«ïŒã®å€§éã®é話é³å£°ãçšããŠã¢ãã«ãåŠç¿ããããšã§ãããŒã¹ã©ã€ã³ããã倧å¹
ã«æ§èœãæ¹åããããšã«æåããŸãã ã â»æ¬èšäºã®å
容ã¯ãçè
ããæ¥æ¬é³é¿åŠäŒ2022幎æ¥å£ç ç©¶çºè¡šäŒã§çºè¡šããå
容ïŒå è€ & æ©æ¬, 2022ïŒã«åºã¥ããŠããŸãã å è€éå¹³ïŒããšãããã
ããžãïŒ ã·ãã¢ãªãµãŒããšã³ãžãã¢ãRevCommã«ã¯2019幎ã«ãžã§ã€ã³ããé³å£°åŠçãäžå¿ãšããç ç©¶éçºãæ
åœãADHDãšä»ãåãã€ã€æ¥åã«åãçµã2å
ã®ç¶ã å人ãŠã§ããµã€ã X â éå»èšäºäžèЧ èŠçŽ èæ¯ æ¬ææ³ã§åŒ·èª¿ããé³å£°ã®äŸ 匷調å åŒ·èª¿åŸ ææ³ é³å£°äºè§£åºŠã衚ãå®¢èŠ³ææš é³å£°å質ã衚ãå®¢èŠ³ææš å®éš é³å£°ããŒã¿ããã³éé³ããŒã¿ ã¢ãã«ã®åŠç¿æ¡ä»¶ããã³å®éšæ¡ä»¶ å®éšçµæ ã¢ãã«ã®åŠç¿ã«çšããé³å£°ããŒã¿ã®éããã³å€æ§æ§ã®å€åã«äŒŽã客芳è©äŸ¡å€ã®å€å ã¢ãã«ã®åŠç¿ã«çšããéé³ã®å€æ§æ§ã®å€åã«äŒŽã客芳è©äŸ¡å€ã®å€å èå¯ æ±åæ§èœ é³å£°ããŒã¿ã®éããã³å€æ§æ§ãå€åããããšãã®é³å£°äºè§£åºŠãé³å£°å質ã®å€å éé³ã®å€æ§æ§ãå€åããããšãã®é³å£°äºè§£åºŠãé³å£°å質ã®å€å çµè« çºè¡šæç® åèæç® èŠçŽ NELE-GANãã éåžžã«å€§éã®MiiTelé話é³å£° ããã³æ§ã
ãªéé³ã®çµåãã§åŠç¿ããŸããïŒåŠç¿ããŒã¿éã¯NELE-GANã®è«æã®å®éšã®æå€§ 33å ïŒã äžèšã®ããŒã¿ãçšããŠåŠç¿ããçµæã é³å£°äºè§£åºŠãåçšåºŠã«ä¿ã€ãããã«åäžãããäžã§ãé³å£°å質ã倧å¹
ã«åäžãããããšã«æåããŸãã ã èæ¯ éé³ã®å€§ããç°å¢äžã§ã¯ãåãé³å£°ã§ãéé³ã®å°ããç°å¢äžã«ããã¹ãŠèãåããé£ãããªããŸãã é³å£°äºè§£åºŠ (speech intelligibility) ã¯é³å£°ã«ããäŒããããåèªãæç« ãçžæã«ã©ãã ãæ£ç¢ºã«äŒãããã衚ã尺床ã§ãéé³ç°å¢äžã§ã¯ãã®é³å£°äºè§£åºŠãäžããããšãç¥ãããŠããŸãã é³å£°äºè§£åºŠãäžããä»çµã¿ã¯æªã è§£æãããŠããŸãããäžæ¹ã§ãé³å£°ã®åšæ³¢æ°ç¹æ§ãªã©ãå€åãããããšã§é³å£°äºè§£åºŠãåäžããããšãããããšãç¥ãããŠããŸããå®ã¯ç§ãã¡äººéã¯éé³ç°å¢äžã§ç¡æèã®ãã¡ã«ãããè¡ã£ãŠããããã®çŸè±¡ã¯ãã³ããŒã广 (Lombard effect) (Lombard, 1911) ãšåŒã°ããŠããŸã *1 ã éé³ç°å¢äžã§ã®é³å£°äºè§£åºŠïŒã€ãŸãé³å£°ã®èãæãéé³ç°å¢äžã«ããå Žåã®é³å£°äºè§£åºŠïŒãåäžãããããã«é³å£°ã倿ããïŒé³å£°åŒ·èª¿ãè¡ãïŒããšã¯ãnear-end listening enhancement (NELE) ãšåŒã°ããŠããŸã *2 ãæ¬èšäºã§ã¯2021å¹Žã«ææ¡ãããã°ããã®NELE-GAN (Li & Yamagishi, 2021) ãçšããå®éšãè¡ããŸãããã®ææ³ã¯ãè€æ°ã®é³å£°äºè§£åºŠã®å®¢èгè©äŸ¡å€ãåäžããããããªå€æ *3 ããæµå¯ŸçåŠç¿ãããã¯ãŒã¯ (generative adversarial network; GAN) ã«ãã£ãŠåŠç¿ããŸãã Li & Yamagishi (2021) ã§ã¯ãåŠç¿ããŒã¿ã«ç·å¥³å1åã®è±èªèªã¿äžãé³å£°ïŒå600æãåèš1,200æïŒãçšããŠããŸããæ¬èšäºã§ã¯ãã¢ãã«ãé»è©±é³å£°ã«ããé©å¿ããããšåæã«æ±åæ§èœ *4 ãããé«ããããã倧éã®MiiTelã®é話é³å£°ãçšããŠã¢ãã«ãåŠç¿ãããã®æ§èœãããŒã¹ã©ã€ã³ïŒLi & Yamagishi (2021) ã®èè
ããå
¬éããŠããã¢ãã«ãšåçã®ã¢ãã«ïŒãšæ¯èŒããŸãã æ¬ææ³ã§åŒ·èª¿ããé³å£°ã®äŸ 匷調å åŒ·èª¿åŸ ã©ã³ãã ãªæ°åãèªã¿äžããŠããé³å£°ã«ãéé³ãä»å ãããã®ã§ãã匷調åãšåŒ·èª¿åŸã§SNR *5 ã¯åãã«ããŠãããŸããã匷調åŸã®é³å£°ã®ã»ããããèãåãããããªã£ãŠããããšãåãããŸãã ææ³ NELE-GANã®è©³çްã«ã€ããŠã¯Li & Yamagishi (2021) ã«è²ããŸãããã¢ãã«ã®æ§é ã¯å³1ããã³å³2ã®ããã«ãªã£ãŠããŸãã å³1 èå¥åšã®æ§é å³2 çæåšã®æ§é NELE-GANã¯ãã¯ãªãŒã³ãªé³å£°ïŒèæ¯éé³ãã»ãšãã©å«ãŸãªãé³å£°ïŒããã³éé³ãå
¥åãšããèå¥åšããã¯å®¢èŠ³ææšã®æšå®å€ãåºåãããçæåšããã¯é³å£°ã®åšæ³¢æ°ç¹æ§ãå€åããããã£ã«ã¿ãçæãããŸãã èå¥åšã®åŠç¿ã«ãããŠã¯ã Q 颿°ïŒè€æ°ã®å®¢èгè©äŸ¡å€ãç®åºãã颿°ïŒã®åºåã§ããå®¢èŠ³ææšã®çã®å€ãšãèå¥åšãåºåããå®¢èŠ³ææšã®æšå®å€ã®å¹³åäºä¹èª€å·®æå€± (mean squared error loss; MSE loss) ãæå°åããããã«åŠç¿ãè¡ãããŸããäžæ¹ãçæåšã®åŠç¿ã«ãããŠã¯ãå®¢èŠ³ææšã®åãããæå€§å€ãšãèå¥åšãåºåããå®¢èŠ³ææšã®æšå®å€ã®MSE lossãæå°åããããã«åŠç¿ãè¡ãããŸãïŒãã®ãšããèå¥åšã®éã¿ã¯åºå®ããŸãïŒããããã亀äºã«ç¹°ãè¿ãããšã§ã客芳è©äŸ¡å€ãæå€§åãããããªãã£ã«ã¿ãåŠç¿ããããšãã§ããŸãã Q 颿°ã§ã¯ãé³å£°äºè§£åºŠã衚ã3ã€ã®å®¢èŠ³ææšã«å ããŠãé³å£°å質ãæ
ä¿ããããã®2ã€ã®å®¢èŠ³ææšãçšããããŸãã é³å£°äºè§£åºŠã衚ãå®¢èŠ³ææš Speech intelligibility in bits (SIIB) (Kuyk et al., 2018) Hearing-aid speech perception index (HASPI) (Kates & Arehart, 2014) Extended short-time objective intelligibility (ESTOI) (Jensen & Taal, 2016) é³å£°å質ã衚ãå®¢èŠ³ææš Perceptual evaluation of speech quality (PESQ) (Rix et al., 2001) Virtual speech quality objective listener (ViSQOL) (Hines et al., 2015) å®éš é³å£°ããŒã¿ããã³éé³ããŒã¿ é³å£°ããŒã¿ã«ã¯ã åŒç€Ÿã®æ¥åã«ãã㊠MiiTelãéããŠè¡ãããé話é³å£°ããã åŒç€Ÿããçºä¿¡ããé話ãçä¿¡ããåŽ ã®ãã£ã³ãã«ã®é³å£°åºéã®ã¿ãæãåºããŠäœ¿çšããŸãããçä¿¡åŽã®è©±è
ãããã»ã©éè€ããŠãããšã¯èãã¥ããã®ã§ã話è
æ°ã¯é話件æ°ãšã»ãŒçãããšèŠãªããŸãããªããé³å£°ã¯å³å¯ãªæå³ã§ã®ã¯ãªãŒã³ãªé³å£°ã§ã¯ãããŸãããããã®å®éšã®ããã«ã§ããéãã¯ãªãŒã³ãªé³å£°ãéžå¥ããŸããã é³å£°ããŒã¿ã«ã€ããŠã¯ãã¢ãã«ã®åŠç¿ã»æ€èšŒã®ããã«ã S : é³å£°åºéæ°2,269ïŒé話件æ°466ä»¶ã4.2æéïŒã M : é³å£°åºéæ°9,004ïŒé話件æ°2,089ä»¶ã16.7æéïŒã L : é³å£°åºéæ°34,962ïŒé話件æ°7,322ä»¶ã66.7æéïŒã®3ã€ã®ããŒã¿ã»ãããçšæããŸããããã ããé³å£°åºéæ°ã®ããå€ãããŒã¿ã»ããã¯ãããå°ãªãããŒã¿ã»ãããå
å«ããŠããŸããã¢ãã«ã®è©äŸ¡ã®ããã«ã¯ãåŠç¿ã»æ€èšŒã®ããã«äœ¿çšããŠããªãé³å£°åºéæ°116ïŒé話件æ°18ä»¶ã0.56æéïŒã®ããŒã¿ã»ãããçšæããŸããã éé³ããŒã¿ã«ã¯ãLi & Yamagishi (2021) ãšåãããThe Microsoft Scalable Noisy Speech Dataset (MS-SNSD) (Reddy, 2019) ã䜿çšããŸãããã¢ãã«ã®åŠç¿ã»æ€èšŒã®ããã«ãââ a : é³å£°ç³»ã®éé³3çš®é¡ (airport, babble, neighbor speaking)ã b : ã»ãã a ïŒé³å£°ç³»ã®éé³3çš®é¡ïŒïŒéèžç³»ã®éé³2çš®é¡ (traffic, station) ã®2ã€ã®ã»ãããçšæãïŒãããã SNR = â10 dB, â5 dB, 0 dB ãšãªãããã«éé³ãé³å£°ã«éç³ããŸãããã¢ãã«ã®è©äŸ¡ã®ããã«ã¯ã closed : åŠç¿ã»æ€èšŒã»ãã a ãšåäžçš®é¡ã®éé³ïŒé³å£°ç³»ã®éé³3çš®é¡ããã ãç°ãªããµã³ãã«ã§ãïŒã acoust : åŠç¿ã»æ€èšŒã»ãã a ã«å«ãŸãããã®ãšã¯ç°ãªãé³å£°ç³»ã®éé³2çš®é¡ (bus, cafe)ã crowd : åŠç¿ã»æ€èšŒã»ãã b ã«å«ãŸãããã®ãšã¯ç°ãªãéèžç³»ã®éé³2çš®é¡ (field, metro)ã office : ãªãã£ã¹ç³»ã®éé³3çš®é¡ (air conditioner, copy machine, typing) ã®4ã€ã®ããŒã¿ã»ãããçšæãããããã SNR = â12 dB, â9 dB, â6 dB, â3 dB, 0 dB, +3 dB ãšãªãããã«éé³ãé³å£°ã«éç³ããŸããã çµæãšããŠãã¢ãã«ã®åŠç¿ã»æ€èšŒã«ã¯ S a â L b ã®6ã€ã®ããŒã¿ã»ãããè©äŸ¡ã«ã¯4ã€ã®ããŒã¿ã»ãããçšããŸããïŒè¡š1ã2ïŒã S a ãããŒã¹ã©ã€ã³ïŒLi & Yamagishi (2021) ã®èè
ããå
¬éããŠããã¢ãã«ïŒã«æãè¿ãã¢ãã«ã«ãªããŸãããªããåŠç¿ã»æ€èšŒã«çšãã S a â L b ã®6ã€ã®ããŒã¿ã»ããã«ã€ããŠã¯ãç¡äœçºã«éžãã 320ãµã³ãã«ãæ€èšŒã«ãæ®ãã®ãµã³ãã«ãåŠç¿ã«çšããŸãã *6 ã 衚1 åŠç¿ã»æ€èšŒããŒã¿ã»ããã®è©³çް ããŒã¿ã»ãã é³å£°ããŒã¿ã»ãã éé³ããŒã¿ã»ãã SNR ãµã³ãã«æ° æéé· [h] Li & Yamagishi (2021) 1,320ãµã³ãã«ïŒæéé·äžæïŒ ïŒ4çš®é¡ïŒ ïŒ3çš®é¡ïŒ 15,840 ïŒäžæïŒ S a S ïŒ4.2æéïŒ a ïŒ3çš®é¡ïŒ -10 dB, -5 dB, 0 dBïŒ3çš®é¡ïŒ 20,421 37.5 S b b ïŒ5çš®é¡ïŒ 34,035 62.5 M a M ïŒ16.7æéïŒ a 81,036 150 M b b 135,060 250 L a L ïŒ66.7æéïŒ a 314,658 600 L b b 524,430 1,000 衚2 è©äŸ¡ã»ããã®è©³çް ããŒã¿ã»ãã é³å£°ããŒã¿ã»ãã éé³ããŒã¿ã»ãã SNR ãµã³ãã«æ° æéé· [h] T closed ïŒ0.56æéïŒ closed ïŒ3çš®é¡ïŒ -12 dB, -9 dB, -6 dB, -3 dB, 0 dB, +3 dBïŒ6çš®é¡ïŒ 2,088 10 T acoust acoust ïŒ2çš®é¡ïŒ 1,392 6.7 T crowd crowd ïŒ2çš®é¡ïŒ 1,392 6.7 T office office ïŒ3çš®é¡ïŒ 2,088 10 ã¢ãã«ã®åŠç¿æ¡ä»¶ããã³å®éšæ¡ä»¶ é³å£°ããã³éé³ã®æšæ¬ååšæ³¢æ°ã¯8 kHz *7 ãšãããããããã®å€§ããã¯32ãšããŸãããããã«ãåŠç¿ã®å®å®åãšé«éåã®ããã«ã1ãšããã¯ç®ã¯SIIB, ESTOI, PESQã®3ã€ã®ææšã®ã¿ãçšããŠåŠç¿ãè¡ãã2ãšããã¯ç®ä»¥éã¯5ã€å
šãŠã®ææšãçšããŠåŠç¿ãè¡ãæ¹æ³ãåããŸããããããŠã3ãšããã¯ç®ä»¥éã®åŠç¿ã§ã¯ãåœè©²ãšããã¯ã®åŠç¿ãçµããæç¹ã§ã®æ€èšŒã»ããã«å¯Ÿãã客芳è©äŸ¡å€ããåãšããã¯çµäºæã®ãã®ãããå°ãããªããäžæçã1 %以äžã«ãªã£ãæç¹ã§åŠç¿ãæã¡åãã客芳è©äŸ¡å€ã®å¹³åãæã倧ããªã¢ãã«ãè©äŸ¡ã«äœ¿çšããŸããã ã¢ãã«ã®è©äŸ¡ã«ãããŠã¯ãè©äŸ¡ã»ããã®åãµã³ãã«ã«å¯Ÿãã客芳è©äŸ¡å€ãå¹³åãããã®ããåœè©²ã»ããã«å¯Ÿãã客芳è©äŸ¡å€ãšããŸããã å®éšçµæ å®éšçµæãå³3ã«ç€ºããŸãã ã©ã®è©äŸ¡ã»ãã ( T closed , T acoust , T crowd , T office ) ã«ãããŠãã匷調åŸã®é³å£° ( S a â L b ) ã¯åŒ·èª¿åã®é³å£° (unmodified) ã«å¯ŸããŠãé³å£°äºè§£åºŠã衚ãå®¢èŠ³ææš (SIIB, HASPI, ESTOI) ã®å€ã¯äžæåŸåã«ããïŒé³å£°å質ã衚ãå®¢èŠ³ææš (PESQ, ViSQOL) ã®å€ã¯äœäžããŠããããšãåãããŸãã å³3 忡件ããã³è©äŸ¡ã»ããã®çµåãã«å¯Ÿãã客芳è©äŸ¡å€ è©äŸ¡ã»ããéã®å®¢èгè©äŸ¡å€ã®çžé¢ãèŠãŠã¿ãŸãïŒè¡š3â5ïŒãé³å£°äºè§£åºŠã衚ãå®¢èŠ³ææš (SIIB, HASPI, ESTOI)ãé³å£°å質ã衚ãå®¢èŠ³ææš (PESQ, ViSQOL) ããšã«èŠãã°ãè©äŸ¡ã»ããéã®å®¢èгè©äŸ¡å€ã®çžé¢ä¿æ°ã¯ãããã0.9以äžãšãéåžžã«åŒ·ãçžé¢ãããããšãåãããŸãã 衚3 å
šãŠã®å®¢èгè©äŸ¡å€ã«ã€ããŠã®è©äŸ¡ã»ããéã®çžé¢ä¿æ° T acoust T crowd T office T closed 0.569 0.511 0.946 T acoust - 0.994 0.769 T crowd - - 0.720 衚4 é³å£°äºè§£åºŠã«é¢ãã客芳è©äŸ¡å€ã«ã€ããŠã®è©äŸ¡ã»ããéã®çžé¢ä¿æ° T acoust T crowd T office T closed 0.929 0.891 0.969 T acoust - 0.994 0.977 T crowd - - 0.959 衚5 é³å£°å質ã«é¢ãã客芳è©äŸ¡å€ã«ã€ããŠã®è©äŸ¡ã»ããéã®çžé¢ä¿æ° T acoust T crowd T office T closed 0.931 0.932 0.968 T acoust - 0.997 0.978 T crowd - - 0.970 ã¢ãã«ã®åŠç¿ã«çšããé³å£°ããŒã¿ã®éããã³å€æ§æ§ã®å€åã«äŒŽã客芳è©äŸ¡å€ã®å€å ã¢ãã«ã®åŠç¿ã«çšããé³å£°ããŒã¿ã®éããã³å€æ§æ§ãå€åããããšã客芳è©äŸ¡å€ã¯ã©ã®ããã«å€åããã®ã§ããããïŒããã芳å¯ããããã«ãåè©äŸ¡ã»ããã«å¯Ÿããçµæã ( S a , M a , L a ) ãŸã㯠( S b , M b , L b ) ã®çµåãã§æ¯èŒããŸããïŒå³4ãå³5ïŒãé³å£°äºè§£åºŠã衚ãå®¢èŠ³ææšã®ãã¡SIIBããã³HASPIã«ã€ããŠã¯ã S a / S b ãã M a / M b ãžãšé³å£°ããŒã¿ã®éããã³å€æ§æ§ã倧ãããããšå€ãè¥å¹²äœäžããŸãããã L a / L b ãžãšããã«å€§ãããããšå€ã¯åçšåºŠãŸã§å埩ããŸãããESTOIã«ã€ããŠã¯ãé³å£°ããŒã¿ã®éããã³å€æ§æ§ã倧ããããã«ãããã£ãŠãå調ã«å€ãäžæããŸãããäžæ¹ãé³å£°å質ã衚ãå®¢èŠ³ææš (PESQ, ViSQOL) ã«ã€ããŠã¯ã S a / S b ãã M a / M b ãžãšé³å£°ããŒã¿ã®éããã³å€æ§æ§ã倧ãããããšå€ãäžæãã L a / L b ãžãšããã«å€§ãããããšå€ã¯è¥å¹²äœäžãããåçšåºŠãšãªããŸããã å³4 忡件ã«ããã ( S a , M a , L a ) ã®çµåãã«å¯Ÿãã客芳è©äŸ¡å€ å³5 忡件ã«ããã ( S b , M b , L b ) ã®çµåãã«å¯Ÿãã客芳è©äŸ¡å€ ã¢ãã«ã®åŠç¿ã«çšããéé³ã®å€æ§æ§ã®å€åã«äŒŽã客芳è©äŸ¡å€ã®å€å ã¢ãã«ã®åŠç¿ã«çšããéé³ã®å€æ§æ§ãå€åããããšãã®å®¢èгè©äŸ¡å€ã®å€åã«ã€ããŠã¯ã©ãã§ããããïŒããã芳å¯ããããã«ãåè©äŸ¡ã»ããã«å¯Ÿããçµæã ( S a , S b )ïŒ( M a , M b )ããŸã㯠( L a , L b ) ã®çµåãã§æ¯èŒããŸããïŒå³6â8ïŒãé³å£°äºè§£åºŠã衚ãå®¢èŠ³ææš (SIIB, HASPI, ESTOI) ã«ã€ããŠã¯ã S a / M a ãã S b / M b ãžãšéé³ããã倿§ã«ãããšè¥å¹²å€ãäœäžããŸãããã L a ãã L b ãžã®å€åã«ã€ããŠã¯ãå€ãåçšåºŠãè¥å¹²ã®äžæã«ãšã©ãŸããŸãããé³å£°å質ã衚ãå®¢èŠ³ææš (PESQ, ViSQOL) ã«ã€ããŠã¯ã S a ãã S b ãžãšéé³ããã倿§ã«ãããšå€ã倧ããäžæããŸãããã M a / L a ãã M b / L b ãžãšå€åãããå Žåã¯ãè©äŸ¡ã»ãããå®¢èŠ³ææšã«ãããã®ã®ãå€ã¯ããããåçšåºŠã«ãšã©ãŸããŸããã å³6 忡件ã«ããã ( S a , S b ) ã®çµåãã«å¯Ÿãã客芳è©äŸ¡å€ å³7 忡件ã«ããã ( M a , M b ) ã®çµåãã«å¯Ÿãã客芳è©äŸ¡å€ å³8 忡件ã«ããã ( L a , L b ) ã®çµåãã«å¯Ÿãã客芳è©äŸ¡å€ èå¯ æ±åæ§èœ è©äŸ¡ã»ããéã®å®¢èгè©äŸ¡å€ã«ã¯ãéåžžã«åŒ·ãçžé¢ãèŠãããŸãããããã¯ãé³å£°ããŒã¿ã®éããã³å€æ§æ§ãéé³ã®å€æ§æ§ã®å€åã«ããããç°ãªã系統ã®éé³ã«å¯Ÿããã¢ãã«ã®æ§èœå€åã®åŸåãé¡äŒŒããŠããããšã瀺ããŠããŸããéé³çš®é¡ãªãŒãã³ã®è©äŸ¡ã»ãã ( T acoust , T crowd , T office ) ã«å¯Ÿãã客芳è©äŸ¡å€ãéé³çš®é¡ã¯ããŒãºã ( T closed ) ã®è©äŸ¡ã»ããã«å¯Ÿãã客芳è©äŸ¡å€ãäžåã£ãããšãããããŠèãããšã NELE-GANãéé³ã®çš®é¡ã«å¯ŸããŠé«ãæ±åæ§èœãæã£ãŠãã ããšã瀺åãããŸãã é³å£°ããŒã¿ã®éããã³å€æ§æ§ãå€åããããšãã®é³å£°äºè§£åºŠãé³å£°å質ã®å€å é³å£°ããŒã¿ã®éããã³å€æ§æ§ãå€åããããšãã®é³å£°äºè§£åºŠãé³å£°å質ã®å€åã«ã€ããŠã¯ãè峿·±ãåŸåãèŠãããŸãããããªãã¡ãæ¯èŒçå°èŠæš¡ã®é³å£°ããŒã¿ãçšããŠåŠç¿ããå Žå ( S a / S b ) ã§ãé³å£°äºè§£åºŠã¯ååã«åäžããŸããããé³å£°å質ã¯å€§ããå£åããŠããŸããŸãããããããããŒã¿ã®èŠæš¡ã倧ãããããšãäžæŠã¯é³å£°äºè§£åºŠãè¥å¹²äœäžããäžæ¹ã§ãé³å£°å質ã¯ããªãå埩ããŸãã ( M a / M b )ãããã«ããŒã¿ã®èŠæš¡ã倧ãããããšãé³å£°å質ãããããä¿ã¡ã€ã€ãé³å£°äºè§£åºŠãå埩ãŸãã¯ããã«åäžããããšãåãããŸãã ( L a / L b )ããã®ããšããã éåžžã«å€§éãã€å€æ§ãªé³å£°ããŒã¿ãçšããŠNELE-GANãåŠç¿ããããšã§ãæ¯èŒçå°éã®ããŒã¿ãçšããŠåŠç¿ããå Žåããããã¢ãã«ã®æ§èœãåäžãããããšãã§ãã ãšèšããã§ãããã éé³ã®å€æ§æ§ãå€åããããšãã®é³å£°äºè§£åºŠãé³å£°å質ã®å€å éé³ã®å€æ§æ§ãå€åããããšãã®é³å£°äºè§£åºŠãé³å£°å質ã®å€åã¯ãé³å£°ããŒã¿ãæ¯èŒçå°èŠæš¡ã®å Žåã¯é¡èã«å·®ããããŸããããããå€§èŠæš¡ã§ããã»ã©å·®ã¯å°ãªããªããŸããããã®çç±ã«ã€ããŠã¯ãæ¬å®éšã®çµæããã ãã§ã¯æšæž¬ãé£ããããããªãæ€èšãå¿
èŠãšããŸãã çµè« æ¬èšäºã§ã¯ãNELE-GANããã倧éãã€å€æ§ãªé³å£°ããŒã¿ãçšããŠåŠç¿ãããšãã®ã¢ãã«æ§èœã®å€åãæ€èšŒããŸãããåæã«ãé³å£°ã«éç³ããéé³ã«ã€ããŠãããã®å€æ§æ§ãå€åããããšãã®ã¢ãã«æ§èœã®å€åãæ€èšŒããŸãããçµæãšããŠã éåžžã«å€§éãã€å€æ§ãªé³å£°ããŒã¿ãçšããããšã§ãæ¯èŒçå°éã®ããŒã¿ãçšããå Žåãšããã¹ãŠãé³å£°äºè§£åºŠãåçšåºŠã«ä¿ã€ãããã«åäžãããäžã§ãé³å£°å質ã倧å¹
ã«åäžããããã ããšãæããã«ãªããŸããã ããã«å€§èŠæš¡ãªé³å£°ããŒã¿ãçšããå Žåã®ã¢ãã«æ§èœãã©ããªãã®ãæ°ã«ãªããŸãããããã¯ä»åŸã®èª²é¡ãšããŸãã çºè¡šæç® å è€éå¹³, & æ©æ¬æ³°äž (2022). NELE-GANã®åŠç¿ã«çšããé³å£°ããŒã¿éããã³å€æ§æ§ã®åœ±é¿ã«ã€ããŠã®èª¿æ». æ¥æ¬é³é¿åŠäŒ2022幎æ¥å£ç ç©¶çºè¡šäŒè¬æŒè«æé , 1025â1028. åèæç® Glasberg, B. R., & Moore, B. C. J. (1990). Derivation of auditory filter shapes from notched-noise data. Hearing Research , 47(1â2), 103â138. https://doi.org/10.1016/0378-5955(90)90170-T Hines, A., Skoglund, J., Kokaram, A. C., & Harte, N. (2015). ViSQOL: An Objective Speech Quality Model. EURASIP Journal on Audio, Speech, and Music Processing , 2015(13), 1â18. Jensen, J., & Taal, C. H. (2016). An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing , 24(11), 2009â2022. https://doi.org/10.1109/TASLP.2016.2585878 Kates, J. M., & Arehart, K. H. (2014). The Hearing-Aid Speech Perception Index (HASPI). Speech Communication , 65, 75â93. https://doi.org/10.1016/j.specom.2014.06.002 Kuyk, S., Kleijin, W. B., & Hendriks, R. C. (2018). An Instrumental Intelligibility Metric Based on Information Theory. IEEE Signal Processing Letters , 25(1), 115â119. https://doi.org/10.1109/LSP.2017.2774250 Li, H., & Yamagishi, J. (2021). Multi-Metric Optimization using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing , 29, 3000â3011. https://doi.org/10.1109/TASLP.2021.3111566 Lombard, Ã. (1911). Le signe de l'élévation de la voix. Annales des Maladies de L'Oreille et du Larynx , XXXVII(2), 101â109. Reddy, C. K. A., Beyrami, E., Pool, J., Cutler, R., Srinivasan, S., & Gehrke, J. (2019). A Scalable Noisy Speech Dataset and Online Subjective Test Framework. Proc. INTERSPEECH , 1816â1820. https://doi.org/10.21437/Interspeech.2019-3087 Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2001). Perceptual Evaluation of Speech Quality (PESQ) â A New Method for Speech Quality Assessment of Telephone Networks and Codecs. Proc. IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP) , II, 749â752. https://doi.org/10.1109/ICASSP.2001.941023 èæ
倪æ, æ©äº®èŒ, & 岡ãè°·äžå€« (2015). ãžã¥ãŠã·ããã®æçºå£°ã«ããããã³ããŒã广ãšåºæ¬åšæ³¢æ°ã®å€å. æ
å ±åŠçåŠäŒç ç©¶å ±å , 2015-MUS-107(37), 1â3. *1 : ãã£ãšèšãã°ã人é以å€ã®åç©ã§ããã³ããŒã广ã芳å¯ãããããšãããããšãç¥ãããŠããŸãïŒèæ
倪æã»ã, 2015ïŒã *2 : é³å£°åŒ·èª¿ãšããèšèã¯ãéé³ãæ®é¿ãæ··ãã£ãé³å£°ä¿¡å·ã«ãããŠããããæå¶ããããšãæãããšãå€ãã§ãã *3 : å
·äœçã«ã¯ãããããããã³æ°ãåºå®ããequivalent rectangular bands (ERB) (Glasberg & Moore, 1990) 尺床ã«åºã¥ããã£ã«ã¿ãã³ã¯ã®åãã³ã«å¯Ÿããéã¿ä»ããè¡ããŸãã *4 : ã¢ãã«ãç¹å®ã®å Žé¢ã«éããåºãå Žé¢ã§é«ãæ§èœãçºæ®ããããšãããã§ã¯ãã©ã®ãããªå£°ãããã¯ããã¹ãæ¹ã®äººã§ãèãåãããã声ã«å€æã§ããããšãæããŠããŸãã *5 : Signal to noise ratioïŒS/Næ¯ãä¿¡å·å¯Ÿé鳿¯ïŒãä¿¡å·ïŒããã§ã¯é³å£°ä¿¡å·ïŒã®ãã¯ãŒã®éé³ã®ãã¯ãŒã«å¯Ÿããæ¯ã§ãå€ãå°ããã»ã©éé³ã倧ããããšã«ãªããŸãã *6 : é³å£°ããŒã¿ã®åŠç¿ã»æ€èšŒã»ãã S ã«éé³ããŒã¿ã®åŠç¿ã»æ€èšŒ ã»ããïŒ a ãŸã㯠b ïŒãéç³ãããã®ããéé³ã®çš®é¡ãé³å£°ãµã³ãã«ïŒã©ã³ãã ãªèå¥åãä»äžïŒã®é ã«ãœãŒãããŠäžŠã¹ãååã®320ãµã³ãã«ã S / M / L å
±éã®æ€èšŒã»ãããšããŸããã *7 : Li & Yamagishi (2021) ã§ã¯16 kHzã§ããããæ¬å®éšã«çšããé³å£°ã¯é»è©±ã®é話é³å£°ã§ãããããæšæ¬ååšæ³¢æ°ã¯é»è©±ã®é³å£°ä¿¡å·ã笊å·åããéã«çšãããã8 kHzã«å¶éãããŸãã