2024 幎 2 æã« æŽæ°ãããåæãæ¥æ¬èªçãšã㊠9 æã«åæ ããŸããïŒ ãã®èšäºã¯ãã³ã¹ãããŒã¹ã®æé©åãšã¯ãšãªçµæã®åå©çšãå«ã Amazon Athena ãšã³ãžã³ããŒãžã§ã³ 3 ã®å€æŽãåæ ããããã«ç¢ºèªããã³æŽæ°ãããŸããã Amazon Athena ã¯ããªãŒãã³ãœãŒã¹ã®ãã¬ãŒã ã¯ãŒã¯ã«åºã¥ãã察話ååæãµãŒãã¹ã§ãæšæºã® SQL ã䜿ã£ãŠ Amazon Simple Storage Service (Amazon S3) ã«æ ŒçŽããããªãŒãã³ããŒãã«ããã³ãã¡ã€ã«åœ¢åŒã®ããŒã¿ãç°¡åã«åæã§ããŸããAthena ã¯ãµãŒããŒã¬ã¹ãªã®ã§ãã€ã³ãã©ã¹ãã©ã¯ãã£ã®ç®¡çã¯äžèŠã§ãå®è¡ããã¯ãšãªã«å¯ŸããŠã®ã¿æéãæ¯æããŸããAthena ã¯äœ¿ãããããAmazon S3 å
ã®ããŒã¿ãæå®ããã¹ããŒããå®çŸ©ããã°ãæšæº SQL ã䜿ã£ãŠã¯ãšãªãå®è¡ã§ããŸãã ãã®æçš¿ã§ã¯ãã¯ãšãªã®ããã©ãŒãã³ã¹ãåäžãããããã®ãã³ãã®ããã10ã玹ä»ããŸããAmazon S3 ãžã® ããŒã¿ä¿å ãšã¯ãšãªç¹æã® ãã¥ãŒãã³ã° ã«é¢é£ããåŽé¢ã«çŠç¹ãåœãŠãŸãã ã¹ãã¬ãŒãž æ¬ã»ã¯ã·ã§ã³ã§ã¯ãAthena ã§æå€§éã®å¹æãåŸãããã®ããŒã¿ã®æ§é åæ¹æ³ã«ã€ããŠç޹ä»ããŸããAmazon S3 ã«ããŒã¿ãä¿åããå Žåãåãå®è·µæ¹æ³ã Amazon EMR ã®ããŒã¿åŠçã¢ããªã±ãŒã·ã§ã³ (SparkãTrinoãPrestoãHive ãªã©) ã«ãé©çšã§ããŸãã以äžã®ãã¹ããã©ã¯ãã£ã¹ã«ã€ããŠèª¬æããŸãã ããŒã¿ã®ããŒãã£ã·ã§ã³åå² ããŒã¿ã®ãã±ããå å§çž®ã®å©çš ãã¡ã€ã«ãµã€ãºã®æé©å åæåã®ãã¡ã€ã«åœ¢åŒã®å©çš 1. ããŒã¿ã®ããŒãã£ã·ã§ã³åå² ããŒãã£ã·ã§ã³åå²ã¯ãããŒãã«ãè«ççãªããŒãã£ã·ã§ã³ã«åå²ããæ¥ä»ãåœãå°åãªã©ã®ã«ã©ã å€ã«åºã¥ããŠé¢é£ããããŒã¿ãåãããŒãã£ã·ã§ã³ã«æ ŒçŽããŸããããŒãã£ã·ã§ã³ã¯ä»®æ³çãªåãšããŠæ©èœããŸããããŒãã£ã·ã§ã³ã¯ããŒãã£ã·ã§ã³å€ã«åºã¥ããŠããŒã¿ãè«ççã«åºåãããŸããããŒãã«äœææã«ããŒãã£ã·ã§ã³ãå®çŸ©ããã¯ãšãªããšã«ã¹ãã£ã³ããããŒã¿éãæžããããšã§ãããã©ãŒãã³ã¹ãåäžããŸããããŒãã£ã·ã§ã³ã«åºã¥ãããã£ã«ã¿ãæå®ããããšã§ãã¯ãšãªã§ã¹ãã£ã³ããããŒã¿éãå¶éã§ããŸãã詳现ã¯ã Partitioning data in Athena ãã芧ãã ããã æ¬¡ã®äŸã¯ãS3 ãã±ããã«å¹Žããšã«ããŒãã£ã·ã§ã³åå²ãããããŒã¿ã»ããã瀺ããŠããŸãã $ aws s3 ls s3://athena-examples/flight/parquet/ PRE year=1987/ PRE year=1988/ PRE year=1989/ PRE year=1990/ PRE year=1991/ PRE year=1992/ PRE year=1993/ ãã®ããŒã¿ã»ããã®å ŽåãããŒãã«ã«ã¯ PARTITIONED BY (year STRING) å¥ãå«ããAthena ã«å¹Žåäœã§ããŒãã£ã·ã§ã³åå²ãããŠããããšãäŒããå¿
èŠããããŸããããŒãã«äœæåŸãããšãã° ALTER TABLE ADD PARTITION ã䜿çšãããã AWS Glue ã¯ããŒã©ãŒã䜿çšãããã MSCK REPAIR TABLE ãå®è¡ããŠãåããŒãã£ã·ã§ã³ã远å ããå¿
èŠããããŸã( Iceberg ããŒãã«ã§ã¯ ããŒãã«ã¬ã€ã¢ãŠãæ
å ±ã远跡ãããããã MSCK REPAIR TABLE ã®å®è¡ã¯å¿
èŠãªãããµããŒãããããŠããŸãã)ããŸããããŒãã«ã¯ããŒãã£ã·ã§ã³ãããžã§ã¯ã·ã§ã³ã䜿çšããããã«æ§æããããšãã§ããŸã (èšå®æ¹æ³ã«ã€ããŠã¯ãããŒãã¹ãããã®ã»ã¯ã·ã§ã³ãåç
§ããŠãã ãã)ã ããŒãã£ã·ã§ã³åãããããŒãã«ãåãåãããéã WHERE å¥ã§ããŒãã£ã·ã§ã³ããŒã䜿çšããããšã§ãã¹ãã£ã³å¯Ÿè±¡ãšãªãããŒãã£ã·ã§ã³ãéå®ã§ããŸã: SELECT dest, origin FROM flights WHERE year = '1991' ãã®ã¯ãšãªãå®è¡ãããšãAthena ã¯å¹Žã®ããŒãã£ã·ã§ã³ããŒã«ãã¬ãã£ã±ãŒã (ãã£ã«ã¿ãŒ) ãããããšãèªèããäžèŽããããŒãã£ã·ã§ã³ããã®ã¿ããŒã¿ãèªã¿èŸŒã¿ãŸãããã®å Žåã s3://athena-examples/flight/parquet/year=1991/ ã®ããŒã¿ã®ã¿ãèªã¿èŸŒãŸããŸãã ããŒã¿ã»ããã«ã¯è€æ°ã®ããŒãã£ã·ã§ã³ããŒãæã€ããšãã§ããŸãã以äžã¯ã AWS Open Data Registry ã® NOAA Global Historical Climatology Network ããŒã¿ã»ãã ããã®äŸã§ãããã®ããŒã¿ã»ããã¯ãSTATION ãš ELEMENT ã§ããŒãã£ã·ã§ã³åå²ãããŠãããã³ãã³ããå®è¡ãããšãç¹å®ã® STATION(=ASN00023351) ã® ELEMENT ããŒãã£ã·ã§ã³ã®ãªã¹ããååŸããããšãã§ããŸãã $ aws s3 ls --no-sign-request s3://noaa-ghcn-pds/parquet/by_station/ STATION=ASN00023351 / PRE ELEMENT=DAPR/ PRE ELEMENT=DWPR/ PRE ELEMENT=MDPR/ PRE ELEMENT=PRCP/ ãã®ããŒã¿ã»ããã®ããŒãã«ã«ã¯ã PARTITIONED BY (station STRING, element STRING) å¥ãå«ãŸããAthena ã«ãã®ããã«ããŒãã£ã·ã§ã³åå²ãããŠããããšãäŒããŸãã ããŒãã£ã·ã§ã³åå²ããã«ã©ã ãæ±ºå®ããéã¯ã以äžã®ç¹ãèæ
®ããŠãã ããã ã¯ãšãªã«é©ããããŒãã£ã·ã§ã³ããŒãéžæããŸããã¯ãšãªããéç®ããŠãããŒã¿ã»ããããã£ã«ã¿ãªã³ã°ããã®ã«ãã䜿ããããã£ãŒã«ããèŠã€ããŠãã ããã ããŒãã£ã·ã§ã³ããŒã®ã«ãŒãã£ããªãã£(ããŒãã£ã·ã§ã³ããŒãåããããŠããŒã¯ãªå€ã®æ°ãæããŸã)ã¯æ¯èŒçäœãæ¹ãè¯ãã§ããããŒãã«å
ã®ããŒãã£ã·ã§ã³æ°ãå¢ããã»ã©ãããŒãã£ã·ã§ã³ã¡ã¿ããŒã¿ã®ååŸãšåŠçã®ãªãŒããŒããããé«ããªãããã¡ã€ã«ãµã€ãºãå°ãããªããŸããããŒãã£ã·ã§ã³ããŒã®ã«ãŒãã£ããªãã£ãé«ããããšãããŒãã£ã·ã§ã³åå²ã®ã¡ãªããã倱ãããå¯èœæ§ããããŸãã ããŒã¿ãããç¹å®ã®ããŒãã£ã·ã§ã³å€ã«åã£ãŠããŠãã»ãšãã©ã®ã¯ãšãªããã®å€ã䜿çšããå ŽåãããŒãã£ã·ã§ã³åå²ã®ã¡ãªããããªãŒããŒãããã§æã¡æ¶ãããå¯èœæ§ããããŸãã æéã®çµéãšãšãã«å¢å€§ããããŒã¿ã»ããã¯ãäžè¬çã«æ¥ä»ã§ããŒãã£ã·ã§ã³ãããã¹ãã§ããç¹å®ã®æéãäŸãã°éå» 1 é±éãéå» 1 ãæãèŠãã¯ãšãªã¯äžè¬çãªãã¿ãŒã³ã§ããæ¥ä»ã§ããŒãã£ã·ã§ã³ããããšã§ãå
šäœã®ããŒã¿ã»ããã®ãµã€ãºãæéãšãšãã«å¢å€§ããŠãããããã®ã¯ãšãªãèªã¿åãããŒã¿éã¯äžå®ã«ä¿ãããŸãã æ¬¡ã®è¡šã¯ãããŒãã£ã·ã§ã³åå²ãããããŒãã«ãšããŒãã£ã·ã§ã³åå²ãããŠããªãããŒãã«ã®ã¯ãšãªå®è¡æéãæ¯èŒããŠããŸãããã®ããŒãã«ã¯ãæ¥çæšæºã®ãã³ãããŒã¯ããŒã¿ã»ãã TPC-H ããååŸãããŠããŸããããŒãã«ã®äž¡ããŒãžã§ã³ã«ã¯ã74 GB ã®éå§çž®ããã¹ãããŒã¿ãæ ŒçŽãããŠããŸããããŒãã£ã·ã§ã³åå²ãããããŒãã«ã¯ã l_shipdate åã§ããŒãã£ã·ã§ã³åå²ããã2,526 ã®ããŒãã£ã·ã§ã³ããããŸãã ã¯ãšãª ããŒãã£ã·ã§ã³åå²ãããŠããªãããŒãã« ã³ã¹ã ããŒãã£ã·ã§ã³åå²ãããããŒãã« ã³ã¹ã ç¯çŽå . å®è¡æé ã¹ãã£ã³ãããããŒã¿ . å®è¡æé ã¹ãã£ã³ãããããŒã¿ . . SELECT COUNT(*) FROM lineitem WHERE l_shipdate = '1996-09-01' 4.8 ç§ 74.1 GB $0.36 0.7 ç§ 29.96 MB $0.0001 99% å®äŸ¡ 85% é«é SELECT COUNT(*) FROM lineitem WHERE l_shipdate >= '1996-09-01' AND l_shipdate < '1996-10-01' 4.4 ç§ 74.1 GB $0.36 2.0 ç§ 898.58 MB $0.004 98% å®äŸ¡ 54% é«é EXPLAIN ã³ãã³ãã䜿çšãããšãã¯ãšãªã«ãã£ãŠã©ã®ããŒãã£ã·ã§ã³ãèªã¿åããããã確èªã§ããŸã: EXPLAIN SELECT COUNT(*) FROM lineitem WHERE l_shipdate = '1996-09-01' åºåã§ SOURCE é
ç®ãæ¢ãããã®äžã® PARTITION_KEY ã©ãã«ã確èªããŠãã ããããã®è¡ã¯ãã¯ãšãªã«ãã£ãŠèªã¿åãããããŒãã£ã·ã§ã³ã瀺ããŠããŸãã ⊠Fragment 1 [SOURCE] Output layout: [count_0] Output partitioning: SINGLE [] Aggregate[type = PARTIAL] â Layout: [count_0:bigint] â Estimates: {rows: 1 (9B), cpu: 0, memory: 9B, network: 0B} â count_0 := count(*) ââ TableScan[table = awsdatacatalog:tpc_h:lineitem] Layout: [] Estimates: {rows: ? (0B), cpu: 0, memory: 0B, network: 0B} l_shipdate:string:PARTITION_KEY :: [[1996-09-01]] ããŒãã«ã«è€æ°ã®ããŒãã£ã·ã§ã³ããŒãããå Žåãããããã®ããŒå€ã«ã€ããŠè¡ãåºåãããŸããã¯ãšãªãããŒãã£ã·ã§ã³ããŒã®è€æ°ã®å€ã«äžèŽããå Žåãåºåã«ã¯ããããã®å€ãå«ãŸããŸããããšãã°ãã¯ãšãªã倿Žã㊠l_shipdate ã®å€ã®ç¯å²ãéžæããå ŽåãæåŸã® 2 è¡ã¯æ¬¡ã®ããã«ãªããŸãã l_shipdate:string:PARTITION_KEY :: [[1996-09-01], [1996-09-02], [1996-09-03], [1996-09-04], [1996-09-05]] 次ã®è¡šã«ç€ºãããã«ãããŒãã£ã·ã§ã³åå²ã«ããã¯ãšãªã§ããŒãã£ã·ã§ã³ãã£ã«ã¿ãŒã䜿çšãããªãå Žåã«ããã©ãŒãã³ã¹äœäžãçããŸããããŒã¿ãéå°ã«ããŒãã£ã·ã§ã³åå²ããªãããæ³šæããŠãã ãããéå°ãªããŒãã£ã·ã§ã³åå²ã¯ã倿°ã®å°ããªãã¡ã€ã«ãçæãããããããã©ãŒãã³ã¹ãäœäžããŸãããã®ç¹ã«ã€ããŠã¯ããã®èšäºã®åŸåã§è©³ãã説æããŸãã ã¯ãšãª ããŒãã£ã·ã§ã³åå²ãããŠããªãããŒãã« ããŒãã£ã·ã§ã³åå²ãããããŒãã« ç¯çŽ . å®è¡æé ã¹ãã£ã³ãããããŒã¿ å®è¡æé ã¹ãã£ã³ãããããŒã¿ . SELECT COUNT(*) FROM lineitem 3.4 ç§ 74.1 GB 8.9 ç§ 74.1 GB 62% é
ã ããŒãã£ã·ã§ã³åå²ã®ããäžã€ã®ããã«ãã£ã¯ãã¯ãšãªã«äžèŽããããŒãã£ã·ã§ã³ãèŠã€ããã®ã«ãããæéã§ãããã®åé¡ã軜æžããæ¹æ³ã® 1 ã€ã¯ãããŒãã«ã«ããŒãã£ã·ã§ã³ã€ã³ããã¯ã¹ãæå¹ã«ããããšã§ããããã«ãããããŒãã«ã«ããŒãã£ã·ã§ã³ãæ°äžå以äžããå Žåã®ããã©ãŒãã³ã¹ãåäžããå¯èœæ§ããããŸããããŒãã£ã·ã§ã³ã€ã³ããã¯ã¹ã䜿çšãããšããã¹ãŠã®ããŒãã£ã·ã§ã³ã®ã¡ã¿ããŒã¿ãååŸããã®ã§ã¯ãªããã¯ãšãªã®ãã£ã«ã¿å
ã®ããŒãã£ã·ã§ã³å€ã®ã¡ã¿ããŒã¿ã®ã¿ãã«ã¿ãã°ããååŸãããŸãããã®çµæããã®ããã«ããŒãã£ã·ã§ã³ã倿°ããããŒãã«ã«å¯Ÿããã¯ãšãªãé«éåãããŸããæ¬¡ã®è¡šã¯ãããŒãã£ã·ã§ã³ã€ã³ããã¯ã¹ããªãå Žåãšããå Žåã®ããŒãã£ã·ã§ã³åå²ããŒãã«ã®ã¯ãšãªå®è¡æéãæ¯èŒããŠããŸãããã®ããŒãã«ã«ã¯çŽ 10 äžåã®ããŒãã£ã·ã§ã³ãšéå§çž®ã®ããã¹ãããŒã¿ãå«ãŸããŠããŸããorders ããŒãã«ã¯ o_custkey åã§ããŒãã£ã·ã§ã³åå²ãããŠããŸãã ã¯ãšãª ããŒãã£ã·ã§ã³ã€ã³ããã¯ã¹ = ç¡å¹ ããŒãã£ã·ã§ã³ã€ã³ããã¯ã¹ = æå¹ é«éå . å®è¡æé å®è¡æé . SELECT COUNT(*) FROM orders WHERE o_custkey BETWEEN 1 AND 100 19.5 ç§ 1.2 ç§ 16 å Athena ã§ã® AWS Glue Data Catalog ã®ããŒãã£ã·ã§ã³ã€ã³ããã¯ã¹ã®å©ç¹ã®è©³çްã«ã€ããŠã¯ã AWS Glue Data Catalog ããŒãã£ã·ã§ã³ã€ã³ããã¯ã¹ã䜿çšã㊠Amazon Athena ã¯ãšãªã®ããã©ãŒãã³ã¹ãåäž ãåç
§ããŠãã ããã ããŒã¿ãããŒãã£ã·ã§ã³åå²ããæ¹æ³ã®äŸã«ã€ããŠã¯ããã®èšäºã®åŸåã«ããæé©åãããããŒã¿ã»ããã®äœæã«é¢ããã»ã¯ã·ã§ã³ãåç
§ããŠãã ããã 2. ããŒã¿ã®ãã±ããå ã¯ãšãªãèªã¿åããªããã°ãªããªãããŒã¿éãæžããå¥ã®æ¹æ³ã¯ãåããŒãã£ã·ã§ã³å
ã®ããŒã¿ããã±ããåããããšã§ãã ãã±ããå ãšã¯ãããåã®å€ã«åºã¥ããŠã¬ã³ãŒããå¥ã
ã®ãã¡ã€ã«ã«åæ£ãããææ³ã§ããããã«ãããåãå€ãæã€ãã¹ãŠã®ã¬ã³ãŒããåããã¡ã€ã«ã«å
¥ããŸãããã±ããåã¯ãé«ãã«ãŒãã£ããªãã£ãæã€åããããå€ãã®ã¯ãšãªããã®åã®ç¹å®ã®å€ãæ€çŽ¢ããå Žåã«æçšã§ãããã±ããåã®è¯ãåè£ã¯ããŠãŒã¶ãŒãããã€ã¹ã® ID ãªã©ã®åã§ãã Athena ã§æ¢ã«ãã±ããåãããããŒã¿ã»ãããããå Žåã CREATE TABLE ã¹ããŒãã¡ã³ãã®äžã§ CLUSTERED BY ( <ãã±ããåãããã«ã©ã > ) INTO <ãã±ããæ°> BUCKETS ãšæå®ããããšã§ããã±ããåãããã«ã©ã ãæå®ã§ããŸããAthena 㯠Hive ãŸã㯠Spark ã§ãã±ããåãããããŒã¿ã»ããããµããŒãããŠãããAthena ã® CREATE TABLE AS (CTAS) ã§ãã±ããåãããããŒã¿ã»ãããäœæã§ããŸãã æ¬¡ã®è¡šã¯ã c_custkey åã䜿çšã㊠32 åã®ãã±ãããäœæããå Žåã®é¡§å®¢ããŒãã«ã®éãã瀺ããŠããŸãã顧客ããŒãã«ã®ãµã€ãºã¯ 2.29 GB ã§ãã ã¯ãšãª éãã±ããåããŒãã« ã³ã¹ã c_custkey ãã¯ã©ã¹ã¿ãŒåã«ã©ã ãšããŠãã±ããåããããŒãã« ã³ã¹ã ç¯çŽå . å®è¡æé ã¹ãã£ã³ããããŒã¿ . å®è¡æé ã¹ãã£ã³ããããŒã¿ . . SELECT COUNT(*) FROM customer WHERE c_custkey = 12677856 1.3 ç§ 2.29 GB $0.01145 0.82 ç§ 72.94 MB $0.0003645 97% å®ã 37% é«é åè¿°ã®ã¯ãšãªã«å¯Ÿã㊠EXPLAIN ANALYZE ãå®è¡ãããšããã±ããåã customer ããŒãã«ãã Amazon S3 ããã®ããŒã¿èªã¿åãéãæžããã®ã«åœ¹ç«ã£ãããšãããããŸãããã±ããåãããŠããªãããŒãã«ãšãã±ããåãããããŒãã«ã®ã¯ãšãªã«å¯Ÿãã EXPLAIN ANALYZE åºåã®æç²ãããå
¥åè¡æ°ãšããŒã¿ãµã€ãºã®éããããããŸãã 以äžã¯ãã±ããåãããŠããªãããŒãã«ã®åºåã§ã: ⊠â ScanFilterProject[table = awsdatacatalog:tpc_h:customer, filterPredicate = ("c_custkey" = 12677856), projectLocality = LOCAL, protectedBarrier = NONE] Layout: [] Estimates: {rows: ? (0B), cpu: ?, memory: 0B, network: 0B}/{rows: ? (0B), cpu: ?, memory: 0B, network: 0B}/{rows: ? (0B), cpu: 0, memory: 0B, network: 0B} CPU: 19.48s (99.94%), Scheduled: 37.43s (99.97%), Blocked: 0.00ns (0.00%), Output: 1 row (0B) Input avg.: 202702.70 rows , Input std.dev.: 4.83% c_custkey := c_custkey:int:REGULAR Input: 15000000 rows (2.29GB), Filtered: 100.00%, Physical input: 2.29GB, Physical input time: 0.00ms 以äžã¯ãã±ããåãããããŒãã«ã®åºåã§ã: ⊠â ScanFilterProject[table = awsdatacatalog:tpc_h:customer buckets=32, filterPredicate = ("c_custkey" = 12677856), projectLocality = LOCAL, protectedBarrier = NONE] Layout: [] Estimates: {rows: ? (0B), cpu: ?, memory: 0B, network: 0B}/{rows: ? (0B), cpu: ?, memory: 0B, network: 0B}/{rows: ? (0B), cpu: 0, memory: 0B, network: 0B} CPU: 654.00ms (100.00%), Scheduled: 1.13s (100.00%), Blocked: 0.00ns (0.00%), Output: 1 row (0B) Input avg.: 156250.00 rows , Input std.dev.: 22.35% c_custkey := c_custkey:int:REGULAR Input: 468750 rows (72.94MB), Filtered: 100.00%, Physical input: 72.94MB, Physical input time: 0.00ns Athena ã§ãã±ããåãããããŒã¿ãæ±ãæ¹æ³ã®è©³çްã¯ã以äžã®ãªãœãŒã¹ãåç
§ããŠãã ããã Athena ã§ã®ããŒãã£ã·ã§ã³åå²ãšãã±ããåå² CTAS ã¯ãšãªã®äŸ: ãã±ããåå²ãšããŒãã£ã·ã§ã³åå²ãããããŒãã«ã®äœæ ããŒã¿ããã±ããã«åå²ããæ¹æ³ã®äŸã«ã€ããŠã¯ããã®èšäºã®åŸåã«ããæé©åãããããŒã¿ã»ããã®äœæã«é¢ããã»ã¯ã·ã§ã³ãåç
§ããŠãã ããã 3. å§çž®ã®å©çš ããŒã¿ãå§çž®ãããšãã¯ãšãªã®å®è¡é床ã倧å¹
ã«åäžããŸããããŒã¿ãµã€ãºãå°ãããªãããšã§ãAmazon S3 ããã¹ãã£ã³ããããŒã¿éãæžããããã¯ãšãªã®å®è¡ã³ã¹ããäžãããŸãããŸããAmazon S3 ãã Athena ãžã®ãããã¯ãŒã¯ãã©ãã£ãã¯ãæžå°ãããããšãã§ããŸãã Athena 㯠gzipãSnappyãzstd ãªã©ã®äžè¬çãªåœ¢åŒãå«ããããŸããŸãªå§çž®åœ¢åŒããµããŒãããŠããŸãããµããŒããããŠãã圢åŒã®äžèЧã«ã€ããŠã¯ã Athena ã®å§çž®ãµããŒã ãåç
§ããŠãã ããã JSON ã CSV ãªã©ã®å§çž®ãããããã¹ãããŒã¿ãåãåãããã«ã¯ãç¹å¥ãªèæ
®ãå¿
èŠã§ããAthena ãããŒã¿ãèªã¿åãéãããŒã¿ã®äžŠååŠçãæå€§åããããã«ããã¡ã€ã«ã®ç°ãªãç¯å²ãããŸããŸãªããŒãã«å²ãåœãŠãŸããåç¯å²ã¯ ã¹ããªãã ãšåŒã°ãã䞊åã§èªã¿åãããã¡ã€ã«ã¯ ã¹ããªããå¯èœ ãšåŒã°ããŸããäžè¬çãªå§çž®åœ¢åŒã®ã»ãšãã©ã¯ã¹ããªããäžå¯èœã§ããã€ãŸãããªãŒããŒã¯å§çž®ãã¡ã€ã«ã®å
é ããèªã¿åãå¿
èŠããããŸããããã¯ãããŒã¿ã»ãããåäžã®å§çž®ããã CSV ãã¡ã€ã«ã§ããå Žåãã¯ãšãªåŠçã«äœ¿çšã§ããã®ã¯ 1 ã€ã®ããŒãã ãã§ããããšãæå³ããŸãã ããã¹ããã¡ã€ã«ãå§çž®ããããŒã¿ã»ãããäœæããéã¯ããã¡ã€ã«æ°ãšãã¡ã€ã«ãµã€ãºã®ãã©ã³ã¹ãå¿ãããŠãã ããããã¡ã€ã«ãµã€ãºã®æé©åã«ã€ããŠã¯ã次ã®ã»ã¯ã·ã§ã³ã§è©³ãã説æããŸãã Parquet ãã¡ã€ã«ãš ORC ãã¡ã€ã«ã¯ããããã®åœ¢åŒããã¡ã€ã«ãæ§æããè€æ°ã®ã»ã¯ã·ã§ã³ãåå¥ã«å§çž®ãããã¡ã€ã«å
ã®åã»ã¯ã·ã§ã³ã®å Žæãå«ãã¡ã¿ããŒã¿ãæã€ãããåå²ãå¯èœã§ã(ãã¡ã€ã«å
šäœã1ã€ã®ã»ã¯ã·ã§ã³ã§æ§æãããŠããå Žåã¯åå²ã§ããªãã®ã§æ³šæãå¿
èŠã§ã)ã gzip 圢åŒã¯é«ãå§çž®çãæã¡ãä»ã®ããŒã«ããµãŒãã¹ã§ãå¹
åºããµããŒããããŠããŸãã zstd (Zstandard) 圢åŒã¯ãããã©ãŒãã³ã¹ãšå§çž®çã®ãã©ã³ã¹ãè¯ãæ°ããå§çž®åœ¢åŒã§ããbzip2 ãš LZO å§çž®åœ¢åŒã¯åå²å¯èœã§ãããããã©ãŒãã³ã¹ãšäºææ§ãéèŠããå Žåã¯æšå¥šãããŸããã ããŒã¿ãå§çž®ããæ¹æ³ã®äŸã«ã€ããŠã¯ããã®èšäºã®åŸåã«ããæé©åãããããŒã¿ã»ããã®äœæã«é¢ããã»ã¯ã·ã§ã³ãåç
§ããŠãã ããã 4. ãã¡ã€ã«ãµã€ãºã®æé©å ããŒã¿ã䞊åã§èªã¿åãã1åã®èªã¿åãèŠæ±ã§å¯èœãªéãå€ãã®ããŒã¿ãèªã¿åãããšãã«ãã¯ãšãªã¯å¹ççã«å®è¡ãããŸããåãã¡ã€ã«ã®èªã¿åãã«ã¯ãªãŒããŒãããããããäŸãã°ã¡ã¿ããŒã¿ã®ååŸãAmazon S3 ãžã®èŠæ±ãå§çž®èŸæžã®ã»ããã¢ãããªã©ãçºçããŸããããã¯éåžžã¯æ°ä»ãããŸãããããã¡ã€ã«æ°ãå¢ããã«ã€ããŠããªãŒããŒããããèç©ãããŸããã¯ãšãªã®ããã©ãŒãã³ã¹ãæå€§åããã«ã¯ããã¡ã€ã«æ°ãšãµã€ãºã®ãã©ã³ã¹ãåãå¿
èŠããããŸãã äžè¬çãªã¬ã€ãã©ã€ã³ãšããŠã128 MB çšåºŠã®åå²ãç®æãããšããå§ãããŸããåå²ãšã¯ããã¡ã€ã«ã®äžéšãæããŸããããšãã°ãéå§çž®ããã¹ããã¡ã€ã«ã®ãã€ãç¯å²ããŸã㯠Parquet ãã¡ã€ã«ã®ããŒãžã§ããå§çž®ã®ã»ã¯ã·ã§ã³ã§èª¬æããããã«ãã»ãšãã©ã®å§çž®ããã¹ããã¡ã€ã«ã¯åå²ã§ããªããããäžæ¬ããŠåŠçãããŸããParquet ã ORC ãªã©ã®åæçšã«æé©åããã圢åŒã¯ããã€ã§ãåå²å¯èœã§ãã å°ããªãã¡ã€ã«ã倿°çæãããçç±ã® 1 ã€ã¯ãåã®ã»ã¯ã·ã§ã³ã§èª¬æããããŒãã£ã·ã§ã³åå²ã®éå°åå²ã§ããã¯ãšãªã®ããã©ãŒãã³ã¹ãå°ããªãã¡ã€ã«ãå€ãããããã«äœäžããŠããããšã瀺ãå
åã¯ãã¯ãšãªçµ±èšã®èšç»ãã§ãŒãºãå
šå®è¡æéã®æ°ããŒã»ã³ããè¶
ããããšã§ããææªã®å Žåãã¯ãšãªããPlease reduce your request rate.ããšãã Amazon S3 ãšã©ãŒã§å€±æããå¯èœæ§ããããŸããããã¯ã倧éã®ãã¡ã€ã«ã«å¯ŸããAthena ã Amazon S3 ã®ãµãŒãã¹ã¯ã©ãŒã¿ãè¶
ããæ°ã®åç
§ãªã¯ãšã¹ããå®è¡ããå Žåã«çºçããŸãã 詳现ã«ã€ããŠã¯ã ãã¹ããã©ã¯ãã£ã¹ãã¶ã€ã³ãã¿ãŒã³: Amazon S3 ã®ããã©ãŒãã³ã¹æé©å ãåç
§ããŠãã ããã å°ãããã¡ã€ã«ã®åé¡ã解決ãã 1 ã€ã®æ¹æ³ã¯ãAmazon EMR ã® S3DistCP ãŠãŒãã£ãªãã£ã䜿ãããšã§ããããã䜿ã£ãŠãå°ããªãã¡ã€ã«ã倧ããªãªããžã§ã¯ãã«ãŸãšããããšãã§ããŸãããŸããS3DistCP ã䜿ãã°ãHDFS ãã Amazon S3 ãžãAmazon S3 ãã Amazon S3 ãžãAmazon S3 ãã HDFS ãžãšã倧éã®ããŒã¿ãæé©åãããæ¹æ³ã§ç§»åã§ããŸãããã®ã»ã¯ã·ã§ã³ã®æåŸã§ãAthena Spark ã䜿ã£ãŠããŒã¿ãååŠçããå¥ã®æ¹æ³ã«ã€ããŠèª¬æããŸãã ãã¡ã€ã«æ°ãå°ãªãããµã€ãºã倧ããæ¹ãããã£ã¬ã¯ããªå
ã®ãã¡ã€ã«ãªã¹ãååŸãéããªããAmazon S3 ãžã®ãªã¯ãšã¹ãæ°ãæžãã管çããã¡ã¿ããŒã¿ãå°ãªããªããšããã¡ãªããããããŸãã äŸãã°ã次ã®è¡šã¯ã100,000 åã®ãã¡ã€ã«ãèªã¿åãå¿
èŠãããã¯ãšãªãšãåãããŒã¿ã»ãããåäžã®ãã¡ã€ã«ãšããŠæ ŒçŽããã¯ãšãªã®éããæ¯èŒããŠããŸããäž¡æ¹ã®ãã¡ã€ã«ã»ããã«ã¯åãããŒã¿ããå§çž®ãããŠããªãããã¹ããã¡ã€ã«ãšããŠæ ŒçŽãããŠããŸããããŒã¿ã®ç·é㯠74 GB ã§ãã ã¯ãšãª ãã¡ã€ã«æ° å®è¡æé SELECT COUNT(*) FROM lineitem 100,000 ãã¡ã€ã« 11.5 ç§ SELECT COUNT(*) FROM lineitem 1 ãã¡ã€ã« 4.3 ç§ åŠçé床åäžç . ~ 62% åæ§ã«ã次ã®è¡šã¯ãããŒã¿ãå§çž®ãããå Žåã«ãã¡ã€ã«æ°ã®éããã©ã®ãããªåœ±é¿ãäžããããæ¯èŒããŠããŸããããŒã¿ã¯åã®äŸãšåãã§ããããããã 10 ãã¡ã€ã«ãš 100 ãã¡ã€ã«ã« gzip å§çž®ãããŠããŸãã ã¯ãšãª ãã¡ã€ã«æ° å¹³åãã¡ã€ã«ãµã€ãº å®è¡æé SELECT COUNT(*) FROM lineitem 10 ãã¡ã€ã« 2.4 GB 60 ç§ SELECT COUNT(*) FROM lineitem 100 ãã¡ã€ã« 243 MB 6.8 ç§ åŠçé床åäžç . . çŽ 88% ãã¡ã€ã«ãµã€ãºããã¡ã€ã«æ°ããã¡ã€ã«ãå§çž®ãããŠãããã©ããã«ãã£ãŠãã¯ãšãªã®ããã©ãŒãã³ã¹ã«å€§ããªéããçããŸããããŒã¿ãå§çž®ãããŠããªãå ŽåãAthena ã¯æé©ãªãµã€ãºã§ãã¡ã€ã«ã䞊ååŠçã§ããŸããããã«ãããåäžã®éå§çž®ããã¹ããã¡ã€ã«ã®ã»ãã 10 äžåã®éå§çž®ãã¡ã€ã«ãåŠçãããããå¹ççã§ããããŒã¿ãå§çž®ãããŠããå Žåããã¡ã€ã«æ°ãšãµã€ãºã®åœ±é¿ã¯ããã«å€§ãããªããŸããããã®å ŽåãAthena ãããŒã¿ã»ããã䞊ååŠçã§ããããã«ååãªæ°ã®ãã¡ã€ã«ãå¿
èŠã«ãªããŸãã ããŒã¿ã»ãããæé©åããæ¹æ³ã«ã€ããŠã¯ããã®èšäºã®åŸåã«ããå°ããªãã¡ã€ã«ãçµåããŠããŒã¿ã»ãããæžãæããäŸãåç
§ããŠãã ããã 5. åæåã®ãã¡ã€ã«åœ¢åŒã®å©çš Apache Parquet ãš Apache ORC ã¯ãåæã¯ãŒã¯ããŒãã§äººæ°ã®ãããã¡ã€ã«åœ¢åŒã§ãããããã¯ãè¡ããšã§ã¯ãªãåããšã«ããŒã¿ãæ ŒçŽããããšããã åæå ã®ãã¡ã€ã«åœ¢åŒãšåŒã°ããããšãå€ãã§ãããŸããèªã¿èŸŒãå¿
èŠã®ããããŒã¿éãæ§ã
ãªæ¹æ³ã§åæžã§ããæ©èœãåããŠããŸããããšãã°ãåãåå¥ã«æ ŒçŽã»å§çž®ããããšã§ãããé«ãå§çž®çãå®çŸã§ããã¯ãšãªã§åç
§ãããåã®ã¿ãèªã¿èŸŒããŸãã åæåã®ãã¡ã€ã«åœ¢åŒã§ã¯ãããŒã¿ã«å¯ŸããŠè€æ°ã®å§çž®æŠç¥ã䜿çšãããŸããããšãã°ãå€ãã®ç¹°ãè¿ãå€ãå«ãåã¯ãå€ã 1 åæ ŒçŽããç¹°ãè¿ãåæ°ãä»å ããã©ã³ã¬ã³ã°ã¹å§çž®ã䜿çšããŠãšã³ã³ãŒãããããšããåå€ãæ€çŽ¢ããŒãã«ãžã®ãã€ã³ã¿ã«çœ®ãæããèŸæžç¬Šå·åã䜿çšããŠãšã³ã³ãŒãããããšãã§ããŸããããã¹ãããŒã¿ã¯ãgzipãSnappyãzstd ãªã©ã®æšæºçãªå§çž®åœ¢åŒã§å§çž®ã§ããŸãã詳现ã«ã€ããŠã¯ã Athena ã®å§çž®ãµããŒã ãåç
§ããŠãã ããã Parquet ãš ORC ã¯ãããŸããŸãªããŒã¿ã»ããã«åãããŠèª¿æŽã§ããŸããããšãã°ãããã㯠(Parquet) ãŸãã¯ã¹ãã©ã€ã (ORC) ãµã€ãºã倧ãããããšãç¶æ³ã«ãã£ãŠã¯æçãªå ŽåããããŸããããŒã¿ã»ããã«å€æ°ã®ã«ã©ã ãããå Žåã¯ãParquet ã®ããã©ã«ã 128MBãORC ã®ããã©ã«ã 64MB ãã倧ããããããšããå§ãããŸããããã«ãããåã«ã©ã ã®ååãªå€ãäžç·ã«æ ŒçŽãããèªã¿åãã®åæ°ãæžããŸãã åæåã®ãã¡ã€ã«åœ¢åŒã䜿ã£ãŠããŒã¿ã»ããã調æŽããå¥ã®æ¹æ³ã¯ãã¯ãšãªã«é »ç¹ã«å«ãŸããåã§ããŒã¿ãäžŠã¹æ¿ããŠããããšã§ããParquet ãš ORC ã¯ãåããŒã¿ãããã¯ã®åã®æå°å€ãšæå€§å€ãªã©ã®ã¡ã¿ããŒã¿ãæ ŒçŽããŠããŸããã€ãŸããã¯ãšãªãšã³ãžã³ã¯ããã®ãããã¯ã«å«ãŸããå€ãã¯ãšãªã«äžèŽããªããšå€æããå Žåããã®ãããã¯ã®ããŒã¿ãèªã¿èŸŒãå¿
èŠããããŸããããã㯠述èªããã·ã¥ããŠã³ ãšåŒã°ããŸããããšãã°ãããŒã¿ã«ã¯ã¿ã€ã ã¹ã¿ã³ãã®ãããªãã®ãããããã®å±æ§ã§ãã¡ã€ã«å
ã®ããŒã¿ãäžŠã¹æ¿ããŠãããšãç¹å®ã®æéç¯å²ãæ¢ãã¯ãšãªã¯ãã¿ã€ã ã¹ã¿ã³ãã®ååŸã®ãããã¯ã®ããŒã¿ãèªã¿èŸŒãå¿
èŠããªããªããŸãã ã¿ã€ã ã¹ã¿ã³ãã«ããäžŠã¹æ¿ããšããŒãã£ã·ã§ã³åå²ãçµã¿åããããšãããã«ããã©ãŒãã³ã¹ãåäžããã³ã¹ããç¯çŽã§ããŸããããšãã°ãæ°æéã®æéæ ã§éèšãè¡ãããšãå€ãå ŽåãèããŠã¿ãŸããããæéåäœã§ããŒãã£ã·ã§ã³åå²ããããšã¯å¯èœã§ããããã¡ã€ã«ãå€ããããå°ããããããããªã¹ã¯ããããŸãã代ããã«ãæ¥åäœã§ããŒãã£ã·ã§ã³åå²ããããŒã¿ãã¿ã€ã ã¹ã¿ã³ãã§äžŠã¹æ¿ããããšãã§ããŸãããã®ããã«ãããšãç²ç²åºŠã®ããŒãã£ã·ã§ã³åå²ã䜿çšããŠã¯ãšãªã«å«ãŸãããã¡ã€ã«ã»ãããäžèŽããããŒãã£ã·ã§ã³ã®ã¿ã«æžãããäžŠã¹æ¿ãã䜿çšããŠæ®ãã®ãã¡ã€ã«å
ã®ãããã¯ãã¹ãããã§ããŸããããŒãã£ã·ã§ã³ããŒãšã¿ã€ã ã¹ã¿ã³ãåã®äž¡æ¹ã§ãã£ã«ã¿ãªã³ã°ãè¡ãããšãå¿ããªãã§ãã ããã æ¬¡ã®è¡šã¯ãããã¹ã gzipããœãŒãç¡ãã® Parquet gzipãããã³ l_partkey ã§ãœãŒãããã Parquet gzip ã®åãããŒã¿ã»ããã«å¯Ÿããå®è¡æéãšã¹ãã£ã³ãããããŒã¿ãæ¯èŒããŠããŸãã ã¯ãšãª SELECT l_orderkey FROM lineitem WHERE l_partkey = 17766770 ããã¹ã圢åŒãšæ¯èŒããç¯çŽå ããã¹ã gzip ããŒã¿ å®è¡æé 11.9 ç§ . ã¹ãã£ã³ããããŒã¿ 23.7 GB . ã³ã¹ã $0.1 . ãœãŒãç¡ãã® Parquet gzip ããŒã¿ å®è¡æé 2.1 ç§ ~ 82% é«éå ã¹ãã£ã³ããããŒã¿ 2.0 GB . ã³ã¹ã $0.009 ~ 91% å®äŸ¡ l_partkey ã§ãœãŒãæžã¿ã® Parquet gzip ããŒã¿ å®è¡æé 1.1 ç§ ~ 90% é«éå ã¹ãã£ã³ããããŒã¿ 38.8 MB . ã³ã¹ã $0.0001 ~ 99.9% å®äŸ¡ æé©åããŒã¿ã»ããã®äœæ ãã®ã»ã¯ã·ã§ã³ã§ã¯ãAthena Spark ã䜿çšããŠããŒã¿ã»ããã倿ããåã®ã»ã¯ã·ã§ã³ã§èª¬æããæé©åãé©çšããæ¹æ³ã瀺ããŸãããã®ã³ãŒãã¯ã Amazon EMR Serverless ã AWS Glue ETL ãªã©ãã»ãšãã©ã®ä»ã® Spark ã©ã³ã¿ã€ã ã§ã䜿çšã§ããŸããAthena SQL ã䜿ã£ãŠããŒã¿ã倿ãããã®èšäºã§èª¬æãããŠããå€ãã®æé©åãé©çšããããšãã§ããŸãããAthena Spark ã®æ¹ãããã»ã¹ã«å¯Ÿããèšå®ãªãã·ã§ã³ãšã³ã³ãããŒã«ãå€ããªããŸãã æ¬¡ã®ã³ãŒãã¯æåã« tpc_h ããŒã¿ããŒã¹ãããã©ã«ãã«èšå®ããŸãããã®ããŒã¿ããŒã¹ã®å Žæã¯ãããŒã¿ãæžã蟌ãŸããå Žæã決å®ããããã«äœ¿çšãããŸããæ¬¡ã«ã以äžã®æäœãå®è¡ã㊠customer_optimized ãšããæ°ããããŒãã«ãäœæããŸã: customer ããŒãã«ã®ãã¹ãŠã®è¡ãèªã¿èŸŒã¿ãŸã coalesce ã䜿çšããŠããã±ããããšã«ããŒãã£ã·ã§ã³ããšã«æžã蟌ãŸãããã¡ã€ã«æ°ã 4 ã«æžãããŸã sortWithinPartitions ã§ c_name ã«ãã£ãŠã¬ã³ãŒããäžŠã¹æ¿ããŸã partitionBy ã§ c_mktsegment ãš c_nationkey ã«ãã£ãŠããŒãã£ã·ã§ã³åå²ãã bucketBy ã§ c_custkey ã«ãã£ãŠ 32 ã®ãã±ããã«åå²ãã zstd ã§å§çž®ããã Parquet ãã¡ã€ã«ã«æžã蟌ã¿ãŸã 次ã®ã³ãŒããåç
§ããŠãã ãã: spark.sql("use tpc_h") spark\ .read.table("customer")\ .coalesce(4)\ .sortWithinPartitions("c_name")\ .write\ .partitionBy("c_mktsegment", "c_nationkey")\ .bucketBy(32, "c_custkey") .saveAsTable("customer_optimized", format="parquet", compression="gzip") ãã®äŸã§ã¯ããã¹ãŠã®æé©åãåæã«ç€ºããŠããŸãã䜿çšã±ãŒã¹ã«ãã£ãŠã¯ã1 ã€ãŸãã¯ããã€ãã®æé©åã ããå¿
èŠãªå ŽåããããŸããããããã®æé©åããã£ãšã圹ç«ã€å Žåã«ã€ããŠã¯ããã®èšäºã®åã®ã»ã¯ã·ã§ã³ãåç
§ããŠãã ããã Amazon EMRãEMR ServerlessãAWS Glue ETLããŸã㯠Athena SQL ã䜿çšããŠããŒã¿ãåŠçããæ¹æ³ã®è©³çްã«ã€ããŠã¯ã以äžã®ãªãœãŒã¹ãåç
§ããŠãã ããã Amazon Athena ã® CTAS ãš INSERT INTO ã¹ããŒãã¡ã³ãã䜿çšããŠãS3 ããŒã¿ã¬ã€ã¯ã«ããŒã¿ãæœåºã倿ãããŒããã Amazon Athena ã® UNLOAD æ©èœã䜿çšã㊠ETL ãš ML ãã€ãã©ã€ã³ãç°¡çŽ åãã AWS Glue ãš Amazon S3 ã䜿çšããŠããŒã¿ã¬ã€ã¯ã®åºç€ãæ§ç¯ãã å€§èŠæš¡ããŒã¿ã»ããã Parquet 圢åŒã«å€æãã åæåã®ãã¡ã€ã«åœ¢åŒãžã®å€æ ã¯ãšãªãã¥ãŒãã³ã° Athena SQL ãšã³ãžã³ã¯ããªãŒãã³ãœãŒã¹ã®åæ£ã¯ãšãªãšã³ãžã³ Trino ãš Presto äžã«æ§ç¯ãããŠããŸãããã®åäœãçè§£ããããšã§ãã¯ãšãªãå®è¡ããéã®æé©åã®æããããåŸãããŸãããã®ã»ã¯ã·ã§ã³ã§ã¯ã以äžã®ãã¹ããã©ã¯ãã£ã¹ã«ã€ããŠè§£èª¬ããŸãã ORDER BY ã®æé©å çµåã®æé©å GROUP BY ã®æé©å è¿äŒŒé¢æ°ã®å©çš å¿
èŠãªã«ã©ã ã®ã¿ãå«ãã 6. ORDER BY ã®æé©å ORDER BY å¥ã¯ãã¯ãšãªã®çµæãäžŠã¹æ¿ããé åºã§è¿ããŸããAthena ã¯åæ£ãœãŒãã䜿çšããŠãè€æ°ã®ããŒãã§ãœãŒãæäœã䞊åã«å®è¡ããŸããäžäœãŸãã¯äžäœ N ä»¶ã®å€ãèŠãããã« ORDER BY å¥ã䜿çšããå Žåã¯ã LIMIT å¥ã䜿çšããŠãœãŒãã®ã³ã¹ããåæžããã¯ãšãªã®å®è¡æéãççž®ã§ããŸãã äŸãã°ã次ã®è¡šã¯ããµã€ãºã 7.25 GB ã®çŽ 6,000 äžè¡ã®ããã¹ã圢åŒéå§çž®ããŒãã«ã䜿çšããããŒã¿ã»ããã«å¯Ÿããã¯ãšãªå®è¡æéããŸãšãããã®ã§ãã ã¯ãšãª å®è¡æé SELECT * FROM lineitem ORDER BY l_shipdate 274 ç§ SELECT * FROM lineitem ORDER BY l_shipdate LIMIT 10000 4.6 ç§ é«éå 98% é«é 7. çµåã®æé©å é©åãªçµåé åºãéžã¶ããšã¯ãã¯ãšãªã®ããã©ãŒãã³ã¹ãåäžãããããã«éèŠã§ãã2 ã€ã®ããŒãã«ãçµåããéã¯ã倧ããæ¹ã®ããŒãã«ãçµåã®å·ŠåŽã«ãå°ããæ¹ã®ããŒãã«ãå³åŽã«æå®ããŠãã ãããç倿¡ä»¶ã䜿ãäžè¬çãªçµåã®å ŽåãAthena ã¯å³åŽã®ããŒãã«ããæ€çŽ¢ããŒãã«ãæ§ç¯ããã¯ãŒã«ãŒããŒãã«é
åžããŸããæ¬¡ã«å·ŠåŽã®ããŒãã«ãã¹ããªãŒãã³ã°ããæ€çŽ¢ããŒãã«å
ã®äžèŽããå€ãæ¢ããŠè¡ãçµåããŸããããã¯ åæ£ããã·ã¥çµå ãšåŒã°ããŸããå³åŽã®ããŒãã«ããæ§ç¯ãããæ€çŽ¢ããŒãã«ã¯ã¡ã¢ãªå
ã«ä¿æãããããããã®ããŒãã«ãå°ããã»ã©ã䜿çšããã¡ã¢ãªãå°ãªããçµåã®å®è¡é床ãéããªããŸãã ããã·ã¥ããŒãã«ãã¯ãŒã«ãŒããŒãéã«åæ£ãããŠãããããããŒã¿ã¹ãã¥ãŒ(ããŒãã«å
ã®ããŒã¿ãã¯ã©ã¹ã¿ãŒå
ã§åã£ãŠä¿æãããç¶æ
)ãããã©ãŒãã³ã¹ã«åœ±é¿ãäžããå¯èœæ§ããããŸããçµåæ¡ä»¶ã§äœ¿çšãããã«ã©ã ã®å€ã«åããããå Žåã1 ã€ã®ããŒããçµåã®å€§éšåãåŠçããªããã°ãªãããä»ã®ããŒãã¯éäŒç¶æ
ã«ãªã£ãŠããŸããŸããæé«ã®ããã©ãŒãã³ã¹ãåŸãã«ã¯ãçµåæ¡ä»¶ã®ã«ã©ã ã«å€ãåäžã«ååžããããã«ããŠãã ããã æ¬¡ã®è¡šã¯ãããã¹ã圢åŒã§å§çž®ãããŠããªãåèš 74 GB ã®ããŒã¿ã»ããã«å¯Ÿããå®è¡æéã瀺ããŠããŸããlineitem ããŒãã«ã«ã¯çŽ 6 åè¡ãpart ããŒãã«ã«ã¯çŽ 2,000 äžè¡ããããŸãã ã¯ãšãª å®è¡æé SELECT COUNT(*) FROM lineitem, part WHERE l_partkey = p_partkey 6.4 ç§ SELECT COUNT(*) FROM part, lineitem WHERE p_partkey = l_partkey 8.1 ç§ é«éå çŽ 20% é«é Athena ã³ã³ãœãŒã«ã®å®è¡è©³çްããžã¥ã¢ã©ã€ã¶ãŒã䜿çšãããšãçµåã®å®è¡é åºã確èªã§ããŸããããžã¥ã¢ã©ã€ã¶ãŒã«ã¯ãåããŒãã«ããçµåãããè¡æ°ã衚瀺ãããŸããããžã¥ã¢ã©ã€ã¶ãŒã®äœ¿ç𿹿³ã®è©³çްã«ã€ããŠã¯ã å®äºããã¯ãšãªã®çµ±èšãšå®è¡è©³çްã®è¡šç€º ãåç
§ããŠãã ããã 3 ã€ä»¥äžã®ããŒãã«ãçµåããå Žåã¯ãäžéçµæãæžããããã«ãæåã«å€§ããªããŒãã«ãšæãå°ããªããŒãã«ãçµåããæ¬¡ã«ä»ã®ããŒãã«ãšçµåããããšãæ€èšããŠãã ããã ã³ã¹ãããŒã¹ã®çµåæé©å ããŒãã«ã« AWS Glue Data Catalog ã§ããŒãã«çµ±èšæ
å ±ãããå ŽåãAthena ã¯ãããã䜿çšããŠãã³ã¹ãããŒã¹ã®æé©å (ããã§ã®ãã³ã¹ããã¯èšç®ã³ã¹ããæããŸã) ã«ãããžã§ã€ã³é åºã®å€æŽãšéçŽããã·ã¥ããŠã³ãå®è¡ããŸããããŒãã«ã«çµ±èšæ
å ±ãããå Žåãã¯ãšãªæé©åãšã³ãžã³ã¯ããŒãã«ã®é åºã®ã©ããæãå¹ççããææ¡ã§ããèªåçã«æé©åãè¡ããŸããã€ãŸãã倧ããªããŒãã«ãæåã§ãžã§ã€ã³ã®å·ŠåŽã«é
眮ããå¿
èŠããªããªããŸãã ã³ã¹ãããŒã¹ã®æé©åæ©èœã®äœ¿ç𿹿³ã®è©³çްã«ã€ããŠã¯ã Amazon Athena ã§ã³ã¹ãããŒã¹ã®æé©åæ©èœã䜿ã£ãŠã¯ãšãªãé«éåãã ããã³ ã³ã¹ãããŒã¹ã®æé©åæ©èœã®äœ¿çš ãåç
§ããŠãã ããã ããŒãã£ã·ã§ã³ããŒãã«ã®çµå ããŒãã£ã·ã§ã³ããŒãã«ãåãåãããéã¯ããã¹ãŠã®ããŒãã«ã®ãã¹ãŠã®ããŒãã£ã·ã§ã³ããŒã§ãã£ã«ã¿ãå«ããããšãæé©ã§ããããã«ãããã¯ãšãªãã©ã³ããŒã¯ãã¡ã€ã«ã®ãªã¹ããšèªã¿åããã§ããã ãçç¥ã§ããããã«ãªããŸãã æ¬¡ã®äŸã§ã¯ã orders ããŒãã«ãš lineitem ããŒãã«ãæ³šææ¥ ( orders ã® o_orderdate ã lineitem ã® l_orderdate ) ã§ããŒãã£ã·ã§ã³ãããŠããŸããæåã®ã¯ãšãªã§ã¯ l_orderdate ã«æ¡ä»¶ããªãããããšã³ãžã³ã¯ lineitem ããŒãã«ã®ãã¹ãŠã®ããŒãã£ã·ã§ã³ãã¹ãã£ã³ããå¿
èŠããããŸããæ³šææ¥ãçµåæ¡ä»¶ã«è¿œå ãããšãã¯ãšãªãã©ã³ããŒã¯ 2 ã€ã®ããŒãã«ã«å¯ŸããŠ1ã€ã®ããŒãã£ã·ã§ã³ã®ã¿ãèªã¿èŸŒãã°è¯ãããšãããããã¹ãã£ã³ãããããŒã¿éã倧å¹
ã«æžå°ããŸãã ã¯ãšãª ã¹ãã£ã³ãããããŒã¿ å®è¡æé SELECT AVG(l_extendedprice) FROM lineitem JOIN orders ON (l_orderkey = o_orderkey) AND o_orderdate = '1993-07-08' 68.1 GB 106 ç§ SELECT AVG(l_extendedprice) FROM lineitem JOIN orders ON (l_orderkey = o_orderkey AND l_orderdate = o_orderdate ) AND o_orderdate = '1993-07-08' 35.4 MB 2 ç§ åè¿°ã®ããã«ã EXPLAIN ã䜿çšããŠã¯ãšãªã«ãã£ãŠã©ã®ããŒãã£ã·ã§ã³ãèªã¿åããããã確èªã§ããŸããããã¯ãè€æ°ã®ããŒãã£ã·ã§ã³åãããããŒãã«ãçµåããéã«ç¹ã«éèŠã§ãã ã¯ãšãªã«é¢ããããŒãã«ã® 1 ã€ä»¥äžã®ããŒãã£ã·ã§ã³ããã¯ãšãªã®å®è¡æã«çºèŠãããæ
å ±ã«äŸåããããšããããŸããææªã®å Žåãã¯ãšãªãã©ã³ããŒãã¯ãšãªãåæããŠããŒãã£ã·ã§ã³ã決å®ã§ããªãããããã¹ãŠã®ããŒãã£ã·ã§ã³ãèªã¿åãå¿
èŠããããŸãããããããã®ãããªå Žåã§ããAthena 㯠åçããŒãã£ã·ã§ã³ãã«ãŒãã³ã° ãšåŒã°ããã¡ã«ããºã ã䜿çšããŠãã¯ãšãªã®å®è¡äžã«ããŒãã£ã·ã§ã³ã®èªã¿åããã¹ãããã§ããããšããããããŸããããšãã°ããšã³ãžã³ãçµåæ¡ä»¶ã«ããŒãã£ã·ã§ã³ããŒãé¢äžããŠããããšãèªèããå³åŽã®å€ã®æ°ãå°ãªãå Žåãã¯ãŒã«ãŒããŒãéã§ãã®æ
å ±ããããŒããã£ã¹ãã§ããŸãããã®åŸãã¯ãŒã«ãŒããŒãã¯ãã®æ
å ±ã䜿çšããŠãçµåæ¡ä»¶ã§åŸããé€å€ãããããŒãã£ã·ã§ã³ã®ãã¡ã€ã«èªã¿åããã¹ãããã§ããŸãã æ¬¡ã®äŸã§ã¯ã orders ããŒãã«ãš lineitem ããŒãã«ãæ³šææ¥ ( orders ã® o_orderdate ã lineitem ã® l_orderdate ) ã§ããŒãã£ã·ã§ã³åå²ãããŠããŸãã lineitem ããŒãã«ã¯åèšã§çŽ 75 GB ã® CSV ã§ãorders ã¯çŽ 16 GB ã§ãããã®ã¯ãšãªã¯ãããŒãã£ã·ã§ã³ã®çŽ 10% ã«åºçŸããç¹å®ã®æ¡ä»¶ã»ããã§æ³šæãããåç®ã®å¹³åäŸ¡æ Œãèšç®ããŸãããããã®ããŒãã£ã·ã§ã³ã¯äºåã«åãããªããããææªã®å Žå㯠90 GB ã®ããŒã¿ãã¹ãã£ã³ããå¿
èŠããããŸãããå®éã«ã¯ããã 26.5 GB ããã¹ãã£ã³ããŸããã SELECT AVG(l_extendedprice) FROM lineitem JOIN orders ON (l_orderkey = o_orderkey AND l_orderdate = o_orderdate) WHERE o_clerk = 'Clerk#000094772' AND o_orderpriority = '1-URGENT' AND o_orderstatus = 'F' ã¯ãšãªãå®è¡ããããšãAthena ã¯æ¡ä»¶ãæºããè¡ã® o_orderdate ã®å€ãåéããŸãããããã®å€ãã¯ã©ã¹ã¿ãŒå
šäœã«ãããŒããã£ã¹ãããããŒãã lineitem ããŒãã«ã®äžèŽããªãããŒãã£ã·ã§ã³ã®èªã¿èŸŒã¿ãã¹ãããã§ããããã«ããŸãã EXPLAIN ã³ãã³ãã䜿çšãããšãAthena ãåçããŒãã£ã·ã§ã³ãã«ãŒãã³ã°ãå®è¡ãããã©ããã確èªã§ããŸããåºåã§ dynamicFilterAssignment ãæ¢ããŠãã ããããã®äŸã®ã¯ãšãªã§ã¯ãEXPLAIN ãã©ã³ã¯æ¬¡ã®ããã«ãªããŸãã ⊠Fragment 1 [HASH] Output layout: [avg_4] Output partitioning: SINGLE [] Aggregate[type = PARTIAL] â Layout: [avg_4:row(double, bigint)] â Estimates: {rows: 1 (55B), cpu: ?, memory: 55B, network: 0B} â avg_4 := avg("l_extendedprice") ââ InnerJoin[criteria = ("l_orderkey" = "o_orderkey") AND ("l_orderdate" = "o_orderdate"), hash = [$hashvalue, $hashvalue_6], distribution = PARTITIONED] â Layout: [l_extendedprice:double] â Estimates: {rows: ? (?), cpu: ?, memory: ?, network: 0B} â Distribution: PARTITIONED â dynamicFilterAssignments = {o_orderkey -> #df_562, o_orderdate -> #df_563} ââ RemoteSource[sourceFragmentIds = [2]] â Layout: [l_orderkey:integer, l_extendedprice:double, l_orderdate:varchar, $hashvalue:bigint] ââ LocalExchange[partitioning = HASH, hashColumn = [$hashvalue_6], arguments = ["o_orderkey", "o_orderdate"]] â Layout: [o_orderkey:integer, o_orderdate:varchar, $hashvalue_6:bigint] â Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B} ââ RemoteSource[sourceFragmentIds = [3]] Layout: [o_orderkey:integer, o_orderdate:varchar, $hashvalue_7:bigint] ⊠åçããŒãã£ã·ã§ã³ãã«ãŒãã³ã°ã®è©³çްã«ã€ããŠã¯ãTrino ããã¥ã¡ã³ãã® åçãã£ã«ã¿ãªã³ã° ãã芧ãã ããã ã¯ãã¹ãžã§ã€ã³ã«æ³šæ çµåæ¡ä»¶ã¯çµåã®ããã©ãŒãã³ã¹ã«ã倧ããªåœ±é¿ãäžããŸããçµåæ¡ä»¶ãè€éãªå ŽåãäŸãã° LIKE ã > ã䜿çšããå Žåãèšç®ã³ã¹ããã¯ããã«é«ããªããŸããææªã®å Žåãäžæ¹ã®çµååŽã®ãã¹ãŠã®ã¬ã³ãŒãããããäžæ¹ã®çµååŽã®ãã¹ãŠã®ã¬ã³ãŒããšæ¯èŒããå¿
èŠããããŸãããã㯠ã¯ãã¹çµå ãšåŒã°ããŸããå¯èœãªéããç倿¡ä»¶ã䜿çšããŠãã ããã EXPLAIN ã³ãã³ãã䜿çšãããšãAthena ãã©ã®ãããªãžã§ã€ã³ãå®è¡ãããã確èªã§ããŸããããšãã°ã ON t1.name LIKE (t2.prefix || '%') ã®ãããªãžã§ã€ã³æ¡ä»¶ã§ã¯ãšãªã«å¯Ÿã㊠EXPLAIN ãå®è¡ãããšã次ã®ããã«åºåãããŸãã Fragment 1 [HASH] ⊠ââ CrossJoin[] ã¯ãšãªãã©ã³ã«ã¯ãã¹ãžã§ã€ã³ã衚瀺ãããå Žåã¯ã代ããã«ç䟡æ¡ä»¶ã䜿çšããããã«ã¯ãšãªãæžãæããããšãæ€èšããŠãã ãããããŒãã«ã® 1 ã€ãéåžžã«å°ããå Žåãé€ããã¯ãã¹ãžã§ã€ã³ãå«ãã¯ãšãªã¯ã¯ãšãªã¿ã€ã ã¢ãŠãå¶éãè¶
ãããªã¹ã¯ãé«ããªããŸãã 8. GROUP BY ã®æé©å éçŽãå®è¡ããéã¯ã GROUP BY å¥ã«å«ããã«ã©ã ãå¯èœãªéãå°ãªãããŠãå¿
èŠãª CPU ãšã¡ã¢ãªã®éãæžããã¹ãã§ããããã«ãå€ã®ååžãå¯èœãªéãåäžãªã«ã©ã ã§ã°ã«ãŒãåããããšã確èªããŠãã ããã éçŽã¯ãšãªã®ããã©ãŒãã³ã¹åé¡ã®äžå ã¯ããŒã¿ã¹ãã¥ãŒã§ããããã¯ãå€ãã®è¡ã GROUP BY å¥ã®ã«ã©ã ã«åãå€ãæã£ãŠããå Žåã«çºçããå¯èœæ§ããããŸããéçŽäžãè¡ã¯ GROUP BY å¥ã®ã«ã©ã ã®ããã·ã¥ã«åºã¥ããŠã¯ãŒã«ãŒããŒãã«åæ£ãããŸããããŒã¿ã«ã¹ãã¥ãŒãããå Žåã1 ã€ã®ããŒããéçŽã®å€§éšåãåŠçããªããã°ãªãããä»ã®ããŒãã¯ã¢ã€ãã«ç¶æ
ã«ãªãå¯èœæ§ããããŸãã åé·ãªã«ã©ã ã¯ãSQL èšèªãåŒã GROUP BY å¥ã«å«ãããéçŽé¢æ°ã䜿çšããããšãèŠæ±ããããããã°ãã° GROUP BY å¥ã«è¿œå ãããŸããããšãã°ã customer_id ãš customer_name ã«ã©ã ãããããŒãã«ã®å Žåã1 ã€ã® customer_id ã«å¯Ÿã㊠1 ã€ã®ååãããªãã«ããããããã顧客ããšã«éèšããå Žå㯠GROUP BY c_custkey, c_name ãšæžãå¿
èŠããããŸãã GROUP BY å¥ã«å€ãã®åé·ãªã«ã©ã ãããå Žåã«ã¯ãšãªãé«éåããæ¹æ³ã® 1 ã€ã ARBITRARY 颿°ã§ãããã®é¢æ°ã¯ãååã瀺ãããã«ããã®ã°ã«ãŒãããä»»æã®å€ãè¿ãéçŽé¢æ°ã§ãã ãã®äŸã§ã¯ã顧客ããšã®æ³šææ°ãç¥ããããšèããŠããŸãã顧客ããŒãã«ãš orders ããŒãã«ãçµåãããšã泚æããšã« 1 è¡ã«ãªãã®ã§ã GROUP BY c_custkey ã䜿ã£ãŠé¡§å®¢ããšã«éçŽããŸããçµæã«é¡§å®¢åãå«ãããã®ã§ã GROUP BY å¥ã« c_name åã远å ãã代ããã«ã ARBITRARY(c_name) ã䜿çšããŸãã SELECT c_custkey, ARBITRARY(c_name) AS c_name, COUNT(*) AS order_count FROM customer JOIN orders ON (customer.c_custkey = orders.o_custkey) GROUP BY c_custkey å¯èœãªéãã GROUP BY å¥ããäžèŠãªã«ã©ã ãåé€ããå¿
èŠããããŸããåã®äŸã®ããã« 1 ã€ã®ã«ã©ã ãããªãå Žåãããã©ãŒãã³ã¹ã®åäžã¯ç®ç«ã¡ãŸããããããã GROUP BY å¥ã«å€æ°ã®åãããå€§èŠæš¡ãªããŒã¿ã»ããã«å¯Ÿããã¯ãšãªã®ããã©ãŒãã³ã¹ã«ãšã£ãŠã¯éåžžã«éèŠã§ãã 9. è¿äŒŒé¢æ°ã®å©çš å€§èŠæš¡ãªããŒã¿ã»ãããæ¢çŽ¢ããéã®äžè¬çãªãŠãŒã¹ã±ãŒã¹ã¯ã COUNT(DISTINCT column) ã䜿çšããŠç¹å®ã®åã®éè€ã®ãªãå€ã®æ°ãã«ãŠã³ãããããšã§ããäŸãšããŠã¯ãWeb ããŒãžã«ã¢ã¯ã»ã¹ãããŠããŒã¯ãŠãŒã¶ãŒã®æ°ã調ã¹ãããšãæããããŸãã æ£ç¢ºãªæ°å€ãå¿
èŠãªãå Žå (äŸãã°ãã©ã®ãŠã§ãããŒãžã詳ãã調ã¹ãããæ¢ããŠããå Žå)ã approx_distinct(column) ã®äœ¿çšãæ€èšããŠãã ããããã®é¢æ°ã¯ãå®å
šãªæååã§ã¯ãªããå€ã®ãŠããŒã¯ãªããã·ã¥ãã«ãŠã³ãããããšã§ãã¡ã¢ãªäœ¿çšéãæå°éã«æããããšããŸããæ³šæç¹ã¯ãæšæºèª€å·®ã 2.3% ããããšã§ãã æ¬¡ã®è¡šã¯ãããã¹ã圢åŒã§å§çž®ãããŠããªãçŽ 6 åè¡ã® 74 GB ã®ããŒãã«ã䜿çšãããšãã®é«éåã®æŠèŠããŸãšãããã®ã§ãã ã¯ãšãª å®è¡æé SELECT COUNT(DISTINCT l_comment) FROM lineitem 7.7 ç§ SELECT approx_distinct(l_comment) FROM lineitem 4.6 ç§ é«éå çŽ 40% é«é 詳现ã«ã€ããŠã¯ãTrino ããã¥ã¡ã³ãã® æŠç®éçŽé¢æ° ãåç
§ããŠãã ããã 10. å¿
èŠãªã«ã©ã ã®ã¿ãå«ãã ã¯ãšãªãå®è¡ããéã¯ã SELECT ã¹ããŒãã¡ã³ãã§å¿
èŠãªã«ã©ã ã®ã¿ãéžæãããã¹ãŠã®ã«ã©ã ãéžæããªãããã«ããŠãã ãããã«ã©ã æ°ãåæžããããšã§ãã¯ãšãªå
šäœã®ãã€ãã©ã€ã³ãéããŠåŠçããå¿
èŠãããããŒã¿éãæžããæçµçãªçµæã«æžã蟌ãŸããããŒã¿éãæžããŸããããã¯ç¹ã«ã倿°ã®æååããŒã¹ã®ã«ã©ã ãæã€ããŒãã«ãç
§äŒããããè€æ°ã®çµåãéçŽãå®è¡ããå Žåã«ç¹ã«å¹æçã§ããåç
§ããŒã¿ãåæåã®ãã¡ã€ã«åœ¢åŒã®å Žåã¯ãç¹å®ã®ã«ã©ã ã®ããŒã¿ã®ã¿ã Amazon S3 ããèªã¿åããããããã¹ãã£ã³ãããããŒã¿éãæžããŸãã æ¬¡ã®è¡šã¯ãçŽ 6000 äžè¡ã®ããã¹ã圢åŒã®éå§çž® 7.25 GB ã®ããŒãã«ã䜿çšãããšãã®é«éåã®æŠèŠããŸãšãããã®ã§ãã ã¯ãšãª å®è¡æé SELECT * FROM lineitem, orders, customer WHERE l_orderkey = o_orderkey AND c_custkey = o_custkey 19.7 ç§ SELECT c_name, l_quantity, o_totalprice FROM lineitem, orders, customer WHERE l_orderkey = o_orderkey AND c_custkey = o_custkey 5.2 ç§ é«éå 73% é«é ããŒãã¹ã®ãã³ã ãã®ã»ã¯ã·ã§ã³ã§ã¯ã远å ã®ããã©ãŒãã³ã¹ãã¥ãŒãã³ã°ã®ãã³ããšããã®èšäºã®æåã®ããŒãžã§ã³ä»¥éã«ãªãªãŒã¹ãããæ°ããããã©ãŒãã³ã¹æåã®æ©èœã«ã€ããŠèª¬æããŸãã ããŒãã£ã·ã§ã³ãããžã§ã¯ã·ã§ã³ã䜿çšããããŒãã£ã·ã§ã³åŠçã®æé©å ããŒãã£ã·ã§ã³æ°ãéåžžã«å€ããAWS Glue ã®ããŒãã£ã·ã§ã³ã€ã³ããã¯ã¹ã䜿çšããŠããªãå ŽåãããŒãã£ã·ã§ã³æ
å ±ã®åŠçã Athena ã¯ãšãªã®ããã«ããã¯ã«ãªãå¯èœæ§ããããŸããAthena ã® ããŒãã£ã·ã§ã³ãããžã§ã¯ã·ã§ã³ ã䜿çšãããšãããŒãã£ã·ã§ã³æ°ãéåžžã«å€ãããŒãã«ã®ã¯ãšãªåŠçãé«éåããããŒãã£ã·ã§ã³ç®¡çãèªååã§ããŸããããŒãã£ã·ã§ã³ãããžã§ã¯ã·ã§ã³ã¯ãããŒãã£ã·ã§ã³æ
å ±ãã¡ã¿ã¹ãã¢ããååŸããã®ã§ã¯ãªããããŒãã£ã·ã§ã³æ
å ±ãèšç®ããŠã¯ãšãªããããããã®ãªãŒããŒããããæå°éã«æããããšãã§ããŸããAWS Glue ããŒãã«ã«ããŒãã£ã·ã§ã³ã®ã¡ã¿ããŒã¿ã远å ããå¿
èŠããªããªããŸã(ãã®æ©èœã¯ Athena ã§ã®ã¿å©çšããããšãã§ããæ©èœã§ããç¹ã«ã泚æãã ãã)ã ããŒãã£ã·ã§ã³ãããžã§ã¯ã·ã§ã³ã§ã¯ãAWS Glue Data Catalog ã®ãããªãªããžããªããèªã¿åãã®ã§ã¯ãªããæ§æããèšç®ãããããŒãã£ã·ã§ã³å€ãšãã±ãŒã·ã§ã³ã䜿çšãããŸããã¡ã¢ãªå
ã®æäœã¯ãªã¢ãŒãæäœãããéåžžé«éã§ãããããããŒãã£ã·ã§ã³ãããžã§ã¯ã·ã§ã³ã¯ããŒãã£ã·ã§ã³æ°ãéåžžã«å€ãããŒãã«ã«å¯Ÿããã¯ãšãªã®å®è¡æéãççž®ã§ããŸããã¯ãšãªãšåºç€ãšãªãããŒã¿ã®ç¹æ§ã«ãã£ãŠã¯ãããŒãã£ã·ã§ã³ã¡ã¿ããŒã¿ã®ååŸã§é
å»¶ãçããã¯ãšãªã®å®è¡æéã倧å¹
ã«ççž®ã§ããå¯èœæ§ããããŸãã ããŒãã£ã·ã§ã³ãããžã§ã¯ã·ã§ã³ã䜿çšããã®ã¯ãããŒãã£ã·ã§ã³ã®ã¹ããŒããåãã§ããå ŽåããããŒãã«ã®ã¹ããŒããããŒãã£ã·ã§ã³ã®ã¹ããŒããæ£ç¢ºã«èšè¿°ããŠããå Žåã«çæ³çã§ããID ã®ãããªéåžžã«å€ã®çš®é¡ãå€ãåããéåžžã«çްããç²åºŠã®æ¥ä»ç¯å²ã§ããŒãã£ã·ã§ã³åå²ããã®ã«äœ¿çšã§ããŸãã 詳现ã«ã€ããŠã¯ã Amazon Athena ã§ã®ããŒãã£ã·ã§ã³ãããžã§ã¯ã·ã§ã³ ãã芧ãã ããã å€§èŠæš¡ãªçµæã»ãããçæããã¯ãšãªã®é«éå (UNLOAD ã®å©çš) Athena ã§ SELECT ã¯ãšãªãå®è¡ãããšãéå§çž® CSV 圢åŒã®åäžã®çµæãã¡ã€ã«ã Amazon S3 ã«çæãããŸããã¯ãšãªã®åºåçµæã倧ãããªããšäºæ³ãããå Žåãåäžã®ãã¡ã€ã«ãžã®æžã蟌ã¿ã«å€ãã®æéãããããŸãã UNLOAD ã䜿çšãããšãçµæã Amazon S3 ã®è€æ°ã®ãã¡ã€ã«ã«åå²ã§ãããããæžãèŸŒã¿æ®µéã§è²»ããããæéãççž®ãããŸãããŸããçµæã»ããã®åœ¢åŒ (ORCãParquetãAvroãJSONãTEXTFILE) ãšå§çž®ã¿ã€ã (ParquetãJSONãTEXTFILE ã®å Žåã¯ããã©ã«ãã§ gzipãORC ã®å Žå㯠zlib) ãæå®ã§ããŸãã æ¬¡ã®è¡šã¯ã SELECT ãš UNLOAD ã¹ããŒãã¡ã³ãã®æ¯èŒã瀺ããŠããŸãããã®ã¯ãšãªã§ã¯ãçŽ 13 GB ã®éå§çž®ããŒã¿ãåºåãããããšãäºæ³ãããŸãã ã¯ãšãª SELECT * FROM lineitem LIMIT 85700000 UNLOAD (SELECT * FROM lineitem LIMIT 85700000) to with (format=âTEXTFILEâ) ç¯çŽ å®è¡æé 362 ç§ 33.2 ç§ ~ 90% é«éå çµæã»ãã 13 GB (CSVãéå§çž®) 3.2 GB (CSVãgzip å§çž®) ~ 75% ã¹ãã¬ãŒãžåæž ããŒã¿ã倿ŽãããŠããªãå Žåã®ã¯ãšãªçµæã®åå©çš ããŒã¿ã¬ã€ã¯ã®ããŒã¿ã»ããã¯ãå€ãã®å Žåã1 æ¥ã« 1 åããŸã㯠1 æ¥ã«æ°åããæŽæ°ãããŸããããããé »ç¹ã«ã¯ãšãªãå®è¡ãããããšããããããŸããããã·ã¥ããŒããæŽæ°ããããã®ã¯ãšãªããã¢ããªã±ãŒã·ã§ã³ã®ãã¥ãŒã«ã¢ã¯ã»ã¹ãããã³ã«å®è¡ãããã¯ãšãªãããå¯èœæ§ããããŸããååå®è¡ããæãããããŒã¿ã倿ŽãããŠããªãå Žåã¯ãçµæãåèšç®ããå¿
èŠã¯ãããŸãããå®éãåèšç®ãããšããæéãããããã³ã¹ããé«ããªããŸãããã®ãããªç¶æ³ã§ã¯ãã¯ãšãªçµæã®åå©çšã掻çšã§ããŸããããã¯ãåãã¯ãšãªãäŸãã°éå» 15 å以å
ã«å®è¡ãããå ŽåãAthena ããã®å®è¡çµæãè¿ãããã«æç€ºããæ©èœã§ãããã®ãããªçµæãããå ŽåãAthena ã¯ããã«ãã®çµæãè¿ããããŒã¿ã®ã¹ãã£ã³ã¯è¡ãããŸããã ã¯ãšãªçµæã®åå©çšã®è©³çްã«ã€ããŠã¯ã Amazon Athena ã®ã¯ãšãªçµæåå©çšã«ããã³ã¹ãåæžãšã¯ãšãªããã©ãŒãã³ã¹åäž ããã³ ã¯ãšãªçµæã®åå©çš ãåç
§ããŠãã ããã çµè« ãã®æçš¿ã§ã¯ãAthena SQL ã§ã®å¯Ÿè©±ååæãæé©åããããã®ããã 10 ã®ãã³ãã玹ä»ããŸããããããã®å®è·µæ¹æ³ã¯ãAmazon EMR äžã® Trino ã䜿çšããéã«ãé©çšã§ããŸãã ãã®èšäºã® ãã«ã³èªç¿»èš³ç ãã芧ããã ããŸãã èè
ã«ã€ã㊠Mert Hocanin ã¯ãAWS Lake Formation ã® Principal Big Data Architect ã§ãã Pathik Shah ã¯ãAmazon Athena ã® Sr. ããã°ããŒã¿ã¢ãŒããã¯ãã§ãã2015 幎㫠AWS ã«å
¥ç€Ÿãã以æ¥ããã°ããŒã¿åæã®åéã«æ³šåããAWS ã®åæãµãŒãã¹ã䜿çšããŠã¹ã±ãŒã©ãã«ã§å
ç¢ãªãœãªã¥ãŒã·ã§ã³ãæ§ç¯ããã客æ§ããµããŒãããŠããŸãã Theo Tolv ã¯ã¹ãŠã§ãŒãã³ãã¹ããã¯ãã«ã ã«æ ç¹ã眮ãã·ãã¢ã¢ããªãã£ã¯ã¹ã¢ãŒããã¯ãã§ãã圌ã¯ãã£ãªã¢ã®å€§åãå°ããªããŒã¿ãšå€§ããªããŒã¿ã§ä»äºãããŠããŸããããã㊠2008 幎ãã AWS äžã§åäœããã¢ããªã±ãŒã·ã§ã³ãæ§ç¯ããŠããŸããäœææéã«ã¯ãé»åå·¥äœãããã£ãããå®å®ãªãã©ãèªãã®ã奜ãã§ãã ç£æ»å±¥æŽ 2024 幎 2 æã« Theo Tolv æ° (ã·ãã¢ã¢ããªãã£ã¯ã¹ã¢ãŒããã¯ã) ã«ããæçµç¢ºèªãšæŽæ°ãè¡ãããŸããã翻蚳ã¯ã¢ããªãã£ã¯ã¹ã¹ãã·ã£ãªã¹ããœãªã¥ãŒã·ã§ã³ã¢ãŒããã¯ãã®å·æãæ
åœããŸãããåæã¯ ãã¡ã ã§ãã