æ¬èšäºã¯ã2024/11/15 ã«å
¬éããã Enrich your AWS Glue Data Catalog with generative AI metadata using Amazon Bedrock ã翻蚳ãããã®ã§ãã翻蚳㯠Solutions Architect ã®æž¡éãæ
åœããŸããã ã¡ã¿ããŒã¿ã¯ãããŒã¿è³ç£ã䜿çšããŠããŒã¿äž»å°ã®æææ±ºå®ãè¡ãéã«éåžžã«éèŠãªåœ¹å²ãæãããŸããå€ãã®å ŽåãããŒã¿è³ç£ã®ã¡ã¿ããŒã¿ã®çæã¯æäœæ¥ã§ããæéãããããŸããçæ AI ãæŽ»çšããããšã§ãããã¥ã¡ã³ãã«åºã¥ããããŒã¿è³ç£ã®å
æ¬çãªã¡ã¿ããŒã¿çæãèªååããAWS ã¯ã©ãŠãç°å¢å
ã®ããŒã¿ãã£ã¹ã«ããªãŒãããŒã¿çè§£ãå
šäœçãªããŒã¿ã¬ããã³ã¹ã匷åã§ããŸããæ¬èšäºã§ã¯ã Amazon Bedrock äžã®åºç€ã¢ãã« (FM) ãšããŒã¿ããã¥ã¡ã³ãã䜿çšãåçã¡ã¿ããŒã¿ã«ãã£ãŠ AWS Glue Data Catalog ã匷åããæ¹æ³ã説æããŸãã AWS Glue ã¯ãåæãŠãŒã¶ãŒãè€æ°ã®ãœãŒã¹ããããŒã¿ãç°¡åã«æ€åºãæºåãç§»åãçµ±åã§ããããã«ãããµãŒããŒã¬ã¹ããŒã¿çµ±åãµãŒãã¹ã§ãã Amazon Bedrock ã¯ãåäžã® API ãä»ã㊠AI21 LabsãAnthropicãCohereãMetaãMistral AIãStability AIãAmazon ãšãã£ã倧æ AI äŒæ¥ããã®é«æ§èœãª FM ãéžæã§ãããã«ãããŒãžããµãŒãã¹ã§ãã ãœãªã¥ãŒã·ã§ã³ã®æŠèŠ ãã®ãœãªã¥ãŒã·ã§ã³ã§ã¯ãAmazon Bedrock ãéããŠå€§èŠæš¡èšèªã¢ãã« (LLM) ã䜿çšããããŒã¿ã«ã¿ãã°å
ã®ããŒãã«å®çŸ©ã®ã¡ã¿ããŒã¿ãèªåçã«çæããŸããã¯ããã«ãLLM ãããã¥ã¡ã³ããªãã§èŠæ±ãããã¡ã¿ããŒã¿ãçæããã³ã³ããã¹ãå
åŠç¿ã®ãªãã·ã§ã³ã暡玢ããŸããæ¬¡ã«ãæ€çŽ¢æ¡åŒµçæ (RAG) ã䜿çšã㊠LLM ããã³ããã«ããŒã¿ããã¥ã¡ã³ãã远å ããã¡ã¿ããŒã¿ã®çæãæ¹åããŸãã AWS Glue Data Catalog æ¬èšäºã§ã¯ãããŸããŸãªããŒã¿ ãœãŒã¹ã«ãããããŒã¿è³ç£ã®äžå
çãªã¡ã¿ããŒã¿ãªããžããªã§ãã AWS Glue Data Catalog ã䜿çšããŸããAWS Glue Data Catalog ã¯ãããŒã¿åœ¢åŒãã¹ããŒãããœãŒã¹ã«é¢ããæ
å ±ãä¿åããã³ã¯ãšãªããããã®çµ±åã€ã³ã¿ãŒãã§ã€ã¹ãæäŸããŸããããã¯ãããŒã¿ãœãŒã¹ã®å Žæãã¹ããŒããããã³ã©ã³ã¿ã€ã ã¡ããªã¯ã¹ãžã®ã€ã³ããã¯ã¹ãšããŠæ©èœããŸãã ããŒã¿ã«ã¿ãã°ã«ããŒã¿ã远å ããæãäžè¬çãªæ¹æ³ã¯ãããŒã¿ãœãŒã¹ãèªåçã«æ€åºããŠã«ã¿ãã°åãã AWS Glue ã¯ããŒã©ãŒ ã䜿çšããããšã§ããã¯ããŒã©ãŒãå®è¡ãããšãæå®ããããŒã¿ããŒã¹ãŸãã¯ããã©ã«ãã®ããŒã¿ããŒã¹ã«è¿œå ãããã¡ã¿ããŒã¿ããŒãã«ãäœæãããŸããåããŒãã«ã¯åäžã®ããŒã¿ã¹ãã¢ã衚ããŠããŸãã çæ AI ã¢ãã« LLM(å€§èŠæš¡èšèªã¢ãã«) ã¯èšå€§ãªéã®ããŒã¿ã§ãã¬ãŒãã³ã°ãããæ°ååã®ãã©ã¡ãŒã¿ã䜿çšã質åãžã®åçãèšèªã®ç¿»èš³ãæç« ã®å®æãªã©ã®äžè¬çãªã¿ã¹ã¯ã®åºåãçæããŸããã¡ã¿ããŒã¿çæãªã©ã®ç¹å®ã®ã¿ã¹ã¯ã« LLM ã䜿çšããããã«ã¯ãæåŸ
ããåºåãçæããããã«ã¢ãã«ãã¬ã€ãããã¢ãããŒããå¿
èŠã§ãã ãã®æçš¿ã§ã¯ã次㮠2 ã€ã®ç°ãªãã¢ãããŒãã§ããŒã¿ã®èª¬æçãªã¡ã¿ããŒã¿ãçæããæ¹æ³ã説æããŸãã ã³ã³ããã¹ãå
åŠç¿ æ€çŽ¢æ¡åŒµçæ (RAG) ãã®ãœãªã¥ãŒã·ã§ã³ã§ã¯ Amazon Bedrock ã§å©çšå¯èœãª 2 ã€ã®çæ AI ã¢ãã« (ããã¹ãçæã¿ã¹ã¯çšãš Amazon Titan Embeddings V2 çš) ã䜿çšããŸãã æ¬¡ã®ã»ã¯ã·ã§ã³ã§ã¯ãPython ã䜿çšããåã¢ãããŒãã®å®è£
ã®è©³çްã«ã€ããŠèª¬æããŸããä»å±ã®ã³ãŒã㯠GitHub ãªããžã㪠ã«ãããŸãã ãã¡ã㯠Amazon SageMaker Studio ã JupyterLabããŸãã¯ãèªèº«ã®ç°å¢ã§æ®µéçã«å®è£
ã§ããŸãã SageMaker Studio ãåããŠäœ¿çšããå Žåã¯ãããã©ã«ãèšå®ã§æ°åã§èµ·åã§ãã ã¯ã€ãã¯ã»ããã¢ãã ã確èªããŠãã ããããã®ã³ãŒã㯠AWS Lambda 颿°ãŸãã¯ç¬èªã®ã¢ããªã±ãŒã·ã§ã³ã§ã䜿çšããããšãã§ããŸãã ã¢ãããŒã1: ã³ã³ããã¹ãå
åŠç¿ ãã®ã¢ãããŒãã§ã¯ãLLM ã䜿çšããŠã¡ã¿ããŒã¿ã®èª¬æãçæããŸããããã³ãããšã³ãžãã¢ãªã³ã°ã䜿çšããŠãLLM ã«çæããããåºåãæç€ºããŸãããã®ã¢ãããŒãã¯ãããŒãã«ã®æ°ãå°ãªã AWS Glue ããŒã¿ããŒã¹ã«æé©ã§ããã³ã³ããã¹ããŠã£ã³ã㊠(ã»ãšãã©ã® Amazon Bedrock ã¢ãã«ãåãå
¥ããå
¥åããŒã¯ã³ã®æ°) ãè¶
ããããšãªããããŒã¿ã«ã¿ãã°ããããŒãã«æ
å ±ãããã³ããã®ã³ã³ããã¹ããšããŠéä¿¡ã§ããŸãã以äžã®å³ãããã®ã¢ãŒããã¯ãã£ãšãªããŸãã ã¢ãããŒã2: æ€çŽ¢æ¡åŒµçæ(RAG) æ°çŸã®ããŒãã«ãããå Žåããã¹ãŠã®ããŒã¿ã«ã¿ãã°æ
å ±ãã³ã³ããã¹ããšããŠããã³ããã«è¿œå ãããšãLLM ã®ã³ã³ããã¹ããŠã£ã³ããŠãè¶
ããããã³ããã衚瀺ãããå¯èœæ§ããããŸããå Žåã«ãã£ãŠã¯ãåºåãçæããåã« FM ã«åç
§ããŠãããããããžãã¹èŠä»¶ããã¥ã¡ã³ããæè¡ããã¥ã¡ã³ããªã©ã®è¿œå ã³ã³ãã³ãããããŸãããã®ãããªããã¥ã¡ã³ãã¯æ°ããŒãžã«åã¶ããšããããéåžžã»ãšãã©ã® LLM ãåãå
¥ããããå
¥åããŒã¯ã³ã®æå€§æ°ãè¶
ããŸãããã®ããããã®ãŸãŸã§ã¯ããã³ããã«å«ããããšãã§ããŸããã 解決çãšã㊠RAG ã¢ãããŒãã®äœ¿çšãæããããŸããRAG ã䜿çšãããš å¿çãçæããåã«åŠç¿ããŒã¿ãœãŒã¹ä»¥å€ã®æš©åšãããã¬ããžããŒã¹ãåç
§ã LLM ã®åºåãæé©åã§ããŸããRAG ã¯ã¢ãã«ããã¡ã€ã³ãã¥ãŒãã³ã°ããããšãªããLLMãç¹å®ã®ãã¡ã€ã³ãŸãã¯çµç¹å
éšã®ãã¬ããžããŒã¹ã«æ¡åŒµããŸãããã㯠LLM ã®åºåãæ¹åããããã®è²»çšå¯Ÿå¹æã®é«ãã¢ãããŒãã§ãããLLM ã¯æ§ã
ãªã³ã³ããã¹ãã«ãããŠé©åãã€æ£ç¢ºã§æçšãªãã®ãšãªããŸãã RAG ãçšãããš LLM ã¯ã¡ã¿ããŒã¿ãçæããåã«ãããŒã¿ã«é¢ããæè¡çãªããã¥ã¡ã³ãããã®ä»ã®æ
å ±ãåç
§ããããšãã§ããŸãããã®çµæãçæããã説æã¯ããè±ãã§æ£ç¢ºãªãã®ã«ãªãããšãæåŸ
ãããŸãã æ¬èšäºã®äŸã§ã¯ãå
¬éãããŠãã Amazon Simple Storage Service (Amazon S3): s3://awsglue-datasets/examples/us-legislators/all ããããŒã¿ãåã蟌ã¿ãŸãããã®ããŒã¿ã»ããã«ã¯ãç±³åœã®è°å¡ã«é¢ããJSON圢åŒã®ããŒã¿ãšåœŒããç±³åœäžé¢ãšç±³åœäžé¢ã§ä¿æããè°åžãå«ãŸããŠããŸããããŒã¿ããã¥ã¡ã³ã㯠Popolo ( http://www.popoloproject.com/ ) ããååŸããŸããã 以äžã®ã¢ãŒããã¯ãã£å³ã¯ãRAG ã¢ãããŒãã瀺ããŠããŸãã æµãã¯ä»¥äžã®éãã§ãã ããŒã¿ããã¥ã¡ã³ãããæ
å ±ãåã蟌ã¿ãŸããããã¥ã¡ã³ãã«ã¯æ§ã
ãªåœ¢åŒãããåŸãŸããæ¬èšäºã§ã¯ããã¥ã¡ã³ãã¯ãŠã§ããµã€ãã«ãªããŸãã ããŒã¿ããã¥ã¡ã³ãã®HTMLããŒãžã®ã³ã³ãã³ãããã£ã³ã¯ããŸããããŒã¿ããã¥ã¡ã³ãã®ãã¯ãã«åã蟌ã¿ãçæããä¿åããŸãã ããŒã¿ã«ã¿ãã°ããããŒã¿ããŒã¹ããŒãã«ã®æ
å ±ãååŸããŸãã ãã¯ãã«ã¹ãã¢ã§é¡äŒŒæ€çŽ¢ãè¡ããæãé¢é£æ§ã®é«ãæ
å ±ããã¯ãã«ã¹ãã¢ããååŸããŸãã ããã³ãããæ§ç¯ããŸããã¡ã¿ããŒã¿ã®äœææ¹æ³ãæç€ºããååŸããæ
å ±ãšããŒã¿ã«ã¿ãã°ã®ããŒãã«æ
å ±ãã³ã³ããã¹ããšããŠè¿œå ããŸããä»åã¯6ã€ã®ããŒãã«ãå«ãããªãå°èŠæš¡ãªããŒã¿ããŒã¹ã§ãããããããŒã¿ããŒã¹ã«é¢ãããã¹ãŠã®æ
å ±ãå«ããŸãã LLM ã«ããã³ãããéä¿¡ãå¿çãååŸããŠãããŒã¿ã«ã¿ãã°ãæŽæ°ããŸãã åææ¡ä»¶ æ¬èšäºã®æé ã«åŸã£ãŠããèªèº«ã® AWS ã¢ã«ãŠã³ãã«ãœãªã¥ãŒã·ã§ã³ããããã€ããå Žåã¯ã GitHub ãªããžã㪠ãåç
§ããŠãã ããã 以äžã®ãªãœãŒã¹ãå¿
èŠãšãªããŸã: AWSã¢ã«ãŠã³ã Python ãš boto3 AWSGlueServiceRole ããªã·ãŒãŸãã¯åçã®ããªã·ãŒãå«ãã AWS Glue ã¯ããŒã©ãŒ çšã® AWS Identity and Access ManagementïŒIAMïŒ ããŒã«ãšãæ¬èšäºã§äœ¿çšããããŒã¿ãä¿åãããŠãã S3 ãã±ããã«ã¢ã¯ã»ã¹ã§ããã€ã³ã©ã€ã³ããªã·ãŒ æ¬èšäºã§ã¯ç°å¢æ§ç¯ã®äžç°ãšããŠã aws-gen-ai-glue-metadata-<random_sequence> ãšããååã§æ°ããS3ãã±ãããäœæããŸãã以äžã¯ã€ã³ã©ã€ã³ããªã·ãŒã®äŸã§ãã { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::aws-gen-ai-glue-metadata-*/*" ] } ] } ããŒãããã¯ç°å¢ã®IAMããŒã«ãIAMããŒã«ã¯ãAWS GlueãAmazon BedrockãAmazon S3 ã«å¯ŸããŠé©åãªæš©éãæã€å¿
èŠããããŸãã以äžã¯ããªã·ãŒã®äŸã§ãããèªèº«ã®ç°å¢ã«åãããŠãããã«æ¡ä»¶ã远å ããŠå¶éããããããšãã§ããŸãã { "Version": "2012-10-17", "Statement": [ { "Sid": "GluePermissions", "Effect": "Allow", "Action": [ "glue:GetCrawler", "glue:DeleteDatabase", "glue:GetTables", "glue:DeleteCrawler", "glue:StartCrawler", "glue:CreateDatabase", "glue:UpdateTable", "glue:DeleteTable", "glue:UpdateCrawler", "glue:GetTable", "glue:CreateCrawler" ], "Resource": "*" }, { "Sid": "S3Permissions", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:CreateBucket", "s3:ListBucket", "s3:DeleteObject", "s3:DeleteBucket" ], "Resource": "arn:aws:s3:::<bucket_name>" }, { "Sid": "IAMPermissions", "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::<account_ID>:role/GlueCrawlerRoleBlog" }, { "Sid": "BedrockPermissions", "Effect": "Allow", "Action": "bedrock:InvokeModel", "Resource": [ "arn:aws:bedrock:*::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0", "arn:aws:bedrock:*::foundation-model/amazon.titan-embed-text-v2:0" ] } ] } Amazon Bedrock ã«ããã Anthropic ã® Claude 3 ãš Amazon Titan Text Embeddings V2 ãžã® ã¢ãã«ã¢ã¯ã»ã¹ ã how_to_generate_metadata_for_glue_data_catalog_w_bedrock.ipynb ã®ããŒããã㯠ãªãœãŒã¹ãšç°å¢ã®ã»ããã¢ãã 以äžãåææ¡ä»¶ãšãªããæ¬¡ã®ã¹ããããå®è¡ããããã«ããŒãããã¯ç°å¢ã«åãæ¿ããŸããããŒãããã¯ã®ã»ããã¢ããã®æé ã§ã¯æåã«ããŒãããã¯ãå¿
èŠãšãã以äžã®ãªãœãŒã¹ãäœæãããŸãã S3 ãã±ãã AWS Glue ããŒã¿ããŒã¹ AWS Glue ã¯ããŒã©ãŒ(èªåçã«å®è¡ãã AWS Glue ããŒã¿ããŒã¹ããŒãã«ãèªåçæãã) ã»ããã¢ãããå®äºãããšã legislators ãšãã AWS Glue ããŒã¿ããŒã¹ãäœæãããŠããŸãã ã¯ããŒã©ãŒã¯ä»¥äžã®ã¡ã¿ããŒã¿ããŒãã«ãäœæããŸãã persons memberships organizations events areas countries ããã¯è°å¡ãšåœŒãã®çµæŽãå«ã忣èŠåãããããŒãã«ã®éåã§ãã ããŒãããã¯ã®æ®ãã®æé ã«åŸã£ãŠç°å¢ã®ã»ããã¢ãããå®äºãããŠãã ãããæ°åã§å®äºããŸãã ããŒã¿ã«ã¿ãã°ã®æ€æ» ã»ããã¢ãããå®äºããããããŒã¿ã«ã¿ãã°ãæ€æ»ããããŒã¿ã«ã¿ãã°ãšã¡ã¿ããŒã¿ã確èªããŸããAWS Glueã®ã³ã³ãœãŒã«ã§ãããã²ãŒã·ã§ã³ãã€ã³ã® Databases ãéžæããæ°ããäœæãã legislators ããŒã¿ããŒã¹ãéããŸãã以äžã®ã¹ã¯ãªãŒã³ã·ã§ããã®ããã«ã6ã€ã®ããŒãã«ãå«ãŸããŠããã¯ãã§ããïŒ ããŒãã«ãéããŠè©³çްã確èªã§ããŸããããŒãã«ã®èª¬æãšããããã®ã«ã©ã ã«å¯Ÿããã³ã¡ã³ãã¯ãAWS Glue ã¯ããŒã©ãŒã«ãã£ãŠèªåçã«è£å®ãããªãããã空çœã«ãªã£ãŠããŸãã AWS Glue API ã䜿çšããŠãåããŒãã«ã®æè¡çãªã¡ã¿ããŒã¿ã«ããã°ã©ã ã§ã¢ã¯ã»ã¹ããããšãã§ããŸãã以äžã®ã³ãŒãã¹ããããã¯ãAWS SDK for Python (Boto3) ã§ AWS Glue API ã䜿çšããŠéžæããããŒã¿ããŒã¹ã®ããŒãã«ãååŸããæ€èšŒã®ããã«ç»é¢ãžè¡šç€ºããŸããæ¬èšäºã®ããŒãããã¯ã«ãã以äžã®ã³ãŒãã¯ãããŒã¿ã«ã¿ãã°æ
å ±ãããã°ã©ã ã§ååŸããããã«äœ¿çšãããŸãã def get_alltables(database): tables = [] get_tables_paginator = glue_client.get_paginator('get_tables') for page in get_tables_paginator.paginate(DatabaseName=database): tables.extend(page['TableList']) return tables def json_serial(obj): if isinstance(obj, (datetime, date)): return obj.isoformat() raise TypeError ("Type %s not serializable" % type(obj)) database_tables = get_alltables(database) for table in database_tables: print(f"Table: {table['Name']}") print(f"Columns: {[col['Name'] for col in table['StorageDescriptor']['Columns']]}") 以äžã§ AWS Glue ããŒã¿ããŒã¹ãšããŒãã«ã詳ããç¥ãããšãã§ããŸããã®ã§ã次ã®ã¹ãããã§ã¯çæ AI ã䜿ã£ãŠããŒãã«ã®ã¡ã¿ããŒã¿ã®èª¬æãçæããŸãã Amazon Bedrock ãš LangChain ã䜿ã Anthropic Claude 3 ã§ããŒãã«ã®ã¡ã¿ããŒã¿èšè¿°ãçæãã ãã®ã¹ãããã§ã¯ãAWS Glue ããŒã¿ããŒã¹ã«ååšããéžæããããŒãã«ã®æè¡çãªã¡ã¿ããŒã¿ãçæããŸãããã®èšäºã§ã¯ persons ããŒãã«ã䜿çšããŸãããŸããããŒã¿ã«ã¿ãã° ããå
šãŠã®ããŒãã«ãååŸããããã³ããã®äžéšãšããŠå«ããŸãããã®ã³ãŒãã¯1ã€ã®ããŒãã«ã®ã¡ã¿ããŒã¿ãçæããããšãç®çãšããŠããŸãããLLM ã«å€éšããŒãæ€åºããããããå¹
åºãæ
å ±ãäžããããšãæå¹ã§ããããŒãããã¯ç°å¢ã«LangChain v0.2.1ãã€ã³ã¹ããŒã«ããŸãã以äžã®ã³ãŒãã確èªããŠãã ãããïŒ from langchain_core.output_parsers import StrOutputParser from langchain_core.prompts import ChatPromptTemplate from botocore.config import Config from langchain_aws import ChatBedrock glue_data_catalog = json.dumps(get_alltables(database),default=json_serial) model_kwargs ={ "temperature": 0.5, # You can increase or decrease this value depending on the amount of randomness you want injected into the response. A value closer to 1 increases the amount of randomness. "top_p": 0.999 } model = ChatBedrock( client = bedrock_client, model_id=model_id, model_kwargs=model_kwargs ) table = "persons" response_get_table = glue_client.get_table( DatabaseName = database, Name = table ) pprint.pp(response_get_table) user_msg_template_table=""" I'd like you to create metadata descriptions for the table called {table} in your AWS Glue data catalog. Please follow these steps: 1. Review the data catalog carefully 2. Use all the data catalog information to generate the table description 3. If a column is a primary key or foreign key to another table mention it in the description. 4. In your response, reply with the entire JSON object for the table {table} 5. Remove the DatabaseName, CreatedBy, IsRegisteredWithLakeFormation, CatalogId,VersionId,IsMultiDialectView,CreateTime, UpdateTime. 6. Write the table description in the Description attribute 7. List all the table columns under the attribute "StorageDescriptor" and then the attribute Columns. Add Location, InputFormat, and SerdeInfo 8. For each column in the StorageDescriptor, add the attribute "Comment". If a table uses a composite primary key, then the order of a given column in a tableâs primary key is listed in parentheses following the column name. 9. Your response must be a valid JSON object. 10. Ensure that the data is accurately represented and properly formatted within the JSON structure. The resulting JSON table should provide a clear, structured overview of the information presented in the original text. 11. If you cannot think of an accurate description of a column, say 'not available' Here is the data catalog json in <glue_data_catalog></glue_data_catalog> tags. <glue_data_catalog> {data_catalog} </glue_data_catalog> Here is some additional information about the database in <notes></notes> tags. <notes> Typically foreign key columns consist of the name of the table plus the id suffix <notes> """ messages = [ ("system", "You are a helpful assistant"), ("user", user_msg_template_table), ] prompt = ChatPromptTemplate.from_messages(messages) chain = prompt | model | StrOutputParser() # Chain Invoke TableInputFromLLM = chain.invoke({"data_catalog": {glue_data_catalog}, "table":table}) print(TableInputFromLLM) åè¿°ã®ã³ãŒãã§ã¯ãããŒã¿ã«ã¿ãã°æŽæ° API ãæåŸ
ãã TableInput ãªããžã§ã¯ãã«é©ãã JSON ã¬ã¹ãã³ã¹ãæäŸããããã« LLM ã«æç€ºããŸããã以äžã¯ã¬ã¹ãã³ã¹ã®äŸã§ããïŒ { "Name": "persons", "Description": "This table contains information about individual persons, including their names, identifiers, contact details, and other relevant personal data.", "StorageDescriptor": { "Columns": [ { "Name": "family_name", "Type": "string", "Comment": "The family name or surname of the person." }, { "Name": "name", "Type": "string", "Comment": "The full name of the person." }, { "Name": "links", "Type": "array<struct<note:string,url:string>>", "Comment": "An array of links related to the person, containing a note and URL." }, { "Name": "gender", "Type": "string", "Comment": "The gender of the person." }, { "Name": "image", "Type": "string", "Comment": "A URL or path to an image of the person." }, { "Name": "identifiers", "Type": "array<struct<scheme:string,identifier:string>>", "Comment": "An array of identifiers for the person, each with a scheme and identifier value." }, { "Name": "other_names", "Type": "array<struct<lang:string,note:string,name:string>>", "Comment": "An array of other names the person may be known by, including the language, a note, and the name itself." }, { "Name": "sort_name", "Type": "string", "Comment": "The name to be used for sorting or alphabetical ordering." }, { "Name": "images", "Type": "array<struct<url:string>>", "Comment": "An array of URLs or paths to additional images of the person." }, { "Name": "given_name", "Type": "string", "Comment": "The given name or first name of the person." }, { "Name": "birth_date", "Type": "string", "Comment": "The date of birth of the person." }, { "Name": "id", "Type": "string", "Comment": "The unique identifier for the person (likely a primary key)." }, { "Name": "contact_details", "Type": "array<struct<type:string,value:string>>", "Comment": "An array of contact details for the person, including the type (e.g., email, phone) and the value." }, { "Name": "death_date", "Type": "string", "Comment": "The date of death of the person, if applicable." } ], "Location": "s3://<your-s3-bucket>/persons/", "InputFormat": "org.apache.hadoop.mapred.TextInputFormat", "SerdeInfo": { "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe", "Parameters": { "paths": "birth_date,contact_details,death_date,family_name,gender,given_name,id,identifiers,image,images,links,name,other_names,sort_name" } } }, "PartitionKeys": [], "TableType": "EXTERNAL_TABLE" } ãŸããçæããã JSON 㯠AWS Glue API ãæåŸ
ãããã©ãŒãããã«æºæ ããŠãããæ€èšŒããããšãã§ããŸããïŒ from jsonschema import validate schema_table_input = { "type": "object", "properties" : { "Name" : {"type" : "string"}, "Description" : {"type" : "string"}, "StorageDescriptor" : { "Columns" : {"type" : "array"}, "Location" : {"type" : "string"} , "InputFormat": {"type" : "string"} , "SerdeInfo": {"type" : "object"} } } } validate(instance=json.loads(TableInputFromLLM), schema=schema_table_input) ããã§ããŒãã«ãšã«ã©ã ã®èª¬æãçæãããã®ã§ãããŒã¿ã«ã¿ãã°ãæŽæ°ããããšãã§ããŸãã ããŒã¿ã«ã¿ãã°ã®ã¡ã¿ããŒã¿ãæŽæ°ãã ãã®ã¹ãããã§ã¯ãAWS Glue API ã䜿çšããŠããŒã¿ã«ã¿ãã°ãæŽæ°ããŸããïŒ response = glue_client.update_table(DatabaseName=database, TableInput= json.loads(TableInputFromLLM) ) print(f"Table {table} metadata updated!") 以äžã®ã¹ã¯ãªãŒã³ã·ã§ããã¯ã persons ããŒãã«ã®ã¡ã¿ããŒã¿ãšãã®èª¬æã瀺ããŠããŸãã 以äžã®ã¹ã¯ãªãŒã³ã·ã§ããã¯ãããŒãã«ã®ã¡ã¿ããŒã¿ãšããŠã«ã©ã ã®èª¬æã衚瀺ããŠããŸãã 以äžã§ããŒã¿ã«ã¿ãã°ã«ä¿åãããŠããæè¡çãªã¡ã¿ããŒã¿ãå
å®ããã®ã§ãããã«å€éšããã¥ã¡ã³ãã远å ããŠèª¬æãæ¹åããŸãã RAG ã§å€éšã®ããã¥ã¡ã³ãã远å ãã¡ã¿ããŒã¿ã®èª¬æãæ¹åãã ãã®ã¹ãããã§ã¯ãããæ£ç¢ºãªã¡ã¿ããŒã¿ãçæããããã«å€éšã®ããã¥ã¡ã³ãã远å ããŸããç§ãã¡ã®ããŒã¿ã»ããã®ããã¥ã¡ã³ã㯠HTML ãšããŠãªã³ã©ã€ã³ã§èŠã€ããããŸããHTML ã®èªã¿èŸŒã¿ã«ã¯ LangChain HTML ããŒããŒã䜿ããŸããïŒ from langchain_community.document_loaders import AsyncHtmlLoader # We will use an HTML Community loader to load the external documentation stored on HTLM urls = ["http://www.popoloproject.com/specs/person.html", "http://docs.everypolitician.org/data_structure.html",'http://www.popoloproject.com/specs/organization.html','http://www.popoloproject.com/specs/membership.html','http://www.popoloproject.com/specs/area.html'] loader = AsyncHtmlLoader(urls) docs = loader.load() ããã¥ã¡ã³ããããŠã³ããŒããããããã£ã³ã¯ã«åå²ããŸããïŒ text_splitter = CharacterTextSplitter( separator='\n', chunk_size=1000, chunk_overlap=200, ) split_docs = text_splitter.split_documents(docs) embedding_model = BedrockEmbeddings( client=bedrock_client, model_id=embeddings_model_id ) 次ã«ãããã¥ã¡ã³ãããã¯ãã«åããŠããŒã«ã«ã«ä¿åããé¡äŒŒæ€çŽ¢ãå®è¡ããŸããæ¬çªã¯ãŒã¯ããŒãã§ã¯ Amazon OpenSearch Service ã®ãããªãã¯ãã«ã¹ãã¢ã®ãããŒãžããµãŒãã¹ãã Amazon Bedrock Knowledge Bases ã®ãã㪠RAG ã¢ãŒããã¯ãã£ãå®è£
ããããã®ãã«ãããŒãžããœãªã¥ãŒã·ã§ã³ã䜿çšããããšãã§ããŸãã vs = FAISS.from_documents(split_docs, embedding_model) search_results = vs.similarity_search( 'What standards are used in the dataset?', k=2 ) print(search_results[0].page_content) 次ã«ãããæ£ç¢ºãªã¡ã¿ããŒã¿ãçæããããã«ã«ã¿ãã°æ
å ±ãããã¥ã¡ã³ããšãšãã«å«ããŸããïŒ from operator import itemgetter from langchain_core.callbacks import BaseCallbackHandler from typing import Dict, List, Any class PromptHandler(BaseCallbackHandler): def on_llm_start( self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any) -> Any: output = "\n".join(prompts) print(output) system = "You are a helpful assistant. You do not generate any harmful content." # specify a user message user_msg_rag = """ Here is the guidance document you should reference when answering the user: <documentation>{context}</documentation> I'd like to you create metadata descriptions for the table called {table} in your AWS Glue data catalog. Please follow these steps: 1. Review the data catalog carefully. 2. Use all the data catalog information and the documentation to generate the table description. 3. If a column is a primary key or foreign key to another table mention it in the description. 4. In your response, reply with the entire JSON object for the table {table} 5. Remove the DatabaseName, CreatedBy, IsRegisteredWithLakeFormation, CatalogId,VersionId,IsMultiDialectView,CreateTime, UpdateTime. 6. Write the table description in the Description attribute. Ensure you use any relevant information from the <documentation> 7. List all the table columns under the attribute "StorageDescriptor" and then the attribute Columns. Add Location, InputFormat, and SerdeInfo 8. For each column in the StorageDescriptor, add the attribute "Comment". If a table uses a composite primary key, then the order of a given column in a tableâs primary key is listed in parentheses following the column name. 9. Your response must be a valid JSON object. 10. Ensure that the data is accurately represented and properly formatted within the JSON structure. The resulting JSON table should provide a clear, structured overview of the information presented in the original text. 11. If you cannot think of an accurate description of a column, say 'not available' <glue_data_catalog> {data_catalog} </glue_data_catalog> Here is some additional information about the database in <notes></notes> tags. <notes> Typically foreign key columns consist of the name of the table plus the id suffix <notes> """ messages = [ ("system", system), ("user", user_msg_rag), ] prompt = ChatPromptTemplate.from_messages(messages) # Retrieve and Generate retriever = vs.as_retriever( search_type="similarity", search_kwargs={"k": 3}, ) chain = ( {"context": itemgetter("table")| retriever, "data_catalog": itemgetter("data_catalog"), "table": itemgetter("table")} | prompt | model | StrOutputParser() ) TableInputFromLLM = chain.invoke({"data_catalog":glue_data_catalog, "table":table}) print(TableInputFromLLM) 以äžã¯ LLM ããã®ã¬ã¹ãã³ã¹ã§ããïŒ { "Name": "persons", "Description": "This table contains information about individual persons, including their names, identifiers, contact details, and other personal information. It follows the Popolo data specification for representing persons involved in government and organizations. The 'person_id' column relates a person to an organization through the 'memberships' table.", "StorageDescriptor": { "Columns": [ { "Name": "family_name", "Type": "string", "Comment": "The family or last name of the person." }, { "Name": "name", "Type": "string", "Comment": "The full name of the person." }, { "Name": "links", "Type": "array<struct<note:string,url:string>>", "Comment": "An array of links related to the person, with a note and URL for each link." }, { "Name": "gender", "Type": "string", "Comment": "The gender of the person." }, { "Name": "image", "Type": "string", "Comment": "A URL or path to an image representing the person." }, { "Name": "identifiers", "Type": "array<struct<scheme:string,identifier:string>>", "Comment": "An array of identifiers for the person, with a scheme and identifier value for each." }, { "Name": "other_names", "Type": "array<struct<lang:string,note:string,name:string>>", "Comment": "An array of other names the person may be known by, with language, note, and name for each." }, { "Name": "sort_name", "Type": "string", "Comment": "The name to be used for sorting or alphabetical ordering of the person." }, { "Name": "images", "Type": "array<struct<url:string>>", "Comment": "An array of URLs or paths to additional images representing the person." }, { "Name": "given_name", "Type": "string", "Comment": "The given or first name of the person." }, { "Name": "birth_date", "Type": "string", "Comment": "The date of birth of the person." }, { "Name": "id", "Type": "string", "Comment": "The unique identifier for the person. This is likely a primary key." }, { "Name": "contact_details", "Type": "array<struct<type:string,value:string>>", "Comment": "An array of contact details for the person, with a type and value for each." }, { "Name": "death_date", "Type": "string", "Comment": "The date of death of the person, if applicable." } ], "Location": "s3:<your-s3-bucket>/persons/", "InputFormat": "org.apache.hadoop.mapred.TextInputFormat", "SerdeInfo": { "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe" } } } 1ã€ç®ã®ã¢ãããŒããšåæ§ã«ãåºåã AWS Glue API ã«é©åããŠããã確èªããããã®æ€èšŒãããããšãã§ããŸãã æ°ããã¡ã¿ããŒã¿ã§ããŒã¿ã«ã¿ãã°ãæŽæ°ãã ããã§ã¡ã¿ããŒã¿ãçæãããã®ã§ãããŒã¿ã«ã¿ãã°ãæŽæ°ã§ããŸãã response = glue_client.update_table(DatabaseName=database, TableInput= json.loads(TableInputFromLLM) ) print(f"Table {table} metadata updated!") çæãããæè¡çãªã¡ã¿ããŒã¿ã確èªããŸãã persons ããŒãã«ã®ããŒã¿ã«ã¿ãã°ã«æ°ããããŒãžã§ã³ã衚瀺ãããŠããã¯ãã§ããã¹ããŒãã®ããŒãžã§ã³ã«ã¯ AWS Glue ã³ã³ãœãŒã«ããã¢ã¯ã»ã¹ã§ããŸãã ä»åã® persons ããŒãã«ã®èª¬æã確èªããŠãã ããããã®åã«å
¥åããã説æãšè¥å¹²ç°ãªã£ãŠããã¯ãã§ãã ã³ã³ããã¹ãå
åŠç¿ã«ããããŒãã«ã®èª¬æ â âThis table contains information about persons, including their names, identifiers, contact details, birth and death dates, and associated images and links. The âidâ column is the primary key for this table.â RAG ã«ããããŒãã«ã®èª¬æ â âThis table contains information about individual persons, including their names, identifiers, contact details, and other personal information. It follows the Popolo data specification for representing persons involved in government and organizations. The âperson_idâ column relates a person to an organization through the âmembershipsâ table.â LLM ã¯ãLLM ã«æäŸãããããã¥ã¡ã³ãã®äžéšã§ãã Popolo ã®ä»æ§ã«å¯Ÿããç¥èã衚çŸããŸããã ã¯ãªãŒã³ã¢ãã 以äžãæ¬èšäºã§ã玹ä»ããã¹ããããå®äºããŸãããç¡é§ãªã³ã¹ããããããªãããã«ã ããŒããã㯠㮠Clean Up ã§æäŸãããã³ãŒãã䜿ã£ãŠå¿ããã«ãªãœãŒã¹ãåé€ããŠãã ããã ãŸãšã æ¬èšäºã§ã¯çæ AIãç¹ã« Amazon Bedrock FM ã䜿çšãããŒã¿ã«ã¿ãã°ãåçã¡ã¿ããŒã¿ã§å
å®ãããæ¢åã®ããŒã¿è³ç£ã®ããŒã¿ãã£ã¹ã«ããªãŒãšããŒã¿çè§£ãåäžãããæ¹æ³ãæ¢ããŸãããç§ãã¡ã宿Œãã2ã€ã®ã¢ãããŒããã³ã³ããã¹ãå
åŠç¿ãš RAG ã¯ããã®ãœãªã¥ãŒã·ã§ã³ã®æè»æ§ãšæ±çšæ§ã瀺ããŠããŸããã³ã³ããã¹ãå
åŠç¿ã¯ããŒãã«æ°ãå°ãªã AWS Glue ããŒã¿ããŒã¹ã«å¯ŸããŠæå¹ã§ããã®ã«å¯ŸããRAGã¢ãããŒãã¯ããæ£ç¢ºã§è©³çްãªã¡ã¿ããŒã¿ãçæããããã«å€éšããã¥ã¡ã³ãã䜿çšãããããããå€§èŠæš¡ã§è€éãªããŒã¿ã©ã³ãã¹ã±ãŒãã«é©ããŠããŸãããã®ãœãªã¥ãŒã·ã§ã³ãå°å
¥ããããšã§æ°ããªã¬ãã«ã®ããŒã¿ã€ã³ããªãžã§ã³ã¹ãéæŸããçµç¹ã«ãããããå€ãã®ããŒã¿ã«åºã¥ããæææ±ºå®ãããŒã¿ããªãã³ãªã€ãããŒã·ã§ã³ã®æšé²ããããŠããŒã¿ã®äŸ¡å€ãæå€§éã«åŒãåºãããšãã§ããŸãããã®èšäºã§ã玹ä»ãããªãœãŒã¹ãæšå¥šäºé
ãã確èªããã ããããŒã¿ãããžã¡ã³ãã®å®è·µã匷åããããšã«ã圹ç«ãŠããã ããã°å¹žãã§ãã èè
ã«ã€ã㊠Manos Samatas ã¯ãAWS ã®ããŒã¿ã»AI éšéã®ããªã³ã·ãã«ãœãªã¥ãŒã·ã§ã³ã¢ãŒããã¯ãã§ããè±åœã®æ¿åºãéå¶å©å£äœãæè²ããã«ã¹ã±ã¢ã®ã客æ§ãšããŒã¿ããã³ AI ã®ãããžã§ã¯ãã«æºãããAWS ã䜿ã£ããœãªã¥ãŒã·ã§ã³ã®æ§ç¯ãæ¯æŽããŠããŸãããã³ãã³åšäœãäœæã¯èªæžãã¹ããŒã芳æŠããããªã²ãŒã ãå人ãšã®äº€æµã楜ããã§ããŸãã Anastasia Tzeveleka ã¯ãAWS ã® GenAI/ML ã®ã·ãã¢ã¹ãã·ã£ãªã¹ããœãªã¥ãŒã·ã§ã³ã¢ãŒããã¯ãã§ãã圌女ã¯ä»äºã®äžç°ãšã㊠EMEA å
šåã®ã客æ§ã AWS ãµãŒãã¹ã䜿çšã㊠FM (åºç€ã¢ãã«)ãæ§ç¯ããã¹ã±ãŒã©ãã«ãªçæ AI ãšæ©æ¢°åŠç¿ã®ãœãªã¥ãŒã·ã§ã³ãäœæããããšãæ¯æŽããŠããŸãã