AWS Resilience Hub 㯠AWS ãããžã¡ã³ãã³ã³ãœãŒã«äžã§ã¢ããªã±ãŒã·ã§ã³ã®å埩åïŒã¬ãžãªãšã³ã¹ïŒãäžå
çã«ç®¡çãæ¹åã§ããããŒã«ã§ããAWS Resilience Hub ã§ã¯ã¬ãžãªãšã³ã¹ã®ç®æšãå®çŸ©ããŠç®æšã«å¯Ÿããèé害æ§äœå¶ãè©äŸ¡ããAWS Well-Architected ãã¬ãŒã ã¯ãŒã¯ã«åºã¥ããæ¹åã®ããã®æšå¥šäºé
ãå®è£
ããããšãã§ããŸãã AWS Resilience Hub 㯠èéå®³æ§ ãš ãªãã¬ãŒã·ã§ã³ ã®äž¡æ¹ã«é¢ããæšå¥šäºé
ãæäŸããŸãããªãã¬ãŒã·ã§ã³ã«é¢ããæšå¥šäºé
ã«ã¯ã Amazon CloudWatch Alarms ã AWS Systems Manager Documents ãå©çšãã æšæºäœæ¥æé (SOPs) ã AWS Fault Injection Service (FIS) ã䜿çšããã«ãªã¹å®éšãå«ãŸããŸãã SOPïŒæšæºäœæ¥æé ïŒãšã¯ãµãŒãã¹ã®äžæãã¢ã©ãŒã ãçºçããéã«ãã¢ããªã±ãŒã·ã§ã³ãå¹ççã«åŸ©æ§ãããããã«èšèšãããå
·äœçãªæé ã®ããšã§ããAWS Well-Architected Framework ã®ã ä¿¡é Œæ§ã®æ± ãã§å®çŸ©ãããŠããäžè¬çãªã¢ã³ããã¿ãŒã³ã®1ã€ã¯ãã¢ã©ãŒãéç¥ãåãåã£ããšãã«ãªãã¬ãŒã¿ãŒãåŸãã¹ã SOP ããªãããšã§ããã¢ã©ãŒã ãçºçããéã®åŠçã®èªååã¯ãèªåçãªæ¯æ£æªçœ®ãå®çŸ©æžã¿SOPã®å®è¡ããšã©ãŒãèµ·ãããããæåäœæ¥ã®åæžã«ãã£ãŠã·ã¹ãã ã®ã¬ãžãªãšã³ã¹ãåäžãããããšãã§ããŸããAWS Resilience Hub 㯠ç¬èªã® SOP ãå®çŸ©ã§ããã«ã¹ã¿ãã€ãºå¯èœãªãã³ãã¬ãŒããæäŸ ããŸãã ãã®ããã°èšäºã§ã¯ AWS Resilience Hub ã®ãªãã¬ãŒã·ã§ã³ã«é¢ããæšå¥šäºé
ã®ãã³ãã¬ãŒã ã«åºã¥ããŠã€ãã³ããã€ã³ã·ãã³ãã«å¯Ÿãã SOP ã®å®è¡ãèªååããã³ãã¹ãããæ¹æ³ã«ã€ããŠèª¬æããŸããããã CI/CD ãã€ãã©ã€ã³ã«çµã¿èŸŒãããšã§ãéå®³ã®æ€åºã埩æ§ãå¯èœãã©ãããç¶ç¶çã«ãã¹ãããããšãã§ããŸãã SOP ã®å®è¡ãå¿
èŠãšããç¶æ³ãåçŸããã«ã¯ãAWS FIS ã®ã«ãªã¹ãšã³ãžãã¢ãªã³ã°ææ³ã䜿çšã§ããŸããAWS FIS ã§ã¯ã¹ã³ãŒããæç¢ºã«å®çŸ©ãããŠãããäºæããªãæåãçºçããå Žåã«ã¯ããŒã«ããã¯ãå¯èœãªå®å
šã¡ã«ããºã ãåããå®éšãè¡ãããšãã§ããŸãã åææ¡ä»¶ ãã®ããã°èšäºã§äœ¿çšããŠããäŸã«ã¯ãããã€ãã®åææ¡ä»¶ããããŸãã AWS Auto Scaling Group ã® EC2 ã€ã³ã¹ã¿ã³ã¹ãå«ãã¯ãŒã¯ããŒãã¢ãŒããã¯ã㣠(Figure 1 ãåç
§) AWS Cloud Development Kit (AWS CDK) ã«ã€ããŠã¯ã AWS CDK ã®éå§æ¹æ³ ãåç
§ããŠãã ãã AWS Resilience Hub ã䜿çšã㊠AWS ã¢ã«ãŠã³ãã«ãããã€ããã¯ãŒã¯ããŒãã¢ãŒããã¯ãã£ãå®çŸ©ããè©äŸ¡ããŸããAWS Resilience Hub ãæå¹ã«ããæ¹æ³ã®è©³çްã«ã€ããŠã¯ããã¡ãã® ããã° ãåç
§ããŠãã ãã ã¢ãŒããã¯ã㣠Figure 1 â ãã®ããã°ã§å®éšå¯Ÿè±¡ãšãããµã³ãã«ã¢ãŒããã¯ã㣠ã¯ãŒã¯ãã㌠Figure 2 â ãŠãŒã¶ãŒãå®éšãéå§ããŠãã SOP ãèªåå®è¡ãããŠã¢ã©ãŒã ãä¿®æ£ããããŸã§ã®ã¯ãŒã¯ãã㌠èªååãœãªã¥ãŒã·ã§ã³ AWS Resilience Hub ã¯ãã¢ã©ãŒã ãSOPãããã³ FIS å®éšã«é¢ããæšå¥šäºé
ãæäŸããŸããããããªãã¬ãŒã·ã§ã³ã«é¢ããæšå¥šäºé
ãæ£åžžã«å®è£
ãããŠãããã©ããã¯ãã客æ§ã®è²¬ä»»ã«ãããŠãã¹ãããŸããAWS Resilience Hub ã«ããã責任å
±æã¢ãã«ã®è©³çްã«ã€ããŠã¯ãããã° Shared Responsibility with AWS Resilience Hub ãåç
§ããŠãã ããã éèŠãªãªãœãŒã¹ã®å埩ãèªååããããšããå§ãããŸãããã®ããã°ã§ã¯ãç¹å®ã®ã¢ã©ãŒã ç¶æ
ã«éãããšãã«å®è£
æžã¿ã® SOP ãå®è¡ãã Amazon EventBridge ã®èªååã«ã€ããŠèª¬æããŸããFIS å®éšã䜿çšããŠãã®èªååããã¹ãããŸãã ã«ãªã¹ãšã³ãžãã¢ãªã³ã°ã¯ã¬ãžãªãšã³ã¹å®éšã®é«åºŠãªã¢ãŒãã§ãç¶ç¶çãªã¬ãžãªãšã³ã¹ãã€ãã©ã€ã³ã§ã®èªåå®éšãå«ã¿ãŸããéèŠãªååã¯ âæ©ã倱æããâ ããšã§ããã€ãŸãã¬ãžãªãšã³ã¹ã®åé¡ãæ¬çªç°å¢ã§èµ·ããåã«ãã§ããã ãæ©ãçºèŠããŠå¯ŸåŠããããšã§ããã«ãªã¹å®éšãç¶ç¶çãªã¬ãžãªãšã³ã¹ã¯ãŒã¯ãããŒã«çµ±åããããšã§ãã¬ãžãªãšã³ã¹å®éšã«å¯Ÿããããã¢ã¯ãã£ããã€å埩çãªã¢ãããŒããå¯èœã«ãªããã¬ãžãªãšã³ã¹ãéçºããã»ã¹ã®äžå¯æ¬ ãªéšåã§ããããšã確ãã«ããŸãã ã¢ãŒããã¯ãã£ã¯ Figure 1 ã«ç€ºãããã«ã Relational Database Service (RDS) ãããã¯ãšã³ãã«æã€ Auto Scaling Group (ASG) å
ã® Amazon Elastic Compute Cloud (Amazon EC2) ã§å®è¡ãããã¢ããªã±ãŒã·ã§ã³ãæã¡ãŸãã ãã®äŸã§ã¯ CPU 䜿çšçãé«ããªã£ãå Žåã®å¿çãèªååããŸããããã圹ã«ç«ã€ãŠãŒã¹ã±ãŒã¹ãèããŠã¿ãŸããã : e ã³ããŒã¹ Web ã¢ããªã±ãŒã·ã§ã³ã¯ Web ãµãŒããŒã® ASG ã min(æå°)/desired(åžæ) = 1ãmax(æå€§) = 2 ã«èšå®ããŠãããã¹ã±ãŒãªã³ã°ããªã·ãŒã¯å¹³å CPU 䜿çšçã«ãã£ãŠæ§æãããŠããŸããäŸãã°ãã€ã·ãŒãºã³ã®ã€ãã³ãã®ããã«ãŠãŒã¶ãŒããã®ãªã¯ãšã¹ããæ¥å¢ããå ŽåãASG ã¯æå€§ãã£ãã·ãã£ã§ãã 2 ã«éããŸãããæ°ãããŠãŒã¶ãŒããŸã ã¢ããªã±ãŒã·ã§ã³ã«æ¥ç¶ã§ããªããŸãŸã ãšãããšãããã¯ååã§ã¯ãããŸããã ã客æ§ã®ãªã³ã³ãŒã«ããŒã ãåé¡ã調æ»ããASG ã®æå€§å€ã倿Žãããšããæ±ºå®ãäžããŸã§ã«ã¯æéçãªã®ã£ããããããŸãããã®éã«æ°ãããŠãŒã¶ãŒããã®æ¥ç¶ã¯éçµ¶ããããžãã¹ã®è²¡åãè©å€ã«åœ±é¿ãçããŸããSOP ãçšããŠãã®ã¡ã«ããºã ãèªååããããšã«ããããã®æéçãªã®ã£ãããåããããšãã§ããŸããã¢ã©ãŒã ãããªã¬ãŒãããããšã§ã«ã¹ã¿ããŒããŒã ã远å 調æ»ã®å¿
èŠæ§ã«æ°ã¥ãããšãã§ããããã§ãã ãªãã¬ãŒã·ã§ã³ã«é¢ããæšå¥šäºé
ã®å®è£
AWS Resilience Hub ã®ãªãã¬ãŒã·ã§ã³ã«é¢ããæšå¥šäºé
ã®3ã€ã®é å å
šãŠã«ã€ããŠèªååãå®è£
ããŸãã 2 ã€ã®ã¢ã©ãŒã AWSResilienceHub-SyntheticCanaryInRegionAlarm_2021-04-01 AWSResilienceHub-AsgHighCpuUtilizationAlarm_2020-07-13 1 ã€ã® SOP AWSResilienceHub-ScaleOutAsgSOP_2020-07-01 1 ã€ã® FIS å®éš AWSResilienceHub-InjectCpuLoadInAsgTest_2021-09-22 ãªãã¬ãŒã·ã§ã³ã«é¢ããæšå¥šäºé
ã®å®è£
ã®è©³çްã«ã€ããŠã¯ãããã° Measure and Improve Your Application Resilience with AWS Resilience Hub ãåç
§ããŠãã ããã Amazon EventBridge ã䜿çšããŠèªååãè¡ãããã«ã以äžã® AWS CloudFormation ãã³ãã¬ãŒããäœæããŠãã®ãªãœãŒã¹ãããããžã§ãã³ã°ããŸããããã®èªååã«ãããè€åã¢ã©ãŒã âAsgMaxCapacityReachedAndAsgHighCPUAlarmâ ãããªã¬ãŒãã㊠âã¢ã©ãŒã äžâ ç¶æ
ã«ãªã£ããšãã«ãSOP âAWSResilienceHub-ScaleOutAsgSOP_2020-07-01âãéå§ãããŸãã AWSTemplateFormatVersion: '2010-09-09' Description: CloudFormation template for EventBridge rule 'arh-alarm-asg-cpu-triggered' Parameters: AlarmTriggerArn: Type: String Description: Arn of the Alarm that will trigger this Event SSMTemplateAssumeRole: Type: String Description: An ARN of the role that SSM is going to assume SSMTemplateASGName: Type: String Description: Auto scaling group name (for the SSM Template) Resources: AmazonEventBridgeInvokeStartAutomationExecutionPolicy: Type: AWS::IAM::ManagedPolicy Properties: Description: Policy for the Amazon EventBridge Invoke Start Automation Execution ManagedPolicyName: !Join ['-', ['AWSResilienceHub-EventBridge_Automation_Policy', !Select [4, !Split ['-', !Select [2, !Split ['/', !Ref "AWS::StackId"]]]]]] Path: '/service-role/' PolicyDocument: !Sub '{ "Version": "2012-10-17", "Statement": [ { "Action": "ssm:StartAutomationExecution", "Effect": "Allow", "Resource": [ "arn:${AWS::Partition}:ssm:${AWS::Region}:*:automation-definition/AWSResilienceHub-ScaleOutAsgSOP_2020-07-01:$DEFAULT" ] }, { "Effect": "Allow", "Action": [ "iam:PassRole" ], "Resource": "${SSMTemplateAssumeRole}", "Condition": { "StringLikeIfExists": { "iam:PassedToService": "ssm.amazonaws.com" } } } ] }' AmazonEventBridgeInvokeStartAutomationExecution: Type: AWS::IAM::Role Properties: RoleName: !Join ['-', ['AWSResilienceHub-EventBridge_Automation', !Select [4, !Split ['-', !Select [2, !Split ['/', !Ref "AWS::StackId"]]]]]] Description: Amazon EventBridge Invoke Start Automation Execution Role AssumeRolePolicyDocument: Statement: - Action: sts:AssumeRole Effect: Allow Principal: Service: events.amazonaws.com Version: "2012-10-17" MaxSessionDuration: 3600 Path: '/service-role/' ManagedPolicyArns: - !Ref AmazonEventBridgeInvokeStartAutomationExecutionPolicy EventRuleArhSop: Type: AWS::Events::Rule Properties: EventBusName: default EventPattern: source: - aws.cloudwatch detail-type: - CloudWatch Alarm State Change detail: alarmName: - !Ref CloudWatchCompositeAlarmAsgMaxCapacityReachedAndAsgHighCPUAlarm state: value: - ALARM Name: !Join ['-', ['arh-alarm-asg-cpu-automation', !Select [4, !Split ['-', !Select [2, !Split ['/', !Ref "AWS::StackId"]]]]]] State: ENABLED Targets: - Id: Id5b81de31-a5ef-42e2-90de-1fc8348b3229 Arn: !Sub "arn:${AWS::Partition}:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/AWSResilienceHub-ScaleOutAsgSOP_2020-07-01" RoleArn: !GetAtt AmazonEventBridgeInvokeStartAutomationExecution.Arn Input: !Sub '{"Dryrun":["false"],"AutoScalingGroupName":["${SSMTemplateASGName}"],"AutomationAssumeRole":["${SSMTemplateAssumeRole}"]}' CloudWatchAlarmAsgMaxCapacityReached: UpdateReplacePolicy: "Retain" Type: "AWS::CloudWatch::Alarm" Properties: ComparisonOperator: "GreaterThanThreshold" TreatMissingData: "missing" ActionsEnabled: true Metrics: - Label: "AsgMaxCapacityReached" Id: "e1" ReturnData: true Expression: "IF(m1 >= m2, 1, 0)" - ReturnData: false MetricStat: Period: 120 Metric: MetricName: "GroupInServiceInstances" Dimensions: - Value: !Ref SSMTemplateASGName Name: "AutoScalingGroupName" Namespace: "AWS/AutoScaling" Stat: "Average" Id: "m1" - ReturnData: false MetricStat: Period: 120 Metric: MetricName: "GroupMaxSize" Dimensions: - Value: !Ref SSMTemplateASGName Name: "AutoScalingGroupName" Namespace: "AWS/AutoScaling" Stat: "Average" Id: "m2" AlarmName: !Join ['-', ['ARH-AsgMaxCapacityReached', !Select [4, !Split ['-', !Select [2, !Split ['/', !Ref "AWS::StackId"]]]]]] EvaluationPeriods: 1 DatapointsToAlarm: 1 Threshold: 0 CloudWatchCompositeAlarmAsgMaxCapacityReachedAndAsgHighCPUAlarm: UpdateReplacePolicy: "Retain" Type: "AWS::CloudWatch::CompositeAlarm" Properties: ActionsEnabled: true AlarmName: !Join ['-', ['ARH-AsgMaxCapacityReachedAndAsgHighCPUAlarm', !Select [4, !Split ['-', !Select [2, !Split ['/', !Ref "AWS::StackId"]]]]]] AlarmRule: !Sub 'ALARM("${CloudWatchAlarmAsgMaxCapacityReached}") AND ALARM("${AlarmTriggerArn}")' CodeBlock 1 â Amazon EventBridge ã§èªååãã»ããã¢ããããããã® AWS CloudFormation ã¹ã¿ã㯠ãªãã¬ãŒã·ã§ã³ã忢ããå Žåã«ããã«åŸ©æ§ã§ããããã«ãäºåã« SOP ãæºåããã¹ããè©äŸ¡ããå¿
èŠããããŸããããã«ã¯ FIS å®éšã圹ç«ã¡ãŸãããã®ã·ããªãªã§ã¯ AWS Resilience Hub ãæšå¥šãã SOP ã䜿çšããŠããŸãããŸã AWS Resilience Hub ãæšå¥šãã FIS å®éšã䜿çšããããšã§ããã® SOP ãå®è¡ããå Žåã®ä»®èª¬æ€èšŒãè¡ãããšãã§ããŸãããã®ãŠãŒã¹ã±ãŒã¹ã§ã¯ Auto Scaling ãã£ãã·ãã£ãæå€§å€ã«éãããšãã«ãAmazon EventBridge rule ãéããŠåŒã³åºããã SOP ã®èªåå®è¡ããã¹ãããŸãã 仮説 EC2 Auto Scaling ãš SOP ã®èªååã®ãããã§ãEC2 ã€ã³ã¹ã¿ã³ã¹ã® CPU 䜿çšçãå
šäœçã«é«ããªã£ãå Žåã§ãã¢ããªã±ãŒã·ã§ã³ã®ããã©ãŒãã³ã¹ã«æªåœ±é¿ããããŒãããšã¯ãªããšäºæ³ããŸããWeb ã¢ããªã±ãŒã·ã§ã³ã¯ç¶ç¶ããŠã¢ã¯ã»ã¹å¯èœã§ã顧客ã¯ãµãŒãã¹ã®å©çšãäžæãããããšã¯ã»ãšãã©ãããŸããã ã¢ã©ãŒã ãSOPãFIS å®éšãAmazon EventBridge rule ããã¹ãŠå®è£
ãããããèªååãããŠããããšãå®éšã«ãã確èªããŸãã仮説ã«åºã¥ããšããã®å®éšã§ã¯æ¬¡ã®ããšã確èªã§ããã¯ãã§ãã FIS å®éšã§ã¯ Auto Scaling Group ã« CPU è² è·ã泚å
¥ããŸã CloudWatch Alarm âAWSResilienceHub-AsgHighCpuUtilizationAlarmâ 㯠âã¢ã©ãŒã äžâ ã«å€ããã¯ãã§ã Auto Scaling ã¯ è² è·ã管çããããã«æ°ããã€ã³ã¹ã¿ã³ã¹ãèµ·åããŠéå§ããŸã FIS å®éšã¯ Auto Scaling Group ã«ããäžåºŠ CPU è² è·ã泚å
¥ããŸã Amazon EventBridge ããã®ã€ãã³ããåŠçã SOP âAWSResilienceHub-ScaleOutAsgSOP_2020-07-01â ãèµ·åããŸã SOP 㯠Auto Scaling Groupãã¹ã±ãŒã«ã¢ãŠãã㊠EC2 ã€ã³ã¹ã¿ã³ã¹ã远å ããŸã å®éšãšSOPã®äž¡æ¹ãæ£åžžã«å®äºããŸã äºåãã§ã㯠ãŸãæåã« AWS ãããžã¡ã³ãã³ã³ãœãŒã«ã® EC2 ã»ã¯ã·ã§ã³ã§ Auto Scaling Group ã®å€ãšã¢ããªã±ãŒã·ã§ã³ã§å®è¡ããŠããã€ã³ã¹ã¿ã³ã¹ã®æ°ã確èªããŸãããã Figure 3 â å
ã® Auto Scaling Group ã®ãã£ãã·ãã£ã®å€ Figure 4 â å
ã® å®è¡äž EC2 ã€ã³ã¹ã¿ã³ã¹ã®æ°ã¯ 1 〠å®éšãã ããã§äžèšã®ä»®èª¬ããã¹ãããããã«ãAWS Resilienc Hub ãæšå¥šãã AWS Fault Injection Service (FIS) å®éš âAWSResilienceHub-InjectCpuLoadInAsgTest_2021-09-22â ãéå§ããŸãã Figure 5 â FIS å®éšã®å®è¡ CloudWatch ã³ã³ãœãŒã«ã§ã¢ã©ãŒã âAWSResilienceHub-AsgHighCpuUtilizationAlarmâ ã âã¢ã©ãŒã äžâ ç¶æ
ã«é·ç§»ããããšãããããŸãããã㯠CPU 䜿çšçãèšå®ããããããå€ãè¶
ããããšã瀺ããŠããŸããããã«ãã Auto Scaling Group ã®åçã¹ã±ãŒãªã³ã°ãããªã¬ãŒãããAuto Scaling Group ã§ 2 ã€ã®ã€ã³ã¹ã¿ã³ã¹ãå®è¡ãããŠããããšãããããŸãã Figure 6 â CloudWatch Alarm ã®ç¶æ
ãå€å Figure 7 â 2 ã€ã®å®è¡äž EC2 ã€ã³ã¹ã¿ã³ã¹ Figure 8 â Auto Scaling Group (ASG) ã®æ°ããå€ å®éšãçµäºãã2 ã€ã®ã€ã³ã¹ã¿ã³ã¹ãå®è¡ãããã¢ã©ãŒã ã âOKâ ç¶æ
ã«ãªããŸããã åã³åãå®éšãå®è¡ãããšãCloudWatch ã³ã³ãœãŒã«ç»é¢ã§ CloudWatch Alarm ã âã¢ã©ãŒã äžâ ç¶æ
ã«é·ç§»ããŠããããšãããããŸãããã㯠CPU 䜿çšçãèšå®ããããããå€ãè¶
ããŠããããšã瀺ããŠããŸããããã«ã2 çªç®ã®ã¢ã©ãŒã âARH-AsgMaxCapacityReachedâ ã âã¢ã©ãŒã äžâ ç¶æ
ã«ãªã£ãŠããããšãããããŸãããã㯠Auto Scaling Group ã®æå€§ãã£ãã·ãã£ã«éããããšã瀺ããŠããŸããããã«ãã Amazon EventBridge rule ãæ£ããå®è¡ãããŠãããã©ããã確èªã§ããŸãããã®ã«ãŒã«ã¯åè¿°ã®ã¢ã©ãŒã ãçµã¿åãããè€åã¢ã©ãŒã ã«åºã¥ããŠããŸãïŒFigure 9 ã«ã衚瀺ïŒã Figure 9 â CloudWatch Alarm ã®ç¶æ
å€å (2 åç®ã®å®éš) Figure 10 â Amazon Eventbridge rule ãæ£ããããªã¬ãŒãããŠãã çµæã®æ€èšŒ Amazon EventBridge ã³ã³ãœãŒã«ã®ã¢ãã¿ãªã³ã° ã¿ããããAmazon EventBridge rule ãããªã¬ãŒããåŒã³åºããæåããŠããããšã確èªã§ããŸããããã«ãã AWSResilienceHub-ScaleOutAsgSOP_2020-07-01 SOP ãã¿ãŒã²ãããšããŠèªåå®è¡ãããã¯ãã§ãã Systems Manager (SSM) Automations æ©èœãã SOP ãæ£åžžã«å®äºããããšãããããŸããAmazon Eventbridge ã«ããèªååããªããã°ãFIS HighCPU å®éšããã®åŸ©æ§ã«ã¯ãã® SOP ãæåã§å®è¡ããããšã«ãªããŸãã Figure 11 â 2åç®ã® FIS å®éšã®åŸãSOP ãæ£åžžã«å®è¡ãããŠãã Auto Scaling Group èªäœã«æ°ããå€ãèšå®ãããŠãããããŸãçŸåšå®è¡äžã® EC2 ã€ã³ã¹ã¿ã³ã¹ã®æ°ã確èªããŠã¿ãŸãããã Figure 12 â æ°ãã ASG ãã£ãã·ãã£ã®å€ Figure 13 â EC2 ã€ã³ã¹ã¿ã³ã¹ã远å ããåèš 3 ã€ã«ãªã£ãŠãã ã芧ã®ãšãããAuto Scaling Group ã® Desired capacity (åžæãããã£ãã·ãã£) ãš Maximum capacity (æå€§ãã£ãã·ãã£) ã®å€ãå¢å ããŠããŸããããã«ãã£ãŠæåŸ
éã Auto Scaling Group ã¯ã¢ããªã±ãŒã·ã§ã³ãžã€ã³ã¹ã¿ã³ã¹ã远å ããŸããããã㯠Auto Scaling Group ã®ã€ãã³ãã§ã確èªã§ããŸããAuto Scaling Group ã¢ã©ãŒã ãš SOP ã«ãã£ãŠãããã远å ãè¡ãããŠããŸãã Figure 14 â Auto Scaling Group ã€ãã³ã CloudWatch Alarm ã®å±¥æŽãèŠãŠã©ã®ãããªã¢ã¯ã·ã§ã³ãç¶æ
ã®å€åãçºçãããã確èªããããšãã§ããŸããæåŸ
éãã«ç¶æ
ã âOKâ ãã âã¢ã©ãŒã äžâ ã«ç§»è¡ããããšãSOP ã®å®è¡ã«ããã¢ã©ãŒã ã âOKâ ã«æ»ã£ãããšã確èªããããšãéèŠã§ãã Figure 15 â å®éšã«ãã㊠CloudWatch Alarm ç¶æ
ã âOKâ ãã âã¢ã©ãŒã äžâ ã«å€ãã£ãŠ âOKâ ã«æ»ãæ§å (å·Šå³) ãšãã€ã³ã¹ã¿ã³ã¹ãšæå€§ãã£ãã·ãã£ã®æ° (å³å³) FIS å®éšã«æ»ã£ãŠãå®éšãæ£åžžã«å®äºããããšã確èªããŸããããå®éšãçµäºãã仮説ãå®å
šã«ç«èšŒãããããšã確èªã§ããŸãã Figure 16 â AWS Resilience Hub ã«å®äºããå®éšã衚瀺ããã æ€èšŒ ããã§åœåã®ä»®èª¬ãšç
§ããåãããŠæ€èšŒã§ããŸãã FIS å®éšã§ ASG ã« CPU è² è·ã泚å
¥ãã FIS å®éšãæ£åžžã«å®è¡ãããŠããããšãããããŸã (Figure 11)ã Amazon CloudWatch Alarm ãããªã¬ãŒãããŠã¢ã©ãŒã ã®ç¶æ
ãå€åããããšã確èªã§ããŸã (Figure 15)ã CloudWatch Alarm âASGHighCPUUtilizationâ ã âã¢ã©ãŒã äžâ ã«å€ãã Amazon CloudWatch Alarm ãããªã¬ãŒãããŠã¢ã©ãŒã ç¶æ
ãå€åããããšã確èªã§ããŸã (Figure 15)ã Amazon EventBridge ããã®ã€ãã³ããåŠçããSOP âScaleOutAsgâ ãéå§ãã Amazon EventBridge rule ãå®è¡ãããŸã (Figure 10)ã æå€§ãã£ãã·ãã£ã«éããå Žåã« SOP 㯠Auto Scaling Group ã ã¹ã±ãŒã«ã¢ãŠãããŠEC2ã€ã³ã¹ã¿ã³ã¹ã远å ãã Amazon EventBridge rule ãå®è£
ãã AWS CloudFormation ã¹ã¿ãã¯ã䜿çšããŠå®çŸããèªååãšãå¿
èŠãªå€æŽã SOP ã§æ£åžžã«è¡ãããããšã®äž¡æ¹ãä»®èª¬ã«æ²¿ã£ãŠç¢ºèªããŸãã SOP å®è¡ã®èªååã¯ãæåæäœãªãã§å®äºãã SSM Document ã§ç¢ºèªã§ããŸã (Figure 11)ã Auto Scaling Group ãš EC2 ã€ã³ã¹ã¿ã³ã¹ã®æ°ã¯æåŸ
éãã®çµæã«ãªã£ãŠããŸã (Figure 12ã13ã14)ã å®éšãšSOPã®äž¡æ¹ãæ£åžžã«å®äºããŸã SOP ãš FIS å®éšã®å®äºã確èªã§ããŸãïŒFigure 16 ãš Figure 11)ã CI/CD ãã€ãã©ã€ã³ã§ã®å®è¡ ããã CI/CD ãã€ãã©ã€ã³ã§å®è¡ãããå Žåã¯ãããããã¹ãŠããªãŒã±ã¹ãã¬ãŒã·ã§ã³ãã AWS Step Functions ãäœæã§ããŸããã¹ããŒããã·ã³å³ã以äžã«ç€ºããŸãã Figure 17 â ã¹ããŒããã·ã³ ãŸããäžèšã®ãªãŒãã¡ãŒã·ã§ã³ãäœæããŸãã æ¬¡ã«ããªãŒãã¡ãŒã·ã§ã³ããããã€ããããŸã§åŸ
ã¡ãŸãã ãããã€ãæåããããFIS å®éšãéå§ããŸãã Amazon EventBridge ã«ããèªååã«ããã¢ã©ãŒã çºçãšAuto Scaling Group ã®æå€§ãã£ãã·ãã£ã«åºã¥ã㊠SOP ãéå§ãããåé¡ã軜æžããŸãã ãšã©ãŒãçºçãããšãSimple Notification Service (SNS) ã¡ãã»ãŒãžãéä¿¡ãããã¯ãŒã¯ãããŒã¯å€±æããŸãã ãã¹ãããšã©ãŒãªãã§çµäºãããšãæåãå ±åãããŸãã ãã® AWS Step Functions ãäœæãã AWS Cloud Development Kit (AWS CDK) ã®ã³ãŒãã import * as cdk from 'aws-cdk-lib'; import * as iam from 'aws-cdk-lib/aws-iam'; import * as stepfunctions from 'aws-cdk-lib/aws-stepfunctions'; export interface ArhBlogTestImportStackProps extends cdk.StackProps { } export class ArhBlogTestImportStack extends cdk.Stack { public constructor(scope: cdk.App, id: string, props: ArhBlogTestImportStackProps = {}) { super(scope, id, props); const iamRoleStepFunctionsRole = new iam.CfnRole(this, 'StepFunctionsRole', { path: '/service-role/', maxSessionDuration: 3600, roleName: 'arh-blog-StepFunctions-role-' + id, policies: [ { policyDocument: { Version: '2012-10-17', Statement: [ { Resource: '*', Action: [ 'cloudformation:CreateStack', 'cloudformation:DeleteStack', 'cloudformation:DescribeStacks', ], Effect: 'Allow', }, ], }, policyName: 'cloudformation-permissions', }, { policyDocument: { Version: '2012-10-17', Statement: [ { Resource: '*', Action: [ 'cloudformation:CreateStack', 'cloudformation:DeleteStack', 'cloudformation:DescribeStacks', "cloudwatch:DescribeAlarms" ], Effect: 'Allow', }, ], }, policyName: 'cloudwatch-permissions', }, { policyDocument: { Version: '2012-10-17', Statement: [ { Resource: '*', Action: [ 'events:DescribeRule', 'events:DeleteRule', 'events:PutRule', 'events:PutTargets', 'events:RemoveTargets', ], Effect: 'Allow', }, ], }, policyName: 'eventbridge-permissions', }, { policyDocument: { Version: '2012-10-17', Statement: [ { Resource: '*', Action: [ 'fis:StartExperiment', 'fis:GetExperiment', ], Effect: 'Allow' }, ], }, policyName: 'fis-permissions', }, { policyDocument: { Version: '2012-10-17', Statement: [ { Resource: '*', Action: [ 'iam:CreatePolicy', 'iam:GetRole', 'iam:DetachRolePolicy', 'iam:GetPolicy', 'iam:CreateRole', 'iam:DeleteRole', 'iam:AttachRolePolicy', 'iam:PutRolePolicy', 'iam:PassRole', 'iam:ListPolicyVersions', 'iam:DeletePolicy', ], Effect: 'Allow' }, ], }, policyName: 'iam-permissions', }, { policyDocument: { Version: '2012-10-17', Statement: [ { Resource: '*', Action: 's3:GetObject', Effect: 'Allow', }, ], }, policyName: 's3-permissions', }, { policyDocument: { Version: '2012-10-17', Statement: [ { Resource: '*', Action: "sns:Publish", Effect: "Allow", }, ], }, policyName: 'sns-permissions', }, ], assumeRolePolicyDocument: { Version: '2012-10-17', Statement: [ { Action: 'sts:AssumeRole', Effect: 'Allow', Principal: { Service: 'states.amazonaws.com', }, }, ], }, }); iamRoleStepFunctionsRole.cfnOptions.deletionPolicy = cdk.CfnDeletionPolicy.RETAIN; const stateMachine = new stepfunctions.CfnStateMachine(this, 'StepFunctionsStateMachine', { definitionString: '{ \"Comment\": \"A description of my state machine\", \"StartAt\": \"CreateAutomationStack\", \"States\": { \"CreateAutomationStack\": { \"Type\": \"Task\", \"Parameters\": { \"StackName\": \"arh-blog-automation\", \"TemplateURL.$\": \"$.input.S3UrlToCloudformationStack\", \"Capabilities\": [ \"CAPABILITY_NAMED_IAM\", \"CAPABILITY_AUTO_EXPAND\" ], \"Parameters\": [ { \"ParameterKey\": \"AlarmTriggerArn\", \"ParameterValue.$\": \"$.input.AlarmTriggerArn\" }, { \"ParameterKey\": \"SSMTemplateAssumeRole\", \"ParameterValue.$\": \"$.input.SSMTemplateAssumeRole\" }, { \"ParameterKey\": \"SSMTemplateASGName\", \"ParameterValue.$\": \"$.input.SSMTemplateASGName\" } ] }, \"Resource\": \"arn:aws:states:::aws-sdk:cloudformation:createStack\", \"Next\": \"WaitForStackToBeReady\", \"Catch\": [ { \"ErrorEquals\": [ \"States.ALL\" ], \"Next\": \"DeleteAutomationStackOnFail\" } ] }, \"WaitForStackToBeReady\": { \"Type\": \"Wait\", \"Seconds\": 5, \"Next\": \"DescribeStacks\" }, \"DescribeStacks\": { \"Type\": \"Task\", \"Next\": \"StackDeploymentStatus\", \"Parameters\": { \"StackName.$\": \"States.ArrayGetItem(States.StringSplit($.StackId, \'/\'), 1)\" }, \"Resource\": \"arn:aws:states:::aws-sdk:cloudformation:describeStacks\", \"OutputPath\": \"$.Stacks[0]\", \"Catch\": [ { \"ErrorEquals\": [ \"States.ALL\" ], \"Next\": \"DeleteAutomationStackOnFail\" } ] }, \"StackDeploymentStatus\": { \"Type\": \"Choice\", \"Choices\": [ { \"Or\": [ { \"Variable\": \"$.StackStatus\", \"StringEquals\": \"REVIEW_IN_PROGRESS\" }, { \"Variable\": \"$.StackStatus\", \"StringEquals\": \"CREATE_IN_PROGRESS\" } ], \"Next\": \"WaitForStackToBeReady\" }, { \"Variable\": \"$.StackStatus\", \"StringEquals\": \"CREATE_COMPLETE\", \"Next\": \"StartExperiment\" } ], \"Default\": \"DeleteAutomationStackOnFail\" }, \"StartExperiment\": { \"Type\": \"Task\", \"Next\": \"WaitForExperimentToFinish\", \"Parameters\": { \"ClientToken.$\": \"States.UUID()\", \"ExperimentTemplateId.$\": \"$$.Execution.Input.input.ExperimentTemplateId\" }, \"Resource\": \"arn:aws:states:::aws-sdk:fis:startExperiment\", \"ResultPath\": \"$.Result\" }, \"WaitForExperimentToFinish\": { \"Type\": \"Wait\", \"Seconds\": 5, \"Next\": \"GetExperiment\" }, \"GetExperiment\": { \"Type\": \"Task\", \"Next\": \"ExperimentStatus\", \"Parameters\": { \"Id.$\": \"$.Result.Experiment.Id\" }, \"Resource\": \"arn:aws:states:::aws-sdk:fis:getExperiment\", \"ResultPath\": \"$.Result\" }, \"ExperimentStatus\": { \"Type\": \"Choice\", \"Choices\": [ { \"Or\": [ { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"pending\" }, { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"initiating\" }, { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"running\" } ], \"Next\": \"WaitForExperimentToFinish\" }, { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"completed\", \"Next\": \"Wait\" } ], \"Default\": \"SNSPublishOnError\" }, \"Wait\": { \"Type\": \"Wait\", \"Seconds\": 20, \"Next\": \"StartExperimentAgain\" }, \"StartExperimentAgain\": { \"Type\": \"Task\", \"Next\": \"WaitForExperimentToFinishAgain\", \"Parameters\": { \"ClientToken.$\": \"States.UUID()\", \"ExperimentTemplateId.$\": \"$$.Execution.Input.input.ExperimentTemplateId\" }, \"Resource\": \"arn:aws:states:::aws-sdk:fis:startExperiment\", \"ResultPath\": \"$.Result\" }, \"WaitForExperimentToFinishAgain\": { \"Type\": \"Wait\", \"Seconds\": 5, \"Next\": \"GetExperimentAgain\" }, \"GetExperimentAgain\": { \"Type\": \"Task\", \"Next\": \"ExperimentStatusAgain\", \"Parameters\": { \"Id.$\": \"$.Result.Experiment.Id\" }, \"Resource\": \"arn:aws:states:::aws-sdk:fis:getExperiment\", \"ResultPath\": \"$.Result\" }, \"ExperimentStatusAgain\": { \"Type\": \"Choice\", \"Choices\": [ { \"Or\": [ { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"pending\" }, { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"initiating\" }, { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"running\" } ], \"Next\": \"WaitForExperimentToFinishAgain\" }, { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"completed\", \"Next\": \"DeleteAutomationStack\" } ], \"Default\": \"SNSPublishOnError\" }, \"SNSPublishOnError\": { \"Type\": \"Task\", \"Resource\": \"arn:aws:states:::sns:publish\", \"Parameters\": { \"TopicArn.$\": \"$$.Execution.Input.input.SnsTopic\", \"Message.$\": \"$\" }, \"Next\": \"DeleteAutomationStackOnFail\" }, \"DeleteAutomationStackOnFail\": { \"Type\": \"Task\", \"Parameters\": { \"StackName\": \"arh-blog-automation\" }, \"Resource\": \"arn:aws:states:::aws-sdk:cloudformation:deleteStack\", \"Next\": \"Fail\" }, \"Fail\": { \"Type\": \"Fail\" }, \"DeleteAutomationStack\": { \"Type\": \"Task\", \"Parameters\": { \"StackName.$\": \"States.ArrayGetItem(States.StringSplit($.StackId, \'/\'), 1)\" }, \"Resource\": \"arn:aws:states:::aws-sdk:cloudformation:deleteStack\", \"Next\": \"Success\" }, \"Success\": { \"Type\": \"Succeed\" } } }', loggingConfiguration: { includeExecutionData: false, level: 'OFF', }, stateMachineName: 'arh-blog-statemachine-' + id, roleArn: iamRoleStepFunctionsRole.attrArn, tags: [ ], stateMachineType: 'STANDARD', tracingConfiguration: { enabled: false, }, }); stateMachine.cfnOptions.deletionPolicy = cdk.CfnDeletionPolicy.RETAIN; } } CodeBlock 2 â AWS Step Functions ã äœæãã AWS CDK ã¹ã¿ã㯠äžèšã® AWS CDK ã³ãŒãã§äœæãããã¹ããŒããã·ã³ãå®è¡ããã«ã¯ãããã€ãã®å
¥åãå®çŸ©ããå¿
èŠããããŸãã AlarmTriggerArn â Resilience Hub ãæšå¥šããã¢ã©ãŒã âAsgHighCpuUtilizationAlarmâ ã® ARN SSMTemplateAssumeRole â SOP ã§äœæããã âAWSResilienceHubAsgScaleOutAssumeRoleâ ã® ARN SSMTemplateASGName â Auto Scaling Group ã®åå (ARNã§ã¯ãªã) ExperimentTemplateId â å®è¡ãã FIS å®éšã® ID (ãã®å Žå㯠AsgScaleOut) SnsTopic â å®éšã倱æããå Žåã«ã¡ãã»ãŒãžãéä¿¡ãã SNS ããã㯠S3UrlToCloudformationStack â Amazon Simple Storage Service (S3) ãã±ããå
ã® CloudFormation ãã¡ã€ã«ã® URLãäžèšã® CodeBlock1 ã® AWS CloudFormation ãã³ãã¬ãŒã㯠S3 ã®ãã©ã«ãã«ä¿åããå¿
èŠããããŸã äŸãšããŠãå
¥åã¯ä»¥äžã®ããã«ãªããŸããCDK ã³ãŒããç°å¢å
ã§æ£ããæ©èœãããã«ã¯ãããæŽæ°ããå¿
èŠããããŸãã { "input": { "AlarmTriggerArn": "arn:aws:cloudwatch:<region>:<accountid>:alarm:AWSResilienceHub-AsgHighCpuUtilizationAlarm-2020-07-13_arh-demo_arh-lab-workload-AutoScalingGroup-oYSKLDR6Vg21", "SSMTemplateAssumeRole": "arn:aws:iam::<accountid>:role/arh-sop-AWSResilienceHubAsgScaleOutAssumeRole-qWqL13hCgexP", "SSMTemplateASGName": "arh-lab-workload-AutoScalingGroup-oYSKLDR6Vg21", "ExperimentTemplateId": "EXT9Au6P89tSQXa", "SnsTopic": "arn of the topics", "S3UrlToCloudformationStack": "https://<bucketname>.s3.<region>.amazonaws.com/arh-eventbridge.yml" } } CodeBlock 3 â AWS CDK å
¥åïŒæŽæ°ãå¿
èŠïŒ ããã§ AWS Step Functions ãäœæãããã®ã§ããã€ãã©ã€ã³ã«çµ±åã§ããŸãããã¡ãã®ããã° â Continually assessing application resilience with AWS Resilience Hub and AWS CodePipeline â ã§ã¯ãAWS Code Pipeline ãã StepFunctions ãããªã¬ãŒããæ¹æ³ã«ã€ããŠèª¬æããŠããŸãã çµè« ããçè§£ããå®çŸ©ãããã€ãã³ããžã®å¯Ÿå¿ãèªååããããšã§ããšã³ãžãã¢ã¯ããçç£çãªã¿ã¹ã¯ã«éäžã§ããŸããããã«ããäŸãã°å¹³ååŸ©æ§æé (MTTR) ãæ¹åãããããªã³ã³ãŒã«å¯Ÿå¿ã«ãããšã³ãžãã¢ãªã³ã°ãªãœãŒã¹ã®ç²åŒãé²ãããšã«ãã£ãŠãå埩åã®ç®æšãéæããããšã«ãã€ãªãããŸãã ãªãªãŒã¹ã®é »åºŠãš CI/CD ãã€ãã©ã€ã³ã®ãããã€ã¡ã³ãééã«å¿ããŠãã«ãªã¹ãã€ãã©ã€ã³ã®ç¯å²ãšæéãè©äŸ¡ããå¿
èŠããããŸããäžè¬ã« Fault Injection Service å®éšã§ã¯ã¯ãŒã¯ããŒãã«ãããŠååãªã€ã³ã¿ã©ã¯ã·ã§ã³ãããŒã¿ãããã³å®éšæ¡ä»¶ã確ä¿ããããã«ãå®è¡æéãé·ãããå¿
èŠããããŸããéçºè
ã®äœæ¥ãé
ããªãã®ãé¿ããããããããã®å®éšã¯ CI/CD ãã€ãã©ã€ã³ã®åŸæ®µã§ããããã¯ç¬èªã®å°çšãã€ãã©ã€ã³ã§å®è¡ããå¿
èŠããããŸããéåžžã®ãããã€çš CI/CD ãã€ãã©ã€ã³ã䜿çšããããå°çšã® âã«ãªã¹ãã€ãã©ã€ã³â ã䜿çšãããã«é¢ä¿ãªããAWS Resilience Hub ã®æšå¥šäºé
ã¯åºçºç¹ãšããŠåœ¹ç«ã¡ãŸãã æ¬èšäºã¯ 2024幎8æ22æ¥ã« â AWS Cloud Operations & Migrations Blog â ã§å
¬éããã â Automate Standard Operating Procedures (SOPs) execution with AWS Resilience Hub â ã翻蚳ãããã®ã§ãã翻蚳ã¯ãœãªã¥ãŒã·ã§ã³ã¢ãŒããã¯ãã®äžå¥œå²éãæ
åœããŸããã