æ¬æ¥ã AWS Trainium ãš AWS Inferentia ã«ãã Llama 3.1 ã¢ãã«ã®ãã¡ã€ã³ãã¥ãŒãã³ã°ãšæšè«ã®ãµããŒããçºè¡šã§ããããšãå¬ããæããŸããLlama 3.1 ãã¡ããªãŒã¯ã8BïŒ80åïŒã70BïŒ700åïŒã405BïŒ4,050åïŒãµã€ãºã®äºååŠç¿ããã³ã€ã³ã¹ãã©ã¯ã·ã§ã³ãã¥ãŒãã³ã°æžã¿ã®å€èšèªå€§èŠæš¡èšèªã¢ãã«ïŒLLMïŒã®ã³ã¬ã¯ã·ã§ã³ã§ãã 以åã®æçš¿ ã§ã¯ã Amazon SageMaker JumpStart ã§ AWS Trainium ãš Inferentia ããŒã¹ã®ã€ã³ã¹ã¿ã³ã¹ã« Llama 3 ã¢ãã«ããããã€ããæ¹æ³ã«ã€ããŠè§£èª¬ããŸãããä»åã®æçš¿ã§ã¯ãAWS AI ãããäžã§ ãã®ã³ã¹ãããã©ãŒãã³ã¹ã®å©ç¹ãšå
±ã« Llama 3.1 ãã¡ããªãŒã®ã¢ãã«ã®ãã¡ã€ã³ãã¥ãŒãã³ã°åã³ãããã€ãå®çŸããæ¹æ³ã«ã€ããŠæŠèª¬ããŸãã Llama 3.1 ã¢ãã«ã®æŠèŠ Llama 3.1 ãã¡ããªãŒã®å€èšèª LLM ã¯ã8Bã70Bã405B ãµã€ãºã®äºååŠç¿ããã³ã€ã³ã¹ãã©ã¯ã·ã§ã³ãã¥ãŒãã³ã°æžã¿ã®çæã¢ãã«ã®ã³ã¬ã¯ã·ã§ã³ã§ãïŒããã¹ãå
¥å/ããã¹ãããã³ã³ãŒãåºåïŒããã¹ãŠã®ã¢ãã«ã¯é·ãã³ã³ããã¹ãé·ïŒ128kïŒããµããŒãããã°ã«ãŒãåãããã¯ãšãªã¢ãã³ã·ã§ã³ïŒGQAïŒããµããŒãããŠãããããæšè«ãé«éã§ãã Llama 3.1 ã€ã³ã¹ãã©ã¯ã·ã§ã³ãã¥ãŒãã³ã°æžã¿ã¢ãã«ïŒ8Bã70Bã405BïŒã¯å€èšèªå¯Ÿè©±ãŠãŒã¹ã±ãŒã¹åãã«æé©åãããŠããã äžè¬çãªæ¥çãã³ãããŒã¯ã§å€ãã®å
¬éãããŠãããã£ããã¢ãã«ãäžåãããã©ãŒãã³ã¹ ã瀺ããŸãããããã¯æ€çŽ¢ãç»åçæãã³ãŒãå®è¡ãæ°åŠçæšè«ãªã©ã®ç¹å®ã®ããŒã«ã®ããŒã«ã³ãŒã«ãçæããããèšç·ŽãããŠããŸããããã«ããŒãã·ã§ããã®ããŒã«äœ¿çšããµããŒãããŠããŸãã Llama 3.1 405B ã¯ãMeta ã«ãããšäžçæå€§ã®å
¬éå©çšå¯èœãª LLM ã§ãããã®ã¢ãã«ã¯äººå·¥ç¥èœïŒAIïŒã®æ°ããåºæºãèšå®ãããšã³ã¿ãŒãã©ã€ãºã¬ãã«ã®ã¢ããªã±ãŒã·ã§ã³ãç ç©¶éçºã«çæ³çã§ããåæããŒã¿çæã®ãããªã¿ã¹ã¯ã«é©ããŠãããã¢ãã«ã®åºåããã¡ã€ã³ãã¥ãŒãã³ã°åŸã«å°èŠæš¡ãª Llama ã¢ãã«ã®æ¹åã«äœ¿çšãããã405Bã¢ãã«ããå°èŠæš¡ã¢ãã«ãžã®ç¥è転移ã®ããã®ã¢ãã«èžçïŒdistillationsïŒã«äœ¿çšãããã§ããŸãããã®ã¢ãã«ã¯ãäžè¬ç¥èãé·æããã¹ãçæãå€èšèªç¿»èš³ãæ©æ¢°ç¿»èš³ãã³ãŒãã£ã³ã°ãæ°åŠãããŒã«äœ¿çšã匷åãããæèçè§£ãé«åºŠãªæšè«ãšæææ±ºå®ã«ãããŠåªããŠããŸãã ã¢ãŒããã¯ãã£çã«ã¯ãLlama 3 ãš Llama 3.1 ã®ã³ã¢ LLM㯠åãå¯ïŒdenseïŒãªã¢ãŒããã¯ãã£ã§ãããããã¯æé©åããããã©ã³ã¹ãã©ãŒããŒã¢ãŒããã¯ãã£ã䜿çšããèªå·±ååž°èšèªã¢ãã«ã§ãããã¡ã€ã³ãã¥ãŒãã³ã°ãããããŒãžã§ã³ã¯ãæçšæ§ãšå®å
šæ§ã«é¢ãã人éã®éžå¥œã«åãããããã«ãæåž«ãããã¡ã€ã³ãã¥ãŒãã³ã°ïŒSFT : supervised fine-tuning ïŒãšäººéã®ãã£ãŒãããã¯ã«ãã匷ååŠç¿ïŒRLHF : einforcement learning with human feedbackïŒã䜿çšããŠããŸãã Meta ã®è²¬ä»»ãã䜿çšã¬ã€ã ã¯ãã¢ãã«ãã«ã¹ã¿ãã€ãºãæé©åããããã«å¿
èŠãªè¿œå ã®ãã¡ã€ã³ãã¥ãŒãã³ã°ããé©åãªå®å
šæ§å¯Ÿçãšãšãã«å®è£
ããéã«åœ¹ç«ã¡ãŸãã AWS Trainium ã Amazon Bedrock ãš Amazon SageMaker ã§ Llama 3.1 ã匷å AWS ã§ Llama 3.1 ãå§ãã æéã®æ¹æ³ã¯ãç®çã«ç¹åãã AI ã€ã³ãã©ã¹ãã©ã¯ãã£ïŒAWS Trainium ãå«ãïŒãå©çšããAmazon Bedrock ã§ããå®å
šã«ç®¡çããã API ãéããŠã Amazon Bedrock ã¯ç®çã«ç¹åãã AI ã€ã³ãã©ã¹ãã©ã¯ãã£ã®å©ç¹ãæäŸãããããã®åŒ·åãªã¢ãã«ãžã®ã¢ã¯ã»ã¹ãç°¡çŽ åãããããå·®å¥åããã AI ã¢ããªã±ãŒã·ã§ã³ã®æ§ç¯ã«éäžã§ããŸãã åºç€ãšãªããªãœãŒã¹ããã现ããå¶åŸ¡ããå¿
èŠãããå Žåã¯ã SageMakerã§Llama 3.1ã¢ãã«ããã¡ã€ã³ãã¥ãŒãã³ã°ããã³ããã〠ã§ããŸããSageMaker JumpStart ã§ã® Llama 3.1 ã® Trainium ãµããŒãã¯è¿æ¥å
¬éäºå®ã§ãã AWS Trainium ãš AWS Inferentia2 ã Llama 3.1 ã¢ãã«ã®é«æ§èœãšäœã³ã¹ããå®çŸ ãã¬ãŒãã³ã°ãšæšè«ã®ããã®ç¬èªã® ML ãã€ãã©ã€ã³ãæ§ç¯ããŠãããé«ãæè»æ§ãšå¶åŸ¡ãåŸããå Žåã¯ã Amazon Elastic Compute Cloud ïŒAmazon EC2ïŒTrn1 ããã³ Inf2 ã€ã³ã¹ã¿ã³ã¹ã䜿çšã㊠AWS AI ãããäžã§ Llama 3.1 ãéå§ã§ããŸãã AWS Neuron SDK ã䜿çšããŠæ°ãã Llama 3.1 8B/70B ã¢ãã«ãéå§ããæ¹æ³ãèŠãŠã¿ãŸãããã Trainium äžã§ Llama 3.1 ããã¡ã€ã³ãã¥ãŒãã³ã° Llama 3.1 8B ãŸã㯠Llama 3.1 70B ã®ãã¡ã€ã³ãã¥ãŒãã³ã°ãéå§ããã«ã¯ã NeuronX Distributed ã©ã€ãã©ãªã䜿çšå¯èœã§ããNeuronX Distributed ã¯ãããäžè¬çãªåæ£ãã¬ãŒãã³ã°ããã³æšè«æè¡ã®å®è£
ãæäŸããŸãã ãã¡ã€ã³ãã¥ãŒãã³ã°ãéå§ããã«ã¯ã以äžã®ãµã³ãã«ã䜿çšã§ããŸãïŒ Training Llama 3.1 8B Training Llama 3.1 70B äž¡æ¹ã®ãµã³ãã«ã¯ãTrainium ã¯ã©ã¹ã¿ãŒã€ã³ãã©ã¹ãã©ã¯ãã£ã管çãã AWS ParallelCluster ãšãã¯ãŒã¯ããŒã管çã®ããã®Slurm ã®äžã«æ§ç¯ãããŠããŸãã以äžã¯ Llama3.1 70B ã®ãã¬ãŒãã³ã°ãéå§ããããã® Slurm ã³ãã³ãã®äŸã§ãïŒ sbatch --exclusive \ --nodes 32 \ --cpus-per-task 128 \ --wrap="srun bash $(pwd)/run_llama3_70B_tp_pp.sh" Slurm ã¹ã¯ãªããå
ã§ãã¯ã©ã¹ã¿ãŒäžã§åæ£ãã¬ãŒãã³ã°ããã»ã¹ãèµ·åããŸããã©ã³ããŒã¹ã¯ãªããã§ã¯ãMetaãæäŸããäºååŠç¿æžã¿ã®éã¿ãšèšå®ãããŒããããã¬ãŒãã³ã°ããã»ã¹ãéå§ããŸãïŒ torchrun $DISTRIBUTED_ARGS run_llama_nxd.py \ --train_batch_size $BS \ --use_meta_device_init 1 \ --training_dir $DATA_PATH \ --training_config $SCRIPT_DIR/${MODEL_SIZE}config_llama${LLAMA_VERSION} \ --max_steps $max_steps \ --seq_len $SEQ_LEN \ --pipeline_parallel_size $PP_DEGREE \ --tensor_parallel_size $TP_DEGREE \ --num_microbatches $NUM_MICROBATCHES \ --lr 0.000015 \ --min_lr 1e-06 \ --beta1 0.9 \ --beta2 0.95 \ --weight_decay 0.1 \ --warmup_steps 2000 \ --constant_steps 0 \ --use_zero1_optimizer 1 \ --use_selective_checkpoint 1 \ --use_flash_attention 1 \ --qkv_linear 1 \ --kv_replicator 4 \ --pretrained_weight 1 \ --save_load_xser 1 \ --checkpoint_dir "/shared/llama${LLAMA_VERSION}${MODEL_SIZE}/" \ --checkpoint_freq $checkpoint_freq \ --num_kept_checkpoint -1 \ --loading_step -1 \ --tb_dir $tb_dir |& tee $LOG_PATH/log exit ${PIPESTATUS[0]} Inferentia2 äžã§ Llama 3.1 ãããã〠ã¢ãã«ã®ãããã€æºåãã§ãããã以åã® Llama 3 8B Neuron ãµã³ãã«ã³ãŒãã§ã¢ãã«IDãæŽæ°ããããšã§ãããã€ã§ããŸãã model_id = "meta-llama/Meta-Llama-3.1-8B" neuron_model = LlamaForSampling.from_pretrained(model_id, neuron_config=neuron_config, batch_size=1, tp_degree=24, amp='bf16', n_positions=4096) neuron_model.to_neuron() åæ§ã®ãµã³ãã«æšè«ã³ãŒãã䜿çšã§ããŸãïŒ tokenizer = AutoTokenizer.from_pretrained(model_id) prompt = "Hello, I'm a language model and I like to" input_ids = tokenizer.encode(prompt, return_tensors="pt") # run inference with top-k sampling with torch.inference_mode(): start = time.time() generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50) elapsed = time.time() - start generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences] print(f'generated sequences {generated_sequences} in {elapsed} seconds') ã¹ããããã€ã¹ãããã®è©³çްã«ã€ããŠã¯ãæ°ãã Llama 3.1 ã®ãµã³ãã«ãåç
§ããŠãã ããïŒ Meta Llama 3.1 8B Meta Llama 3.1 70B Meta Llama 3.1 8B 32k Meta Llama 3.1 405B on Trainium ã¯è¿æ¥å
¬éäºå®ã§ã ãŸããHugging Face ã® Optimum Neuron ã©ã€ãã©ãªã䜿çšããŠãHugging Face Model Hub ãã SageMaker ãéããŠçŽæ¥ã¢ãã«ããã°ãããããã€ããããšãã§ããŸããLlama 3.1 ã¢ãã«ã«ãŒããããããã Deploy ãïŒãããã€ïŒãéžæããæ¬¡ã«ã SageMaker ããéžã³ãæåŸã«ã AWS Inferentia & Trainium ããéžæããŸãããµã³ãã«ã³ãŒãã SageMaker ããŒãããã¯ã«ã³ããŒããã Run ãïŒå®è¡ïŒãéžæããŸãã import json import sagemaker import boto3 from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri try: role = sagemaker.get_execution_role() except ValueError: iam = boto3.client("iam") role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"] # Hub Model configuration. https://huggingface.co/models hub = { "HF_MODEL_ID": "meta-llama/Meta-Llama-3.1-8B", "HF_NUM_CORES": "2", "HF_AUTO_CAST_TYPE": "fp16", "MAX_BATCH_SIZE": "8", "MAX_INPUT_LENGTH": "3686", "MAX_TOTAL_TOKENS": "4096", "HF_TOKEN": "<REPLACE WITH YOUR TOKEN>", } assert hub["HF_TOKEN"] != "<REPLACE WITH YOUR TOKEN>", "Please replace '<REPLACE WITH YOUR TOKEN>' with your Hugging Face Hub API token" # create Hugging Face Model Class huggingface_model = HuggingFaceModel( image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.23"), env=hub, role=role, ) # deploy model to SageMaker Inference predictor = huggingface_model.deploy( initial_instance_count=1, instance_type="ml.inf2.xlarge", container_startup_health_check_timeout=1800, volume_size=512, ) # send request predictor.predict( { "inputs": "What is is the capital of France?", "parameters": { "do_sample": True, "max_new_tokens": 128, "temperature": 0.7, "top_k": 50, "top_p": 0.95, } } ) ããã«ãvLLM ã䜿çšããŠã¢ãã«ããããã€ãããå Žåã¯ã continuous batching ã¬ã€ã ãåç
§ããŠç°å¢ãäœæã§ããŸããç°å¢ãæ§ç¯ããåŸãvLLM ã䜿çšã㊠AWS Trainium ãŸã㯠Inferentia ã« Llama 3.1 8B/70B ã¢ãã«ããããã€ã§ããŸãã以äžã¯ Llama 3.1 8B ããããã€ããäŸã§ãïŒ from vllm import LLM, SamplingParams # Sample prompts. prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="meta-llama/Meta-Llama-3.1-8B", max_num_seqs=8, # The max_model_len and block_size arguments are required to be same as max sequence length, # when targeting neuron device. Currently, this is a known limitation in continuous batching # support in transformers-neuronx. max_model_len=128, block_size=128, # The device can be automatically detected when AWS Neuron SDK is installed. # The device argument can be either unspecified for automated detection, or explicitly assigned. device="neuron", tensor_parallel_size=8) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") çµè« AWS Trainium ãš Inferentia ã¯ãLlama 3.1 ã¢ãã«ã®ãã¡ã€ã³ãã¥ãŒãã³ã°ãšãããã€ã髿§èœãã€äœã³ã¹ãã§æäŸå¯èœã§ãããããã®åŒ·åãªã¢ãã«ãšç®çã«ç¹åãã AI ã€ã³ãã©ã¹ãã©ã¯ãã£ã䜿çšããŠãå·®å¥åããã AI ã¢ããªã±ãŒã·ã§ã³ãæ§ç¯ããæ¹æ³ãèŠãã®ã楜ãã¿ã§ããAWS AI ãããã®äœ¿çšéå§æ¹æ³ã®è©³çްã«ã€ããŠã¯ãAWS Neuron ããã¥ã¡ã³ãã® ã¢ãã«ãµã³ãã«ãšãã¥ãŒããªã¢ã« ãåç
§ããŠãã ããã 翻蚳㯠Annapurna Labs ã®åžžäžãæ
åœããŸãããåæã¯ ãã¡ã ã§ãã