ããã«ã¡ã¯ãã€ãããŒã·ã§ã³ã»ã³ã¿ãŒã®éŽã¶å¶ºã§ããæ®æ®µã¯ AI/ML ã·ã¹ãã ã«é¢ããæ¥åã«åŸäºããŠããŸãã æ¬èšäºã§ã¯ãCUDA 12.8 ãã远å ããã Checkpoint API ã®æŠèŠã«ã€ããŠè§£èª¬ããŸãã ãŸããCheckpoint ã®ãŠãŒã¹ã±ãŒã¹ããããŸã§ã® NVIDIA CUDA ã«ããã Checkpoint ã®è©Šã¿ãªã©ã®èæ¯ã説æããæ°ãã«è¿œå ããã CUDA Checkpointing ã«ã€ããŠè§£èª¬ããŸãã ããã«å®éã«å®è£
ããtorchvision ã transformers ãªã©ã® CUDA ã¢ããªã±ãŒã·ã§ã³ã«å¯ŸããŠãCheckpoint ã®æ€èšŒãããŠããŸãã èæ¯ CUDA Checkpointing å®è£
ãšæ€èšŒ cu_check tool æ€èšŒ Pytorch Counter torchvision transformers ãŸãšã èæ¯ Checkpoint ã¯èšç®éäžã®å
éšç¶æ
ïŒã¡ã¢ãªãªã©ïŒããã£ã¹ã¯ãªã©ã«ä¿åããä»»æã®ã¿ã€ãã³ã°ã§èšç®ãåéã§ããæè¡ã§ãã æ³å®ããããŠãŒã¹ã±ãŒã¹ãšããŠãé害çºçæã®ããã¯ã¢ãããããã©ã€ããã€ã°ã¬ãŒã·ã§ã³ãããé·æéå®è¡ãããèšç®ã®éäžçµæä¿åãããã¿ã¹ã¯ã¹ã±ãžã¥ãŒãªã³ã°ã«ãããããªãšã³ãã·ã§ã³ããããã©ã¬ã³ãžãã¯åæã«ããã蚌æ ä¿å
šããªã©ãæããããŸãã ç¹ã« GPU åéã§ã¯ãæšä»ã®å€§èŠæš¡èšèªã¢ãã«ã«ä»£è¡šãããé·æéã«ãããåŠç¿åŠçã®äžæä¿åãããªãœãŒã¹æé©åã®äžç°ãšããŠãäžéšã® GPU ããã»ã¹ãå¥ã® GPU ãµãŒããžç§»è¡ãããšãã£ãçšéãèããããŸãã ãããããããŸã§ã® NVIDIA CUDA ã«ããã Checkpoint ææ³ãšããŠããŸããŸãªè©Šã¿ã¯ãããŸããããã¢ããªã±ãŒã·ã§ã³ã®å®è¡ç°å¢èªäœã«å€æŽãå ããŠãCUDA Driver API ã®åŒã³åºããååããŠããããå®å
šã«ééçãªãã®ã§ã¯ãããŸããã§ããã 1 2 3 4 5 6 7 8 9 10 äžæ¹ã2024 幎 7 æ NVIDIA ã® Technical Blog ã«ãããŠã CUDA Driver API ãšã㊠version 555 ãã Checkpoint ãå®éšçã«å®è£
ãããããšãçºè¡šãããŸããã ãªãŒãã³ãœãŒã¹ãª Checkpoint utility ã® CRIU (Checkpoint/Restore in Userspace) ãšã®çµã¿åããã«ã€ããŠã説æãããŠããŸãã https://developer.nvidia.com/blog/checkpointing-cuda-applications-with-criu/ 以äžã«ãã®ããŒã«ã§ãã cuda-checkpoint ãå
¬éãããŠããŸãããªããžããªã«ã¯ããŒã«ã®ãã€ããªãã眮ããŠãããããã€ããªã®æååã調ã¹ããš cuGetExportTable 11 以å€ã® API åŒã³åºããååšããŠããããããããããã¥ã¡ã³ãåãããŠããªã颿°ãå©çšãã圢ã§å®è£
ãããŠããŸããã https://github.com/NVIDIA/cuda-checkpoint git clone https://github.com/NVIDIA/cuda-checkpoint.git cd cuda-checkpoint strings ./bin/x86_64_Linux/cuda-checkpoint | grep cu # libcuda.so.1 # cuDriverGetVersion # cuGetExportTable ä»ã«ãã MemVerge ç€Ÿã¯æ©æã«ãã® API ãå©çšãã AI ã®åŠç¿ PoC ã宿œã㊠GTC24 ãªã©ã§çºè¡šããŠããŸãã CUDA 12.x driver enhancements will enable the open-source CRIU project to checkpoint and restart a GPU-based compute node. We'll provide a technical overview and demonstrate this new capability. https://www.nvidia.com/en-us/on-demand/session/gtc24-p63184/ www.youtube.com ãããŠã2025 幎 3 æã®çŸæç¹ã§ã¯ Driver version 570 ããã³ CUDA 12.8 ãã CUDA Checkpointing ãšããŠæ£åŒã« API ãå
¬éãããŸããã å
ã»ã©ã® check-checkpoint ã®ãªããžããªã«ãäžéšãã® API ãå©çšããã³ãŒããå
¬éãããŠããŸããã cuda-checkpoint ãšäœµçšãããŠãããæ°ãã«ä»æ§å
¬éããããã®ã«çœ®ãæãã£ãŠããªããšæãããŸãã 12 13 ãã®ãããçŸåš CUDA 12.8 ã§æ°ãã«å
¬éããã API ãå©çšãããµã³ãã«ã³ãŒããèŠåœãããªãç¶æ³ã«ãªã£ãŠãããšæãããŸãã æ¬¡ã®ç« ã§å
¬éãããŠãã Checkpoint API ã調æ»ããã©ã®ããã«å©çšããã®ãã確èªããŸãã CUDA Checkpointing CUDA Checkpointing ã® API äžèЧã以äžã«ãªããŸãã https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CHECKPOINT.html CUresult cuCheckpointProcessGetState ( int pid, CUprocessState* state ) CUDA ããã»ã¹ã®çŸåšã®ç¶æ
( CU_PROCESS_STATE_RUNNING , CU_PROCESS_STATE_LOCKED , CU_PROCESS_STATE_CHECKPOINTED , CU_PROCESS_STATE_FAILED )ãååŸããŸãã CUresult cuCheckpointProcessLock ( int pid, CUcheckpointLockArgs* args ) CUDA ããã»ã¹ã lock ããŠã以éã® CUDA API åŒã³åºãããããã¯ããŸãã CUresult cuCheckpointProcessCheckpoint ( int pid, CUcheckpointCheckpointArgs* args ) GPU ã¡ã¢ãªã®å
容ã host memory ã«ä¿åããCUDA ããã»ã¹ã checkpoint ããŸãã CUresult cuCheckpointProcessRestore ( int pid, CUcheckpointRestoreArgs* args ) CUDA ããã»ã¹ããªã¹ãã¢ããŸããç¶æ
㯠CU_PROCESS_STATE_CHECKPOINTED ã§ããå¿
èŠããããŸãã CUresult cuCheckpointProcessUnlock ( int pid, CUcheckpointUnlockArgs* args ) CUDA ããã»ã¹ã® lock ãè§£é€ããŠãCUDA API Call ãåéã§ããããã«ããŸãã CUresult cuCheckpointProcessGetRestoreThreadId ( int pid, int* tid ) CUDA ããã»ã¹ã® Thread ID ãååŸããã åŒçš: https://arxiv.org/abs/2502.16631 14 äžå³ãåèã«ãããšãå®è¡äžã®ããã»ã¹ã® Checkpoint ã¯æ¬¡ã®ããã«å®è¡ããŸãã cuCheckpointProcessLock cuCheckpointProcessCheckpoint CRIU dump ãŸããä¿åãããã®ã Restore ããã«ã¯æ¬¡ã®ããã«å®è¡ããŸãã CRIU restore cuCheckpointProcessRestore cuCheckpointProcessUnlock 次ã®ç« ã§å®éã«ããããåäœããã³ãã³ãããŒã«ãå®è£
ããŠãPytorch ãªã©ã®ã¢ããªã±ãŒã·ã§ã³ã® Checkpoint ãå¯èœãã確èªããŸãã å®è£
ãšæ€èšŒ cu_check tool 次ã®ããã«ããããã® Checkpoint API ããµãã³ãã³ããšããŠãç¹å®ã®ããã»ã¹ ID ã«å®è¡ããããŒã«ãå®è£
ããŸãã cu_check.c #include <cuda.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #define CHECK_CU (func) \ do { \ CUresult res = (func); \ if (res != CUDA_SUCCESS) { \ const char *errName = NULL ; \ const char *errDesc = NULL ; \ cuGetErrorName (res, &errName); \ cuGetErrorString (res, &errDesc); \ fprintf ( stderr , " %s failed: %s %s\n " , #func, errName, errDesc); \ return - 1 ; \ } \ } while ( 0 ) const char * getCUprocessState (CUprocessState state) { switch (state) { case CU_PROCESS_STATE_RUNNING: return "CU_PROCESS_STATE_RUNNING" ; case CU_PROCESS_STATE_LOCKED: return "CU_PROCESS_STATE_LOCKED" ; case CU_PROCESS_STATE_CHECKPOINTED: return "CU_PROCESS_STATE_CHECKPOINTED" ; case CU_PROCESS_STATE_FAILED: return "CU_PROCESS_STATE_FAILED" ; default : return "OTHER_STATE" ; } } int main ( int argc, char **argv) { if (argc < 3 ) { fprintf ( stderr , "usage: %s [state|lock|checkpoint|restore|unlock] <pid> \n " , argv[ 0 ]); return - 1 ; } const char *subcommand = argv[ 1 ]; int pid = atoi (argv[ 2 ]); CHECK_CU ( cuInit ( 0 )); if ( strcmp (subcommand, "state" ) == 0 ) { CUprocessState state; CHECK_CU ( cuCheckpointProcessGetState (pid, &state)); printf ( "state: %s\n " , getCUprocessState (state)); } else if ( strcmp (subcommand, "thread" ) == 0 ) { int threadId = 0 ; CHECK_CU ( cuCheckpointProcessGetRestoreThreadId (pid, &threadId)); printf ( "thread id: %d\n " , threadId); } else if ( strcmp (subcommand, "lock" ) == 0 ) { CUcheckpointLockArgs args = { .timeoutMs = 600000 // 10min timeout }; CHECK_CU ( cuCheckpointProcessLock (pid, &args)); printf ( "locked successfully \n " ); } else if ( strcmp (subcommand, "checkpoint" ) == 0 ) { CHECK_CU ( cuCheckpointProcessCheckpoint (pid, NULL )); printf ( "checkpointed successfully \n " ); } else if ( strcmp (subcommand, "restore" ) == 0 ) { CHECK_CU ( cuCheckpointProcessRestore (pid, NULL )); printf ( "restored successfully \n " ); } else if ( strcmp (subcommand, "unlock" ) == 0 ) { CHECK_CU ( cuCheckpointProcessUnlock (pid, NULL )); printf ( "unlocked successfully \n " ); } else { printf ( "unknown subcommand: %s\n " , subcommand); return - 1 ; } return 0 ; } gcc -I /usr/local/cuda-12. 8 /include cu_check.c -o cu_check -lcuda # install sudo mv cu_check /usr/local/bin # Usage cu_check state < pid > cu_check lock < pid > cu_check checkpoint < pid > cu_check restore < pid > cu_check unlock < pid > æ€èšŒ äºåã« CRIU ãã€ã³ã¹ããŒã«ããŸãã curl -LO " http://github.com/checkpoint-restore/criu/archive/v4.0/criu-4.0.tar.gz " tar xvfz criu-4. 0 .tar.gz cd criu-4. 0 / make -j sudo make install Pytorch Counter CUDA Memory äžã«ä¿åãã 1 ç§ããšã« inc ããã Counter ã³ãŒãã§ãŸãæ€èšŒããŸãã torch_counter.py import torch, time counter = torch.tensor( 0 , device= 'cuda' ) while True : print (counter) counter.add_( 1 ) time.sleep( 1 ) 次ã®ããã«å®è¡ããŸãã 1 ç§ããšã« Counter ãç¶ç¶ããŠåºåãããæ§åã確èªã§ãããšæããŸãã pip install torch python torch_counter.py & sleep 5 PID = $( pgrep -f ' python torch_counter.py ' ) # checkpoint rm -rf tcnt && mkdir -p tcnt cu_check lock $PID cu_check checkpoint $PID sudo criu dump -j -D tcnt -t $PID du -sh tcnt # 755M # restore sudo criu restore -j -D tcnt & while ! pgrep -f ' python torch_counter.py ' > /dev/null 2 >& 1 ; do sleep 1 ; done sudo cu_check restore $PID sudo cu_check unlock $PID sleep 5 kill -9 $PID torchvision 次ã«ãtorchvision ã® ResNet ã§æ€èšŒããŸãã åŠç¿éäžã®ç¶æ
ã Checkpoint ãããåéåŸã«åŠç¿éäžããå®è¡ãããæ§åã確èªã§ãããšæããŸãã git clone https://github.com/pytorch/examples.git cd examples/imagenet/ pip install -r requirements.txt python main.py -a resnet152 --dummy -j 0 & sleep 20 PID = $( pgrep -f ' python main.py -a resnet152 --dummy -j 0 ' ) # checkpoint rm -rf resnet && mkdir -p resnet cu_check lock $PID cu_check checkpoint $PID sudo criu dump -j -D resnet -t $PID du -sh resnet # 50G # restore sudo criu restore -j -D resnet & while ! pgrep -f ' python main.py -a resnet152 --dummy -j 0 ' > /dev/null 2 >& 1 ; do sleep 1 ; done sudo cu_check restore $PID sudo cu_check unlock $PID sleep 20 sudo kill -9 $PID transformers æåŸã«ãtransformers ã§æ€èšŒããŸãã ãã¡ããåæ§ã«ç¶ç¶ããŠåŠç¿å¯èœã§ããæ§åã確èªã§ããŸããã train_bert.py # ref: https://huggingface.co/docs/transformers/training from datasets import load_dataset from transformers import ( AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, ) dataset = load_dataset( "yelp_review_full" )[ "train" ].select( range ( 10000 )) tokenizer = AutoTokenizer.from_pretrained( "google-bert/bert-base-cased" ) def tokenize_function (examples): return tokenizer(examples[ "text" ], padding= "max_length" , truncation= True ) small_train_dataset = dataset.map(tokenize_function, batched= True ) model = AutoModelForSequenceClassification.from_pretrained( "google-bert/bert-base-cased" , num_labels= 5 ) trainer = Trainer( model=model, args=TrainingArguments(num_train_epochs= 10000 ), train_dataset=small_train_dataset, ) trainer.train() 次ã®ããã«å®è¡ããŸãã pip install transformers datasets accelerate python train_bert.py & sleep 60 PID = $( pgrep -f ' python train_bert.py ' ) # checkpoint rm -rf bert && mkdir -p bert cu_check lock $PID cu_check checkpoint $PID sudo criu dump -j -D bert -t $PID --tcp-established du -sh bert # 5.5G # restore sudo criu restore -j -D bert --tcp-established & while ! pgrep -f ' python train_bert.py ' > /dev/null 2 >& 1 ; do sleep 1 ; done sudo cu_check restore $PID sudo cu_check unlock $PID sleep 20 sudo kill -9 $PID ãŸãšã æ¬èšäºã§ã¯ CUDA 12.8 ã§å°å
¥ããã Checkpoint API ã«ã€ããŠè§£èª¬ããŸããã ãŸãå®éã®å®è£
ãéããŠãæšä»ã®å€§èŠæš¡èšèªã¢ãã«ã®ããã¡ã¯ããªã©ã€ãã©ãªã§ãã transformers ãªã©ã® CUDA ã¢ããªã±ãŒã·ã§ã³ã«å¯ŸããŠãCheckpoint ãåäœããããšã確èªããŸããã ä»åŸãããã®æè¡ã掻çšãããããšã§ãé·æéã«ãããåŠç¿åŠçã®äžæä¿åããå¹ççãªãªãœãŒã¹æé©åãè¡ãããããšãæåŸ
ããŠããŸãã Takizawa, Hiroyuki, et al. "CheCUDA: A checkpoint/restart tool for CUDA applications." 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies. IEEE, 2009. ↩ Nukada, Akira, Hiroyuki Takizawa, and Satoshi Matsuoka. "NVCR: A transparent checkpoint-restart library for NVIDIA CUDA." 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. IEEE, 2011. ↩ Jiang, Hai, et al. "A checkpoint/restart scheme for cuda programs with complex computation states." International Journal of Networked and Distributed Computing 1.4 (2013): 196-212. ↩ Garg, Rohan, et al. "CRUM: Checkpoint-restart support for CUDA's unified memory." 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2018. ↩ Jain, Twinkle, and Gene Cooperman. "CRAC: checkpoint-restart architecture for CUDA with streams and UVM." SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020. ↩ Shukla, Dharma, et al. "Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads. arXiv. org (February 2022)." arXiv preprint arXiv:2202.07848. ↩ Eiling, Niklas, et al. "Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support." Concurrency and Computation: Practice and Experience 34.14 (2022): e6474. ↩ Nukada, Akira, Taichiro Suzuki, and Satoshi Matsuoka. "Efficient checkpoint/Restart of CUDA applications." Parallel Computing 116 (2023): 103018. ↩ Eiling, Niklas, Stefan Lankes, and Antonello Monti. "Checkpoint/Restart for CUDA Kernels." Proceedings of the SC'23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis. 2023. ↩ Yang, Yanning, et al. "On-demand and Parallel Checkpoint/Restore for GPU Applications." Proceedings of the 2024 ACM Symposium on Cloud Computing. 2024. ↩ https://forums.developer.nvidia.com/t/cugetexporttable-explanation/259109/2 ↩ https://github.com/NVIDIA/cuda-checkpoint/blob/6ec728aff032c18c9fb0794a272d94c6adcce508/src/r570-features.c ↩ https://github.com/checkpoint-restore/criu/blob/4b099510b35f98a1f1d6589b1660470402fc1fef/plugins/cuda/cuda_plugin.c ↩ Stoyanov, Radostin, et al. "CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads." arXiv preprint arXiv:2502.16631 (2025). ↩