ã¹ã±ãŒã«ã«ããããã¬ãŒãã³ã°ãšæšè«ã®ããã®ã€ã³ããªãžã§ã³ããªãªãŒã±ã¹ãã¬ãŒã·ã§ã³ã«ãããGPUå©çšçãæå€§åããå®éšãããã®ã³ã¹ããæå°éã«æããŸãã

å€§èŠæš¡ã¢ãã«ããã¬ãŒãã³ã°ãã AI ããŒã ã¯ãéé ·ãªã€ã³ãã©ã¹ãã©ã¯ãã£ã®åé¡ã«çŽé¢ããŠããŸããGPU ã³ã³ãã¥ãŒãã£ã³ã°ã¯é«äŸ¡ã§ãåžå°ã§ãããå©çšçãäœãã®ãçŸç¶ã§ããããŒã¿ãµã€ãšã³ãã£ã¹ãã¯ãå ±æã¯ã©ã¹ã¿ãŒã§ã® GPU ã¢ã¯ã»ã¹ãäœæéãåŸ æ©ããäžæ¹ã§ãå²ãåœãŠãããã€ã³ã¹ã¿ã³ã¹ã¯ããŒã¿ååŠçããã€ããŒãã©ã¡ãŒã¿ãŒåæäžã«ã¢ã€ãã«ç¶æ ã«ãªã£ãŠããŸããSpot ã€ã³ã¹ã¿ã³ã¹ã®äžæã¯ãé©åãªãã§ãã¯ãã€ã³ãåŠçããªãæ°æ¥éã«ããããã¬ãŒãã³ã°å®è¡ãå°ç¡ãã«ããæ°åãã«ã®ç¡é§ãçãå¯èœæ§ããããŸããå®éšãããã®ã³ã¹ããå¯èŠåãããŠããªããããç°ãªãç ç©¶æ¹åã® ROI ãæ¯èŒããããšã¯äžå¯èœã§ããã¢ãã«ã¢ãŒãã£ãã¡ã¯ãã¯ãããŒãžã§ã³ç®¡çã系統远跡ããããã«ãå人ã®ãã·ã³ã S3 ãã±ããã«æ£åšããŠããŸããçµç¹ãåäž GPU ã®å®éšãã忣åãã«ãããŒããã¬ãŒãã³ã°ãžãšã¹ã±ãŒã«ããã«ã€ããŠãå°èŠæš¡ããŒã ã§ã¯æ©èœããŠããã¢ãããã¯ãªããŒã«ã¯ç Žç¶»ããç ç©¶è ã¯ã¢ãã«ã®é²æ©ãããã€ã³ãã©ã¹ãã©ã¯ãã£ã®ç®¡çã«å€ãã®æéãè²»ããããã«ãªããŸãã
次ã®ãããžã§ã¯ãã®ããã®å®è£ ãã«ãŒããªã³ãããã£ãšèŠã€ãã
MicrocosmWorksã¯ãA100/H100 GPUäžã§MIGïŒMulti-Instance GPUïŒããŒãã£ã·ã§ãã³ã°ã䜿çšããã¯ãŒã¯ããŒãèªèåã®GPUã¹ã±ãžã¥ãŒãªã³ã°ãå®è£ ããŠããŸããããã«ãããæšè«ã¯ãŒã¯ããŒããããå°ããªGPUã¹ã©ã€ã¹ã«åé¢ãã€ã€ããã¬ãŒãã³ã°ãžã§ãçšã«å®å šãªGPUãŸãã¯è€æ°GPUã®å²ãåœãŠã確ä¿ããæ··åã¯ãŒã¯ããŒãã®å¹²æžã«ããã¡ã¢ãªæçåãé²ããŸãããªãŒã±ã¹ãã¬ãŒã¿ãŒã¯ãç°ãªãã¯ãŒã¯ããŒãã¿ã€ãã®ã¡ã¢ãªãããã¡ã€ã«ãçè§£ããæçåãããå²ãåœãŠã«ããã¡ã¢ãªäžè¶³ã®å€±æãåŒãèµ·ããããšãªããGPUå©çšçãæå€§åããããã«ããããã¹ã±ãžã¥ãŒãªã³ã°ããŸããæšè«ãšãã¬ãŒãã³ã°ã®äž¡æ¹ãå®è¡ããã¯ã©ã¹ã¿ãŒã®å Žåããã®ã¢ãããŒãã¯éåžžãçŽ æŽã«ã¹ã±ãžã¥ãŒã«ãããæ··åã¯ã©ã¹ã¿ãŒã§äžè¬çãª30-40%ãšæ¯èŒããŠã70-85%ã®GPUå©çšçãéæããŸãã
MicrocosmWorks ã¯éåžžãKubernetes ã« NVIDIA GPU Operator ãšã«ã¹ã¿ã ã¹ã±ãžã¥ãŒãªã³ã°ãã©ã°ã€ã³ã䜿çšããããã« vanilla Kubernetes ããã€ãã£ãã«ãµããŒãããªã gang schedulingãfair-share queuingãfractional GPU allocation ã®ããã« Run:ai ã Volcano ã®ãããªãã¬ãŒã ã¯ãŒã¯ã§åŒ·åããã GPU ãªãŒã±ã¹ãã¬ãŒã·ã§ã³ããããã€ããŸããæšæºã® Kubernetes 㯠GPU ãäžéæãªæŽæ°ãªãœãŒã¹ãšããŠæ±ããŸãããåœç€Ÿã®åŒ·åãããã¹ã¿ãã¯ã¯ãGPU ããããžãŒ (NVLink ã€ã³ã¿ãŒã³ãã¯ããPCIe 察 NVSwitch)ãã¡ã¢ãªå®¹éãèšç®èœåãçè§£ãããã¬ãŒãã³ã°ããã©ãŒãã³ã¹ã«å€§ãã圱é¿ããé 眮決å®ãè¡ããŸããå€§èŠæš¡ãªã¯ã©ã¹ã¿ãŒ (GPU 50 å°ä»¥äž) ã®å Žåãã¹ã±ãžã¥ãŒãªã³ã°ã€ã³ããªãžã§ã³ã¹ã ãã§ããããã©ã«ãã® Kubernetes GPU ã¹ã±ãžã¥ãŒãªã³ã°ãšæ¯èŒããŠå®å¹ã¹ã«ãŒãããã 20ïœ40% åäžãããããšãã§ããŸãã
MicrocosmWorksã¯ãããŒã¹ããã£ãã·ãã£ã®ããã«ãªã³ããã³ãã¯ã©ãŠãGPUããããŒã¹ã©ã€ã³ã®å®åžžç¶æ ã¯ãŒã¯ããŒãã®ããã«Reserved Instancesãããã§ãã¯ãã€ã³ãæ©èœãåãããã©ãŒã«ããã¬ã©ã³ããªãã¬ãŒãã³ã°ãžã§ãã®ããã«Spot/Preemptible Instancesãçµã¿åãããå€å±€çãªGPUèª¿éæŠç¥ãå°å ¥ããŠããŸããâããã«ããããªã³ããã³ãã®ã¿ã®æéãšæ¯èŒããŠ40ïœ60%ã®ã³ã¹ãåæžãå®çŸããŠããŸãããªãŒã±ã¹ãã¬ãŒã·ã§ã³å±€ã¯ãèšå®å¯èœãªééã§ãã¬ãŒãã³ã°ãžã§ããèªåçã«ãã§ãã¯ãã€ã³ãããSpot Instancesãåå©çšãããéã«åªé ãªããªãšã³ãã·ã§ã³å埩ãå¯èœã«ããæéå¶çŽã®ããæšè«ã¯ãŒã¯ããŒãã¯å¯çšæ§ãä¿èšŒããããã«Reserved Capacityãžã«ãŒãã£ã³ã°ããŸããç¶ç¶çãªGPUéèŠãããçµç¹åãã«ã¯ãèªç€Ÿææã®NVIDIAããŒããŠã§ã¢ãšã®ColocationãšCloud-Onlyã¢ãããŒããæ¯èŒæ€èšããŸããããã¯ãèªç€ŸææããŒããŠã§ã¢ã®Break-Even Pointãéåžž12ïœ18ã¶æã®Continuous Utilizationã§ããããã§ãã
MicrocosmWorksã¯ãNCCLæé©åããããããã¯ãŒã¯ããããžãåãããInfiniBand (400Gbps NDR) ãŸã㯠RoCE v2 (100-400Gbps) ãã¡ããªãã¯ã䜿çšããé«åž¯åå¹ ãäœé å»¶ã®ã€ã³ã¿ãŒã³ãã¯ããå±éããŸããããã¯ãããŒãéã®åŸé åæãéä¿¡ããã«ããã¯ãçããããéã忣ãã¬ãŒãã³ã°ã®ããã©ãŒãã³ã¹ãã³ã³ãã¥ãŒãããŠã³ãã§ã¯ãªããããã¯ãŒã¯ããŠã³ãã«ãªãããšãå€ãããã§ãããã®ãããã¯ãŒã¯ã¢ãŒããã¯ãã£ã«ã¯ãã¯ãã¹ã¹ã€ãããã©ãã£ãã¯ãæå°éã«æãããããåããããã¯ãŒã¯ã¹ã€ãããä»ããŠæ¥ç¶ãããããŒãäžã«åæ£ãã¬ãŒãã³ã°Podãå ±åãããããããžèªèåãžã§ãé 眮ïŒleaf-spine topology awarenessïŒãå«ãŸããŸããã¯ã©ãŠãå±éã®å Žåãåœç€Ÿã¯ããã¢ãã¢ã¡ã¿ã«ãããã¯ãŒã¯æ§èœãæäŸãããã¬ã€ã¹ã¡ã³ãã°ã«ãŒãããã³ã¯ã©ã¹ã¿ãŒãããã¯ãŒã¯ãªãã·ã§ã³ (AWS EFA, GCP GPUDirect-TCPX, Azure InfiniBand) ãæŽ»çšããŠããããããã¯ãŒã¯ã¢ãŒããã¯ãã£ã³ã³ãµã«ãã£ã³ã°ã¯$35-$50/æéã§æäŸããŠããŸãã
MicrocosmWorksã¯ãããŒã ããšã«ä¿èšŒãããæå°GPUã¯ã©ãŒã¿ãã¯ã©ã¹ã¿ãŒã«ã¢ã€ãã«ãªãœãŒã¹ãããå Žåã®ã¯ã©ãŒã¿ãè¶ ããããŒã¹ã容éãããã³ãããŒãã¬ãŒãã³ã°æéäžã§ãã£ãŠãé«åªå 床æ¬çªæšè«ã¯ãŒã¯ããŒããåžžã«ãªãœãŒã¹ã確ä¿ããåªå 床ããŒã¹ã®ããªãšã³ãã·ã§ã³ããªã·ãŒã«ãã£ãŠãåå空éããŒã¹ã®ãã«ãããã³ã·ãŒãå®è£ ããŠããŸãããã®ãã©ãããã©ãŒã ã«ã¯ã»ã«ããµãŒãã¹ããŒã¿ã«ãå«ãŸããŠãããããŒã ãªãŒããŒã¯ãã©ãããã©ãŒã ãšã³ãžãã¢ãªã³ã°ã®ä»å ¥ãå¿ èŠãšããã«ããã¬ãŒãã³ã°ãžã§ãã®æåºããã¥ãŒã®äœçœ®ã®ç¢ºèªãGPU䜿çšçã®ç£èŠãããã³ããŒã ã®ãžã§ãåªå 床ã®ç®¡çãè¡ãããšãã§ããŸãããã£ãŒãžããã¯ã¬ããŒãã¯ãåããŒã ããã³ãããžã§ã¯ãã«ãã£ãŠæ¶è²»ãããGPUæéã远跡ãã財åããŒã ãAIã€ã³ãã©ã³ã¹ããããžãã¹ãŠãããå šäœã«ããã£ãŠæ£ç¢ºã«å²ãåœãŠãããšãå¯èœã«ããŸãã
å°éããŒã ãã客æ§ã®ããžãã¹ã®ããã«ãã®ãœãªã¥ãŒã·ã§ã³ãæ§ç¯ããæ¹æ³ã«ã€ããŠãåãåãããã ããã
ãåãåããMicrocosmWorks ã¯ãã³ã³ãã¥ãŒãã£ã³ã°ãå ±æå¯èœã§ã¹ã±ãžã¥ãŒã«å¯èœãªãªãœãŒã¹ãšããŠæ±ããã€ã³ããªãžã§ã³ããªãã¥ãŒã€ã³ã°ãããªãšã³ãã·ã§ã³ããªã·ãŒãã³ã¹ã远跡ãåãããšã³ãããŒãšã³ãã® GPU ãªãŒã±ã¹ãã¬ãŒã·ã§ã³ãã©ãããã©ãŒã ãæ§ç¯ã§ããŸãããã®ãã©ãããã©ãŒã ã¯ããã¬ãŒãã³ã°ãšæšè«ã®äž¡æ¹ã®ã¯ãŒã¯ããŒããç°ãªãã¹ã±ãžã¥ãŒãªã³ã°ãããã¡ã€ã«ã§ãµããŒãããŸãããã¬ãŒãã³ã°ãžã§ãã¯ãèªåãã§ãã¯ãã€ã³ãåŠçãåãã Spot ã€ã³ã¹ã¿ã³ã¹ãšãªã³ããã³ãã€ã³ã¹ã¿ã³ã¹ã«ãããã¹ã±ãžã¥ãŒã«ãããæšè«ãšã³ããã€ã³ãã¯ãªã¯ãšã¹ããã¿ãŒã³ã«åºã¥ããŠãªãŒãã¹ã±ãŒã«ããŸããçµ±åãããã¢ãã«ã¬ãžã¹ããªã¯ããã¹ãŠã®å®éšã®ã³ãŒããããŒã¿ããã€ããŒãã©ã¡ãŒã¿ãŒãããã³çµæã®ã¢ãŒãã£ãã¡ã¯ããå®å šãªç³»çµ±ãšãšãã«è¿œè·¡ããŸããç ç©¶è ã¯ãã»ã«ããµãŒãã¹ããŒã¿ã«ãéããŠãªãœãŒã¹èŠä»¶ãå®çŸ©ãããã©ãããã©ãŒã ãé 眮ãã¹ã±ãŒãªã³ã°ããã©ãŒã«ããã¬ã©ã³ã¹ãããã³ã³ã¹ãã¢ããªãã¥ãŒã·ã§ã³ãèªåçã«åŠçããŸãã
ãã®ãã©ãããã©ãŒã ã¯ããã¥ãŒã®æ·±ãã«å¿ããŠãªãŒãã¹ã±ãŒã«ãããªã³ããã³ãã€ã³ã¹ã¿ã³ã¹ãš Spot ã€ã³ã¹ã¿ã³ã¹ã®ããŒãããŒã«ãçµã¿åãããŠäœ¿çšããGPU-aware ã¹ã±ãžã¥ãŒãªã³ã°ãåãã Kubernetes äžã§åäœããŸããã«ã¹ã¿ã ã¹ã±ãžã¥ãŒã©ãŒã¯ãããŒã äºç®ãæéããªãœãŒã¹å¹çã«åºã¥ããŠãžã§ãã®åªå é äœãä»ããŸãã忣ã¹ãã¬ãŒãžã¬ã€ã€ãŒã¯ããã¬ãŒãã³ã°ãžã§ããžã®é«ã¹ã«ãŒãããã®ããŒã¿ã¢ã¯ã»ã¹ãæäŸããã¢ãã«ã¬ãžã¹ããªãšå®éšãã©ãã«ãŒã¯åçŸæ§ãšã¬ããã³ã¹ã®ããã®ã¡ã¿ããŒã¿ããã¯ããŒã³ãæäŸããŸãã
| ã¬ã€ã€ãŒ | ãã¯ãããžãŒ |
|---|---|
| Backend | Python, Go, FastAPI, gRPC, Ray |
| AI / ML | PyTorch, DeepSpeed, Hugging Face Transformers, NVIDIA NCCL, TensorRT, vLLM |
| Frontend | React, Grafana, MLflow UI, custom Jupyter Hub portal |
| Database | PostgreSQL (metadata), MinIO (artifact storage), Redis (job queue), TimescaleDB (metrics) |
| Infrastructure | Kubernetes (EKS with GPU nodes), Karpenter, NVIDIA GPU Operator, Terraform, ArgoCD, Prometheus, DCGM Exporter |
ãã®ãã©ãããã©ãŒã ã¯ã4ã€ã®ãã§ãŒãºã«åããŠ12ã16é±éã§æ§ç¯ãããŸãã1ã3é±ç®ã¯ãèŠä»¶ã®çºèŠãGPU ã¯ãŒã¯ããŒãã®ãããã¡ã€ãªã³ã°ãããã³ Karpenter ãš NVIDIA GPU Operator ã䜿çšãã Kubernetes ããŒã¹ã®ã¹ã±ãžã¥ãŒãªã³ã°ããã³ãªãŒãã¹ã±ãŒãªã³ã°ã€ã³ãã©ã¹ãã©ã¯ãã£ã®ã¢ãŒããã¯ãã£èšèšã«çŠç¹ãåœãŠãŸãã4ã8é±ç®ã§ã¯ããã³ãããã³ã°ãšã®ã£ã³ã°ã¹ã±ãžã¥ãŒãªã³ã°ãåãã GPU-aware ã¹ã±ãžã¥ãŒã©ãŒãSpot ã€ã³ã¹ã¿ã³ã¹ã®å ¥ææŠç¥ãåãããšã©ã¹ãã£ãã¯ããŒãããŒã«ãããŒãžã£ãŒãããã³ DVC çµ±åãåãã MLflow ããŒã¹ã®ã¢ãã«ã¬ãžã¹ããªãå®è£ ããŸãã9ã12é±ç®ã§ã¯ãã»ã«ããµãŒãã¹ç ç©¶è ããŒã¿ã«ãã³ã¹ãã¢ããªãã¥ãŒã·ã§ã³ãšã³ãžã³ãããã³ããŒã ããšã®äºç®å·è¡ããã·ã¥ããŒããæ§ç¯ããŸãã13ã16é±ç®ã§ã¯ã代衚çãªãã¬ãŒãã³ã°ãžã§ãã䜿çšããããŒããã¹ãã宿œããSpot äžæã®ããã®ãã§ãã¯ãã€ã³ããšåéã¯ãŒã¯ãããŒã調æŽããML ãã©ãããã©ãŒã ããã³ç ç©¶ããŒã ã«éçšãã¬ãŒãã³ã°ãæäŸããŸãã
| ã¡ããªã¯ã¹ | æ¹å | 詳现 |
|---|---|---|
| GPU å©çšç | å¹³å70-85% | ãã³ãããã³ã°ãšãã¥ãŒããŒã¹ã®ã¹ã±ãžã¥ãŒãªã³ã°ã«ãããã¢ã€ãã«ç¶æ ã®äºçŽã€ã³ã¹ã¿ã³ã¹ãæé€ |
| ã³ã³ãã¥ãŒãã£ã³ã°ã³ã¹ã | 45-60%åæž | ãã§ãã¯ãã€ã³ãåŠçã䌎ã Spot ã€ã³ã¹ã¿ã³ã¹ç®¡çã«ãããäœæ¥ã倱ããªã¹ã¯ãªãã«ã³ã¹ããåæž |
| ç ç©¶è ã®åŸ æ©æé | 80%åæž | ãã§ã¢ã·ã§ã¢ã¹ã±ãžã¥ãŒãªã³ã°ãšãšã©ã¹ãã£ãã¯ã¹ã±ãŒãªã³ã°ã«ãããå çé ã® GPU ç¬å ãè§£æ¶ |
| å®éšã®åçŸæ§ | 100% | ããŒã¿ããŒãžã§ã³ããã¢ãã«ã¢ãŒãã£ãã¡ã¯ããŸã§ã®å®å šãªç³»çµ±è¿œè·¡ã«ããããã¹ãŠã®çµæãåçŸå¯èœã§ããããšãä¿èšŒ |
| ã¢ãã«ãããã€ãŸã§ã®æé | 70%åæž | çµ±åãããã¢ãã«ã¬ãžã¹ããªãããµãŒãã³ã°ãã€ãã©ã€ã³ãžã®ç§»è¡ã«ãããç ç©¶ãšãšã³ãžãã¢ãªã³ã°éã®æåã§ã®åŒãç¶ããè§£æ¶ |
èªååãããã»ãã¥ã¢ã§ãåçŸæ§ã®ããããªããªãŒãã€ãã©ã€ã³ã«ããããããã€æéãæ°æéããæ°åã«ççž®ããŸãã