1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
| """ Jetson部署脚本 """ import torch import gc import time import argparse from pathlib import Path
class JetsonDeployer: """ Jetson部署器 """ def __init__(self, model_path: str, quantize_mode: str = "fp16"): self.model_path = model_path self.quantize_mode = quantize_mode self.model = None print(f"CUDA available: {torch.cuda.is_available()}") if torch.cuda.is_available(): print(f"GPU: {torch.cuda.get_device_name()}") print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB") def load_model(self): """加载模型""" print(f"Loading model from {self.model_path}") if "llama" in self.model_path.lower(): from transformers import LlamaForCausalLM, LlamaTokenizer self.model = LlamaForCausalLM.from_pretrained( self.model_path, torch_dtype=torch.float16, device_map="auto" ) self.tokenizer = LlamaTokenizer.from_pretrained(self.model_path) elif "chatglm" in self.model_path.lower(): from transformers import AutoModel, AutoTokenizer self.model = AutoModel.from_pretrained( self.model_path, torch_dtype=torch.float16, trust_remote_code=True ).half().cuda() self.tokenizer = AutoTokenizer.from_pretrained( self.model_path, trust_remote_code=True ) gc.collect() torch.cuda.empty_cache() print("Model loaded successfully") def quantize_model(self, mode: str = "int8"): """模型量化""" print(f"Quantizing model to {mode}") if mode == "int8": from transformers import BitsAndBytesConfig quant_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, llm_int8_has_fp16_weight=False ) if "llama" in self.model_path.lower(): from transformers import LlamaForCausalLM self.model = LlamaForCausalLM.from_pretrained( self.model_path, quantization_config=quant_config, device_map="auto" ) gc.collect() print("Quantization completed") def benchmark(self, prompt: str = "介绍自己", num_runs: int = 10): """性能测试""" print("\n=== Benchmark ===") _ = self.generate(prompt, max_new_tokens=32) latencies = [] tokens_per_second = [] for i in range(num_runs): torch.cuda.synchronize() start = time.time() output = self.generate(prompt, max_new_tokens=128) torch.cuda.synchronize() end = time.time() latency = end - start tokens = len(self.tokenizer.encode(output)) tps = tokens / latency latencies.append(latency) tokens_per_second.append(tps) print(f"Run {i+1}: {latency:.2f}s, {tps:.2f} tokens/s") print(f"\n=== Results ===") print(f"Average latency: {sum(latencies)/len(latencies):.2f}s") print(f"Average throughput: {sum(tokens_per_second)/len(tokens_per_second):.2f} tokens/s") def generate(self, prompt: str, max_new_tokens: int = 256): """生成文本""" inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = self.model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.7, top_p=0.9 ) return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--model_path", type=str, required=True) parser.add_argument("--quantize", type=str, default="fp16", choices=["fp16", "int8", "int4"]) parser.add_argument("--benchmark", action="store_true") parser.add_argument("--prompt", type=str, default="介绍人工智能的发展历史") args = parser.parse_args() deployer = JetsonDeployer(args.model_path, args.quantize) deployer.load_model() if args.quantize != "fp16": deployer.quantize_model(args.quantize) output = deployer.generate(args.prompt) print(f"\nOutput:\n{output}") if args.benchmark: deployer.benchmark(args.prompt)
|