当前位置：首页 > news >正文

优化Google Cloud Storage大文件上传和内存溢出

news 2025/12/17 10:35:01

背景

我们的项目每天都会并行上传好几万份文件到下游的GCP Cloud Storage，当文件比较大时，会采用GCP的可续上传方案，通过把文件切分成多个数据块，分多次HTTP请求上传到GCP Bucket，具体可参考https://cloud.google.com/storage/docs/performing-resumable-uploads。但是在实际应用中，会发现文件比较大时，由于数据包会被分成多次HTTP请求上传，偶尔GCP会返回400错误码导致上传失败，但是查看各个请求参数都属于正常，目前不确定GCP一些网络限制导致还是该 API 存在性能问题。于是我选择使用另一种替代方案，尝试使用以下方式，通过GCP 官方SDK进行文件上传。

 GoogleCredentials apiCredentials = GoogleCredentials.fromStream(new FileInputStream(jsonKeyPath)).createScoped(scopes);Storage storage = StorageOptions.newBuilder().setCredentials(apiCredentials).build().getService();BlobId blobId = BlobId.of(bucketName, objectName);BlobInfo blobInfo = BlobInfo.newBuilder(blobId).setContentType("application/json;charset=UTF-8").setMd5(checksum).build();byte[] bytes = Files.readAllBytes(sourceFIle.toPath());storage.create(blobInfo,bytes);

SDK API参考https://cloud.google.com/java/docs/reference/google-cloud-storage/latest/overview。

文件小的时候速度还是比较快，因为它是直接Upload，不做数据分块，但是当文件大点或者数量多点，就会出现如下异常

java.lang.OutOfMemoryError: Required array size too large

原因是这句代码Files.readAllBytes(sourceFIle.toPath())会一次过把文件直接加载到内存，当文件比较大或者文件数量很多的时候，就会直接导致内存溢出。

虽然官方SDK也有针对大文件上传提供了可续上传方案，例如:

 String bucketName = "my-unique-bucket";String blobName = "my-blob-name";BlobId blobId = BlobId.of(bucketName, blobName);byte[] content = "Hello, World!".getBytes(UTF_8);BlobInfo blobInfo = BlobInfo.newBuilder(blobId).setContentType("text/plain").build();try (WriteChannel writer = storage.writer(blobInfo)) {writer.write(ByteBuffer.wrap(content, 0, content.length));} catch (IOException ex) {// handle exception}

WriteChannel底层原理也是会在传输的时候会分块上传，并且遇到网络波动导致传输失败会自动基于上一次传输的数据包进行重传，但是这个也只是解决大文件上传性能问题，并没办法解决我们在加载源文件就已经内存溢出。

解决方案

其实解决办法也很简单，我们只需要把每次读书源文件的数据包在可控范围即可，也就是分块读取，分块上传，这样既保证大文件上传效率，也不会因为一次过加载文件太多导致内存溢出。

以下是示范代码：

主程序入口:
以下scope对应的是我们所需要的权限https://cloud.google.com/storage/docs/authentication

    public static void main(String[] args) throws IOException {List<String> scopes = new ArrayList<>(Arrays.asList("https://www.googleapis.com/auth/devstorage.full_control","https://www.googleapis.com/auth/devstorage.read_write"));//GCP Json Key PathString jsonKeyPath = "";//File used to be uploadedFile sourceFile = new File("filePath");//Calculate file checksumString checksum = DigestUtils.md5Hex(new FileInputStream(sourceFile));//GCS BucketNameString bucketName = "";//GCS Target ObjectNameString objectName=sourceFile.getName();//Prepare connection details//https://cloud.google.com/storage/docs/authenticationGoogleCredentials apiCredentials = GoogleCredentials.fromStream(new FileInputStream(jsonKeyPath)).createScoped(scopes);Storage storage = StorageOptions.newBuilder().setCredentials(apiCredentials).build().getService();BlobId blobId = BlobId.of(bucketName, objectName);BlobInfo blobInfo = BlobInfo.newBuilder(blobId).setContentType("application/json;charset=UTF-8").setMd5(checksum).build();uploadToBucket(storage, sourceFile, blobInfo);}

上传方法:
这里的文件大小需要根据自己服务器实际的内存和带宽去设置，一般小文件使用直接上传方法效果最好，如果文件比较大，就需要进行数据包拆分。
关于数据块大小设置，以下是GCP的建议:

The chunk size should be a multiple of 256 KiB (256 x 1024 bytes), unless it's the last chunk that completes the upload. Larger chunk sizes typically make uploads faster, but note that there's a tradeoff between speed and memory usage. It's recommended that you use at least 8 MiB for the chunk size.

    public static void uploadToBucket(Storage storage, File sourceFIle, BlobInfo blobInfo) throws IOException {//For small files, we can upload the file in one go// less than 10 MBif(sourceFIle.length() < 10000000){byte[] bytes = Files.readAllBytes(sourceFIle.toPath());storage.create(blobInfo,bytes);return;}//For big files , we need to split it into multiple chunkstry(WriteChannel writer = storage.writer(blobInfo)){//Don't read the whole file because it will cause OutOfMemory issuebyte[] buffer = new byte[10240];try(InputStream input = Files.newInputStream(sourceFIle.toPath())){int limit;while((limit = input.read(buffer))>=0){writer.write(ByteBuffer.wrap(buffer,0,limit));}}}}

以下这一句是解决这个问题的关键，每次只读取一部分数据，所以不会导致内存溢出。

  while((limit = input.read(buffer))>=0){

测试效果

这里使用了项目用的GCP Storage和账号进行测试，多线程并行去上传4个210MB文件，由于我电脑在中国网络，而对应GCP Bucket部署在欧洲，所以速度不算快。

总结

其实这里解决方案不只适用于谷歌云，以前我们使用阿里云的时候，上传一些大文件也是采用这样的方式，适用于所有的文件传输场景，这里其实就是利用分治思想去解决问题。

完整代码

https://github.com/EvanLeung08/cloud-solutions/tree/main/gcs-largefile-upload

优化Google Cloud Storage大文件上传和内存溢出

背景

解决方案

测试效果

总结

完整代码

相关文章：

优化Google Cloud Storage大文件上传和内存溢出

chatGPT的prompt技巧

【华为OD机试 2023最新】统一限载货物数最小值（C语言题解 100%）

ios 在windows chrome 联调

干翻Mybatis源码系列之第六篇：Mybatis可选缓存概述

如何调教ChatGPT

记一次我的漏洞挖掘实战——某公司的SQL注入漏洞

代码随想录二刷复习 day1 704二分查找 27 移除元素 977 有序数组的平方

第16章指令级并行与超标量处理器

JavaWeb ( 三 ) Web Server 服务器

2.6 浮点运算方法和浮点运算器

第一次找实习, 什么项目可以给自己加分（笔记）

FPGA/Verilog HDL/AC620零基础入门学习——8*8同步FIFO实验

shell脚本

不部署服务端调用接口，前端接口神器json-server

国产化：复旦微JFM7K325T ＋华为海思 HI3531DV200 的综合视频处理平台

Ceph入门到精通- stderr raise RuntimeError(‘Unable to create a new OSD id‘)

AWSFireLens轻松实现容器日志处理

Java程序设计入门教程--案例：自由落体

Qt音视频开发44-本地摄像头推流（支持分辨率/帧率等设置/实时性极高）

使用分级同态加密防御梯度泄漏

【大模型RAG】Docker 一键部署 Milvus 完整攻略

1.3 VSCode安装与环境配置

P3 QT项目----记事本（3.8）

2025 后端自学UNIAPP【项目实战：旅游项目】6、我的收藏页面

3-11单元格区域边界定位(End属性)学习笔记

人工智能（大型语言模型 LLMs）对不同学科的影响以及由此产生的新学习方式

Selenium常用函数介绍

云原生周刊：k0s 成为 CNCF 沙箱项目

spring Security对RBAC及其ABAC的支持使用