Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BlobReader not buffering properly. #462

@Megabytemb

Description

@Megabytemb

Environment details

  • OS type and version: Mac 10.15.7
  • Python version: Python 3.9.5
  • pip version: pip 21.1.1
  • google-cloud-storage: Version: 1.38.0

Steps to reproduce

When trying to stream a file from Google Cloud Storage to Google Drive, the BlobReader doesn't appear to be buffering properly.

Reading through the blobReader code, it should buffer the file as per chunksize, then download new chuncks as that buffer is exausted. However my experience is that every time the blobwriter is read a 2nd time, it invalidates the buffer, and downloads a new chunk.

the Google API MediaIoBaseUpload appears to be requesting files in 8192 bytes chunks, and every time the next chunk is requested from the GCS BlobReader, it downloads the next 40Mb chunk, rather than reading from the buffer.

My debugging has found that the buffer is actually being invalided when the python HTTP class 'seeks' the next chunk, and the math is failing here, however, i'm unsure what should be happening.

Code example below demonstrates the problem, just provide your own client credentials for Drive, and upload a CSV file to Google Cloud storage, and note the blob and bucket.

Code example

from google.cloud import storage
from googleapiclient.http import MediaIoBaseUpload
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
import os
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

logger = logging.getLogger(__name__)

SCOPES = ['https://www.googleapis.com/auth/drive']


def getCreds():
    creds = None
    # The file token.json stores the user's access and refresh tokens, and is
    # created automatically when the authorization flow completes for the first
    # time.
    if os.path.exists('token.json'):
        creds = Credentials.from_authorized_user_file('token.json', SCOPES)
    # If there are no (valid) credentials available, let the user log in.
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'client_secret.json', SCOPES)
            creds = flow.run_local_server(port=0)
        # Save the credentials for the next run
        with open('token.json', 'w') as token:
            token.write(creds.to_json())
    
    return creds

def gcsToDrive(bucketName, blobName):
    client = storage.Client()

    bucket = client.get_bucket(bucketName)
    blob = bucket.blob(blobName)

    creds = getCreds()

    service = build('drive', 'v3', credentials=creds)

    megabyte = (256 * 1024) * 4
    chunk_size: int = megabyte * 40

    with blob.open("rb", chunk_size=chunk_size) as stream:
        file_metadata = {
            "name": "My Report",
            "mimeType": "application/vnd.google-apps.spreadsheet",
        }
        media = MediaIoBaseUpload(stream, mimetype="text/csv", resumable=True, chunksize=chunk_size)
        file = (
            service.files()
            .create(body=file_metadata, media_body=media, fields="id")
            .execute()
        )

    logger.info(file)
    
    return file

if __name__ == "__main__":
    my_bucket = "my_bucked"
    my_blob = "my-blob.csv"

    gcsToDrive(my_bucket, my_blob)

Example Logs

DEBUG:google.auth._default:Checking /Users/<username>/code/python/gscFileError/sa.json for explicit credentials as part of auth process...
DEBUG:google.auth._default:Checking /Users/<username>/code/python/gscFileError/sa.json for explicit credentials as part of auth process...
DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): storage.googleapis.com:443
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /storage/v1/b/<my bucket>?projection=noAcl&prettyPrint=false HTTP/1.1" 200 574
INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /storage/v1/b/<my bucket>/o/my-blob.csv?projection=noAcl&prettyPrint=false HTTP/1.1" 200 738
DEBUG:googleapiclient.discovery:URL being requested: POST https://www.googleapis.com/upload/drive/v3/files?fields=id&alt=json&uploadType=resumable
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 206 2240745
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 206 2224361
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 206 2207977
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 206 2191593
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 206 2175209
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 206 2158825
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 206 2142441
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 206 2126057
...
...
...
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 206 94441
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 206 78057
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 206 61673
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 206 45289
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 206 28905
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 206 12521
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 416 29
DEBUG:urllib3.connectionpool:https://storage.googleapis.com:443 "GET /download/storage/v1/b/<my bucket>/o/my-blob.csv?generation=1623219180016524&alt=media HTTP/1.1" 416 29
INFO:__main__:{'id': '19GXqCAQDsRtK1czPSB_vX_SaBp4JtatlxCl4uQUU66o'}

Metadata

Metadata

Assignees

Labels

api: storageIssues related to the googleapis/python-storage API.status: investigatingThe issue is under investigation, which is determined to be non-trivial.type: questionRequest for information or clarification. Not an issue.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions