Detect text in files  in PDF doesn't work even for sample pdf file. getting empty strings 

Thanks for stopping by to let us know something could be better!

**PLEASE READ**: If you have a support contract with Google, please create an issue in the [support console](https://cloud.google.com/support/) instead of filing on GitHub. This will ensure a timely response.

Please run down the following list and make sure you've tried the usual "quick fixes":

  - Search the issues already opened: https://github.com/googleapis/python-vision/issues
  - Search StackOverflow: https://stackoverflow.com/questions/tagged/google-cloud-platform+python

If you are still having issues, please be sure to include as much information as possible:

#### Environment details

  - OS type and version: Linux
  - Python version: `python --version` 3.9
  - pip version: `pip --version`  21.3
  - `google-cloud-vision` version: `pip show google-cloud-vision` 2.5.0

#### Steps to reproduce

  1. ? taken extact code from this link :  https://cloud.google.com/vision/docs/pdf#vision_text_detection_pdf_gcs-python
  2. ?

#### Code example









def async_detect_document(gcs_source_uri, gcs_destination_uri):
    """OCR with PDF/TIFF as source files on GCS"""
    import json
    import re
    from google.cloud import vision
    from google.cloud import storage

    # Supported mime_types are: 'application/pdf' and 'image/tiff'
    mime_type = 'application/pdf'

    # How many pages should be grouped into each json output file.
    batch_size = 2

    client = vision.ImageAnnotatorClient()

    feature = vision.Feature(
        type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)

    gcs_source = vision.GcsSource(uri=gcs_source_uri)
    input_config = vision.InputConfig(
        gcs_source=gcs_source, mime_type=mime_type)

    gcs_destination = vision.GcsDestination(uri=gcs_destination_uri)
    output_config = vision.OutputConfig(
        gcs_destination=gcs_destination, batch_size=batch_size)

    async_request = vision.AsyncAnnotateFileRequest(
        features=[feature], input_config=input_config,
        output_config=output_config)

    operation = client.async_batch_annotate_files(
        requests=[async_request])

    print('Waiting for the operation to finish.')
    operation.result(timeout=420)

    # Once the request has completed and the output has been
    # written to GCS, we can list all the output files.
    storage_client = storage.Client()

    match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
    bucket_name = match.group(1)
    prefix = match.group(2)

    bucket = storage_client.get_bucket(bucket_name)

    # List objects with the given prefix.
    blob_list = list(bucket.list_blobs(prefix=prefix))
    print('Output files:')
    for blob in blob_list:
        print(blob.name)

    # Process the first output file from GCS.
    # Since we specified batch_size=2, the first response contains
    # the first two pages of the input file.
    output = blob_list[0]

    json_string = output.download_as_string()

    response = json.loads(json_string)

    # The actual response for the first page of the input file.
    first_page_response = response['responses'][0]
    annotation = first_page_response['fullTextAnnotation']

    # Here we print the full text from the first page.
    # The response contains more information:
    # annotation/pages/blocks/paragraphs/words/symbols
    # including confidence scores and bounding boxes
    print('Full text:\n')
    print(annotation['text'])
async_detect_document('gs://cloud-samples-data/vision/pdf_tiff/census2010.pdf','My-uri')

```python
# example
```

#### Stack trace
```
# example
```
Traceback (most recent call last):
  File "/home/nanduri_saivenkataraju/Desktop/projects/searchandQA/src/PDFExtraction.py", line 83, in <module>
    async_detect_document('gs://cloud-samples-data/vision/pdf_tiff/census2010.pdf','my-uri')
  File "/home/nanduri_saivenkataraju/Desktop/projects/searchandQA/src/PDFExtraction.py", line 69, in async_detect_document
    response = json.loads(json_string)
  File "/opt/conda/envs/fastapi/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/envs/fastapi/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/envs/fastapi/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Process finished with exit code 1


Making sure to follow these steps will guarantee the quickest resolution possible.

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect text in files in PDF doesn't work even for sample pdf file. getting empty strings #257

Environment details

Steps to reproduce

Code example

Stack trace

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Detect text in files in PDF doesn't work even for sample pdf file. getting empty strings #257

Description

Environment details

Steps to reproduce

Code example

Stack trace

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions