Thanks to visit codestin.com
Credit goes to github.com

Skip to content
This repository was archived by the owner on Feb 13, 2024. It is now read-only.
This repository was archived by the owner on Feb 13, 2024. It is now read-only.

Detect text in files in PDF doesn't work even for sample pdf file. getting empty strings  #257

@SAIVENKATARAJU

Description

@SAIVENKATARAJU

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Please run down the following list and make sure you've tried the usual "quick fixes":

If you are still having issues, please be sure to include as much information as possible:

Environment details

  • OS type and version: Linux
  • Python version: python --version 3.9
  • pip version: pip --version 21.3
  • google-cloud-vision version: pip show google-cloud-vision 2.5.0

Steps to reproduce

  1. ? taken extact code from this link : https://cloud.google.com/vision/docs/pdf#vision_text_detection_pdf_gcs-python
  2. ?

Code example

def async_detect_document(gcs_source_uri, gcs_destination_uri):
"""OCR with PDF/TIFF as source files on GCS"""
import json
import re
from google.cloud import vision
from google.cloud import storage

# Supported mime_types are: 'application/pdf' and 'image/tiff'
mime_type = 'application/pdf'

# How many pages should be grouped into each json output file.
batch_size = 2

client = vision.ImageAnnotatorClient()

feature = vision.Feature(
    type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)

gcs_source = vision.GcsSource(uri=gcs_source_uri)
input_config = vision.InputConfig(
    gcs_source=gcs_source, mime_type=mime_type)

gcs_destination = vision.GcsDestination(uri=gcs_destination_uri)
output_config = vision.OutputConfig(
    gcs_destination=gcs_destination, batch_size=batch_size)

async_request = vision.AsyncAnnotateFileRequest(
    features=[feature], input_config=input_config,
    output_config=output_config)

operation = client.async_batch_annotate_files(
    requests=[async_request])

print('Waiting for the operation to finish.')
operation.result(timeout=420)

# Once the request has completed and the output has been
# written to GCS, we can list all the output files.
storage_client = storage.Client()

match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
bucket_name = match.group(1)
prefix = match.group(2)

bucket = storage_client.get_bucket(bucket_name)

# List objects with the given prefix.
blob_list = list(bucket.list_blobs(prefix=prefix))
print('Output files:')
for blob in blob_list:
    print(blob.name)

# Process the first output file from GCS.
# Since we specified batch_size=2, the first response contains
# the first two pages of the input file.
output = blob_list[0]

json_string = output.download_as_string()

response = json.loads(json_string)

# The actual response for the first page of the input file.
first_page_response = response['responses'][0]
annotation = first_page_response['fullTextAnnotation']

# Here we print the full text from the first page.
# The response contains more information:
# annotation/pages/blocks/paragraphs/words/symbols
# including confidence scores and bounding boxes
print('Full text:\n')
print(annotation['text'])

async_detect_document('gs://cloud-samples-data/vision/pdf_tiff/census2010.pdf','My-uri')

# example

Stack trace

# example

Traceback (most recent call last):
File "/home/nanduri_saivenkataraju/Desktop/projects/searchandQA/src/PDFExtraction.py", line 83, in
async_detect_document('gs://cloud-samples-data/vision/pdf_tiff/census2010.pdf','my-uri')
File "/home/nanduri_saivenkataraju/Desktop/projects/searchandQA/src/PDFExtraction.py", line 69, in async_detect_document
response = json.loads(json_string)
File "/opt/conda/envs/fastapi/lib/python3.9/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/opt/conda/envs/fastapi/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/opt/conda/envs/fastapi/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Process finished with exit code 1

Making sure to follow these steps will guarantee the quickest resolution possible.

Thanks!

Metadata

Metadata

Assignees

Labels

api: visionIssues related to the googleapis/python-vision API.documentationImprovements or additions to documentationsamplesIssues that are directly related to samples.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions