-
Notifications
You must be signed in to change notification settings - Fork 84
Detect text in files in PDF doesn't work even for sample pdf file. getting empty strings #257
Description
Thanks for stopping by to let us know something could be better!
PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.
Please run down the following list and make sure you've tried the usual "quick fixes":
- Search the issues already opened: https://github.com/googleapis/python-vision/issues
- Search StackOverflow: https://stackoverflow.com/questions/tagged/google-cloud-platform+python
If you are still having issues, please be sure to include as much information as possible:
Environment details
- OS type and version: Linux
- Python version:
python --version3.9 - pip version:
pip --version21.3 google-cloud-visionversion:pip show google-cloud-vision2.5.0
Steps to reproduce
- ? taken extact code from this link : https://cloud.google.com/vision/docs/pdf#vision_text_detection_pdf_gcs-python
- ?
Code example
def async_detect_document(gcs_source_uri, gcs_destination_uri):
"""OCR with PDF/TIFF as source files on GCS"""
import json
import re
from google.cloud import vision
from google.cloud import storage
# Supported mime_types are: 'application/pdf' and 'image/tiff'
mime_type = 'application/pdf'
# How many pages should be grouped into each json output file.
batch_size = 2
client = vision.ImageAnnotatorClient()
feature = vision.Feature(
type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)
gcs_source = vision.GcsSource(uri=gcs_source_uri)
input_config = vision.InputConfig(
gcs_source=gcs_source, mime_type=mime_type)
gcs_destination = vision.GcsDestination(uri=gcs_destination_uri)
output_config = vision.OutputConfig(
gcs_destination=gcs_destination, batch_size=batch_size)
async_request = vision.AsyncAnnotateFileRequest(
features=[feature], input_config=input_config,
output_config=output_config)
operation = client.async_batch_annotate_files(
requests=[async_request])
print('Waiting for the operation to finish.')
operation.result(timeout=420)
# Once the request has completed and the output has been
# written to GCS, we can list all the output files.
storage_client = storage.Client()
match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
bucket_name = match.group(1)
prefix = match.group(2)
bucket = storage_client.get_bucket(bucket_name)
# List objects with the given prefix.
blob_list = list(bucket.list_blobs(prefix=prefix))
print('Output files:')
for blob in blob_list:
print(blob.name)
# Process the first output file from GCS.
# Since we specified batch_size=2, the first response contains
# the first two pages of the input file.
output = blob_list[0]
json_string = output.download_as_string()
response = json.loads(json_string)
# The actual response for the first page of the input file.
first_page_response = response['responses'][0]
annotation = first_page_response['fullTextAnnotation']
# Here we print the full text from the first page.
# The response contains more information:
# annotation/pages/blocks/paragraphs/words/symbols
# including confidence scores and bounding boxes
print('Full text:\n')
print(annotation['text'])
async_detect_document('gs://cloud-samples-data/vision/pdf_tiff/census2010.pdf','My-uri')
# exampleStack trace
# example
Traceback (most recent call last):
File "/home/nanduri_saivenkataraju/Desktop/projects/searchandQA/src/PDFExtraction.py", line 83, in
async_detect_document('gs://cloud-samples-data/vision/pdf_tiff/census2010.pdf','my-uri')
File "/home/nanduri_saivenkataraju/Desktop/projects/searchandQA/src/PDFExtraction.py", line 69, in async_detect_document
response = json.loads(json_string)
File "/opt/conda/envs/fastapi/lib/python3.9/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/opt/conda/envs/fastapi/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/opt/conda/envs/fastapi/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Process finished with exit code 1
Making sure to follow these steps will guarantee the quickest resolution possible.
Thanks!