Amazon Textract is a fully managed machine learning service that automatically extracts printed text, handwriting, and other data from scanned documents.
Using Amazon Textract, you can easily extract text and data from images and any scanned documents that go beyond simple optical character recognition (OCR) to extract data from tables and forms.
Many businesses and government organizations extract data from scanned documents, such as PDFs, tables and forms, through manual data entry that is slow, expensive and prone to errors. Some businesses and government organizations are using simple business process automation (BPA), which provides fully automated workflows or semi-automated processes in the majority of businesses within various domains. These processes require manual configuration which needs to be updated each time the form changes to be usable. To overcome these manual processes, Textract uses machine learning to instantly process any type of document, accurately extracting text, forms and tables without the need for any manual effort or custom code.
AWS Textract consists of higher capabilities than the average optical character recognition (OCR) system. It is able to extract information like names, birthdates, social security numbers from the images and PDF files which are stored in the S3 buckets.
“Amazon Textract is based on the same proven, highly scalable, deep-learning technology that was developed by Amazon’s computer vision scientists to analyze billions of images and videos daily. You don’t need any machine learning expertise to use it.”
Let’s explore AWS Textract!
In this exercise, we will be utilizing following AWS services:
- Simple Storage Service (S3)
- Identity Access Management Service (IAM)
- Lamda Service
- Textract Service
We will be demonstrating one major use case of AWS Textract service using AWS Lambda with Python implementations:
Extracting Text from an S3 Bucket Image. (Complete Hands On)
In order to use AWS Textract in Python, the latest boto3 package is required. This package we will download and upload as an AWS Lambda “Layer”. Let’s do it!
Execute following command in command shell.
pip install boto3
Now let’s create a boto3 layer. Go to AWS Lambda -> Layers and click “Create Layer”.
Give a layer name, select the latest python version and upload the zip file as below.
So, let’s start doing text extraction!
Extracting Text from the image stored in the S3 bucket
We are going to create a Lambda function which gets triggered whenever an image gets uploaded to S3 Bucket.
- Creating the S3 Bucket
In the Amazon console, go to the AWS S3 page and click “Create bucket”.
Enter Bucket name and Region same as the region that will be used in Lambda function, in the Set permissions section, set the permissions as below image and create a bucket
2. Creating The S3 Lambda Trigger
Now we will create a Lambda function that will be executed upon new image uploads in the bucket which we have created.
Go to AWS Lambda service page and click “Create function”.
- Select “Use a blueprint” and search for “s3-get-object-python” template and click “Configure” as shown in the below image:
Enter “Function name”, “Role name” and select the “Bucket name” as the S3 bucket created in the previous step. Add “Suffix” to restrict the trigger only for PNG images. Fill out the rest of the settings as shown in the below image. Don’t change Lambda function code as of now, we will do the changes later in the code. Click on the “Create function” button.
Replace the existing code in the Function Code area with the following line of code. This code sends the uploaded image to the AWS Textract and writes the response as a text file with the same name to the S3 bucket.
3. Attaching Permission Policies to Lambda
Click on “getTextFromImageRole” in the Permission tab in the Lambda configuration as displayed in below image:
This will open the “getTextFromImageRole” configuration page as below.
Click “Attach policy” and select “AmazonTextractFullAccess” policy and click “Attach policy” as displayed in the below image.
This will give Lambda function permission to access AWS Textract service as shown in the following image
4. Adding Custom “boto3-layer” to Lambda
Click “Layers” from Lambda designer and click “Add a layer” as shown in the below image
It will show the Add Layer screen as shown in the following image. Select Custom Layer and add the “workfall-boto3-layer” that was created earlier and click on Add button
5. Testing The S3 Lambda Trigger
Before uploading the files into S3 bucket, let’s test our Lambda function. Click on “Test”, as of now we are getting some errors as shown in the following image. This error is regarding access denied on the GetObject operation.
Let’s fix this error. Go to the permission tab of the Lambda function and click on the role named getTextFromImageRole. In the next screen, click on the “add inline policy” to add a policy to give access to the lambda function to access objects as shown in the following image:
It will take you to the next screen as shown in the below image. Give name to the policy and click on “Create Policy” button
Go to the S3 bucket “workfallbucket” created in previous step and upload a png image with some text. I have uploaded the following image which represents the Workfall Partner Onboarding process.
Once the image is uploaded, after a few seconds the extracted text file should be created in the same backet with the same name as displayed in the following image:
Hope this information is helpful. We will keep sharing more about how to use new AWS services. Stay tuned!
For any further queries, feel free to post your comments, we are happy to help!
Keep Exploring -> Keep Learning -> Keep Mastering