Digitize a Batch Record

In this tutorial you will digitize a PDF batch record. Critical process data shouldn’t stay trapped in PDFs. Start to analyze data from PDFs in less than five minutes.

_images/batch3.png
_images/batch4.png

Get an API key

You’ll need an API key to follow along with this tutorial. Get a temporary API key sent to you by email.

How to use this tutorial

There are three ways to follow along with the tutorial (from beginner to advanced):

  1. Run the code in the cloud. Without installing anything locally, you’ll be able to both change and run the code. Open the Colab notebook.

  2. Download the code as a Jupyter notebook: batch-record-digitization.ipynb. First-time users of Jupyter notebooks should follow the Getting Started instructions first.

  3. Copy and paste the code snippets below into your own Python development environment.

Install the fathomdata library

From a terminal:

$ pip install fathomdata

Confirm the installation was successful by importing the library. We name the library fd on import by convention.

import fathomdata as fd

Users who are new to Python can find more detailed instructions at Getting Started.

Digitize a batch record

Download a sample batch record, or use the code below to download it programmatically.

with open("batch3.pdf", "wb") as f:
    pdf = fd.get_sample_batch_record("batch3")
    f.write(pdf)

Take a moment to look at the example batch record PDF. If you didn’t change the path above, the PDF will be saved in your current working directory. You can also open it programmatically.

This batch record contains many different types of data from raw material sources to process metrics. There is a mix of handwritten and typed text and the formatting varies throughout the record. For this tutorial, we’ll focus on extracting and cleaning any type of data stored in a table (but this is just the start!).

Set your API key for this session.

Tip

Keep your API as an environment variable to prevent you from accidentally checking it into a git repository.

apikey = 'your-api-key-goes-here'
fd.set_api_key(apikey)

Now, digitize the batch record using ingest_document.

new_doc_id = fd.ingest_document("batch3.pdf") #update path to download location
print(f"Ingested document with ID {new_doc_id}")
Ingested document with ID 65aab1e5-0031-4679-9f66-eb4930ea6c6d-0

That’s it! Check that the upload was successful by listing the available records.

df = fd.available_documents()
df.head()
DocumentId ReceivedTime Filename UploadedByUserId
65aab1e5-0031-4679-9f66-eb4930ea6c6d-0 07-20-2021 05:37PM batch3.pdf demo@fathom.one

If the df syntax look familiar, that’s because fathomdata is built on top of pandas. You can interact with this record dataframe using all the pandas slicing and indexing tools .

Take a moment to repeat the process and digitize a new sample record. Download the batch4 pdf here or use the first code block above to download it programmatically. Then re-run the rest of the commands, but replace batch3 with batch4. When you are done, your available documents dataframe should look something like this (plus a few columns we hid for space).

DocumentId ReceivedTime Filename UploadedByUserId
65aab1e5-0031-4679-9f66-eb4930ea6c6d-0 07-20-2021 05:37PM batch3.pdf demo@fathom.one
9a4000c8-edf3-4bbf-a78e-4f3f235aae90-0 07-20-2021 05:38PM batch4.pdf demo@fathom.one

Use the digitized data

The extracted data is also returned in a pandas dataframe so it’s quickly available for custom analysis.

doc = fd.get_document(new_doc_id)
materials = doc.get_materials_df()
materials.head()
SKU Lot Number Expiry Amount Verifier Initials Performer Initials
Glucose 7438-1 NUQ25Z 2022-09-01T00:00:00 50g BUC JUJ
DMEM Media 8549-YR HKOJJ5L 2021-08-23T00:00:00 30L BUC JUJ
BSA 54894-d64 IITSTT2B 2022-09-19T00:00:00 1g BUC JUJ
Trypsin/EDTA 543543 MJ7X23 2021-12-20T00:00:00 50 mL BUC JUJ

Next you can try some statistical process control analytics using this data.