Learn how to create a task that extracts data from a PDF and writes it to an Excel file. Pat Cameron, Director of Automation, shows you two different ways to use Automate for PDF data extraction. In this video you’ll learn how to:
- Use the PDF Action to extract data from the PDF file into a text file and write the data out to Excel
- Use the OCR Action to grab data from specific fields in a PDF to write out to Excel
Watch this chapter now to learn how to get started.
Pat: Let's look at two examples of managing PDF files with Automate. The first one uses the PDF action to extract the data from this PDF file. So, here we have an invoice and what we're going to do is we're gonna extract all the data in that file to a text file, [00:00:30] loop through it, and write the data out to Excel. So, here's our PDF action that we're going to use. In the task that we have built here, we've got a number of variables at the top, file locations, et cetera, we've got a variable for data line, current row that we're going to be incrementing and then also data line count. In this example, we only read the first 30 lines of data, so we're gonna use that when we're parsing out the text file. [00:01:00] The next action then is our PDF action. And as you can see over on the left, we're gonna use the extract action. And we're just gonna extract everything in that PDF to a text file and any pages. We only have one page, but all the pages. You can see I have a lot of options there.
The next thing we'll do then is we'll create an Excel file to write this data to. So, we're just gonna use the open document, create a new spreadsheet, and then we're going [00:01:30] to output the column headings to that spreadsheet. So, we're gonna use the open document spreadsheet and we're just going to use a list that we have down below here, the list value, and write those column headings out, starting up in the upper left column. And we're gonna use a pipe as the delimiter. So, we can write that text right out to that spreadsheet and it will use the delimiter for each of the column headings. We don't wanna use a comma because we have some commas in [00:02:00] our text file. And then we increment that very, that row by one. So, we write the headings, we go down to row number two. Now, what we're gonna do is start looping through that text file that we've created and grabbing the data and writing it out to the spreadsheet. So, we're going to populate the variable of data line and as we loop through, it's gonna increment that line.
So, the first thing that we'll do is we're gonna look for a [00:02:30] data line that has, at position 59, the decimal point and that's where the dollars are. So, it's just one position starting at 59. And now we're going to parse out the rest of that data and only get the data, not get the headings and things like that. So again, we've built a list and this is using the trim function in the text to read through [00:03:00] that text file line by line. So, data line one, five positions, and then a pipe, again for the delimiter, starting at position seven and going out eight positions, we're gonna take that data and write it out to that spreadsheet. So, we've just built what that text file looks like. Now, in our example ... And then we're gonna increment the data line count by one and we're just gonna get the top 30 lines [00:03:30] and we just do this in our demo so you could eliminate this entire little if ... Loop here. When we hit 30 lines, then we're going to stop. We've written the data out and then we're gonna close the document and then I'm just gonna open the PDF, the text and the Excel so that you can see what those files look like.
So, let's go ahead and run this task and you'll see that it runs very quickly. That extract goes [00:04:00] so fast, creates the text file and now we're looping through that text file and writing the data out to that spreadsheet, so it runs very quickly. If that's something you're doing manually, I'm sure it takes you a lot more time than that. So now what we have here is the original PDF file. There's our text file and then we've taken that text file and written that data out to Excel. Now, I have another example that uses the OCR function, [00:04:30] because what I have here is a loan application and I just wanna grab certain fields. I want the amount and the interest rate. I'm gonna take the address and then I'm gonna get the customer's name and I think maybe a social security number. We did this one for a customer that was reading through all of these loan applications manually and writing that data to a spreadsheet.
So, in this case, instead of using the PDF action, we're going to use the OCR action. [00:05:00] So, building the task kind of the same way as all of our tasks. We declare all our variables. Up at the top here, we've got row numbers when we're writing out to Excel that we wanna be able to increment as well. And then we have file locations, et cetera. And then down below here you can see we've got a variable for the social security number, the name, the address. So each of those values that we're pulling in from the PDF, we've created a variable. [00:05:30] So, we're gonna store the value from the PDF file and then write it out to a spreadsheet. So, here we create a new spreadsheet. In this case I am creating it from a template that already has the headings on that spreadsheet and I'm using Excel instead of open document. So, you can see I got lots of different ways that you can do it.
So, with the OCR action you can get text or just certain words or getting ... Or lines. We're going to get texts from just certain regions of the file. [00:06:00] So, here we've got the name of the file and if we do that click or pick ... Click on pick region, it will show us a little thumbnail of the file and then it's gonna highlight the field that we want to be able to capture. I'll make this a little bit bigger so you can see it better and we're gonna go down to that borrower's name and we're just gonna use our mouse and drag it across that field. Now, I'm gonna get more than just the customer name there because I might have a long name that fills [00:06:30] up the field. Over on the right hand side, that preview's gonna show you the data that you have in there. And then it will show you the region or the zone that it captured. Click on that. We'll save that. We're going to take that name and we're gonna populate the variable, Var name.
And then in the next step we're just gonna go ahead and write that text out to the spreadsheet. So, for a data source, it's just a single value. [00:07:00] The new ... And it's ... And the value is that Var names, so you just select the value that you wanna use from your list of variables, Var name is what we're using. We're gonna start up in the upper left corner of the spreadsheets. So, row one, row two, actually, column one, because we've got our headers and then we just go through and do the same thing for each of those fields that we wanna capture. So, here we've got the address. Same thing. Find the region on that field, using that little thumbnail [00:07:30] and then write it out to the spreadsheet using that same value and we're gonna write this in column two. Go ahead and get the next value, write it out, and the next and the next. So, it's just a matter of what would you do manually and we can automate that. So now when this runs, it's gonna create that spreadsheet for us, and now it's gonna go back and look at the PDF file, grab the name, write it out to the spreadsheet, get the address, social [00:08:00] security number, amount and rate. So hopefully that's kind of a quick way to be able to process any PDF files that you might need to do.