PDFs store incredible amount of information about companies, email addresses, and phone numbers; however, they’re notoriously difficult to pull information out of.
Follow along with this brief tutorial on how to scrape PDFs for lead information so that you can bring tons of quality information to your sales and marketing teams.
We’re going to use a file of Warehouse contact information as an example of how to scrape this PDF.
Depending on our PDF, there are multiple ways that we can scrape it. In this case, we’re going to do it the easiest way.
For this specific PDF, it will be best to Select All with the Control A function on your computer and then copy it into a new .Txt file called “plaintext”.
This will get all the text out of the file, but next, we will need to pull information out of it in a way that we can be sure that all the information is associated with each correct row of data.
We can read in the document’s text into our Python file by calling the open function and calling the .read() function on it. We will store this in text so that we can use it.
text = open("/Users/kevinmead/PycharmProjects/opendock/plaintext.txt", "r").read()
By observing the Plain Text, we can identify that there are a couple of patterns in the data. Almost every line that has “CONTACT” in it is the last line in the group. We will use this to break apart the rows of our CSV.
We want to create an array that we can iterate through with a for loop. In order to do this, we will split the text by every single new line in the TXT file. This will allow us to go line by line with a for loop.
Normally, this would make it difficult to keep track of which information is associated with each company, but using the keyword CONTACT as the last line will solve this for us.
We will use the split() function on our text file and pass in the character string ‘\n’. This backslash n represents an escape character for a new line. It will return an array of every line in this file.
text = text.split('\n')
Next, we want to reorient all of these lines into their own groups. This will turn 3,780 lines into 484 companies.
We will create an empty string variable called “group” at the top of this section and an empty array of strings called “groups”. As we iterate through every single line in the text, we will add the line plus a new line to our “group” string. We will use the new line so we can make sure we’re keeping the separation in case we need it.
By using an if statement, we will check if the keyword “CONTACT” exists using the find method. The find method checks a string for a word and returns its position. If it does not exist, it will return a -1.
If it does not return a -1, we know the keyword “CONTACT” is in the line and this is the last line. We will add this line to our group, add the group to our “groups” array, and finally clear the temporary group variable so that this data doesn’t get added to a new row.
group = '' groups =  for line in text: if line.find('CONTACT') != -1: group += line groups.append(group) group = '' else: group += line + '\n'
Next, we will need to create a DataFrame where we can store our PDF data. We will declare the columns as “Company Name”, “Location”, “Phone”, “Website”, and “Contact”.
We will then create a DataFrame called “warehouse_data”. This will act as the place we will store each row of data on these companies. We will export this DataFrame when we finish iterating through each group.
columns = ['Company Name', 'Location', 'Phone', 'Website', 'Contact'] warehouse_data = pd.DataFrame(columns=columns)
Luckily, the text in this PDF follows a pretty predictable pattern.
We will need to iterate through every single group to create a new row for each company. We will do that by using a for loop. All the code of this function will be wrapped in a for loop. The full for loop will be left at the bottom of this section.
for group in groups:
The first line of each group is the company name. The second line is the address. The third line will be the most difficult since we need to split it up by phone number, website, and contact.
Luckily, a phone number, website, and the contact have easily identifiable separation points. For the phone number, the left open parenthesis signifies the start of it. For the website, www. signifies the start of it. Finally, for the contact, “CONTACT” signifies the start of it.
We will iterate through each group, create a new DataFrame, and store that information in there. At the end of each group, we will append our new row to the warehouse_data DataFrame.
Since each group is one line of text with new line characters separating it, we can use the “find()” method to find the position of a specific element.
We will get the company name by finding the first new line character and segmenting the line of text from the start to that position.
We can do that by treating the string like an array where we say we want every character from the start signified by a single colon to our variable of “first_new_line”. We will then store that in a Series object in the Company Name Row of our new row DataFrame.
first_new_line = group.find('\n') individual_row['Company Name'] = pd.Series(group[:first_new_line])
Next, we need to find the position where the phone number starts on the third line. We will use the find method again and pass in the open left parenthesis. This will. give us the start of the phone number. Since the location sits in between the Company name and the phone number, we can get the subset of the string between first_new_line and phone.
phone = group.find('(') individual_row['Location'] = pd.Series(group[first_new_line:phone])
Next, we want to find the website. Luckily, it will start with www every time. This makes it easy to find the start of the website. This position can be used for us to find the end of the phone number as well.
We will need to find the point where the domain starts by finding the position of the substring “www”. We will then find the end by finding “.com”. We will double-check that .com is the right one.
If it is not, we’ll find “.net” instead. If you wanted to make a foolproof script, you would want to add an array of all the possible ends of domains and go through each one based on how often it is used.
This script is meant to save time, so we aren’t going to overkill it.
domain = group.find('www') com = group.find('.com') if com == -1: com = group.find('.net')
There are a couple extra pieces of data in this PDF that we want to watch out for so that we make sure everything goes in the right place. In particular, some companies have a Fax number listed and some have Industries listed.
If we don’t pay attention to these, the Fax number will get logged into the Phone Number and the Industries will come up in the Contact Name.
In order to deal with the Fax number, we’re going to call upper on the line of text and see if we can find “FAX:”. If it exists, it will return a non negative 1 number. If it returns a -1, there is no Fax listed and we can operate as normal by getting the substring between the left parenthesis and the start of the domain.
If the find method returns a number, there is a fax number and we need to work around it. We will take the string between the phone parenthesis and the fax number as the Phone and put that in the appropriate spreadsheet column.
We will then take the string between the Fax position and the www of the domain as the Fax number. In order to make this data useful for a CRM, we will replace the string “Fax:” with an empty space to remove it from the string.
fax = group.upper().find('FAX:')
if fax != -1:
individual_row['Phone'] = pd.Series(group[phone:fax])
individual_row['Fax'] = pd.Series(group[fax:domain].replace('Fax:', ''))
individual_row['Phone'] = pd.Series(group[phone:domain])
individual_row['Fax'] = ''
To finish out grabbing the domain, we will store the string between the www and .com or .net. We add 4 to the com because we need to include the number of characters the length of the string we were looking for. Otherwise, it would remove the end of the domain.
individual_row['Website'] = pd.Series(group[domain:com+4])
Next, we need to watch out for the “INDUSTRIES:” keyword. We will check if it exists in our line for the company by using the find method.
If it does not return a -1, that means the INDUSTRIES keyword is in this company’s data. We will need to put the string between the domain and this position into the Contact Column. We will also replace the “CONTACT:” keyword with an empty space to make our data easier to use.
Finally, we will take all the remaining data from the industries tag’s position to the end of the line as the industries they serve. We will then replace that keyword with an empty space to make it easier to use. Since this PDF has a lot of empty space, we will call the strip method to remove any spaces at the start or end of the string. This makes it easier to read.
If the INDUSTRIES keyword doesn’t exist, we can simply take all the info from the domain to the end of the line of text symbolized by a single colon.
industries = group.find('INDUSTRIES') if industries != -1: individual_row['Contact'] = pd.Series(group[com+4:industries].replace('CONTACT:', '').strip()) individual_row['Industries Served'] = pd.Series(group[industries:].replace('INDUSTRIES SERVED:', '').strip()) else: individual_row['Contact'] = pd.Series(group[com + 4:].replace('CONTACT:', '').strip())
Our final step at the bottom of our for loop will be to add our new row with the data from this group to the DataFrame. We will do this using the “concat” method.
In order to concatenate two DataFrames, you need to create an array of the two DataFrames called frames. Then you will use the pd.concat() method while passing through frames and two other parameters.
Sort and Ignore_index are simple parameters that change how we can concatenate the frames. They are a preference I use and they are not necessary.
frames = [warehouse_data, individual_row] warehouse_data = pd.concat(frames, sort=False, ignore_index=True)
You can find the full for loop here.
for group in groups: individual_row = pd.DataFrame(columns=columns) first_new_line = group.find('\n') phone = group.find('(') individual_row['Company Name'] = pd.Series(group[:first_new_line]) individual_row['Location'] = pd.Series(group[first_new_line:phone]) domain = group.find('www') com = group.find('.com') if com == -1: com = group.find('.net') industries = group.find('INDUSTRIES') fax = group.upper().find('FAX:') if fax != -1: individual_row['Phone'] = pd.Series(group[phone:fax]) individual_row['Fax'] = pd.Series(group[fax:domain].replace('Fax:', '')) else: individual_row['Phone'] = pd.Series(group[phone:domain]) individual_row['Fax'] = '' individual_row['Website'] = pd.Series(group[domain:com+4]) if industries != -1: individual_row['Contact'] = pd.Series(group[com+4:industries].replace('CONTACT:', '').strip()) individual_row['Industries Served'] = pd.Series(group[industries:].replace('INDUSTRIES SERVED:', '').strip()) else: individual_row['Contact'] = pd.Series(group[com + 4:].replace('CONTACT:', '').strip()) frames = [warehouse_data, individual_row] warehouse_data = pd.concat(frames, sort=False, ignore_index=True)
Finally, you will want to export your data as a CSV so that you can import this company data into your CRM for targeting with ads or for sales people to reach out.
We will do that by calling the “to_csv” function. We will call it on our warehouse_data DataFrame object by passing in the title we want and index=False which removes the row number.
At the end of this process, we have spent about 30 minutes to write a script that pulls 484 high quality warehouse leads out of this PDF. This is much more efficient than if we had sat down with a TV in the background to pull the information manually. It is even more efficient than sending it to Upwork.
If you are able to master pulling information out of PDFs, you can pull thousands of leads quickly for your salespeople and company.