Data Mine Emails for Job Titles, Phone Numbers, and Addresses

Share on facebook
Share on twitter
Share on linkedin
data-mining-emails (2)

Information is power, especially when it comes to logging information into a Customer Relationship Management Software. The hard part is ensuring you’re logging that information reliably. When it comes to logging information into a CRM, sales teams may have other priorities like closing deals and hitting quota.

In this article, we’re going to show a quick example of how you could mine your sales emails for phone numbers. While your CRM or mail provider may change, the extraction process will not.

The end goal will be a process that can be deployed to read new engagements from HubSpot ( our CRM of choice ) and auto log Street Address, Phone Number, and Job Title onto any contact that is missing this data.

Extract Emails from Your CRM

Before we begin, we need to have a way to pull all of the emails from your CRM or wherever you are keeping track of your emails to prospects.

def get_all_engagements():
    apikey = 'INSERT_YOUR_KEY_HERE'
    limit = 250
    parameter_dict = {'hapikey': apikey, 'limit': limit}

    headers = {}
    has_more = True
    engagement_url = 'https://api.hubapi.com/engagements/v1/engagements/paged?'
    all_engagement_data = pd.DataFrame()

    specific_properties = ['id', 'createdAt', 'createdBy', 'ownerId', 'type', 'timestamp', 'contactIds', 'companyIds', 'dealIds', 'ownerIds', 'ticketIds', 'raw', 'email', 'to', 'cc', 'bcc', 'subject', 'html', 'text', 'startTime', 'endTime', 'body', 'disposition', 'postSendStatus']
    while has_more:
        parameters = urllib.parse.urlencode(parameter_dict)
        get_url = engagement_url + parameters

        response = requests.get(url=get_url, headers=headers)
        response_dict = response.json()
        pprint.pprint(response_dict)
        engagement_df = json_normalize(response_dict['results'])
        parameter_dict['offset'] = response_dict['offset']

        has_more = response_dict['hasMore']

        engagement_df.columns = engagement_df.columns.str.replace('engagement.', '')
        engagement_df.columns = engagement_df.columns.str.replace('.value', '')
        engagement_df.columns = engagement_df.columns.str.replace('associations.', '')
        engagement_df.columns = engagement_df.columns.str.replace('metadata.', '')
        engagement_df.columns = engagement_df.columns.str.replace('from.', '')

        print(engagement_df.columns)
        engagement_df = engagement_df.loc[:,~engagement_df.columns.duplicated()]
        print(engagement_df)
        try:
            engagement_df = engagement_df.filter(specific_properties)

            frames = [all_engagement_data, engagement_df]
            all_engagement_data = pd.concat(frames, axis=0, ignore_index=True, sort=False).reset_index(drop=True)
           # break
        except ValueError:
            print("Value Error")
    
    return all_engagement_data

Clean Your Emails of Threads

Next, we need to remove any email threads so that this email is only associated with the sender and does not include your sales rep’s email footer.

def clean_email_text(self, email_text):
email_text = str(email_text)
thread_start = email_text.find('>')
if thread_start != -1:
email_text = email_text[:thread_start]
return email_text
engagements = get_all_engagements()
incoming_emails = incoming_emails = engagements[engagements['type'] == 'INCOMING_EMAIL'].reset_index(drop=True)
incoming_emails['text'] = incoming_emails['text'].apply(clean_email_text)

Find Phone Numbers in Email

We will be using regexes to identify phone numbers. They are the most reliable way to identify phone numbers in a ton of text. A regex is a regular expression. It is a way to match multiple variations of text as one. They’re very powerful for text recognition and identifying emails and phone numbers. I’d recommend looking at https://ihateregex.io/ for a cheat sheet.

def find_phone(email_text):
    phones = re.findall('(?:\+ *)?\d[\d\- ]{7,}\d', email_text)
    # remove duplicates using a dictionary
    phones = list(dict.fromkeys(phones))
    # check that we aren't returning an empty list
    if len(phones) == 0:
        return ''
    return phones

Find Job Title in Email

While researching for this article, I found a great library for finding job titles within text. I was intending to write a function for this; however, at the core of this blog is not reinventing the wheel so we can focus on real applications.

We will be using the “find-job-titles” library which can be found here. We will import the library using pip install.

pip install find-job-titles

Next, we will need to import it into our script.

from find_job_titles import finder

Finally, we will write a quick function that takes in the text from an email and returns the job title in the future, if it exists.

We will pass in the email and then use the findall function from our library. As shown in the documentation, it will return a list of Tuples ordered by the likelihood that the job title is in the text. This is why we check if there are any results and then get the first result in the list and the title that is stored in the first part of the tuple.

def find_job_title(email):
    results = finder.findall(email)
    if len(results) == 0:
        return ''
    else:
        return results[0][0]

Find Street Address in Email

Street address may be the most difficult part of this script as there are so many variants to find street address. We will attempt to use a Regular Expression for the sake of this article based on this source code.

We will look for any street address followed by city, state, and zipcode. You can get more granular with this, but that is past the scope of this article.

def find_address(email):
    zipcode_regex = '\b\d{5}(?:-\d{4})?\b'
    city_regex = '(?:[A-Z][a-z.-]+[ ]?)+'
    state_regex = 'Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|Florida|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|New[ ]Hampshire|New[ ]Jersey|New[ ]Mexico|New[ ]York|North[ ]Carolina|North[ ]Dakota|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode[ ]Island|South[ ]Carolina|South[ ]Dakota|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West[ ]Virginia|Wisconsin|Wyoming'
    state_abrev_regex = 'AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY'
    street_regex = '\d+[ ](?:[A-Za-z0-9.-]+[ ]?)+(?:Avenue|Lane|Road|Boulevard|Drive|Street|Ave|Dr|Rd|Blvd|Ln|St)\.?'
    city_state_zip_regex = city_regex + ',[ ](?:' + state_regex + '|' + state_abrev_regex + '[ ]' + zipcode_regex
    street_address_regex = street_regex + '\s' + city_state_zip_regex
    street_address = re.findall(string=email, pattern=street_address_regex)
    return street_address

Update the Contact with Data

We want to ensure that we are only writing data that does not already exist. We will want to check the contact to make sure it doesn’t already have a phone number, job title, or street address so we aren’t replacing correct data with data obtained from email.

First, we need to write a function to update a contact. Luckily, HubSpot has an endpoint that allows us to update a contact with an email.

import json
import requests
apikey = 'YOUR_API_KEY'
def update_contact(email, address, jobtitle, phone):
    url ='https://api.hubapi.com/contacts/v1/contact/email/' + email + '/profile?hapikey=' + apikey
    headers = {'Content-Type': 'application/json', 'Accept-All': 'application/json'}
    data = json.dumps({
        "properties": [
        {
            "property": "address",
            "value": address
        },
        {
            "property": "jobtitle",
            "value": jobtitle
        },
        {
            "property": "phone",
            "value": phone,
        }
    })

Next, we need to pull our engagement data and use the methods from above to find any data that is identifiable.

incoming_emails['phone'] = incoming_emails['text'].apply(find_phone)
incoming_emails['address'] = incoming_emails['text'].apply(find_address)
incoming_emails['jobtitle'] = incoming_emails['text'].apply(find_job_title)
incoming_emails = incoming_emails.fillna('')
incoming_emails = incoming_emails.filter(['email', 'phone', 'address', 'jobtitle'])

We have now used all of our functions to find the fields that can be identified with our Regular expressions. Next, we need to run through every single row and update the CRM.

incoming_emails = incoming_emails.filter(['email', 'phone', 'address', 'jobtitle'])
for index, row in incoming_emails.iterrows():
    update_contact(email=row['email'], address=row['address'], phone=row['phone'], jobtitle=row['jobtitle'])

Now, you have officially added more data into your CRM from all the email footers that normally go unnoticed by your sales reps. This can help if you want to retarget contacts by phone numbers on Facebook, want to send gifts through mail, or want to run an analysis on what job titles occur the most in your CRM.

References

https://www.codeproject.com/Tips/989012/Validate-and-Find-Addresses-with-RegEx