The idea to scrape Yelp is incredibly exciting, because you can find local businesses looking to market themselves in mass along with their phone numbers. You probably are asking, “How do I scrape Yelp without getting blacklisted”.
Luckily, Yelp has an Open API that you can use for whatever purposes you need.
Simply apply for an API key here and we’ll get started on your journey to scrape yelp.
For your first step, you want to go the Yelp Developer webpage.
Then, you want to click the Yelp Fusion link. Yelp Fusion is a REST API that allows you to pull information from it. Imagine a REST API as a script that is constantly live on the web everyday. When you send data to it, it processes that data and returns it back as a response to your request.
Click “Get Started” and create a new account through your Google account.
You will need to submit a message about what you plan to use your new application for. Simply put in that you’re looking to enrich the businesses in your database with data from Yelp.
At the top of the approval page will be a Client Secret and API Key. You will only need the API key for your Python script.
Next, you will want to make a request to the Yelp API. You can find all of the documentation here. In this tutorial, we will only go over searching for businesses in a specific area.
import requests import pandas as pd search_api_url = 'https://api.yelp.com/v3/businesses/search' yelp_key = '' # Your API Key city = 'Phoenix' headers = {'Authorization': 'bearer %s' % yelp_key} num = 0 searchTerm = 'Restaurant' params = {'term': searchTerm, 'limit': 50, 'offset': num, 'location': city} response = requests.get(url=search_api_url, params=params, headers=headers) response = response.json() print(response)
Let’s talk about how this code is able to scrape yelp. In the first line, we are declaring a new variable called “search_api_url”, this is the REST API for Yelp’s business search.
That means, that this is the URL that is waiting for you to send it data so it can process it and send you back results.
Next, we are declaring your API Key as a variable. I left it empty as you shouldn’t share your API Keys. Simply create a new variable called “yelp_key” and pass a string into it of your API Key. That means that you want to put quotes around your API Key.
We want to make a couple more variables. Create a variable called “city” and store the city you want to search. I passed in “Phoenix”.
Then, we want to create a dictionary called “headers” where we will store our API Key. We also will create a variable called “num” set at 0. Num will act like a page that we will pass into offset. We will also create a variable called “searchTerm” that will tell Yelp what we are looking for.
We will store all of these variables into a dictionary called “params”. We will store our searchTerm as “term”, limit set at 50, offset set to num, and location set to city.
Next, we will use the requests library to run the “get” method. We will pass in our search_api_url as the “url” parameter, our params as the “params” parameter, and our headers as the “headers” parameter.
Finally, we will store the result of the “get” function into a variable called “response”. We need to then turn it into JSON so we will store response.json() back into the response variable.
Print your response and see the data you got from Yelp.
Using data in JSON ( Javascript Object Notation ), isn’t very easy for someone new to programming. I always recommend storing your data as a DataFrame. DataFrames are superpowered spreadsheets.
First, you want to create a new DataFrame called Yelp Data.
yelp_data = pd.DataFrame() try: for biz in response['businesses']: new_row = pd.DataFrame(columns=headers) try: new_row['Company Name'] = pd.Series((biz['name'])) except AttributeError: print("No name") try: new_row['Street Address'] = pd.Series((biz['location']['address1'])) except AttributeError: print("No address") try: new_row['City'] = pd.Series(biz['location']['city']) except AttributeError: print("No state") try: new_row['State'] = pd.Series(biz['location']['state']) except AttributeError: print("No State") try: new_row['Zip Code'] = pd.Series(biz['location']['zip_code']) except AttributeError: print("No zip") try: new_row['Phone Number'] = pd.Series((biz['phone'])) except AttributeError: print("No phone") try: new_row['Vertical'] = pd.Series(searchTerm) except AttributeError: print("No search term") frames = [yelp_data, new_row] yelp_data = pd.concat(frames, sort=False, ignore_index=0).reset_index(drop=True) except AttributeError: print("Attribute Error") print(yelp_data)
Now that you have this data inside a DataFrame, you want to be able to export that data so you can put it into a Google Sheet or Excel document.
Lucky for us, it is very easy to export a DataFrame into a CSV that we can import to Google Sheets.
We will call the “to_csv” file and pass in the String of the file we want our CSV to be saved as and whether we want to drop the index. By passing, index equal to False, it will make sure that the row number is not saved in the CSV. It cleans everything up and makes the CSV easier to read.
yelp_data.to_csv('file_output.csv', index=False)
If you would like to learn how to export a DataFrame directly into a Google Sheet, you can read this blog.
Now that you have your Yelp data, you can start using it in your Python scripts.
Copyright 2021 Salestream LLC Sitemap