For marketers getting started with scraping, tables are the easiest way to find effective data that is meant to be processed without having to learn too much about formatting that data. Here’s how to scrape any table online.
We will start by defining a test Table and then move on to an example online.
Here is the example we will use:
|Full Name||Phone||Company Name|
|Michael Bluthemail@example.com||422-321-0000||Bluth Company|
|Richard Hendricksfirstname.lastname@example.org||230-501-2301||Pied Piper|
|Jared Dunnemail@example.com||329-538-6790||Pied Piper|
Now that we have a sample table, we will take a look at what it looks like in HTML. For those not aware of it, HTML is the building block of the web. By knowing the structure of HTML, we can scrape a webpage in Python.
Scraping any website requires an understanding of HTML. This table is made up of the following HTML:
<table> <tr> <th>Full Name</th> <th>Email</th> <th>Phone</th> <th>Company Name</th> </tr> <tr> <td>Michael Bluth</td> <td>firstname.lastname@example.org</td> <td>422-321-0000</td> <td>Bluth Company</td> </tr> <tr> <td>Richard Hendricks</td> <td>email@example.com</td> <td>230-501-2301</td> <td>Pied Piper</td> </tr> <tr> <td>Gavin Belson</td> <td>firstname.lastname@example.org</td> <td>401-251-5282</td> <td>Hooli</td> </tr> <tr> <td>Jared Dunn</td> <td>email@example.com</td> <td>329-538-6790</td> <td>Pied Piper</td> </tr> </table>
In order to scrape this table, we need to find the table on the page, go through each row, and pull the information into a DataFrame that we can export.
Before we get into pulling this straight off the website, let’s try this with training wheels. First, let’s create a new HTML file in your PyCharm or IDE for Python called “test.html”. Grab the HTML of the table above and copy it into the file.
We are going to use a library called “codecs” to open the HTML file and read it. Remember for the future, if you can’t scrape a website straight from the website, you can always inspect the page, save the HTML, and scrape from your local computer using codecs.
Next, we are going to import a library called Beautiful Soup, a HTML parsing library. This library is used in every scraping project. As always, we want to import Pandas for storing and exporting our data.
import bs4 as soup_parser
import pandas as pd
After we import our libraries, we want to open the HTML document and read it through codecs. We want to store the HTML into a variable called “source”.
Next, we want to parse it with BeautifulSoup and store that parser into a variable called “soup”. When creating a Beautiful Soup parser, you need to pass in the HTML and a string called “html.parser” to tell Beautiful Soup that you want to parse HTML.
source = codecs.open('test.html', 'r', 'utf-8').read()
soup = soup_parser.BeautifulSoup(source, 'html.parser')
Now that we have created a BeautifulSoup object, we can use its methods to break the table down row by row.
We want to be smart about how we write this script so that we can use it on any table on the internet. That is why we are going through the header as well and storing data row by row.
We can get any element in this HTML by calling the “find_all” method on the Soup object and passing in the HTML tag. In this case, we want all of the “tr” tags as the table is made up of all of these. We will store the list of rows in a variable called “rows”.
rows = soup.find_all('tr')
Next, we want to create an empty DataFrame where we will store all of our information from the table. We will call it “test_data”. We don’t need to pass in any columns as we want this script to work for any table online.
test_data = pd.DataFrame()
Now that we have a place to store this information, we want to create a for loop that iterates through each row and then finds all the columns of that row.
We will create two iterators: a “row_iterator” and a “column_iterator”.
row_iterator = 0
headers = soup.find_all('th') headers = [th.text for th in headers]
for row in rows: print(row) # print the row out to debug items = row.find_all('td') # find each column of the row print(items) # print the columns to debug column_iterator = 0 # create a column iterator new_row = pd.DataFrame() for td in items: if row_iterator == 0: # if it is the first row, store as header headers.append(td.text) else: # if not first row, store in correct column text = td.text new_row[headers[column_iterator]] = pd.Series(td.text) column_iterator += 1 print(td.text) # we want to add each row to our original data frames = [test_data, new_row] test_data = pd.concat(frames, sort=False).reset_index(drop=True) row_iterator += 1
Next, we want to print out and export our table data.
Finally, we want to store the information into a CSV or a Google Sheet. That’s the point of scraping these tables of data! Simply print your test_data variable and then call the “to_csv” method.
print(test_data) test_data.to_csv('test.csv', index=False)
We started this article off simple by saving an HTML file and then reading it. Let’s try scraping this blog instead and seeing if it works!
Remember that the Golden Rule of Scraping is “Respect the website”. Please do not run a for loop and request my site or any site a 1,000 times. You will get banned. I want you to have access to information, but I want you to be respectful of others.
First import the requests library.
Next, you want to request this blog article’s url and pass it into source.
blog_article = requests.get('https://scriptsformarketers.com/scrape-table-online/') source = blog_article.text soup = soup_parser.BeautifulSoup(source, 'html.parser')
Continue the rest of the article from there and you should be able to scrape our example table!
Now, you have the data from this table and can send it to your team or use it in your program! Give it a shot on a table you find in the wild, preferably a very long one.