HTML Cheatsheet: A Scraper's Guide to the Web

Share on facebook
Share on twitter
Share on linkedin

Any online marketer can benefit from knowing basic HTML. While tools like WordPress and landing page builders can make it easy to make websites, they often have some limitations. In this HTML cheatsheet, I’ll go over all the basics of HTML so you can succeed in writing your own.

As you scrape, you will be pulling the web through Python, but you will need a basic understanding of HTML and CSS to understand how best to pull your information.

HTML is the bones of each webpage, and CSS is the outfit it wears. I recommend that you download the Chrome Extension “Visual Inspector”.

This will let you check the HTML of the website without doing any weird codey things.

We’ll start by discussing HTML and then move into CSS Tag Names HTML is a language with tags.

Essentially, it has a group of labels that tell a computer what a section is. There are a couple of basic tags to keep an eye out for:

The Head and Body

The head is where all the scripts will be found that power the website. For most scraping purposes, you only need the </body>. The body tag holds all of the tags we will discuss in this article.

The </head> tag holds all of the SEO information and often the scripts. Useful applications of the Head would be to learn how a competitor structures their SEO or see what software a prospect is using.

The most basic version of an HTML document would be the following.


Div Tag

Almost 80% of the web is built of a div inside another div like nesting dolls.

A div is notated as <div></div>

div = soup.find('div')
# print the first Div in the website
div_list = soup.find_all('div')
for div in div_list:
# Print every div in the website

Anchor Tag

The other most used tag is a tag. This is an “anchor tag”. It’s best to consider it as a link. These will be the second most used Tag that you look for.

link = soup.find('a')
# print the first Link in the website
link_list = soup.find_all('a')
for link in link_list:
# Print every Link in the website

Section Tag

Sections are usually used to denote an area where all elements belong to a certain category. They’re big areas denoted as <section></section>

Typically, Sections are used to show different areas of the website such as Pricing or a Contact Form area.

To grab sections on the web page, you can run the following code.

section = soup.find('section')
# print the first Section in the website
section_list = soup.find_all('section')
for section in section_list:
# Print every section in the website

Paragraph Tag

A paragraph tag is denoted with a lowercase p as <p/>

In order to find a Paragraph tag, you will want to call the following code.

paragraph = soup.find('p')
# print the first Paragraph in the website
paragraph_list = soup.find_all('p')
for paragraph in paragraph_list:
# Print every paragraph in the website

Header Tag

h1, h2, h3, h4, h5, h6 – these are all header tags. The numbers symbolize their importance to the page.

header = soup.find('h1')
# print the first Header 1 in the website
header_list = soup.find_all('h1')
for header in header_list:
# Print every Header 1 on the website
# If you want other headers, repalce h1 with h2 or h3, etc.

Span Tag

Spans can hold any text they want. They’re used to highlight paragraphs often. You may find text hidden in them. They are hidden because you may scrape the text of the Paragraph, but can miss any data in a Span.

span = soup.find('span')
# print the first Span in the website
span_list = soup.find_all('span')
for span in span_list:
# Print every Span on the website

Strong Tag

A strong tag is similar to a Span tag, with the slight distinction that it is always bold. Again, text can be hidden within a strong tag so if you pull the text from a paragraph tag, you may miss a word that is inside the strong tag.

strong_tag = soup.find('strong')
# print the first Strong tag in the website
strong_list = soup.find_all('strong')
for strong_tag in strong_list:
# Print every strong_tag on the website

Table Tag

Tables are very easy and intuitive to scrape because they have defined structure. You can read this article to learn how to scrape any table online.

A table in HTML would be written like this.

          Single Header
          Single Column Value

Tables in HTML are defined with the <table> tag opening it up and have multiple Table rows notated as <tr></tr>. Table rows can hold <td> or <th> tags. Typically Table Header tags are only used in the first row of the table while the <td> tags hold the actual values of the table.

CSS Class Names

Now you may have noticed that if you are looking for a specific div, it can be like finding a needle in a haystack.

This is where classes can help incredibly. Classes are names you can add to tags to style them with CSS. When you see a link with a different color, that may have a class on it.

This is what a link with a class would look like is the following HTML:

<a href="#" class="link-class">Link</a>

In HTML, these classes look. If you wanted all of these links, you would want to call

links_with_class = soup.find_all('a', {'class': 'link-class'}) 

The links_with_class variable will be a list of all the anchor tags.

ID Names

Now that you have a grasp on classes, let’s figure out how to get one *specific* element on the page. Often, if an element on the page is important enough, it will have an “id”. This is denoted as a hashtag “#important-element”.

This will normally look like the following in HTML.

<a href="#" id="important-element">Link</a>

You can access it by calling the following:

important_link = soup.find('a', {'id': 'important-element'})

Next Steps

I recommend that you start testing this out on websites that interest you. As always, please be respectful of other people’s websites and do not spam them with visits.

Find a website with a Table, the most simple element to scrape, and read this article on how to scrape any table online.