Python Web Scraping Best Practices

I've been scraping the web a lot as part of my quest to find the best scratch-off lottery ticket. I want to share some of the practices that I've discovered.

I'll do so by posting the code itself in a style of "literate programming". (It's actually just heavily commented code. Actual literate programming would use something like noweb to "tangle" comments/code.)

Scrapes the Pennsylvania lottery website for scratch-off ticket
data and calculates the expected value for each game.

Pennsylvania publishes the number of tickets printed and how many
tickets are printed at each prize level.

We can calculated the expected value of a game by summing
the value of all the prizes and dividing that by the cost
of all the tickets.

The palottery website has an "index" page that has links to every game.
Each individual game has a link to a "game rules" page.
We can start at the index and visit every game rules page, then we
can find the html table on that page which has the detailed prize
information and run our calculations.

Website that we'll be scraping:

Example usage:
    python -m pennsylvania
    LOGLEVEL=DEBUG USE_CACHE=True python -m pennsylvania

The following behavior is configurable through shell environment variables.

Set LOGLEVEL to print useful debug info to console.
Defaults to WARNING.

Set USE_CACHE to cache responses. This speeds up development
and is nice to the servers we're hitting.
Defaults to False. Note: Setting this env variable to the string False
will cause it to use cache because the string "False" evaluates to Truthy.
Either set it to True or don't set it.
import base64
from bs4 import BeautifulSoup as bs
import logging
import os
import re
import requests
from requests import RequestException
from tempfile import gettempdir

# If this were part of a larger package of scripts, logging configuration
# should be handled in the package's ``. One way to do this
# is with `dictConfig`. See the python docs for an example
# For a simple one-off script, it's convenient to set the log level based on
# an environment variable. This way, you can turn logging on or off without
# having to modify any code, and you don't have to add code to handle
# parsing command line arguments.
    level=getattr(logging, os.environ.get('LOGLEVEL', 'WARNING'))

# It's worth assigning to constants values that are used in many
# places throughout a script.
INDEX_URL = "{}/Scratch-Offs/Active-Games.aspx".format(BASE_URL)

def fetch_html(url):
    Helper to fetch and cache html responses.

    During development and while testing, we'll be hitting the same urls often.
    The content of the pages probably won't be changing.
    Cacheing the results will speed up development,
    and the servers will appreciate us for not spamming requests.

    The responses are cached in the operating systems tempfile directory.
    That's probably /tmp/ or /var/tmp/ on Unix flavors and C:/temp/ on Windows.
    The filename is based on the URL. But since the URL might contain
    characters that are invalid for filenames, we base64 encode the URL.
    safe_filename = base64.urlsafe_b64encode(bytes(url, "utf-8")).decode("utf-8")
    filepath = os.path.join(gettempdir(), safe_filename)

    if os.path.isfile(filepath) and os.environ.get('USE_CACHE', False):
        with open(filepath, "r") as f:
        # We are relying on the outside world when we make a request, so we
        # might want to wrap this in a try/except. But we'd
        # only want to do that in two cases.
        # 1. We have a way of handling exceptions,
        # A good example would be to catch exceptions and retry the
        # request; maybe the network was down.
        # 2. We can't handle the exception, but we want to log something
        # more useful than the stack trace that will get spit out if
        # we just let the exception go uncaught.
        # In this case, I don't think it's worth muddying up the code
        # trying to handle exceptions here. It's easy enough to just re-run
        # the script.
        html = requests.get(url).text
        if os.environ.get('USE_CACHE', False):
            with open(filepath, "w+") as f:
        return html

def find_game_names(html):
    Game names can be found on the index page
    in the text of anchor elements
    which have the class "activeGame_li".
    soup = bs(html, "lxml")
    game_elements = soup.find_all("a", class_="activeGame_li")
    return [
            r"\s+", " ", g.find("div", class_="info").text
        ) for g in game_elements

def find_game_urls(html):
    Luckily, all of the Pennsylvania games are listed on a single html page.
    We don't have to mess around with any pagination and making multiple requests.

    The links are "href" attributes of anchor tags with the class "activeGame_li".
    soup = bs(html, "lxml")
    game_elements = soup.find_all("a", class_="activeGame_li")
    return ["{}{}".format(BASE_URL, e.attrs["href"]) for e in game_elements]

def find_complete_game_rules_url(html):
    Game pages have a link to the complete game rules.
    The complete game rules have a table of all prizes for a game.

    The link to the game rules page is in an anchor tag
    nested under a div with the class "instant-games-games-info".
    soup = bs(html, "lxml")
    games_info_div = soup.find("div", class_="instant-games-games-info")
    games_info_anchor = games_info_div.find_all("a")[1]
    games_info_url = games_info_anchor.attrs["href"]
    return games_info_url

def find_rows(html):
    From a game rules page, find the rows of the table
    that have the number of tickets and the value of each prize.
    soup = bs(html, "lxml")

    # Some game rules pages have multiple tables.
    # The first table has the prizes.
    # soup.find returns the first matching element
    # soup.find_all returns a list of all matching elements.
    prize_table = soup.find("table")
    row_elements = prize_table.find_all("tr")

    # The first row is headers so we sort of want
    # to skip it for the calculations, but it includes
    # an important bit of information that we want.
    # The rows only contain winning ticket info.
    # We also care about a row for the losing prize tier.
    # It will have a value of "0" but we want to know
    # how many losing tickets there are.
    # We can calculate that from the first header. It
    # contains the total number of tickets printed.
    # Let's get the total number of tickets printed so
    # we can subtract the sum of the number of winning
    # giving us the number of losing tickets.
    header_row = row_elements[0]
    header_columns = header_row.find_all("th")
    total_number_tickets = int(re.sub(r"\D", "", header_columns[-1].text))

    row_elements = row_elements[1:]

    # We only care about the last and second to last columns.
    # The following helper functions will help us parse
    # the data we care about from each row.
    # The last column is the number of tickets at this prize level.
    # The number of tickets has commas, like 1,350,500.
    # We'll have to parse them out.
    # The second to last column is the prize value.
    # Prize value is usually "$" followed by a number.
    # Those are easy to parse.
    # But for the free ticket prize it's "FREE $1 TICKET"
    def parse_value(row_element):
        columns = row_element.find_all("td")
            value_element = columns[-3]
            value_text = value_element.text
            return int(re.sub(r"\D", "", value_text))
        except Exception as e:
            # This is an exception we can handle.
            # We can simply return a value of 0 if
            # the row doesn't have what we expect.
            # Our result might be inaccurate, but
            # I'll consider that acceptable.
            # I'll log something useful so I know
            # to look into it.
                "Exception parsing value for a row.\n{}".format(
            return 0

    def parse_num_tickets(row_element):
        columns = row_element.find_all("td")
            num_tickets_element = columns[-1]
            num_tickets_text = num_tickets_element.text
            return int(num_tickets_text.replace(",", ""))
            # Same as above, we can handle this.
            # Logging and returning 0 is better than blowing up.
                "Exception parsing num_tickets for a row.\n{}".format(
            return 0

    # Iterate over each row and parse out the value of the prize tier
    # and the number of remaining tickets at that prize tier.
    rows = [(parse_value(e), parse_num_tickets(e)) for e in row_elements]
    number_winning_tickets = sum(r[1] for r in rows)

    # Insert the losing ticket value, $0, and the number
    # of losing tickets into our rows.
    rows.insert(0, (0, total_number_tickets - number_winning_tickets))
    return rows

def find_price(html):
    Price is hard to find. It seems to always be a sibling to an
    <i> tag which has the text "Price". So, we can find that <i>
    tag, get the text of it's parent, find the last word of that text,
    and that will be the price of the ticket as a string that looks like
    "$10.", which we can then strip of the non-digits.
    soup = bs(html, "lxml")
    price_element = soup.find(text="Price")
    price_text = price_element.parent.parent.text.split(" ")[-1]
    price = int(re.sub(r"\D", "", price_text))
    return price

def calculate_original_ev(game_url):
    The "expected value" or "return on investment" of a game
    will be the total value of the remaining prizes
    divided by the total cost of the remaining tickets.

    Imagine you bought every ticket that was printed.

    How much money would you spend? How much money would you get back in prizes?

    If you won $1,500,000 and spent $2,000,000
    then your expected value is 1,500,000 / 2,000,000 = 0.75.

    For every $1 spent on the game, you'll get back $0.75
    for an average loss of $0.25.
    game_html = fetch_html(game_url)
    game_rules_url = find_complete_game_rules_url(game_html)
    game_rules_html = fetch_html(game_rules_url)
    price = find_price(game_rules_html)
    rows = find_rows(game_rules_html)
    total_number_tickets = sum(r[1] for r in rows)
    total_number_winners = sum(r[1] for r in rows if r[0] != 0)
    total_value_tickets = sum(r[1] * r[0] for r in rows)
    total_cost_tickets = total_number_tickets * price
    ev = total_value_tickets / total_cost_tickets
    return ev

def main():
    index_html = fetch_html(INDEX_URL)
    game_urls = find_game_urls(index_html)
    game_names = find_game_names(index_html)
    # Data will be a list of tuples that looks like:
    # [(Ticket Price, Game Name, Expected Value), ...]
    # The first element of the tuple of the list comprehension below
    # is kind of confusing. We are iterating over game urls.
    # We first fetch the html for the game url. Then we find the
    # game rules url in that page. Then we fetch the html of the game rules
    # page, then we find the price from that html.
    # Hence:
    #     `find_price(fetch_html(find_complete_game_rules_url(fetch_html(url))))`
    data = [
        for name, url in list(zip(game_names, game_urls))

    def price(game):
        We want to the list by price, but the price in the list
        is a string prepended by "$". We can't sort strings numerically,
        so use this function as the key to our sort.
        return int(game[0].replace("$", ""))

    data = sorted(data, key=price)

    # Ta-da!
    for datum in data:

if __name__ == "__main__":

Show Comments