python 3.x – Getting Death Row Inmates Last Statement

Initial version of the code appeared as an answer to an SO question. I refactored the code a bit and it works pretty well, IMHO. I get a solid .csv file with all the data from Texas Department of Criminal Justice’s Death Row page.

What I was especially interested in was getting all offenders’ last statement, if there was any, which the codes accomplishes.

What I’d want to get here is some feedback on utilizing pandas, as I’m relatively new to it. Also, some memory efficiency suggestions would be nice too, I guess.

For example, should I save the initial version of the .csv file and then read it so I can append the last statements? Or keeping everything in memory is fine?

If you find any other holes, do point them out!

The code:

import random
import time

import pandas as pd
import requests
from lxml import html

base_url = "https://www.tdcj.texas.gov/death_row"
statement_xpath = '//*(@id="content_right")/p(6)/text()'


def get_page() -> str:
    return requests.get(f"{base_url}/dr_executed_offenders.html").text


def clean(first_and_last_name: list) -> str:
    name = "".join(first_and_last_name).replace(" ", "").lower()
    return name.replace(", Jr.", "").replace(", Sr.", "").replace("'", "")


def get_offender_data(page: str) -> pd.DataFrame:
    df = pd.read_html(page, flavor="bs4")
    df = pd.concat(df)
    df.rename(
        columns={'Link': "Offender Information", "Link.1": "Last Statement URL"},
        inplace=True,
    )

    df("Offender Information") = df(
        ("Last Name", 'First Name')
    ).apply(lambda x: f"{base_url}/dr_info/{clean(x)}.html", axis=1)

    df("Last Statement URL") = df(
        ("Last Name", 'First Name')
    ).apply(lambda x: f"{base_url}/dr_info/{clean(x)}last.html", axis=1)
    return df


def get_last_statement(statement_url: str) -> str:
    page = requests.get(statement_url).text
    statement = html.fromstring(page).xpath(statement_xpath)
    text = next(iter(statement), "")
    return " ".join(text.split())


def get_last_statements(offenders_data: list) -> list:
    statements = ()
    for item in offenders_data:
        *names, url = item
        print(f"Fetching statement for {' '.join(names)}...")
        statements.append(get_last_statement(statement_url=url))
        time.sleep(random.randint(1, 4))
    return statements


if __name__ == "__main__":
    offenders_df = get_offender_data(get_page())
    names_and_urls = list(
        zip(
            offenders_df("First Name"),
            offenders_df("Last Name"),
            offenders_df("Last Statement URL"),
        )
    )
    offenders_df("Last Statement") = get_last_statements(names_and_urls)
    offenders_df.to_csv("offenders_data.csv", index=False)

The scraping part is intentionally slow, as I don’t want to abuse the server, but I do want to get the job done. So, if you don’t have a couple of minutes to spare, you can fetch the offenders_data.csv file from here.