Empirical Asset Pricing via Machine Learning

This exercise should help you to fight your way through an actual academic application of machine learning methods in asset pricing. The exercise guides you step-by-step to replicate the empirical framework and, therefore, at some point also the main results of the paper Empirical Asset Pricing via Machine Learning by Shihao Gu, Bryan Kelly, and Dacheng Xiu.

Exercises:

Solutions

library(tidyverse)
library(RSQLite)
library(lubridate)
library(archive)
import pandas as pd
import requests
import numpy as np
from io import BytesIO, StringIO
import zipfile
import sqlite3

First, we download the (large) dataset directly from within R. The data is stored as a .csv file within a zipped folder. For such purposes, the archive package is useful. Presumably, you want to store the output in a .sqlite file, for instance, the file tidy_finance_*.sqlite or in a new file. Next, we set options(timeout = 200), which allows the R session to download the dataset for 200 seconds. The default in R is 60 seconds which was too short on my machine.

options(timeout = 1200)
characteristics <- read_csv(archive_read("https://dachxiu.chicagobooth.edu/download/datashare.zip", 
                                         file = "datashare.csv"))

characteristics <- characteristics |>
  rename("month" = "DATE") |>
  mutate(
    month = ymd(month),
    month = floor_date(month, "month")
  ) |>
  rename_with(~ paste0("characteristic_", .), -c(permno, month, sic2))

First, we download the (large) dataset directly from within Python. The data is stored as a .csv file within a zipped folder. For such purposes, the requests library is useful. Presumably, you want to store the output in a .sqlite file, for instance, the file tidy_finance_*.sqlite or in a new file. We set timeout = 1200, which allows the session to download the dataset for 1200 seconds. The default was too short on my machine.

url = "https://dachxiu.chicagobooth.edu/download/datashare.zip"
response = requests.get(url, timeout=1200, verify=False)

# Extract the contents of the zip file
with zipfile.ZipFile(BytesIO(response.content)) as z:
    csv_file = z.open(z.namelist()[0])
    # Read the CSV file
    characteristics = pd.read_csv(csv_file)

characteristics = (characteristics
    .rename(columns={'DATE': 'month'})
    .assign(month=lambda df: pd.to_datetime(df['month']).dt.to_period('M').dt.to_timestamp())
    .rename(columns={col: 'characteristic_' + col for col in characteristics.columns if col not in ['permno', 'month', 'sic2']})
)    

The cross-sectional ranking may be time-consuming. The idea is that at each date, the cross-section of each predictor should be scaled such that the maximum value is one and the minimum value is -1. The function below explicitly handles NA values, so they do not tamper with the ranking.

rank_transform <- function(x) {
  rx <- rank(x, na.last = TRUE)
  non_nas <- sum(!is.na(x))
  rx[rx > non_nas] <- NA
  2 * ((rx - min(rx, na.rm = TRUE)) / (max(rx, na.rm = TRUE) - min(rx, na.rm = TRUE)) - 0.5)
}

characteristics <- characteristics |>
  group_by(month) |>
  mutate(across(contains("characteristic"), rank_transform))
def rank_transform(x):
    rx = x.rank(na_option='keep')
    non_nas = x.count()
    rx[rx > non_nas] = np.nan
    return 2 * ((rx - rx.min()) / (rx.max() - rx.min()) - 0.5)


characteristics = (characteristics
    .groupby('month', group_keys=False)
    .apply(lambda group: group.apply(lambda x: rank_transform(x) if 'characteristic' in x.name and x.name != 'month' else x))
)

Next, merge the data with monthly CRSP data to get excess returns. To create portfolio sorts based on the predictions, mktcap_lag remains in the sample. As a final step, we also include the macroeconomic predictor variables.

tidy_finance <- dbConnect(SQLite(), "../data/tidy_finance.sqlite",
  extended_types = TRUE
)

crsp_monthly <- tbl(tidy_finance, "crsp_monthly") |>
  select(month, permno, mktcap_lag, ret_excess) |>
  collect()

macro_predictors <- tbl(tidy_finance, "macro_predictors") |>
  select(-rp_div) |>
  collect() |>
  rename_with(~ paste0("macro_", .), -month)

characteristics <- characteristics |>
  inner_join(crsp_monthly, by = c("month", "permno")) |>
  inner_join(macro_predictors, by = "month") |>
  arrange(permno, month) |>
  select(permno, month, ret_excess, mktcap_lag, sic2, contains("macro"), contains("characteristic"))
tidy_finance = sqlite3.connect(database="../data/tidy_finance_python.sqlite")

crsp_monthly = pd.read_sql_query(
  sql="SELECT month, permno, mktcap_lag, ret_excess FROM crsp_monthly",
  con=tidy_finance,
  parse_dates={"month"}
)

macro_predictors = (pd.read_sql_query(
    sql="SELECT * FROM macro_predictors",
    con=tidy_finance
).drop(columns=['rp_div'])
 .rename(columns=lambda x: f'macro_{x}' if x != 'month' else x)
)

characteristics = (
    characteristics
    .merge(crsp_monthly, on=['month', 'permno'], how='inner')
    .merge(macro_predictors, on='month', how='inner')
    .sort_values(by=['permno', 'month'])
    .reset_index(drop=True)  
    [['permno', 'month', 'ret_excess', 'mktcap_lag', 'sic2'] +
     [col for col in characteristics.columns if 'macro_' in col] +
     [col for col in characteristics.columns if 'characteristic_' in col]]
)

The code below replaces missing values with each stock’s cross-sectional median at each month. One of the paper’s coauthors also claims that NA values are set to zero.

replace_nas <- function(x) {
  x[is.na(x)] <- median(x, na.rm = TRUE)
  return(x)
}

characteristics <- characteristics |>
  group_by(month) |>
  mutate(across(contains("characteristic"), replace_nas)) |>
  drop_na(sic2) |>
  mutate(sic2 = as_factor(sic2))
def replace_nas(x):
    x[x.isna()] = x.median(skipna=True)
    return x

characteristics = (characteristics
    .groupby('month')
    .apply(lambda group: group.apply(replace_nas) if group.select_dtypes(include='number').any().any() else group)
    .reset_index(drop=True)
    .dropna(subset=['sic2'])
    .astype('category')
)

Finally, for your convenience, I include the pre-processed file into a new file, the tidy_finance_ML.sqlite database. You find the document on Absalon and can use it for your research and replication attempts.

tidy_finance_ML <- dbConnect(SQLite(), "../data/tidy_finance_ML.sqlite",
  extended_types = TRUE
)

dbWriteTable(tidy_finance_ML,
  "stock_characteristics_monthly",
  value = characteristics |> ungroup(),
  overwrite = TRUE
)
tidy_finance_ML = sqlite3.connect(database="../data/tidy_finance_ML.sqlite")


characteristics.to_sql(
  name="stock_characteristics_monthly",
  con=tidy_finance_ML, 
  if_exists="replace", 
  index=False
  )