Programming and Data

A Very Few Useful Programming and Data Tips.

 

Translating Financial Statement Variables Into COMPUSTAT Data Items

Financial Statement Variable COMPUSTAT Data Number
Long Term Debt DATA9
Property Plant and Equipment (PPE): Wrong Sign! DATA30
Current Portion of Long Term Debt DATA44
Sale of Property Plant and Equipment. (Does not appear to match values in the 10-K. If you can figure out how to calculate the sale of assets from the available COMPUSTAT data items please let me know.) DATA107
Sales (Total Revenue) DATA117
Net Income Before Taxes DATA122
Depreciation and Amortization DATA125
Cost of Goods Sold minus Depreciation and Amortization DATA131
Selling, General and Administrative Expenses (SG&A) DATA132
Provision for Income Taxes DATA135
Notes Payable DATA206
Standard and Poors (S&P) Bond Rating: Translated from their letter code to a numerical value. See the WRDS online documentation to translate back. DATA280
Change in Accounts Receivable DATA302
Change in Inventories DATA303
Change in Accounts Payable DATA304
Cash Interest Paid DATA315
Cash Taxes Paid DATA317

R Programming Package Hints, Tips and Manual Clarifications

Flexclust Package

Flexclust is an R package for creating clusters of similar objects based on their characteristics. It has the ability to call a user supplied distance function and use it to calculate the “distance” between objects in a cluster and the cluster center. I put distance in quotes since it does not need to be a physical distance. It could be, for example, the difference in per capita GDP between a country and the rest of the countries in a cluster.You call the distance function through a command that looks like:

kmFam <- kccaFamily(dist=mydist, cent=colMeans, groupFun = “minSumClusters”).In this example the distance function is mydist. It is not clear from the manual, but this function needs to take as inputs a matrix of values x and matrix of centers c. The x matrix should contain one row per observation. Each row of the centers matrix c is a row vector of centers from candidate clusters. The user supplied distance function should take the matrices x and c and then calculate a distance from each candidate center for each observation in x and then return the result. For example, if x is a nxk matrix and centers matrix is 2xk then the returned matrix has to be nx2. Since flexclust determines how many rows will be passed in the centers matrix the distance function must be able to handle an arbitrary number of them. Each row in the centers matrix.

Matching Data Across Data Frames: Find the First Date in One After Dates in a Column of Another

#You have two data frames. You need to find the largest value of some column
#in the second data frame less than or equal to each value in the first data frame
#and do this ID by ID of some sort. For example, the last rating date prior to a
#trade date bond Cusip by bond Cusip. You then want to record some information from the
#second data frame. In the example, the bond’s rating. After the code is finished
#the resulting data frame will NOT be sorted. If you needed it back in original
#sort order you need to add a line to do that.
#You need the dplyr library for this to work. In this sample code uses an inner_join
#so data is only kept if it is in both data frames.

library(dplyr)

#Some sample data, say bond trades.
b <- data.frame(cusip=c(rep(“A”,30),rep(“B”,10),rep(“C”,5)),td=c(62:91,173:182,266,268,272,279,281),stringsAsFactors=F)

#More sample data, say bond ratings.
r <- data.frame(cusip=c(rep(“A”,10),rep(“B”,5),rep(“C”,7)), rd=5*c(12:21,33:37,52:58), rate=c(101:122), stringsAsFactors=F)

#Create the new data with the columns containing the value in table 2 from the column rate in r that corresponds with the
#largest value in r from the column rd that is less than or equal to td for each cusip in b.
vv <- inner_join(b,r) %>% filter(rd<=td) %>% group_by(cusip,td) %>% filter(rd==max(rd))

#Turn the output back into a data frame and sort by cusip and td.
vv <- data.frame(vv)
vv <- vv[with(vv,order(cusip,td)),]