Python: flattening exported csv relational data with pandas

python-logoIf your data is coming from CSV dumps of relational database tables, then you probably need to flatten the data before starting the training or analysis.

It requires an understanding of the relational schema so you can specify the foreign key lookups, but from that pandas can reconstruct the relationships and produce a single flattened DataFrame using pandas.merge.

As an example, I will use CSV dumps of Microsoft’s Adventure Works database to load in multiple DataFrame and merge them into a single flattened view using the foreign key relationships.

Adventure Works Sales schema

The Adventure Works database has been released alongside every new version of Microsoft SQLServer, and is a large and normalized database (full diagram from blog.jpries) that covers the sales, production, and purchasing of a fictional manufacturer.

For this article, we are going to focus on the SalesOrderHeaders table.  Specifically, we want to analyze how much is being purchased in each state, and the average tax rate.

The SalesOrderHeader directly contains the “TaxAmt” and “TotalDue” fields, so that can easily be used in our analysis.  But notice there is no state field, instead “BillToAddressId” is normalized into another table.

So we will need to resolve BillToAddressId to a state name.  But notice, there is an even deeper level of normalization, because that table uses “StateProvinceID” to look into the “StateProvince” table for the actual state name.

Luckily, the pandas merge method can help us perform these operations.

Download scripts and data

Download the scripts I have provided on github for fetching the CSV files and running the Python3 pandas script.

# make sure wget is installed
sudo apt install wget -y

# download bash script that fetches data
wget https://raw.githubusercontent.com/fabianlee/blogcode/master/pandas/flatten_relational/fetch_relational_data.sh

# download Python3 script that flattens data
https://raw.githubusercontent.com/fabianlee/blogcode/master/pandas/flatten_relational/flatten_relational.py3

# make scripts executable
chmod +x fetch_relational_data.sh
chmod +x flatten_relational.py3

Then fetch the CSV files from the github location.

# fetch the required CSV files for Sales, Address, State tables
./fetch_relational_data.sh

This will retrieve, then do the minimal cleaning necessary to use these files from pandas (converting utf-16 to utf-8).

Install Python3 pandas module

Make sure the pandas module is installed and available to Python3.

pip3 install pandas --user

Run Python3 pandas script

Before I explain the critical pieces of code, let’s run the script and view the final results.

# run script as smoke test
$ ./flatten_relational.py3
 
Processing 500 max sales rows, debug=False


=== TOTAL TAX AND SALES BY STATE =======================
                  taxamt totaldue meantaxrate
state.name                                   
Arizona            $1.2k   $14.6k       0.085
Bayern             $1.4k   $19.8k       0.072
Brandenburg        $0.3k    $4.0k       0.072
British Columbia  $14.4k  $177.9k       0.081
California        $61.9k  $756.9k       0.082
...                  ...      ...         ...
Virginia           $3.3k   $39.1k       0.085
Washington        $33.2k  $400.9k       0.083
Wisconsin          $1.2k   $13.7k       0.085
Wyoming            $7.2k   $84.8k       0.085
Yveline            $0.9k   $11.9k       0.072

[54 rows x 3 columns]

This shows the total amount of tax and total purchase costs from each state.  Along with a synthesized column showing the average tax rate.

This was constructed by reading in the main SalesOrderHeader.csv file, and then resolving address and State foreign keys for a flattened view that can be grouped by State as shown above.

Now let’s go into more detail on how this is done via pandas.

Load DataFrame from CSV file

The ‘read_csv‘ function provides a convenient way of reading a CSV file into a DataFrame.

sales_data = pd.read_csv("SalesOrderHeader.csv",encoding="unicode_escape",sep="\t",nrows=rows)

address_data = pd.read_csv("Address.csv",encoding="unicode_escape",sep="\t")

state_data = pd.read_csv("StateProvince8.csv",encoding="unicode_escape",sep="\t",on_bad_lines="warn",header=0)

We load each of the relational tables into its own DataFrame.

Assign columns to DataFrame

Some CSV files contain header column names, but this particular set does not have them, so we assign names for our convenience.

# define columns since these files do not have headers
# prefixed names (e.g "add." "state.") makes it easier to identify join source later

address_data.columns = ["add.id","add.line1","add.line2","add.city","add.state","add.postalcode","add.spatialloc","add.guid","add.modifieddate"]

state_data.columns = ["state.id","state.code","state.regioncode","state.isstateprovince","state.name","state.territoryid","state.guid","state.modifieddata"]

sales_data.columns = ["id","revision","orderdate","duedate","shipdate","status","onlineflag","salesnum","ponum","acctnum","custid","salespersonid","territoryid","billtoaddressid","shiptoaddressid","shipmethodid","ccid","ccapproval","currencyrateid","subtotal","taxamt","freight","totaldue","comment","guid","modifieddate"]

This knowledge of the format/order usually comes from your DBA group, in our case we have referred to the schema pdf, pdf explanation, and the sql load file.

Flatten Address to include State

The Address table contains a ‘StateProvinceID’ field instead of the State name, so we need to do an inner join to flatten out the DataFrame.

# inner join on Address add.state-> state.stateid
address_data = pd.merge(address_data,state_data,how="left",left_on="add.state",right_on="state.id",suffixes=(None,"_s") )

The address_data DataFrame now contains State name information (col=”state.name”).

Flatten Sales to include Address

The SalesOrderHeader table contains a ‘BillToAddressId’ field instead of the State name, so we need to do an inner join to flatten out the DataFrame.

# inner join on Sales billtoaddressid -> add.id
sales_data = pd.merge(sales_data,address_data,how="left",left_on="billtoaddressid",right_on="add.id",suffixes=(None,"_a") )

The sales_data DataFrame now contains the State name information (col=”state.name”).

Create DataFrame grouping

With all this flattening, our sales_data DataFrame now has a lot of columns.  Too many for a compact display of data, so we will create a new DataFrame with just the columns we want to analyze and display.

From this DataFrame, we will group by ‘state.name’, and then create a new synthesized column ‘meantaxrate’ containing the average tax rate for the state.

# select new DataFrame, with just a few columns we want to analyze
sales_data_select_columns = sales_data[ ["state.name", "taxamt", "totaldue"] ]

# sum up the total taxes and purchased, by State
flattened_and_grouped = sales_data_select_columns.groupby(['state.name']).sum()

# create synthesized mean tax rate column based on tax/total
flattened_and_grouped = flattened_and_grouped.assign(meantaxrate = lambda x: x['taxamt']/x['totaldue'] )

If we were just doing data analyis, we would probably stop here and use ‘to_csv‘ to save the results.

Format column data

Since we want to display it on the console, we will format the columns for easier viewing.

First we transform the columns for tax and total amount with a dollar sign and “k” for currency.  Then we change the tax rate to show 3 significant digits after the decimal point.

# format tax and total columns into currency for viewing
for col in ['taxamt','totaldue']:
  flattened_and_grouped[col] = flattened_and_grouped[col].apply(lambda x: "${:.1f}k".format((x/1000)))

# format tax rate float to 3 significant decimal points
flattened_and_grouped['meantaxrate'] = flattened_and_grouped['meantaxrate'].apply(lambda x: "{0:.3f}".format(x) )

Display DataFrame

Finally, we show the DataFrame summary of 10 rows.

# show 10 row summary of DataFrame
with pd.option_context('display.max_rows',10):
  print(flattened_and_grouped)

 

REFERENCES

pandas, merge dataframes explanation

pandas, merge method documentation

pandas, indexing and selecting

pandas, read_csv reference

github microsoft, Adventure Works CSV files

blog.jpries, Adventure Works explanation

blog.jpries, AdventureWorks schema illustrated as PDF

dataedo.com, pdf explains each field in Adventure Works

svitla.com, dataframes and joining with pandas with diagram of join types

askpython, pandas option_context to show certain row count

stackovervflow, dataframe select versus copy

datagy.io, pandas select and copy

re-thought.com, adding new column in dataframe