Use Python to extract and consolidate monthly close files

After each monthly close, I get several different Excel files that contain all the granular details of how revenue is accrued for each SKU/campaign/project for that specific month. Per ASC 606, generally, revenue is accrued when firm has delivered services as specified in the contract, or stating another way, when clients have started receiving benefits from the firm’s services. “The devil is in the details”: drilling into the line-item details of monthly revenue booking not only provides insights for my variance analysis, but also helps me understand the true operation/delivery picture.

There are 2 main issues when working with these Excel files

These files have different structures and formats as they often come from different systems (which are dedicated to different divisions/subsidiaries). At one of my former companies, due to 7 add-on M&A deals, there could be 10 different revenue accrual files that I have to deal with, each with their own quirks.
These files are often massive, containing 10k+ rows and 100+ columns. Not all columns are needed for my analysis and they are not always organized in a logic-flowing manner.

Thus, I need to AUTOMATE the process of extracting, organizing, and consolidating data from different files with varying formats. Because manually copy-pasting 30+ columns from each of 10 different Excel files is NEVER my style.

There are 2 ways/libraries to accomplish this and we will try both here

USE PANDA’S DATAFRAMES

First, let’s read the data from Excel files into DataFrames to see their sizes

# import libraries to use
import pandas as pd

# Set display format for numbers
pd.options.display.float_format = '{:,.0f}'.format

# Specify the path to Excel files & read them
folder_path = 'C:/Users/Eliza Bui/Automation Illustration/Extract and Consolidate Data/'

report1_name = '2023.06.30 RevRec Snapshot ASC606 - Parent.xlsx'
report2_name = '2023.06.30 RevRec Snapshot ASC606 - Division A.xlsx'

report1_file_path  = folder_path + report1_name
report2_file_path  = folder_path + report2_name

# Read the Excel files into a pandas DataFrame
report1_df = pd.read_excel(report1_file_path)
report2_df = pd.read_excel(report2_file_path)

We can see that report 1 for Parent firm has 104 columns and more than 58k+ rows and report 2 for Division A has 135 columns and more than 10k+ rows.

Now I need to specify a list of specific columns I want from report 1 then put them in my preferred order. Also need to check if any column name is not showing up the report, likely due to name changes/report updates. Then using df.info() function, I can quickly check that all columns I want is now in the ‘report1_reorg_df” data frame.

Report1_Column_List = ['RR Number', 
                       'Performance Obligation Line Item Number', 
                       'Account Formal Name', 
                       'Account ID',       
                       'Counter ID',
                       'Subscription #', 
                       'Contract Number',
                       'Program Code',
                       'Product: Product Name',                         
                       'Product: Product Code',
                       'Parent Product: Product Code', 
                       'Parent Product: Product Name',
                       'Product: Product Family',
                       'Fee Type', 
                       'Risk Status', 
                       'Risk Type', 
                       'Recognition Start Date',                        
                       'Recognition End Date', 
                       'Start Date',                                    
                       'End Date',
                       'Revenue',            
                       'GAAP Amount',
                       'Recognized to Date',
                       'Deferred Amount',
                       'Subscription Type', 
                       'Line Status',
                       'Deployment ID',                     
                       'Deployment Name',
                       'Comments',
                       'Contract Term Type',
                       'Subscription Type',
                       'Revenue Recognition Template: Template Name',
                       'Contracted Quantity',                            
                       'Actual Quantity',
                       'Contracted Hours',                               
                       'Actual Hours',
                       'Performance Obligation Complete',
                       'Cost Center'          ]

# Put specific columns needed into a new dataframe
report1_reorg_df = report1_df[Report1_Column_List]

# Double check output
report1_reorg_df.info()

# Put specific columns needed into a new dataframe
report1_reorg_df = report1_df[Report1_Column_List]

# Double check output
report1_reorg_df.info()

Report2_Column_List = ['Revenue: RR Number',
                       'Historical Linker',
                       'Revenue: ID', 
                       'Account Name',
                       'Cost Center',
                       'Product',
                       'Program', 
                       'Program Code',
                       'Fee Type', 
                       'Risk Status', 
                       'Risk Type', 
                       'Recognition Start Date', 
                       'Recognition End Date', 
                       'GAAP Amount',
                       'Actual Recognized to Date', 
                       'Deferred Revenue (GAAP)',
                       'Contract Name', 
                       'Contract Year', 
                       'Contract Receipt Date', 
                       'Contract Status', 
                       'Contract Term Type', 
                       'Contract Type',
                       'Contract Start', 
                       'Contract End',
                       'Rev Rec Complete']

# Check if each column in Report1_Column_List is in report1_df
missing_columns_report2 = [col for col in Report2_Column_List if col not in report2_df.columns]

# Print the result
if not missing_columns_report2:
    print("All columns wanted are present in report2_df.")
else:
    print("Missing columns in report2_df:")
    print(missing_columns_report2)

# Put specific columns needed into a new dataframe
report2_reorg_df = report2_df[Report2_Column_List]

# Double check output
report2_reorg_df.info()

All 38 columns I need to work with are showing up as expected.

Now the exact same process can be used with report 2

All 25 columns I need to work with are showing up as expected.

The last step is to write the 2 new dataframes to a consolidated Excel file.

# Put current snapshot date in file path
snapshot_date_name_to_save = '2023.06.30'
excel_file_name = f'Consolidated RevRec Snapshot {snapshot_date_name_to_save}.xlsx'
excel_file_path = folder_path + excel_file_name

# Create an Excel writer object
with pd.ExcelWriter(excel_file_path, engine='xlsxwriter') as writer:
    
    # Write DataFrame 1 to a new tab in Excel
    report1_reorg_df.to_excel(writer, sheet_name='Report1', index=False)

    # Write DataFrame 1 to a new tab in Excel
    report2_reorg_df.to_excel(writer, sheet_name='Report2', index=False)

# Display the Excel file path
print(f"Excel file saved to: {excel_file_path}")

2. USE openpyxl and re

# import libraries
import glob, os, openpyxl, re
from openpyxl.utils import get_column_letter

# Get directory and file name
os.chdir("C:/Users/Eliza Bui/Automation Illustration/Extract and Consolidate Data/")

for file in glob.glob("*RevRec Snapshot ASC606*.xlsx"):
    if "Parent" in file:
        wb_parent_name = file
        break
    else:
        print("File not Found")
        break

# Open RevRec file, extract needed columns, then write to a new tab within the same file
wb_parent = openpyxl.load_workbook(wb_parent_name)

# Create a new tab to copy data over
ws2 = wb_parent.create_sheet('New DataCut')
ws1 = wb_parent.worksheets[0]

# Create a list of specific columns in perferred order
Report1_Column_List = ['RR Number', 'Performance Obligation Line Item Number', 'Account Formal Name', 'Account ID', 
'Counter ID', 'Subscription #', 'Contract Number', 'Program Code', 'Product: Product Name', 'Product: Product Code',
'Parent Product: Product Code', 'Parent Product: Product Name', 'Product: Product Family', 'Fee Type', 'Risk Status', 
'Risk Type', 'Recognition Start Date', 'Recognition End Date', 'Start Date', 'End Date', 'Revenue', 'GAAP Amount',
'Recognized to Date',  'Deferred Amount', 'Subscription Type', 'Line Status', 'Deployment ID', 'Deployment Name',
'Comments', 'Contract Term Type', 'Subscription Type', 'Revenue Recognition Template: Template Name','Contracted Quantity',                            
'Actual Quantity', 'Contracted Hours', 'Actual Hours', 'Performance Obligation Complete', 'Cost Center' ]

# Loop through all column names, look for the one matching indexed element of Column List
for i in range (0, len(Report1_Column_List)):
    for j in range(1, ws1.max_column):
        if ws1.cell(row = 1, column = j).value == Report1_Column_List[i]:
            Col_Letter = get_column_letter(j)
    Current_Col = ws1[Col_Letter +'1:' + Col_Letter + str(ws1.max_row)]

    for row_id, tuple_cell in enumerate(Current_Col, 1):
        for cell in tuple_cell:
            ws2.cell(row = row_id, column = i+1 ).value = cell.value

# Save file
wb_parent.save(wb_parent_name)
print("Parent RevRec snapshot extraction DONE!")

Notebook in my Github reporsitory:

https://github.com/ExcellentBee/Learning-Everyday/blob/main/Extract%20%26%20Consolidate%20Monthly%20Close%20File.ipynb

USE PANDA’S DATAFRAMES

2. USE openpyxl and re

Automate Staff Changes Emails Consolidation

Scrape financial data from polygon.io

eliza-bui.com