ScrapE financial statement data from polygon.io using python

Polygon.io is a financial data platform that provides real-time and historic market data for stocks, cryptocurrencies, and other financial instruments. It offers a wide range of APIs and services designed for developers, traders, and financial professionals to access and analyze market data.

I want to analyze data from historical financial statements for a large number of public firms, mostly Income Statement, Balance Sheet, and Cash Flow Statement from 10-K and 10-Q forms. Thus, I tried out polygon.io API for Financial Statements extraction. Yet, the impression so fast is best described as “there is room for improvements”. I tested extracting data for PowerSchool, PWSC, in this blog post and the final output doesn’t include many data fields I need to perform some analysis I want to do.

Link to GitHub’s repo: https://github.com/ExcellentBee/Scrape-polygon.io/blob/main/Scrape%20polygon.io%20for%20financial%20statement%20data.ipynb

  1. Peek at the structure of data returned from api call

First, the usual import libraries, set up API key and call data for a sample stock “PWSC”

# install package
#pip install polygon-api-client

# Import libraries
import polygon
from collections import Counter
import inspect
import pandas as pd
import numpy as np

# Set the global display format for numbers with commas
pd.options.display.float_format = '{:,.0f}'.format

# Get free API key from polygon to access data
my_API_key = 'ELIZA-BUI'
client = polygon.RESTClient(my_API_key)

# Get an example ticker to inspect data
example_ticker = 'PWSC'

# Call data from API
data = client.vx.list_stock_financials(ticker = example_ticker)

The output is a generator, which is an iterable object that produces values one at a time, not list or dictionary. Using the var() function, we can see the list of keys stored in each item of this generator.

Extract Basic Stock Ticker Data


# Get a list of attribute
attribute_keys = vars(current_row).keys()

# Initialize an empty dictionary to extract its value
attribute_dict = {}

# Extract attribute values to put in the new dictionary
for attribute_name in attribute_keys:
    attribute_value = getattr(current_row, attribute_name)
    attribute_dict[attribute_name] = attribute_value
    
# Get a smaller dictionary for basic info
meta_data_dict = {key: value for key, value in attribute_dict.items() if key != 'financials'}
meta_data_df = pd.DataFrame(meta_data_dict, index = [0])

# Define the desired column order
desired_col_order = ['cik','company_name', 'start_date', 'end_date', 'filing_date', 'fiscal_year', 'fiscal_period', 
                     'source_filing_file_url', 'source_filing_url']
meta_data_df = meta_data_df[desired_col_order]

# Display output dataframe
display(meta_data_df)

# Look at attribute keys of Financials data
financial_keys = vars(attribute_dict['financials']).keys()
display(financial_keys)

# Iterate through each key in financial_keys to inspect the data type of each
for key in financial_keys:
    financial_statement_dict = getattr(attribute_dict['financials'], key)
    print(f"\nType of {key}: ")
    print(type(financial_statement_dict))

The first unusual thing here is that Balance Sheet has dict type but other 3 financial statements: Cash Flow Statement, Comprehensive Income Statement, Income Statement. Thus I will need to process these datasets differently

Extract balance sheet data

Extract Cash flow data

We can continue checking the other 2 financial statements as in my Jupiter notebook

2. extract the whole dataset available for a ticker


# Initialize an empty list to store DataFrames from each iteration
result_df = pd.DataFrame()

# Get a count of entries in the object returned by API call
component_count = len(data_list)

for i in range(component_count):
    # Get a list of attribute
    attribute_keys = vars(data_list[i]).keys()

    # Initialize an empty dictionary to extract its value
    attribute_dict = {}

    # Extract attribute values to put in the new dictionary
    for attribute_name in attribute_keys:
        attribute_value = getattr(data_list[i], attribute_name)
        attribute_dict[attribute_name] = attribute_value

    # Get a smaller dictionary for basic info
    meta_data_dict = {key: value for key, value in attribute_dict.items() if key != 'financials'}
    meta_data_df = pd.DataFrame(meta_data_dict, index = [0])

    # Define the desired column order
    desired_col_order = ['cik','company_name', 'start_date', 'end_date', 'filing_date', 'fiscal_year', 'fiscal_period', 
                         'source_filing_file_url', 'source_filing_url']
    meta_data_df = meta_data_df[desired_col_order]
    
    # Get a list of keys from attribute 'financials'
    financial_keys = vars(attribute_dict['financials']).keys()
    
    # Put financials object into a variable
    financial_object = attribute_dict['financials']
    
    # Create an empty list to store DataFrames to consolidate all df extracted from BS, CF, CIS, IS
    financial_statement_dfs = []

    # Loop through each key in financials attribute of each data_list[index] row
    for eachkey in financial_keys:

        # Extract value from financial object - could be a dictionary or polygon object
        financial_statement_obj = getattr(financial_object, eachkey)

        # If financial_statement_obj type is a dictionary, then do nothing, 
        # but if the type is a polygon object, then convert it to a dictionary using vars()
        if isinstance(financial_statement_obj, dict):
            # Put in dictionary with a standardized name
            financial_statement_dict = financial_statement_obj
        
        elif isinstance(financial_statement_obj, (polygon.rest.models.CashFlowStatement, 
                                                  polygon.rest.models.ComprehensiveIncome, 
                                                  polygon.rest.models.IncomeStatement)):
            
            financial_statement_dict = vars(financial_statement_obj)
        
        else:
            # Put the rest to empty dictionary
            financial_statement_dict = {}

        # Initialize an empty dictionary to store data
        extracted_data = {}

        # Iterate through the dictionary
        for key, datapoint in financial_statement_dict.items():
            extracted_data[key] = datapoint.value

        # Create a DataFrame to put extracted data dictionary into
        financial_statement_df = pd.DataFrame(extracted_data, index=[0])

        # Append the DataFrame to the list
        financial_statement_dfs.append(financial_statement_df)

    # Concatenate all DataFrames in the list column-wise
    combined_df = pd.concat(financial_statement_dfs, axis=1)
    
    # Add meta_data columns to combined_df
    combined_df = pd.concat([meta_data_df, combined_df], axis=1)
    
    # Display the number of columns and their names in the DataFrame
    #print(f"DataFrame {i+1} - Number of columns: {combined_df.shape[1]}")
    
    # Check for duplicate columns in the combined_df
    #duplicate_columns = [col for col, count in Counter(combined_df.columns).items() if count > 1]
    #if duplicate_columns:
    #    print(f"DataFrame {i+1} - Duplicate columns in combined DataFrame: {', '.join(duplicate_columns)}")
    
    # Change empty cells to np.NaN
    combined_df = combined_df.replace('', np.nan)
    
    # Change empty cells to np.NaN
    combined_df = combined_df.replace('None', np.nan)
        
    # Append the resulting DataFrame to the list for all entries in query
    result_df = pd.concat([result_df,combined_df], axis=0, ignore_index=True)

# Display the resulting combined DataFrame
display(result_df)

There are A lot of columns with missing value and list of financial statement line items available are significantly shorter than other data sources like yahoo finance or macrotrend. Thus, this vendor API is not yet helpful to my analysis of financial statements for a large number of stocks

Previous
Previous

Use Python to extract and consolidate monthly close files

Next
Next

Scrape Data from Yahoo Finance