ScrapE financial statement data from polygon.io using python
Polygon.io is a financial data platform that provides real-time and historic market data for stocks, cryptocurrencies, and other financial instruments. It offers a wide range of APIs and services designed for developers, traders, and financial professionals to access and analyze market data.
I want to analyze data from historical financial statements for a large number of public firms, mostly Income Statement, Balance Sheet, and Cash Flow Statement from 10-K and 10-Q forms. Thus, I tried out polygon.io API for Financial Statements extraction. Yet, the impression so fast is best described as “there is room for improvements”. I tested extracting data for PowerSchool, PWSC, in this blog post and the final output doesn’t include many data fields I need to perform some analysis I want to do.
Link to GitHub’s repo: https://github.com/ExcellentBee/Scrape-polygon.io/blob/main/Scrape%20polygon.io%20for%20financial%20statement%20data.ipynb
Peek at the structure of data returned from api call
First, the usual import libraries, set up API key and call data for a sample stock “PWSC”
# install package #pip install polygon-api-client # Import libraries import polygon from collections import Counter import inspect import pandas as pd import numpy as np # Set the global display format for numbers with commas pd.options.display.float_format = '{:,.0f}'.format # Get free API key from polygon to access data my_API_key = 'ELIZA-BUI' client = polygon.RESTClient(my_API_key) # Get an example ticker to inspect data example_ticker = 'PWSC' # Call data from API data = client.vx.list_stock_financials(ticker = example_ticker)
The output is a generator, which is an iterable object that produces values one at a time, not list or dictionary. Using the var() function, we can see the list of keys stored in each item of this generator.
Extract Basic Stock Ticker Data
# Get a list of attribute attribute_keys = vars(current_row).keys() # Initialize an empty dictionary to extract its value attribute_dict = {} # Extract attribute values to put in the new dictionary for attribute_name in attribute_keys: attribute_value = getattr(current_row, attribute_name) attribute_dict[attribute_name] = attribute_value # Get a smaller dictionary for basic info meta_data_dict = {key: value for key, value in attribute_dict.items() if key != 'financials'} meta_data_df = pd.DataFrame(meta_data_dict, index = [0]) # Define the desired column order desired_col_order = ['cik','company_name', 'start_date', 'end_date', 'filing_date', 'fiscal_year', 'fiscal_period', 'source_filing_file_url', 'source_filing_url'] meta_data_df = meta_data_df[desired_col_order] # Display output dataframe display(meta_data_df) # Look at attribute keys of Financials data financial_keys = vars(attribute_dict['financials']).keys() display(financial_keys) # Iterate through each key in financial_keys to inspect the data type of each for key in financial_keys: financial_statement_dict = getattr(attribute_dict['financials'], key) print(f"\nType of {key}: ") print(type(financial_statement_dict))
The first unusual thing here is that Balance Sheet has dict type but other 3 financial statements: Cash Flow Statement, Comprehensive Income Statement, Income Statement. Thus I will need to process these datasets differently
Extract balance sheet data
Extract Cash flow data
We can continue checking the other 2 financial statements as in my Jupiter notebook
2. extract the whole dataset available for a ticker
# Initialize an empty list to store DataFrames from each iteration result_df = pd.DataFrame() # Get a count of entries in the object returned by API call component_count = len(data_list) for i in range(component_count): # Get a list of attribute attribute_keys = vars(data_list[i]).keys() # Initialize an empty dictionary to extract its value attribute_dict = {} # Extract attribute values to put in the new dictionary for attribute_name in attribute_keys: attribute_value = getattr(data_list[i], attribute_name) attribute_dict[attribute_name] = attribute_value # Get a smaller dictionary for basic info meta_data_dict = {key: value for key, value in attribute_dict.items() if key != 'financials'} meta_data_df = pd.DataFrame(meta_data_dict, index = [0]) # Define the desired column order desired_col_order = ['cik','company_name', 'start_date', 'end_date', 'filing_date', 'fiscal_year', 'fiscal_period', 'source_filing_file_url', 'source_filing_url'] meta_data_df = meta_data_df[desired_col_order] # Get a list of keys from attribute 'financials' financial_keys = vars(attribute_dict['financials']).keys() # Put financials object into a variable financial_object = attribute_dict['financials'] # Create an empty list to store DataFrames to consolidate all df extracted from BS, CF, CIS, IS financial_statement_dfs = [] # Loop through each key in financials attribute of each data_list[index] row for eachkey in financial_keys: # Extract value from financial object - could be a dictionary or polygon object financial_statement_obj = getattr(financial_object, eachkey) # If financial_statement_obj type is a dictionary, then do nothing, # but if the type is a polygon object, then convert it to a dictionary using vars() if isinstance(financial_statement_obj, dict): # Put in dictionary with a standardized name financial_statement_dict = financial_statement_obj elif isinstance(financial_statement_obj, (polygon.rest.models.CashFlowStatement, polygon.rest.models.ComprehensiveIncome, polygon.rest.models.IncomeStatement)): financial_statement_dict = vars(financial_statement_obj) else: # Put the rest to empty dictionary financial_statement_dict = {} # Initialize an empty dictionary to store data extracted_data = {} # Iterate through the dictionary for key, datapoint in financial_statement_dict.items(): extracted_data[key] = datapoint.value # Create a DataFrame to put extracted data dictionary into financial_statement_df = pd.DataFrame(extracted_data, index=[0]) # Append the DataFrame to the list financial_statement_dfs.append(financial_statement_df) # Concatenate all DataFrames in the list column-wise combined_df = pd.concat(financial_statement_dfs, axis=1) # Add meta_data columns to combined_df combined_df = pd.concat([meta_data_df, combined_df], axis=1) # Display the number of columns and their names in the DataFrame #print(f"DataFrame {i+1} - Number of columns: {combined_df.shape[1]}") # Check for duplicate columns in the combined_df #duplicate_columns = [col for col, count in Counter(combined_df.columns).items() if count > 1] #if duplicate_columns: # print(f"DataFrame {i+1} - Duplicate columns in combined DataFrame: {', '.join(duplicate_columns)}") # Change empty cells to np.NaN combined_df = combined_df.replace('', np.nan) # Change empty cells to np.NaN combined_df = combined_df.replace('None', np.nan) # Append the resulting DataFrame to the list for all entries in query result_df = pd.concat([result_df,combined_df], axis=0, ignore_index=True) # Display the resulting combined DataFrame display(result_df)
There are A lot of columns with missing value and list of financial statement line items available are significantly shorter than other data sources like yahoo finance or macrotrend. Thus, this vendor API is not yet helpful to my analysis of financial statements for a large number of stocks