Crawling Financial Reports

Update in 2024: Now it can be done easily with some open-source tools such as yfinance or yahoo_fin or Zipline or Pyfolio. However, I still keep this post for fun and as a memory of my journey to learn how to invest in the stock market.

In 2022, armed with a bit of extra cash and an abundance of free time or being procrastinate (a defining characteristic of a Ph.D. student, as you might already know 😊) I started to learn how to invest in the ASX stock market. As one might expect, I lost almost all of my money because of following the hot trend “Buy Now Pay Later” at that time with ZiP and IoU. It didn’t take long for me to realize that perhaps I should not follow the hot trend and instead invest in the company that I understand and believe in its long-term growth as adviced by Warren Buffett. I tried to read the financial reports of some companies but soon realized that it was incredibly time-consuming endeavor. So I decided to crawl the financial reports of all the companies in the ASX and build some automated tools to help me analyze them. And here is it. The code is available on Github here.

What are the financial reports?

There are three main financial reports: Balance Sheet, Income Statement, and Cash Flow Statement. The Balance Sheet shows the assets, liabilities, and equity of a company at a specific point in time. The Income Statement shows the revenues, expenses, and profits of a company over a period of time. The Cash Flow Statement shows the cash inflows and outflows of a company over a period of time. The Balance Sheet, Income Statement, and Cash Flow Statement are the three most important financial statements for analyzing a company’s financial performance.

Where to get the financial reports?

There are several platforms such as Yahoo Finance, that you can get the financial reports of the companies not only ASX but also other stock markets such as NYSE, NASDAQ, etc. However, I found that the most friendly platform is Market Watch because it provides the financial reports in a table format, which is easy to crawl. For example, the financial reports of ASX is shown as follows:

It also provides in annual and quarterly format.

How to crawl the financial reports?

Actually, this tool is not something fancy and difficult to implement, so please ease your expectation and go easy on me .

Utilities

Get the url of the financial reports of a company given its share code. There are 6 urls in total: 2 for Balance Sheet, 2 for Income Statement, and 2 for Cash Flow Statement, annual and quarterly. Market Watch organizes these reports in very reasonable urls, so it is easy to get the urls of the reports given the share code. The following function is used to get the urls of the reports given the share code. You can change the country code to get the reports of other stock markets.

def get_url(share_code):
    """
    Get the url of the financial reports of a company given its share code. 
    There are 6 urls in total: 2 for Balance Sheet, 2 for Income Statement, and 2 for Cash Flow Statement, annual and quarterly.
    """
    url_bs_y = 'https://www.marketwatch.com/investing/stock/{}/financials/balance-sheet?countrycode=au'.format(share_code)
    url_bs_q = 'https://www.marketwatch.com/investing/stock/{}/financials/balance-sheet/quarter?countrycode=au'.format(share_code)

    url_is_y = 'https://www.marketwatch.com/investing/stock/{}/financials?countrycode=au'.format(share_code)
    url_is_q = 'https://www.marketwatch.com/investing/stock/{}/financials/income/quarter?countrycode=au'.format(share_code)

    url_cf_y = 'https://www.marketwatch.com/investing/stock/{}/financials/cash-flow?countrycode=au'.format(share_code)
    url_cf_q = 'https://www.marketwatch.com/investing/stock/{}/financials/cash-flow/quarter?countrycode=au'.format(share_code)

    url_dict = dict()
    url_dict['bs_y'] = url_bs_y
    url_dict['bs_q'] = url_bs_q
    url_dict['is_y'] = url_is_y
    url_dict['is_q'] = url_is_q
    url_dict['cf_y'] = url_cf_y
    url_dict['cf_q'] = url_cf_q
    return url_dict

Because there are a lot of features in the financial reports, I only focus on some key features that I think are important. For example, in the Balance Sheet, I only focus on the Assets and Liabilities, with some important keys such as Cash & Short Term Investments, Inventories, Total Current Assets, Intangible Assets, Total Assets, etc. The get_table_keys function is used to get some key features of the financial reports given the sheet key.

Main Function

The main function is shown as follows. It reads all the ASX share codes from the asx-companies-list.csv file (sorry, I forgot how to get this list). Then it iterates through all the share codes and run process_one function to crawl the financial reports of each company. The process_one function is shown in the next section.

Important notes: Maybe Market Watch has some mechanism to prevent the crawler, so sometimes it will return an error. In this case, I add the error share codes to a list and save it to a file. Then I can run the crawler again with the error share codes to get the reports of these companies. Maybe adding some delay between each request will help to prevent the error.

def main_process():
    # Read all codes 
    df = pd.read_csv('asx-companies-list.csv')
    codes = df['Code'] # format 'ASX:BHP'
    # codes = ['ASX:CBA', 'ASX:NAB', 'ASX:WBC', 'ASX:ANZ', 'ASX:BEN', 'ASX:BOQ'] # custom codes

    error_codes = []
    for idx, share_code in enumerate(codes): 
        share_code = share_code.replace('ASX:','').lower()
        print('---------------')
        print('Processing {}/{} - code = {}'.format(idx+1, len(codes), share_code))

        try:
            output = process_one(share_code)  
            if output is not True: 
                print(output)
                error_codes.append(share_code)      
        except Exception as e: 
            print(e)
            error_codes.append(share_code)

    df_error_codes = pd.DataFrame(error_codes, columns=['code'])
    df_error_codes.to_csv('error_codes.csv')

Process One Function

The following code is the main function to crawl the financial reports of a company given its share code. It iterates through all the sheets (Balance Sheet, Income Statement, and Cash Flow Statement) and all the tables in each sheet. Then it saves the reports to an excel file. It uses BeautifulSoup to parse the html.

RUN_FILES = ['bs_y', 'is_y', 'cf_y', 'bs_q', 'is_q', 'cf_q']
TIME_SLEEP = 2 # second

def process_one(share_code): 
    out_file_name = 'asx/{}.xlsx'.format(share_code)
    raw_file_name = 'asx/raw_{}.npy'.format(share_code)

    url_dict = get_url(share_code)

    data = dict()

    try: 
        for sheet_key in RUN_FILES:
            time.sleep(TIME_SLEEP)
            # print('---------------')
            print('Start crawling share={}, report={}'.format(share_code, sheet_key))
            url_ = url_dict[sheet_key]
            table_keys = get_table_keys(sheet_key)

            req = Request(url_, headers={'User-Agent': 'Mozilla/5.0'})    
            read_data = urlopen(req).read()
            soup_is= BeautifulSoup(read_data,'lxml')

            ls= [] # Create empty list
            for l in soup_is.find_all('div'): 
                #Find all data structure that is ‘div’
                ls.append(l.string) # add each element one by one to the list


            exclude_keys = [
                'Operating Expenses',
                'Non-recurring Events',
                table_keys['period_key'],
            ]
            ls = [e for e in ls if e not in exclude_keys] # Exclude those columns

            new_ls = list(filter(None,ls))

            def cut_invalid_in_end(dataframe): 
                assert(len(dataframe[0]) == len(dataframe[1]))
                for i in range(len(dataframe[0])):
                    if dataframe[0][i] != dataframe[1][i]:
                        break 
                invalid = range(i,len(dataframe[0]))
                new_dataframe = dataframe.drop(invalid, axis=0)
                return new_dataframe


            start_idxes = []
            for idx, e in enumerate(new_ls):
                if e == 'Item':
                    start_idxes.append(idx)

            assert(len(start_idxes) == 2 * table_keys['num_tables'])

            list_df = []
            for t_idx in range(table_keys['num_tables']): 
                if t_idx == table_keys['num_tables'] - 1:
                    short_ls = new_ls[start_idxes[t_idx*2]:]
                else:
                    short_ls = new_ls[start_idxes[t_idx*2]:start_idxes[t_idx*2+2]]
                
                is_data = list(zip(*[iter(short_ls)]*7))
                df = pd.DataFrame(is_data[0:])
                new_df = cut_invalid_in_end(df)
                # data[table_keys['tab_names'][t_idx]] = new_df
                list_df.append(new_df)

            # Concat all tables in one sheet 
            list_df = pd.concat(list_df)

            # Remove first col 
            data[sheet_key] = list_df.drop(list_df.columns[0], axis=1)

            print('Finish crawling {}'.format(sheet_key))

        with pd.ExcelWriter(out_file_name) as writer:
            for sheet_key in RUN_FILES:
                # use to_excel function and specify the sheet_name and index
                # to store the dataframe in specified sheet
                data[sheet_key].to_excel(writer, sheet_name=sheet_key, index=False)
        return True 
    except Exception as e:
        return e 

How to compare the financial reports?

In this project, I also developed a tools to compare some finanical indicators among companies in the same sector. The code can be found in the file 02_compare_fin_reports.py.

Let me go through top-down the code, from the main function to the helpers.

Main Process

The following code helps to compare between Dairy companies provided by codes = ['ASX:A2M', 'ASX:BUB', 'ASX:CLV', 'ASX:AHF', 'ASX:SM1', 'ASX:NUC']. It reads the financial reports of these companies from the excel files saved in the previous section, to build a dataframe for each company. Then it compares the indicators of these companies and save the results to an excel file. You can do some fancy visualization with these results.

The final results is something like this:

def main_process():
    # Read all codes 
    out_file_name = 'compare/dairy.xlsx'
    codes = ['ASX:A2M', 'ASX:BUB', 'ASX:CLV', 'ASX:AHF', 'ASX:SM1', 'ASX:NUC']

    all_data = dict()
    for idx, share_code in enumerate(codes): 
        share_code = share_code.replace('ASX:','').lower()
        print('---------------')
        print('Processing {}/{} - code = {}'.format(idx+1, len(codes), share_code))

        all_data[share_code] = process_one(share_code)  

    data = dict()
    data['annual'] = compare_report(all_data, 'annual')
    data['quarter'] = compare_report(all_data, 'quarter')

    with pd.ExcelWriter(out_file_name) as writer:
        data['annual'].to_excel(writer, sheet_name='annual', index=False)
        data['quarter'].to_excel(writer, sheet_name='quarter', index=False)

Helpers

Below are some important helpers that are used in the main function. To avoid the post is too long, I just explain the main idea of each helper. You can find the full code in the Github repo.

Process One Function: To read the financial reports from the excel files saved in the previous section, add new indicators, and return a dataframe. It combines all three reports (Balance Sheet, Income Statement, and Cash Flow Statement) into one dataframe.

Compare Report Function: Read the dataframe, remove unimportant indicators, and compare the important ones. Indicators can be concatenated vertically (as shown in the above figure) or horizontally.

Add New Indicators Function: This function helps to create new indicators from the existing ones. For example, the Current Ratio is calculated from Total Current Assets and Total Current Liabilities, or Quick Ratio is calculated from Total Accounts Receivable and Total Current Liabilities.

Imporatant Indicators are defined in the global variable INDICATORS. You can add more indicators in your own preference. Here is my list

INDICATORS = [
    'Item',
    # Income statement 
    'Gross Income', 
    'Gross Income Growth',
    'Gross Profit Margin', 
    'Pretax Income', 
    'Pretax Margin', 
    'Net Income', 
    'EBITDA',
    # Balance sheet 
    'Cash & Short Term Investments',
    'Inventories', 
    'Total Current Assets',
    'Intangible Assets', 
    'Total Assets',
    'Total Current Liabilities',
    'Current Ratio', 
    'Quick Ratio', 
    'Accounts Receivable Turnover',
    'Inventory Turnover',
    # Cash flow 
]