Downloaded the announcement of a listed company from cninfo.com.cn and saved it as a PDF file.
The relevant title about it is 国泰君安 - 关于独立董事公开征集投票权的公告.

Let’s extract text and table on the PDF file by python. We have to make sure the library pdfplumber had been installed. Use the command pip install pdfplumber to prepare the environment if there is no pdfplumber.

Extract Text

#! /usr/local/bin/python3
# -*- coding: utf-8 -*-

import pdfplumber

if __name__ == '__main__':
    pdf = pdfplumber.open( '/Users/weiyang/Desktop/国泰君安.PDF' )
    pages = pdf.pages
    target_page = pages[5]
    content = target_page.extract_text()
    print( content )
    pdf.close()

Extract Table

The table in PDF:

The test script:

#! /usr/local/bin/python3
# -*- coding: utf-8 -*-

import pdfplumber

if __name__ == '__main__':
    pdf = pdfplumber.open( '/Users/weiyang/Desktop/国泰君安.PDF' )
    pages = pdf.pages
    target_page = pages[5]
    tables = target_page.extract_tables()
    print( tables[0] )
    pdf.close()

Output:

[['序号', '非累积投票议案名称', '同意', '反对', '弃权'], ['1', '关于提请审议公司 A股限制性股票激励计划(草\n案)及其摘要的议案', '', '', ''], ['2', '关于提请审议公司A股限制性股票激励计划实施\n考核管理办法的议案', '', '', ''], ['3', '关于授权董事会办理A股限制性股票激励计划相\n关事宜的议案', '', '', '']]

The list format is not friendly to humans. So we have to use the library pandas to handle it and show the data by the local xlsx file.

    table = tables[0]

    #remove `\n`
    for i in range( len(table) ):
        for j in range( len(table[i]) ):
            table[i][j] = table[i][j].replace('\n', '')

    df = pd.DataFrame( table[1:], columns=table[0] )
    # show all column
    pd.set_option('display.max_columns', None)
    df.to_excel( 'data.xlsx', index=False )

Output:

Categories: Python

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

You cannot copy content of this page