MS Office documents are probably one of the most inconvenient and poorly formalized data sources. It's much better to keep all the data in specialized databases or at least in wiki. But in real life, MS Office documents are in active use in nearly every organization. Simply because it is a flexible and easy tool that anyone can use. That's why it's important to know an automated ways for processing such files.
You can easily edit .docx files without any libraries. Technically it's just a zip archive. So, you can unzip it, make a replacement in the document.xml file and make a zip it again. It's much better than dealing with old binary .doc files. But there are even more elegant ways.
Let's says, we need to read data from .xlsx document and generate .docx files based on some existing template. To work with .xlsx files I will use openpyxl python library.
I installed the library and opened input/table.xlsx file:
#sudo pip install openpyxl #sudo pip2 install openpyxl import openpyxl wb = openpyxl.load_workbook(filename = 'input/table.xlsx')
Now we can choose with the worksheet and get value of some cell:
ws = wb.worksheets print(ws["A1"].value)
Ok. We can read one cell. What about the head of the table? If we have 9 columns in the table we can do it like this:
for col in range(1, 10): print(ws.cell(column=col, row=1).value)
Username Surname Attribute1 Attribute2 Attribute3 Attribute4 Attribute5 Attribute6 Attribute7
If we want to read the entire table (without head), we can do this:
first_cell_value = "1" row = 2 **table** = dict() while first_cell_value != None: if not ws.row_dimensions[row].hidden: table[row] = dict() for col in range(1, 10): **table**[row][col] = ws.cell(column=col, row=row).value row += 1 first_cell_value = ws.cell(column=1, row=row).value
Note that I skiped the hidden rows.
From such a table structure we can get everything we need for analysis .
Let's figure out how to work with .docx files. I installed python-docx library and opened .docx template file:
#sudo pip install python-docx #sudo pip2 install python-docx import docx document = docx.Document(docx = 'input/template.docx')
Here is how we can iterate through paragraphs in doc files, find a key word "USERNAME" and replace it with the required value "John":
import re for paragraph in document.paragraphs: if "**USERNAME**" in paragraph.text: paragraph.text = re.sub("**USERNAME**", "John", paragraph.text)
What if we need to read or edit the table in .docx file? You can read the table like this:
for table in document.tables: for row in table.rows: for cell in row.cells: for paragraph in cell.paragraphs: print(paragraph.text)
To insert rows into a table, we act like this:
rows_to_add = dict() rows_to_add = ['1','2','3','4'] for table in document.tables: for row in rows_to_add: cells = table.add_row().cells for n in range(0,len(rows_to_add[row]), 1): cells[n].text = rows_to_add[row][n]
And finally, save the modified template to a new file: