Ever since I learned that Microsoft actually introduces new binary file formats in the Office 2007 timeframe at the same time that they are telling everyone that XML is their new mantra, I wondered how the .bin file format would read. At the time of writing, it's totally undocumented. But given that I am an Excel expert, especially the file format, out of curiosity I wanted to know more about this.
A couple introductory elements : when saving an Excel workbook as .xlsb instead of .xslx using the UI, some parts of the new zip file format are saved as .bin zip entries instead of .xml, not all. Typical such parts are the main workbook part, the styles part, the shared strings part, all worksheet parts. In fact, everything that should be .xml except the relationship parts that meet the OPC (open packaging conventions, a sub-file format).
The easiest thing to do to get started is to create a simple file, than save it as .xlsx and as .xlsb ("b" stands for binary). Then open the .bin parts in a hex editor and compare this to the .xml version.
Here follows the reverse engineering of the workbook part.
Here is the workbook part in .xml, followed by the workbook part in .bin :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/5/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
<fileVersion lastEdited="4" lowestEdited="4" rupBuild="4017"/>
<workbookPr defaultThemeVersion="123820"/>
<bookViews>
<workbookView xWindow="360" yWindow="60" windowWidth="11295" windowHeight="5580"/>
</bookViews>
<sheets>
<sheet name="Sheet1" sheetId="1" r:id="rId1"/>
<sheet name="Sheet2" sheetId="2" r:id="rId2"/>
<sheet name="Sheet3" sheetId="3" r:id="rId3"/>
</sheets>
<calcPr calcId="122211"/>
<webPublishing codePage="1252"/>
</workbook>
83 01 00 80 01 14 04 04 b1 0f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
99 01 0c 20 00 01 00 ac e3 01 00 00 00 00 00 87 01 00 9e 01 1d 68 01 00 00 3c
00 00 00 1f 2c 00 00 cc 15 00 00 58 02 00 00 00 00 00 00 00 00 00 00 78 88 01
00 8f 01 00 9c 01 28 00 00 00 00 00 00 00 00 01 00 00 00 04 00 00 00 72 00 49
00 64 00 31 00 06 00 00 00 53 00 68 00 65 00 65 00 74 00 31 00 9c 01 28 00 00
00 00 00 00 00 00 02 00 00 00 04 00 00 00 72 00 49 00 64 00 32 00 06 00 00 00
53 00 68 00 65 00 65 00 74 00 32 00 9c 01 28 00 00 00 00 00 00 00 00 03 00 00
00 04 00 00 00 72 00 49 00 64 00 33 00 06 00 00 00 53 00 68 00 65 00 65 00 74
00 33 00 90 01 00 9d 01 19 63 dd 01 00 01 00 00 00 64 00 00 00 fc a9 f1 d2 4d
62 50 3f 01 00 00 00 6a 96 04 06 00 00 00 00 00 00 9a 01 01 00 a9 04 0b 07 00
03 60 00 00 00 e4 04 00 00 9b 01 01 00 84 01 00
For each line, I have put the .xml part followed by the bits from the .bin part. Here is how to read each .bin slice : there is a record identifier consisting in two bytes encoded using little endian, followed by a single byte which is the length of the record in bytes (surrounded by parentheses), followed by the record content itself. Here goes :
<workbook>
83 01 (00)
<fileVersion lastEdited="4" lowestEdited="4" rupBuild="4017"/>
80 01 (14) 04 04 b1 0f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
<workbookPr defaultThemeVersion="123820"/>
99 01 (0c) 20 00 01 00 ac e3 01 00 00 00 00 00
<bookViews>
87 01 (00)
<workbookView xWindow="360" yWindow="60" windowWidth="11295" windowHeight="5580"/>
9e 01 (1d) 68 01 00 00 3c 00 00 00 1f 2c 00 00 cc 15 00 00 58 02 00 00 00 00 00 00 00 00 00 00 78
</bookViews>
88 01 (00)
<sheets>
8f 01 (00)
<sheet name="Sheet1" sheetId="1" r:id="rId1"/>
9c 01 (28) 00 00 00 00 00 00 00 00 01 00 00 00 04 00 00 00 72 00 49 00 64 00 31 00 06 00 00 00 53 00 68 00 65
00 65 00 74 00 31 00
<sheet name="Sheet2" sheetId="2" r:id="rId2"/>
9c 01 (28) 00 00 00 00 00 00 00 00 02 00 00 00 04 00 00 00 72 00 49 00 64 00 32 00 06 00 00 00 53 00 68 00 65
00 65 00 74 00 32 00
<sheet name="Sheet3" sheetId="3" r:id="rId3"/>
9c 01 (28) 00 00 00 00 00 00 00 00 03 00 00 00 04 00 00 00 72 00 49 00 64 00 33 00 06 00 00 00 53 00 68 00 65
00 65 00 74 00 33 00
</sheets>
90 01 (00)
<calcPr calcId="122211"/>
9d 01 (19) 63 dd 01 00 01 00 00 00 64 00 00 00 fc a9 f1 d2 4d 62 50 3f 01 00 00 00 6a
96 04 (06) 00 00 00 00 00 00
9a 01 (01) 00
<webPublishing codePage="1252"/>
a9 04 (0b) 07 00 03 60 00 00 00 e4 04 00 00
9b 01 (01) 00
</workbook>
84 01 (00)
A couple comments :
- the record identifiers are not those from BIFF8. Believe it or not, the Excel team had so much time in their hands that they really invented an entirely new binary file format.
- not all xml elements have a match for their binary counterpart, meaning that those intending to implementing a read or write routine will find themselves in troubles. Here, the binary writer knows more about what's going on, and stores more info that you cannot find in the xml counterpart.
- the xml and bin parts are not interchangeable. You can not create the bin part from the xml part because the bin part stores more. You can not create the xml part from the bin part either because the namespaces and xml parent/children relations as well as what should be stored as an xml element, versus what should be stored as an xml attribute, are not defined in the bin part.
This is not the new XML world we've been told, I am afraid. Lock-in through undocumented file formats is here to stay.
[Update] : I have finally managed to post an
article about my journey in Office 2007 .bin file formats. You get a lot of undocumented pieces, dirty secrets, plus a BIFF12 reader which can read any cell of an arbitrary Excel 2007 binary workbook.