Automating Documents — From XML Hell to Headless Browsers

Generating documents programmatically sounds simple — fill in a few fields, export as PDF, done. In practice, you quickly end up dealing with XML namespaces, broken styles, and libraries that haven’t seen an update in years.

I’ve tried both approaches: the classic way with Apache POI in Java and the modern approach with Puppeteer. Here’s a comparison — what works, what’s frustrating, and why I wouldn’t go back today.

The Classic Way: Apache POI and the Office XML Structure

What Is Apache POI?

Apache POI is a Java library for reading and writing Microsoft Office files — Word, Excel, PowerPoint. The project has existed for over 20 years and is the de facto standard in the Java ecosystem.

Technically solid, but not exactly pleasant to work with.

The Standard Behind It: Office Open XML (OOXML)

To understand why POI is so cumbersome, it helps to look at what a .docx file actually is.

A .docx file is not a single document but a ZIP archive containing an entire directory structure of XML files. The text resides in word/document.xml, styles in word/styles.xml, relationships between parts in word/_rels/document.xml.rels.

The format is called Office Open XML (OOXML) and is an ECMA/ISO standard. On paper, an open, standardized way to represent Office documents. In practice, it looks like this:

<w:p w:rsidR="00A77427" w:rsidRDefault="007F1D13">
  <w:pPr>
    <w:pStyle w:val="Heading1"/>
  </w:pPr>
  <w:r>
    <w:rPr>
      <w:b/>
      <w:sz w:val="28"/>
    </w:rPr>
    <w:t>Hallo Welt</w:t>
  </w:r>
</w:p>

A bold “Hallo Welt” text as a heading. That requires a paragraph (w:p), paragraph properties (w:pPr), a run (w:r), run properties (w:rPr), and only then the actual text (w:t). If you want “Hallo” bold and “Welt” italic, you need two separate runs.

That’s not a bug — that’s how the standard works. And that’s exactly what makes working with it programmatically so unintuitive.

Why This Hurts During Development

Nested abstractions: You think in paragraphs and formatting. OOXML thinks in paragraphs, runs, properties, numbering definitions, and relationship parts. The gap between “I want a table with borders” and what you actually have to write in code is enormous.
Inconsistent behavior: Word interprets OOXML liberally. Two visually identical documents can have completely different XML — depending on how they were created.
Style inheritance: Styles inherit from each other, get overridden by theme defaults, and behave differently depending on the document version. Debugging becomes detective work.
Tables: Merging cells, setting borders, defining column widths — each step requires its own XML construct with its own namespace. A simple table with merged cells can easily become 100 lines of XML.
The API mirrors the complexity: Apache POI only partially abstracts the XML. You work with XWPFParagraph, XWPFRun, CTTblPr, and similar classes that directly map to the XML structure.

Did it work? Yes. Was it maintainable? Barely.

The Modern Way: Puppeteer

The Idea

Why fight with XML structures when you already know HTML and CSS?

The approach:

Build the document as HTML/CSS (or with a template engine like Handlebars)
Launch Puppeteer (headless Chrome)
Render the HTML
Export as PDF

No XML, no namespaces, no run properties.

In Practice

const puppeteer = require('puppeteer');

const browser = await puppeteer.launch();
const page = await browser.newPage();

await page.setContent(`
  <html>
    <style>
      body { font-family: Arial, sans-serif; margin: 40px; }
      h1 { color: #333; border-bottom: 2px solid #666; }
      table { width: 100%; border-collapse: collapse; }
      td, th { border: 1px solid #ccc; padding: 8px; }
    </style>
    <body>
      <h1>Angebot Nr. ${offerNumber}</h1>
      <p>Sehr geehrte/r ${customerName},</p>
      <table>
        <tr><th>Position</th><th>Beschreibung</th><th>Preis</th></tr>
        ${items.map(i =>
          `<tr><td>${i.pos}</td><td>${i.desc}</td><td>${i.price}</td></tr>`
        ).join('')}
      </table>
    </body>
  </html>
`);

await page.pdf({ path: 'angebot.pdf', format: 'A4' });
await browser.close();

That is essentially the entire code. The same output would have required many times the number of lines with Apache POI.

Advantages

HTML and CSS — technologies every web developer knows. No new format, no cryptic APIs.
Preview in the browser — open the template, press F12, done. No XML debugging.
CSS layout — Grid, Flexbox, Media Queries. Complex layouts that would be extremely tedious in OOXML are straightforward with CSS.
Template engines — Handlebars, Nunjucks, or simply template literals. Insert data just like on a website.
Maintainability — new team members can get started immediately. HTML/CSS skills are widespread; OOXML expertise is not.

Limitations

The Puppeteer approach also has weaknesses:

Page breaks — CSS offers page-break-before/after and break-inside, but behavior with dynamic content is not always predictable.
Headers and footers — Puppeteer supports them in page.pdf(), but only as a separate HTML template. Page numbers work; complex headers with images become cumbersome.
No .docx output — Puppeteer produces PDFs. If you need an editable Word document, you’ll have to use other libraries or combine both approaches.
Resource consumption — a headless Chrome instance is not lightweight. For batch processing, you need pooling and resource management.
Print CSS — @media print and @page have their own rules. Not everything that works on screen looks the same in a PDF.

The key difference: these problems are solvable and well-documented. With OOXML issues, you often end up in decade-old JIRA tickets with the status “Won’t Fix.”

Comparison at a Glance

Aspect	Apache POI (OOXML)	Puppeteer (HTML/CSS)
Learning curve	Steep — OOXML knowledge required	Gentle — HTML/CSS is enough
Output format	.docx, .xlsx natively	PDF natively, .docx only via workarounds
Layout control	Tedious, lots of boilerplate	CSS — flexible and powerful
Debugging	Sifting through XML	Browser DevTools
Maintainability	Low, specialized knowledge needed	High, standard web skills
Performance	Lightweight, fast	Headless browser = more resources
Editable documents	Yes	No (PDF only)

Use in Practice

In the context of Document Management Systems (DMS) , this approach becomes particularly useful:

Quotes and invoices — data from CRM or ERP, fixed template, variable values.
Contracts and forms — standardized documents with individual fields. Build once, use thousands of times.
Reports — KPI reports, status reports, audit logs. Pull data, fill template, store PDF.
Onboarding documents — employment contracts, checklists, access credentials. Same structure, individual content.
Archiving — automatically generated PDFs are long-term stable (PDF/A) and searchable.

A typical workflow: an event triggers the generation, data is pulled from the source, the template is filled, and the PDF is automatically stored in the DMS — with metadata, versioning, and access controls.

Conclusion

OOXML and Apache POI work, but the effort is often disproportionate to the result. Puppeteer is not a universal solution — but it leverages technologies that web developers already master, and turns document generation into a solvable problem.

As long as the output can be PDF, I would always choose the HTML-to-PDF approach today.

Glossary

Office Open XML (OOXML): An ECMA/ISO standard (ECMA-376) for Office documents. A .docx file is a ZIP archive containing XML files that describe content, styles, and relationships in a deeply nested structure of paragraphs, runs, and properties.
Headless Browser: A browser without a graphical interface, controlled programmatically. Puppeteer drives a headless Chrome via API — useful for tests, scraping, and generating PDFs from HTML/CSS.
Apache POI: A Java library for reading and writing Microsoft Office files (Word, Excel, PowerPoint). Works directly on the OOXML structure and reflects its complexity in the API.
Template Engine: A software component that replaces placeholders in templates with dynamic data. Examples include Handlebars, Nunjucks, or EJS — widely used in web development for server-side rendering and email templates.
Document Management System (DMS): A system for digital management, versioning, and archiving of documents. Typically supports metadata, access rights, and full-text search — essential for compliance-relevant storage in organizations.