Generating documents programmatically sounds simple — fill in a few fields, export as PDF, done. In practice, you quickly end up dealing with XML namespaces, broken styles, and libraries that haven’t seen an update in years.
I’ve tried both approaches: the classic way with Apache POI in Java and the modern approach with Puppeteer. Here’s a comparison — what works, what’s frustrating, and why I wouldn’t go back today.
The Classic Way: Apache POI and the Office XML Structure
What Is Apache POI?
Apache POI is a Java library for reading and writing Microsoft Office files — Word, Excel, PowerPoint. The project has existed for over 20 years and is the de facto standard in the Java ecosystem.
Technically solid, but not exactly pleasant to work with.
The Standard Behind It: Office Open XML (OOXML)
To understand why POI is so cumbersome, it helps to look at what a .docx file actually is.
A .docx file is not a single document but a ZIP archive containing an entire directory structure of XML files. The text resides in word/document.xml, styles in word/styles.xml, relationships between parts in word/_rels/document.xml.rels.
The format is called Office Open XML (OOXML) and is an ECMA/ISO standard. On paper, an open, standardized way to represent Office documents. In practice, it looks like this:
<w:p w:rsidR="00A77427" w:rsidRDefault="007F1D13">
<w:pPr>
<w:pStyle w:val="Heading1"/>
</w:pPr>
<w:r>
<w:rPr>
<w:b/>
<w:sz w:val="28"/>
</w:rPr>
<w:t>Hallo Welt</w:t>
</w:r>
</w:p>
A bold “Hallo Welt” text as a heading. That requires a paragraph (w:p), paragraph properties (w:pPr), a run (w:r), run properties (w:rPr), and only then the actual text (w:t). If you want “Hallo” bold and “Welt” italic, you need two separate runs.
That’s not a bug — that’s how the standard works. And that’s exactly what makes working with it programmatically so unintuitive.
Why This Hurts During Development
-
Nested abstractions: You think in paragraphs and formatting. OOXML thinks in paragraphs, runs, properties, numbering definitions, and relationship parts. The gap between “I want a table with borders” and what you actually have to write in code is enormous.
-
Inconsistent behavior: Word interprets OOXML liberally. Two visually identical documents can have completely different XML — depending on how they were created.
-
Style inheritance: Styles inherit from each other, get overridden by theme defaults, and behave differently depending on the document version. Debugging becomes detective work.
-
Tables: Merging cells, setting borders, defining column widths — each step requires its own XML construct with its own namespace. A simple table with merged cells can easily become 100 lines of XML.
-
The API mirrors the complexity: Apache POI only partially abstracts the XML. You work with
XWPFParagraph,XWPFRun,CTTblPr, and similar classes that directly map to the XML structure.
Did it work? Yes. Was it maintainable? Barely.
The Modern Way: Puppeteer
The Idea
Why fight with XML structures when you already know HTML and CSS?
The approach:
- Build the document as HTML/CSS (or with a template engine like Handlebars)
- Launch Puppeteer (headless Chrome)
- Render the HTML
- Export as PDF
No XML, no namespaces, no run properties.
In Practice
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setContent(`
<html>
<style>
body { font-family: Arial, sans-serif; margin: 40px; }
h1 { color: #333; border-bottom: 2px solid #666; }
table { width: 100%; border-collapse: collapse; }
td, th { border: 1px solid #ccc; padding: 8px; }
</style>
<body>
<h1>Angebot Nr. ${offerNumber}</h1>
<p>Sehr geehrte/r ${customerName},</p>
<table>
<tr><th>Position</th><th>Beschreibung</th><th>Preis</th></tr>
${items.map(i =>
`<tr><td>${i.pos}</td><td>${i.desc}</td><td>${i.price}</td></tr>`
).join('')}
</table>
</body>
</html>
`);
await page.pdf({ path: 'angebot.pdf', format: 'A4' });
await browser.close();
That is essentially the entire code. The same output would have required many times the number of lines with Apache POI.
Advantages
- HTML and CSS — technologies every web developer knows. No new format, no cryptic APIs.
- Preview in the browser — open the template, press F12, done. No XML debugging.
- CSS layout — Grid, Flexbox, Media Queries. Complex layouts that would be extremely tedious in OOXML are straightforward with CSS.
- Template engines — Handlebars, Nunjucks, or simply template literals. Insert data just like on a website.
- Maintainability — new team members can get started immediately. HTML/CSS skills are widespread; OOXML expertise is not.
Limitations
The Puppeteer approach also has weaknesses:
- Page breaks — CSS offers
page-break-before/afterandbreak-inside, but behavior with dynamic content is not always predictable. - Headers and footers — Puppeteer supports them in
page.pdf(), but only as a separate HTML template. Page numbers work; complex headers with images become cumbersome. - No .docx output — Puppeteer produces PDFs. If you need an editable Word document, you’ll have to use other libraries or combine both approaches.
- Resource consumption — a headless Chrome instance is not lightweight. For batch processing, you need pooling and resource management.
- Print CSS —
@media printand@pagehave their own rules. Not everything that works on screen looks the same in a PDF.
The key difference: these problems are solvable and well-documented. With OOXML issues, you often end up in decade-old JIRA tickets with the status “Won’t Fix.”
Comparison at a Glance
| Aspect | Apache POI (OOXML) | Puppeteer (HTML/CSS) |
|---|---|---|
| Learning curve | Steep — OOXML knowledge required | Gentle — HTML/CSS is enough |
| Output format | .docx, .xlsx natively | PDF natively, .docx only via workarounds |
| Layout control | Tedious, lots of boilerplate | CSS — flexible and powerful |
| Debugging | Sifting through XML | Browser DevTools |
| Maintainability | Low, specialized knowledge needed | High, standard web skills |
| Performance | Lightweight, fast | Headless browser = more resources |
| Editable documents | Yes | No (PDF only) |
Use in Practice
In the context of Document Management Systems (DMS) , this approach becomes particularly useful:
- Quotes and invoices — data from CRM or ERP, fixed template, variable values.
- Contracts and forms — standardized documents with individual fields. Build once, use thousands of times.
- Reports — KPI reports, status reports, audit logs. Pull data, fill template, store PDF.
- Onboarding documents — employment contracts, checklists, access credentials. Same structure, individual content.
- Archiving — automatically generated PDFs are long-term stable (PDF/A) and searchable.
A typical workflow: an event triggers the generation, data is pulled from the source, the template is filled, and the PDF is automatically stored in the DMS — with metadata, versioning, and access controls.
Conclusion
OOXML and Apache POI work, but the effort is often disproportionate to the result. Puppeteer is not a universal solution — but it leverages technologies that web developers already master, and turns document generation into a solvable problem.
As long as the output can be PDF, I would always choose the HTML-to-PDF approach today.