Portable Document File (PDF) Notes

Posted Friday, October 28, 2022 by Sri.Tagged MEMO
EDIT STATUS:new

PDF File Format

PDF is the tokenized output of a Postscript source file. The Postscript interpreter is used to generate output and it is collected into a package that includes any referenced file assets. The file structure defines a number of objects, streams and dictionaries.

Document Structure Overview

%PDF-1.5
1 0 obj << /Type /Catalog ... endobj 
2 0 obj << /Type /Pages ... endobj
3 0 obj << /Type /Page ... endobj
4 0 obj << /Length 35
        >> stream ... endstream
    endobj
5 0 obj << /Title ... endobj
6 0 obj << /Type /Metadata ... endobj
8 0 obj << /Filter /MySecyrityHandlerName ... endobj
xref
  ...
trailer
  << /Size 8 /Root 1 0 R /Info 5 0 R /Encrypt 8 0 R
startxref
495
%%EOF

FIRST there is a bunch of objects:

  1. CATALOG object: points to PAGE TREE and METADATA
  2. PAGE TREE object: points to KIDS (individual pages), which is 1
  3. FIRST PAGE object: parent tree, media box, points to CONTENTS
  4. PAGE CONTENTS object(s): a stream of encrypted "page marking operators"
  5. INFO DICTIONARY object: a dictionary of something
  6. METADATA object: crypto filter parameters, unencrypted XML stream
  7. blank
  8. ENCRYPTION DICT obj: cryptographic algorithm selection, public keys

THEN there is an XREF followed by a TRAILER which is points to the "root" directory.

  • note that PDFs are designed to have additional objects appended to it so things don't have to be renumbered (?), which is why the trailer exists. It's sort of like a tape archive in that sense?

The specific features of a PDF as far as drawing stuff:

  • Graphics Objects describes all the graphic-related operations that PDF supports.
  • Text Objects describes all the text-related operations that PDF supports.

These are also summarized in a big table in the Operator Summary, which includes the PostScript equivalent if it exists.

Also, Type 4 (PostScript Calculator) Functions provide limited dynamic calculation, but they must be pure functions (no variables at all).

The 1.0 version of the PDF standard is a bit more accessible to see the foundational concepts. Since PDF is derived from many PostScript conventions, it helps to know a bit about it: here's a reasonable primer (written for data scientists) on how to work with PostScript that might be helpful.

Relation to Adobe Illustrator File Format

Adobe Illustrator has a full PostScript interpreter/renderer. It is an encapsulated postscript (EPS) with a compact syntax (through dictionaries) and metadata structure stored in comments. "Document Structuring Convention" (DSC) is a convention that turns Postscript into a useable graphics file format.

The wayback machine has the old 1998 specification which might provide some insights into the underlying concepts of what is represented in the file format.

Acrobat Javascript

You can create Acrobat Javascript apps that run inside Acrobat or Acrobat Reader?

  • App-level are installed from user javascript folder
  • Document level accessed via Tools>Javascript>Document, run on doc open but before Page Open
  • You can pull data from external databases and webservices
  • Acrobat templates can be used to add/generate pages, which includes all the form elements
  • PDF Layers are interesting because you hide/show things.

Embedding EPS into PDF

You can't do this directly since PDF is not an extension of PostScript. However, you can use Adobe Acrobat to add it as a layer. This is a feature called Optional Content Groups (OCG) and can be turned on/off. However, none of the open source libraries seem to support it. There are paid libraries for PDF that do, though.

PDFKit

This seems to be a better-documented library similar to jsPDF. It is built upon a different library with a similar name: pdfjs from Mozilla. Note: there is another pdfjs library that isn't the one you want. Make sure you ge thte Mozilla one.

It does not seem to have tables or pattern fill utilities that jsPDF has. It uses an "HTML5 canvas-like API", can embed fonts. This is a good reference for the drawing operations and a primer on vector graphics in general. It uses the distribution version of Mozilla's PDF.js

pdfmake

This is a pure js library to generate PDFs. There's a sample library of PDFs. Looks pretty basic.

html5-to-pdf, NodePDF

These libraries look suspect...I think they just use a browser emulator and stuff a screenshot into a PDF wrapper. Not maintained.

Paper.js

This is a library I looked at years ago, with roots in the Adobe Illustrator Scriptographer plugin, which allowed you to extend functionality through Javascript. It's no longer maintained but there is an Illustrator API that might be worth poking into (this seems to have been last updated in 2017).

The paper.js repo seems to be maintained currently, though there hasn't been a release since last year. However, it does not import/export PDF so it is out of the running. It is more of a neat graphics library for people who are familiar with illustration programs like me.

PDF.js

This is the library used by PDFKit, and is part of the Mozilla foundation. It's actually part of Firefox 19+. However, it's not well documented. They just point to the source code api.js file.

jsPDF

// jsPDF is a library for PDF 
import { jsPDF } from "jspdf";
const doc = new jsPDF({
orientation: "landscape",
unit: "in",
format: [4, 2]
});
doc.text("Hello world!", 1, 1);
doc.save("two-by-four.pdf");
  • PDF is a tokenized representation of PostScript features, but is not actually PostScript code despite sharing many organizational and conceptual elements of it. jsPDF is a library that generates the tokens themselves (for example, BP is the token for the PostScript beginPath command, but not all tokens have a PostScript equivalent).
  • Shared: PostScript uses a "page context". You issue commands that set the state of the context that effects how subsequent drawing commands work. This is 1980s-style computer graphics that is similar to Logo.
  • jsPDF creates a PDF object that you invoke calls on, and the result is the PDF object so you can chain calls.

jsPDF reference

The jsPDF library seems to cover a lot of PostScript-like concepts...here's a raw dump of what's in their API reference. PDFKit, however, looks like a better-documented library.

Page
File
CreationDate
DisplayMode
DocumentProperties
FormObject
Precision
R2L
Point
Rectangle
RenderTarget

DRAWING COMMANDS
  ellipse, circle, rect, roundedRect
  moveTo, curveTo, lineTo
	line, lines (bezier)
	triangle
	path
	stroke
	text

GRAPHICS STATE
	LineCap
	LineDashPattern
	LineJoin
	LineMiterLimit
	LineWidth
  TransformationMatrix
  FillColor
  DrawColor
  TYPOGRAPHY
    Font, FontList, FontSize
    LineWidth, LineHeight
    CharSpace
    LineHeightFactor
    TextColor

ADVANCED MODE
	ShadingPattern
	TilingPattern
	
PDF SECURITY
  constructor (returns 'security' obj)
  encryptor
  md5cycle
  processOwnerPassword

UTILITIES
	hexToBytes, rc4, toHexString
	lsbFirstWord
	RGBColor
	
MODULE: AcroForm
	AcroFormButton
	AcroFormCheckBox
	AcroFormChoiceField
	AcroFormComboBox
	AcroFormEditBox
	AcroFormField
	AcroFormListBox
	AcroFormPasswordField
	AcroFormPDFObject
	AcroFormPushButton
	AcroReadioButton
	AcroFormTextField

MODULE: addImage, gif_support, png_support, rgba_support, svg, webp_support
insert an image into PDF

MODULE: annotations 
  these are drawn on a page at a location
  Links, Text, Popup, FreeText

MODULE: autoprint, viewerpreferences
makes print dialog open 

MODULE: canvas
mimics HTML5 canvas

MODULE: context2d
every object in a canvas has a context2d
	arc, arcTo, createArc
	rect
	beginPath, closePath, lineTo, moveTo, bezierCurveTo, quadraticCurveTo
	fillRect, strokeRect, clearRect
	fillText, strokeText
	getLineDash, setLineDash
	setTransform, transform
	translate, scale, rotate, clip (current drawing)
	measureText
  save, restore
	stroke
	toDataURL
	
MODULE: cell
	table drawing
	
MODULE: fileloading

MODULE: setLanguage, utf8, xmp_metadata
sets the language code for document

MODULE: standard_fonts_metrics, ttfsupport
adds to the built-in core fonts

MODULE: split_text_to_size
given string, returns array of strings that fit 

MODULE: vFS
virtual file system