PDF File Format
PDF is the tokenized output of a Postscript source file. The Postscript interpreter is used to generate output and it is collected into a package that includes any referenced file assets. The file structure defines a number of objects, streams and dictionaries.
Document Structure Overview
%PDF-1.5
1 0 obj << /Type /Catalog ... endobj
2 0 obj << /Type /Pages ... endobj
3 0 obj << /Type /Page ... endobj
4 0 obj << /Length 35
>> stream ... endstream
endobj
5 0 obj << /Title ... endobj
6 0 obj << /Type /Metadata ... endobj
8 0 obj << /Filter /MySecyrityHandlerName ... endobj
xref
...
trailer
<< /Size 8 /Root 1 0 R /Info 5 0 R /Encrypt 8 0 R
startxref
495
%%EOF
FIRST there is a bunch of objects:
- CATALOG object: points to PAGE TREE and METADATA
- PAGE TREE object: points to KIDS (individual pages), which is 1
- FIRST PAGE object: parent tree, media box, points to CONTENTS
- PAGE CONTENTS object(s): a stream of encrypted "page marking operators"
- INFO DICTIONARY object: a dictionary of something
- METADATA object: crypto filter parameters, unencrypted XML stream
- blank
- ENCRYPTION DICT obj: cryptographic algorithm selection, public keys
THEN there is an XREF followed by a TRAILER which is points to the "root" directory.
- note that PDFs are designed to have additional objects appended to it so things don't have to be renumbered (?), which is why the trailer exists. It's sort of like a tape archive in that sense?
The specific features of a PDF as far as drawing stuff:
- Graphics Objects describes all the graphic-related operations that PDF supports.
- Text Objects describes all the text-related operations that PDF supports.
These are also summarized in a big table in the Operator Summary, which includes the PostScript equivalent if it exists.
Also, Type 4 (PostScript Calculator) Functions provide limited dynamic calculation, but they must be pure functions (no variables at all).
The 1.0 version of the PDF standard is a bit more accessible to see the foundational concepts. Since PDF is derived from many PostScript conventions, it helps to know a bit about it: here's a reasonable primer (written for data scientists) on how to work with PostScript that might be helpful.
Relation to Adobe Illustrator File Format
Adobe Illustrator has a full PostScript interpreter/renderer. It is an encapsulated postscript (EPS) with a compact syntax (through dictionaries) and metadata structure stored in comments. "Document Structuring Convention" (DSC) is a convention that turns Postscript into a useable graphics file format.
The wayback machine has the old 1998 specification which might provide some insights into the underlying concepts of what is represented in the file format.
Acrobat Javascript
You can create Acrobat Javascript apps that run inside Acrobat or Acrobat Reader?
- App-level are installed from user javascript folder
- Document level accessed via Tools>Javascript>Document, run on doc open but before Page Open
- You can pull data from external databases and webservices
- Acrobat templates can be used to add/generate pages, which includes all the form elements
- PDF Layers are interesting because you hide/show things.
Embedding EPS into PDF
You can't do this directly since PDF is not an extension of PostScript. However, you can use Adobe Acrobat to add it as a layer. This is a feature called Optional Content Groups (OCG) and can be turned on/off. However, none of the open source libraries seem to support it. There are paid libraries for PDF that do, though.
PDFKit
This seems to be a better-documented library similar to jsPDF
. It is built upon a different library with a similar name: pdfjs
from Mozilla. Note: there is another pdfjs
library that isn't the one you want. Make sure you ge thte Mozilla one.
It does not seem to have tables or pattern fill utilities that jsPDF has. It uses an "HTML5 canvas-like API", can embed fonts. This is a good reference for the drawing operations and a primer on vector graphics in general. It uses the distribution version of Mozilla's PDF.js
pdfmake
This is a pure js library to generate PDFs. There's a sample library of PDFs. Looks pretty basic.
html5-to-pdf, NodePDF
These libraries look suspect...I think they just use a browser emulator and stuff a screenshot into a PDF wrapper. Not maintained.
Paper.js
This is a library I looked at years ago, with roots in the Adobe Illustrator Scriptographer plugin, which allowed you to extend functionality through Javascript. It's no longer maintained but there is an Illustrator API that might be worth poking into (this seems to have been last updated in 2017).
The paper.js repo seems to be maintained currently, though there hasn't been a release since last year. However, it does not import/export PDF so it is out of the running. It is more of a neat graphics library for people who are familiar with illustration programs like me.
PDF.js
This is the library used by PDFKit, and is part of the Mozilla foundation. It's actually part of Firefox 19+. However, it's not well documented. They just point to the source code api.js file.
jsPDF
// jsPDF is a library for PDF
import { jsPDF } from "jspdf";
const doc = new jsPDF({
orientation: "landscape",
unit: "in",
format: [4, 2]
});
doc.text("Hello world!", 1, 1);
doc.save("two-by-four.pdf");
- PDF is a tokenized representation of PostScript features, but is not actually PostScript code despite sharing many organizational and conceptual elements of it. jsPDF is a library that generates the tokens themselves (for example,
BP
is the token for the PostScriptbeginPath
command, but not all tokens have a PostScript equivalent). - Shared: PostScript uses a "page context". You issue commands that set the state of the context that effects how subsequent drawing commands work. This is 1980s-style computer graphics that is similar to Logo.
- jsPDF creates a PDF object that you invoke calls on, and the result is the PDF object so you can chain calls.
jsPDF reference
The jsPDF library seems to cover a lot of PostScript-like concepts...here's a raw dump of what's in their API reference. PDFKit, however, looks like a better-documented library.
Page
File
CreationDate
DisplayMode
DocumentProperties
FormObject
Precision
R2L
Point
Rectangle
RenderTarget
DRAWING COMMANDS
ellipse, circle, rect, roundedRect
moveTo, curveTo, lineTo
line, lines (bezier)
triangle
path
stroke
text
GRAPHICS STATE
LineCap
LineDashPattern
LineJoin
LineMiterLimit
LineWidth
TransformationMatrix
FillColor
DrawColor
TYPOGRAPHY
Font, FontList, FontSize
LineWidth, LineHeight
CharSpace
LineHeightFactor
TextColor
ADVANCED MODE
ShadingPattern
TilingPattern
PDF SECURITY
constructor (returns 'security' obj)
encryptor
md5cycle
processOwnerPassword
UTILITIES
hexToBytes, rc4, toHexString
lsbFirstWord
RGBColor
MODULE: AcroForm
AcroFormButton
AcroFormCheckBox
AcroFormChoiceField
AcroFormComboBox
AcroFormEditBox
AcroFormField
AcroFormListBox
AcroFormPasswordField
AcroFormPDFObject
AcroFormPushButton
AcroReadioButton
AcroFormTextField
MODULE: addImage, gif_support, png_support, rgba_support, svg, webp_support
insert an image into PDF
MODULE: annotations
these are drawn on a page at a location
Links, Text, Popup, FreeText
MODULE: autoprint, viewerpreferences
makes print dialog open
MODULE: canvas
mimics HTML5 canvas
MODULE: context2d
every object in a canvas has a context2d
arc, arcTo, createArc
rect
beginPath, closePath, lineTo, moveTo, bezierCurveTo, quadraticCurveTo
fillRect, strokeRect, clearRect
fillText, strokeText
getLineDash, setLineDash
setTransform, transform
translate, scale, rotate, clip (current drawing)
measureText
save, restore
stroke
toDataURL
MODULE: cell
table drawing
MODULE: fileloading
MODULE: setLanguage, utf8, xmp_metadata
sets the language code for document
MODULE: standard_fonts_metrics, ttfsupport
adds to the built-in core fonts
MODULE: split_text_to_size
given string, returns array of strings that fit
MODULE: vFS
virtual file system