PDF for an RFC Series Output Document Format
AT&T Laboratories
200 Laurel Ave. South
Middletown
NJ
07748
USA
tony@att.com
Adobe
345 Park Ave
San Jose
CA
95110
USA
masinter@adobe.com
http://larry.masinter.net
Adobe
345 Park Ave
San Jose
CA
95110
USA
mahardy@adobe.com
RFC Design Team
Requests for Comment
This document discusses options and requirements for the PDF
rendering of RFCs in the RFC Series, as outlined in RFC
6949. It also discusses the use of PDF for Internet-Drafts,
and available or needed software tools for producing and
working with PDF.
The RFC Series is evolving, as outlined in
. Future documents will use a canonical format, XML,
with renderings in various formats, including PDF.
Because PDF has a wide range of capabilities and
alternatives, not all PDFs are "equal". For example, visually
similar documents could consist of scanned or rasterized images,
or include text layout options, hyperlinks, embedded fonts, and
digital signatures.
(See
for a history of PDF.)
This document explains some of the relevant options and
makes recommendations, both for the RFC series and Internet-Drafts.
The PDF format and the tools to manipulate it are
not as well known as those for the other RFC formats, at least
in the IETF community. This document
discusses some of the processes for creating and using PDFs
using both open source and commercial products.
The details described in this document are expected to change based on experience
gained in implementing the RFC production center's toolset.
Revised documents will be published capturing those changes as the toolset is completed.
Other implementers must not expect those changes to remain backwards-compatible with the
details described in this document.
NOTE: [RFC-EDITOR: This note should be removed before publication.]
See
for XML source, related files, and an issue tracker for
this document.
PDF has gone through several revisions,
primarily for the addition of features. PDF
features have generally been added in a way that older viewers
'fail gracefully', but even so, the older the PDF version
produced, the more legacy viewers will support that version,
but the fewer features will be enabled.
As PDF has evolved a broad set of capabilities, additional
standards for PDF files are applicable. These standards
establish ground rules that are important for specific
applications. For example PDF/X was specifically designed
for Prepress digital data exchange, with careful attention
to color management and printing instructions. The PDF/E
standard was designed for engineering documents with dynamic
workflows (where a document continues to be revised after
publication) and allows interactive media (including
animation and 3D).
Two additional standards families are important to the
RFC format, though: long-term preservation (PDF/A), and user
accessibility (PDF/UA). These then have sub-profiles (PDF/A-1,
PDF/A-2, PDF/A-3), each of which have conformance levels.
These standards are then supported
by various software libraries and tools.
It is effective and useful to use these standards to capture
PDF for RFC requirements, and they will make the PDF files useful
in workflows that expect them.
Recommendations:
Use PDF 1.7; although relatively recent, it is well
supported by widely available viewers.
For RFCs, require PDF/A-3 with conformance level "U".
This captures the archivability and
long-term stability of PDF 1.7 files, mandatory Unicode mapping,
and many of the requirement features.
Use PDF/A-3 for embedding additional data (including
the XML source file) in RFCs and Internet-Drafts.
Use PDF/UA for user accessibility.
This section lays out options and requirements for PDFs produced by
the RFC editor for RFCs. There are two sections:
"Visible" options are related to how the PDF appears when
it is viewed with a PDF viewer. "Internal Structure" options
affect the ability to process PDFs in other ways, but do not
control the way the document appears. (Of course, a viewer
UI might display processing capabilities, such as showing
whether a document has been digitally signed.)
In many cases, the choice of PDF requirements is heavily
influenced by the capabilities of available tools to create
PDFs. Most of the discussion of tooling is to be found in
.
PDF supports rich visible layout of fixed-sized pages.
For a consistent "look" of RFC and good style, the PDFs
produced by the RFC editor should have a clear, consistent,
identifiable and easy-to-read
style. They should print well on the widest range of
printers, and look good on displays of varying
resolution.
PDF files are laid out for a particular size of page and
margins. There are two paper sizes in common use: "US
Letter" (8.5 x 11 inches, 216x279 mm, in popular use in
North America) and "A4" (210x297 mm, 8.27x11.7 inches,
standard for the rest of the world). Usually PDF printing
software is used in a "shrink to fit" mode where the
printing is adjusted to fit the paper in the printer.
There is some controversy, but the argument that A4 is
an international standard is compelling.
However, if the margins and header positioning are chosen
appropriately, the document can be printed without any
scaling.
Recommendation:
The Internet-Draft and RFC processors should produce A4 size
by default.
However, the margins and header positioning need to be chosen
to look good on both paper sizes without scaling.
Following the advice found in , this
means that we should use A4 portrait mode with left and right
margins of 20 mm, and top and bottom margins of 33 mm.
Page headers and footers are part of the page layout.
There are a variety of options. Note that page headers and
footers in PDF can be typeset in a way that the entire
(longer) title might fit.
Recommendation: Page headers and footers should contain similar
information as the headings in the current text versions of
documents, including page numbers, title, author, working
group. However, the page headers and footers should be typeset
in a way so as to be unobtrusive.
The page headers and footers should be placed into the PDF
in a way not to interfere with screen readers.
One common feature of the Internet-draft output formats
are optional visible paragraph numbers, to aid in discussions.
In the PDF and thus printed rendition, it is possible
to make paragraph numbers unobtrusive, and even
to impinge on the margins.
Recommendation: When the XML "editing=yes" option has been chosen,
show paragraph numbers in the right margin, typeset in a way so
as to be unobtrusive.
(The right margin instead of the left margin prevents the paragraph
numbers from being confused with the section numbers.)
If possible, the paragraph numbers should be coded in a way
that they do not interfere with screen readers.
By its nature, PDF is paginated, so pagination issues must
be considered.
This is reflected in two areas: running headers and footers,
and how text is layed out on a page for optimal reading.
describes
the process of creating a paged document from running text
such that related material is present
on the same page together and artifacts of pagination
don't interfere with easy reading of the document.
Layout engines differ in the quality of the algorithms
used to automate these processes. In some cases, the
automated processes require some manual assistance
to ensure, for example, that a text line intended
as a heading is "kept" with the text it is heading for.
Recommendations:
Headers and footers should be printed on each page.
The information should include the RFC number or internet-draft name,
the page number, the category (informational, etc.),
a shortened version of the authors' names,
the date of the RFC or internet-draft, and
the short form of the document title.
Choose a layout engine so that manual
intervention is minimized, and that widow and
orphan processing, heading and title contiguation
are automatic.
A PDF may refer to a font by name, or it may use an
embedded font. When a font is not embedded, a PDF viewer
will attempt to locate a locally installed font of the same
name. If it can not find an exact match, it will find a
"close match". If a close match is not available, it will
fall back to something implementation dependent and
usually undesirable.
In addition, the PDF/A standards mandate the embedding
of fonts. Instead of using additional software to embed the fonts,
the software generating the PDF files should produce PDF/A-conforming
files directly, thus ensuring that all glyphs include Unicode
mappings and embedded fonts from the outset.
If the HTML version of the document is
being visually mimicked, the font(s) chosen should have
both variable width and constant width components, as well
as bold and italic representations.
The typefaces used by Internet-Drafts and by RFCs
need not be identical.
Few fonts have glyphs for the entire repertoire of
Unicode characters; for this purpose, the PDF generation
tool may need a set of fonts and a way of choosing them.
The RFC Editor is defining where Unicode characters
may be used within
RFCs.
Typefaces are typically licensed and, in many cases,
there is a fee for use by PDF creation tools; however,
not for display or print of the embedded fonts.
Recommendations:
For consistent viewing, all fonts should be embedded.
The fonts used must be available for use by the IETF community.
Some discussion of available typefaces can be found in
.
The choice of type faces with respect to serif, sans serif,
monospace, etc., should follow the recommendations for
HTML and CSS rendering
and .
The range of Unicode characters allowed in
the XML source for Internet-Drafts and RFCs may
be bounded by the availability of embeddable fonts with
appropriate glyphs .
Typically, when doing page layout of running text,
especially with narrow page width and long words,
layout processors of English text often have the
option of hyphenating words, or using existing
hyphens as a place to introduce word breaks.
However, inserting line breaks mid-word can be harmful
when the "word" is actually a sequence of characters
representing a protocol element or protocol sequence.
Recommendation: avoid introducing hyphenated
line breaks mid-word into the visual display,
consistent with requirements for plain text
and HTML.
PDF supports hyperlinks both to sections of the same
document and to other documents.
The conversion to PDF can generate:
hyperlinks within the document
hyperlinks to other RFCs and Internet-Drafts
hyperlinks to external locations
hyperlinks within a table of contents
hyperlinks within an index
Recommendations:
All hyperlinks available in the HTML rendition of the
RFC should also be visible and active in the PDF
produced.
This includes both internal hyperlinks and hyperlinks to
external resources.
The table of contents, including page numbers, are useful
when printed. These should also be hyperlinked to their
respective sections.
As specified in the section on Referencing RFCs in
, hyperlinks to RFCs from the
references section should point to the RFC "info" page, which
then links to the various formats available.
Hyperlinks to Internet-Drafts from the
references section should point to the datatracker entry
page for the draft, which
then links to the various formats available.
There is some advantage to having the PDF files look like
the text or HTML renderings of the same document. There are
several options even so.
The PDF
could look like the text version of the document, or
could look like the text version of the document but
with pictures rendered as pictures instead of using their
ASCII-art equivalent, or
could look like the HTML version.
Recommendation: the PDF rendition should look like the
HTML rendition, at least in spirit. Some
differences from the HTML rendition would include different
typeface and size (chosen for printing), page numbers in the
table of contents and index, and the use of page headers and footers.
Most of the choices used for the
rendering and are thus applicable.
See those documents for specifics on the rendering of the specific XML elements.
Some notes are:
Every place in the document that would receive an HTML ID would be given
an identical PDF named destination.
In addition, a named destination will be created for each page with the form
"pg-#", as in "pg-35".
No pilcrows are generated or made visible.
The table of contents (generated if the XML's <rfc>
element's tocInclude attribute has the value "true")
will have the section number linked to that section named destination, but
will also include a page number that is linked to the page named destination.
The section title and the page number will be separated by a
visually-appropriate separator and the page numbers will be aligned
with each other.
The index (generated if the XML's <rfc>
element's indexInclude attribute has the value "true")
will have the section number linked to that section named destination, but
will also include a page number that is linked to the page named destination.
The running header in one line (on page 2 and all subsequent
pages) has the RFC number on the left (RFC NNNN), the (possibly
shortened form) title centered, and the date (Month Year) on the
right.
The text is rendered in a way that is visually unobtrusive.
The running footer in one line (on all pages) has the author's
last name on the left, category centered, and the page number on
the right ([Page N]).
The text is rendered in a way that is visually unobtrusive.
We should not attempt to replicate in PDF the feature of the
HTML format that includes a dynamic block that displays
up-to-date information on updates, obsoletions and errata.
PDF offers a number of features which improve the utility
of PDF files in a variety of workflows, at the cost of extra
effort in the xml2rfc conversion process; the tradeoffs may be
different for the RFC editor production of RFCs and for
Internet-Drafts.
The contents of a PDF file can be represented in many
ways. The PDF file could be generated:
as an image of the visual representation, such as a
JPEG image of the word "IETF". That is, there might be no internal
representation of letters, words or paragraphs at all.
placing individual characters in position on the page,
such as saying "put an 'F' here", then "put an 'T' before
it", then "put an 'E' before that", then "put an 'I'
before that" to render the word "IETF". That is, there might be
no internal representation of words or paragraphs at all.
placing words in position on the page, such as keeping
the word "IETF" would be kept together. That is, there might be
no internal representation of paragraphs at all.
ensuring that the running order of text in the content
stream matches the logical reading order. That is, a sentence
such as 'The Internet Engineering Task Force (IETF)
supports the Internet.' would be kept together as a sentence,
and multiple sentences within a paragraph would be kept together.
All of these end up with essentially the same visual
representation of the output. However, each level has
tradeoffs for auxiliary uses, such as searching or indexing,
commenting and annotation, and accessibility
(text-to-speech).
Keeping the running order of text in the content stream in the
proper order supports all of these auxiliary uses.
In addition, the "role map" feature of PDF
()
would additionally allow for the mapping of the logical tags
found in the original XML into tags in the PDF.
Recommendations:
Text in content streams should follow the XML
document's logical order (in the order of tags) to the
extent possible. This will provide optimal reuse by
software that does not understand Tagged PDF.
(PDF/UA requires this.)
It might be possible to use the "role map" annotation
to capture enough of the xml2rfc source structure,
to the point where it is possible to reconstruct
the XML source structure completely.
However, there is not a compelling case to do so
over embedding the original XML, as described in
.
PDF itself does not require use of Unicode. Text is
represented as a sequence of glyphs which then can be
mapped to Unicode.
Recommendations:
PDF files generated must have the full text,
as it appears in the original XML.
Unicode normalization may occur.
Text within SVG for SVG images should
also have Unicode mappings.
Alt-text for images should also support Unicode.
The XML allows both ASCII art and SVG to be used for artwork.
Recommendations:
If both ASCII art and SVG are available for a picture,
the SVG artwork should be the preferred over the ASCII artwork.
ASCII artwork must be rendered using a monospace font.
Guidelines for accessibility of PDF
recommend that images, formulas, and other non-text items
provide textual alternatives, using the '/Alt' Tag in
PDF to provide human-readable text that can be vocalized
by text-to-speech technology.
Recommendation: Any alt-text for artwork and figures
available in the XML source should be stored using the PDF
/Alt property. Internet draft authors and the RFC editor
should ensure inclusion of alt-text for all SVG or images,
within the XML source.
Metadata encodes information about the document authors,
the document series, date created, etc. Having this
metadata within the PDF file allows it to be used by
search engines, viewers and other reuse tools.
PDF supports embedded metadata in a variety of
ways, including using XMP
, the Extensible Metadata Platform (XMP).
The RFC editor maintains metadata about an RFC on its
info page.
Recommendation: The PDFs generated should have all of the
metadata from the XML version embedded directly as XMP
metadata, including the author, date, the
document series, and a URL for where the document can be
retrieved. This information should be consistent with
the RFC editor info page at the time of publication.
PDF supports an "outline" feature where sections of the
document are marked; this could be used in addition to the
table of contents as a navigation aid.
The section structure of an RFC can be mapped into the PDF
elements for the document structure.
This will allow the
bookmark feature of PDF readers to be used to quickly access
sections of the document.
Recommendation: The section structure of an RFC should be
mapped into the PDF elements for the document structure.
This would include section headings for the boilerplate
sections such as the Abstract, Status of the Document, Table
of Contents, and Author Addresses, plus the obvious
section headings that are normally included in
the Table of Contents. If possible, this should be done in a way
that the same fragment identifiers for the HTML version of
the RFC will work for the PDF version.
PDF has the capability of including other files;
the files may be labeled both by a media type and a
role, the AFRelationship key .
In this way, the PDF file acts also as a container.
Embedded content may be compressed.
Many PDF viewers support the ability to
view and extract embedded files, although this capability
is not universal.
Embedding content in the PDF file allows the PDF to act as
a complete package, which can be transformed, archived,
and digitally signed.
(Some sample code illustrating how items can be attached to a
PDF file and subsequently extracted can be found
at .)
Useful possibilities:
Embed the source XML input file itself within the PDF.
If the source SVG and images for illustrations are also
embedded, this would make the PDF file totally
self-referential.
Embed directly extractable components that are useful for
independent processing, including ABNF, MIBs, source code
for reference implementations. This capability might be
supported through other mechanisms from the XML source
files, but could also be supported within the PDF.
Finding, extracting and embedding other components
may require additional markup to clearly identify them,
and additional review to ensure the correctness of
embedded files that are not visible.
Recommendations:
Embed the XML source and all illustrations,
for RFCs, as a standard
feature for xml2rfc's PDF output.
If possible, make this a standard feature for
Internet-Drafts as well.
Named <sourcecode> entries should be embedded.
Bitmap images (SVG sources, JPEGs, PNGs, etc) should be embedded.
The RFC Editor and staff are at times called to provide evidence
that a particular RFC is the "original" and has not been modified;
digital signatures can provide that verification.
As signatures also apply to embedded content, embedding the
XML source will provide a way of signing the source XML
that was used to product the PDF file as well.
PDF has supported digital signatures since PDF 1.2, and there
are multiple methods and options available for signing PDF files.
The signing of internet-drafts and RFCs will be guided by
.
Portable document format -- Part 1: PDF 1.7
ISO
Also available free from Adobe.
Extensible metadata platform (XMP) specification --
Part 1: Data model, serialization and core properties
ISO
Not available free, but there
are a number of descriptive resources, e.g.,
Electronic document file format for long-term
preservation -- Part 2: Use of ISO 32000-1 (PDF/A-2).
ISO
Electronic document file format for long-term
preservation -- Part 3: Use of ISO 32000-1 with support for
embedded files (PDF/A-3)
ISO
Electronic document file format enhancement for
accessibility -- Part 1: Use of ISO 32000-1
(PDF/UA-1)
ISO
NOTE: this section is meant as an overview
to give some background.
The RFC series has for a long time accepted Postscript
renderings of RFCs, either in addition to or instead of the
text renderings of those same RFCs. These have usually been
produced when there was a complicated figure or mathematics
within the document. For example, consider the figures and
mathematics found in RFC 1119 and RFC 1142, and compare the
figures found in the text version of RFC 3550 with those in
the Postscript version. The RFC editor has provided a PDF
rendering of RFCs. Usually, this has been a print of the
text file that does not take advantage of any of the broader
PDF functionality, unless there was a Postscript version
of the RFC, which would then be used by the RFC editor to
generate the PDF.
In addition to PDFs generated and published by the RFC
editor, the IETF tools community has also long supported PDF
for Internet-Drafts. Most RFCs start with Internet-Drafts,
edited by individual authors. The Internet-Drafts submission
tool at https://datatracker.ietf.org/submit/ accepts PDF and
Postscript files in addition to the (required) text submission
and (currently optional) XML. If a PDF wasn't submitted for a
particular version of an Internet-Draft, the tools would
generate one from the Postscript, HTML, or text.
The process of creating a paged document from running text
typically involves ensuring that related material is present
on the same page together, and that artifacts of pagination
don't interfere with easy reading of the document. Typical
high-quality layout processors do several things:
Widows and orphans ()
should be avoided automatically (unless the entire paragraph is only one line).
Ensure that a page break does not occur after the first
line of a paragraph (orphans), if necessary, using slightly longer
page sizes.
Similarly, ensure that a page break does not occur before
the last line of a paragraph (widows).
Do not insert a page break immediately after a section heading.
If there isn't room on a page for the first (two)
lines of a section after the section heading,
insert a page break before the heading.
Figures should not be split from figure titles.
If possible, keep the figure on the same page
as the (first) mention of the figure.
Another common option in producing paginated documents
is to include the column headings of a table if
the table cannot be displayed on a single page.
Similarly, tables should not be split from the table titles.
The XML attributes of "keepWithNext" and "keepWithPrevious"
should be followed whenever possible.
The XML entities such as NBSP and NBHYPHEN should be
followed as directed whenever possible.
This section discusses tools for viewing, comparing,
creating, manipulating, transforming PDF files, including those
currently in use by the RFC editor and Internet-Drafts, as well
as outlining available PDF tools for various processes.
As with most file formats, PDF files are experienced through
a reader or viewer of PDF files. For most of the common
platforms in use (iOS, OS X, Windows, Android, ChromeOS,
Kindle) and for most browsers (Edge, Safari, Chrome,
Firefox), PDF viewing is built in. In addition there are
many PDF viewers available for download and install.
PDF viewers vary in capabilities, and it is important to note
which PDF viewers support the features utilized in PDF RFCs and
Internet-Drafts (features such as links, digital signatures,
Tagged PDF and others mentioned in
).
While almost all viewers also support printing of
PDF files, printing is one of the most important use
cases for PDFs. Some printers have direct PDF
support.
Because the xml2rfc format is a unique format,
software for converting XML source documents to
the various formats will be needed, including
PDF generation.
One promising
direction is suggested in
: using
XSLT to generate XSL-FO which is then processed by a formatting object processor such as Apache FOP.
Several libraries are also available for generating PDF signatures.
The choice of library to use for xml2pdf will depend on many factors:
programming language, quality of implementation, quality of
PDF generated, support, cost, availability,
and so forth.
This section is intended to discuss available typefaces
that might satisfy requirements. Some openly available
fixed-width typefaces (without extensive Unicode support,
however) include:
Source Sans
Source Serif Pro
Source Code Pro
A font that looks promising for its broad Unicode support is
Skolar,
but it requires licensing.
Another potentially useful set of typefaces is the Noto family from Google.
In addition to generating and viewing PDF, other categories of
PDF tools are available and may be useful both during
specification development and for published RFCs.
These include tools for comparing two PDFs, checkers that could
be used to validate the results of conversion, reviewing and commentary
tools that attach annotations to PDF files, and digital signature
creation and validation.
Validation of an arbitrary author-generated PDF file
would be quite difficult; there are few PDF validation
tools. However, if RFCs and Internet-Drafts are generated by
conversion from XML via xml2rfc, then explicit validation of PDF
and adherence to expected profiles would mainly be useful to ensure
that xml2rfc has functioned properly.
Recommendations:
Discourage (but allow) submission of a PDF representation
for Internet-Drafts. In most cases, the PDF for an Internet-Draft
should be produced automatically when XML is submitted,
with an opportunity to verify the conversion.
The input of the following people is gratefully acknowledged:
Nevil Brownlee (ISE),
Brian Carpenter,
Chris Dearlove,
Martin Duerst,
Heather Flanagan (RSE),
Joe Hildebrand,
Paul Hoffman,
Duff Johnson,
Ted Lemon,
Sean Leonard,
Henrik Levkowetz,
Julian Reschke,
Adam Roach,
Leonard Rosenthol,
Alice Russo,
Robert Sparks,
Andrew Sullivan,
and
Dave Thaler.