Archive for August 2016

The sad state of PDF-Accessibility of LaTex Documents

August 11, 2016

[I will use this blog as a dump for random things and thoughts from now on. German, Japanese, English – all mixed. Topic wise: Anything from computer science, life in general, up to things related to Japan].

Accessibility is becoming more important nowadays. Whereas the 90s saw a quick and uncoordinated development with various technologies that didn’t really account for folks with disabilities, fortunately nowadays developments take that into account.

For example PDF files. After being a half-open format for quite some time, PDF is nowadays an official ISO standard. And the specification of PDF requires accessible PDFs to be tagged. What this means is that in addition to the printable graphical description of the content, all content should be additionally included in a tagged, somewhat primitive XML-like structure. Something like (very much simplified here):

[element header]
 This is a very important document
[end header]
[table start]
    [table row 1]
        [table cell]
        [end table cell]
    [end table row]
[end table]
[graphic start]
    [alternative text]
        This image shows something beautiful
    [end alternative text]
[end graphic]

Well you can see the idea here. This allows screen readers to extract information out of the document and read it out loud to the visually impaired. It also allows to display information about included elements, such as an alternative text for a graphic.

It also allows for things like reflow, i.e. when you display a PDF on your kindle. Then the kindle can extract the text from the tagged PDF, reflow it according to your screen size, display it in a font you chose, and modify it in other ways suitable to the reader.

Sure this goes somehow against the original idea of a PDF (you see what you print), but then again, originally 86-DOS was thought of as a quick hack for a computer kit, and we know how it all ended.

Now where is the problem?

LaTeX or TeX in general cannot generate PDF documents with tags. pdfTeX is probably the right place to target, but I do not see this happening anywhere anytime.

And this is a very sad state of affairs.

A lot of institutions nowadays require accessible documents for publication. There are other requirements than just tagging alone, but this is the biggest obstacle.

The irony is that a LaTeX-document itself is already a quite structured document. But translating the LaTeX syntax-constructs into tags has never been done. It’s also probably a non-trivial task since a) there are a bazillion LaTeX packages out there which all use their own syntax constructs and b) Tex/LaTeX was never designed with a clear XML-like [tag] [/tag] structure in mind. So parsing and translating is probably non-trivial.

And that’s just one problem. TeX was designed when things like object oriented programming were virtually at a research stage, far from being common. Mainframes were the hot thing. And despite Knuth being a genius, look at this fscking mess that TeX is. Take your average computer science graduate from the last ten years. Do you think anyone would be remotely able to understand what is going on there?

Achim Blumensath understood this problem some 15 years ago (kinda funny to accidentally hit his name when writing up this article, as he happened to be one of the tutors of an undergrad logic course I took some 15 years ago), and wrote ANT  as a TeX – replacement, but the whole thing is unmaintained since 2007. Guess he was busy kick-starting his university career, which is understandable. Sadly, as most one-man-shows, that project never really took of.

My point being that if we wouldn’t rely on TeX itself and use ANT (or whatever alternative) which is written in the quite elegant OCaml, than hacking it would be at least possible for mere mortals. Although I have to admit, despite being in love with OCaml since my PhD days, it’s also a quite niche language. But imagine if the whole thing was written in Python, or at least C.

I haven’t looked at pdfTeX’s source, but it very much looks to me like development is not a huge common effort, but rather Hàn Thế Thành on his own, chronically overworked and left alone with this huge task. So we are stuck.

There are some folks who think that tagging with LaTeX can be done. For example the guys at CHI have some instructions for it, see here.

The problem is that they are all wrong, which is one motivation for this post.

Basically what they suggest is to add tags after PDF generation by Acrobat Pro. This doesn’t work of course for anything more complex than a simple single page. The detection of what is a header, what is a sub-header, what is a table, what is a table header – all this is impossible to do once the PDF is generated, because you will have to use heuristics, which will lead to issues. So kudos to the Adobe guys for trying, but weirdo bastard tags won’t help, they will lead to a bigger mess for screen readers than just trying to directly extract the text for reading. If you don’t believe that, just take a random paper from arXiv, run it through Acrobat Pro and add tags, and see the result.

There is Babette Schalitz‘ accessibility package, where she tried to hack the PDF generation in a way that tags are generated automatically. If you take a look into her source-code, you can see that this inevitably lead to a complete mess (no offense here – but I claim it is simply impossible to do in a clean way on the LaTeX level instead of below within pdfTeX). The package is unusable on modern TeX distributions and documents won’t compile because, well because the code does all sorts of nasty hacks which don’t work in current versions.

Andy Clifton tried to hack the package and fix these compilation issues, but again: Run it through Acrobat Pro’s accessibility checker, run it through PAC or better: Inspect the document manually using Acrobat Pro: The generated tag structure is completely broken. Spaces are missing, the structure is interwinded. It’s completely useless. You could as well manually add [tag] foobar [/tag] to the document. Sure some tools like Acrobat Reader (not the Pro version) would then show „document is tagged“, but what is the point?

Ross Moore wrote some papers on tagged PDF’s with LaTeX by directly hacking pdfTeX, but it seems to be a single man show and a Sisyphean task. There seems to be nothing that is remotely production read, more like super alpha-alpha stage.

ConTeXt made some efforts into that direction, but there seem to be also all sorts of minor issues and let’s face it: ConTeXt is an unpredictable one man show. No defined APIs, documentation is a clusterfsck of entries on wikis here and there or on mailing-lists, there are constant syntax changes (especially from MkII to MkIV), examples in the wiki don’t work, there are no books, the official manual is always behind… Despite being a real interesting approach, ConTeXt is PRAGMAs inhouse tool of choice, but simply not production ready for outsiders.

And then there are numerous threads on tex.stackexchange with questions on what to do concerning accessibility and tagged PDFs, and the answer is always the same: It doesn’t work.

In some universities and government institutions it is legally mandatory to publish accessible documents, and essentially that rules out LaTeX for document creation. Did I mention that both Word and LibreOffice generate tagged PDFs? (not perfect, but usable).

That’s all in all a very sad state of affairs. But it kind of shows the underlying problem: From a coder’s perspective, (La)TeX is a big mess, there is incredible dirt under the carpet, and as such, the development is driven by a few folks which are overworked. Since the development of pdfTeX there were few substantial developments in the TeX-world that address the real core functionality (yes, we have a better packaging system, yes we have TikZ & beamer now – all nice, but they’re all built on top). And syntax-wise btw, TikZ is horrible, too.

I sometimes miss WordPerfect. WYSIWYG approach, yet there was „reveal codes“. Still a word processor, and not remotely close to the typesetting quality of TeX, but still.

So? Means I have to stick to Word & LibreOffice is my daily life.

Oh what a mess…