Web page capture util

Mon Aug 28 19:59:40 MST 2006

This is a long road that I've been down many times. The basic question
that must be answered is how much like the real site you want it to look
like. If the answer is "not much", then you have plenty of options. One
easy one is html2ps (apt-get has it) and then something easy like
imagemagick to convert from ps to jpg.

The trick is when you want it to look just like the page. If, for
example, you want to have full CSS or JS support.

Scripting firefox to grab the page and dump it to an image/printer is a
great idea, but the feature has been broken since like 1998. There are
a few very recent firefox extensions that will do it though. One is:

  http://pearlcrescent.com/products/pagesaver/

The real grail though, the one I've been wanting, is a completely
headless solution. This, I'm afraid, does not exist. Even soltuions like
khtml2png:

  http://khtml2png.sourceforge.net/

aren't _really_ headless, and require an x server of some sort to attach
to (even though it doesn't show a window). This has to do with the way
everybody creates rendering engines -- they use calls from X or whatever
GUI toolkit to figure things like "how big is this string when
rendered". So no X, no knowledge of the partial-rendering, no render.

Html2ps is supposed to do this, but doesn't support CSS (and no JS)
well, and the same is true of docbook. A fork of html2ps, re-written in
PHP, is at

  http://www.tufat.com/script19.htm
  http://sourceforge.net/projects/html2ps

and is much better at CSS rendering, doesn't require an X server, and is
very very slow. Too slow. Good for simple pages though. But has it's own
quirks, just like any browser.

Theoretically having firefox/mozilla move to a cairo-based back end
rendering engine will fix this. But the theory hasn't happened yet.

So what's left? Give up on the headlessness and do some sort of
scripting of firefox or a custom engine on top of gecko. Just attach it
to some sort of a dummy head, such as Xvfb. This is exactly what Josh
has done. He made a websight for anyone to use, and even more recently
released the source of his custom browser engine:

  http://blog.joshuaeichorn.com/archives/2006/07/18/webthumb/
  http://blog.joshuaeichorn.com/archives/2006/08/21/webthumb-rendering-engine-released/
  http://bluga.net/webthumb/

So there you have it. I still dream of a completely headless gecko-based
html2pdf (or html2whatever) converter though... if anyone notices one
lying around let me know. But beware of trying to do it yourself --
there be even more dragons!

--Brock

On 2006.08.28.14.42, Shawn Badger wrote:
| Does anyone know of a CLI app that can capture a web page to a jpg or
| better a pdf? I need to capture a dynamic page on daily basis and e-mail
| the captured image to various people. I have tried using wget, but it
| saves some weird results. I suspect that is because the page I am
| polling is generated with PHP.
| 
| Any ideas would be appreciated.
| 
| 
| 
| ---------------------------------------------------
| PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
| To subscribe, unsubscribe, or to change  you mail settings:
| http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss