Drupal Module: Import_HTML

Synopsis

Facility to import an existing, static HTML site structure into Drupal Nodes.

This is done by allowing an admin to define a source directory (siteroot) of a traditional HTML website, and importing (as much as possible) the content and structure into a Drupal site.

Files will be absorbed completely, and their existing cross-links should be maintained, whilst the standard headers, chrome and navigation blocks should be stripped and replaced with Drupal equivalents. Old structure will be inferred and imported from the old folder hierarchy.

Requirements
Usage Detailed step-by-step
Intent What it's intended to to
- Methodology Exactly how it does it, at a coder level
- Notes Issues arising, and some detailed explanations
Guide Reference section
- Setup Requirements, and Installing for the first time
- Import Templates XSL. With great power comes great complexity
- Settings Explanation of the user settings
Development / TODO
- Troubleshooting
  - open_basedir (security)
  - max_allowed_packet (server death)
  - Relinking
  - Duplicate content
  - Pathauto conflicts (menu not building right, and slow performance)

Requirements

Before you begin

See the setup section for details. Because of the number of settings, this is not just a point-and-go module.

PHP5. PHP4 support has been entirely dropped in 2010
XML/XSLT support on the server. Check your php_info().
HTMLTidy - Either with the PHP module or the commandline version.
Some understanding of XSL for advanced template translation.
Some libraries of my own (bundled) to actually do the XSLT

Usage

This module uses no database tables of its own. It requires XML support on the server, this can be tricky if it's not already enabled.

Given a working system, the process is thus:

Visit the admin/build/import_html/settings page and check the settings.
Just use the 'default' import profile tab for now. Multiple profiles are advanced options
If all values look OK for now, you can try a test run by visiting admin/build/import_html/demo . Choose a 'page' sort of page, not a portal or layout-rich sort of thing. The demo will scrape the given file and import it to the system. Some of the new navigation features will not be apparent yet, as they apply only to large-scale imports, or at least imports that have a defined source siteroot.
Try opening the 'admin/build/import_html' main page and defining a source folder. Enter the root path of the site you wish to import and continue. The UI should display a treeview of the files you can selectively select for import.
It's recommended to just try one page at a time to begin with.
Note: If your server has PHP open_basedir restrictions in effect, the webserver/PHP process may be prevented from accessing files outside of webroot. See below
Upon importing a page, a new node should be created. The object of the import templates is to trim down the content block to its unique value. This will probably require some template tuning, so make a new template (copy the existing html2simplehtml.xsl), select it (enter the new name in the admin page) tweak the XSL and try again.
If you are extremely lucky, or don't care too much about the extras, you can go straight to bulk import.
If you need to check how the the images are turning up, they can safely be imported as well using the previous interface. They will be copied, structured in the same folders they were in originally, into the directory configured in the admin/setting. Imported pages will have their links rewritten to find them there.
Two type of content are being imported, depending on file suffix. 'Pages' (html) - which become nodes ... and everything else, which becomes 'files'.
When you are happy that the body field is as tidy as it's going to get (test several pages), you can try a bulk import. This may fill up your node collection a bit, so be prepared to delete them if things don't work perfectly first time. Many static sites have whole sections that are not structured the same as the rest of the pages.
On input, a menu structure and a bunch of aliases will be auto-generated. These can be manually adjusted easily. For instance, the menu branches will initially be named after the document titles found in the directory structure. Which is great if you used a decent folder heirachy, but some of the labels can probably be tidied up a bit. For that matter, after input, you can safely re-arrange the menu structure altogether, shifting whole sections to different places without worrying about links breaking. These changes will show through in the menu, sitemap and breadcrumbs but not in the pathalias which will reman old-style. There appear to be issues navigating to pages deep in a menu where the parent has not been imported or created yet. This is normal Drupal behaviour when making menu links to non-existant paths.

By following these instructions, you should probably be able to end up with a version of the old content in the new layout. For large sites (200+ pages) some extra tuning may be neccessary, eg using different templates for different sources.

Incremental imports, processing just sections at a time, or repeated imports as you tune the content or the transformation should be non-destructive. Re-importing the same file will retain the same node ID path, and any Drupal-specific additions made so far.

Multiple "Import Profiles" can be set up and saved alongside each other. This allows you to run side-by-side imports of different sources without over-writing the settings each time. You may find you require one profile for importing the 'product information' part of a subsite, and another for importing the 'documentation archive' subdirectory. It is mainlyem> provided as a convenience to macro automation tools, and the normal user should just work with the 'default' profile and ignore the multi-profile feature.

Intent / Theory

This is intended as a run-once sort of tool, that, once tuned right on a handful of pages, can churn through a large number of reasonably structured, reasonably formatted pages doing a lot of the boring copy & paste that would otherwise be required.

The existing file paths of the source content will be used to create an automatic menu, and therefore a heirachical structure identical to the source URLs. With path.module, appropriate aliases will also be created such that this will enable a drupal instance to TRANSPARENTLY REPLACE an existing static site without breaking any bookmarks!

Methodology Overview / Tasks

A peek under the hood into what happens in what order

We have a facility for spidering/enumerating existing source files. (the admin/build/import_html page)
Define import rules - choose an XSL stylesheet, set some parameters on it, configure presets for the imported pages.
Expose selective selection of files to import (admin UI)
Import each source file by way of sequential :
1. (Optional) download/copy of files to local mirror site.
2. Processing with html-tidy, to prepare for XSL transforms
3. URL-rewriting via XSL. All hrefs are redirected to the new pseudo-location aliases, all srcs are redirected to somewhere under /files.
4. Content-scraping via XSL (XSL stylesheet will probably have to be customized to each source site)
5. Or content-scraping via RegExps and heuristics
6. Deduction (as much as possible) of meta-information like page title, author,date
7. Extra information can be added as hooks specific to core or contrib modules.
8. Validate nodes and save them with node-insert calls
9. Extra API-insert calls (eg to create menu navigation and path aliases) are also called via module-specific hooks after save (required once we know what the new node ID is).
Pages are now first-class nodes, and can be administered through the CMS as usual.

Notes

The more valid and more homogenous the source site is, the better. A creation using strict XHTML and useful, semantic tags like #title #content or something could be imported swiftly. One with a variety of table structures may not...
Of course, this tool is supposed to be useful when dealing with messy, non-homogenous legacy sites that need a makeover. Sometimes regular expression parsing may come to the rescue for content extraction, but that's not implimented yet.

I'm choosing XSL because I know it, it's powerful for converting content out of (well-structured) HTML, and I've had success with this approach in the past. Others may object to this abstract technology (XSL is NOT an easy learning curve) but the alternative options include RegExp wierdness or cut and paste. (which I may patch on as alternative methods - or someone else can have a go) Both approaches I've also used successfully in bulk site templating (over THOUSANDS of pages) but it's my call. Making your own XSL import template is non-trivial.

In the interests of good housekeeping, imported files with spaces in the filenames will be renamed to use underscores. Although it spaces can be worked around, they just cause trouble in website URLs. Thus, references to the spaced, or %20 versions of the files may break. This rewrite can be disabled in the settings.
Filenames are assumed to be, and will remain, case-sensitive.

Guide

Installation/setup

XML/XSL Support

The module uses the PHP5 implementation of XSL(T) but the PHP modules does have to be enabled somehow.

If you can see the words XSL or XSLT in your phpinfo() output, You should be fine. The module will test and warn you anyway.

If not,

    $ sudo apt-get install php5-xsl

... should do it for Debian/Ubuntu servers. Windows binary distributions I've seen come with it compiled in these days, but you may just have to uncomment a line extension=php_xsl.dllin php.ini to enable it.

HTMLTidy Setup

The module also uses the famous HTMLTidy tool. There is a PHP module that implements HTMLTidy natively, which can be installed and enabled, either at PHP build-time, or afterwards as a loaded extension.

If you don't have (sudo) access to that, we can instead run 'tidy' from the command line. Find the appropriate binary release of HTMLTidy for your system, and place it in your PATH, in the modules install directory, or wherever you like, then define the path to the executable in the settings. This works fine under Windows too.

If this sounds complicated, and you have limited access to a Unix host and need to use it, there is an auto-installer (On the settings page under HTMLTidy configuration) that can attempt to set up tidy even on a box you don't have login access to.

The preferred method is to enable the official, binary release tidy extension (not the PECL extension if you can help it). On some distros (Windows, Redhat) this is just a matter of uncommenting extension=tidy.so in your php.ini.

Ubuntu-PHP-tidy extension

On Debian/Ubuntu, the quickest method to fetch and enable the extension is: $ sudo apt-get install php5-tidy If that works, you are good to go.

In some systems, you may have to try compiling it for yourself. In Ubuntu (as of 2007) the tidy extension has been left out of the default debian PHP package :( although it may be found in certain repositories?. Official instructions are to recompile php5 from source --with-tidy but that's a bit scary if you are used to using a package manager.
Instead, this post gives instructions on how to compile just the extension, then add it to php. I also had to apt-get php5-dev to get "phpize" on a brand new clean system, and had to use ./configure --with-tidy=tidy-20051018/ instead of just ./configure

Import Templates

An import template defines the mapping between existing HTML content and our node values. It uses the XSL language because of the power it has to select bits of a structured document, for example select=\"//*[@id='content']\" ... will find the block anywhere in the page, of any type with the id 'content', and select=\"//table[@class='main']//td[position(3)]\" Will locate the third TD block in the table called 'main'. Both these examples would be common when trying to extract the actual text from a legacy site.

You can begin with the example XSL template, this contains code that attempts to translate a page containing the usual HTML structures like (either title or h1) and (either the div called 'content' or the entire body tag) into a standard, minimal, vanilla, sematically-tagged HTML doc.

It's likely that whatever site you are importing will NOT be shaped exactly like we need it to translate straight using this format. You have to identify the parts of your existing pages that can reliably be scanned for to define content, then come up with an XPath expression to represent this.

If your source, for example, didn't use nice H1 tage to denote the page title, but instead always looked like

<font size='+2'><B>my
  page</B></font>

... your template could be made to find it, wherever it was in the page using select=\"//font[@size='+2']/B\" and proceed to use that as the node title.

No, the code is not pretty, and if Regular Expressions are a foreign language to you, This is worse.
But this is why developers have been ranting for the last ten years about using semantic markup!!
The uniformity, and the usefulness of the metadata detected in the source files will play a big part here.

It's easier to develop and test the XSLT using a third-party tool, I recommend Cooktop on Windows, or oXygen on everything else. Be sure to set the XSL engine to 'Sablotron' which is the one that PHP uses under the hood.

Although it would be possible to configure a logical mapping system to select different import templates based on different content, at this stage the administrator is expected to be doing a bit of hand-tweaking, and predicting all possible inputs is impossible. Some of this sort of logic can however be built into the powerful XSL template, if you are good at XSL

Once importing is taking place, you can even filter it more to improve the structure of the input, for example by removing all redundant FONT tags, or by ensuring that every H1,2,3 tag has an associated #ID for anchoring. Yay XSL.

Your own templates

To start with, you can use the html2simplehtml.xsl template. That contains some logic that makes generic guesses about any source structure. You are best to NOT use this as a base for modification if developing your own template, as the extra logic there may be unwanted. For a starter template, use the much simpler simplehtml2simplehtml.xsl sample instead.

Import to Taxonomy

If you have taxonomy enabled and the source is tagged, these terms can be imported. Links with a rel='tag' attribute will be taken to refer to keywords, tags or terms in your available taxonomy. Each of the following syntaxes should be equivalently detected as the term 'Interesting':

  <a href='whatever/term' rel='tag' >interesting</a>
  <link href='whatever/term' rel='tag' title='interesting' />
  <meta name="keywords" content="interesting" />

First - if an existing term of that name exists in any valid vocabulary, the imported page will be tagged with it. If not, the first available freetagging vocabulary will be used to insert the tag. If no freetagging is enabled, only pre-existing terms will be used.

Import to Chosen Taxonomy

Although very rarely used, it's possible to specify the target vocab with syntax such as:

  <a href='whatever/term' rel='tag' >subject:interesting</a>
  <a href='places/77' rel='tag' >location:aotearoa</a>

Which will place the page as 'Interesting' in the 'Subject' vocabulary and 'Aotearoa' in the 'Location' vocabulary. Raw imports probably won't have this level of namespace, but you can use it to translate contextual information in the page into import clues using the XSL template. EG, this can be translated quite easily in XSL:

  <div class='subjects'>
    <h3>More:<h3>
    <a href='whatever/term/13' rel='tag' >Interesting</a>
    <a href='whatever/term/funny' rel='tag' >Funny</a>
  </div>
  <div class='location'>
    <h3>Places:<h3>
    <a href='places/aotearoa' rel='tag' >Aotearoa</a>
  </div>

Import to CCK

The base functionality supports placing found content into the $node->body field, not naturally into any arbitrary CCK fields, but this is also possible.

If you have a CCK node with (eg) fields:

field_text, field_byline, field_image

and your input pages are nice and semantically tagged, eg

<body>
  <h1 id='title'>the title</h1>
  <div id='image'><img src='this.gif'/></div>
  <h3 id='byline'>By me</h3>
  <div id='text'>the content html etc</div>
</body>

A mapping from HTML ids to CCK fields will be done automatically, and the content should just fall into place.

  $node->title = "the title";
  $node->field_image = "<img src='this.gif'/><";
  $node->field_byline = "By me";
  $node->field_text = "the content html etc";

(Actually, current CCK field notation internally places all field values into an array with a 'value'. This is correctly supported during import via the modules/content.inc:content_import_html() hook. .)

Import sidebar/pullquote

It's common that imported source may contain related non-body content you want to capture.

First - edit your target content type to include a multiline, multivalue text field called 'field_sidebar'

Any source data with the class or id 'sidebar' will now arrive into that textarea. The XSL may need to be adjusted, eg like so:

  <xsl:template name="sidebar" match="//*[id='leftcol']">  
    <div id="sidebar">
      <xsl:apply-templates />
    </div>
  </xsl:template>

The above snippet will find any 'leftcol' content and store it into your own 'sidebar' field. This field can then be rendered as you wish within Drupal, eg by using cck_block.module to put it BACK into a column within your theme :-)
This method can be extended a lot.

In fact, ANY element found in the source text with an ID or class gets added to the $node object during import, although most data found this way is immediately discarded again if the content type doesn't know how to serialize it. Enabling debugging will display the full node object as it exists before saving. Inspecting that data structure may help you tune a storage space in your target content type.
A special-case demonstrated here prepends field_ to known CCK field names. Normally they get labelled as-is.

If the source data is NOT tagged, you'll have to develop a bit of custom XSL to produce the same effect.

customtemplate2simplehtml.xsl

... xsl preamble ...
  <xsl:template name="html_doc" match="/">  
    <html>
    <body>
    ... other extractions ...
    <h3 id="byline">
      <xsl:value-of select="./descendant::xhtml:img[2]/@alt" />
    </h3>
    </body>
  </html>
</xsl:template>

In this example, the byline we wanted to extract was the alt value of the second image found in the page (a real-world example). This has now been extracted and wrapped in an ID-ed h3 during an early phase of the import process, and should now turn up in the CCK field_byline as desired.
XSL is complex, but magic.

Import to Other Modules

Any add-on modules can use a hook to extract their own data from the input file and add data to the node object.
The modules/ directory contains the callbacks that allow any module access to the half-cooked node object and raw source data to extract and insert/modify its own node properties before it is saved.
Contrib examples are in the import_html/modules directory. See that for the hook prototype and docs.

If, for example you were able to extract date/time information out of an import page, event.module could be told to do so, and create detailed event nodes.

Other XML!

I've sucessfully used it to import other random XML formats (RecipeML) although the advantages of doing so are currently limited.

If it is possible for you to create an XSL template that translates any arbitrary XML dialect into the 'simple' HTML + microformat markup used during the import phase (see examples) then your XML can be imported into Drupal nodes. It is supported for one source file to produce multiple Nodes. The 'simple' HTML half-way phase should be an XML document containing multiple HTML elements. Wrapping them in xt:document nodes is a good idea. (Example needs to be given)

Settings

On the Administer - Site building - Import HTML screen, you can (if you wish):

Choose the import template. These templates translate between the existing page structure and the raw content blocks. XSL templates are supplied in the modules 'templates/' directory. Place your own templates there.
Use the existing examples to start from. simplehtml2simplehtml.xsl is the easiest to build on. html2simplehtml.xsl attempts to be generic, so includes a bunch of catch-all logic.
Customize a parameter used in the import template - the id of the real 'content' block of the source documents. This could be extended into a wizard to work towards an all-purpose template, but that will probably never happen. Can't predict how broken the import sites may be.
When a site is imported, it must bring along some of its baggage. Images and suchlike. You can choose where they will end up as the Extra File Storage Path.
When the imported site is given new URLs (reflecting the original path) we can publish the new nodes in a 'subsite' by applying a prefix directory to the aliases they are issued. The existing old links will be written to point to where the imported neighbouring pages are EXPECTED to end up. Incremental processing means they may not always be there until the whole site is done. Link checking (preferably on-the-fly) would be a nice tidy-up process.
The new URLs generated for the new pages are url aliases based on the original paths. You can choose to have tidy (no suffix) or legacy (old .htm or whatever suffix) aliases - or both. path.module is required for this.
As the input is (a fragment of) Pure HTML, the content filter (input format) must be set correctly. I choose to define a blank filter, which doesn't even add extra BRs, but you can override that if you wish.
If using non-native HTMLTidy support, the path to the tidy executable should be defined in the setup. PHP safe mode or settings can cripple this, so you may be out of luck on cheap hosting servers.

Notes on the Treeview Interface

Files and folders beginning with _ or . are nominally 'hidden' so are skipped and do not show up on this listing. While it's possible to list a thousand or so files, It may be a good idea to allow the listing to be more selective, to scale to larger sites. Do this by entering the Subsection to list before clicking list and waiting for every file on the server to be enumerated.

Development / TODO

As mentioned in Usage, this module uses no database tables of its own. Pages are read straight into 'page' nodes.

It's easy to imagine this sytem set up as a synchroniser, that could re-fetch and refresh local nodes when remote content changes. This would involve recording exactly what the source URL was (which isn't currently done) but would be a fun feature.

I may fork off the page-parsing into a pluggable method, so that a regexp version can be developed alongside, and be used for folk without XSL support.

How to leverage this to import a local site to a remote server? You must either unpack the source files somewhere on that machine, then provide the absolute path where the server can find them.

Also TODO is a 'Spidering' method to try to import URL sites. Way in the future!

TODO There are issues when a page links directly to a file that would be regarded as a resource via an href. Most hrefs are re-written to point to the new node, but things like large images or word docs get imported under 'files'. The XSL rewrite_href_and_src.xsl attempts to correct for this, but there may be some side-effects. Always run a link checker after import.

Troubleshooting

open_basedir

If your server has PHP open_basedir restrictions in effect, the webserver/PHP process may be prevented from accessing files outside of webroot. This is a good security measure, but may stop import_html from reading your source data (even though browsing the source directories may still appear to work). The open_basedir setting can be seen in your phpinfo.
An error like: Local file copy failed (/tmp/1fixed/simple.htm to files/imported/simple.htm) When you are sure the source file does exist and permissions are readable may be symptomatic of good security on your server. A reasonable fix is to place your source data inside webroot/files (even if just temporarily) to run the import process, then delete it later. Alternatively, copy your data over top of web root (as described in walkthrough.htm) to do an in-place import. Disabling open_basedir is not recommended, and probably requires root privileges anyway. Drupal.org issue discussion

max_allowed_packet

It has been found that there is a limit to how many batch operations you can queue up at one time. If you get a white screen, and error in your log saying something like: Got a packet bigger than 'max_allowed_packet' bytes It means the server is trying to do just too damn much. The list of instructions is too long to fit into one database entry! The max_allowed_packet limit can be increased in your MySQL configuration or possibly from code. import_html includes an attempt to fix this problem automatically - but MAY NOT ALWAYS WORK on some hosts.
If your host has such a limit, you will have to take imports slowly, and only select and process a few hundred pages at a time.

Relinking

I've gone to great lengths to rewrite the links from the new node locations to relative links to the resources that moved over into /files/ but there are problems. When a/long/path/index.html links to its image by going ../../../files/a/long/path/pic.jpg it works which is good. But as a/long/path/index.html is also aliased to a/long/path - that up-and-over path is wrong now the page is being served from what looks to the browser like a different place.

I don't favour embedding anything that hard-codes the Drupal base_url, and we don't want to use HTML BASE. I want to continue to support portable subsites, so embedding site-rooted links (/files/etc) is not great either.

Currently, by happy chance, going up one ../ too far will get ignored by most browsers, so if you are not running Drupal in a subdirectory, the requests for both style of page will just work. Which will mean that 80% of cases should get by OK. The rest may need an output filter of some sort developed some day

Duplicate content

If you find that duplicate, identical menu items are created (both '/mycontent' and '/mycontent/mycontent') or that child items are created under inaccessable or non-existant parents, check the 'Default Document' in the settings. The process will interpret 'index.htm' and 'index.html' differently, and only the correct one can be used as the parent item. Enter the filename appropriate to the site you are importing from.
If you are using both interchangably on one site .. :-{

Pathauto

import_html attempts to assign correct paths to all imported content. The menu building process actually requires that improted files be given new paths that match exactly where the content was in the original site. Pathauto can conflict with that and ruin things.
If you are using pathauto, it should be disabled during imports, or it will come up with its own paths and rename nodes as we import content, meaning the importer will immediately lose track of new nodes, and the menu builder will fail. It may be possible to set pathauto to only add to (and not remove) existing aliases. This means that the import process will work OK, but your rules will mean that all imported content will have unneccessary and probably inaccurate path aliases added to it.
Also, pathauto has been profiled to be incredibly inefficient when operating over a large number of items, so it's best to turn it off when using import_html. You can turn it on again later.

Glossary / Terminology

Source Siteroot

The root of the input site. This may be a folder on the local filesystem, possibly beginning with "/". Imported files will calculate their links relative to there. If the source files have all their links truly relative (no links starting with '/' or 'http') then links will be rewritten as normal. If a link is found that starts with '/' or 'http://site.name/' (root-relative) then this will be converted to behave as if it were starting from the Source Siteroot.
EG: /var/www/old_site/html/

Source Subsection

A folder within Source Siteroot to process. Links will still be recalculated relative to the Source Siteroot, but only the subsection will be displayed for selection. Use this to manage large numbers of files that may take a while to prepare. EG: archives/2007

Link Rewriting

Links will be re-written according to the rules in the settings. No actual link-checking is done, the links are rewritten to the location the other files are expected to be so if you are processing subsections or selections from the tree, they may point to places that don't exist yet.
This means that imports are 'atomic' and it is safe to import or re-import pages individually without knowing what has happened before or will happen.
You are advised to run a link-checker afterwards.

There is provision for pages and resources to be placed in different places in the site. In general, pages will be accessed under the normal site root, while Drupal conventions place images and documents into a /sites/site.name/files folder. Most legacy sites have images in /images etc. This needs to become /sites/site.name/files/images within Drupal. The rewriting does this, by detecting the suffix of the file being linked to in href= or src= tags. Any file type that is not in the list of known HTML file types is regarded as a 'resource' and rewritten to appear under the 'files' directory. The $import_html_file_classes array is currently hard-coded in the module. File suffixes are not good enough for this, should the suffix list be editable, or should I scan the files themselves?

TODO Hard http://old.site/ links-to-self may need to be removed using a different process.

Alternative Profiles

Most users will not have a use for this in the space of one project, but it is possible to save several different import profiles (sets of settings) through the UI, and switch between them as you go.

This feature is intended to be used by other automation processes, or for Large projects where you want to import all the 'gallery' pages with one set of rules, and the 'product' pages with another set.

Import HTML profiles are exportable nad importable using the 'Features' module, so they provide a way to bundle your settings.

It's probably not worth playing with if you just want one site imported.

Reference

Long ago, I started building this with reference to the existing import/export module but I couldn't find too many common features. The transitional format the XSL templates convert into is a 'microformat' of XHTML (basically XHTML, but with strictly controlled classes and IDs). This is how I see a platform-agnostic dump of content should be exported, when this eventually morphs into import_export_HTML.