Import_HTML : Technical reference

To set a date on imported content:

In HTML

<meta name="created" content="2011-01-01" />
<meta name="DC.Date.Created" scheme="ISO8601" content="2011/1/1" />
or
<span id="created">2011-01-01</span>

Any date format that php strtotime() can read should be acceptable.

In code (HOOK_import_html)

$node->created = $timestamp;
or
$node->date = format_date($timestamp, 'custom', 'Y-m-d\TH:i:sP'); // a string

To set a teaser/description:

In HTML

<meta name="description" content="About this page" />
<meta name="DC.Description" content="About this page" />
or
<span id="description">About this page</span>

To set an owner/author/creator

In HTML

<meta name="creator" content="William Shakespeare" />
<meta name="DC.Creator" content="William Shakespeare" />
or
<span id="author">William Shakespeare</span>
This will only take effect if there is already a username with an exact string match on your site. TODO XFN, Microformat or SIOC support? An extension module can add that. I'll stick to DC for now. TODO
<meta name="author" content="willie@stratford.on.avon" />

wget tips

wget --mirror --convert-links http://example.com/

Suggested syntax for a shallow mirror

Fetch the files and structure, but avoid any large downloads of binary files. wget does not support this too well, so we have to list the filetypes to NOT get.

--reject=pdf,doc,docx,ppt,pps,DOC,DOCX,rtf

Avoid utility links that look like pages

--exclude-directory=taxonomy,tag,feed,print