libreccm-legacy/ccm-ldn-importer/src/com/arsdigita/london/importer/package.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<title>com.arsdigita.london.importer</title>
</head>
<body bgcolor="white">

<p>
    Generic CMS content items importer.
</p>

<p> Importer can import content items from XML source, placing them
  at specified place in folder hierarchy.  Importer expects XML
  input in format similar to output of DomainObjectTraversal, with
  some modifications.  There is also possibility to start workflow on
  imported objects.
</p>

<p> The basic XML input file has a structure like this:
</p>

<pre>
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;import source="camden.aplaw.org.uk" xmlns="http://xmlns.redhat.com/waf/london/importer/1.0"&gt;
  &lt;folder xmlns="http://www.arsdigita.com/cms/1.0" label="Root Folder " name="/" oid="[com.arsdigita.cms.Folder:{id=556}]"&gt;
   &lt;folder label="one" name="one" oid="[com.arsdigita.cms.Folder:{id=4402}]"&gt;
     &lt;folder label="Two Test" name="two-test" oid="[com.arsdigita.cms.Folder:{id=4407}]"&gt;
        &lt;cms:item xmlns:cms="http://www.arsdigita.com/cms/1.0" oid="[com.arsdigita.cms.contenttypes.Article:{id=4412}]"&gt;
           &lt;fileAttachments oid="[com.arsdigita.cms.contentassets.FileAttachment:{id=4448}]"&gt;
              &lt;name&gt;Address.xsd&lt;/name&gt;
              &lt;content file="content-4448-Address.xsd"/&gt;
           &lt;/fileAttachments&gt;
           &lt;name&gt;eeek&lt;/name&gt;
           &lt;type oid="[com.arsdigita.cms.ContentType:{id=144}]"/&gt;
           &lt;launchDate&gt;Tue Dec 02 00:00:00 GMT 2003&lt;/launchDate&gt;
           &lt;title&gt;Eeek&lt;/title&gt;
           &lt;textAsset oid="[com.arsdigita.cms.TextAsset:{id=4429}]"&gt;
              &lt;content&gt;Edit text here&lt;/content&gt;
           &lt;/textAsset&gt;
           &lt;imageCaptions oid="[com.arsdigita.cms.ArticleImageAssociation:{id=4439}]"&gt;
              &lt;caption&gt;sdfdsfs&lt;/caption&gt;
              &lt;imageAsset oid="[com.arsdigita.cms.ReusableImageAsset:{id=1734}]"&gt;
                 &lt;name&gt;5.jpg&lt;/name&gt;
                 &lt;mimeType oid="[com.arsdigita.cms.MimeType:{mimeType=image/jpeg}]"&gt;
                    &lt;label&gt;JPG image&lt;/label&gt;
                    &lt;mimeType&gt;image/jpeg&lt;/mimeType&gt;
                    &lt;fileExtension&gt;jpg&lt;/fileExtension&gt;
                 &lt;/mimeType&gt;
                 &lt;height&gt;768&lt;/height&gt;
                 &lt;width&gt;512&lt;/width&gt;
                 &lt;content file="content-1734-5.jpg"/&gt;
              &lt;/imageAsset&gt;
           &lt;/imageCaptions&gt;
           &lt;lead&gt;sadfsd&lt;/lead&gt;
           &lt;dublinCore oid="[com.arsdigita.london.cms.dublin.DublinCoreItem:{id=4413}]"&gt;
              &lt;name&gt;eeek-dublin-metadata&lt;/name&gt;
              &lt;dcLanguage&gt;en&lt;/dcLanguage&gt;
           &lt;/dublinCore&gt;
        &lt;/cms:item&gt;
     &lt;/folder&gt;
   &lt;/folder&gt;
  &lt;/folder&gt;
&lt;/import&gt;
</pre>

<h3> The <tt>import</tt> tag </h3>

<p> This is the top-level element.  Its mandatory attribute
  <tt>source</tt> identifies the source of import data.  This is
  an arbitrary string used to track the objects already processed
  by importer.  Whenever an imported object is persisted in database,
  new {@link RemoteOidMapping} record is stored along, with the
  value of <tt>source</tt> attribute being written to <tt>system_id</tt>
  column.
</p>

<h3> The <tt>folder</tt> tag </h3>

<p> Determines the position of imported object.  The <tt>folder</tt>
  tag can be nested to an arbitrary level.  Importer checks if the
  folder with the specified <tt>name</tt> (mandatory attribute) exists.
  If not, it will be created.  Optional attribute <tt>label</tt> is used
  to provide folder's title (label).
</p>

<h3> The <tt>cms:item</tt> tag </h3>

<p> Denotes the start of a content item block.  The mandatory attribute
  <tt>oid</tt> specifies the OID of the source item, not the target one.
  Importer has no control over OIDs assigned to objects
  created during import process.
</p>

<p> However, <tt>oid</tt> is used to determine the correct object type
  for the imported object.  Moreover, if the <tt>defaultDomainClass</tt>
  element is not specified, importer will try to create instance of
  Java class specified in the <tt>oid</tt> attribute.  This must be
  held in mind when importing data from pre-Rickshaw source, or from
  any non-CCM source.  In both cases, source OIDs must be adjusted to
  match the persistence object types present in target CCM instance.
</p>

<p> Opening &lt;cms:item&gt; tag can contain several optional attributes.
  Here's what they are used for: </p>

  <ul>
    <li> <tt>author</tt>: if provided, the attribute value will be interpreted
          as email address.  If a user with that email address does not already
          exist, it will be created with some random password that can be
          reset via password recovery facility.  Workflow will be actived
          for the imported object, with all tasks being locked by said user.
     </li>
    <li> <tt>indexItem</tt>: if equals to <tt>true</tt>, the imported item
          will be set as the index item to the currently processed folder.
     </li>
    <li> <tt>relabelFolder</tt>: if equals to <tt>true</tt> and with
          <tt>indexItem</tt> set to <tt>true</tt> as well, the currently
          active folder will be relabeled after this item's title.
     </li>
  </ul>


<p> Any element underneath opening &lt;cms:item&gt; is expected to
  contain a persistence attribute.  Simple persistence properties, like
  <tt>name</tt> or <tt>title</tt>, are being simply taken from bodies of
  corresponding XML elements.  Role properties, however, represent domain
  objects and their opening tag has to include <tt>oid</tt> attribute,
  as per the reasons stated above.  Keep in mind that OID mappings of
  role properties will also be stored via RemoteOidMapping facility,
  just like the mapping of the top-level content item.
</p>

<h3> BLOB handling </h3>

<p> Persistence properties that contain byte[] values can be imported
  in two ways:
</p>

<ul>
  <li> via external file: name of the file containing raw BLOB data
      can be specified via <tt>file</tt> XML attribute.  If not provided,
      importer looks up a file named
      <tt><em>objectID</em>-<em>propertyName</em>.raw</tt>.
      In both cases the file is expected to be in the <em>lobDir</em>,
      the directory specified by one of the importer invocation
      arguments.
  </li>
  <li> inline: the BLOB value, base64 encoded,  can be specified in body of XML
      element if the <tt>encoding="base64"</tt> XML attribute is
      provided.
  </li>
</ul>

<h3> Transaction handling and the <tt>external</tt> tag</h3>

<p> When importing from XML source with no <tt>external</tt> XML tags,
  importer will open a single transaction and import all the content
  within its context.  This can cause memory problems when import set
  is too big.  Sometimes it's enough to have handful of large BLOBs
  to trigger infamous OutOfMemoryException.  The recommended approach
  in this case is to split import data into several pieces which are
  concatenated together via <tt>external</tt> XML tag:
</p>

<pre>
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;import source="camden.aplaw.org.uk" xmlns="http://xmlns.redhat.com/waf/london/importer/1.0"&gt;
&lt;folder xmlns="http://www.arsdigita.com/cms/1.0" label="Root Folder " name="/" oid="[com.arsdigita.cms.Folder:{id=556}]"&gt;
   &lt;folder label="one" name="one" oid="[com.arsdigita.cms.Folder:{id=4402}]"&gt;
      &lt;external source="one/eeek.xml"/&gt;
   &lt;/folder&gt;
   &lt;folder label="Two Test" name="two-test" oid="[com.arsdigita.cms.Folder:{id=4407}]"/&gt;
   &lt;external source="sfsdfsdfs.xml"/&gt;
   &lt;external source="mpa-test.xml"/&gt;
&lt;/folder&gt;
&lt;/import&gt;
</pre>

<p> In this case each file referenced in <tt>external</tt> tag will be
  processed in its own transaction.  Each file included in this way is
  expected to contain a single <tt>cms:item</tt> block.
</p>

<h3> Invoking from command line </h3>

<p> Importer can be invoked via standalone command-line tool like this:

<h4>With tools-ng and ecdc</h4>
<pre>
  ant -Dccm.classname=com.arsdigita.london.importer.cms.ItemImportTool \
      -Dccm.parameters="path/to/index/file /path/to/items/dir /path/to/assets/dir content" \
      ccm-run
</pre>

<code>content</code> is the content section where the imported should
be placed.

<h4>With tools-legacy</h4>
<pre>
  ccm-run com.arsdigita.london.importer.cms.ItemImportTool \
      master-import.xml /dir/with/files/to/include /dir/containing/lobs
</pre>

</body>
</html>