As one who often consults with companies migrating to a new content management system (CMS) and would like to move content from their existing system to a new CMS, our conversations often goes something like this:
CLIENT: “We would like you to migrate the pages on our site to the new site.”
CLIENT: “However, we don’t want to change the content in any way; it should be just a ‘lift and shift.'”
CLIENT: “Shouldn’t be too hard—simply move the pages from the old system to the new system.”
Having had this conversation several times and finding “Well…” as an unsatisfactory response to this request, I am laying out a series of considerations when migrating content from one CMS to another.
Anatomy of an HTML Page
- Moving from one CMS to another presents changes in URL structure.
- Links between elements on the page will change, breaking links to other pages, code, styles and assets.
- CSS files used for styling content are also located via URL; therefore changes in URL structure separates content from its style rules.
- I would like to reiterate this is the simplest case. Pages may display correctly when moving the HTML from one system to another; however the relationships between pages will be broken.
And the prognosis only gets darker from here.
Content Management Systems
One of the primary reasons for investing in a modern CMS is to separate content creation from the code necessary to display it correctly. In earlier systems, the creation of a Web page required knowledge of HTML/CSS and JaveScript (JS). This dependency lengthened the period of time it took to create and publish content.
Modern CMS platforms abstract the presentation of content from the content itself. Content is embedded in components, or widgets, that format the code for the page. These components can then be placed on the page using a WYSIWYG editor. Separating the code from how content is displayed dramatically decreases the time necessary to author content and get it in front of its audience.
In general, the CMS provides a set components which, when added to templates, define the overall page structure. In the end, a “page” in the CMS consists of a template and a set of components operating as a visual wrapper defined by the applied CSS and JS.
When upgrading from an old system to a new CMS, it is likely content and presentation code are interwoven. In the worst case, the original pages are solid blocks of HTML.
There will be a temptation just to move these “chunks of HTML” over to the new system. However, we discourage this approach, as it moves an outdated approach to a contemporary system. A more comprehensive, and correct solution is to deconstruct the original page and move the content from the HTML to the supported component architecture of the new CMS. Chances are the new CMS will strictly abstract the component code, design elements (CSS and JS) and content. As such, the original content must be decomposed and reconstructed in the new CMS during the migration process.
We recommend the migration process begin with a content inventory. The content inventory provides us with a set of target pages, but also allows the client to prune the list of pages they wish to migrate.
The second element required is matching URL references for where the pages will eventually reside in the new CMS. Even if the referential relationships between the pages remain the same, we need to qualify fully the URL of the page location. These two pieces of information provide a “from and to” map for the migration process.
The first step in the migration process is to create an empty shell page on the new platform. This “from and to” set of base pairs represent the frame on which we can hang other attributes we want to associate with pages. In most implementations, the new page assets need to be associated with tags and other pieces of content to segment them into useful groups. Other pieces of useful metadata include:
- Boolean to determine if we should migrate the page
- Template associated with the page
- Any identifiers that match the page with external applications
These attributes can be added to the existing line for each page in the CSV file:
On systems where the “toURL” cannot be pre-assigned, the process should perform an initial run and capture the new URL and place it in the map. At this point, the new system should have a series of shell pages and the “from and to” URLs should be known and in the map file.
After creating the shell page, all additional metadata can be applied to it. You should assume other pieces of default metadata can be applied, or inherited, through the page hierarchy.
Migrating Page Content
The primary goal of content migration is to map groups of content from the original page to a component in a template within the new CMS. You can accomplish this in several ways, with some tactics being better than others.
On systems where content is separate from the visual display, the migration consists of extracting content from the backend of the current CMS and programmatically creating components within the template specified by the “from/to” mapping file. This approach is the most straightforward of all migration strategies.
However, for most migrations, the most efficient way to extract content is through interrogating the fully rendered page. We typically pull the page using the origin URL and parse it using a DOM manipulation package. HERO is partial to leveraging BeautifulSoup to pull data out of HTML and XML files.
We want to extract content from particular sections of the DOM. We take advantage of most, if not all, the pages that follow the same pattern, mainly because the original CMS likely enforced a specific orthodoxy or the team hand-coded pages and then pasted pages following a pattern.
Here’s a simple example HERO performed for a client to port several hundred press releases to a new CMS:
- We examined the existing pages and found, in most cases, the content resided in a div called <bodyContent>.
- We created a file with the URLs of the existing press releases and wrote a program that read each URL and returned the DOM.
- Using JSoup, we pulled the content of the <bodyContent> div.
- We programmatically created a page in the new CMS, with a specific template type.
- Within that template, we created a TEXT component and inserted the content we pulled from the <bodyContent> div.
This migration worked because the body content did not contain links to other pages or binary content such as images. In cases where there are links to internal content, the body of the component must be parsed for any HREF links. These links will point to the original endpoints and need to be modified to point to new locations. We would use the from/to map for pages and another map to point to internal binary content.
The above approach is the minimum necessary to port content from one container to another. However, page styling must be considered. In the above example, page content was simple HTML with text content and hyperlinks.
Most likely significant work went into styling the page content, and the HTML is wrapped in sophisticated CSS to make it conform to design. When content is moved, several styling issues need to be considered.
- CSS applied to the original page might not work within the internal styling applied in the new container. This will force rework to recreate the page on the new system, or require the page to be redesigned.
- There might be differences in the CSS version applied to the original page and the new system.
- Content might be broken up into several different components breaking styling continuity. We have also found some CMS create additional <section> tags to the DOM. This might require the CSS style chaining that was applied to the original content be refactored.
Adam Trissel is SVP of Technology at HERO. He is a senior architect with deep expertise in Java-based technologies including Adobe Experience Manager. He also serves as HERO’s chief archivist of Bob Ross videos and as our curator of K-Dramas.