Greg Dawson – Adobe

AEM Infrastructure Series: Oak Clustering

Greg Dawson — Fri, 22 Apr 2016 14:25:47 +0000

In the application server world, clustering is typically implemented to promote redundancy and scalability, subsequently with a goal toward high availability—that is maintaining uptime on customer facing sites or services. It’s a common assumption that these same concepts apply when it comes to clustering on Adobe Experience Manager 6.x. Right?

Not so fast.

If you read the fine print, clustering on AEM is not recommended by Adobe on publish instances. You see, clustering on AEM introduces a new level of dependencies and complexity due to the reliance on MongoDB secondaries across data centers, and thus the argument is that clustering in AEM can decrease reliability and performance. That direction from Adobe pretty much throws the the whole high availability clustering use case we find in the application server world out the window.

Instead Adobe recommends stand-alone TarMK farms for failover of publish instances. Farms have better performance, are easily scaled, and because they are synced with the Author instance, are inherently fault tolerant. But where to cluster then?

Author! Author!

There are several uses cases for clustering the AEM Author instances, but as you may have read in my other posts, I am not a fan of clustering author instances simply for failover. That can be achieved using TarMK Cold Standby’s with far less complexity and moving parts. See: AEM Infrastructure Series: Disaster Recovery Basics.

For me to recommend a clustering deployment, I like to see one of these use cases:

Exceeding the authoring capacity limits of concurrent editors and contributors (the users accessing the authoring server). And (this is important) where sharding the author instance is not feasible. More about this later.
Where regional performance of the authoring instance is important. Authoring in AEM tends to be “chatty” and low latency can vastly improve the perceived performance of an author instance.
Where uptime of the authoring instance is critical—organizations that cannot survive even a few minutes of authoring downtime (I’ve yet to encounter this when you dig deep).

Sharding?

In simple terms, sharding is a technique used to split very large databases into smaller, faster more easily managed parts. An often overlooked option is to manually shard the authoring instances, that is to physically split the sites into separate AEM authoring instances. These could be:

Physical sites, e.g. a primary www vs. a support or intranet site.
Portions of existing sites, e.g. localized sites where live copies are not required.
Separating global assets from the primary site authoring instance, e.g. a Global Corporate DAM.

Keep in mind that with independent author instances, each can have their own TarMK Cold Standby, be regionally located to reduce latency, and have separate maintenance cycles. And you don’t need Mongo DBA resources.

Sharding requires careful forethought and planning, often digging deep into the use cases and asking lots of questions and making decisions. This should be done during your initial AEM standup—questions and advice your AEM consultants should be providing.

Performance Tips

Why performance tips? Because maximizing performance of your authoring instances can reduce or eliminate the need for clustering based on concurrency. Remember that Oak uses out of process memory for deserialization, speeding up performance. Thus where in a 5.x system we could get along with 8-12 GB of memory, in a 6.x system we recommend 64 GB with 8 GB of JVM. Also:

Dedicated CPU cores can increase performance. We typically recommend 12 or 16 cores.
SSD storage for the repository folder.
If your internal infrastructure team is balking at that much SSD storage, using Oak FileDataStore to separate the document store to slower magnetic media, while leaving the node store on the faster SSD media. Note: we are still waiting for Adobe’s blessing on using the newer and more efficient Oak FileBlobStore. As of this writing it’s technically available, but not supported.
Use Sling Offloading to offload high CPU jobs.

All of the above will help you extract more performance out of a single licensed AEM author instance, and thus further reduce the need for clustering due to concurrency.

Adobe Resources

AEM Infrastructure Series: Oak Clustering was first posted on April 22, 2016 at 9:25 am.
©2016 "Adobe". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement. Please contact me at gserafini@gmail.com

AEM Infrastructure Series: Disaster Recovery Basics

Greg Dawson — Thu, 14 Apr 2016 19:12:45 +0000

As I woke up this morning and turned on the coffee maker, the buttons would not work. Trying the light switch, and the lights would not turn on. Sometime during the night my power went out. I realized this was the day to write about disaster recovery for Adobe Experience Manager, but in my case the blocker was not about my computer (plenty of battery), or Internet (thank you tethering), but about how to survive for the next few hours without a hot cup of coffee. Here goes.

When planning an Adobe Experience Manager on premise deployment, I always ask the the question “What is your tolerance for a site outage?” And the answer is almost always “None.” This is not surprising as I knew the answer before I asked the question.

It may seem obvious that loss of your ecommerce site, even temporarily, can be severely detrimental to sales. But the truth is that all types of businesses suffer–expectations are high that your site is always available or your corporate image may be tarnished. Why then is redundancy and disaster recovery not a part of all AEM stand-ups?

Cost? Most IT organizations have disaster recovery plans and initiatives in place, and when properly planned, an AEM infrastructure that is regionally fault tolerant is not necessarily an expensive proposition. Once I explain the options, it’ll be hard to say no.

Author Instances

First let’s look at a fault tolerant authoring instance. As you’ll see in other blog posts, I’m not a fan of Oak clustering as an answer for fault tolerance, and there are very specific use cases that need to be met before I’ll bless a 6.x clustering design. Clustering increases complexity, is slower than native TarMK and requires specialized DBAs to manage MongoDB instances. DBA resources that few customers have. With AEM 6.x there are other database options, but they require “Engineering / Support approval” and few, if any customers are actually running this in production. Do you want to be first?

A better alternative is to use the TarMK Cold Standby feature, spanning the standbys across physical data centers. Sure, a Cold Standby requires a short amount of downtime as they must be manually activated, but most organizations can tolerate Authoring outages for short periods of time.

Publish Instances

It’s simpler than you think. Load balanced publishers in multiple data centers protect against complete data center outages and if licensed, can provide increased capacity. A global load balancer can route traffic among data centers handing out IP addresses to visitors. Talk to your Adobe rep about licensing for instances that will be accepting traffic (an active/active scenario), versus instances that will only accept traffic in a disaster recovery situation (active/passive).

Configure dependent services like LDAP, SMTP and federated search with fallback when possible.

Creating a Disaster Recovery Plan

Your organization likely already has a disaster recovery plan template created for other projects, and may require you use the template when creating your AEM disaster recovery plan. Using a common template provides documentation consistency for system administrators while executing a plan in a real DR situation. Elements a disaster recovery plan should include:

Basic information around the plan purpose, scope, objectives and strategies;
Primary contacts and call lists;
Hardware and software inventories with sizing, mounts, IP, DNS names, login information, and any other special information required to understand the platform;
Infrastructure diagrams to visually show connectivity and interaction between devices, including port numbers, firewalls, IP addresses and DNS names;
Upstream dependencies, how to validate and who to contact;
Tasking orders with detailed recovery procedures, and return to normal procedures;
Detailed testing procedures and the results when initially tested (always test your plan!);

When writing a disaster recovery plan, assume the reader knows literally nothing about the platform. This helps you write and test with clarity to help ensure success during recovery.

And yes, pouring hot water through a coffee filter into a cup does render a decent cup of coffee. Thank you. Disaster averted.

Adobe Resources

AEM Infrastructure Series: Disaster Recovery Basics was first posted on April 14, 2016 at 2:12 pm.
©2016 "Adobe". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement. Please contact me at gserafini@gmail.com

AEM Infrastructure Series: Making the Case for Consistency

Greg Dawson — Wed, 06 Apr 2016 13:42:06 +0000

The Adobe Deploying and Maintaining AEM documents are an invaluable reference for the new or experienced systems administrator installing and configuring Adobe Experience Manager on premise. Despite the wealth of references, I all too often see installations that were not well thought through and result in maintenance headaches that would have been easy to avoid, but can be challenging to remedy after-the-fact. Common problems include:

Inconsistent installation paths, permissions, and ownership;
Inconsistent startup settings and overall configuration;
No planning or forethought for disaster recovery and expansion;
Little or no monitoring of AEM instances;
Poorly documented release processes and inconsistent access restrictions;
Lack of, or not adhering to a standardized maintenance plan;

These are just a few of the challenges customers face; employee turnover and juggling multiple vendors only aggravates the problem. Deploying AEM for the long-term involves careful planning, clear and repeatable documentation, and a maintenance plan that follows established best practices not just from Adobe, but from your internal IT team.

It seems obvious, right?

Not always. For organizations that are new to AEM, getting up and running initially on the platform in a way that adheres best practices and promotes long-term maintainability is often overlooked. This task is sometimes given to an experienced AEM developer, but with little or no actual system skills, or an experienced system administrator, but with no AEM experience.

If you are embarking on a new project to stand-up your AEM foundation, here are some tips:

Pre-Installation Tips

Understand best practices from your internal IT organization around separation of systems (e.g. dispatcher zones vs. author/publish instance zones, firewalls, load balancers, etc.).
Create a detailed topology diagram for all environments showing each server and device along with detailed port maps. This visually communicates infrastructure needs, shows how internal and external system interact with one another, as well as what firewall ports and load balancer pools, etc. need to be created.
Obtain the necessary approvals and get started on build sheets early.
Understand Disaster Recovery requirements, AEM Cold Standby’s, clustering, Jackrabbit Oak Document Store configuration options, and physical sizing requirements. All of these decisions will affect infrastructure topology and build.

Installation Tips

Completely read Adobe’s Deploying and Maintaining AEM.
I encourage anyone that will be deploying and maintenance AEM in an enterprise environment to attend the Adobe AEM System Administration. This training can be invaluable and offers a well-rounded view of how to deploy AEM.
Draft and fully test detailed step-by-step installation documentation locally or on test servers before finalizing.
Establish a set of configuration parameters that can be used both during the initial system stand-up and as a health check guide for the future.
Read through the Adobe Security Hardening Checklist, and maintain a list of your own hardening instructions.
Don’t always go for the latest version, unless there is a compelling feature you can’t live without. Our general rule is to wait until the first service pack is released.
Implement a change management process that includes updating all relevant documentation, including disaster recovery, service packs, hotfixes, health check and hardening guides, etc.

Development, Build and Release Processes Tips

It’s likely your organization has standardized release process guidelines that can be adopted for AEM, and if not you’ll need to develop them.
Make sure the guidelines include necessary security and code scans, release notes templates, rollback procedures, and any other checkpoints required before moving between environments.
Establish source control and document a branching structure early on, before the first line of code is developed.
A developer Standard Operating Procedures document should be created to accelerate developer onboarding and again provide consistency.

And if you are choosing an AEM vendor for the first time, you are now armed with knowledge to quiz their maturity level around AEM on premise deployments.

AEM Infrastructure Series: Making the Case for Consistency was first posted on April 6, 2016 at 8:42 am.
©2016 "Adobe". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement. Please contact me at gserafini@gmail.com