Red Heap: ADF Session Replication on a Cluster for High Availability

Running an ADF application on a weblogic cluster is fairly simple. Just deploy the application to the cluster and your done. This gives you scalability as the load is distributed over multiple managed servers. However, it does not give you protection against machine failure out of the box. For that to work, you need to make sure all state in your ADF application is replicated across multiple nodes in the cluster. This post will explain the things you have to think about when setting up this session replication.

Most of this information is based on Configuring High Availability for Fusion Web Applications chapter of the Fusion Web Applications Developer's Guide but we've added some undocumented features and best practices based on our experience.

View

First thing you need to do is tell weblogic to replicate a http session to other nodes in the cluster. This is done by setting persistent-store-type to replicated_if_clustered in your weblogic.xml file:
```
<weblogic-web-app>
  <session-descriptor>
    <persistent-store-type>
      replicated_if_clustered
    </persistent-store-type>
  </session-descriptor>
</weblogic-web-app>
```
During development on my local workstation I like to set the session-description slightly differently:
```
<weblogic-web-app>
  <session-descriptor>
    <persistent-store-type>file</persistent-store-type>
    <persistent-store-dir>c:/temp/</persistent-store-type>
    <cache-size>0</cache-size>
  </session-descriptor>
</weblogic-web-app>
```
This tells weblogic to write the http session request to file after each request. Since the cache-size is set to 0 it will not keep any session in memory. This means each subsequent request from the same session has to restore the session from file first. Since this (de)serializes the http session on each request it behaves the same as a session failover on a cluster after each request. This is a great way to test if you application is really cluster-safe and if none of the managed beans or other objects will loose their state.
In your web.xml you have to set CHECK_FILE_MODIFICATION to false:
```
<web-app>
  ...
  <context-param>
    <param-name>
      org.apache.myfaces.trinidad.CHECK_FILE_MODIFICATION
    </param-name>
    <param-value>false</param-value>
  </context-param>
<web-app>
```
Having this on could lead to errors when a failover occurs. It is best practice to disable this for a production system anyhow. If you want this on during local development you should look into deployment plans to override such a setting per environment.

Controller

Next thing you have to do is to tell the ADF Controller you're running in a replicated cluster so all managed beans will be replicated. You do this by setting adf-scope-ha-support to true in your adf-config.xml:
```
<adf-controller-config xmlns="http://xmlns.example.com/adf/controller/config">
    <adf-scope-ha-support>true</adf-scope-ha-support>
</adf-controller-config>
```
If you're application is also using database based MDS and that database is also a (RAC) cluster you need to tell the ADF Controller to retry any failed database connection in case a database failover occurs. You do this in the persistence-config section of adf-config.xml where you already configured MDS:
```
<persistence-config>
  <metadata-namespaces>...
  <metadata-store-usages>...
  <external-change-detection enabled="false" />
  <read-only-mode enabled="true"/>
  <retry-connection enabled="true"/>
</persistence-config>
```
Managed beans that live longer than a single request need to implement Serializable to enable them to transfer "over the wire" to other nodes in the cluster. This means this is not needed for request-scope and backingbean-scope beans as the do not live longer than a single request and will not be replicated to other nodes in the cluster. You do need to do this for every viewScope, pageFlowScope and sessionScope bean.
According to the documentation you need to let the ADF Controller know each time a managed bean changes so it knows to replicate that bean (or that scope) to the other node in the cluster. You have to do this by invoking ControllerContext.markScopeDirty and giving it the viewScope or pageFlowScope that contains the changed bean. This sounds like a useful performance optimization that only the changed scopes are replicated across the cluster. However, if you inspect the source code it doesn't really matter which scope you pass to the markScopeDirty. It sets a global flag on the request that will replicate all ADF Controller scopes at the end of the request, not just the ones that you marked dirty. Obviously this might change in a future version so you're safest bet is to nicely invoke markScopeDirty whenever you change a bean. One other caveat is that ADF (at least version 12.1.3) internally always changes two internal values in a scope in every request. This means that ADF internally invokes markScopeDirty on each request which means all scopes are replicated on each request regardless what you do. We've filed bugs for this so this might also change in the future and you shouldn't rely on these bugs and assume that everything is replicated anyhow.
I typically create a number of base classes that I use as superclasses for all of my viewScope or pageFlowScope beans. Than all you need to do is invoke markDirty from the superclass in each setter or other method that alters the state of its bean:
```
public abstract class ManagedBean implements Serializable {
  public ManagedBean() {
    /* Workaround voor Oracle bug #18508781. When bean is
       instantiated by the scope itself on first usage it
       does not automatically mark the scope as dirty
       even though the documentation states it does */
     markDirty();
  }
  protected void markDirty() {
    ControllerContext.getInstance().markScopeDirty(getScope());
  }
  // use this in a setter to only mark dirty if a property really changes
  protected void markDirty(Object oldValue, Object newValue) {
    if (!((oldValue == null && newValue == null) ||
          (oldValue != null && newValue != null &&
           oldValue.equals(newValue)))) {
      markDirty();
    }
  }
  // should return the scope map the managed bean lives in
  protected abstract Map<String, Object> getScope();
}
```
```
public class ViewScopeBean extends ManagedBean {
  @Override
  protected Map<String, Object> getScope() {
    return ADFContext.getCurrent().getViewScope();
  }
}
```
```
public class PageFlowScopeBean extends ManagedBean {
  @Override
  protected Map<String, Object> getScope() {
    return ADFContext.getCurrent().getPageFlowScope();
  }
}
```

Model

Finally you also need to make sure the state (data) in your ADF data controls is replicated across the cluster. This is not handled by the normal state replication by the http session or the ADF controller. It is the responsibility of the data control itself.

ADF Business Components

For ADF Business Components you only have to change jbo.dofailover in the bc4j.xcfg configuration file. This is easiest done through the declarative editors in JDeveloper. Alternatively you could also set this as a system property on the command line to start your weblogic server: -Djbo.dofailover=true
This ensures that ADF business components writes it entire state to a XML document at the end of each request so that other nodes in the cluster can read the XML and continue the session. You can configure whether this state is saved to a shared filesystem that all nodes in the cluster can access or to a shared database. You do this by setting jbo.passivationstore to either file or database, where database is the default. When persisting to a database it is advised to setup a special JDBC datasource to be used for this using the jbo.server.internal_connection property. If you do not set one the normal datasource of your ADF BC application module will be used to save this state. Having a dedicated datasource for state saving allows you to store this in a different database schema or tune the datasource for this behaviour.
Also be sure to periodically clean up this persisted ADF business components state for situations where ADF BC failed to clean it up automatically. On a file system this is fairly straightforward and for database storage Oracle supplies a adfbc_purge_statesnapshots.sql script in your oracle_common/common/sql directory.
When using the database as a store for ADF Business Component state and the database is also a cluster or some other type of database that supports failover, be sure to use a GridLink data source for state passivation as GridLink data sources are capable of switching between nodes in a database cluster when a failover of the database node occurs.
The same applies for the normal data source used by ADF BC to get your business data. If it connects to a clustered database it should be using a GridLink data source to support failover.

Data Controls that are not ADF BC

When you are not using ADF Business Components, but another type of ADF Data Control like the bean data control or the XML Data Control things get a bit more complicated. As stated before the datacontrol is responsible for replicating its state across a cluster. The Oracle documentation even describes how a datacontrol should create and restore snapshots. Unfortunately this doesn't really work in real life, at least not in version 12.1.3 and earlier (this might be fixed in future versions). The reason behind this is that a data control can replicate its state using this method but your ADF Model is more than just the raw data. It is also the state of every iterator in your page that "knows" which row in a collection is the current row. These iterators are not implemented by the data controls themselves but are supplied by the ADF framework. In fact, ADF uses an ADF Business Components wrapper around each non-BC data control to get these iterators, the so called InternalDCAM (internal data control application module).

This made us wonder if we can get this InternalDCAM to also replicate its state across a cluster. Since it is just an ADF Business Components Application Module we might have a chance. Since this is an internal application module there is no bc4j.xcfg configuration file that is normally used to configure ADF Business Components. But you can also use system properties to set the default for ADF BC configuration and these defaults should also be used for this internal application module. For this we added a number of system properties to the weblogic startup script:
-Djbo.dofailover=true -Djbo.passivationstore=file -Djbo.tmpdir=C:\work\wlssessions\ -Doracle.adf.model.bean.supports-passivation=true

Here's the reasoning behind these system properties:

jbo.dofailover=true enables the failover behaviour of ADF BC to write its state to XML after each request
we force jbo.passivationstore to file to write the session state to the file system and not to the database. Looking at oracle.adf.model.bean.DCBeanDataControl#findApplicationPool we discovered that writing to the database for a non-BC data control isn't supported anyhow which makes sense.
when writing the state to the file system we also set jbo.tmpdir to specify the directory where to save these files
finally, we set oracle.adf.model.bean.supports-passivation=true to enable ADF BC state passivation for non-BC data controls. This seems to be an experimental feature as there is no documentation about this parameter. This system property is not yet supported in 11.1.1.7 but it does exist in 12.1.3. This seems to be the missing link that enables state passivation for non-BC data controls

We found that for simple data controls this actually works. Having oracle.adf.model.bean.supports-passivation to enable the InternalDCAM to passivate its state in combination with having the data control itself save and restore its state makes this work on a failing cluster. This can easily be demonstrated with the tip at the beginning of this article to use a file based weblogic http session store without any in-memory caching. Unfortunately when having a data control with more than one level this will fail with an exception. For example, you could have a bean data control or XML Data Control with a findAllDepartments() method. That method returns a collection of departments. Each department has a collection of employees and each of those employees has a collection of accounts. In such a scenario with nested collections the failover of the iterators in the InternalDCAM will fail with missing parent key relations and throws ArrayIndexOutOfBoundsException. Perhaps this is the reason why oracle.adf.model.bean.supports-passivation is an undocumented feature. Let's hope Oracle will fully support this in a future release and it will also work with these more complex data controls. Until then, I guess you're out of luck for replicating your ADF application state in a cluster when using anything else than ADF Business Components.

Without the oracle.adf.model.bean.supports-passivation system property things get even worse. Since the InternalDCAM is not replicating its state without this property, it will reset all iterators to the first row even though the underlying data control might have replicated its state. This means a user might haven been working on row 5 in a collection. When a failover occurs the server-side iterator resets to row 1, but the user still sees row 5 in his browser. Any changes made by the user are submitted to the server and applied to row 1 without any error message or warning, leading to data corruption.

Networking

One final word about the networking architecture in front of your multi-node weblogic cluster. Since failover of a session to a different node in the cluster is relatively expensive we want to prevent this as much as possible and only move a session to a different node when the original node really fails. When the node stays up we should do our best to route each request for the same session to the same node. This is known as session stickiness. When you use an Oracle HTTP Server in front of your weblogic cluster this is handled automatically. Oracle HTTP server knows about the weblogic cluster architecture and will try to route each request. By the way, you might want to look in the mod_weblogic plugin parameters to fine tune the clustering behaviour and perhaps change some of the timeout settings.

When not using an Oracle HTTP server, but for example a hardware load balancer that directly routes to the weblogic cluster you need to ensure session stickiness yourself. You need to learn your load balancer about the structure of a weblogic session cookie. The format of a session cookie is sessionid!primary_server_id!secondary_server_id. The load balancer should be able to pick up the id of the primary server between the two exclamation points and use that id to route each request to that primary server. Only when that server fails it should use the id of the secondary server and route the request to that server as that server is the one that already has the replicated session in memory. If you would route to any other node in the cluster it would still work but it would add the delay of first getting the state from another node in the cluster. By the way, once a session has failed over to a different node the client will get a new session cookie with new server id's that can be used in routing subsequent requests.

Red Heap

Friday, October 9, 2015

ADF Session Replication on a Cluster for High Availability