Nuxeo cluster shared nuxeo.tmp.dir causing problems due to nuxeo-launcher jar naming contention

According to this answer, it is a best practice for nodes in a Nuxeo cluster to share their nuxeo.tmp.dir. When doing so, must each node in the cluster have its own tmpdir on the binary store filesystem? I am encountering nuxeo-launcher jar file naming collisions causing NFS stale file handle errors when multiple servers in a cluster share their tmpdir and I simultaneous invoke nuxeoctl operations (using Ansible) on all nodes in the cluster.

0 votes

1 answers

2184 views

ANSWER



In cluster mode it's not recommended at all to share nuxeo.tmp.dir, there are many libraries we don't control which could have a problem with it. This means in turn that you can't leverage the NXP-9361 no-copy optimizations…

On the other hand if the only problems you have are due to nuxeo-launcher jar file naming then we could fix this on our end and allow tmp sharing. Please open a JIRA ticket.

Edit: the simplest and surest way is probably to have a shared filesystem but make each node point its nuxeo.tmp.dir to a different subdirectory in it.

0 votes



rg1
Please clarify.

In cluster mode, do you recommend nuxeo.tmp.dir be set to a cluster-node-unique directory on the shared file system in order to take advantage of NXP-9361? By default, java.io.tmpdir = nuxeo.tmp.dir, right?

Or in cluster mode, should java.io.tmpdir and nuxeo.tmp.dir be set independently? NXP-9361 says java.io.tmpdir should be on the shared file system. Should it be set to a cluster-node-unique directory there and nuxeo.tmp.dir be local?

11/08/2013

If you want the benefits of NXP-9361 in cluster mode then the name collision you see has to be fixed. Given the code in nuxeoctl, the launcher in the tmp dir (which is there to allow the launcher to update itself) should be named nuxeo-launcher-$RANDOM.jar where $RANDOM is randomly generated by bash and should be collision-free (although mktemp would be better). Is that not the case for you? Please open a ticket if you have enough info for us to track this down.
11/12/2013

And yes once that bug is fixed using a shared nuxeo.tmp.dir should be ok.
11/12/2013

rg1
As mentioned above, I'm using Ansible with an ssh connection to remotely manage the multiple Nuxeo nodes in my cluster. We regularly see nuxeo-launcher jar collisions when we remotely execute nuxeoctl commands simultaneously across all nodes in the cluster. For now, we have updated each nodes' nuxeo.conf to set nuxeo.tmp.dir to a unique, node-specific directory within the shared binary file system to work around this issue.

Since this configuration has the temp directory on the shared file system, I would expect the NXP-9361 optimization to be fully-functional, do you agree? In general, this seems like a safer configuration than trying to share a common nuxeo.tmp.dir across all nodes. What are your thoughts?

11/13/2013

Yes having nuxeo.tmp.dir point to different parts of a shared filesystem depending on the node is a good way to solve the issue.
11/15/2013

rg1
Thanks. Given our discussion, you might consider updating your original answer since I found it a bit confusing (I would like to be able to mark it as the accepted answer). Also, if pointing nuxeo.tmp.dir to different parts of a shared filesystem depending on the node is a best practice, would it make sense to update the cluster documentation accordingly?

Even if nuxeo-launcher naming collisions were fixed, it seems risky for multiple nodes to share nuxeo.tmp.dir.

11/15/2013

Answer updated and doc (http://doc.nuxeo.com/x/3IHZ will also be reported in other pages) as well.
11/27/2013