Speed up import

Based on the nuxeo-platform-importer package we have developed our own importer tool. Up to one million documents the average import speed is about 250 docs/sec. Importing further documents takes more and more time and the average speed goes down to 50 docs/sec and less.

We followed all performance-relevant instructions for postgres DB described here. My question is if anyone knows further measures to speed up import since we had to import several millions documents.

0 votes

1 answers

1722 views

ANSWER

The big thing I found to help with mass import tuning was batch size - that is number of documents created before a commit (save) is performed. Too small and the overhead is large (per transaction). Too big and Postgres complains (not to mention a hickup runs the risk of losing all the documents in the commit).

The other thing I have done in the past is write my own batch importer which is customer-specific, but tuned for the data, document types, and infrastructure.

Of course, your physical infrastructure can be a real limiting factor. For smaller imports Solid State drives make a world of difference! For larger imports your disk subsystem would be a good place to look for performance improvements, along with more cpu, ram, further tuning your database, etc.

03/16/2012



Hi

Yes importing few millions of documents is a question of days.

A much faster way is to generate the ad'hoc SQL dump and to populate the database with the PostgreSQL copy instruction. This is possible if the data layout to import is simple.

ben

0 votes