Batch Processing

To load data to PostgreSQL, pgloader uses the COPY streaming protocol. While this is the faster way to load data, COPY has an important drawback: as soon as PostgreSQL emits an error with any bit of data sent to it, whatever the problem is, the whole data set is rejected by PostgreSQL.

To work around that, pgloader cuts the data into batches of 25000 rows each, so that when a problem occurs it’s only impacting that many rows of data. Each batch is kept in memory while the COPY streaming happens, in order to be able to handle errors should some happen.

When PostgreSQL rejects the whole batch, pgloader logs the error message then isolates the bad row(s) from the accepted ones by retrying the batched rows in smaller batches. To do that, pgloader parses the CONTEXT error message from the failed COPY, as the message contains the line number where the error was found in the batch, as in the following example:

CONTEXT: COPY errors, line 3, column b: "2006-13-11"

Using that information, pgloader will reload all rows in the batch before the erroneous one, log the erroneous one as rejected, then try loading the remaining of the batch in a single attempt, which may or may not contain other erroneous data.

At the end of a load containing rejected rows, you will find two files in the root-dir location, under a directory named the same as the target database of your setup. The filenames are the target table, and their extensions are .dat for the rejected data and .log for the file containing the full PostgreSQL client side logs about the rejected data.

The .dat file is formatted in PostgreSQL the text COPY format as documented in http://www.postgresql.org/docs/9.2/static/sql-copy.html#AEN66609.

It is possible to use the following WITH options to control pgloader batch behavior:

on error stop, on error resume next

This option controls if pgloader is using building batches of data at all. The batch implementation allows pgloader to recover errors by sending the data that PostgreSQL accepts again, and by keeping away the data that PostgreSQL rejects.

To enable retrying the data and loading the good parts, use the option on error resume next, which is the default to file based data loads (such as CSV, IXF or DBF).

When migrating from another RDMBS technology, it’s best to have a reproducible loading process. In that case it’s possible to use on error stop and fix either the casting rules, the data transformation functions or in cases the input data until your migration runs through completion. That’s why on error resume next is the default for SQLite, MySQL and MS SQL source kinds.

A Note About Performance

pgloader has been developed with performance in mind, to be able to cope with ever growing needs in loading large amounts of data into PostgreSQL.

The basic architecture it uses is the old Unix pipe model, where a thread is responsible for loading the data (reading a CSV file, querying MySQL, etc) and fills pre-processed data into a queue. Another threads feeds from the queue, apply some more transformations to the input data and stream the end result to PostgreSQL using the COPY protocol.

When given a file that the PostgreSQL COPY command knows how to parse, and if the file contains no erroneous data, then pgloader will never be as fast as just using the PostgreSQL COPY command.

Note that while the COPY command is restricted to read either from its standard input or from a local file on the server’s file system, the command line tool psql implements a copy command that knows how to stream a file local to the client over the network and into the PostgreSQL server, using the same protocol as pgloader uses.

A Note About Parallelism

pgloader uses several concurrent tasks to process the data being loaded:

a reader task reads the data in and pushes it to a queue,

at last one write task feeds from the queue and formats the raw into the PostgreSQL COPY format in batches (so that it’s possible to then retry a failed batch without reading the data from source again), and then sends the data to PostgreSQL using the COPY protocol.

The parameter workers allows to control how many worker threads are allowed to be active at any time (that’s the parallelism level); and the parameter concurrency allows to control how many tasks are started to handle the data (they may not all run at the same time, depending on the workers setting).

We allow workers simultaneous workers to be active at the same time in the context of a single table. A single unit of work consist of several kinds of workers:

a reader getting raw data from the source,

N writers preparing and sending the data down to PostgreSQL.

The N here is setup to the concurrency parameter: with a CONCURRENCY of 2, we start (+ 1 2) = 3 concurrent tasks, with a concurrency of 4 we start (+ 1 4) = 5 concurrent tasks, of which only workers may be active simultaneously.

The defaults are workers = 4, concurrency = 1 when loading from a database source, and workers = 8, concurrency = 2 when loading from something else (currently, a file). Those defaults are arbitrary and waiting for feedback from users, so please consider providing feedback if you play with the settings.

As the CREATE INDEX threads started by pgloader are only waiting until PostgreSQL is done with the real work, those threads are NOT counted into the concurrency levels as detailed here.

By default, as many CREATE INDEX threads as the maximum number of indexes per table are found in your source schema. It is possible to set the max parallel create index WITH option to another number in case there’s just too many of them to create.