7 Best Practice Tips for PostgreSQL Bulk Data Loading

Posted 耀阳居士

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了7 Best Practice Tips for PostgreSQL Bulk Data Loading相关的知识,希望对你有一定的参考价值。

7 Best Practice Tips for PostgreSQL Bulk Data Loading

 
 
February 19, 2023

Sometimes, PostgreSQL databases need to import large quantities of data in a single or a minimal number of steps. This is commonly known as bulk data import where the data source is typically one or more large files. This process can be sometimes unacceptably slow.

There are many reasons for such poor performance: indexes, triggers, foreign keys, GUID primary keys, or even the Write Ahead Log (WAL) can all cause delays.

In this article, we will cover some best practice tips for bulk importing data into PostgreSQL databases. However, there may be situations where none of these tips will be an efficient solution. We recommend readers consider the pros and cons of any method before applying it.

Tip 1: Change Target Table to Un-logged Mode

For PostgreSQL 9.5 and above, the target table can be first altered to UNLOGGED, then altered back to LOGGED once the data is loaded:

ALTER TABLE <target table> SET UNLOGGED
<bulk data insert operations…>
ALTER TABLE <target table> LOGGED

The UNLOGGED mode ensures PostgreSQL is not sending table write operations to the Write Ahead Log (WAL). This can make the load process significantly fast. However, since the operations are not logged, data cannot be recovered if there is a crash or unclean server shutdown during the load. PostgreSQL will automatically truncate any unlogged table once it restarts. 

Also, unlogged tables are not replicated to standby servers. In such cases, existing replications have to be removed before the load and recreated after the load. Depending on the volume of data in the primary node and the number of standbys, the time for recreating replication may be quite long, and not acceptable by high-availability requirements.

We recommend the following best practices for bulk inserting data into un-logged tables:

  • Making a backup of the table and data before altering it to an un-logged mode
  • Recreating any replication to standby servers once data load is complete
  • Using un-logged bulk inserts for tables which can be easily repopulated (e.g. large lookup tables or dimension tables)

Tip 2: Drop and Recreate Indexes

Existing indexes can cause significant delays during bulk data inserts. This is because as each row is added, the corresponding index entry has to be updated as well.

We recommend dropping indexes in the target table where possible before starting the bulk insert, and recreating the indexes once the load is complete. Again, creating indexes on large tables can be time-consuming, but it will be generally faster than updating the indexes during load.

DROP INDEX <index_name1>, <index_name2> … <index_name_n>
<bulk data insert operations…>
CREATE INDEX <index_name> ON <target_table>(column1, …,column n)

It may be worthwhile to temporarily increase the maintenance_work_mem configuration parameter just before creating the indexes. The increased working memory can help create the indexes faster.

Another option to play safe is to make a copy of the target table in the same database with existing data and indexes. This newly copied table can be then tested with bulk insert for both scenarios: drop-and-recreate indexes, or dynamically updating them. The method that yields better performance can be then followed for the live table.

Tip 3: Drop and Recreate Foreign Keys

Like indexes, foreign key constraints can also impact bulk load performance. This is because each foreign key in each inserted row has to be checked for the existence of a corresponding primary key. Behind-the-scene, PostgreSQL uses a trigger to perform the checking. When loading a large number of rows, this trigger has to be fired off for each row, adding to the overhead.

Unless restricted by business rules, we recommend dropping all foreign keys from the target table, loading the data in a single transaction, then recreating the foreign keys after committing the transaction.

ALTER TABLE <target_table> 
    DROP CONSTRAINT <foreign_key_constraint>

BEGIN TRANSACTION
    <bulk data insert operations…>
COMMIT

ALTER TABLE <target_table> 
    ADD CONSTRAINT <foreign key constraint>  
    FOREIGN KEY (<foreign_key_field>) 
    REFERENCES <parent_table>(<primary key field>)...

Once again, increasing the maintenance_work_mem configuration parameter can improve the performance of recreating foreign key constraints.

Tip 4: Disable Triggers

INSERT or DELETE triggers (if the load process also involves deleting records from the target table) can cause delays in bulk data loading. This is because each trigger will have logic that needs to be checked and operations that need to complete right after each row is INSERTed or DELETEd. 

We recommend disabling all triggers in the target table before bulk loading data and enabling them after the load is finished. Disabling ALL triggers also include system triggers that enforce foreign key constraint checks.

ALTER TABLE <target table> DISABLE TRIGGER ALL
<bulk data insert operations…>
ALTER TABLE <target table> ENABLE TRIGGER ALL

Tip 5: Use COPY Command

We recommend using the PostgreSQL COPY command to load data from one or more files. COPY is optimized for bulk data loads. It’s more efficient than running a large number of INSERT statements or even multi-valued INSERTS.

COPY <target table> [( column1>, … , <column_n>)]
    FROM  \'<file_name_and_path>\' 
    WITH  (<option1>, <option2>, … , <option_n>)

Other benefits of using COPY include:

  • It supports both text and binary file import
  • It’s transactional in nature
  • It allows specifying the structure of the input files
  • It can conditionally load data using a WHERE clause

Tip 6: Use Multi-valued INSERT

Running several thousand or several hundreds of thousands of INSERT statements can be a poor choice for bulk data load. That’s because each individual INSERT command has to be parsed and prepared by the query optimizer, go through all the constraint checking, run as a separate transaction, and logged in the WAL. Using a multi-valued single INSERT statement can save this overhead.

INSERT INTO <target_table> (<column1>, <column2>, …, <column_n>) 
VALUES 
    (<value a>, <value b>, …, <value x>),
    (<value 1>, <value 2>, …, <value n>),
    (<value A>, <value B>, …, <value Z>),
    (<value i>, <value ii>, …, <value L>),
    ...

Multi-valued INSERT performance is affected by existing indexes. We recommend dropping the indexes before running the command and recreating the indexes afterwards. 

Another area to be aware of is the amount of memory available to PostgreSQL for running multi-valued INSERTs. When a multi-valued INSERT is run, a large number of input values has to fit in the RAM, and unless there is sufficient memory available, the process may fail.

We recommend setting the effective_cache_size parameter to 50%, and shared_buffer parameter to 25% of the machine’s total RAM. Also, to be safe, it running a series of multi-valued INSERTs with each statement having values for 1000 rows.

Tip 7: Run ANALYZE

This is not related to improving bulk data import performance, but we strongly recommend running the ANALYZE command on the target table immediately after the bulk import. A large number of new rows will significantly skew the data distribution in columns and will cause any existing statistics on the table to be out-of-date. When the query optimizer uses stale statistics, query performance can be unacceptably poor. Running the ANALYZE command will ensure any existing statistics are updated.

Final Thoughts

Bulk data import may not happen every day for a database application, but there’s a performance impact on queries when it runs. That’s why it’s necessary to minimize load time as best as possible. One thing DBAs can do to minimize any surprise is to test the load optimizations in a development or staging environment with similar server specifications and PostgreSQL configurations. Every data load scenario is different, and it’s best to try out each method and find the one that works.

Spark: Best practice for retrieving big data from RDD to local machine

‘ve got big RDD(1gb) in yarn cluster. On local machine, which use this cluster I have only 512 mb. I‘d like to iterate over values in RDD on my local machine. I can‘t use collect(), because it would create too big array locally which more then my heap. I need some iterative way. There is method iterator(), but it requires some additional information, I can‘t provide.

UDP: commited toLocalIterator method

shareimprove this question
 
    
toLocalIterator is not ideal if you want to iterate locally over a partition at a time – Landon Kuhn Oct 29 ‘14 at 2:25
2  
@LandonKuhn why not? – Tom Yubing Dong Aug 4 ‘15 at 23:02

5 Answers

Update: RDD.toLocalIterator method that appeared after the original answer has been written is a more efficient way to do the job. It uses runJob to evaluate only a single partition on each step.

TL;DR And the original answer might give a rough idea how it works:

First of all, get the array of partition indexes:

val parts = rdd.partitions

Then create smaller rdds filtering out everything but a single partition. Collect the data from smaller rdds and iterate over values of a single partition:

for (p <- parts) {
    val idx = p.index
    val partRdd = rdd.mapPartitionsWithIndex(a => if (a._1 == idx) a._2 else Iterator(), true)
    //The second argument is true to avoid rdd reshuffling
    val data = partRdd.collect //data contains all values from a single partition 
                               //in the form of array
    //Now you can do with the data whatever you want: iterate, save to a file, etc.
}

I didn‘t try this code, but it should work. Please write a comment if it won‘t compile. Of cause, it will work only if the partitions are small enough. If they aren‘t, you can always increase the number of partitions with rdd.coalesce(numParts, true).

shareimprove this answer
 
    
does this code cause each partition to be computed in serial when it loops through and call mapPartitionsWithIndex? What‘s the best way to remedy this? – foboi1122 Nov 18 ‘15 at 0:42
    
@foboi1122 Please see updated answer – Wildfire Nov 18 ‘15 at 8:36 
    
@Wildfire Will this approach resolve this. Else how to resolve using any or might be this approach. – ChikuMiku 2 days ago 

Wildfire answer seems semantically correct, but I‘m sure you should be able to be vastly more efficient by using the API of Spark. If you want to process each partition in turn, I don‘t see why you can‘t using map/filter/reduce/reduceByKey/mapPartitions operations. The only time you‘d want to have everything in one place in one array is when your going to perform a non-monoidal operation - but that doesn‘t seem to be what you want. You should be able to do something like:

rdd.mapPartitions(recordsIterator => your code that processes a single chunk)

Or this

rdd.foreachPartition(partition => {
  partition.toArray
  // Your code
})
shareimprove this answer
 
    
Is‘t these operators execute on cluster? – epahomov Apr 3 ‘14 at 7:05
1  
Yes it will, but why are you avoiding that? If you can process each chunk in turn, you should be able to write the code in such a way so it can distribute - like using aggregate. – samthebest Apr 3 ‘14 at 15:54
    
Is not the iterator returned by forEachPartitition the data iterator for a single partition - and not an iterator of all partitions? – javadba May 20 at 8:23

Here is the same approach as suggested by @Wildlife but written in pyspark.

The nice thing about this approach - it lets user access records in RDD in order. I‘m using this code to feed data from RDD into STDIN of the machine learning tool‘s process.

rdd = sc.parallelize(range(100), 10)
def make_part_filter(index):
    def part_filter(split_index, iterator):
        if split_index == index:
            for el in iterator:
                yield el
    return part_filter

for part_id in range(rdd.getNumPartitions()):
    part_rdd = rdd.mapPartitionsWithIndex(make_part_filter(part_id), True)
    data_from_part_rdd = part_rdd.collect()
    print "partition id: %s elements: %s" % (part_id, data_from_part_rdd)

Produces output:

partition id: 0 elements: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
partition id: 1 elements: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
partition id: 2 elements: [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
partition id: 3 elements: [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
partition id: 4 elements: [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
partition id: 5 elements: [50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
partition id: 6 elements: [60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
partition id: 7 elements: [70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
partition id: 8 elements: [80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
partition id: 9 elements: [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
shareimprove this answer
 

Map/filter/reduce using Spark and download the results later? I think usual Hadoop approach will work.

Api says that there are map - filter - saveAsFile commands:https://spark.incubator.apache.org/docs/0.8.1/scala-programming-guide.html#transformations

shareimprove this answer
 
    
Bad option. I don‘t want to do serialization/deserialization. So I want this data retrieving from spark – epahomov Feb 11 ‘14 at 10:37
    
How do you intend to get 1gb without serde(i.e. storing on the disk.) ? on a node with 512mb ? – scrapcodesFeb 12 ‘14 at 9:13
1  
By iterating over the RDD. You should be able to get each partition in sequence to send each data item in sequence to the master, which can then pull them off the network and work on them. – interfect Feb 12 ‘14 at 18:07

For Spark 1.3.1 , the format is as follows

val parts = rdd.partitions
    for (p <- parts) {
        val idx = p.index
        val partRdd = data.mapPartitionsWithIndex { 
           case(index:Int,value:Iterator[(String,String,Float)]) => 
             if (index == idx) value else Iterator()}
        val dataPartitioned = partRdd.collect 
        //Apply further processing on data                      
    }

 

以上是关于7 Best Practice Tips for PostgreSQL Bulk Data Loading的主要内容,如果未能解决你的问题,请参考以下文章

Best Practice of DevOps for Develop Microservice 11 - Cassadra

Spark: Best practice for retrieving big data from RDD to local machine

设置java系统属性的最佳实践是什么,-D或System.setProperty()?(What is best practice for setting java system properties

优维低代码:Best Practice

KVM Best practice

Dockerfile Security Best Practice