[Python] psycopg2 모듈을 사용한 효율적인 PostgreSQL bulk insert

언어/파이썬 & 장고 2017. 2. 16. 20:06

PostgreSQL은 데이터를 관리, 구성, 질의 및 검색하는 데 매우 뛰어나지만 Insert 자체가 매우 느릴 수 있습니다. PostgreSQL에서 가장 빠른 Insert 방법은 COPY문을 사용하는 것입니다. 응용 프로그램이 PostgreSQL을 사용할 수 있는 권한이 있다고 해도 소프트웨어 구성 요소 간에 엄격한 기능 분리를 유지하는 관점에 있어서는 COPY문을 사용하는 것을 권하지 않습니다. 다음은 COPY문을 제외한 100,000개의 row들을 insert할 때 찾은 효율적인 방법을 소개합니다.

Test 구성

테스트는 아래의 테이블에 insert하는 데 걸리는 시간을 기준으로 합니다.

CREATE TABLE upload_time_test(
        uuid uuid primary key default uuid_generate_v4(),
        created timestamp with time zone not null default now(),
        text text not null,
        properties hstore not null default ''::hstore
    );

    GRANT ALL ON upload_time_test TO test;

다음은 psycopg2의 연결을 담당하는 함수 부분입니다.

def connect():
    connection= psycopg2.connect(host=HOST,database=DATABASE,user=USER,password=PASSWORD)
    psycopg2.extras.register_hstore(connection)
    return connection
def execute(sql,params={}):
    with connect() as connection:
        with connection.cursor() as cursor:
            cursor.execute(sql,params)

1000개 row insert

아래 Tester 클래스는 인스턴스화 할 때마다 샘플 테이블을 파괴하고 다시 만들면서 진행합니다. 데이터베이스에 row를 insert하는 세 가지 방법을 설명합니다. 각 기능은 서로 다른 기술을 기반으로 설명되어 있습니다.

slowInsert()는 각 행에 대해 새 데이터베이스 연결을 만들기 때문에 가장 느립니다.

insert()는 보통 사용하는 방법입니다. 하나의 연결을 만들고 각 연결에 다시 연결하는 방식입니다. 이 방법은 보통 psycopg2의 executemany()가 수행하는 작업입니다.

fastInsert()는 새로운 방법으로 unnest()를 사용하여 psycopg2를 통해 전달된 배열 집합을 푸는 방식입니다.

class Tester():
    SINGLE_INSERT = """
        INSERT INTO upload_time_test(text,properties)
         VALUES (%(text)s, %(properties)s)
        """
    def __init__(self, count):
        execute(SETUP_SQL)
        self.count = count
        self.data = [
            {
                'text': 'Some text',
                'properties': {"key": "value"},
            }
            for i in range(count)
        ]
    def slowInsert(self):
        '''
            Creates a new connection for each insertion
        '''
        for row in self.data:
            text = row['text']
            properties = row['properties']
            execute(SINGLE_INSERT, locals())
    def insert(self):
        '''
            One connection.
            Multiple queries.
        '''
        with connect() as connection:
            with connection.cursor() as cursor:
                for row in self.data:
                    text = row['text']
                    properties = row['properties']
                    cursor.execute(SINGLE_INSERT, locals())
    def fastInsert(self):
        '''
            One connection, one query.
        '''
        sql = '''
            INSERT INTO upload_time_test(text,properties)
              SELECT unnest(ARRAY '%(texts)s' ) ,
                     unnest(ARRAY '%(properties)s')
        '''
        texts = [r['text'] for r in self.data]
        properties = [r['properties'] for r in self.data]
        execute(sql, locals())

tester = Tester(1000)
    with timer('slow'):
        tester.slowInsert()
    with timer('normal'):
        tester.insert()
    with timer('fast'):
        tester.fastInsert()
 
# 결과
# slow: 7.160489320755005 second(s)
# normal: 0.1441025733947754 second(s)
# fast: 0.042119503021240234 second(s)

위 테스트는 1000개의 데이터를 insert한다고 가정하고 진행했습니다.

연결을 매 insert마다 새로이 하는 것보다 하나를 연결한 다음 재사용할 시 50배정도 빠른 것을 확인할 수 있습니다. 그리고 unnest를 사용하면 이보다 3배정도 빠른 것을 확인할 수 있습니다.

100,000개 row insert

slowInsert()는 매우 느리다는 것을 확인되었으니 insert()와 fastInsert()만 확인하도록 합니다.

ester=Tester(count=100000)
with timer('normal'):
    tester.insert()

tester=Tester(count=100000)
with timer('fast'):
    tester.fastInsert()
 
# 결과
# normal: 14.866096019744873 second(s)
# fast: 3.9566986560821533 second(s)

100,000개의 row를 insert할 때, unnest를 사용한 fastInsert()가 기본 insert()보다 4배정도 빠른 것을 확인이 가능합니다.

아래는 삽입개수와 삽입 속도를 나타내는 표입니다. (count / 총 소요시간)

count	bulk	normal
50	4485.694730	3867.856879
100	10159.389609	4847.897547
200	15212.186276	6057.106548
500	27340.842720	7081.049689
1000	33248.545382	7694.657609
2000	35640.695767	7070.777670
5000	41223.200473	8027.790910
10000	40948.723106	7785.005392
20000	42604.387914	7568.314015
50000	40795.233470	7291.552509
100000	27014.354119	6872.935483

결론

동시에 다수의 row들을 insert하는데 unnest를 사용하면 다음과 같은 장점이 존재합니다.

보통 수천 개의 행을 insert 할 때 일반 insert 반복문보다 상당히 빠름
unnest() 사용의 이점은 적어도 최대 50,000 개의 행을 증가시킴
문자열 연결없이 (합리적으로) 직접 매개 변수화된 SQL을 작성할 수 있습니다. (리스트를 그대로 사용 가능)
원격 데이터베이스에서 이 unnest를 시도하면, 네트워크를 통해 전송되는 데이터의 양이 현저히 줄어듬

ABOUT ME

불곰 불곰

Test 구성

1000개 row insert

100,000개 row insert

결론

티스토리툴바

ABOUT ME

Test 구성

1000개 row insert

100,000개 row insert

결론

관련글 관련글 더보기

티스토리툴바