How to do indexing, partition, and migration in Postgres 10

From DISI
Jump to: navigation, search

This tutorial shows how to do data migration, partition, and indexing in Postgres 10.

1. Change data directory

The dataset is quite large, and it will take up about 220GB of hard disk once loaded and indexed into Postgres. It is better to move the database storage to another large disk instead of the default (root) one. A Solid State Disk (SSD) is preferred for faster disk access.

(1) Check current Postgres data directory.

sudo -i 
su - postgres
psql
SHOW data_directory;
Output:

     data_directory      
-------------------------
 /var/lib/pgsql/10/data
(1 row)

(2) Quit all Postgres Processes and terminate Postgres service.

\q     # quit from psql
sudo systemctl stop postgresql-10           # stop postgres
sudo systemctl status postgresql-10         # check its status

(3) Create new folder in destination directory

cd /ssd/disk1             
mkdir psql_10_data                          # create a new folder
chown postgres psql_10_data                 # Postgres requires exclusive ownership and access to the data directory. 
chmod 700 psql_10_data                      # Change read, write, and execute authority of this folder. No group or world access to this folder. Required by Postgres.

(4) Copy current data and configure files to new folder

cp -R /var/lib/pgsql/10/data/* /ssd/disk1/psql_10_data/

(5) Point to the new data location

vim /var/lib/pgsql/10/data/postgresql.conf       # Edit the configuration file in old folder.

...
data_directory = '/ssd/disk1/psql_10_data'              # use data in another directory
...

(6) Restart PostgreSQL service

sudo systemctl start postgresql-10

Trouble shooting, you may encounter the following error message:

● postgresql-10.service - PostgreSQL 10 database server
   Loaded: loaded (/usr/lib/systemd/system/postgresql-10.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2018-07-24 13:04:11 PDT; 38s ago
     Docs: https://www.postgresql.org/docs/10/static/
  Process: 23952 ExecStart=/usr/pgsql-10/bin/postmaster -D ${PGDATA} (code=exited, status=1/FAILURE)
  Process: 23946 ExecStartPre=/usr/pgsql-10/bin/postgresql-10-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS)
 Main PID: 23952 (code=exited, status=1/FAILURE)

Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og systemd[1]: Starting PostgreSQL 10 database server...
Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og postmaster[23952]: 2018-07-24 13:04:11.944 PDT [23952] LOG:  cou...se
Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og postmaster[23952]: 2018-07-24 13:04:11.944 PDT [23952] HINT:  Is...y.
Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og postmaster[23952]: 2018-07-24 13:04:11.944 PDT [23952] LOG:  cou...se
Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og systemd[1]: postgresql-10.service: main process exited, code=ex...URE
Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og systemd[1]: Failed to start PostgreSQL 10 database server.
Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og systemd[1]: Unit postgresql-10.service entered failed state.
Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og systemd[1]: postgresql-10.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

Job for postgresql-10.service failed because the control process exited with error code. See "systemctl status postgresql-10.service" and "journalctl -xe" for details.

journalctl -xe               # check the error message
There is a ******.lock file block access to the PostgreSQL service. Change the ownership or access roles of this file to solve this problem.

(7) Restart psql and check data directory again.

postgres=# SHOW data_directory;

Output
     data_directory      
-------------------------
 /ssd/disk1/psql_10_data
(1 row)



2. Create Partition Tables and Load Data into Tables

In Postgres, a table can be partitioned by certain attributes. Partition speedup search, because only need to focus on a subset of data. We can modify a partition (child) table without effecting the whole table. An automate script is /var/lib/pgsql/script/create_partition_all.sh

(1) Create partition table (parent). Partition on trache_name, a natural way to partition the dataset.

CREATE TABLE IF NOT EXISTS smile_partition(
    smiles varchar(256) NOT NULL, 
    zinc_id varchar(128) NOT NULL, 
    inchikey varchar(128), 
    mwt REAL NOT NULL, 
    logp REAL NOT NULL, 
    reactive REAL NOT NULL, 
    purchase INT NOT NULL, 
    tranche_name varchar(32) NOT NULL, 
    feature text) PARTITION BY LIST(tranche_name);

(2) Create partition table(child).

CREATE TABLE IF NOT EXISTS $tablename PARTITION OF smile_partition FOR VALUES IN ('$prefix');   # $tablename is like smile_partition_aaaa, smile_partition_ebca; '$prefix' is like 'EEBD', 'AAAA'

(3) Load data to the table.

copy smile_partition from '$i' with CSV HEADER Delimiter AS E'\t';   # copy to the partition (parent) table. Data will be automatically load to corresponding child table. Don't copy data to partition (child) table.

(4) Check created tables.

\d

Output:

List of relations
 Schema |         Name         | Type  |  Owner   
--------+----------------------+-------+----------
 public | smile_partition      | table | postgres
 public | smile_partition_aaaa | table | postgres
 public | smile_partition_aaab | table | postgres
 public | smile_partition_aaad | table | postgres
 public | smile_partition_aaae | table | postgres
 public | smile_partition_aaaf | table | postgres
 public | smile_partition_aaba | table | postgres
...
...
 public | smile_partition_kkgf | table | postgres
 public | smile_partition_kkia | table | postgres
 public | smile_partition_kkib | table | postgres
 public | smile_partition_kkie | table | postgres
 public | smile_partition_kkif | table | postgres
(3601 rows)


3. Create Index on Partition Tables

Indexing can greatly speedup searching. There are several types of index in Postgres. In this work, I am using the default BTREE (Binary Tree) index. In the future, we may leverage RDKit for indexing. An automate script to generate index for all partition tables is /var/lib/pgsql/script/create_index_all.sh

CREATE INDEX IF NOT EXISTS $indexname ON $tablename(smiles);  # indexname is like 'idx_abed', tablename is like 'smile_partition_kkie'.

Check Validity of index 1. Check the number of index is the same as the number of partition tables (child, no index for the parent table). 2. Check the details of partition table. e.g.

\d smile_partition_aaaa
                  Table "public.smile_partition_aaaa"
    Column    |          Type          | Collation | Nullable | Default 
--------------+------------------------+-----------+----------+---------
 smiles       | character varying(256) |           | not null | 
 zinc_id      | character varying(128) |           | not null | 
 inchikey     | character varying(128) |           |          | 
 mwt          | real                   |           | not null | 
 logp         | real                   |           | not null | 
 reactive     | real                   |           | not null | 
 purchase     | integer                |           | not null | 
 tranche_name | character varying(32)  |           | not null | 
 feature      | text                   |           |          | 
Partition of: smile_partition FOR VALUES IN ('AAAA')
Indexes:
    "index_aaaa" btree (smiles)

If the index is invalid, following message will be shown.
Indexes:
    "index_aaaa" btree (smiles) INVALID

3. Check by doing a query search.

select * from smile_partition_aaaa where smiles='CS(=O)(=O)N1CCC[C@@H](C(=O)N2CCC3(CC2)OCCO3)C1';
                                       QUERY PLAN                                        
-----------------------------------------------------------------------------------------
 Index Scan using index_aaaa on smile_partition_aaaa  (cost=0.28..8.30 rows=1 width=104)
   Index Cond: ((smiles)::text = 'CS(=O)(=O)N1CCC[C@@H](C(=O)N2CCC3(CC2)OCCO3)C1'::text)
(2 rows)

You should see "Index Scan" indicating that Postgres is using indexing in searching. If you don't see this, check the validity of index.
You can drop and recreate index on individual partition table (child). 
Another pro of partition, if something wrong with one partition table (child), just modify this one.

4. Search the database

Directly perform search on the partition table (parent). Parallel query will be automatically used in searching.

explain select * from smile_partition where smiles='COc1cc(C(=O)N2CCOCC2)cc(OS(=O)(=O)O)c1OC';

Output:

QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Append  (cost=0.28..25863.85 rows=3600 width=250)
   ->  Index Scan using index_aaaa on smile_partition_aaaa  (cost=0.28..8.30 rows=1 width=104)
         Index Cond: ((smiles)::text = 'COc1cc(C(=O)N2CCOCC2)cc(OS(=O)(=O)O)c1OC'::text)
   ->  Index Scan using index_aaab on smile_partition_aaab  (cost=0.28..8.30 rows=1 width=106)
         Index Cond: ((smiles)::text = 'COc1cc(C(=O)N2CCOCC2)cc(OS(=O)(=O)O)c1OC'::text)
   ->  Index Scan using index_aaad on smile_partition_aaad  (cost=0.29..8.30 rows=1 width=125)
         Index Cond: ((smiles)::text = 'COc1cc(C(=O)N2CCOCC2)cc(OS(=O)(=O)O)c1OC'::text)
   ->  Index Scan using index_aaae on smile_partition_aaae  (cost=0.42..8.44 rows=1 width=106)
         Index Cond: ((smiles)::text = 'COc1cc(C(=O)N2CCOCC2)cc(OS(=O)(=O)O)c1OC'::text)
...
...

Performance of searching with partition, indexing, and parallel query.

Tranche_name Query Running time
ADBA Cc1cc(C(=O)O)cc(C)c1[N+](=O)[O-] 1549.504 ms
BDCD CC(C)OCC(=O)N1CCc2cccc(O)c2C1 1049.261 ms
CAGE NNC(=O)c1nnn(Cc2ccccc2)c1C(=O)NN 971.508 ms
DCEB O=C1c2nc[nH]c2C(=O)C(SCCO)=C1SCCO 882.626 ms
ECAD C[C@@H](CCNC(=O)C(C)(C)F)NC(=O)[C@H]1CCc2nncn2CC1 1001.437 ms
KDED COc1ccc(C=c2sc(=C(C#N)C(=O)N3CCCC3)n(CCN3CCOCC3)c2=O)cc1OCc1ccccc1 1043.976 ms