How to do indexing, partition, and migration in Postgres 10
This tutorial shows how to do data migration, partition, and indexing in Postgres 10.
1. Change data directory
The dataset is quite large, and it will take up about 220GB of hard disk once loaded and indexed into Postgres. It is better to move the database storage to another large disk instead of the default (root) one. A Solid State Disk (SSD) is preferred for faster disk access.
(1) Check current Postgres data directory.
sudo -i su - postgres psql SHOW data_directory;
Output: data_directory ------------------------- /var/lib/pgsql/10/data (1 row)
(2) Quit all Postgres Processes and terminate Postgres service.
\q # quit from psql sudo systemctl stop postgresql-10 # stop postgres sudo systemctl status postgresql-10 # check its status
(3) Create new folder in destination directory
cd /ssd/disk1 mkdir psql_10_data # create a new folder chown postgres psql_10_data # Postgres requires exclusive ownership and access to the data directory. chmod 700 psql_10_data # Change read, write, and execute authority of this folder. No group or world access to this folder. Required by Postgres.
(4) Copy current data and configure files to new folder
cp -R /var/lib/pgsql/10/data/* /ssd/disk1/psql_10_data/
(5) Point to the new data location
vim /var/lib/pgsql/10/data/postgresql.conf # Edit the configuration file in old folder. ... data_directory = '/ssd/disk1/psql_10_data' # use data in another directory ...
(6) Restart PostgreSQL service
sudo systemctl start postgresql-10 Trouble shooting, you may encounter the following error message: ● postgresql-10.service - PostgreSQL 10 database server Loaded: loaded (/usr/lib/systemd/system/postgresql-10.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Tue 2018-07-24 13:04:11 PDT; 38s ago Docs: https://www.postgresql.org/docs/10/static/ Process: 23952 ExecStart=/usr/pgsql-10/bin/postmaster -D ${PGDATA} (code=exited, status=1/FAILURE) Process: 23946 ExecStartPre=/usr/pgsql-10/bin/postgresql-10-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS) Main PID: 23952 (code=exited, status=1/FAILURE) Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og systemd[1]: Starting PostgreSQL 10 database server... Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og postmaster[23952]: 2018-07-24 13:04:11.944 PDT [23952] LOG: cou...se Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og postmaster[23952]: 2018-07-24 13:04:11.944 PDT [23952] HINT: Is...y. Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og postmaster[23952]: 2018-07-24 13:04:11.944 PDT [23952] LOG: cou...se Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og systemd[1]: postgresql-10.service: main process exited, code=ex...URE Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og systemd[1]: Failed to start PostgreSQL 10 database server. Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og systemd[1]: Unit postgresql-10.service entered failed state. Jul 24 13:04:11 yod.cluster.ucsf.bkslab.og systemd[1]: postgresql-10.service failed. Hint: Some lines were ellipsized, use -l to show in full. Job for postgresql-10.service failed because the control process exited with error code. See "systemctl status postgresql-10.service" and "journalctl -xe" for details. journalctl -xe # check the error message There is a ******.lock file block access to the PostgreSQL service. Change the ownership or access roles of this file to solve this problem.
(7) Restart psql and check data directory again.
postgres=# SHOW data_directory; Output data_directory ------------------------- /ssd/disk1/psql_10_data (1 row)
2. Create Partition Tables and Load Data into Tables
In Postgres, a table can be partitioned by certain attributes. Partition speedup search, because only need to focus on a subset of data. We can modify a partition (child) table without effecting the whole table. An automate script is /var/lib/pgsql/script/create_partition_all.sh
(1) Create partition table (parent). Partition on trache_name, a natural way to partition the dataset.
CREATE TABLE IF NOT EXISTS smile_partition( smiles varchar(256) NOT NULL, zinc_id varchar(128) NOT NULL, inchikey varchar(128), mwt REAL NOT NULL, logp REAL NOT NULL, reactive REAL NOT NULL, purchase INT NOT NULL, tranche_name varchar(32) NOT NULL, feature text) PARTITION BY LIST(tranche_name);
(2) Create partition table(child).
CREATE TABLE IF NOT EXISTS $tablename PARTITION OF smile_partition FOR VALUES IN ('$prefix'); # $tablename is like smile_partition_aaaa, smile_partition_ebca; '$prefix' is like 'EEBD', 'AAAA'
(3) Load data to the table.
copy smile_partition from '$i' with CSV HEADER Delimiter AS E'\t'; # copy to the partition (parent) table. Data will be automatically load to corresponding child table. Don't copy data to partition (child) table.
(4) Check created tables.
\d Output: List of relations Schema | Name | Type | Owner --------+----------------------+-------+---------- public | smile_partition | table | postgres public | smile_partition_aaaa | table | postgres public | smile_partition_aaab | table | postgres public | smile_partition_aaad | table | postgres public | smile_partition_aaae | table | postgres public | smile_partition_aaaf | table | postgres public | smile_partition_aaba | table | postgres ... ... public | smile_partition_kkgf | table | postgres public | smile_partition_kkia | table | postgres public | smile_partition_kkib | table | postgres public | smile_partition_kkie | table | postgres public | smile_partition_kkif | table | postgres (3601 rows)
3. Create Index on Partition Tables
Indexing can greatly speedup searching. There are several types of index in Postgres. In this work, I am using the default BTREE (Binary Tree) index. In the future, we may leverage RDKit for indexing. An automate script to generate index for all partition tables is /var/lib/pgsql/script/create_index_all.sh
CREATE INDEX IF NOT EXISTS $indexname ON $tablename(smiles); # indexname is like 'idx_abed', tablename is like 'smile_partition_kkie'.
Check Validity of index 1. Check the number of index is the same as the number of partition tables (child, no index for the parent table). 2. Check the details of partition table. e.g.
\d smile_partition_aaaa Table "public.smile_partition_aaaa" Column | Type | Collation | Nullable | Default --------------+------------------------+-----------+----------+--------- smiles | character varying(256) | | not null | zinc_id | character varying(128) | | not null | inchikey | character varying(128) | | | mwt | real | | not null | logp | real | | not null | reactive | real | | not null | purchase | integer | | not null | tranche_name | character varying(32) | | not null | feature | text | | | Partition of: smile_partition FOR VALUES IN ('AAAA') Indexes: "index_aaaa" btree (smiles) If the index is invalid, following message will be shown. Indexes: "index_aaaa" btree (smiles) INVALID
3. Check by doing a query search.
select * from smile_partition_aaaa where smiles='CS(=O)(=O)N1CCC[C@@H](C(=O)N2CCC3(CC2)OCCO3)C1'; QUERY PLAN ----------------------------------------------------------------------------------------- Index Scan using index_aaaa on smile_partition_aaaa (cost=0.28..8.30 rows=1 width=104) Index Cond: ((smiles)::text = 'CS(=O)(=O)N1CCC[C@@H](C(=O)N2CCC3(CC2)OCCO3)C1'::text) (2 rows) You should see "Index Scan" indicating that Postgres is using indexing in searching. If you don't see this, check the validity of index. You can drop and recreate index on individual partition table (child). Another pro of partition, if something wrong with one partition table (child), just modify this one.
4. Search the database
Directly perform search on the partition table (parent). Parallel query will be automatically used in searching.
explain select * from smile_partition where smiles='COc1cc(C(=O)N2CCOCC2)cc(OS(=O)(=O)O)c1OC'; Output: QUERY PLAN ------------------------------------------------------------------------------------------------ Append (cost=0.28..25863.85 rows=3600 width=250) -> Index Scan using index_aaaa on smile_partition_aaaa (cost=0.28..8.30 rows=1 width=104) Index Cond: ((smiles)::text = 'COc1cc(C(=O)N2CCOCC2)cc(OS(=O)(=O)O)c1OC'::text) -> Index Scan using index_aaab on smile_partition_aaab (cost=0.28..8.30 rows=1 width=106) Index Cond: ((smiles)::text = 'COc1cc(C(=O)N2CCOCC2)cc(OS(=O)(=O)O)c1OC'::text) -> Index Scan using index_aaad on smile_partition_aaad (cost=0.29..8.30 rows=1 width=125) Index Cond: ((smiles)::text = 'COc1cc(C(=O)N2CCOCC2)cc(OS(=O)(=O)O)c1OC'::text) -> Index Scan using index_aaae on smile_partition_aaae (cost=0.42..8.44 rows=1 width=106) Index Cond: ((smiles)::text = 'COc1cc(C(=O)N2CCOCC2)cc(OS(=O)(=O)O)c1OC'::text) ... ...
Performance of searching with partition, indexing, and parallel query.
Tranche_name | Query | Running time |
---|---|---|
ADBA | Cc1cc(C(=O)O)cc(C)c1[N+](=O)[O-] | 1549.504 ms |
BDCD | CC(C)OCC(=O)N1CCc2cccc(O)c2C1 | 1049.261 ms |
CAGE | NNC(=O)c1nnn(Cc2ccccc2)c1C(=O)NN | 971.508 ms |
DCEB | O=C1c2nc[nH]c2C(=O)C(SCCO)=C1SCCO | 882.626 ms |
ECAD | C[C@@H](CCNC(=O)C(C)(C)F)NC(=O)[C@H]1CCc2nncn2CC1 | 1001.437 ms |
KDED | COc1ccc(C=c2sc(=C(C#N)C(=O)N3CCCC3)n(CCN3CCOCC3)c2=O)cc1OCc1ccccc1 | 1043.976 ms |