2dload
New 2D instructions
New commands
1. pre_process_partition.bash [partition_id] [tranches]
2. python 2dload.py add [partition_id] [preprocess_file] [catalog_shortname]
3. python 2dload.py rollback [partition_id] list
4. python 2dload.py rollback [partition_id] [shortname_list]
5. 2dwrapper.bash [partition_id] [catalog_shortname] [tranches]
6. python 2dload.py postgres [partition_id] [port_number] [shortname_list]
Command Description
1. The new preprocessing command. Now separate from loading so that preprocessing results may be saved elsewhere in case the database needs to be loaded from scratch again. Launches a number of slurm jobs to pre-process molecules, terminates once they have all completed. Saves a tarball of the collected results to /tmp/${PARTITION_ID}.pre
2. The new command for adding data to a database. Just like before, a partition id & catalog shortname are required. In addition, a ".pre" preprocessing file acquired from pre_process_partition.bash is required.
3. A new feature which lets you view the currently loaded data for a particular database. Any databases created before the update will display "legacy database" in their size field when this command is run
ex:
python 2dload.py rollback 110 list tranche|date |short |catalog size =======|===============|=======|=============== H26P370|08.31.19.11 |s |3548322 H26P370|09.01.12.21 |m |8485914 =============================================== H26P380|08.31.19.59 |s |3996581 H26P380|09.01.13.50 |m |8713228 =============================================== H26P390|08.31.20.48 |s |4887562 H26P390|09.01.15.10 |m |6547271 =============================================== H26P400|08.31.21.42 |s |5240524 H26P400|09.01.16.16 |m |4465551 =============================================== [11.05.09.49]: total substance table size: 44660929 [11.05.09.49]: total supplier table size: 22587771 [11.05.09.49]: total catalog table size: 45884953
python 2dload.py rollback 111 list tranche|date |short |catalog size =======|===============|=======|=============== H26P340|08.31.19.17 |s |legacy archive H26P340|09.01.13.15 |m |legacy archive H26P340|10.06.18.46 |s |legacy archive H26P340|10.07.11.32 |m |legacy archive =============================================== H26P350|08.31.20.09 |s |legacy archive H26P350|09.01.15.34 |m |legacy archive H26P350|10.06.21.55 |s |legacy archive H26P350|10.08.00.25 |m |legacy archive =============================================== H26P360|08.31.21.05 |s |legacy archive H26P360|09.01.17.34 |m |legacy archive H26P360|10.07.02.29 |s |legacy archive H26P360|10.08.15.00 |m |legacy archive =============================================== [11.05.09.49]: total substance table size: 48379301 [11.05.09.49]: total supplier table size: 23877467 [11.05.09.49]: total catalog table size: 50178572
(as you can see, partition 111 still has duplicate archives) (IMPORTANT: before you do anything loading-wise, you must delete *all* erroneous archives in the source directory)
4. Lets you roll back the database to a previous state. This previous state is controlled by the shortname_list argument, which is a comma-separated list of the catalogues that you want to roll the database back to. For example:
python 2dload.py rollback 0 list tranche|date |short |catalog size =======|===============|=======|=============== H00P000|08.31.19.17 |s |100000 H00P000|08.31.19.18 |m |2000 H00P000|09.31.17.15 |u |-56
(uh oh, something is wrong with catalog u- let's roll it back)
python 2dload.py rollback 0 s,m ... python 2dload.py rollback 0 list tranche|date |short |catalog size =======|===============|=======|=============== H00P000|08.31.19.17 |s |100000 H00P000|08.31.19.18 |m |2000
5. Wrapper script that will perform the entire loading process in one go, including preprocessing.
6. Not implemented yet, used for loading 2D data into postgres. Expects a partition, port, and list of shortnames- shortnames from this partition will be loaded into a postgres server at the specified port number. I've noticed some issues with the data we've been loading into postgres, specifically I've noticed that catalog information in the postgres databases is incorrect. We might need to re-load the postgres databases at some point, so be aware.
Additional
To run the 2dload.py script, a python3 environment must be used. Using a python2 environment will get you an "AttributeError: __exit__" error.