Creating Maps on SmallWorld: Difference between revisions

From DISI
Jump to navigation Jump to search
(37 intermediate revisions by the same user not shown)
Line 1: Line 1:
Written by Jennifer Young on April 10, 2019
Written by Jennifer Young on April 10, 2019. Last modified on July 12, 2019
 
==Install: The developer need only do this once.  For the user, skip to the Run section==


==Install==
===Step 1: Create the /sw directory===
===Step 1: Create the /sw directory===
SmallWorld requires a /sw directory which contains /anon and /maps
SmallWorld requires a /sw directory which contains /anon and /maps
Line 20: Line 21:
==Run==
==Run==
===Important! Currently, we only run SmallWorld map generation on n-9-22===
===Important! Currently, we only run SmallWorld map generation on n-9-22===
===Step 1: Set SWDIR environment variable===
Whenever running operations that are rough on the disk, only one one major operation at a time.  Also, use nice/renice and ionice to decrease the disruption to other lab members.
 
You can adjust the priority of a running process using the process ID.  Look for the process ID like this
    ps aux | grep <something to help you find your process>
 
Then use renice to change the priority
    renice -n +19 -p <PID>
Note that +19 is the lowest priority possible.  Consult the man page for more information.
 
You can also use ionice to limit the io usage
  ionice -c 3 -p <PID>
Here the 3 represents idle.  Consult the man page for other options such as -c 2 which has different priority levels for best-effort scheduling
 
==Step 1: Set SWDIR environment variable==
Determine where the /sw directory with the /anon and /maps directory and if using csh, set the SWDIR variable using  
Determine where the /sw directory with the /anon and /maps directory and if using csh, set the SWDIR variable using  
     setenv SWDIR /srv/db4
     setenv SWDIR /srv/db4
Line 27: Line 41:
     export SWDIR=/srv/db4
     export SWDIR=/srv/db4


===Step 2: Generate the map files===
==Step 2: Generate the map files for SMILES files LESS THAN 1 Billion==
For files larger than 1 Billion smiles, see the section for Processing Large Smiles Files in SmallWorld
 
Run sw.jar to convert your smiles file into a map file.  
Run sw.jar to convert your smiles file into a map file.  


Line 41: Line 57:
     |& tee /mnt/nfs/ex9/work/smallworld/wait-ok_log
     |& tee /mnt/nfs/ex9/work/smallworld/wait-ok_log


For now, generate the map files in the /mnt/nfs/ex9/work/smallworld directory so they can be easily moved to the maps directory when the server is restarted
This is a long command, so we can break down the different parts.
 
map  The SMILES file you want to map follows the word "map"
 
-T      This is the temp directory you want to write temp files to, rather than writing to /tmp
 
-o      The final name of your output map file
 
-u      Argument to keep track of the unmapped compounds.
 
Important: The unmapped compounds file may be used to generate extensions and is the same as the .anon.ext.map file mentioned in the SmallWorld documentation
 
 
For now, generate the map files in the  
    /mnt/nfs/ex9/work/smallworld  
 
directory so they can be easily moved to the maps directory when the server is restarted
 
The sorting is done automatically when you generate the SmallWorld maps in a single step like this.
 
==Step 2: Processing large SMILES files in SmallWorld GREATER THAN 1 Billion==
If your SMILES file contains more than 1 billion molecules, I think it is better to split the input in to chunks of 1 billion to avoid waiting too long to get intermediate results in case the job fails for some reason.
 
===Step 1: Split the input file into chunks of 1 billion smiles===
    split -l 1000000000 <your-smiles-file>
 
===Step 2: Process the chunks in SmallWorld using the command above ===
Write a helper script to process these using a loop, usually one or two at a time to be mindful of disk overload.  An example is shown below.  Just modify the path and pattern to match your files.
    #!/bin/csh
    foreach j(/mnt/nfs/ex9/work/smallworld/REAL_SPACE_400_to_500/x??)
        echo $j
        echo '(time java -jar /opt/nextmove/sw.jar map' $j '-T /mnt/nfs/ex9/work/smallworld/scratch_sw -o '${j}'.anon.map -u '${j}'.anon.unmapped) |& tee '${j}'_log'
        (time java -jar /opt/nextmove/sw.jar map $j -T /mnt/nfs/ex9/work/smallworld/scratch_sw -o ${j}.anon.map -u ${j}.anon.unmapped) |& tee ${j}_log
    end
 
Important: Make sure that the file you want to process only contains two columns.  The first column contains the SMILES and the second column contains the ID.
If this is not the case, you need to use
    awk '{print $1"\t"$2}'  <your file> > <your_file_smi_ID>
to get only these columns before processing with SmallWorld.
 
===Step 3-4 Combined: Rather than concatenating and sorting, merge the map file chunks (assuming they are already sorted)===
Use the -T option on sort to set the temp directory of your choice.  If your file is large enough, the sort command needs scratch space to write temporary files to later merge.
 
Adjust the matching in the loop rather than just using *.anon.map if you want to be more specific about the files you are combining
 
    #!/bin/bash
    for j in *anon.map
    do
        ls $j
        LC_ALL=C sort -T /nfs/db5/jyoung -m -u $j <combined_map_name> -o <combined_map_name>
    done
 
As suggested in the SmallWorld documentation, you can check if the resulting file is sorted using the sort -c option
    #!/bin/bash
    for j in <combined_map_name>
    do
        ls $j
        LC_ALL=C sort -T /nfs/db5/jyoung -c -u $j
    done
 
===Step 3: Copy to SSD and Concatenate===
It is much faster and avoids overloading the disk if you concatenate the results on an SSD rather than a regular hard drive.  I have been using /nfs/db5/jyoung as a place to concatenate the files.  You can use another helper script to perform that concatenation one file at a time.
 
    #!/bin/csh
    foreach j(*.anon.map)
        ls $j
        cat $j >> enamine_private_400_to_500_all
    end
 
===Step 4: Sort the resulting map file using bash with locale LC_ALL=C ===
It is very important to sort using bash using the locale LC_ALL=C!
 
    #!/bin/bash
    for j in enamine_private_400_to_500_all.anon.map
    do
        ls $j
        (time LC_ALL=C sort -T /nfs/db5/jyoung $j -o sorted_${j}) |& tee sort_time_${j}
    done
 
===Step 5: Copy the completed file back to /ex9 ===
    cp <your map file> /nfs/ex9/work/smallworld
 
===Step 6: Move the file to the final location===
Move the map file into either the maps/ directory or the private_small_world_maps directory depending on whether file should be public or private
    mv <your map file> maps/
    mv <your map file> private_smallworld_maps/
 
==Step 2.5 Incremental Updates==
 
There are several ways to perform incremental updates for SmallWorld maps which are equivalent.
 
If you just have a new SMILES file input.smi you wish to add to an already existing map named <already_existing>.anon.map, then you can use the sw map command with the --append option provided in the SmallWorld Java Command Line Interface (CLI)
 
First, if you have not already done so, set an alias "sw" to the SmallWorld Java file as shown below:
Make sure the SWDIR environment variable is also set
 
    setenv SWDIR /srv/db4
 
    alias sw 'java -jar /opt/nextmove/sw.jar'
 
Now you can run the sw map command using --append
 
    sw map input.smi --append --o <already_existing>.anon.map
 
Another method for performing incremental updates is discussed below.
A worked example is located in the SmallWorld version 4 documentation on pages 17-18 in the Incremental Updates Section
 
===What if you already created a .anon.map file from your new smiles, and simply want to combine the two separate .anon.map files into a single .anon.map file?===
 
Make a copy and modify the bash script below with the name of your new map file
 
    /nfs/ex9/work/smallworld/bash_merge_maps
This script simply combines two map files that have already been created
 
Replace the OLD_MAPFILE environment variable with the old map
 
Replace the ADD_MAPFILE environment variable with the map you wish to incrementally add
 
Replace the NEW_MAPFILE environment variable with the name you want for the new combined map
 
Updating a new map file in this way does not automatically update the .anon.map.cfg file with the new total number of molecules mapped and other statistics.
The create_cfg.py program will automatically generate a new .anon.map.cfg file for the combined map
 
The create_cfg.py program is located in
 
    /mnt/nfs/ex9/work/smallworld/create_cfg.py
 
The create_cfg.py script will automatically generate a cfg file for a map that is composed of two existing SmallWorld Maps.  Run as follows:
    python create_cfg.py first second combined
 
Note:
first is first.anon.map
second is second.anon.map
combined is combined.anon.map
 
The program will look for the .anon.map files and the .anon.map.cfg for the first and second before proceeding
 
The new_cfg.py program is located in
 
    python /mnt/nfs/ex9/work/smallworld/new_cfg.py <combined cfg file name>
 
This Python program will find all the cfg folders in a directory and combine them into a single combined cfg file
 
==Step 3: How to Make the New Database Available. Stop the server first!==
 
First, stop the server. You do not need to restart Tomcat every time you want to restart SmallWorld. The TomCat GUI is available at
    http://10.20.9.22:8080/
You need the tunnel to access the private network or this link will not work! Also, the admin username and password is required.
 
In the Manager App, in the Applications section, use the Stop and Start buttons for the sw-ws app to restart the SmallWorld server
 
==Step 4: Move the Map File into the /maps directory==
 
If you followed the example above, your .anon.map file should be located in
    /mnt/nfs/ex9/work/smallworld
 
Now, simply move the map file into the "/sw/maps" directory so it can be recognized by SmallWorld.
 
    mv <your-file-name>.anon.map /mnt/nfs/ex9/work/smallworld/maps
 
Or, simply
    mv <your-file-name>.anon.map maps
 
 
==Step 5: Restart the Server==
Again, go to the TomCat manager under applications and click Start once you are finished moving around the map files.

Revision as of 18:35, 19 July 2019

Written by Jennifer Young on April 10, 2019. Last modified on July 12, 2019

Install: The developer need only do this once. For the user, skip to the Run section

Step 1: Create the /sw directory

SmallWorld requires a /sw directory which contains /anon and /maps The /sw directory is located on n-9-22 in

   /srv/db4

This directory contains:

The /anon directory contains 12 TB of pre-computed subgraphs

The /maps directory contains text files computed by SmallWorld that map your molecules of choice to the corresponding nodes in the anonymous graph index

Step 2: Find a location for the sw.jar file

Copy the sw.jar file to a reasonable location such as

   /opt/nextmove/sw.jar

Run

Important! Currently, we only run SmallWorld map generation on n-9-22

Whenever running operations that are rough on the disk, only one one major operation at a time. Also, use nice/renice and ionice to decrease the disruption to other lab members.

You can adjust the priority of a running process using the process ID. Look for the process ID like this

   ps aux | grep <something to help you find your process>

Then use renice to change the priority

   renice -n +19 -p <PID>

Note that +19 is the lowest priority possible. Consult the man page for more information.

You can also use ionice to limit the io usage

  ionice -c 3 -p <PID>

Here the 3 represents idle. Consult the man page for other options such as -c 2 which has different priority levels for best-effort scheduling

Step 1: Set SWDIR environment variable

Determine where the /sw directory with the /anon and /maps directory and if using csh, set the SWDIR variable using

   setenv SWDIR /srv/db4

If using bash, set the SWDIR variable using

   export SWDIR=/srv/db4

Step 2: Generate the map files for SMILES files LESS THAN 1 Billion

For files larger than 1 Billion smiles, see the section for Processing Large Smiles Files in SmallWorld

Run sw.jar to convert your smiles file into a map file.

(Optional: use the time command to see how long this takes)

(Optional: tee your results into a log file to check progress)

Here is an example where wait-ok.smi is your .smi file to map:

   (time java -jar /opt/nextmove/sw.jar map /mnt/nfs/ex9/work/smallworld/wait-ok.smi 
   -T /mnt/nfs/ex9/work/smallworld/scratch_sw 
   -o /mnt/nfs/ex9/work/smallworld/wait-ok.anon.map 
   -u /mnt/nfs/ex9/work/smallworld/wait-ok.anon.unmapped)
    |& tee /mnt/nfs/ex9/work/smallworld/wait-ok_log

This is a long command, so we can break down the different parts.

map The SMILES file you want to map follows the word "map"

-T This is the temp directory you want to write temp files to, rather than writing to /tmp

-o The final name of your output map file

-u Argument to keep track of the unmapped compounds.

Important: The unmapped compounds file may be used to generate extensions and is the same as the .anon.ext.map file mentioned in the SmallWorld documentation


For now, generate the map files in the

   /mnt/nfs/ex9/work/smallworld 

directory so they can be easily moved to the maps directory when the server is restarted

The sorting is done automatically when you generate the SmallWorld maps in a single step like this.

Step 2: Processing large SMILES files in SmallWorld GREATER THAN 1 Billion

If your SMILES file contains more than 1 billion molecules, I think it is better to split the input in to chunks of 1 billion to avoid waiting too long to get intermediate results in case the job fails for some reason.

Step 1: Split the input file into chunks of 1 billion smiles

   split -l 1000000000 <your-smiles-file>

Step 2: Process the chunks in SmallWorld using the command above

Write a helper script to process these using a loop, usually one or two at a time to be mindful of disk overload. An example is shown below. Just modify the path and pattern to match your files.

   #!/bin/csh
   foreach j(/mnt/nfs/ex9/work/smallworld/REAL_SPACE_400_to_500/x??)
       echo $j
       echo '(time java -jar /opt/nextmove/sw.jar map' $j '-T /mnt/nfs/ex9/work/smallworld/scratch_sw -o '${j}'.anon.map -u '${j}'.anon.unmapped) |& tee '${j}'_log'
       (time java -jar /opt/nextmove/sw.jar map $j -T /mnt/nfs/ex9/work/smallworld/scratch_sw -o ${j}.anon.map -u ${j}.anon.unmapped) |& tee ${j}_log
   end

Important: Make sure that the file you want to process only contains two columns. The first column contains the SMILES and the second column contains the ID. If this is not the case, you need to use

   awk '{print $1"\t"$2}'  <your file> > <your_file_smi_ID>

to get only these columns before processing with SmallWorld.

Step 3-4 Combined: Rather than concatenating and sorting, merge the map file chunks (assuming they are already sorted)

Use the -T option on sort to set the temp directory of your choice. If your file is large enough, the sort command needs scratch space to write temporary files to later merge.

Adjust the matching in the loop rather than just using *.anon.map if you want to be more specific about the files you are combining

   #!/bin/bash
   for j in *anon.map
   do
       ls $j
       LC_ALL=C sort -T /nfs/db5/jyoung -m -u $j <combined_map_name> -o <combined_map_name>
   done

As suggested in the SmallWorld documentation, you can check if the resulting file is sorted using the sort -c option

   #!/bin/bash
   for j in <combined_map_name>
   do
       ls $j
       LC_ALL=C sort -T /nfs/db5/jyoung -c -u $j
   done

Step 3: Copy to SSD and Concatenate

It is much faster and avoids overloading the disk if you concatenate the results on an SSD rather than a regular hard drive. I have been using /nfs/db5/jyoung as a place to concatenate the files. You can use another helper script to perform that concatenation one file at a time.

   #!/bin/csh
   foreach j(*.anon.map)
       ls $j
       cat $j >> enamine_private_400_to_500_all
   end

Step 4: Sort the resulting map file using bash with locale LC_ALL=C

It is very important to sort using bash using the locale LC_ALL=C!

   #!/bin/bash
   for j in enamine_private_400_to_500_all.anon.map
   do
       ls $j
       (time LC_ALL=C sort -T /nfs/db5/jyoung $j -o sorted_${j}) |& tee sort_time_${j}
   done

Step 5: Copy the completed file back to /ex9

   cp <your map file> /nfs/ex9/work/smallworld

Step 6: Move the file to the final location

Move the map file into either the maps/ directory or the private_small_world_maps directory depending on whether file should be public or private

   mv <your map file> maps/
   mv <your map file> private_smallworld_maps/

Step 2.5 Incremental Updates

There are several ways to perform incremental updates for SmallWorld maps which are equivalent.

If you just have a new SMILES file input.smi you wish to add to an already existing map named <already_existing>.anon.map, then you can use the sw map command with the --append option provided in the SmallWorld Java Command Line Interface (CLI)

First, if you have not already done so, set an alias "sw" to the SmallWorld Java file as shown below: Make sure the SWDIR environment variable is also set

   setenv SWDIR /srv/db4
   alias sw 'java -jar /opt/nextmove/sw.jar'

Now you can run the sw map command using --append

   sw map input.smi --append --o <already_existing>.anon.map

Another method for performing incremental updates is discussed below. A worked example is located in the SmallWorld version 4 documentation on pages 17-18 in the Incremental Updates Section

What if you already created a .anon.map file from your new smiles, and simply want to combine the two separate .anon.map files into a single .anon.map file?

Make a copy and modify the bash script below with the name of your new map file

   /nfs/ex9/work/smallworld/bash_merge_maps

This script simply combines two map files that have already been created

Replace the OLD_MAPFILE environment variable with the old map

Replace the ADD_MAPFILE environment variable with the map you wish to incrementally add

Replace the NEW_MAPFILE environment variable with the name you want for the new combined map

Updating a new map file in this way does not automatically update the .anon.map.cfg file with the new total number of molecules mapped and other statistics. The create_cfg.py program will automatically generate a new .anon.map.cfg file for the combined map

The create_cfg.py program is located in

   /mnt/nfs/ex9/work/smallworld/create_cfg.py

The create_cfg.py script will automatically generate a cfg file for a map that is composed of two existing SmallWorld Maps. Run as follows:

   python create_cfg.py first second combined

Note: first is first.anon.map second is second.anon.map combined is combined.anon.map

The program will look for the .anon.map files and the .anon.map.cfg for the first and second before proceeding

The new_cfg.py program is located in

   python /mnt/nfs/ex9/work/smallworld/new_cfg.py <combined cfg file name>

This Python program will find all the cfg folders in a directory and combine them into a single combined cfg file

Step 3: How to Make the New Database Available. Stop the server first!

First, stop the server. You do not need to restart Tomcat every time you want to restart SmallWorld. The TomCat GUI is available at

   http://10.20.9.22:8080/

You need the tunnel to access the private network or this link will not work! Also, the admin username and password is required.

In the Manager App, in the Applications section, use the Stop and Start buttons for the sw-ws app to restart the SmallWorld server

Step 4: Move the Map File into the /maps directory

If you followed the example above, your .anon.map file should be located in

   /mnt/nfs/ex9/work/smallworld

Now, simply move the map file into the "/sw/maps" directory so it can be recognized by SmallWorld.

   mv <your-file-name>.anon.map /mnt/nfs/ex9/work/smallworld/maps

Or, simply

   mv <your-file-name>.anon.map maps


Step 5: Restart the Server

Again, go to the TomCat manager under applications and click Start once you are finished moving around the map files.