Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

shrey183
Copy link
Contributor

@shrey183 shrey183 commented Aug 25, 2020

[GSoC-2020] General File Reader and Multithreaded Mol Supplier

Overview

The General File Reader, as the name suggests, provides the user with the appropriate MolSupplier object to parse a file of a given format. Thus for instance earlier if one wanted to parse a file of smiles, say input.smi, then one would need to explicitly construct an object SmilesMolSupplier. However, with the implementation provided in the GeneralFileReader, one can easily pass the file name along with supplier options to obtain the appropriate MolSupplier object determined by the file format. Furthermore, the General File Reader also provides an interface with the MultithreadedMolSupplier objects for Smiles and SDF file formats. Besides the implementation, test cases are also included to demonstrate the utility of the General File Reader.

The Multithreaded Mol Supplier provides a concurrent implementation of the usual base class MolSupplier. Due to time constraints, multithreaded versions of only Smiles, and SD Mol Suppliers were implemented. The motivation for this part stemmed from parsing large Smiles or SDF files. With the current implementation the user, for instance, can construct the object MultithreadedSmilesMolSupplier to parse a smiles file with a large number of records. Besides the implementation, test cases are also included to demonstrate the correctness and performance of the MultithreadedMolSupplier. Here is a brief summary of the performance result obtained by running the function testPerformance on @greglandrum's machine:

Duration for SmilesMolSupplier: 6256 (milliseconds)
Maximum Duration for MultithreadedSmilesMolSupplier: 6972 (milliseconds) with 1 writer thread
Minimum Duration for MultithreadedSmilesMolSupplier: 855 (milliseconds) with 15 writer threads

Duration for SDMolSupplier: 2584 (milliseconds) 
Maximum Duration for MultithreadedSDMolSupplier: 2784 (milliseconds) with 1 writer thread
Minimum Duration for MultithreadedSDMolSupplier: 729 (milliseconds) with 7 writer threads

Implementation

Implementation of the General File Reader is quite concise and makes use of only two methods determineFormat and getSupplier. The former determines the file and the compression format given pathname, while the latter returns a pointer to MolSupplier object given pathname and SupplierOptions.

Regarding the implementation of the MultithreadedMolSupplier, the first step was to implement a thread-safe blocking queue of fixed capacity. This would allow us to extract and process records from the file concurrently. The concurrent queue was implemented with a single lock and two condition variables to signal whether the queue was empty or full. Test cases checking the correctness of the ConcurrentQueue are also included in the project.

The next step required the implementation of the base class MultithreadedMolSupplier which would manage the input and output queue. The input queue would be populated by the method extractNextRecord that would read a record from a given file/stream, whereas the output queue would be populated by the method processMoleculeRecord that would first pop a record from the input queue and then process it into an object of type ROMol *. The reader thread would thus call extractNextRecord until no record can be read, while the writer thread(s) would call the method processMoleculeRecord until the output queue is done and empty. The child classes MultithreadedSmilesMolSupplier and MultithreadedSDMolSupplier primarily provide implementations of the methods, extractNextRecord and processMoleculeRecord. Both suppliers were tested on various files with different parameter values for input queue size, output queue size, and the number of writer threads.

Further Work

Due to time constraints and the difficulty involved in debugging concurrent code, there were a few things that could not be completed.

  1. In cases where the file format is less defined, it might be useful to parse the file content to discover the file format and possible Supplier options. The current implementation does not support this and only uses the pathname to determine the appropriate Supplier.
  2. Wrappers for the Multithreaded Smiles and SD Suppliers in other languages such as Java were not implemented in this project.

Changes made for the General File Reader and Multithreaded Mol Supplier:

List of important files added:

//! concurrent queue is full so we wait until
//! it is not full
while (d_head + d_capacity == d_tail) {
d_notFull.wait(lk);
Copy link
Contributor

@bp-kelley bp-kelley Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a dead lock here. If we push an item, we lock the queue and THEN wait on for the queue to be emptied, which can't happen since we hold the lock.

A small change to testProducerConsumer shows this where we call thread::join before launching the consumers.

n.b. this test is in error, the join is blocking

  //! start producer threads
  for (int i = 0; i < numProducerThreads; i++) {
    producers[i] = std::thread(produce, std::ref(q), numToProduce);
  }

  std::for_each(producers.begin(), producers.end(),
                std::mem_fn(&std::thread::join));


  //! start consumer threads
  for (int i = 0; i < numConsumerThreads; i++) {
    consumers[i] = std::thread(consume, std::ref(q), std::ref(results[i]));
  }

  //! the producer is done producing
  q.setDone();

  std::for_each(consumers.begin(), consumers.end(),
                std::mem_fn(&std::thread::join));
  TEST_ASSERT(q.isEmpty());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bp-kelley, assuming I understand your comment correctly:
The use of the condition_variable d_notFull ensures that we don't get a deadlock here. We no longer hold the lock while d_notFull.wait() is executing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above test was wrong as the join was blocking. I added a sleep in between the producers and consumers and it looks ok now, although I did fail the assert once (q.isempty) I'm assuming that was a mistake on my part for now.

TDTMolSupplier* tdtsup = new TDTMolSupplier(
strm, true, opt.nameRecord, opt.confId2D, opt.confId3D, opt.sanitize);
return tdtsup;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we are using determineFormat to throw the exception here when we have a format we don't understand. This is causing the compiler to think we have a control path with no return value (which is true).

Perhaps we should have determineFormat return true/false and deal with exceptions in the main body.

Also I would throw BadFileException as opposed to invalid_argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the code to throw BadFileException instead of std::invalid_argument. However, I don't think it would be better to do error handling in the main body since we might not be able to get helpful error messages. The way determineFormat is written now, it terminates exactly when the file or compression format is invalid. Furthermore, we also know in which case it failed and therefore we are able to log out the precise reason (at least in the case when the compression formats are .zst, .bz2 and .7z).

unsigned int numWriterThreads = 0;
};
//! current supported file formats
std::vector<std::string> supportedFileFormats{"sdf", "mae", "maegz", "smi",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

const std::string &record, unsigned int lineNum) {
PRECONDITION(dp_inStream, "no stream");
std::istringstream inStream(record);
auto res = MolDataStreamToMol(inStream, lineNum, df_sanitize, df_removeHs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also have used an SDMolSupplier on the stream (or raw text) and also get the properties. This would have reduced some code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bp-kelley : I think I looked at that and determined that it wouldn't work because we want to have an accurate lineNum available. I agree that we should refactor this at some point, but I think that approach doesn't work at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. We can make a feature request to seed sdmolsupplier with a line number in the future.

Copy link
Collaborator

@d-b-w d-b-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to include the scaling of this approach as I increase the number of threads - like, do 4 threads really make reading go 4x faster?

In my opinion, this PR probably would have been easier to read if it had been two PRs - one for the generalized reader, and one for the multithreaded reader.

unsigned int numWriterThreads = 0;
};
//! current supported file formats
std::vector<std::string> supportedFileFormats{"sdf", "mae", "maegz", "smi",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've totally seen .sdfgz files before (meaning gzipped .sdf). Also - RDKit has a PDBMolSupplier - did you consider including that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I'd never seen .sdfgz, so I didn't suggest that. We can always add that and PDB support later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added .sdfgz as a special format similar to .maegz, in case it is useful to someone.

return;
}
}
throw std::invalid_argument("Unsupported structure or compression extension");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to include at least the extension, and maybe the full file name here (and in the other exception in this function)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included the filename for all error messages. When all the above cases fail then no compression format would have been computed. So it would be an empty string, however, I agree that it would be useful to log out the file name.

throw std::invalid_argument("Unsupported structure or compression extension");
}

//! returns a MolSupplier object based on the file name instantiated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think people usually say in the RDKit codebase:

//! returns a new MolSupplier object...

where "new" helps readers discern that they own the MolSupplier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

- the caller owns the memory and therefore the pointer must be deleted
*/

MolSupplier* getSupplier(const std::string& path,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have considered encoding the ownership in the structure of the language. AKA - it might make sense to return either a std::unique_ptr or a shared_ptr. It's possible that you'll want the latter for best interoperability with boost::python.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that using an std::unique_ptr here makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


//! Dispatch to the appropriate supplier
if (fileFormat == "sdf") {
if(opt.numWriterThreads > 0){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer spaces to tabs. Or even better - run the clang-format using the .clang-format file in the repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a collection of code that hasn't been through clang-format (or which has been run with an older version). I'm going to run the whole codebase through once we do the next release.

std::vector<std::string> supportedCompressionFormats{"gz"};

//! given file path determines the file and compression format
void determineFormat(const std::string path, std::string& fileFormat,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more robust to make this return enum class objects. Relying on string comparison leaves you vulnerable to coding errors via typos that would only be caught at runtime. Enum class comparisons would be caught at compile time! Also, getSupplier() could then use a switch statement.

std::pair<FileFormat, CompressionFormat> guessFormat(const& std::string path);

Use like (C++17 only):

auto [format, compression] = guessFormat(path);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth considering for a future enhancement.

@greglandrum
Copy link
Member

In my opinion, this PR probably would have been easier to read if it had been two PRs - one for the generalized reader, and one for the multithreaded reader.

@d-b-w I agree with this, but it's easier to satisfy the documentation requirements for GSoC if there's a single PR. Sorry for the extra hassle to review

@greglandrum
Copy link
Member

@shrey183 : I don't think the CI build failure was due to your code. I'm running that test again to be sure.

@greglandrum
Copy link
Member

@bp-kelley, @d-b-w : We'll need to do some ongoing tweaking and tuning, but I think this is ok to go ahead and merge. What's your take?

greglandrum
greglandrum previously approved these changes Oct 7, 2020
@greglandrum greglandrum dismissed their stale review October 7, 2020 03:50

Just noticed a problem

@greglandrum
Copy link
Member

of course I approved the PR and then tried a build without threading enabled. I will fix those problems and do a PR against @shrey183's repo.

@greglandrum
Copy link
Member

@shrey183 : I did a PR against your repo with the relevant fixes. If you could accept/merge that then I think we're close to being able to merge this PR into master.

make this work when building in non-threads mode
@shrey183
Copy link
Contributor Author

shrey183 commented Oct 7, 2020

@shrey183 : I did a PR against your repo with the relevant fixes. If you could accept/merge that then I think we're close to being able to merge this PR into master.

@greglandrum Done!

@greglandrum
Copy link
Member

hmm, the CI failures are looking suspiciously real. I will try to find some time to look at these tonight/tomorrow morning.

@shrey183
Copy link
Contributor Author

shrey183 commented Oct 8, 2020

@greglandrum I merged the recent PR!

@greglandrum
Copy link
Member

Here's a quick summary of the recent changes I made pre-merge:

  1. I disabled the function in the Python wrapper that created a MultiThreadedSDMolSupplier from a Python stream. This is clearly useful functionality, but what's there has a heisenbug on Windows and Mac. Also we should rethink the API and ensure that MultiThreadedSmilesMolSuppliers can also be created that way.
  2. I added comments indicating that the multi-threaded code is still experimental and that the API may change. I think the tests here are good, but this is significant new functionality that hasn't seen real-world testing yet and I can imagine that there are bugs lurking (Note: not bugs related to the correctness of the molecules coming back from the suppliers!) and that we may want tweak the API.

@greglandrum greglandrum merged commit 8ea1ac6 into rdkit:master Oct 9, 2020
@greglandrum greglandrum added this to the 2020_09_1 milestone Oct 9, 2020
@greglandrum
Copy link
Member

@shrey183 Thanks for all the work you put into this! It's been a fun project. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants