-
Notifications
You must be signed in to change notification settings - Fork 949
[GSoC-2020] Generalized and Multithreaded File Reader #3363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merge branch 'master' of https://github.com/rdkit/rdkit
This PR deals with the Generalized File Reader.
//! concurrent queue is full so we wait until | ||
//! it is not full | ||
while (d_head + d_capacity == d_tail) { | ||
d_notFull.wait(lk); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a dead lock here. If we push an item, we lock the queue and THEN wait on for the queue to be emptied, which can't happen since we hold the lock.
A small change to testProducerConsumer shows this where we call thread::join before launching the consumers.
n.b. this test is in error, the join is blocking
//! start producer threads
for (int i = 0; i < numProducerThreads; i++) {
producers[i] = std::thread(produce, std::ref(q), numToProduce);
}
std::for_each(producers.begin(), producers.end(),
std::mem_fn(&std::thread::join));
//! start consumer threads
for (int i = 0; i < numConsumerThreads; i++) {
consumers[i] = std::thread(consume, std::ref(q), std::ref(results[i]));
}
//! the producer is done producing
q.setDone();
std::for_each(consumers.begin(), consumers.end(),
std::mem_fn(&std::thread::join));
TEST_ASSERT(q.isEmpty());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bp-kelley, assuming I understand your comment correctly:
The use of the condition_variable d_notFull
ensures that we don't get a deadlock here. We no longer hold the lock while d_notFull.wait()
is executing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above test was wrong as the join was blocking. I added a sleep in between the producers and consumers and it looks ok now, although I did fail the assert once (q.isempty) I'm assuming that was a mistake on my part for now.
TDTMolSupplier* tdtsup = new TDTMolSupplier( | ||
strm, true, opt.nameRecord, opt.confId2D, opt.confId3D, opt.sanitize); | ||
return tdtsup; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we are using determineFormat to throw the exception here when we have a format we don't understand. This is causing the compiler to think we have a control path with no return value (which is true).
Perhaps we should have determineFormat return true/false and deal with exceptions in the main body.
Also I would throw BadFileException as opposed to invalid_argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the code to throw BadFileException
instead of std::invalid_argument
. However, I don't think it would be better to do error handling in the main body since we might not be able to get helpful error messages. The way determineFormat
is written now, it terminates exactly when the file or compression format is invalid. Furthermore, we also know in which case it failed and therefore we are able to log out the precise reason (at least in the case when the compression formats are .zst
, .bz2
and .7z
).
unsigned int numWriterThreads = 0; | ||
}; | ||
//! current supported file formats | ||
std::vector<std::string> supportedFileFormats{"sdf", "mae", "maegz", "smi", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
const std::string &record, unsigned int lineNum) { | ||
PRECONDITION(dp_inStream, "no stream"); | ||
std::istringstream inStream(record); | ||
auto res = MolDataStreamToMol(inStream, lineNum, df_sanitize, df_removeHs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also have used an SDMolSupplier on the stream (or raw text) and also get the properties. This would have reduced some code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bp-kelley : I think I looked at that and determined that it wouldn't work because we want to have an accurate lineNum
available. I agree that we should refactor this at some point, but I think that approach doesn't work at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. We can make a feature request to seed sdmolsupplier with a line number in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to include the scaling of this approach as I increase the number of threads - like, do 4 threads really make reading go 4x faster?
In my opinion, this PR probably would have been easier to read if it had been two PRs - one for the generalized reader, and one for the multithreaded reader.
unsigned int numWriterThreads = 0; | ||
}; | ||
//! current supported file formats | ||
std::vector<std::string> supportedFileFormats{"sdf", "mae", "maegz", "smi", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've totally seen .sdfgz
files before (meaning gzipped .sdf). Also - RDKit has a PDBMolSupplier - did you consider including that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, I'd never seen .sdfgz
, so I didn't suggest that. We can always add that and PDB support later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added .sdfgz
as a special format similar to .maegz
, in case it is useful to someone.
return; | ||
} | ||
} | ||
throw std::invalid_argument("Unsupported structure or compression extension"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to include at least the extension, and maybe the full file name here (and in the other exception in this function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included the filename for all error messages. When all the above cases fail then no compression format would have been computed. So it would be an empty string, however, I agree that it would be useful to log out the file name.
throw std::invalid_argument("Unsupported structure or compression extension"); | ||
} | ||
|
||
//! returns a MolSupplier object based on the file name instantiated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think people usually say in the RDKit codebase:
//! returns a new MolSupplier object...
where "new" helps readers discern that they own the MolSupplier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
- the caller owns the memory and therefore the pointer must be deleted | ||
*/ | ||
|
||
MolSupplier* getSupplier(const std::string& path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have considered encoding the ownership in the structure of the language. AKA - it might make sense to return either a std::unique_ptr or a shared_ptr. It's possible that you'll want the latter for best interoperability with boost::python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that using an std::unique_ptr
here makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
|
||
//! Dispatch to the appropriate supplier | ||
if (fileFormat == "sdf") { | ||
if(opt.numWriterThreads > 0){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer spaces to tabs. Or even better - run the clang-format using the .clang-format file in the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a collection of code that hasn't been through clang-format (or which has been run with an older version). I'm going to run the whole codebase through once we do the next release.
std::vector<std::string> supportedCompressionFormats{"gz"}; | ||
|
||
//! given file path determines the file and compression format | ||
void determineFormat(const std::string path, std::string& fileFormat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more robust to make this return enum class objects. Relying on string comparison leaves you vulnerable to coding errors via typos that would only be caught at runtime. Enum class comparisons would be caught at compile time! Also, getSupplier()
could then use a switch statement.
std::pair<FileFormat, CompressionFormat> guessFormat(const& std::string path);
Use like (C++17 only):
auto [format, compression] = guessFormat(path);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth considering for a future enhancement.
@d-b-w I agree with this, but it's easier to satisfy the documentation requirements for GSoC if there's a single PR. Sorry for the extra hassle to review |
modifications to the general file reader
@shrey183 : I don't think the CI build failure was due to your code. I'm running that test again to be sure. |
@bp-kelley, @d-b-w : We'll need to do some ongoing tweaking and tuning, but I think this is ok to go ahead and merge. What's your take? |
of course I approved the PR and then tried a build without threading enabled. I will fix those problems and do a PR against @shrey183's repo. |
@shrey183 : I did a PR against your repo with the relevant fixes. If you could accept/merge that then I think we're close to being able to merge this PR into master. |
make this work when building in non-threads mode
@greglandrum Done! |
hmm, the CI failures are looking suspiciously real. I will try to find some time to look at these tonight/tomorrow morning. |
disable construction from streams in python and mark interface as experimental
@greglandrum I merged the recent PR! |
Here's a quick summary of the recent changes I made pre-merge:
|
@shrey183 Thanks for all the work you put into this! It's been a fun project. :-) |
[GSoC-2020] General File Reader and Multithreaded Mol Supplier
Overview
The General File Reader, as the name suggests, provides the user with the appropriate
MolSupplier
object to parse a file of a given format. Thus for instance earlier if one wanted to parse a file of smiles, sayinput.smi
, then one would need to explicitly construct an objectSmilesMolSupplier
. However, with the implementation provided in the GeneralFileReader, one can easily pass the file name along with supplier options to obtain the appropriateMolSupplier
object determined by the file format. Furthermore, the General File Reader also provides an interface with theMultithreadedMolSupplier
objects for Smiles and SDF file formats. Besides the implementation, test cases are also included to demonstrate the utility of the General File Reader.The Multithreaded Mol Supplier provides a concurrent implementation of the usual base class
MolSupplier
. Due to time constraints, multithreaded versions of only Smiles, and SD Mol Suppliers were implemented. The motivation for this part stemmed from parsing large Smiles or SDF files. With the current implementation the user, for instance, can construct the objectMultithreadedSmilesMolSupplier
to parse a smiles file with a large number of records. Besides the implementation, test cases are also included to demonstrate the correctness and performance of theMultithreadedMolSupplier
. Here is a brief summary of the performance result obtained by running the functiontestPerformance
on @greglandrum's machine:Implementation
Implementation of the General File Reader is quite concise and makes use of only two methods
determineFormat
andgetSupplier
. The former determines the file and the compression format given pathname, while the latter returns a pointer toMolSupplier
object given pathname andSupplierOptions
.Regarding the implementation of the
MultithreadedMolSupplier
, the first step was to implement a thread-safe blocking queue of fixed capacity. This would allow us to extract and process records from the file concurrently. The concurrent queue was implemented with a single lock and two condition variables to signal whether the queue was empty or full. Test cases checking the correctness of theConcurrentQueue
are also included in the project.The next step required the implementation of the base class
MultithreadedMolSupplier
which would manage the input and output queue. The input queue would be populated by the methodextractNextRecord
that would read a record from a given file/stream, whereas the output queue would be populated by the methodprocessMoleculeRecord
that would first pop a record from the input queue and then process it into an object of typeROMol *
. The reader thread would thus callextractNextRecord
until no record can be read, while the writer thread(s) would call the methodprocessMoleculeRecord
until the output queue is done and empty. The child classesMultithreadedSmilesMolSupplier
andMultithreadedSDMolSupplier
primarily provide implementations of the methods,extractNextRecord
andprocessMoleculeRecord
. Both suppliers were tested on various files with different parameter values for input queue size, output queue size, and the number of writer threads.Further Work
Due to time constraints and the difficulty involved in debugging concurrent code, there were a few things that could not be completed.
Changes made for the General File Reader and Multithreaded Mol Supplier:
List of important files added: