The MolSSI QCArchive

Use cases driving QCArchive's design and development

I am a new student who is interested in machine learning, and I would like to obtain the ANI-1 dataset or similar in a format compatible with Keras using a single line of code to get started.

This was a modern case we heard involving machine learning. The many derivatives of this case all came down to "I want large data to do machine learning with for quantum chemistry." This use case illustrated the need for a regular structured, indexed, searchable database, but also to have small quantities of data returned for any query.

I am an academic or industrial partner who wishes to use quantum chemistry methods to generate bespoke parameters for a new drug molecule using a pre-existing database of quantum chemical data, or contribute molecules covering regions of chemical space to be included in general organic small molecule force field parametrization.

The primary non-machine learning application are those who need access to Q. Chem. data. Having a central repository of these calculations, especially for small molecule fragments, was a popular request. This avoids repeating calculations for small, derivative changes, but also recognizes when the molecules are changed enough to run new calculations. This is the primary use case for our partners at The Open Force Field Consortium.

I am a researcher who would like to benchmark a new methodology and would like reference datasets and the ability to compare results easily.

The need to benchmark data is a long a sordid tale throughout all scientific research. Quantum chemistry research relies on access to large data sets to compare your calculations against. Hopefully all that data is in the same place, and hopefully its all formatted the same searchable way. Having a regular, persistent, searchable database of these calculations vastly reduces the time from needing the data to analyzing the data.

I have a set of reactions that look like the following … Which DFT methods historically give the best results for similar reactions?

Sometimes, you just need to compare different methods. But aggregating many labs worth of research and calculations can be challenging. An index of molecules and calculations helps researches find these comparisons quickly.

We want to compute an open AI database to accelerate drug discovery and cheminformatics research, subscribe to our BOINC server.

Big data is still big data, and training AI databases takes less overhead when the Big Data is well formatted. Although the main use case presented is for the CMS community, this use case has garnered the interest of more general AI community as well.

I am running many computations and I would like to query the database to see if they have already been completed.

One large problem in research is finding what has already been done. Whether to reference those data in papers, look them up as a precursors to your own research, or to avoid a duplicate calculation. The QCArchive looks to reduce this problem through its indexed, regular, searchable results.

I am a researcher who needs to compute a very large open dataset for (many-body force fields, ML-based force fields, method comparisons, etc) and I need help deploying/accessing software to support this.

Science is often gated behind technical expertise and infrastructure limitations. We QCArchive developers have elected to provide this expertise and build a tool everyone can benefit from, and we hope our efforts reduce the lag time from hypothesis to result. We also understand that many industrial and proprietary data groups have need of such software as well. This is why we have chosen to provide the software open source with instructions and examples of how to set up and run it on your own platforms, without exposing your data.

I have a few reactions and would like to see if I need to run new benchmark energies on them all or if some have already been performed.

This is a deceptively simple use case and one we often hear from researchers, especially PhD students. This is a deceptive use case because it sounds like it something which should be easy, but is hard to do because it requires not only searching the literature and existing data bases (assuming they can be searched or found) for previous calculations. It also requires that the researcher set up the calculations in a comparable, fair way; which in turn requires incorporating new molecules and parameters into their code. QCArchive tries to handle all of these problems by automatically checking for the calculations' existence, and then setting up and running the calculations.

I am a researcher who has just published a large data set, and I neither want my results to get buried in CSV in a Supplemental Material tarball, nor do I want to receive email inquiries, nor do I want to design a web portal for others to access.

QCArchive's database is designed to support exactly this. You can supply your data through our Portal without setting up your own server. Your data's provenance information can also be submitted with data so its author and program can be tracked as well. Know that submitting your data to QCArchive's central data base will make it public.

I am a PI, and I don’t want to task my graduate students and postdocs with learning databases and distributed queueing systems.

This is a reasonable request, and we understand graduate student time is finite. As as PI, you will want your graduate students focusing on the science behind their work, not database management. QCArchive's user facing Portal abstracts the database interface through a more user-friendly set of API calls. Users do not need to understand database tables, or even how we have organized the data on the back end. We'll handle that, you focus on the science.

I am a researcher in a multi-university collaboration with a neat idea I want to test quickly. My collaborator and I each have local and university computing resources we can allocate to the project, as well as an NCSA allocation. I want to make sure all my calculations are run the same way and not spend my days submitting and herding input/output files.

QCArchive is designed to handle multiple compute locations, and handles the distribution of compute jobs for the user, with minimal initial setup. Once you configure the compute manager for a given compute site, you can configure any number of managers to talk back to a central QCFractal server. Then all the researchers in your collaboration can submit jobs to the same central QCFractal server and jobs will be distributed to all attached managers.

I want to enable “O(0)” quantum chemical calculations by having my quantum chemistry code deposit results in a database automatically, querying it before a computation to see if the effect may already exist. The resulting datasets may also be useful to those doing machine learning from large QC datasets.

At its heart, this is what QCArchive aims to accomplish. This use case shares overlap with other use cases in this list. Please see the Uses Cases on Machine Learning, Database Management, and Duplicate Calculations above.

I want to publish a static workflow which accomplishes a single task and allow users and developers to run this static workflow. An example of this would be to compute IR spectra with peak broadening which would involve many quantum chemistry computations.

QCArchive is a dynamic collection of methods and calculations, not a fixed set of immutable known calculations. We encourage users to develop their own procedures and interfaces to other quantum chemistry codes. We also encourage users to contribute these changes back to the QCArchive through GitHub Pull Requests to bring those into the code for everyone else to use.

Use Cases

Driven and developed by community needs

Use cases driving QCArchive's design and development