What drives the QCArchive's development and features?
The QCArchive project started from the needs and wants of the quantum chemistry community and the fields which benefit from its data. MolSSI reached out to the wider community and asked question such as: What is getting in the way of you focusing on your scientific efforts?
What would help the field as a whole?
What do you think is missing?
It was from these interviews and research that the the concept of the QCArchive arose. From there, the needs were turned into the primary use cases below that QCArchive would then be designed to solve.
This was a modern case we heard involving machine learning. The many derivatives of this case all came down to "I want large data to do machine learning with for quantum chemistry." This use case illustrated the need for a regular structured, indexed, searchable database, but also to have small quantities of data returned for any query.
The primary non-machine learning application are those who need access to Q. Chem. data. Having a central repository of these calculations, especially for small molecule fragments, was a popular request. This avoids repeating calculations for small, derivative changes, but also recognizes when the molecules are changed enough to run new calculations. This is the primary use case for our partners at The Open Force Field Consortium.
The need to benchmark data is a long a sordid tale throughout all scientific research. Quantum chemistry research relies on access to large data sets to compare your calculations against. Hopefully all that data is in the same place, and hopefully its all formatted the same searchable way. Having a regular, persistent, searchable database of these calculations vastly reduces the time from needing the data to analyzing the data.
Sometimes, you just need to compare different methods. But aggregating many labs worth of research and calculations can be challenging. An index of molecules and calculations helps researches find these comparisons quickly.
Big data is still big data, and training AI databases takes less overhead when the Big Data is well formatted. Although the main use case presented is for the CMS community, this use case has garnered the interest of more general AI community as well.
One large problem in research is finding what has already been done. Whether to reference those data in papers, look them up as a precursors to your own research, or to avoid a duplicate calculation. The QCArchive looks to reduce this problem through its indexed, regular, searchable results.
Science is often gated behind technical expertise and infrastructure limitations. We QCArchive developers have elected to provide this expertise and build a tool everyone can benefit from, and we hope our efforts reduce the lag time from hypothesis to result. We also understand that many industrial and proprietary data groups have need of such software as well. This is why we have chosen to provide the software open source with instructions and examples of how to set up and run it on your own platforms, without exposing your data.
This is a deceptively simple use case and one we often hear from researchers, especially PhD students. This is a deceptive use case because it sounds like it something which should be easy, but is hard to do because it requires not only searching the literature and existing data bases (assuming they can be searched or found) for previous calculations. It also requires that the researcher set up the calculations in a comparable, fair way; which in turn requires incorporating new molecules and parameters into their code. QCArchive tries to handle all of these problems by automatically checking for the calculations' existence, and then setting up and running the calculations.
QCArchive's database is designed to support exactly this. You can supply your data through our Portal without setting up your own server. Your data's provenance information can also be submitted with data so its author and program can be tracked as well. Know that submitting your data to QCArchive's central data base will make it public.
This is a reasonable request, and we understand graduate student time is finite. As as PI, you will want your graduate students focusing on the science behind their work, not database management. QCArchive's user facing Portal abstracts the database interface through a more user-friendly set of API calls. Users do not need to understand database tables, or even how we have organized the data on the back end. We'll handle that, you focus on the science.
QCArchive is designed to handle multiple compute locations, and handles the distribution of compute jobs for the user, with minimal initial setup. Once you configure the compute manager for a given compute site, you can configure any number of managers to talk back to a central QCFractal server. Then all the researchers in your collaboration can submit jobs to the same central QCFractal server and jobs will be distributed to all attached managers.
At its heart, this is what QCArchive aims to accomplish. This use case shares overlap with other use cases in this list. Please see the Uses Cases on Machine Learning, Database Management, and Duplicate Calculations above.
QCArchive is a dynamic collection of methods and calculations, not a fixed set of immutable known calculations. We encourage users to develop their own procedures and interfaces to other quantum chemistry codes. We also encourage users to contribute these changes back to the QCArchive through GitHub Pull Requests to bring those into the code for everyone else to use.