How to Predict Solubility: Rules, Models, and ML

Predicting solubility comes down to estimating how much energy it takes to pull a substance apart and how much energy the solvent gains by interacting with it. If the solvent-solute interactions release enough energy to compensate for breaking the solid apart, the substance dissolves. Every prediction method, from a quick rule of thumb to a neural network, is ultimately trying to quantify that energy balance. The approaches range widely in complexity and accuracy, and the best choice depends on whether you need a rough estimate in seconds or a precise value for a regulatory filing.

The Core Thermodynamic Principle

Solubility is governed by the change in Gibbs free energy when a substance moves from its pure form into solution. That free energy change has two competing parts: the enthalpy change (heat absorbed or released) and the entropy change (the increase in disorder). When the overall free energy change is negative, dissolution is favorable and the substance is soluble.

In practice, predicting solubility from first principles means calculating two separate energy steps. First, you need the energy required to break the solid crystal apart (sublimation free energy). Second, you need the energy released when those freed molecules interact with the solvent (hydration free energy, if the solvent is water). Physics-based approaches like free energy perturbation compute both of these values using molecular simulations, effectively running a virtual experiment at the atomic level. These calculations can be highly accurate but require significant computational resources and detailed knowledge of the crystal structure.

The “Like Dissolves Like” Rule, Quantified

The simplest prediction tool is the principle that substances dissolve best in solvents with similar intermolecular forces. Hansen solubility parameters turn this intuition into numbers. Every substance gets three values representing the strength of its dispersion forces (weak attractions between all molecules), polar forces (attractions between partially charged regions), and hydrogen bonding capacity. Each parameter is calculated as the square root of that type of energy divided by the molecule’s volume.

To predict whether a substance will dissolve in a given solvent, you calculate the “distance” between their three parameters in a kind of 3D space. A small distance means good compatibility. Experimentally determining these parameters for a new substance is tedious, requiring extensive measurements to map out what’s called a solubility sphere. But published tables cover thousands of common solvents and polymers, making this a practical first-pass method for formulation work, coatings, and polymer science.

Group Contribution Methods

Rather than measuring properties of a whole molecule, group contribution methods predict behavior by adding up the known contributions of each functional group in the structure. The most widely used is the UNIFAC model, which estimates how much a dissolved molecule deviates from ideal behavior (its “activity coefficient”) by breaking it into fragments like hydroxyl groups, methyl groups, and aromatic rings.

UNIFAC splits the activity coefficient into two parts. The combinatorial part accounts for differences in molecular size and shape between solute and solvent. The residual part captures the energy of interactions between different functional groups, using published tables of interaction parameters measured between thousands of group pairs. You look up the groups present in your molecule, sum their contributions using the model’s equations, and get an activity coefficient that tells you how the real solubility deviates from the ideal case. The interaction parameters between groups depend on temperature, which means the model naturally captures how solubility shifts with heating or cooling. This approach works well for organic mixtures and is built into many commercial process simulation tools.

How Temperature Changes Solubility

For most solid substances in liquid solvents, solubility increases with temperature. The relationship follows the van’t Hoff equation, which says that plotting the natural log of solubility against the inverse of temperature (in Kelvin) should give roughly a straight line. The slope of that line is proportional to the heat of dissolution.

If dissolving the substance absorbs heat (endothermic), the slope is negative on a 1/T plot, meaning solubility rises as you heat the solution. If dissolution releases heat (exothermic), solubility decreases with warming. This is why sugar dissolves more easily in hot water, while gases like oxygen become less soluble as water warms. The van’t Hoff relationship is an approximation that works well over moderate temperature ranges. Over wider ranges, the heat capacity difference between the dissolved and undissolved states introduces curvature, and you need a quadratic correction term to maintain accuracy.

Molecular Descriptors for Quick Estimation

When you need to estimate aqueous solubility for a large number of compounds quickly, molecular descriptors offer a practical shortcut. These are numerical properties calculated directly from a molecule’s structure: things like the number of oxygen and nitrogen atoms (which form hydrogen bonds with water), the number of halogen atoms (fluorine, chlorine, bromine, which tend to reduce water solubility), the count of flexible bonds, and the ratio of polar to nonpolar surface area.

Studies comparing descriptor importance consistently find that features related to hydrogen bond donors, the balance of polar and nonpolar regions, and the count of specific atom types are the strongest predictors. One large comparison study generated over 1,600 two-dimensional descriptors from molecular structures, then narrowed them down to 177 that carried meaningful predictive signal. Among the most influential were counts of hydrogen bond donors, the number of hydroxyl groups, and descriptors capturing the distribution of polar surface area across the molecule. You don’t need to calculate all of these by hand. Free cheminformatics packages generate them automatically from a molecule’s structural notation.

Computational Solvent Models

COSMO-RS sits between simple descriptor methods and full molecular simulation. It uses quantum chemistry to calculate how electric charge is distributed across a molecule’s surface, then applies statistical thermodynamics to predict how that molecule interacts with a solvent. The key advantage is that it doesn’t need experimental data for the specific solute-solvent pair. It only needs the molecular structures.

For predicting water solubility in hydrocarbons, COSMO-RS achieves deviations within about 30% of experimental values while correctly capturing trends across homologous series (for example, how solubility changes as you lengthen a carbon chain). That level of accuracy is useful for screening and ranking candidates but often isn’t precise enough for final formulation decisions, where experimental confirmation remains necessary.

Machine Learning Approaches

The newest prediction tools use neural networks trained on large experimental solubility databases. The largest curated dataset, AqSolDB, has been combined with other sources to create training sets of nearly 18,000 measured solubility values. Models learn the relationship between molecular structure and solubility without being told the underlying physics.

Graph neural networks, which represent molecules as networks of connected atoms rather than lists of descriptors, have shown particularly strong results. A model called SolPredictor, built on a residual-gated graph neural network architecture, achieved a correlation coefficient of 0.79 and a root mean square error of 1.03 log units on cross-validation. To put that error in context, a 1 log unit error means the prediction could be off by a factor of 10 in either direction. That sounds large, but for early-stage drug discovery where you’re screening thousands of candidates, distinguishing “very soluble” from “practically insoluble” is often sufficient.

The performance varies by chemical space. On some external test sets, the same models achieved errors below 0.6 log units, while on others the error exceeded 1.1 log units. Models tend to perform best on molecules structurally similar to their training data and struggle with unusual scaffolds or functional groups that are underrepresented in existing databases. An older architecture called SolTranNet, trained only on AqSolDB, showed weaker cross-validation performance with an RMSE of 1.46 and a correlation of 0.68, illustrating how much dataset size and model design matter.

Choosing the Right Method

Your choice depends on what you’re solving. If you’re selecting a solvent for a known polymer or resin, Hansen solubility parameters give a fast, practical answer with minimal computation. If you’re designing a crystallization process and need to know how solubility changes with temperature and solvent composition, UNIFAC or COSMO-RS provides the thermodynamic framework. For screening large virtual libraries of drug candidates, machine learning models offer speed that no other method can match, processing thousands of structures in minutes.

In pharmaceutical development, the FDA uses a concrete solubility threshold for regulatory classification: a drug is considered highly soluble if the highest single therapeutic dose dissolves completely in 250 mL or less of water-based media across a pH range of 1.2 to 6.8 at body temperature. This classification, part of the Biopharmaceutics Classification System, determines whether a generic drug can skip certain clinical studies. Prediction methods in pharma ultimately need to be accurate enough to guide decisions toward or away from that threshold.

No single method dominates across all situations. The most reliable workflow combines a fast screening method (descriptors or machine learning) to narrow candidates, a thermodynamic model (UNIFAC, COSMO-RS, or free energy perturbation) to refine estimates, and experimental measurement to confirm the final values that matter most.