How effective are current AI models on mathematical research problems?


Credit: Vatican Museum

2023-2025: Can chatbots prove mathematical theorems?

Two years ago (23 Feb 2023), the present author tested ChatGPT, which then was the state-of-the-art in AI-based large language models (LLMs), with requests to produce rigorous mathematical proofs for four well-known but nontrivial theorems:

  1. A general angle cannot be trisected with ruler and compass.
  2. $\pi$ is irrational.
  3. $\pi$ is transcendental.
  4. Every algebraic equation with integer coefficients has a root in the complex numbers.

The results, frankly, were abysmal. For example, ChatGPT attempted to prove the irrationality of $\pi$ by reasoning that if $\pi$ were rational, then its decimal expansion would eventually be periodic [but the non-repeating nature of the digits of $\pi$ is a consequence of its irrationality, not the other way around]. Its proof of the fundamental theorem of algebra assumed that a root exists, which is what it was asked to prove. See this Math Scholar article for more details.

In 2025, the present author revisited this exercise, this time trying the newly released DeepSeek. In contrast to the earlier exercise, DeepSeek produced rather good responses, at least for three of the four problems. The author noted a few minor points, but no major flaws. Overall, it was clear that DeepSeek represented a major improvement over the capability of state-of-the-art LLMs from just two years ago. See this Math Scholar article for more details.

2026: An exercise from the theory of Euler sums

One year later, given the very rapid improvement in these models, it is time to assess the ability of these LLMs to handle some more realistic research problems. To that end, the author conducted a small study of four software systems on some research problems from the theory of Euler sums.

An Euler sum is an infinite series involving the harmonic function $H_k = 1 + 1/2 + 1/3 + \cdots + 1/k$. They have been studied since the time of Euler, and more recently in numerous investigations of mathematical physics, the Riemann hypothesis and others. One notable feature of these sums is that many have surprisingly elegant analytic evaluations: for example, $\sum_{k=1}^\infty H_k/k^2 = 2 \zeta(3), \; \sum_{k=1}^\infty H_k/k^3 = \pi^4/72$ and $\sum_{k=1}^\infty H_k^2/k^3 = 7 \zeta(5)/2 – \pi^2 \zeta(3) / 6$, where $\zeta(\cdot)$ is the Riemann zeta function.

Recently the present author, in collaboration with Ross McPhedran (University of Sydney, Australia) and Bruno Salvy (INRIA, France) found some formulas and techniques giving explicit evaluations, in terms of the digamma and polygamma functions, for a class of Euler sums — see this paper. We originally discovered these results numerically, then subsequently found rigorous proofs for them. While these results were not trivial, they certainly would not be considered “hard” mathematical problems. They are more in the category of problems that could be assigned to a good student as part of a larger research effort. By the way, the proofs we found employed an approach distinct from any mentioned below.

In the course of this research, the present author tried posing some of these questions to several currently available LLMs, including ChatGPT, DeepSeek, Google’s Gemini and Anthropic’s Claude. This exercise was an interesting exploration into both the capabilities and limitations of these systems. Some specific problems that the present author has posed to these models include the following, ranging from general (more challenging) to specific (less challenging).

Notation. In the following, $H_k = 1 + 1/2 + \cdots + 1/k$ denotes the harmonic function; $\psi(z) = \psi(0,z)$ denotes the digamma function; $\psi(q,z) = \psi^{(q)} (z) = D^{q+1} (\log \Gamma(z))$ denotes the polygamma function; $\binom{n}{k}$ denotes the binomial coefficient; and $\gamma = 0.5772156649\ldots$ is Euler’s constant.

Problems:

  1. Given $t$ not an integer, and integer $p \geq 2$, prove that
    \begin{align}
    \sum_{k=1}^\infty \frac{H_k}{(k + t)^p} &= \frac{1}{2 (p-1)!} \left(t^{-p-1} p! + (-1)^p 2 (\gamma + \psi(0,t)) \psi(p-1,t) \right. \nonumber \\
    & \hspace{1em} \left. + (-1)^p \sum_{k=1}^{p-2} \binom{p-1}{k} \psi(k,t) \psi(p-1-k,t) \; – (-1)^p \psi(p,1+t)\right). \label{form:thm1}
    \end{align}
  2. For integers $m, n \geq 1, \gcd (m, n) = 1, p \geq 2$, prove that
    \begin{align}
    \sum_{k=1}^\infty \frac{H_k}{(m k + n)^p} &= \frac{1}{2 m^p (p-1)!} \left(\left(n/m \right)^{-p-1} p! + (-1)^p 2 (\gamma + \psi(0,n/m)) \psi(p-1,n/m) \right. \nonumber \\
    & \hspace{-4em} \left. + (-1)^p \sum_{k=1}^{p-2} \binom{p-1}{k} \psi(k,n/m) \psi(p-1-k,n/m) \; – (-1)^p \psi(p,1+n/m)\right). \label{form:thm2}
    \end{align}
  3. Prove that
    \begin{align}
    \sum_{k=1}^\infty \frac{H_k}{(5k+1)^2} &= \frac{1}{50}\left( 250 +2\gamma\psi(1,1/5) +2\psi(0,1/5)\psi(1,1/5) -\psi(2,6/5)\right). \label{form:S51b}
    \end{align}

The present author did not perform a comprehensive check of all problems and models, but here is a brief summary of the results obtained, mostly for Problem 3, the simplest in the list. For additional details, see this paper.

ChatGPT

The present first presented ChatGPT with a statement of Problem 3 above. ChatGPT cleverly employed the identity $H_k = \int_0^1 (1 – x^k)/(1 – x) \, dx$ to rewrite the original summation on the left-hand side of Problem 3 as $\int_0^1 1/(1-x) \sum_{k=1}^\infty (1-x^k)/(5k+1)^2 \, dx$. Somewhat later, however, it went awry by asserting
\begin{align}
\sum_{k=1}^\infty \frac{x^k}{(5k+1)^2} &= \frac{1}{25} \sum_{k=1}^\infty \frac{(x^5)^k}{(k+\frac{1}{5})^2} \; = \; \frac{1}{25} \left[\Phi(x^5,2,1/5) – \frac{1}{(1/5)^2}\right], \label{form:chatgpt1}
\end{align}
where $\Phi(z,s,a)$ denotes the Lerch phi function. Sadly, the middle expression is completely in error — the numerator $(x^5)^k$ should simply be $x^k$, and thus the right-hand expression is not equal to the left-hand summation. ChatGPT later applied a “known identity”
\begin{align}
\sum_{n=0}^\infty \frac{\psi(n+b)}{(n+a)^2} &= \psi(a) \psi^{(1)} (a) – \frac{1}{2} \psi^{(2)} (a + b – a).
\end{align}
Sadly, this identity, for which no reference was provided, is false.

The present author then tried a more advanced version of ChatGPT, namely GPT5.1 (available from the site https://use.ai on 17 Feb 2026; this required a paid subscription). This version pursued an interesting line of derivation, but went awry with statements with divergent sums, such as the enigmatic line
\begin{align}
\sum_{k=1}^\infty \frac{1}{k+a} &= \sum_{k=0}^\infty \frac{1}{k+a+1} \; = \; \sum_{k=0}^\infty \frac{1}{k+(a+1)}.
\end{align}
In the end, while GPT5.1’s approach was interesting, the output was too problematic to be useful.

DeepSeek

The author then tried DeepSeek (the version available at https://www.deepseek.com on 10 Feb 2026). DeepSeek began by pursuing a derivation similar to the other LLMs, but went awry with the line
\begin{align}
S &= \int_0^1 \frac{1}{1-x} dx + \frac{1}{25} \int_0^1 \frac{\psi^{(1)} (1/5) – 1 – \Phi(x,2,1/5)}{1-x} \, dx.
\end{align}
DeepSeek acknowledged that the first integral diverges, but it tersely dismissed the problem. It then cited a “known formula”:
\begin{align}
\sum_{k=1}^\infty \frac{H_k}{(k+a)^2} &= \frac{1}{a} \left[ \psi^{(0)} (a+1) \psi^{(1)}(a) – \frac{1}{2} \psi^{(2)}(a+1)\right] + \frac{\psi^{(1)}(a) – 1/a}{a^2} + \frac{\gamma \psi^{(1)}(a)}{a} + \frac{\psi^{(0)}(a+1)}{a^2}.
\end{align}
Sadly, this formula is false. In short, DeepSeek’s output had significant errors and thus was not useful.

Google Gemini

The author then tried Google Gemini (the version available 17 Feb 2026 on the website https://gemini.google.com). The problem posed is exactly the same as was posed to ChatGPT. Google’s proof proceeded until it presented a “known powerful identity”
\begin{align}
\sum_{k=1}^\infty \frac{H_k}{(k+a)^2} &= \frac{1}{2} \left[\psi^{(2)}(a+1) + 2 \gamma \psi^{(1)}(a+1) + 2 \psi^{(0)}(a+1) \psi^{(1)}(a+1)\right] + \cdots.
\end{align}
However, there is no such identity, and, as it stands (ignoring the enigmatic dots at the end), it is false.

The author then tried Gemini 3 Pro (available from the website https://use.ai on 23 Feb 2026; this required a paid subscription). This software quickly solved Problem 3, in a sense, by citing a “general identity for harmonic sums”:
\begin{align}
\sum_{k=1}^\infty \frac{H_k}{(k+a)^2} &= (\gamma + \psi(a)) \psi^{(1)}(a) – \frac{1}{2} \psi^{(2)}(a).
\end{align}
As it turns out, this identity is correct! But Gemini 3 Pro did not provide any proof or cite any reference.

The author then tried Gemini 3 Pro on Problem 1, the more general formula. After some initial derivation, it cited a “generating function” for the sum of harmonic numbers with a linear denominator:
\begin{align}
f(t) &= \sum_{k=1}^\infty \frac{H_k}{k+t} \; = \; \frac{1}{2} \left[(\gamma + \psi((t))^2 + \psi^{(1)}(t)\right] – \frac{1}{2} \zeta(2). \label{form:gem3b}
\end{align}
But clearly this cannot be correct, because the summation does not converge. It then continued with some analysis and claimed to have proven the formula, but its derivation had errors and thus the proof was invalid.

Anthropic Claude

The author then tried the Anthropic Claude (Opus 4.5 version, available from https://use.ai on 25 Feb2026; this required a paid subscription). The problem posed to Claude was exactly the same as was posed to ChatGPT. Claude’s derivation at one point applied the enigmatic relation
\begin{align}
\Phi(t,2,a) &= \psi^{(1)}(a) + (t – 1) \left[\gamma \psi^{(1)}(a) + \psi^{(0)}(a) – \frac{1}{2} \psi^{(2)}(a)\right] + O((t-1)^2), \label{form:claudex}
\end{align}
and then claimed the result
\begin{align}
\sum_{k=1}^\infty \frac{H_k}{(5k+1)^2} &= \frac{1}{50}\left( 250 +2\gamma\psi(1,1/5) +2\psi(0,1/5)\psi(1,1/5) -\psi(2,6/5)\right).
\end{align}
While this overall line of reasoning seemed promising, the present author was quite disappointed at the lack of details in the evaluation of the integral and formula. Lacking a detailed explanation of these key steps, this material was not useful.

Conclusion

In general, the author found that the freely available LLMs performed rather poorly. More advanced models available via paid subscription did significantly better, and are greatly improved from just a year or two ago. But even these more advanced models do not yet appear ready for “prime time” as mathematical research assistants for the type of problem addressed in this study. Difficulties include:

  1. Errors of algebra.
  2. Reliance on divergent sums and integrals.
  3. Reliance on formulas or other results without specific literature citation.
  4. Reliance on false formulas and results.
  5. Lack of details at key steps in the proof.
  6. Lack of validity checking, either for intermediate or final results.

With regards to the last item, while there were one or two instances of numerical checks in the output of the software packages studied here, this could be greatly expanded, which by itself should significantly improve the reliability of these systems on real mathematical problems.

This exercise has significant limitations, notably that the models were tested only on a very specific set of problems, and thus these results should not be deemed indicative of the likely success on completely different types of problems. Hopefully once more researchers document their experiences in this manner, a clearer picture will emerge. Further, in each case the tested software was a specific version available from the listed site on a specific date. Given the rapid rate of improvement of these models, these results will likely be quite different in just a few months from now. We eagerly await these improved versions.

However, one conclusion is clear: For the foreseeable future it will be essential for mathematical users of these tools to keep firmly in mind that they do make mistakes, and that ultimately human users are responsible to ensure that published material is free from difficulties that could potentially corrupt the mathematical research enterprise.

For additional details on these tests, see this report.

Comments are closed.