Mathlib Graph - Statistical Properties of Formal Math Libraries

Background

What is Mathlib Graph?

Mathlib is a library of formally verified mathematics, written in Lean 4. Proof formalization libraries like mathlib provide an exciting array of new possibilities, including having a centralized locus of verified mathematical knowledge, training data for LLMs and proof formalizers. These developments bring to the forefront old questions in a new format regarding what mathematics is and how it is practiced and taught. Fortunately, I think that libraries of formalized theorems can help us gain empirical evidence towards answering these questions more satisfactorily. Mathlib Graph is a very small step into what I believe will be a field of increasing importance, namely mathematical metaanalysis. This field analyzes the structure of mathematics as a whole, similar to how scientific metaanalysis synthesizes many relevant studies and allows one to see larger trends which aren't apparent from the ground.

🔗

Proof Dependency Graph

Each theorem that uses another creates a directed edge. We construct a DAG of 387K edges.

📊

Statistical Analysis

We compute in-degree, out-degree, hub scores, longest dependency chains and other stats, then correlate these with proof complexity metrics.

🏷️

Named Theorem Detection

We identify theorems named after mathematicians using two independent signals: docstring bold-markup and camelCase identifier decomposition against a curated name list.

Methods

Data Pipeline

In case you want to play around with the data, this section describes the back-end.

🤗

HuggingFace

mathlib-types
713K rows, Parquet

🗃️

LeanDojo

Zenodo corpus.jsonl
premises → edges

🐙

GitHub Raw

Docstring extraction
at commit hash

→

🦆

DuckDB Graph

mathlib.db
738K decl · 387K edges

We construct the full transitive dependency graph across all of Mathlib, join it with docstrings and source positions from GitHub, and run statistical analyses comparing named vs. unnamed theorems.

Graph Metrics

For each theorem we compute: in-degree (how many proofs cite it), out-degree (how many lemmas it uses), hub score (in × out), BFS depth, and longest dependency chain via Kahn's topological DP.

Named Theorem Detection

Two independent signals: (1) doc bold **...** markup in docstrings, (2) id camelCase/underscore decomposition against ~90 mathematician names. Combined: 6,056 named theorems identified.

Docstring Harvesting

We fetch raw Lean source files from GitHub at a fixed commit hash, parse /-- ... -/ docstring blocks using source line positions from the lean_source table. 43,480 docstrings collected across all of Mathlib.

Statistical Testing

Mann-Whitney U tests (non-parametric) with rank-biserial correlation as effect size. Compares named vs. unnamed theorems across in-degree, out-degree, tactic step count, and code line count.

Results · Theorem Importance

Most Cited Theorems in Mathlib

In-degree (the number of other proofs that directly depend on a theorem) is the most direct proxy for mathematical importance within a formal library. In the future, this analysis can be expanded to perform more sophisticated statistical analyses and make large AI generated theorem libraries human interpretable.

mul_comm (commutativity of multiplication) is cited by 1,462 other proofs. The top 5 are all basic ring axioms, appearing more often than any named theorem like Cauchy's or Fermat's.

Top 30 by In-degree Primitive Concepts Cross-module Flows

Declarations cited by the most other proofs. Includes all kinds (theorems, defs, instances).

#	Name	Kind	Area	In-degree
1	mul_comm	theorem	Algebra	1,462
2	mul_assoc	theorem	Algebra	1,267
3	mul_one	theorem	Algebra	1,171
4	Set.range	def	Data	1,055
5	one_mul	theorem	Algebra	1,015
6	add_comm	unknown	Algebra	950
7	le_antisymm	theorem	Order	814
8	Fintype.card	def	Data	810
9	LE.le.trans	unknown	Order	699
10	CategoryTheory.Category.assoc	unknown	CategoryTheory	695
11	Set.univ	def	Data	692
12	MulZeroClass.mul_zero	unknown	Algebra	644
13	zero_add	unknown	Algebra	634
14	add_zero	unknown	Algebra	616
15	Set	def	Data	599
16	le_rfl	theorem	Order	598
17	sub_eq_add_neg	unknown	Algebra	582
18	le_trans	theorem	Order	565
19	eq_or_ne	theorem	Logic	559
20	MulZeroClass.zero_mul	unknown	Algebra	530
21	Set.mem_setOf_eq	theorem	Data	518
22	div_eq_mul_inv	theorem	Algebra	493
23	smul_eq_mul	lemma	Algebra	486
24	add_assoc	unknown	Algebra	444
25	algebraMap	def	Algebra	437
26	one_smul	lemma	Algebra	435
27	LT.lt.le	unknown	Order	416
28	ENNReal.ofReal	def	Data	416
29	CategoryTheory.Category.comp_id	unknown	CategoryTheory	406
30	le_of_lt	theorem	Order	379

Primitive concepts are declarations with no outgoing dependency edges. (This is partly determined by how sensitive the dependency generator is, determining line by line dependency for term mode proofs was not that easy and could be made to detect more dependencies.)

#	Name	Kind	Area	In-degree
1	Eq.symm	theorem	Prelude	1,730
2	Iff.rfl	theorem	Core	1,413
3	Function.Injective	theorem	Data	1,268
4	Set.range	def	Data	1,055
5	add_comm	unknown	Algebra	950
6	Fintype.card	def	Data	810
7	Or.inl	theorem	Prelude	802
8	Function.comp_apply	theorem	Core	801
9	Or.inr	theorem	Prelude	760
10	eq_comm	theorem	Core	701
11	LE.le.trans	unknown	Order	699
12	CategoryTheory.Category.assoc	unknown	CategoryTheory	695
13	Set.univ	def	Data	692
14	congr_arg	unknown	Logic	680
15	Iff.mp	theorem	Core	655
16	MulZeroClass.mul_zero	unknown	Algebra	644
17	zero_add	unknown	Algebra	634
18	Ne	theorem	Core	627
19	add_zero	unknown	Algebra	616
20	Set	def	Data	599

644,656 of 738,240 declarations (87%) are primitive (omega, simp, decide)

Cross-module dependency traffic analyzes which dependencies between different areas of math. Algebra is the dominant sink: Analysis, RingTheory, LinearAlgebra, and NumberTheory all flow heavily into it.

Analysis→Algebra

11,866

RingTheory→Algebra

11,555

LinearAlg→Algebra

6,840

NumberTh→Algebra

6,168

Analysis→Data

5,095

Algebra→Data

5,017

MeasureTheory→Data

5,009

Topology→Data

4,550

Analysis→Topology

4,534

Analysis→Order

4,226

MeasureTheory→Order

4,061

Topology→Order

4,055

MeasureTheory→Algebra

3,635

NumberTh→Data

3,262

Data→Algebra

3,253

Results · Named Theorems

Which Named Theorems Matter Most?

We identify theorems named after mathematicians using two independent signals, then measure whether they are structurally more important than unnamed results.

de Morgan's laws (not_and_or) top the named-theorem rankings with an in-degree of 113, cited by 113 other proofs. They're followed by the Squeeze theorem (18), Monotone Convergence (16), and the Binomial theorem (16).

Statistical Effect Sizes: Named vs. Unnamed Theorems

Rank-biserial correlation r from Mann-Whitney U tests. Positive = named theorems score higher. All results significant at p < 0.001.

+0.51

In-degree (citations)

Large effect · p < 0.001

+0.38

Out-degree (dependencies)

Medium effect · p < 0.001

+0.46

Tactic steps

Large effect · p < 0.001

+0.44

Code lines

Large effect · p < 0.001

Top Named Theorems by In-degree (docstring signal)

#	Common Name	Lean Identifier	Area	In-degree
1	de Morgan's laws	not_and_or	Logic	113
2	Squeeze theorem	tendsto_of_tendsto_of_tendsto_of_le_of_le	Topology	18
3	Monotone convergence / Beppo-Levi	MeasureTheory.lintegral_iSup	MeasureTheory	16
4	Binomial theorem	add_pow	Data	16
5	Lipschitz continuous	LipschitzWith	Topology	16
6	Krull's theorem	Ideal.exists_le_maximal	RingTheory	15
7	Squeeze theorem (variant)	tendsto_of_tendsto_of_tendsto_of_le_of_le'	Topology	12
8	Bézout's lemma	Nat.gcd_eq_gcd_ab	Data	11
9	Extreme value theorem	IsCompact.exists_isMaxOn	Topology	10
10	Cantor's theorem	Cardinal.cantor	SetTheory	9
11	Integration by parts	intervalIntegral.integral_mul_deriv_eq_deriv_mul	MeasureTheory	8
12	Dominated convergence	MeasureTheory.tendsto_integral_of_dominated_convergence	MeasureTheory	7
13	Rearrangement inequality	mul_add_mul_le_mul_add_mul	Algebra	6
14	Hölder inequality	NNReal.inner_le_Lp_mul_Lq	Analysis	6

Named Theorem Counts by Area

Analysis

270

MeasureTheory

147

Algebra

Topology

Combinatorics

NumberTheory

Probability

RingTheory

GroupTheory

LinearAlgebra

Results · Graph Structure

The Shape of Mathematical Knowledge

Despite hundreds of transitive dependencies, proof chains in Mathlib are surprisingly shallow. Even the most complex theorems have longest chains of only 10-15 steps (this is partly due to the limitation in detecting dependency between very simple lemmas).

🌊

Shallow Chains

PadicInt.norm_int_lt_one_iff_dvd has 195 transitive dependencies yet its longest chain is only 12 steps. Formal proofs are wide, not deep (again partly due to difficulty detecting simple lemma dependence).

🌳

87% Primitive Leaves

644,656 declarations have no outgoing edges

🧲

Algebra is the Attractor

Analysis, RingTheory, LinearAlgebra, and NumberTheory all flow into Algebra at 3,000-12,000 edges each.

A Graph of
Formal Mathematics

What is Mathlib Graph?

Proof Dependency Graph

Statistical Analysis

Named Theorem Detection

Data Pipeline

Graph Metrics

Named Theorem Detection

Docstring Harvesting

Statistical Testing

Most Cited Theorems in Mathlib

Which Named Theorems Matter Most?

Statistical Effect Sizes: Named vs. Unnamed Theorems

Top Named Theorems by In-degree (docstring signal)

Named Theorem Counts by Area

The Shape of Mathematical Knowledge

Shallow Chains

87% Primitive Leaves

Algebra is the Attractor

Mathematics in the Age of Formal Verification

Essay: Framing the New Mathematical Paradigm

A Graph ofFormal Mathematics

What is Mathlib Graph?

Proof Dependency Graph

Statistical Analysis

Named Theorem Detection

Data Pipeline

Graph Metrics

Named Theorem Detection

Docstring Harvesting

Statistical Testing

Most Cited Theorems in Mathlib

Which Named Theorems Matter Most?

Statistical Effect Sizes: Named vs. Unnamed Theorems

Top Named Theorems by In-degree (docstring signal)

Named Theorem Counts by Area

The Shape of Mathematical Knowledge

Shallow Chains

87% Primitive Leaves

Algebra is the Attractor

Mathematics in the Age of Formal Verification

Essay: Framing the New Mathematical Paradigm

A Graph of
Formal Mathematics