#!/usr/bin/env python3 # -*- coding: utf-8 -*- # latex2unicode.py: Convert a simple inline TeX/LaTeX (aimed at ArXiv abstracts) into Unicode+HTML+CSS, using the OA API. # Author: Gwern Branwen # Date: 2023-06-28 # When: Time-stamp: "2026-01-28 15:33:14 gwern" # License: CC-0 # # Usage: $ OPENAI_API_KEY="sk-XXX" xclip -o | python latex2unicode.py # # Typesetting TeX/LaTeX for web browsers is typically a heavyweight operation; even if done server-side, display often requires a lot of CSS+fonts. And then the result looks highly unnatural and clearly 'alien', interrupting reading flow. This is worthwhile for complex equations, where browser typesetting is not up to snuff, but for many in-the-wild TeX uses, the use is often as simple as `$X$`, which would look better as `X` & take megabytes less to render. So it is desirable for simple TeX expressions to convert them to 'native' Unicode/HTML (augmented with a bit of custom CSS to handle things like superscripts-over-subscripts which pop up in integrals/summations/binomials/matrices etc). # Unfortunately, TeX is an irregular macro language which is hard to parse and 'compile' to Unicode: it's easy to do many examples, but there's a long tail of weird variables, formatting commands etc, which means that I wind up defining lots of rewrites by hand, even though they are usually pretty 'obvious'. So, quite tedious and unrewarding. # However, this is a perfect use-case for GPT models: it is hard to write comprehensive rules for, but is an extremely constrained problem in a domain it knows well which requires processing few tokens, where I can give it many few-shot examples, interrogate it for edge-cases to then write rules/examples for, and the harm of an error is relatively minimal (anyone seriously using an equation will need to read the original anyway, so won't be fooled by a wrong translation). # So we write down a list of general rules, then a bunch of specific examples, then ask GPT-4 to translate from TeX to Unicode/HTML/CSS. # # eg. # $ echo 'a + b = c^2' | python3 latex2unicode.py # a + b = c2 # # Bonus feature: LLMs are smart enough to generalize, so free-form natural language inputs may also work: # # $ echo 'x times 2 but raised to 1/3rds' | latex2unicode.py # x Γ— 21⁄3 # $ echo 'asymptotically square root n' | latex2unicode.py # π’ͺ(√n) # # NOTE: this is intended only for using clean TeX and compiling to something usable in HTML/Markdown. For converting from an image or screenshot to TeX, see tools like or or (or prompting a VLM like Claude-3 or GPT-4o-V with an image & request) import sys from openai import OpenAI client = OpenAI() if len(sys.argv) == 1: target = sys.stdin.read().strip() else: target = sys.argv[1] prompt = """ Task: Convert LaTeX inline expressions from ArXiv-style TeX math to inline Unicode+HTML+CSS, for easier reading in web browsers. Task example: Input to convert: \\(H\\gg1\\) Converted output: H ≫ 1 Details: - Convert only if the result is unambiguous. - Note that inputs may be very short, because each LaTeX fragment in an abstract is processed individually. Many inputs will be as short as a single letter (which are variables). - Assume only default environment settings with no redefinitions or uses like `\\newcommand` or `\\begin`. Skip custom operators. - Do not modify block-level equations, or complex structures such as diagrams or tables or arrays or matrices (eg `\\begin{bmatrix}`), or illustrations such as drawn by TikZ or `\\draw` , as those require special processing (eg. matrixes must be converted into HTML tables). Do not convert them & simply repeat it if the input is not an inline math expression. - If a TeX command has no reasonable Unicode equivalent, such as the `\\overrightarrow{AB}`/`\\vec{AB}` or `\\check{a}` or `\\underline`/`\\overline` commands in LaTeX, simply repeat it. - If a TeX command merely adjusts positioning, size, or margin (such as `\\big`/`\\raisebox`/`\\big`/`\\Big`), always omit it from the conversion (as it is probably unnecessary & would need to be handled specially if it was). - The TeX/LaTeX special glyphs (`\\TeX` & `\\LaTeX`) are handled elsewhere; do not convert them, but simply repeat it. - Use Unicode entities, eg. MATHEMATICAL CAPITAL SCRIPT O `π’ͺ` in place of `\\mathcal{O}`, and likewise for the Fraktur ones (`\\mathfrak`) and bold ones (`\\mathbb`). Convert to the closest Unicode entity that exists. Convert symbols, special symbols, mathematical operators, and Greek letters. Convert even if the Unicode is rare (such as `π’ͺ`). If there is no Unicode equivalent (such as because there is not a matching letter in that font family, or no appropriate combining character), then do not convert it. - If there are multiple reasonable choices, such as `\\approx` which could be represented as `β‰ˆ` or `~`, choose the simpler-looking one. Do not choose the complex one unless there is some good specific reason for that. - For superimposed subscript+superscript, use a predefined CSS class `subsup`, eg. `(\\Delta^0_n)` β†’ `Ξ”n0`; `\\Xi_{cc}^{++} = ccu` β†’ `Ξcc++ = ccu`; `\\,\\Lambda_c \\Lambda_c \\to \\Xi_{cc}^{++}\\,n\\,` β†’ `Ξ›c Ξ›c β†’ Ξcc++,n`. This is also useful for summations or integrals, such as `\\int_a^b f(x) dx` β†’ `∫ab f(x) dx`. - NOTE: the '' must come before '', for Pandoc compatibility. (Hence the name 'subsup' rather than 'supsub'.) - For small fractions, where both numbers are 3 integer digits or less, use FRACTION SLASH (⁄) to convert (eg. `1/2` or `\\frac{1}{2}` β†’ `1⁄2`). Do not use the Unicode fractions like VULGAR FRACTION ONE HALF `Β½`. - For symbolic or large fractions, where one argument is a letter or symbol or >3 integer digits, use U+29F8 BIG SOLIDUS (β§Έ) instead, like '_a_β§Έ_b_'. - For complex fractions which use superscripts or subscripts, multiple arguments etc, do not convert them & simply repeat them. eg. do not convert `\\(\\frac{a^{b}}{c^{d}}\\)`, as it is too complex. - Convert roots such as square or cube roots if that would be unambiguous. For example, `\\sqrt[3]{8}` β†’ `βˆ›8` is good, but not `\\sqrt[3]{ab}` because `βˆ›ab` is ambiguous; do not convert complex roots like `\\sqrt[3]{ab}`. - Color & styling: if necessary, you may use simple CSS inline with a `` declaration, such as to color something blue using ``. - Outlines/boxes: you may use simple inline CSS to draw borders. - Be careful about dash use: correctly use MINUS SIGN (βˆ’) vs EM DASH (β€”) vs EN DASH (–) vs HYPHEN-MINUS (-). More rules/examples for edge-cases: - ' O(1)' π’ͺ(1) - '\\(\\mathsf{TC}^0\\)' TC0 - '\\(\\approx\\)' ~ - '\\(1-\\tilde \\Omega(n^{-1/3})\\)' 1 βˆ’ Ξ©Μƒ(nβˆ’1⁄3) - '\\(\\mathbf{R}^3\\)' 𝐑3 - '\\(\\ell_p\\)' 𝓁p - '\\textcircled{r}' β“‘ - '(\\nabla \\log p_t\\)' βˆ‡ log pt - '\\(\\partial_t u = \\Delta u + \\tilde B(u,u)\\)' βˆ‚tu = Ξ”u + BΜƒ(u, u) - '\\(1 - \\frac{1}{e}\\)' 1 βˆ’ 1β§Έe - 'O(\\sqrt{T}' π’ͺ(√T) - '\\(^\\circ\\)' Β° - '\\(^\\bullet\\)' β€’ - '6\\times 10^{-6}\\)' 6Γ—10βˆ’6 - '5\\div10' 5 Γ· 10 - '\\Pr(\\text{text} | \\alpha)' Pr(text | Ξ±) - '\\(\\hbar\\)' ℏ - '\\frac{1}{2}β†’ 1⁄2' - \\nabla βˆ‡ - '\\(r \\to\\infty\\)' r β†’ ∞ - '\\hat{a}' Γ’ - '\\textit{zero-shot}' zero-shot - '\\(f(x) = x \\cdot \\text{sigmoid}(\\beta x)\\)' f(x) = x Γ— sigmoid(Ξ² x) - '\\clubsuit' ♣ - '\\textcolor{red}{x}' x - '\\textcolor{red}{X}' X - '\\textbf{bolding}' bolding - '\\textit{emphasis}' emphasis - 'B' B - 'u' u - 'X + Y' X + Y - '\\,\\Lambda_b \\Lambda_b \\to \\Xi_{bb}\\,N\\,' , Ξ›b Ξ›b β†’ Ξbb N, - 'x \\in (-\\infty, \\infty)' x ∈ (-∞, ∞) - 'p\\bar{p} \\to \\mu^+\\mu^-' ppΜ… β†’ ΞΌ+ΞΌβˆ’ - '\\alpha\\omega\\epsilon\\S\\om\\in' αωΡ§øm∈ - '^2H ^6Li ^{10}B ^{14}N' 2H 6Li 10B 14N - '\\mathcal{L} \\mathcal{H} \\mathbb{R} \\mathbb{C}' β„’ β„‹ ℝ β„‚ - '\\textrm{M}_\\odot' Mβ˜‰βˆ’16–10βˆ’10Mβ˜‰ - '200+' 200+ - 'M = M_a \\cup M_b \\subseteq \\mathbb{R}^d' M = Ma βˆͺ Mb βŠ† ℝd - 'f : \\mathbb{R}^d \\to \\mathbb{R}^p' f : ℝd β†’ ℝp - 'M_a' Ma - 'Ξ²_k\\bigl(f(M_i)\\bigr) = 0' Ξ²k(f(Mi)) = 0 - 'k \\ge 1' k β‰₯ 1 - 'Ξ²_0\\bigl(f(M_i)\\bigr) = 1' Ξ²0(f(Mi)) = 1 - 'i =a, b' i = a, b - '(n,d,\\lambda)' (n, d, Ξ») - '\\Lambda' Ξ› - '\\not\\approx' ≉ - '\\left\\langle A \\middle| B \\right\\rangle' ⟨A|B⟩ # note: : "In Unicode, a few of the more common blackboard bold characters (β„‚, ℍ, β„•, β„™, β„š, ℝ, and β„€) are encoded in the Basic Multilingual Plane (BMP) in the Letterlike Symbols (2100–214F) area, named DOUBLE-STRUCK CAPITAL C etc. The rest, however, are encoded outside the BMP, in Mathematical Alphanumeric Symbols (1D400–1D7FF), specifically from 1D538–1D550 (uppercase, excluding those encoded in the BMP), 1D552–1D56B (lowercase) and 1D7D8–1D7E1 (digits). Blackboard bold Arabic letters are encoded in Arabic Mathematical Alphabetic Symbols (1EE00–1EEFF), specifically 1EEA1–1EEBB." - '\\mathcal{R}' β„› - '\\mathbb{R}' ℝ - '\\mathbb{N}' β„• - '\\cancel{x}' xΜΈ - '\\left{\\frac{1}{2} \\right}' \\left{\\frac{1}{2} \\right} - '\\dot{x}' ẋ - '\\ddot{x}' ẍ - 'x^{y^{z}}' xyz - '\\lim_{x \\to \\infty} f(x)' limx β†’ ∞ f(x) - '\\boxed{A}' A - '\\'   - '\\:'   - '\\;'   - '\\quad'   - '\\qquad'    - '!'   - '\\!' - En space   - Figure space   - Punctuation space   - 'O(m' \\log^2 m')' π’ͺ(mβ€² log2 mβ€²) - 'n'' nβ€² - '$%$' % - '%' %q - "\\(0.90, 0.91, 0.94\\)" 0.90, 0.91, 0.94 - '123/456' 123⁄456 - '123/4567' 123β§Έ4,567 - '1234/765' 1,234β§Έ765 - '5610/987980' 5,610β§Έ987,980 - '504827' 50,4827 - '($(\\frac{202680742}{582771} \\cdot 0.1) \\cdot 100$)' ((202,680,742β§Έ582,771) Γ— 0.1 Γ— 100) - '740/618' 740⁄618 - '$\\frac{1910}{209} = 9.14$' 1,910β§Έ209 = 9.14 - '(504827⁄1800) Γ— 1.0 Γ— 100' (504,827β§Έ1,800) Γ— 1.0 Γ— 100 - '$n/({\\pi\\over 8}$ lg $n)\\sp{1/2}$' _n_β§Έ(πœ‹β§Έ8 log _n_)1⁄2 - 'O(\\log n \\operatorname{polyloglog} n)' π’ͺ(⟨log⁑n⟩ polyloglog n) - 'r1,... rm' r1, ..., rm - '\\(LCSPACE[s,c,e] = CSPACE[\\Theta(s + e \\log c), \\Theta(c)]\\)' LCSPACE[S, c, e] = CSPACE[Θ(s + e log c), Θ(c)] - 'M_{PBH} > 1.4 \\times 10^{17} {\\rm g}' MPBH > 1.4 Γ— 1017 g - \\(<n\\) <n - '$DyT($x$) = \\tanh(Ξ±$x$)$' DyT(x) = tanh(Ξ±x) - '\\hat r' rΜ‚ - '$x = \\frac{o \\cdot e - (1 - e)}{o}$' x = o Β· e βˆ’ (1 βˆ’ e) β§Έ o - '$\\mathcal{V}$' 𝒱 - '\\(\\sim 10^6 \\mathrm{\\mu Lenat/word}\\)' ~3 Γ— 106 ΞΌLenatβ§Έword - '\\322\\ 322 - 'E\\in\\mathbb{R}^{m\\times n}' E ∈ ℝm Γ— n - 'z = \\frac{147.21 - 147.64}{0.145} = -2.96' z = (147.21 βˆ’ 147.64)β§Έ0.145 = βˆ’2.96 - '$12,087$' 12,087 - '1,600' 1,600 - '$2*5=10$' 2 Β· 5 = 10 Task: - '""" + target + "'\n" completion = client.chat.completions.create( model="gpt-4.1-mini", # we use GPT-4 because the outputs are short, we want the highest accuracy possible, we provide a lot of examples & instructions which may overload dumber models, and reviewing for correctness can be difficult, so we are willing to spend a few pennies to avoid the risk of a lower model messages=[ {"role": "system", "content": "You are a skilled mathematician & tasteful typographer, expert in LaTeX."}, {"role": "user", "content": prompt } ] ) output = completion.choices[0].message.content.rstrip() print(output, end='') # avoid trailing newline because we might be cleaning inline text & want to avoid injecting newlines