Markdown转PDF_ 字节宝

markdown转pdf是比较常见的需求，有许多成熟的工具可以实现，比如pandoc和wkhtml2pdf，很多工具都是对这些的进一步包装。

pandoc

Pandoc 是一个格式转换的工具，Markdown、PDF、TXT、Doc等都能够胜任。

官网：https://pandoc.org/

（1）安装Python库

代码语言：javascript复制

pip install pandoc

（2）安装texlive

texlive官网：https://tug.org/texlive/

texlive是渲染pdf的关键依赖，TeXLive不支持中文，可以从yum源安装，也可以在官网下载，进行安装：https://tug.org/texlive/quickinstall.html

代码语言：javascript复制

yum install texlive

TinyTeX 是一个瘦身版的 TeXLive，同时支持中文：https://yihui.org/tinytex/，两者安装其一就行。

代码语言：javascript复制

wget "https://yihui.name/gh/tinytex/tools/install-unx.sh"
bash install-unx.sh

（3）配置字体

查看当前系统支持哪些字体，也可以查看指定语言的字体，若没有相应字体，则对应语言渲染出的字符为乱码，需要下载字体到目录下，linux的字体目录是/usr/share/fonts/。

代码语言：javascript复制

fc-list :lang=zh

如果遇到更多字体相关问题，参考：https://github.com/jgm/pandoc/wiki/Pandoc-with-Chinese

（4）代码示例

Markdown字符串转pdf的示例代码如下，pandoc底层调用了texlive。

代码语言：javascript复制

def markdown_to_pdf(markdown_text: str) -> bytes:
    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as temp_pdf:
        temp_pdf_path = temp_pdf.name

    process = subprocess.Popen(
        ['pandoc', '-f', 'markdown', '-t', 'xelatex',  '-o', temp_pdf_path],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE
    )
    _, stderr = process.communicate(input=markdown_text.encode('utf-8'))
    if process.returncode != 0:
        raise RuntimeError(f"Pandoc error: {stderr.decode('utf-8')}")

    with open(temp_pdf_path, 'rb') as pdf_file:
        pdf_bytes = pdf_file.read()

    return pdf_bytes

pdfkit

markdown转pdf的流程如下：

使用markdown库将md转换为html。
使用pdfkit将html转换为pdf，pdfkit依赖wkhtmltopdf工具。

（1）安装Python依赖

代码语言：javascript复制

pip install markdown pdfkit lxml pymdown-extensions bs4 python-markdown-math markdown_checklist pygments

（2）下载wkhtmltopdf

官网：https://wkhtmltopdf.org/downloads.html

代码语言：javascript复制

wget https://github.com/wkhtmltopdf/wkhtmltopdf/releases/download/0.12.4/wkhtmltox-0.12.4_linux-generic-amd64.tar.xz
tar xvfJ wkhtmltox-0.12.4_linux-generic-amd64.tar.xz
cd wkhtmltox/bin
sudo mv ./wkhtmltopdf /usr/local/bin/wkhtmltopdf
sudo chmod  x /usr/local/bin/wkhtmltopdf

（3）配置字体参考上述Pandoc中的字体配置，对于放在中文乱码，需要在html的头部加入utf-8声明，详细代码见如下（5）中扩展的代码示例。

（4）代码示例

代码语言：javascript复制

def markdown_to_pdf(markdown_text: str) -> str:
    # Convert Markdown to HTML
    html = markdown.markdown(markdown_text)
    # Convert HTML to PDF
    pdf = pdfkit.from_string(html, False)
    # Encode PDF to base64
    pdf_base64 = base64.b64encode(pdf).decode('utf-8')
    return pdf_base64

上述代码可以对简单的markdown实现转换，如果对markdown中代码样式、表格等复杂样式进行渲染，需要再进行扩展。

（5）安装扩展依赖

代码语言：javascript复制

pip install lxml pymdown-extensions bs4 python-markdown-math markdown_checklist pygments

生成样式文件

代码语言：javascript复制

pygmentize -f html -a .highlight -S default > pygments.css

（6）完整的示例如下，其中临时文件test.html和test_final.html文件的相关代码可以优化，转换后可以删除。

代码语言：javascript复制

import base64
import codecs
import markdown
import pdfkit
from pymdownx import superfences
from bs4 import BeautifulSoup

# pip install markdown pdfkit lxml pymdown-extensions bs4 python-markdown-math markdown_checklist pygments
# pygmentize -f html -a .highlight -S default > pygments.css

extensions = [
    'toc',  # 目录，[toc]
    'extra',  # 缩写词、属性列表、释义列表、围栏式代码块、脚注、在HTML的Markdown、表格
]
third_party_extensions = [
    'mdx_math',  # KaTeX数学公式，$E=mc^2$和$$E=mc^2$$
    'markdown_checklist.extension',  # checklist，- [ ]和- [x]
    'pymdownx.magiclink',  # 自动转超链接，
    'pymdownx.caret',  # 上标下标，
    'pymdownx.superfences',  # 多种块功能允许嵌套，各种图表
    'pymdownx.betterem',  # 改善强调的处理(粗体和斜体)
    'pymdownx.mark',  # 亮色突出文本
    'pymdownx.highlight',  # 高亮显示代码
    'pymdownx.tasklist',  # 任务列表
    'pymdownx.tilde',  # 删除线
]
extensions.extend(third_party_extensions)
extension_configs = {
    'mdx_math': {
        'enable_dollar_delimiter': True  # 允许单个$
    },
    'pymdownx.superfences': {
        "custom_fences": [
            {
                'name': 'mermaid',  # 开启流程图等图
                'class': 'mermaid',
                'format': superfences.fence_div_format
            }
        ]
    },
    'pymdownx.highlight': {
        'linenums': True,  # 显示行号
        'linenums_style': 'pymdownx-inline'  # 代码和行号分开
    },
    'pymdownx.tasklist': {
        'clickable_checkbox': True,  # 任务列表可点击
    }
}


# codecs.open() 函数和内置的 open() 函数在 Python 中都用于打开文件，但它们在处理文件编码方面有一些区别
# codecs.open() 函数允许你通过 encoding 参数显式地指定文件编码。这在处理非标准编码的文件或确保使用特定编码时非常有用。
# 而 open() 函数在没有提供 encoding 参数时，会使用默认的系统编码。
def convert_markdown_to_pdf(md_content: str):
    html_content = markdown.markdown(md_content, extensions=extensions, extension_configs=extension_configs)
    with codecs.open("test.html", "w", encoding="utf-8") as f:
        # 加入文件头防止中文乱码
        with open("pygments.css", "r") as g:
            f.write('''<head>
    <style>
    {}
    </style>
    {}
    </head>n'''.format(g.read(), '<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>'))
        f.write(html_content)

    # 优化html中的图片信息
    with codecs.open("test.html", "r", encoding="utf-8") as f:
        soup = BeautifulSoup(f, features="lxml")
        image_content = soup.find_all("img")
        for i in image_content:
            i["style"] = "max-width:100%; overflow:hidden;"
        with codecs.open("test_final.html", "w", encoding="utf-8") as g:
            g.write(soup.prettify())

    pdf_content = pdfkit.from_string(soup.prettify(), False)
    pdf_content_base64 = base64.b64encode(pdf_content).decode('utf-8')
    return pdf_content_base64

参考

Pandoc如何把 Markdown 文件批量转换为 PDF：https://sspai.com/post/47110
linux 安装wkhtmltopdf：https://blog.csdn.net/weixin_45019350/article/details/115799676
Python转换md文件至pdf：https://blog.csdn.net/qq1787991631/article/details/129878598

markdown pdfmarkdown pdf pdfkit wkhtmltopdf

0 人点赞