Trafilatura 2.0 的 extract() 函数无论如何设置都会丢掉 <h2> 的解决方案

Trafilatura 2.0 的 extract() 函数无论如何设置都会丢掉 <h2> 的解决方案
Trafilatura用起来很顺手，但是extract() 函数无论如何设置都会丢弃正常的 <h2> 标签。

Python 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> #!/usr/bin/env python3
>>> import trafilatura
>>> from trafilatura.settings import use_config
>>> 
>>> html = '''
... <body>          <!-- 只加这一行 -->
... <h2>Understanding Git Undo Mechanisms</h2>
... <p>In software development, erroneous code commits are common occurrences.</p>
... <h2>Basic Undo Operations: git reset</h2>
... <p>git reset is the most commonly used undo command.</p>
... </body>         <!-- 和这一行 -->
... '''
>>> 
>>> cfg = use_config()
>>> cfg.set('DEFAULT', 'EXTRACTION_STRICT', '0')   # 依然关严格模式
>>> 
>>> txt = trafilatura.extract(
...         html,
...         output_format='txt',
...         config=cfg,
...         favor_recall=True,
...         include_formatting=True,
...         favor_precision=False)
>>> 
>>> print(repr(txt))
'In software development, erroneous code commits are common occurrences.\ngit reset is the most commonly used undo command.'
>>> print('含第一个h2:', 'Understanding Git Undo Mechanisms' in txt)
含第一个h2: False
>>> print('含第二个h2:', 'Basic Undo Operations: git reset' in txt)
含第二个h2: False
>>>

感觉无解呀，为何这么好用的工具，会有这种问题……

喜欢这个问题 | 分享 | 新建回答

回答

jerkzhang

Oct 27, 2025

1 赞

使用多个AI进行Debug了几百轮，最终基于AI没有找到解决方案。

后续通过阅读Github的问题列表的讨论，找到了根源。

Trafilatura并不是类似bs的这种提取，它的提取是对网页的整体分析，来进行提取，背后是有相应的算法，从而准确提取；上述你测试的这些，并非是完整的HTML页面的源码，所以提权下来，基于其算法，所以把h2省略掉了，这不是BUG，而是Trafilatura自身的特性。

因此，你可以试着把你要提取的内容，尽量组成一个相对完整的页面。比如：

html_blog = f"<header><h1>{title}</h1></header><article>{article_content}</article>"

形成上述结构，就可以提取到正文部分的h2标题了。

下面，我们展示一个完整的例子，这样就能把h2提取到了。

import trafilatura
from trafilatura.settings import use_config
print('trafilatura version :', trafilatura.__version__)
html = b'''
        <header>
            <h1>Article Title</h1>
        </header>
        <article>
            <h2>Related Articles</h2>
            <p>Article contentArticle contentArticle contentArticle contentArticle contentArticle contentArticle content</p>
        </article>
    '''
cfg = use_config()
cfg.set('DEFAULT', 'EXTRACTION_STRICT', '0')
cfg.set('DEFAULT', 'MIN_OUTPUT_SIZE', '0')
txt = trafilatura.extract(html,
                          output_format='txt',
                          config=cfg,
                          favor_recall=True,
                          favor_precision=False,
                          include_formatting=True)   # <-- 换成这个
print('---------- repr of txt ----------')
print(repr(txt))

1 赞 0 条评论分享