trafilatura.fetch_url 与 requests.get 有何区别？

回答

jerkzhang

Oct 4, 2025

1 赞

trafilatura.fetch_url应该是以requests或urllib为底层，进行了封装，对于很多细节都做到位了；适用相应场景下，比requests更合适。但requests的功能更全面。

首先看一下其源码：

# ~~~~~~~~~~~~~~~~~~~~ trafilatura/downloads.py ~~~~~~~~~~~~~~~~~~~~
import logging
from typing import Optional, Union
from .core import baseline
from .settings import DEFAULT_TIMEOUT
from .utils import decode_response, fetch_response, load_html
LOGGER = logging.getLogger(__name__)
def fetch_url(
    url: str,
    *,
    no_ssl: bool = False,
    timeout: int = DEFAULT_TIMEOUT,
    decode: bool = True,
) -> Optional[str]:
    """
    一次性的、阻塞的、会遵守 robots.txt 的网页下载函数。
    成功返回解码后的 HTML 字符串；失败返回 None。
    """
    # 1. 下载（内部已做 robots.txt 判定）
    response = fetch_response(url, no_ssl=no_ssl, timeout=timeout)
    if response is None or response.status != 200:
        return None
    # 2. 解码（默认 utf-8 / chardet / cchardet）
    if decode:
        return decode_response(response)
    # 3. 也可选择返回原始 bytes（极少用）
    return response.data

| 特性 | `trafilatura.fetch_url` | `requests.get` |
| ----------------- | -------------------------------------- | -------------------------------- |
| **底层库** | 使用 `urllib3` 或 `requests`，但**封装了一层优化** | 原生 `requests` |
| **User-Agent** | ✅ 自动设置合理的 UA（模拟浏览器） | ❌ 默认是 `python-requests/...`，容易被封 |
| **超时处理** | ✅ 默认有超时（如 30 秒） | ❌ 需手动设置 `timeout` |
| **编码处理** | ✅ 自动检测并处理编码 | ❌ 需手动处理 `response.encoding` |
| **重试机制** | ✅ 自动重试（内部实现） | ❌ 需手动实现 |
| **Robots.txt 遵守** | ✅ 默认遵守 `robots.txt` | ❌ 不检查 |
| **返回类型** | 返回的是 **HTML 字符串**（可直接给 `extract`） | 返回的是 `Response` 对象 |
| **错误处理** | 失败返回 `None`（静默） | 抛出异常或需手动检查 `status_code` |

1 赞 0 条评论分享