tornado.http1connection.HTTP1Connection 消息解析实现

引言

HTTP1Connection 是 HTTP/1.x 连接的抽象，可作为 client 发起请求和解析响应，也可作为 server 接收请求和回应响应。这里主要分析 HTTP1Connection 中怎样实现对请求和响应数据的解析的。关于请求的发起和回应响应，涉及到如何写的实现，将在后续文章中分析。

在具体分析代码实现之前，先介绍一下 HTTP/1.x 相关的内容以方便后面对代码的理解。

HTTP/1.x 简介

HTTP 协议是一个应用层协议，协议本身并没有规定使用它或它支持的层。事实上，HTTP 可以在任何互联网协议上，或其他网络上实现。HTTP 假定其下层协议提供可靠的传输。因此，任何能够提供这种保证的协议都可以被其使用。在TCP/IP 协议族上使用 TCP 作为其传输层。

版本

HTTP 协议有多个版本 HTTP/1.x 和 HTTP/2.0。目前使用最广泛的是 HTTP/1.x，包括 HTTP/1.0 和 HTTP/1.1 两个版本，后者是对前者的升级改进，最大的不同有两点：

默认支持持久连接，在一个 TCP 连接上可以传送多个 HTTP 请求和响应，减少了建立和关闭连接的消耗和延迟。
支持 Host 请求头字段，使得 Web 服务器可以在同一个 IP 和 Port 上使用不同的 HostName 来创建多个虚拟 Web 站点。

请求（Request）

请求消息（Request Message）

请求由客户端向服务器端发出，请求消息由下面 4 部分组成（RFC 2616 Request）：

Request-Line，请求行，格式为：Method SP Request-URI SP HTTP-Version CRLF， eg. “GET /foo HTTP/1.1”。
Request Header Fields，请求头，*(( general-header | request-header | entity-header ) CRLF)，在 HTTP/1.1 中除了 Host 外其他请求头都是可选的。
空行，CRLF。
消息体，[ message-body ]

每个头字段由一个字段名称（name） + 冒号（:） + 字段值(value), 三部分组成，name 是大小写无关的，value 前可以添加任何数量的空格符，头字段可以被扩展为多行，在每行开始处，使用至少一个空格或制表符。

请求方法（Method）

HTTP/1.x 中定义了 8 中请求方法来以不同的方式操作指定的资源：OPTIONS/GET/HEAD/POST/PUT/DELETE/TRACE/CONNECT/PATCH，方法名称是区分大小写的，具体的定义请参考 RFC 2616 Request Method。当某个请求所针对的资源不支持对应的请求方法的时候，服务器应当返回状态码 405（Method Not Allowed），当服务器不认识或者不支持对应的请求方法的时候，应当返回状态码 501（Not Implemented）。

HTTP 服务器至少应该实现 GET 和 HEAD 方法，其他方法都是可选的。 GET 和 HEAD 方法，除了进行获取资源信息外，不具有其他意义，理论上是”安全的“（实际上其结果取决于服务器的实现）。

响应（Response）

响应消息（Response Message）

服务器端接收处理了客户端请求后，向客户端回应一个响应消息，消息由下面 4 部分组成（RFC 2616 Response）：

Status-Line，响应状态行，格式为：HTTP-Version SP Status-Code SP Reason-Phrase CRLF，eg. “HTTP/1.1 200 OK”。
Response Header Fields，响应头， *(( general-header | response-header | entity-header ) CRLF)。
空行，CRLF。
消息体，[ message-body ]

状态码（Status Code）

HTTP 响应的第一行都是状态行（Status Line），依次是当前 HTTP 版本号，3 位数字组成的状态代码，以及描述状态的短语，彼此由空格分隔。

状态代码的第一个数字代表当前响应的类型，目前有 5 种状态：

1xx 消息——请求已被服务器接收，继续处理
2xx 成功——请求已成功被服务器接收、理解、并接受
3xx 重定向——需要后续操作才能完成这一请求
4xx 请求错误——请求含有词法错误或者无法被执行
5xx 服务器错误——服务器在处理某个正确请求时发生错误

虽然 RFC 2616 中已经推荐了描述状态的短语，例如”200 OK”，”404 Not Found”，但是 WEB 服务器开发者仍然能够自行决定采用何种短语，用以显示本地化的状态描述或者自定义信息。

Keep-Alive 持久连接

持久连接是从 HTTP/1.1 开始引入的，对于 HTTP/1.0 可以通过在请求中增加请求头 Connection: keep-alive 来告诉服务器使用持久连接。而 HTTP/1.1 中持久连接是默认必须的，除非显示在请求时增加请求头 Connection: close。

由于多个 HTTP 连接要复用同一个 TCP 连接，并且支持在应答到达前继续发送请求的 ”流线化“（stream）方式。为了区分单个请求或者响应的边界， HTTP/1.1 引入实体头（entity-header） Content-Length 。Content-Length 指出 Meesage Body 的长度，通过这个头字段的值便可以准确判断请求或者响应的边界。

Content-Length 实体头的引入表明了在发送请求/响应前发送端必须提前知道整个消息数据的长度（称为 buffer 模式），对于客户端请求这不是问题，但是对于服务器在实际使用中有时候就不会那么容易获取数据长度了。例如，数据来自文件或者动态生成，要知道数据长度就得在内存中开足够大的 buffer，等内容全部生成好再计算。这样很显然会增加开销和延迟。为了解决这个问题， HTTP/1.1 引入了分块传输编码，增加了一个通用头(general-header) Transfer-Encoding:chunked 来支持启用这个功能。

分块传输编码（Transfer-Encoding:chunked）

分块传输编码允许服务器不需要预先知道发送数据的大小，而把数据分解成一系列数据块，并以一次一个或者多个块发送。通常数据块的大小是一致的，但也不总是这种情况 **。

一个 HTTP 消息（请求消息或应答消息）的 Transfer-Encoding 消息头的值为 chunked，那么，消息体由数量未定的块组成，并以最后一个大小为 0的块为结束。

每一个非空的块都以该块包含数据的字节数（十六进制表示）开始，跟随一个 CRLF，然后是数据本身，最后跟 CRLF 结束。在一些实现中，块大小和 CRLF 之间填充有白空格（0x20）。
最后一块由块大小（0），一些可选的填充白空格，以及 CRLF。最后一块不包含任何数据，但是可以发送包含消息头字段的可选尾部，最后以 CRLF 结尾。

*******************************************************
 HTTP/1.1 200 OK
 Content-Type: text/plain
 Transfer-Encoding: chunked
 (空行)
 25
 This is the data in the first chunk
 1C
 and this is the second one
 3
 con
 8
 sequence
 0
 (空行)
********************************************************

HTTP 1.1引入分块传输编码提供了以下几点好处：

HTTP 分块传输编码允许服务器为动态生成的内容维持 HTTP 持久链接。通常，持久链接需要服务器在开始发送消息体前发送 Content-Length 消息头字段，但是对于动态生成的内容来说，在内容创建完之前是不可知的。
分块传输编码允许服务器在最后发送消息头字段。对于那些头字段值在内容被生成之前无法知道的情形非常重要，例如消息的内容要使用散列进行签名，散列的结果通过 HTTP 消息头字段进行传输。没有分块传输编码时，服务器必须缓冲内容直到完成后计算头字段的值并在发送内容前发送这些头字段的值。
HTTP 服务器有时使用压缩（gzip 或 deflate）以缩短传输花费的时间。分块传输编码可以用来分隔压缩对象的多个部分。在这种情况下，块不是分别压缩的，而是整个负载进行压缩，压缩的输出使用本文描述的方案进行分块传输。在压缩的情形中，分块编码有利于一边进行压缩一边发送数据，而不是先完成压缩过程以得知压缩后数据的大小。

参考文档:
分块传输编码

HTTP 消息压缩

HTTP 支持对消息体进行压缩传输（不支持对 HTTP 头部进行压缩） ，以减少网络传输的数据量增加传输效率。这是通过 HTTP 内容编码头字段来支持的，实际上 HTTP 消息压缩在协议上是 HTTP 内容编码的一种。

对于客户端，请求时通过发送请求头 Accept-Encoding 来向服务器表明客户端是否支持压缩，以及支持的压缩格式。例如， ”Accept-Encoding: gzip, deflate, sdch“ 表明客户端支持 gzip, deflate, sdch 压缩格式。

对于服务器端，响应时通过发送响应头 Content-Encoding 来向客户端说明响应数据是否压缩，以及压缩的格式。例如，”Content-Encoding:gzip“ 表示使用的是 gzip 压缩格式。如果没有 ”Content-Encoding“ 头或者 ”Content-Encoding:identity“ 则表明消息没有被编码，也就没有被压缩。

很显然，服务端响应消息使用的编码方式（压缩格式）必须是客户端所能支持的（由请求的 Accept-Encoding 值来声明）。

注：常见的压缩的方式 gzip, deflate 的关系。在 HTTP 内容编码中，deflate 表示的是 zlib。gzip 和 zlib 是两种不同的封装格式，其数据压缩都是使用的 deflate 算法，只是数据封装时使用的头部和尾部不同（头部和尾部主要是为了保存文件属性和校验信息）。所以通用的开源压缩库 ”zlib“ 同时支持 gzip，zlib 格式。

HTTP1Connection

read_response 方法是 HTTP 消息解析的入口，从方法名称看这个方法仅仅针对响应，由前面 HTTP/1.x 请求和响应数据格式可以看到二者的数据格式上是一致的，所以 read_response 实现上同时支持对请求和响应的数据解析。

def read_response(self, delegate):
    if self.params.decompress:
        delegate = _GzipMessageDelegate(delegate, self.params.chunk_size)
    return self._read_message(delegate)

delegate 是一个 HTTPMessageDelegate 类型，如果支持 HTTP gzip 压缩则需要被 _GzipMessageDelegate 再次包装一下。 _GzipMessageDelegate 类型之前已经有介绍，其实现内部负责解压的是 GzipDecompressor 和 zlib 模块。python 的 zlib 模块同时支持 gzip 和 zlib，但其 API 有些隐晦，针对 gzip 需要这样初始化 decompressor 实例（how-can-i-decompress-a-gzip-stream-with-zlib）: zlib.decompressobj(16 + zlib.MAX_WBITS)。

HTTP 消息解析的逻辑都被封装在 _read_message 中。

@gen.coroutine
def _read_message(self, delegate):
    need_delegate_close = False
    try:
        # 消息头与消息体之间由一个空行分开
        header_future = self.stream.read_until_regex(
            b"\r?\n\r?\n",
            max_bytes=self.params.max_header_size)
        if self.params.header_timeout is None:
            header_data = yield header_future
        else:
            try:
                header_data = yield gen.with_timeout(
                    self.stream.io_loop.time() + self.params.header_timeout,
                    header_future,
                    io_loop=self.stream.io_loop)
            except gen.TimeoutError:
                self.close()
                raise gen.Return(False)
        # 解析消息头，分离头字段和首行（request-line/status-line）
        start_line, headers = self._parse_headers(header_data)
        # 作为 client 解析的是 server 的 response，作为 server 解析的是 client 的 request。
        # response 与 request 的 start_line(status-line/request-line) 的字段内容不同：
        # 1. response's status-line: HTTP-Version SP Status-Code SP Reason-Phrase CRLF
        # 2. request's request-line：Method SP Request-URI SP HTTP-Version CRLF
        # start_line 的值是一个 namedtuple。
        if self.is_client:
            start_line = httputil.parse_response_start_line(start_line)
            self._response_start_line = start_line
        else:
            start_line = httputil.parse_request_start_line(start_line)
            self._request_start_line = start_line
            self._request_headers = headers

        # 非 keep-alive 的请求或响应处理完成后要关闭连接。
        self._disconnect_on_finish = not self._can_keep_alive(
            start_line, headers)
        need_delegate_close = True
        with _ExceptionLoggingContext(app_log):
            header_future = delegate.headers_received(start_line, headers)
            if header_future is not None:
                # 如果 header_future 是一个 `Future` 实例，则要等到完成才读取 body。
                yield header_future
        # websocket ？？？
        if self.stream is None:
            # We've been detached.
            need_delegate_close = False
            raise gen.Return(False)
        skip_body = False
        if self.is_client:
            # 作为 client 如果发起的是 HEAD 请求，那么 server response 应该无消息体
            if (self._request_start_line is not None and
                    self._request_start_line.method == 'HEAD'):
                skip_body = True
            code = start_line.code
            if code == 304:
                # 304 responses may include the content-length header
                # but do not actually have a body.
                # http://tools.ietf.org/html/rfc7230#section-3.3
                skip_body = True
            if code >= 100 and code < 200:
                # 1xx responses should never indicate the presence of
                # a body.
                if ('Content-Length' in headers or
                    'Transfer-Encoding' in headers):
                    raise httputil.HTTPInputError(
                        "Response code %d cannot have body" % code)
                # TODO: client delegates will get headers_received twice
                # in the case of a 100-continue.  Document or change?
                yield self._read_message(delegate)
        else:
            # 100-continue 这个状态码是在 HTTP/1.1 中为了提高传输效率而设置的。当
            # client 需要 POST 较大数据给 WebServer 时，可以在发送 HTTP 请求时带上
            # Expect: 100-continue，WebServer 如果接受这个请求则应答一个
            # ``HTTP/1.1 100 (Continue)``，那么 client 就继续传输 request body，
            # 否则应答 ``HTTP/1.1 417 Expectation Failed`` client 就放弃传输剩余
            # 的数据。（注：Expect 头部域，用于指出客户端要求的特殊服务器行为采用扩展语法
            # 定义，以方便扩展。）
            if (headers.get("Expect") == "100-continue" and
                    not self._write_finished):
                self.stream.write(b"HTTP/1.1 100 (Continue)\r\n\r\n")
        if not skip_body:
            body_future = self._read_body(
                start_line.code if self.is_client else 0, headers, delegate)
            if body_future is not None:
                if self._body_timeout is None:
                    yield body_future
                else:
                    try:
                        yield gen.with_timeout(
                            self.stream.io_loop.time() + self._body_timeout,
                            body_future, self.stream.io_loop)
                    except gen.TimeoutError:
                        gen_log.info("Timeout reading body from %s",
                                     self.context)
                        self.stream.close()
                        raise gen.Return(False)
        self._read_finished = True
        # 对 client mode ，response 解析完成就调用 HTTPMessageDelegate.finish() 方法是合适的；
        # 对 server mode ，_write_finished 表示 response 是否发送完成，未完成前调用
        # HTTPMessageDelegate.finish() 方法是合适的；
        if not self._write_finished or self.is_client:
            need_delegate_close = False
            with _ExceptionLoggingContext(app_log):
                delegate.finish()
        # If we're waiting for the application to produce an asynchronous
        # response, and we're not detached, register a close callback
        # on the stream (we didn't need one while we were reading)
        #
        # NOTE:_finish_future resolves when all data has been written and flushed
        # to the IOStream.
        # 等待异步响应完成，所有数据都写入 fd，才继续后续处理，详细见 _finish_request/finish 方法实现。
        if (not self._finish_future.done() and
                self.stream is not None and
                not self.stream.closed()):
            self.stream.set_close_callback(self._on_connection_close)
            yield self._finish_future
        # 对于 client mode，处理完响应后如果不是 keep-alive 就断开连接。
        # 对于 server mode，需要在 response 完成后才断开连接，详细见 _finish_request/finish 方法实现。
        if self.is_client and self._disconnect_on_finish:
            self.close()
        if self.stream is None:
            raise gen.Return(False)
    except httputil.HTTPInputError as e:
        gen_log.info("Malformed HTTP message from %s: %s",
                     self.context, e)
        self.close()
        raise gen.Return(False)
    finally:
        # 连接 “关闭” 前还没能结束处理请求（call HTTPMessageDelegate.finish()），则
        # call  HTTPMessageDelegate.on_connection_close()
        if need_delegate_close:
            with _ExceptionLoggingContext(app_log):
                delegate.on_connection_close()
        self._clear_callbacks()
    raise gen.Return(True)

_read_body 方法负责负责读取 HTTP 消息的消息体，按照前面 HTTP 协议的定义，消息体的读取要分成 3 中情况：

非持久连接方式，消息体的读取以连接关闭作为界限；
持久连接方式下，通过 Content-Length 指定消息体的长度；
持久连接下采用 Transfer-Encoding:chunked 分块传输消息体。

上述 3 种情况对消息体的读取分别封装在方法 language_read_body_until_close、_read_fixed_body、_read_chunked_body 中。如下代码所示：

def _read_body(self, code, headers, delegate):
    if "Content-Length" in headers:
        if "," in headers["Content-Length"]:
            # Proxies sometimes cause Content-Length headers to get
            # duplicated.  If all the values are identical then we can
            # use them but if they differ it's an error.
            pieces = re.split(r',\s*', headers["Content-Length"])
            if any(i != pieces[0] for i in pieces):
                raise httputil.HTTPInputError(
                    "Multiple unequal Content-Lengths: %r" %
                    headers["Content-Length"])
            headers["Content-Length"] = pieces[0]
        content_length = int(headers["Content-Length"])

        if content_length > self._max_body_size:
            raise httputil.HTTPInputError("Content-Length too long")
    else:
        content_length = None

    # 204 No Content，表示服务器已经完成了请求，但是返回的信息不包括 message-body，但是可以通过
    # header fields 返回一些用于更新的元数据。
    if code == 204:
        # This response code is not allowed to have a non-empty body,
        # and has an implicit length of zero instead of read-until-close.
        # http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.3
        if ("Transfer-Encoding" in headers or
                content_length not in (None, 0)):
            raise httputil.HTTPInputError(
                "Response with code %d should not have body" % code)
        content_length = 0

    # 持久连接： Content-Length or Transfer-Encoding
    if content_length is not None:
        return self._read_fixed_body(content_length, delegate)
    if headers.get("Transfer-Encoding") == "chunked":
        return self._read_chunked_body(delegate)
    # 非持久连接
    if self.is_client:
        return self._read_body_until_close(delegate)
    return None

代码中对于特殊情况的下 Content-Length 进行了一些额外的处理：

由于一些代理可能会导致 Content-Length 的值重复，需要判断这些值是否一致，不一致的情况要作为错误处理；
HTTP Status Code 是 204 的时候，消息不应该含有消息体，所以不能出现 Transfer-Encoding 头字段。

language_read_body_until_close 和 _read_fixed_body 方法实现都很简单，委托 stream 相关方法就好了。_read_chunked_body 方法需要按照前述的 Transfer-Encoding:chuncked 消息组成格式进行解析，相对复杂一点点，下面是其代码：

@gen.coroutine
def _read_chunked_body(self, delegate):
    # TODO: "chunk extensions" http://tools.ietf.org/html/rfc2616#section-3.6.1
    #
    # *************************** chunk extensions *************************
    # 使用分块传输编码（chunked transfer encoding）时，消息体由数量未定的块组成，并以最
    # 后一个大小为 0 的块结束。
    # 1. 每一个非空的块都以该块包含数据的字节数（十六进制表示）开始，跟随一个 CRLF，然后是数
    # 据本身，最后跟 CRLF 结束。在一些实现中，块大小和 CRLF 之间填充有白空格（0x20）。
    # 2. 最后一块由块大小（0），一些可选的填充白空格，以及 CRLF。最后一块不包含任何数据，但
    # 是可以发送包含消息头字段的可选尾部（注：以下代码实现不支持可选尾部），最后以 CRLF 结尾。
    # ----------------------------eg. start--------------------------------
    # HTTP/1.1 200 OK\r\n
    # Content-Type: text/plain\r\n
    # Transfer-Encoding: chunked\r\n
    # \r\n
    # 25\r\n
    # This is the data in the first chunk\r\n
    # 1C\r\n
    # and this is the second one\r\n
    # 3\r\n
    # con\r\n
    # 8\r\n
    # sequence\r\n
    # 0\r\n
    # \r\n
    # ----------------------------eg. end--------------------------------
    # **********************************************************************
    total_size = 0
    while True:
        chunk_len = yield self.stream.read_until(b"\r\n", max_bytes=64)
        chunk_len = int(chunk_len.strip(), 16)
        if chunk_len == 0:
            return
        total_size += chunk_len
        if total_size > self._max_body_size:
            raise httputil.HTTPInputError("chunked body too large")
        bytes_to_read = chunk_len
        while bytes_to_read:
            chunk = yield self.stream.read_bytes(
                min(bytes_to_read, self.params.chunk_size), partial=True)
            bytes_to_read -= len(chunk)
            if not self._write_finished or self.is_client:
                with _ExceptionLoggingContext(app_log):
                    yield gen.maybe_future(delegate.data_received(chunk))
        # chunk ends with \r\n
        crlf = yield self.stream.read_bytes(2)
        # 如果最后一个 chunk 中包含了可选的尾部，断言会失败。可选尾部由 Trailer 头域支持，
        # 参考：http://tools.ietf.org/html/rfc2616#section-14.40。
        # 目前 tornado 中的实现不支持这个可选尾部，如果发生异常的话，可尝试判断是否是 last chunk，
        # 然后吞掉可选尾部。
        # eg.
        # if bytes_to_read == 0 and crlf != b"\r\n":
        #     yield self.stream.read_until(b"\r\n", max_bytes=self._max_body_size - total_size)
        # else:
        #     assert crlf == b"\r\n"
        assert crlf == b"\r\n"

代码中已经对相应的代码做了详细注释，值得注意的是这里方法并不是完整支持 RFC 中的分块传输编码，不支持最后一块数据中包含可选尾部。