summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorihciah <[email protected]>2021-12-02 14:38:25 +0800
committerihciah <[email protected]>2021-12-02 14:38:25 +0800
commit9df6a26f2f51aba824ceeee21179fbff4dd42d49 (patch)
tree031cacffb874f003a313d5fe6094751927efcdf0 /docs
Init: open-source
Diffstat (limited to 'docs')
-rw-r--r--docs/en/async-communicate.md17
-rw-r--r--docs/en/benchmark.md66
-rw-r--r--docs/en/comparing-with-others.md27
-rw-r--r--docs/en/configuration.md73
-rw-r--r--docs/en/how-intergrate-with-tower.md11
-rw-r--r--docs/en/memlock.md30
-rw-r--r--docs/en/platform-support.md16
-rw-r--r--docs/en/why-GAT.md41
-rw-r--r--docs/en/why-async-rent.md35
-rw-r--r--docs/zh/async-communicate.md17
-rw-r--r--docs/zh/benchmark.md66
-rw-r--r--docs/zh/comparing-with-others.md30
-rw-r--r--docs/zh/configuration.md73
-rw-r--r--docs/zh/how-intergrate-with-tower.md11
-rw-r--r--docs/zh/memlock.md30
-rw-r--r--docs/zh/platform-support.md16
-rw-r--r--docs/zh/why-GAT.md44
-rw-r--r--docs/zh/why-async-rent.md35
18 files changed, 638 insertions, 0 deletions
diff --git a/docs/en/async-communicate.md b/docs/en/async-communicate.md
new file mode 100644
index 0000000..d523e93
--- /dev/null
+++ b/docs/en/async-communicate.md
@@ -0,0 +1,17 @@
+---
+title: How to communicate within a thread or between threads
+date: 2021-11-24 20:00:00
+author: ihciah
+---
+
+# How to communicate within a thread or between threads
+
+Cross-thread or cross-task communication is a very common operation. For example, if you want to do some statistics, the best performance method is to make statistics and aggregation within each thread, and finally aggregate each thread.
+
+## Same thread asynchronous communication
+Asynchronous communication with the same thread can use this crate we implemented: [local-sync](https://crates.io/crates/local-sync). It provides the implementation of mpsc (bounded/unbounded), once cell, oneshot and semaphore. Since it has no cross-thread Sync + Send implementation, its internal data structure does not require synchronization primitives such as Atomic or Mutex, which is more efficient.
+
+## Cross-thread communication
+By default, we do not support cross-thread asynchronous communication (means you can `await` tasks created by other threads), but you can still use Atomic, etc. to achieve cross-thread communication.
+
+To enable cross-thread asynchronous communication support, you need to enable feature `sync`. We do not provide such an implementation, you can use any implementation, such as [async-channel](https://crates.io/crates/async-channel), or use the one provided by Tokio (you need to introduce tokio dependency, you can just enable `sync` feature, it is not provided as a separate crate).
diff --git a/docs/en/benchmark.md b/docs/en/benchmark.md
new file mode 100644
index 0000000..2daa728
--- /dev/null
+++ b/docs/en/benchmark.md
@@ -0,0 +1,66 @@
+---
+title: Performance test and comparison
+date: 2021-12-01 15:50:00
+author: ihciah
+---
+
+# Performance test data and comparison
+
+In order to measure the performance of Monoio, we selected two more representative Runtimes to compare with Monoio: Tokio and Glommio.
+
+## Testing environment
+Our test is carried out on the ByteDance production network, and the client end and the server end are running on different physical machines.
+
+Server information:
+> Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
+>
+> Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
+>
+> Linux 5.15.4-arch1-1 #1 SMP PREEMPT Sun, 21 Nov 2021 21:34:33 +0000 x86_64 GNU/Linux
+>
+> rust nightly-2021-11-26
+
+## Testing tools
+The testing tool is developed based on Monoio using Rust.
+
+You can find its source code [here](https://github.com/monoio-rs/monoio-benchmark).
+
+## Testing data
+
+### Extreme performance testing
+In this test we will start a fixed number of connections on the client side. The more connections, the higher the load on the server. This test aims to detect the extreme performance of the system.
+
+1 Core | 4 Cores
+:-------------------------:|:-------------------------:
+![1core](/.github/resources/benchmark/monoio-bench-1C.png) | ![4cores](/.github/resources/benchmark/monoio-bench-4C.png)
+
+8 Cores | 16 Cores
+:-------------------------:|:-------------------------:
+![8cores](/.github/resources/benchmark/monoio-bench-8C.png) | ![16cores](/.github/resources/benchmark/monoio-bench-16C.png)
+
+In the case of a single core and very few connections, Monoio's latency will be higher than Tokio, resulting in lower throughput than Tokio. This latency difference is due to the difference between io-uring and epoll.
+
+Except for the previous scenario, Monoio performance is better than Tokio and Glommio. Tokio will decrease the average peak performance of a single core as the number of cores increases; Monoio's peak performance has the best horizontal scalability.
+
+Under single core, Monoio's performance is slightly better than Tokio; under 4 cores, the peak performance is about twice that of Tokio; under 16 cores, it is close to 3 times. Glommio and the model are the same as Monoio, so it also has good horizontal scalability, but its peak performance is still a certain gap compared to Monoio.
+
+![100B](/.github/resources/benchmark/monoio-bench-100B.png)
+We use a message size of 100Byte to test the peak performance under different core counts (1K will fill up the network card when there are more cores). It can be seen that Monoio and Glommio can maintain linearity better; while Tokio has very little performance improvement or even degradation when there are more cores.
+
+### Fixed pressure testing
+In the production environment, it is impossible for us to hit the full server side, so testing the performance under constant pressure is also of great significance.
+
+1 Core * 80 Connections | 4 Cores * 80 Connections
+:-------------------------:|:-------------------------:
+![1core*80](/.github/resources/benchmark/monoio-bench-1C-80conn-qps.png) | ![4cores*80](/.github/resources/benchmark/monoio-bench-4C-80conn-qps.png)
+
+1 Cores * 250 Connections | 4 Cores * 250 Connections
+:-------------------------:|:-------------------------:
+![1core*250](/.github/resources/benchmark/monoio-bench-1C-250conn-qps.png) | ![4cores*250](/.github/resources/benchmark/monoio-bench-4C-250conn-qps.png)
+
+Similar to the problem explained by the previous test data, Tokio has a delay advantage over uring-based Glommio and Monoio when the number of connections is small. But Monoio is still the lowest in CPU consumption.
+
+As the number of connections increases, Monoio has the lowest latency and CPU usage.
+
+## Reference data
+You can find more specific benchmark data in [the link](/.github/resources/benchmark/raw_data.txt).
diff --git a/docs/en/comparing-with-others.md b/docs/en/comparing-with-others.md
new file mode 100644
index 0000000..e8dedaa
--- /dev/null
+++ b/docs/en/comparing-with-others.md
@@ -0,0 +1,27 @@
+---
+title: Comparing with Others
+date: 2021-11-24 20:00:00
+author: ihciah
+---
+
+# Comparing with Others
+
+There are indeed many runtime like us. So what\'s the difference?
+
+## Mio and Tokio
+Mio and Tokio do not use io-uring. Instead, they use epoll-like mechanism.
+
+Epoll/kqueue is a mechanism that allows a program to monitor multiple file descriptors for events. It is designed for watching which fd is ready to read or write. But you have to do read or write on your own later, which means more context switching between user space and kernel space.
+
+Mio is a library to use epoll/kqueue/IOCP for different platforms. Tokio is built on top of mio providing async IO and cross-thread scheduling support. Tokio is suitable for most of the common applications and environment.
+
+## Tokio-uring
+How to use io-uring in tokio? Tokio-uring manage fd and provide an IO interface with ownership of buffer.
+
+However, it still use epoll as notification of uring\'s fd, which is not necessary. But tokio-uring reuses most abilities of tokio, so it can be seen as a compromise in consideration of compatibility.
+
+## Glommio
+Glommio is a complex library to support common applications based on io-uring. For compatibility, Glommio copies the data, which is unnecessary. Also, we found part of Glommio implementation is not efficient.
+
+# Performance
+We compared the performance in multiple scenarios with Tokio and Glommio. Detailed comparison data can be seen [here](/docs/en/benchmark.md).
diff --git a/docs/en/configuration.md b/docs/en/configuration.md
new file mode 100644
index 0000000..5e4475e
--- /dev/null
+++ b/docs/en/configuration.md
@@ -0,0 +1,73 @@
+---
+title: Configuration Guide
+date: 2021-11-26 14:00:00
+author: ihciah
+---
+
+# Configuration Guide
+
+This section describes the configurable options and some of the default behavior inside Monoio.
+
+## Runtime Configuration
+With the current version, there are 2 main configurations that you can change at runtime.
+1. entries
+
+ entries refers to the ring size of the io-uring, the default is `1024`, you can specify this value when creating the runtime. Note that for performance, the setting is set to 256 when it is less than 256. When your QPS is high, setting larger entries can increase the ring size and reduce the number of submits, which will significantly reduce the syscall usage, but also bring some memory usage, please set it reasonably.
+
+ The entries also affects the initial size of the inflight op cache, which defaults to `10 * entries` and currently does not provide a custom interface. Again, to avoid frequent expansion of this cache, please set entries wisely.
+
+ When creating a runtime, specify.
+ ```rust
+ RuntimeBuilder::new().with_entries(32768).build()
+ ```
+ Specified by macro.
+ ```rust
+ #[monoio::main(entries = 32768)]
+ async main() {
+ // ...
+ }
+ ```
+
+2. enable_timer
+
+ enable_timer means whether to enable the timer or not, the default is not enabled. If you have asynchronous timing requirements, you need to enable this feature, otherwise it will panic.
+
+ What are the asynchronous timing requirements? That is, when you need to use the time module of this crate, such as asynchronous sleep or tick, etc. Simply calling the standard library to get the current time is not included.
+
+ When creating a runtime, specify:
+ ```rust
+ RuntimeBuilder::new().enable_timer().build()
+ ```
+ Specified via macro:
+ ```rust
+ #[monoio::main(timer_enabled)]
+ async main() {
+ // ...
+ }
+ ```
+
+## Compile-time configuration
+There are also some features that affect runtime behavior during compile time.
+1. async-cancel
+
+ async-cancel is enabled by default. Turning this feature on pushes a CancelOp into the io-uring to try to cancel the corresponding Op when the Future is dropped, which may have some performance improvements. Note that even then we can't guarantee that the Op will be cancelled. So if you have something like select {read, timeout}, be sure to save the Future if you still need to read it later.
+
+2. zero-copy
+
+ zero-copy is not enabled by default. When enabled, it turns on the SOCK_ZEROCOPY flag for the socket when it is created, and adds an additional MSG_ZEROCOPY flag when it is sent. This reduces memory copies, but is not stable in our tests. If you want to enable this feature, please make sure that the system behaves properly under stress tests.
+
+3. macros
+
+ macros is enabled by default. With this feature on you can use macros instead of the `RuntimeBuilder` constructor, such as `#[monoio::main]`.
+
+4. sync
+
+ sync is not enabled by default. This feature allows you to share Future between different threads of Runtime, e.g. the most common requirement is to create a cross-thread channel and use it to communicate between threads. As a thread per core model runtime, it is not recommended to use this approach heavily on the hot path.
+
+5. utils
+
+ utils is turned on by default. Currently there is only one utility that allows you to set the affinity of threads and cpu.
+
+6. debug
+
+ debug is not enabled by default. It will print some debugging information at runtime when enabled. It is only for debugging during Runtime development and is not recommended to be enabled in production environment.
diff --git a/docs/en/how-intergrate-with-tower.md b/docs/en/how-intergrate-with-tower.md
new file mode 100644
index 0000000..300d62a
--- /dev/null
+++ b/docs/en/how-intergrate-with-tower.md
@@ -0,0 +1,11 @@
+---
+title: How to Intergrate with Tower
+date: 2021-11-24 20:00:00
+author: ihciah
+---
+
+# How to Intergrate with Tower
+
+With GAT we may use Tower in another way.
+
+TODO \ No newline at end of file
diff --git a/docs/en/memlock.md b/docs/en/memlock.md
new file mode 100644
index 0000000..b47522f
--- /dev/null
+++ b/docs/en/memlock.md
@@ -0,0 +1,30 @@
+---
+title: Set memlock limit
+date: 2021-11-26 14:00:00
+author: ihciah
+---
+
+# Set memlock limit
+Io-uring needs to share memory in user mode and kernel mode, such as ring or registered buffer.
+
+The default configuration of many kernels will have a small memlock limit number, such as 64(64 KiB). We need a larger memlock to work properly (if you manually specify the size of the ring, you may need to make sure that its size is legal).
+
+To view the current limit, you can use `ulimit -l` (if you have just modified the configuration, you need to log in to the session again to take effect):
+```
+❯ ulimit -l
+unlimited
+```
+
+To modify this limit globally, you can modify the `/etc/security/limits.conf` file and add two lines:
+```
+* hard memlock unlimited
+* soft memlock unlimited
+```
+
+If you only want to take effect for this session, you can consider using the root user to execute `ulimit -Sl unlimited && ulimit -Hl unlimited your_cmd`.
+
+In systemd, you can set the memlock limit through the `LimitMEMLOCK` configuration, you can refer to `/etc/systemd/user.conf` and `/etc/systemd/system.conf`.
+
+Except unlimited, setting 512 is generally sufficient under normal circumstances. But if the system throughput is high, you can consider configuring a larger limit (or unlimited) and specify a larger number of ring entries when creating the runtime to obtain better performance.
+
+What is the result when memlock is not enough? Sometimes an error will be returned when you read or write: `code: 105, kind: Uncategorized, message: "No buffer space available"`.
diff --git a/docs/en/platform-support.md b/docs/en/platform-support.md
new file mode 100644
index 0000000..ba508a6
--- /dev/null
+++ b/docs/en/platform-support.md
@@ -0,0 +1,16 @@
+---
+title: Platform Support
+date: 2021-11-24 20:00:00
+author: ihciah
+---
+
+# Platform Support
+
+Currently only linux with io-uring is supported, 5.6+ should be ok.
+
+## Plans
+Later we will support epoll/kqueue by using mio as **fallback**. If you do not use io-uring, there will be less meaningful to use Monoio.
+
+IO interfaces with ownership of buffer is hard to use but has better performance in io-uring mode. So if you have io-uring in most of your environment, and care performance much, using Monoio is a good choice.
+
+Windows is a little bit hard to support. We do not consider support it yet. \ No newline at end of file
diff --git a/docs/en/why-GAT.md b/docs/en/why-GAT.md
new file mode 100644
index 0000000..edb2788
--- /dev/null
+++ b/docs/en/why-GAT.md
@@ -0,0 +1,41 @@
+---
+title: Why GAT
+date: 2021-11-24 20:00:00
+author: ihciah
+---
+
+# Why GAT
+
+We enable GAT globally and use it widely.
+
+We define the lifetime of the associated Future so it can capture `&self` instead of `Clone`ing part of it, or defining another struct with lifetime mark.
+
+## Define a future
+How to define a future? Normally, we need to define a `poll` function. The function take `Context` and return `Poll` synchronously. To implement that, we have to manually manage the state of the future, and it\'s a hard task and easy to make mistake.
+
+Maybe you think, we just use `async` and `await` then it will be fine! Infact, `async` block will generate the state machine for you. However, you can\'t name it, which is hard to use when we want to use the future as the associated type of another struct, for example, a tower `Service`. In this case, you need opaque type and you have to enable `type_alias_impl_trait` feature; or, you can use `Box<dyn Future>` with some runtime cost.
+
+## Generate a future
+Except for using `async` block, we can manually construct a struct which implements `Future`. There are two common types:
+1. Future with ownership, which means there\'s no need to define lifetime mark in struct. Generate the future, and the future has no relationship with any others. Maybe you have to share ownership with `Rc` or `Arc`.
+2. Future with reference. You have to define lifetime mark in struct. For example, in Tokio `AsyncReadExt`, the `read` is like `fn read<'a>(&'a mut self, buf: &'a mut [u8]) -> Read<'a, Self>`. The constructed struct capture the reference of `self` and `buf` without cost of sharing ownership. However, without GAT this type of future cannot be used in type alias.
+
+## Define IO trait
+In a normal way, we have to define the trait in a `poll` style like `poll_read`. Any wrapping should done in `poll` style and implement the trait.
+
+For user-friendliness, there will be another `Ext` trait with default implementation. The `Ext` trait requires and only requires the former trait. `Ext` trait provides something like `read` to return a future. Using `await` on the future is more convenient than manage state and do `poll` manually. The future returned is a manually defined struct(with or without lifetime mark) implementing `Future`.
+
+Why using `poll` style for the basic trait? Because `poll` style is synchronous and always returns `Poll`, which does not capture any thing, is easy to define and generic enough.
+
+With GAT things will be easier. How about generate a future directly in trait? With GAT we can define a type alias with lifetime in trait. Then we can return a future capturing the reference of self.
+```rust
+trait AsyncReadRent {
+ type ReadFuture<'a, T>: Future<Output = BufResult<usize, T>>
+ where
+ Self: 'a,
+ T: 'a;
+ fn read<T: IoBufMut>(&self, buf: T) -> Self::ReadFuture<'_, T>;
+}
+```
+
+The only problem here is, if you use GAT style, you should always use it. Providing `poll` style based on GAT is not easy. As an example, `monoio-compat` implement tokio `AsyncRead` and `AsyncWrite` based on GAT style future with some unsafe hack(and also with a `Box` cost).
diff --git a/docs/en/why-async-rent.md b/docs/en/why-async-rent.md
new file mode 100644
index 0000000..dca681f
--- /dev/null
+++ b/docs/en/why-async-rent.md
@@ -0,0 +1,35 @@
+---
+title: Async Rent
+date: 2021-11-24 20:00:00
+author: ihciah
+---
+
+# Async Rent
+
+We use async rent as our core IO abstraction.
+
+## The need of io-uring
+1. Address of the buffer being fixed.
+
+ Since we only submit the buffer to the kernel, we don't know when the buffer will be read or written. We must make sure the address is fixed, so it can be valid when operated.
+
+2. Lifetime promise of the buffer.
+
+ Consider the following circumstance:
+ 1. User creates a buffer.
+ 2. User takes a reference of the buffer and use the reference(no matter `&` or `&mut`) to do read or write.
+ 3. Instead of `await` the future, the user just drops it.
+ 4. Since the buffer has no borrower, the user can drop the buffer safely.
+ 5. The `ReadOp` or `WriteOp` is still in the queue or being processed. We can not guarantee the `CancelOp` is successfully pushed and consumed in time.
+ 6. The kernel is operating on invalid memory!
+
+So, we want to make sure the buffer is fixed and valid during the operation. We found no way to do this if not take ownership of the buffer.
+
+Why Tokio AsyncIO do not require buffer ownership? Answer in a simple way: Because we can cancel it instantly. We do async read or async write with the borrow of the buffer, then we can generate future with the borrow, and the rust compiler know their relation. Once the future is done or dropped, the borrow is dropped, and the kernel will not operate on the buffer(because with epoll+syscall, syscall is executed synchronously). In another word, if we can do async drop, we can do async IO with the reference of the buffer with io-uring instead of ownership.
+
+## Ecology Problem
+The biggest problem is the compatibility with the current async ecology.
+
+To relieve the problem, we provide a compatible wrapper for structs like `TcpStream`. With the wrapper, users can use Tokio AsyncIO with data copy cost.
+
+And also, if user uses BufRead or BufWrite, there will be no need to take the ownership of the buffer since we will copy the data instantly. We can use Tokio AsyncIO now too. \ No newline at end of file
diff --git a/docs/zh/async-communicate.md b/docs/zh/async-communicate.md
new file mode 100644
index 0000000..4c42342
--- /dev/null
+++ b/docs/zh/async-communicate.md
@@ -0,0 +1,17 @@
+---
+title: 如何在同线程或跨线程使用异步通信
+date: 2021-11-24 20:00:00
+author: ihciah
+---
+
+# 如何在同线程或跨线程使用异步通信
+
+跨线程或跨 Task 通信是一个很常见的操作。比如你想做一些统计,那么性能最佳的办法是,每个线程内部自己统计并聚合,最后各个线程再聚合。
+
+## 同线程异步通信
+同线程异步通信可以使用我们实现的这个 crate:[local-sync](https://crates.io/crates/local-sync)。它提供了 mpsc(bounded/unbounded)、once cell、oneshot 和 semaphore 的实现。由于它没有跨线程 Sync + Send 实现,其内部数据结构不需要 Atomic 或 Mutex 等同步原语,较为高效。
+
+## 跨线程通信
+默认情况下我们不支持跨线程异步通信(指可以 `await` 其他线程创建的 Task),但你仍然可以使用如 Atomic 等实现跨线程通信的目的。
+
+若要开启跨线程异步通信支持,需要开启 feature `sync`。我们没有提供这类实现,你可以使用任何实现,如[async-channel](https://crates.io/crates/async-channel),或者使用 Tokio 提供的(需要引入 tokio 依赖,可以只开启 `sync` feature,它没有以单独的 crate 提供)。
diff --git a/docs/zh/benchmark.md b/docs/zh/benchmark.md
new file mode 100644
index 0000000..a4654bd
--- /dev/null
+++ b/docs/zh/benchmark.md
@@ -0,0 +1,66 @@
+---
+title: 性能测试与对比
+date: 2021-12-01 15:50:00
+author: ihciah
+---
+
+# 性能测试数据与对比
+
+为了衡量 Monoio 的性能表现,我们选取了 2 个比较具有代表性的 Runtime 与 Monoio 做对比:Tokio 和 Glommio。
+
+## 测试环境
+我们的测试在字节跳动生产网上进行,发压端与被压端运行在不同的物理机上。
+
+被压端配置:
+> Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
+>
+> Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
+>
+> Linux 5.15.4-arch1-1 #1 SMP PREEMPT Sun, 21 Nov 2021 21:34:33 +0000 x86_64 GNU/Linux
+>
+> rust nightly-2021-11-26
+
+## 测试工具
+测试工具使用 Rust 基于 Monoio 开发。
+
+你可以在[这里](https://github.com/monoio-rs/monoio-benchmark)找到它的源代码。
+
+## 测试数据
+
+### 极限性能测试
+在本测试中我们会在客户端启动固定数量的连接。连接数越多,服务端的负荷越高。本测试旨在探测系统的极限性能。
+
+1 Core | 4 Cores
+:-------------------------:|:-------------------------:
+![1core](/.github/resources/benchmark/monoio-bench-1C.png) | ![4cores](/.github/resources/benchmark/monoio-bench-4C.png)
+
+8 Cores | 16 Cores
+:-------------------------:|:-------------------------:
+![8cores](/.github/resources/benchmark/monoio-bench-8C.png) | ![16cores](/.github/resources/benchmark/monoio-bench-16C.png)
+
+在单核且连接数极少的情况下,Monoio 的延迟会高于 Tokio,导致吞吐低于 Tokio。这个延迟差异是由于 io-uring 和 epoll 的不同导致。
+
+除了前面的这种场景,Monoio 性能都好于 Tokio 和 Glommio。Tokio 会随着核数的增多,单 Core 的平均峰值性能出现较大下降;Monoio 的峰值性能的水平扩展性是最好的。
+
+单核心下,Monoio 的性能略好于 Tokio;4 核心下峰值性能是 Tokio 的 2 倍左右;16 核心时接近 3 倍。Glommio 和模型和 Monoio 是一致的,所以也有较好的水平扩展性,但是它的峰值性能对比 Monoio 仍旧有一定差距。
+
+![100B](/.github/resources/benchmark/monoio-bench-100B.png)
+我们使用 100Byte 的消息大小测试不同核数下的峰值性能(1K 会在核数较多时打满网卡)。可以看出,Monoio 和 Glommio 可以较好地保持线性;而 Tokio 在核数较多时性能提升极少甚至出现劣化。
+
+### 固定压力测试
+生产环境中我们不可能打满服务端,因此测试固定压力下的性能表现也有很大意义。
+
+1 Core * 80 连接 | 4 Cores * 80 连接
+:-------------------------:|:-------------------------:
+![1core*80](/.github/resources/benchmark/monoio-bench-1C-80conn-qps.png) | ![4cores*80](/.github/resources/benchmark/monoio-bench-4C-80conn-qps.png)
+
+1 Cores * 250 连接 | 4 Cores * 250 连接
+:-------------------------:|:-------------------------:
+![1core*250](/.github/resources/benchmark/monoio-bench-1C-250conn-qps.png) | ![4cores*250](/.github/resources/benchmark/monoio-bench-4C-250conn-qps.png)
+
+同前面的测试数据说明的问题类似,在连接数较小时,Tokio 相比 基于 uring 的 Glommio 和 Monoio 有延迟优势。但在 CPU 消耗上 Monoio 仍是最低的。
+
+随着连接数的上涨,Monoio 在延迟和 CPU 占用上都是最低的。
+
+## 参考数据
+原始的压测数据可以在 [这里](/.github/resources/benchmark/raw_data.txt) 找到 \ No newline at end of file
diff --git a/docs/zh/comparing-with-others.md b/docs/zh/comparing-with-others.md
new file mode 100644
index 0000000..5a89e20
--- /dev/null
+++ b/docs/zh/comparing-with-others.md
@@ -0,0 +1,30 @@
+---
+title: 与其他 Runtime 的对比
+date: 2021-11-24 20:00:00
+author: ihciah
+---
+
+# 与其他 Runtime 的对比
+
+类似的 Runtime 还有很多,本文主要介绍区别和设计上的取舍。
+
+## Mio 和 Tokio
+Mio 和 Tokio 不使用 io-uring,它们使用 epoll(在 linux 上)。以 linux 平台为例,epoll 一种 IO 通知机制,利用它可以监控并等待 fd ready。但是,你仍需要在 fd ready 后调用 read/write,这也是会有额外的用户态和内核态的切换开销。
+
+Mio 是 Tokio 的底层 IO 库,它屏蔽了 epoll/kqueue/IOCP 在不同平台上的差异。
+
+Tokio 内部除了利用 Mio 做 IO 多路复用以外,还做了跨线程的任务调度(实现和 golang 很像)。对于绝大多数通用场景来说,Tokio 是很合适的;同时它也提供了非常好的兼容性。
+
+## Tokio-uring
+在 Tokio 内能使用 io-uring 吗?可以试试 Tokio-uring。Tokio-uring 提供了基于 io-uring 的 IO 接口,需要持有 buffer 所有权操作。
+
+但是它并不是彻底的 io-uring,为了复用 Tokio 的底层能力,它还是将 uring fd 构建在 epoll 之上,不考虑兼容的话这显然是不必要的。
+
+## Glommio
+Glommio 是一个复杂的基于 io-uring 的库。内部有多个 ring,设计上尽可能兼容多种场景,而这种多 ring 的设计这也带来了一些额外的性能开销。在 IO 操作接口上,它提供了类似 Tokio 的接口,由于 kernel 读写数据是异步的,所以它内部做了一次额外的数据拷贝,在高性能场景下也带来了一些不期望的开销。
+
+## Monoio
+Monoio 设计上以性能为最优先。它也不是银弹,如果你的业务场景中任务不太均匀,那么很可能会导致不同核心的利用率出现差别。对于合适的场景,如常规的代理场景,Monoio 的性能是要好于其他 Runtime 的。详细的数据可以 Monoio Bnechmark。
+
+# 性能对比
+我们与 Tokio 和 Glommio 对比了多个场景下的性能。详细的对比数据可以看 [对比数据](/docs/zh/benchmark.md)。
diff --git a/docs/zh/configuration.md b/docs/zh/configuration.md
new file mode 100644
index 0000000..17378fd
--- /dev/null
+++ b/docs/zh/configuration.md
@@ -0,0 +1,73 @@
+---
+title: 配置指南
+date: 2021-11-26 14:00:00
+author: ihciah
+---
+
+# 配置指南
+
+本节将介绍 Monoio 内部的可配置选项和一些默认行为。
+
+## 运行时配置
+目前版本下,在运行时你可以改动的主要有 2 个配置:
+1. entries
+
+ entries 指 io-uring 的 ring 大小,默认是 `1024`,你可以在创建 runtime 时指定该值。注意,为了保证性能,当设定小于 256 时会设置为 256。当你的 QPS 较高时,设置较大的 entries 可以增大 ring 的大小,减少 submit 次数,这样会显著降低 syscall 占用,但也会带来一定内存占用,请合理设置。
+
+ 这个 entries 也会影响 inflight op 缓存的初始大小,默认为 `10 * entries`,目前不提供自定义接口。同样,为了避免这个缓存频繁扩容,请合理设置 entries。
+
+ 创建 runtime 时指定:
+ ```rust
+ RuntimeBuilder::new().with_entries(32768).build()
+ ```
+ 通过宏指定:
+ ```rust
+ #[monoio::main(entries = 32768)]
+ async main() {
+ // ...
+ }
+ ```
+
+2. enable_timer
+
+ enable_timer 指是否启用定时器,默认不开启。如果你有异步计时需求,需要开启该功能,否则会 panic。
+
+ 哪些属于异步计时需求?即需要用到本 crate 的 time 包的时候,如异步 sleep 或者 tick 等。简单地调用标准库获取当前时间并不包括在内。
+
+ 创建 runtime 时指定:
+ ```rust
+ RuntimeBuilder::new().enable_timer().build()
+ ```
+ 通过宏指定:
+ ```rust
+ #[monoio::main(timer_enabled)]
+ async main() {
+ // ...
+ }
+ ```
+
+## 编译期配置
+在编译期也有一些 feature 会影响 runtime 行为。
+1. async-cancel
+
+ async-cancel 默认开启。开启该 feature 后会在 Future 被 Drop 时向 io-uring 推入一个 CancelOp 来试图取消对应 Op,可能有一定的性能提升。注意,即便如此我们也并不能保证这个 Op 一定被取消。所以如果你有类似 select {读,超时} 的行为,如果你后续仍需要继续读取,请务必保存这个 Future。
+
+2. zero-copy
+
+ zero-copy 默认不开启。开启后会在创建 socket 时对该 socket 开启 SOCK_ZEROCOPY 标记,并在 send 时额外添加 MSG_ZEROCOPY 标记。可以减少内存拷贝,但在我们的测试中并不稳定。如果你要开启这个 feature,请确认在压力测试下系统表现正常。
+
+3. macros
+
+ macros 默认开启。开启该 feature 后你可以使用宏来替代 `RuntimeBuilder` 的构造函数,如 `#[monoio::main]`。
+
+4. sync
+
+ sync 默认不开启。这个 feature 允许你在不同线程的 Runtime 间共享 Future。如最常见的需求是,创建一个跨线程的 channel,并通过它实现线程间的通信。作为一个 thread per core 模型的 runtime,不建议在热路径上大量使用这种方式。
+
+5. utils
+
+ utils 默认开启。目前只有一个工具,可以允许你设置线程和 cpu 的 affinity。
+
+6. debug
+
+ debug 默认不开启。开启后会在运行时打印一些调试信息。仅供 Runtime 开发时调试用,不建议在生产环境开启。
diff --git a/docs/zh/how-intergrate-with-tower.md b/docs/zh/how-intergrate-with-tower.md
new file mode 100644
index 0000000..797e260
--- /dev/null
+++ b/docs/zh/how-intergrate-with-tower.md
@@ -0,0 +1,11 @@
+---
+title: 如何与 Tower 兼容
+date: 2021-11-24 20:00:00
+author: ihciah
+---
+
+# 如何与 Tower 兼容
+
+使用了 GAT 的情况下 Future 会带上生命周期,我们可能需要其他方式来使用 Tower 这套组件和抽象。
+
+TODO \ No newline at end of file
diff --git a/docs/zh/memlock.md b/docs/zh/memlock.md
new file mode 100644
index 0000000..ffc28c8
--- /dev/null
+++ b/docs/zh/memlock.md
@@ -0,0 +1,30 @@
+---
+title: 设置 memlock 限制
+date: 2021-11-26 14:00:00
+author: ihciah
+---
+
+# 设置 memlock 限制
+io-uring 需要在用户态和内核态共享内存,如 ring 或 registered buffer。
+
+很多内核的默认配置会带一个数值较小的 memlock 限制,如 64(指 64 KiB)。我们需要更大的 memlock 才能正常工作(如果你手动指定了 ring 的大小,你可能需要确保它的大小是合法的。我们测试中,64 其实也是能跑的)。
+
+要查看当前的限制,可以使用 `ulimit -l`(如果你刚刚修改完配置,需要重新登录会话才会生效):
+```
+❯ ulimit -l
+unlimited
+```
+
+要全局修改这个限制,可以修改 `/etc/security/limits.conf` 文件,添加 2 行内容(你也可以把星号换成你的用户名):
+```
+* hard memlock unlimited
+* soft memlock unlimited
+```
+
+如果你只希望对本会话生效,可以考虑使用 root 用户执行 `ulimit -Sl unlimited && ulimit -Hl unlimited your_cmd`。
+
+在 systemd 中,可以通过 `LimitMEMLOCK` 配置来设置 memlock 限制,可以参考 `/etc/systemd/user.conf` 和 `/etc/systemd/system.conf`。
+
+除了 unlimited 外,正常情况下设置 512 一般是足够的。但如果吞吐量较高,可以考虑配置更大的限制(或 unlimited)并在创建 Runtime 时指定更大的 ring entry 数以获得更好的性能(参考 configuration.md)。
+
+memlock 不够时的表现是什么?有时候在你读写时会返回错误:`code: 105, kind: Uncategorized, message: "No buffer space available"`。
diff --git a/docs/zh/platform-support.md b/docs/zh/platform-support.md
new file mode 100644
index 0000000..ee810f0
--- /dev/null
+++ b/docs/zh/platform-support.md
@@ -0,0 +1,16 @@
+---
+title: 平台支持
+date: 2021-11-24 20:00:00
+author: ihciah
+---
+
+# 平台支持
+
+目前仅支持 Linux 平台,且需要内核支持 io-uring,最低版本为 5.6。
+
+## 未来计划
+未来将会支持 epoll/kqueue 作为 **fallback**。注意,即便支持了 epoll/kqueue,也希望你能在多数场景下利用 io-uring,否则使用 Monoio 的意义不大。
+
+Monoio 使用类似 Tokio-uring 的带有 buffer 所有权的 IO 接口,这套接口与 Tokio 不兼容,且需要组件单独适配支持。所以如果你的大部分运行环境支持 io-uring,且你十分注重性能,那么欢迎使用 Monoio。
+
+Windows 支持起来较为困难,且开发意义和生产意义不大,目前暂无支持计划。
diff --git a/docs/zh/why-GAT.md b/docs/zh/why-GAT.md
new file mode 100644
index 0000000..e8fe41c
--- /dev/null
+++ b/docs/zh/why-GAT.md
@@ -0,0 +1,44 @@
+---
+title: 为什么使用 GAT
+date: 2021-11-24 20:00:00
+author: ihciah
+---
+
+# 为什么使用 GAT
+
+我们全局开启了 GAT 并广泛地使用了它。
+
+我们在 trait 中的关联类型 Future 上定义了生命周期,这样它就可以捕获 `&self` 而不是非要 `Clone self` 中的部分成员,或者单独定义一个带生命周期标记的结构体。
+
+## 定义 Future
+如何定义一个 Future?常规我们需要定义一个结构体,并为它实现 Future trait。这里的关键在于要实现 `poll` 函数。这个函数接收 `Context` 并同步地返回 `Poll`。要实现 `poll` 我们一般需要手动管理状态,写起来十分困难且容易出错。
+
+这时你可能会说,直接 `async` 和 `await` 不能用吗?事实上 `async` 块确实生成了一个状态机,和你手写的差不多。但是问题是,这个生成结构并没有名字,所以如果你想把这个 Future 的类型用作关联类型就难了。这时候可以开启 `type_alias_impl_trait` 然后使用 opaque type 作为关联类型;也可以付出一些运行时开销,使用 `Box<dyn Future>`。
+
+## 生成 Future
+除了使用 `async` 块外,常规的方式就是手动构造一个实现了 `Future` 的结构体。这种 Future 有两种:
+1. 带有所有权的额 Future,不需要额外写生命周期标记。这种 `Future` 和其他所有结构体都没有关联,如果你需要让它依赖一些不 `Copy` 的数据,那你可以考虑使用 `Rc` 或 `Arc` 之类的共享所有权的结构。
+2. 带有引用的 Future,这种结构体本身上就带有生命周期标记。例如,Tokio 中的 `AsyncReadExt`,`read` 的签名是 `fn read<'a>(&'a mut self, buf: &'a mut [u8]) -> Read<'a, Self>`。这里构造的 `Read<'a, Self>` 捕获了 self 和 buf 的引用,相比共享所有权,这是没有运行时开销的。但是这种 Future 不好作为 trait 的 type alias,只能开启 `generic_associated_types` 和 `type_alias_impl_trait`,然后使用 opaque type。
+
+## 定义 IO trait
+通常,我们的 IO 接口要以 `poll` 形式定义(如 `poll_read`),任何对 IO 的包装都应当基于这个 trait 来做(我们暂时称之为基础 trait)。
+
+但是为了用户友好的接口,一般会提供一个额外的 `Ext` trait,主要使用其默认实现。`Ext` trait 为所有实现了基础 trait 的类自动实现。例如,`read` 返回一个 Future,显然基于这个 future 使用 `await` 要比手动管理状态和 `poll` 更容易。
+
+那为什么基础 trait 使用 `poll` 形式定义呢?不能直接一步到位搞 Future 吗?因为 poll 形式是同步的,不需要捕获任何东西,容易定义且较为通用。如果直接一步到位定义了 Future,那么,要么类似 `Ext` 一样直接把返回 Future 类型写死(这样会导致无法包装和用户自行实现,就失去了定义 trait 的意义),要么把 Future 类型作为关联类型(前面说了,不开启 GAT 没办法带生命周期,即必须 static)。
+
+所以总结一下就是,在目前的 Rust 稳定版本中,只能使用 poll 形式的基础 trait + future 形式的 Ext trait 来定义 IO 接口。
+
+在开启 GAT 后这件事就能做了。我们可以直接在 trait 的关联类型中定义带生命周期的 Future,就可以捕获 self 了。
+
+```rust
+trait AsyncReadRent {
+ type ReadFuture<'a, T>: Future<Output = BufResult<usize, T>>
+ where
+ Self: 'a,
+ T: 'a;
+ fn read<T: IoBufMut>(&self, buf: T) -> Self::ReadFuture<'_, T>;
+}
+```
+
+这是银弹吗?不是。唯一的问题在于,如果使用了 GAT 这一套模式,就要总是使用它。如果你在 `poll` 形式和 GAT 形式之间反复横跳,那你会十分痛苦。基于 `poll` 形式接口自行维护状态,确实可以实现 Future(最简单的实现如 `poll_fn`);但反过来就很难受了:你很难存储一个带生命周期的 Future。虽然使用一些 unsafe 的 hack 可以做(也有 cost)这件事,但是仍旧,限制很多且并不推荐这么做。`monoio-compat` 基于 GAT 的 future 实现了 Tokio 的 `AsyncRead` 和 `AsyncWrite`,如果你非要试一试,可以参考它。
diff --git a/docs/zh/why-async-rent.md b/docs/zh/why-async-rent.md
new file mode 100644
index 0000000..ce47e1e
--- /dev/null
+++ b/docs/zh/why-async-rent.md
@@ -0,0 +1,35 @@
+---
+title: 为什么使用 AsyncRent 作为 IO 抽象
+date: 2021-11-24 20:00:00
+author: ihciah
+---
+
+# 为什么使用 AsyncRent 作为 IO 抽象
+
+我们使用 AsyncRent 作为 IO 抽象。
+
+## io-uring 的需要
+1. buffer 的位置固定
+
+因为我们只是将 buffer 提交给 kernel,我们不知道 kernel 什么时候会向其中写数据或从中读数据。我们必须保证在 IO 完成前,buffer 地址是有效的。
+
+2. buffer 声明周期保证
+
+ 考虑下面这种情况:
+ 1. 用户创建了 Buffer
+ 2. 用户拿到了 buffer 的引用(不管是 `&` 还是 `&mut`)来做 read 和 write。
+ 3. Runtime 返回了 Future,但用户直接将其 Drop 了。
+ 4. 现在没有人持有 buffer 的引用了,用户可以直接将其 Drop 掉。
+ 5. 但是,buffer 的地址和长度已经被提交给内核,它可能即将被处理,也可能已经在处理中了。我们可以推入一个 `CancelOp` 进去,但是我们也不能保证 `CancelOp` 被立刻消费。
+ 6. Kernel 这时已经在操作错误的内存啦,如果这块内存被用户程序复用,会导致内存破坏。
+
+所以,我们想要保证在 IO 完成前,buffer 地址固定且是有效的。在 Rust 中,如果不拿 buffer 的所有权,这件事几乎不可能做到。
+
+为什么 Tokio 的 AsyncIO 不需要 buffer 所有权呢?原因很简单:根本原因是 kernel 对 buffer 的操作是同步的,可以立刻取消 IO。所以我们可以只拿 buffer 的引用去做读写,然后我们可以构造捕获这个 buffer 的 Future,然后 Rust 编译器会知道它们的生命周期和引用关系。一旦 Future 完成了或被 Drop 了,这个引用就释放了,这时 kernel 是不可能操作到 buffer 的(在 epoll+syscall 下,syscall 是同步执行的,一旦用户拿到控制权,那么一定 kernel 一定不在操作)。换句话说,如果我们可以做 async drop,我们基于 io-uring 也能使用类似 Tokio 的只需要 buffer 引用的 IO 接口。
+
+## 生态问题
+最大的问题在于生态,基于 buffer 所有权的 IO 接口和目前的业界生态不搭。
+
+为了缓解这个问题,我们为 `TcpStream` 之类的结构类提供了一个兼容的包装。用户可以使用 Tokio 的接口操作这些结构,但要付出多一次数据拷贝的开销。
+
+并且,如果用户使用 BufRead 或 BufWrite,也没有必要非要使用带有 buffer 所有权的接口操作,因为它的内部本来就有 buffer,我们可以在 call 的时候立刻拷贝数据从而不必将用户传入的 buffer 置于一个不确定的状态(指不确定什么时候它被读写),我们也可以在没有新增的开销的情况下提供类似 Tokio 的接口。