一区二区三区日韩精品-日韩经典一区二区三区-五月激情综合丁香婷婷-欧美精品中文字幕专区

分享

Receive packet steering [LWN.net]

 jijo 2009-11-24
 

 

[LWN subscriber-only content]

Welcome to LWN.net

The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider accepting the trial offer on the right. Thank you for visiting LWN.net!

Free trial subscription

Try LWN for free for 3 months: no payment or credit card required. Activate your trial subscription now and see why thousands of readers subscribe to LWN.net.

 

By Jonathan Corbet
November 17, 2009
Contemporary networking hardware can move a lot of packets, to the point that the host computer can have a hard time keeping up. In recent years, CPU speeds have stopped increasing, but the number of CPU cores is growing. The implication is clear: if the networking stack is to be able to keep up with the hardware, smarter processing (such as generic receive offload) will not be enough; the system must also be able to distribute the work across multiple processors. Tom Herbert's receive packet steering (RPS) patch aims to help make that happen.

From the operating system's point of view, distributing the work of outgoing data across CPUs is relatively straightforward. The processes generating data will naturally spread out across the system, so the networking stack does not need to think much about it, especially now that multiple transmit queues are supported. Incoming data is harder to distribute, though, because it is coming from a single source. Some network interfaces can help with the distribution of incoming packets; they have multiple receive queues and multiple interrupt lines. Others, though, are equipped with a single queue, meaning that the driver for that hardware must deal with all incoming packets in a single, serialized stream. Parallelizing such a stream requires some intelligence on the part of the host operating system.

Tom's patch provides that intelligence by hooking into the receive path - netif_rx() and netif_receive_skb() - right when the driver passes a packet into the networking subsystem. At that point, it creates a hash from the relevant protocol data (IP addresses and port numbers, in particular) and uses it to pick a CPU; the packet is then enqueued for the target CPU's attention. By default, any CPU on the system is fair game for network processing, but the list of target CPUs for any given interface can be configured explicitly by the administrator if need be.

The code is relatively simple, but it succeeds in distributing the load of receive processing across the system. The use of the hash is important: it ensures that packets for the same stream of data end up on the same processor, increasing cache locality (and, thus, performance). This scheme is also nice in that it requires no driver changes at all, so it can be deployed quickly and with minimal disruption.

There is one place where drivers can help, though. The calculation of the hash requires accessing data from the packet header. That access will necessarily involve one or more cache misses on the CPU running the steering code - that data was just put there by the network interface and thus cannot be in any CPU's cache. Once the packet has been passed over to the CPU which will be doing the real work, that cache miss overhead is likely to be incurred again. Unnecessary cache misses are the bane of high-speed network processing; quite a bit of work has been done to eliminate them wherever possible. Adding a new cache miss for every packet in the steering code would be counterproductive.

It turns out that a number of network interfaces can, themselves, calculate a hash value for incoming packets. That processing comes for free, and it could eliminate the need to calculate that hash (and suffer the overhead of accessing the data) on the dispatching processor. To take advantage of this capability, the RPS patch adds a new rxhash field to the sk_buff (SKB) structure. Drivers which are able to obtain hash values from the hardware can place them in the SKB; the network stack will then skip the calculation of its own hash value. That should keep the packet's data out of the dispatching CPU's cache entirely, speeding processing.

How well does this work? The patch included some benchmark results using the netperf tool. An 8-core server with a tg3-based network interface went from 90,000 transactions per second to 285,000; an e1000-based adapter on the same system went from 90,000 to 292,000. Similar results are obtained for nForce and bnx2x chipsets on 16-core servers. It would appear that this patch does succeed in making networking processing faster on multi-core systems.

The patch, incidentally, comes from Google, which has a bit of experience with network processing. It has, evidently, been running on Google's production servers for a while. So the RPS patch is, hopefully, an early component of what will be a broad stream of contributions from Google as that company tries to work more closely with the mainline. It seems like a good start.


Did you like this article? Please accept our trial subscription offer to be able to see more content like it and to participate in the discussion.


(Log in to post comments)

 

Receive packet steering

Posted Nov 19, 2009 13:04 UTC (Thu) by cma (subscriber, #49905) [Link]

Wow...keep up the good work Google! Thanks a lot!

Related to MSI-X?

Posted Nov 21, 2009 17:20 UTC (Sat) by kleptog (subscriber, #1183) [Link]

How is this related to MSI-X, the system whereby network cards can assert different MSI interrupts based on a checksum in the header. This allows the load to be spread across CPUs in much the same way as suggested above.

I'm also wondering how this interacts with PCAP. If you have a machine with a dozen processes attached to an interface, then the packet needs to be copied to several different places in userspace (assuming MMAP ring-buffers). These are all going to be running on different CPUs so I don't think the above processing will help. But perhaps the actual BPF filtering can be spread out over multiple CPUs?

I ran into a problem this week, where IO-APIC round-robin interrupt routing is disabled on machines with >= 8 CPUs, which means if you don't have MSI-X you have >50% of one CPU dedicated to interrupt processing. The scheduler doesn't know this, leading to some odd effects. So if the above system works on ordinary MSI network cards this could be a solution,

 

    本站是提供個人知識管理的網(wǎng)絡(luò)存儲空間,所有內(nèi)容均由用戶發(fā)布,不代表本站觀點。請注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息,謹防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請點擊一鍵舉報。
    轉(zhuǎn)藏 分享 獻花(0

    0條評論

    發(fā)表

    請遵守用戶 評論公約

    類似文章 更多

    中文字幕91在线观看| 99久久精品久久免费| 亚洲天堂一区在线播放| 国产精品欧美激情在线播放| 日韩精品综合福利在线观看| 欧美中文字幕日韩精品| 亚洲精品国产美女久久久99| 精品欧美一区二区三久久 | 欧美色婷婷综合狠狠爱| 九九热九九热九九热九九热| 国产精品美女午夜福利| 日本男人女人干逼视频| 字幕日本欧美一区二区| 欧美日韩一级aa大片| 亚洲一区二区久久观看 | 日本欧美一区二区三区高清| 欧美日韩中黄片免费看| 91欧美一区二区三区| 日本午夜乱色视频在线观看| 精品欧美在线观看国产| 五月婷婷亚洲综合一区| 亚洲欧美日韩国产自拍| 久久香蕉综合网精品视频| 91亚洲人人在字幕国产| 国产日韩中文视频一区| 亚洲国产精品久久综合网| 亚洲欧美日韩国产成人| 91麻豆精品欧美视频| 国产一区欧美午夜福利| 欧美一区二区三区高潮菊竹| 国产精品日本女优在线观看| 国产精品免费精品一区二区| 国产精品亚洲二区三区| 国产日本欧美韩国在线| 加勒比系列一区二区在线观看 | 美女黄片大全在线观看| 午夜福利92在线观看| 91蜜臀精品一区二区三区| 男人把女人操得嗷嗷叫| 日本一本不卡免费视频| 视频在线观看色一区二区|