Zero Copy I: User-Mode Perspective
In kernel version 2.1, the sendfile system call was introduced to simplify the transmission of data over the network and between two local files. Introduction of sendfile not only reduces data copying, it also reduces context switches. Use it like this:
sendfile(socket, file, len);
To get a better idea of the process involved, take a look at Figure 3.

Figure 3. Replacing Read and Write with Sendfile
Step one: the sendfile system call causes the file contents to be copied into a kernel buffer by the DMA engine. Then the data is copied by the kernel into the kernel buffer associated with sockets.
Step two: the third copy happens as the DMA engine passes the data from the kernel socket buffers to the protocol engine.
You are probably wondering what happens if another process truncates the file we are transmitting with the sendfile system call. If we don't register any signal handlers, the sendfile call simply returns with the number of bytes it transferred before it got interrupted, and the errno will be set to success.
If we get a lease from the kernel on the file before we call sendfile, however, the behavior and the return status are exactly the same. We also get the RT_SIGNAL_LEASE signal before the sendfile call returns.
So far, we have been able to avoid having the kernel make several copies, but we are still left with one copy. Can that be avoided too? Absolutely, with a little help from the hardware. To eliminate all the data duplication done by the kernel, we need a network interface that supports gather operations. This simply means that data awaiting transmission doesn't need to be in consecutive memory; it can be scattered through various memory locations. In kernel version 2.4, the socket buffer descriptor was modified to accommodate those requirements—what is known as zero copy under Linux. This approach not only reduces multiple context switches, it also eliminates data duplication done by the processor. For user-level applications nothing has changed, so the code still looks like this:
sendfile(socket, file, len);
To get a better idea of the process involved, take a look at Figure 4.

Figure 4. Hardware that supports gather can assemble data from multiple memory locations, eliminating another copy.
Step one: the sendfile system call causes the file contents to be copied into a kernel buffer by the DMA engine.
Step two: no data is copied into the socket buffer. Instead, only descriptors with information about the whereabouts and length of the data are appended to the socket buffer. The DMA engine passes data directly from the kernel buffer to the protocol engine, thus eliminating the remaining final copy.
Because data still is actually copied from the disk to the memory and from the memory to the wire, some might argue this is not a true zero copy. This is zero copy from the operating system standpoint, though, because the data is not duplicated between kernel buffers. When using zero copy, other performance benefits can be had besides copy avoidance, such as fewer context switches, less CPU data cache pollution and no CPU checksum calculations.
Now that we know what zero copy is, let's put theory into practice and write some code. You can download the full source code from www.xalien.org/articles/source/sfl-src.tgz. To unpack the source code, type tar -zxvf sfl-src.tgz at the prompt. To compile the code and create the random data file data.bin, run make.
Looking at the code starting with header files:
/* sfl.c sendfile example program
Dragan Stancevic <
header name function / variable
-------------------------------------------------*/
#include <stdio.h> /* printf, perror */
#include <fcntl.h> /* open */
#include <unistd.h> /* close */
#include <errno.h> /* errno */
#include <string.h> /* memset */
#include <sys/socket.h> /* socket */
#include <netinet/in.h> /* sockaddr_in */
#include <sys/sendfile.h> /* sendfile */
#include <arpa/inet.h> /* inet_addr */
#define BUFF_SIZE (10*1024) /* size of the tmp
buffer */
Besides the regular <sys/socket.h> and <netinet/in.h> required for basic socket operation, we need a prototype definition of the sendfile system call. This can be found in the <sys/sendfile.h> server flag:
/* are we sending or receiving */
if(argv[1][0] == 's') is_server++;
/* open descriptors */
sd = socket(PF_INET, SOCK_STREAM, 0);
if(is_server) fd = open("data.bin", O_RDONLY);
The same program can act as either a server/sender or a
client/receiver. We have to check one of the command-prompt
parameters, and then set the flag is_server to run in sender mode.
We also open a stream socket of the INET protocol family. As part
of running in server mode we need some type of data to transmit to
a client, so we open our data file. We are using the system call
sendfile to transmit data, so we don't have to read the actual
contents of the file and store it in our program memory buffer.
Here's the server address:
/* clear the memory */ memset(&sa, 0, sizeof(struct sockaddr_in)); /* initialize structure */ sa.sin_family = PF_INET; sa.sin_port = htons(1033); sa.sin_addr.s_addr = inet_addr(argv[2]);We clear the server address structure and assign the protocol family, port and IP address of the server. The address of the server is passed as a command-line parameter. The port number is hard coded to unassigned port 1033. This port number was chosen because it is above the port range requiring root access to the system.
Here is the server execution branch:
if(is_server){
int client; /* new client socket */
printf("Server binding to [%s]\n", argv[2]);
if(bind(sd, (struct sockaddr *)&sa,
sizeof(sa)) < 0){
perror("bind");
exit(errno);
}
As a server, we need to assign an address to our socket descriptor. This is achieved by the system call bind, which assigns the socket descriptor (sd) a server address (sa):
if(listen(sd,1) < 0){
perror("listen");
exit(errno);
}
Because we are using a stream socket, we have to advertise our
willingness to accept incoming connections and set the connection
queue size. I've set the backlog queue to 1, but it is common to
set the backlog a bit higher for established connections waiting to
be accepted. In older versions of the kernel, the backlog queue was
used to prevent syn flood attacks. Because the system call listen
changed to set parameters for only established connections, the
backlog queue feature has been deprecated for this call. The kernel
parameter tcp_max_syn_backlog has taken over the role of protecting
the system from syn flood attacks:
if((client = accept(sd, NULL, NULL)) < 0){
perror("accept");
exit(errno);
}
The system call accept creates a new connected socket from the
first connection request on the pending connections queue. The
return value from the call is a descriptor for a newly created
connection; the socket is now ready for read, write or poll/select
system calls:
if((cnt = sendfile(client,fd,&off,
BUFF_SIZE)) < 0){
perror("sendfile");
exit(errno);
}
printf("Server sent %d bytes.\n", cnt);
close(client);
A connection is established on the client socket descriptor, so we
can start transmitting data to the remote system. We do this by
calling the sendfile system call, which is prototyped under Linux
in the following manner:
extern ssize_t
sendfile (int __out_fd, int __in_fd, off_t *offset,
size_t __count) __THROW;
The first two parameters are file descriptors. The third parameter
points to an offset from which sendfile should start sending data.
The fourth parameter is the number of bytes we want to transmit. In
order for the sendfile transmit to use zero-copy functionality, you
need memory gather operation support from your networking card. You
also need checksum capabilities for protocols that implement
checksums, such as TCP or UDP. If your NIC is outdated and doesn't
support those features, you still can use sendfile to transmit
files. The difference is the kernel will merge the buffers before
transmitting them.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Download the Free Red Hat White Paper "Using an Open Source Framework to Catch the Bad Guy"
- Tech Tip: Really Simple HTTP Server with Python
- Home, My Backup Data Center
- Android is Linux -- why no better inter-operation
48 min 21 sec ago - Connecting Android device to desktop Linux via USB
1 hour 16 min ago - Find new cell phone and tablet pc
2 hours 14 min ago - Epistle
3 hours 43 min ago - Automatically updating Guest Additions
4 hours 52 min ago - I like your topic on android
5 hours 38 min ago - Reply to comment | Linux Journal
6 hours ago - This is the easiest tutorial
12 hours 14 min ago - Ahh, the Koolaid.
17 hours 53 min ago - git-annex assistant
23 hours 52 min ago




Comments
Nice article on ZeroCopy
Nice article on ZeroCopy
sendfile/sendfile64 on Solaris
It looks like it has been a while since this article was written. At least OpenSolaris (and perhaps earlier) supports sendfile on files larger than 2 GB when the application is compiled as 64-bit, otherwise sendfile64 must be used. The sendfile(3EXT) manpage has this description:
sendfile - send files over sockets or copy files to files
so sendfile has the ability to copy files/buffers to sockets as well as files.
sendfilev also exists and supports vector arguments as stated above.
In any case, great article! I like the progression of less and less copies.
sendfile
This article is ok if we want to know about the theory.i want to know the implimentation side. i want to compare it with the copy command.so i would like to go through the code for this two. so if you include the code it will be more helpful for us to do.........so please the favour