Embedded PCI and Linux

by Steven Slupsky

Peripheral Component Interconnect (PCI) technology is prevalent on the desktop and embedded computing environments. PCI has a significant share of the desktop market, and the sizeable volumes associated with this market enable PCI technology to be implemented cost effectively.

Unfortunately, the low-cost nature of PCI has not been realized in the embedded environment. This is mainly due to two factors: the desktop PC form factor being unsuitable for embedded applications and the low-volume requirements of most industrial applications not yielding to economics of scale.

The desktop PC form factor is not well suited for many embedded applications, generally due to space constraints or the fact that the form factor itself inherently is unreliable in demanding industrial settings. Alternative form factors are available, such as CompactPCI and PC/104+. However, these technologies suffer from the inhibiting fact that their applications primarily are industrial in nature, resulting in the volumes for these applications being significantly less than those in the desktop environment. Consequently, components of these technologies are costly due to the market niche that they occupy.

Overview of Proposed dimmPCI Standard

The proposed dimmPCI standard (see www.dimmpci.com) attempts to alleviate these two problems. The dimmPCI standard is a superset of the 32-bit, 33MHz desktop PCI standard in a dual in-line memory module (DIMM) form factor. This form factor uses a high-volume, commercially available DIMM interconnect component. The use of a high-volume interconnect reduces system cost considerably, as can be seen when a comparison is done between the desktop market and other PCI technologies.

The acceptable sizes defined by the DIMM standard improve reliability by reducing the mass of the circuit card considerably. Reliability is enhanced further by the presence of card locks at both ends of the DIMM connector. Easily manufactured and inexpensive braces also may be added to the middle of the circuit cards for additional support.

By using PCI, a myriad of options are available to the embedded systems developer. The developer now may select from a range of high-volume PCI components used in desktop applications. This brings low-cost, high-performance, system-level solutions within reach of the embedded marketplace.

Use of standard PCI technology enables a significant reduction in development effort. Using existing proven technology reduces time-to-market of both the hardware and software components.

Finally, an operating system was required that met the following requirements: widely used, easily supported, ease of development, low-development costs, low per-unit licensing fees, support for wide range of hardware platforms and devices, code maintenance, qualified developers and maximized exploitation of the technology platform. After analysis of various options, µClinux was chosen.

Linux Implementation

Linux has existed in the desktop domain for a considerable length of time, and thus it has accumulated a significant knowledge base related to PCI development.

Our selection of hardware and software occurred using an iterative process. That is, the initial hardware and operating system requirements (low cost, performance, networking) identified a broad solution set. The new solution set was combined with the initial requirements and information we collected during the first iteration to generate new initial conditions (network stack, OS/RTOS, SRAM/DRAM, PCI). Again, the equation was solved and a solution identified. This process went through three iterations, each time identifying a better solution. At the end of the process, we settled on the hardware and operating system components described herein.

Requirements for Selecting Linux

Our iterative process indicated Linux as a possible choice for our operating system quite early in the process. Based on previous work, use of closed-source OSes for embedded projects has created problems with code debug and trace during development. Another important consideration is off-the-shelf support for different filesystems, network stacks and peripheral devices. Finally, support for a wide range of hardware platforms is important for future development projects that require application migration to higher levels of performance. An analysis of available options quickly identified Linux as the OS of choice.

Requirements for Selecting µClinux

Our first hardware platform (described below) is based on the Motorola Dragonball VZ microcontroller. The Motorola Dragonball family devices use a large-endian memory format and alignment that is different from x86-based devices. The Dragonball also lacks a memory management unit (MMU) unlike microprocessors used in desktop computers. Typically, lower-cost processors lack an MMU, and many embedded applications are driven by cost. The software for these applications does not require the benefits that an MMU would provide.

It is known that a full Linux kernel requires the presence of an MMU in the hardware system. Due to the absence of the MMU in the Dragonball processor, the process of porting a Linux kernel to this microprocessor architecture is a significant undertaking.

µClinux is a Linux 2.0 kernel that has been ported to a Motorola 68k family device that does not have an MMU. Some kernel services need to be adapted to work in an environment without an MMU. For example, services normally provided by the function fork() may be accomplished with vfork(), given that care is used. Note that µClinux originally was ported to the Dragonball EZ processor. Our implementation uses the Dragonball VZ and so requires some additional kernel modifications.

Linux Support for Hardware Devices

A major prerequisite for selection of an OS is a wide range of support for hardware platforms and peripheral components. In order to establish dimmPCI as an open standard, it is necessary to have full software support for as many devices as possible. Thus, migrating an application to the dimmPCI architecture mainly would involve a hardware form-factor change, with only minimal changes to the hardware design and application software. Device-driver design is one of the most substantial costs during software development. A large support base for peripheral devices and components, such as network interfaces, RAM, ROM, serial communications and displays, is essential for cost savings during the development phase. Again, Linux surfaces as the option that is most suitable, especially since the supported knowledge base includes source code.

Multitasking and Real-Time Requirements

Performance was considered during our OS selection process. Linux is a multitasking operating system and provides a rich application development environment. Many practical embedded applications can be developed within this operating system environment, and our immediate application needs were satisfied. We have found that with properly written drivers, one can respond to hard real-time interrupts without having to preempt the kernel. For most of our applications, this is sufficient--especially given that the gcc tools compile C code into efficient code, and embedded assembly code can be written if response times are critical.

We have, from time to time in the past, required a more deterministic response. This type of response is a primary characteristic of a real-time operating system (RTOS). We conducted some research into real-time extensions for Linux and were pleased to find that such technology exists. Projects such as RTLinux and RTAI provide the necessary framework for providing real-time response within Linux. They operate on the principle that an RT kernel runs at the core with the Linux kernel scheduled as a low-priority task.

JFFS Filesystem Implementation

Filesystems for embedded applications require a level of robustness that exceeds that of normal desktop systems. In addition, they must support diskless storage and embedded devices. Achieving these requirements generally involves a sacrifice in efficiency and performance.

The Journaling Flash Filesystem (JFFS) was chosen since it is specifically designed to minimize the risk of data loss from a system crash or noncontrolled shutdown for industry-standard Flash memories. It is included in the standard Linux 2.4 kernel and is available as an open-source patch for the Linux 2.0 kernel. Our implementation using the Linux 2.0 kernel required significant work to obtain reliable operation with our hardware configuration.

JFFS is simply a log-structured list of nodes on the Flash media. Each node contains information about the associated file and possible file data. If data is present, the node will contain a field that indicates the location in the file where data should appear. This allows newer data to overwrite older data. The node also contains information about the amount and location of data to delete from the file. This information is used for truncating files or overwriting selected data within a file. In addition, each node contains information that is used to indicate the relative age of the node. In order to recreate a file, the entire media must be scanned, the individual nodes must be sorted in order of increasing version number and the data must be processed according to the instructions in each node. This is required only once when the filesystem is mounted.

JFFS writes to the Flash media in a cyclic manner. New nodes simply are appended until the end of the media is reached. Before reaching the end of the media, the first block of media must be freed for use. This is accomplished by copying all valid nodes (nodes that have not been made obsolete by later nodes) and then erasing the block. This inherent cyclic nature ensures wear-leveling of the Flash media.

As evident, JFFS is not an efficient means of storing and retrieving data. The number of bytes required to store a file can be more significant than the actual file size. However, in terms of robustness and crash recovery, JFFS rates highly. If the system crashes or experiences an unexpected loss of power, only the last node written might be affected. Thus, the file still can be recreated excluding the changes described by the last corrupted node.

Hardware Description

The dimmPCI hardware was selected to address the needs of the embedded marketplace by focusing on scalability, physical size, flexibility, development costs and product cost. Figure 1 illustrates the first-generation hardware platform consisting of a passive developers' backplane and the netdimm CPU module. The developers' backplane is intended to be used by other embedded system developers for quick application development. The backplane includes a standard desktop PCI slot to facilitate software development earlier in the cycle using standard PCI peripheral cards.

Figure 1. First-Generation dimmPCI CPU Module and Backplane

System Architecture

The netdimm CPU architecture consists of a microprocessor, memory, network interface, PCI interface and serial communication ports. The backplane architecture consists of three dimmPCI slots, one desktop PCI slot and associated I/O connectors. Figures 2 and 3 show the key components for the netdimm and passive backplane, respectively.

Figure 2. Block Diagram of CPU Module

Figure 3. Block Diagram of Application Development Backplane


The first-generation product is a complete single-board computer based on a Motorola Dragonball VZ microcontroller. The MC68VZ328 is based on a 68K core and runs at higher speeds and lower power than previous components. The VZ328 is used in many of the most popular PDAs on the market today and has many features integrated into the device including:

  • FLX68000 CPU

  • chip-select logic and 8-16-bit bus interface

  • clock generation module (CGM) and power control

  • interrupt controller

  • 76 GPIO lines grouped into ten ports

  • two pulse-width modulators (PWM 1 and PWM 2)

  • two general-purpose timers

  • two serial peripheral interfaces (SPI 1 and SPI 2)

  • two UARTs (UART 1 and UART 2) and infrared communication support

  • LCD controller

  • real-time clock

  • DRAM controller that supports EDO RAM, Fast Page Mode and SDRAM

  • in-circuit emulation module

  • bootstrap mode


The total memory capacity of the netdimm product is 40MB. This is composed of two parts: volatile and nonvolatile. The volatile memory capacity of the netdimm is 32MB Synchronous Dynamic RAM (SDRAM). When Linux was considered in the design process, one of the requirements that contributed to an early design iteration was the memory footprint. This requirement influenced the microprocessor selection because of the need for an integrated SDRAM controller.

The nonvolatile memory storage is limited to 8MB and is made up using Flash-based memory. Total kernel requirements are 450KB, and our root filesystem requirement is 150-400KB, depending on which standard Linux components are required for the application. This leaves approximately 7MB of RAM free for user application and data.

Our kernel is stored using the ROM filesystem and is set to execute in place. Nonvolatile memory requirements can be reduced in part by using CRAMFS. CRAMFS is a special type of filesystem that uses compression to store information. The executables are decompressed to RAM for execution.


A 10Mbps 10Base-T network interface has been incorporated into the first-generation netdimm product. In addition to Ethernet, one serial RS-232 port and one serial RS-485 port have been included.

A system management bus of sorts has been incorporated using the industry-standard, serial peripheral interface (SPI) physical layer. This bus can be used for communicating with low-bandwidth peripherals.

PCI Architecture

The PCI architecture used on our first-generation module is dictated principally by the Motorola Dragonball's (VZ328) inability to support multiple masters on the address and control buses. One could insert buffers to isolate the VZ328 address and control signals. Such an implementation would suffer however, because another external device that gains control of the bus would not have access to the peripherals internal to the VZ328. These include functions such as chip-select logic and, most importantly, the SDRAM memory interface. Therefore, little is to be gained by adding the external buffer logic.

A solution exists where a dual-ported memory is used to interface the PCI bus with the Dragonball microprocessor (Figure 4). Cypress Semiconductor offers a component with a dual-ported memory and PCI Master/Target interface combined into one integrated circuit. The Cypress Cy7C09449 acts as a bridge in this implementation. This results in two distinct address spaces: one for the VZ328 and one for the PCI.

Figure 4. Block Diagram of PCI Architecture

An interesting observation at this point is that deviations from industry-standard architectures, including the implementation disclosed herein, would be a monumental undertaking were it not for the fact that source code is available for Linux.

Linux Support for PCI

In the 2.0.38 kernel we used as a starting point, Linux support for PCI peripherals is extensive but is completely dependent on BIOS calls for PCI configuration. This works in the world of PCs but is a gaping hole in implementing PCI for an embedded system. We were unaware of any Linux implementations of PCI that did not use the BIOS, and so our starting point was to give our system PCI BIOS functionality in order to make use of the base Linux PCI driver.

PCI for Intel Designs vs. dimmPCI Designs

Most developers are familiar with PCI in the Intel x86 environment. It is important to note that such an implementation is a special case. In the special case, the PCI memory address space and the host processor (386, 486, Pentium, Athlon), physical address space are one and the same. The bridge arrangement required by the VZ328 architecture described above is the more general case.

There are other significant development issues as well. These stem from the reality that most PCI development is for products intended for the desktop. In addition to the above problem, other problems involved in porting PC drivers to our embedded platform include: 1) byte order is generally ignored as x86s and PCI are the same; 2) word alignment is generally poor on x86s as these processors automatically generate multiple bus cycles; 3) new hardware still reflects ISAs (industry-standard architectures, i.e., AT) by using the same I/O register maps, rather than introducing improved memory space register maps; and 4) drivers, especially video drivers, sometimes rely on expansion ROMs that contain x86 binary code, which assumes the existence of PC peripheral devices and specific x86 PC BIOS software interrupt vectors.

The consequence of our decision to use the Motorola VZ328 microprocessor is that placing the memory onto the PCI bus would transform the architecture into the special case of the PC, at increased cost and great technical complexity.

PCI BIOS Development

The PCI BIOS must do two things. It must allocate memory space, I/O space and interrupts. It also must allow a device driver access to PCI configuration cycles that allow the device driver to find the card, read the card resources and be able to configure the card as necessary.

The PCI BIOS calls have been extended to provide a hardware abstraction layer to compensate for having a split memory address space. These functions are summarized as follows:

  • pcibios_read_memory_byte()

  • pcibios_write_memory_byte()

  • pcibios_read_memory_word()

  • pcibios_write_memory_word()

  • pcibios_read_memory_dword()

  • pcibios_write_memory_dword()

The VZ328, unlike x86 processors, does not have both a memory address space and an I/O address space. The VZ328 itself cannot generate an I/O cycle, nor does it need to. The VZ328 must communicate the PCI I/O address to the bridge, as required for PCI memory addresses, and must communicate a request to the bridge for a PCI I/O cycle. The PCI BIOS calls have been extended with the addition of the following functions to support I/O cycles:

  • pcibios_read_io_byte()

  • pcibios_write_io_byte()

  • pcibios_read_io_word()

  • pcibios_write_io_word()

  • pcibios_read_io_dword()

  • pcibios_write_io_dword()

The VZ328, unlike x86 processors, is a large-endian device. PCI has been defined to work well with x86 devices, which are small-endian. Since most PCI data transfers are word or dword (i.e., few byte transfers), it is advantageous to avoid byte swapping on every word or dword access, so the data bus between the bridge and the VZ328 has been swapped in hardware. A side-effect result is that byte accesses now must be corrected by inverting the least significant address bit (A0). Fortunately, this is handled in the supplied PCI BIOS calls (and in the above extensions), so the details largely are hidden by the PCI BIOS:
  • pcibios_read_config_byte()

  • pcibios_write_config_byte()

  • pcibios_read_config_word()

  • pcibios_write_config_word()

  • pcibios_read_config_dword()

  • pcibios_write_config_dword()

Most transfers on the PCI bus are accomplished using direct memory access (DMA) bus cycles. These transfers can be initiated by any PCI device to any memory address reachable on the PCI bus. Since the VZ328 address space is separate from the PCI address space, PCI DMA transfers cannot specify a source or destination that is within the VZ328 memory space.

The Cypress Cy7C09449 contains a 32KB dual-ported memory that is used to bridge the VZ328 memory space to PCI memory space. Since there may be more than one PCI device in a system, there can be more than one PCI device driver installed at any given time. This condition compels us to treat the dual-ported memory as a shared resource and to balance efficiency with performance.

The kernel function kmalloc(), which calls get_free_pages(), was designed to allocate such special memory using GFP_ flags. GFP_DMA is defined for PC implementations and denotes a page that is bounded by a 16M memory limit and does not cross a 64K boundary (this was due to the limitations of the 8237 DMA controller at the time).

Since PCs share processor and PCI memory address spaces, there is no special flag defined to select memory for use with DMA and PCI. It appears that the PC standard here is designed around the special case. To compound the problem, there is no central resource to initiate a DMA transfer, as device drivers initiate transfers by writing to nonstandard registers on the device hardware. Generally speaking, this code needs to be altered to become bridge-aware. Therefore, to facilitate sharing the dual-ported RAM on the bridge, the PCI BIOS has been extended to include kmalloc memory tables and a GFP_PCI flag. A device driver may allocate and deallocate DPram on a transfer-by-transfer basis or request a block of memory that it will own permanently, until the device driver is closed.

Finally, interrupts are handled using the register_interrupt() system call. PCI specifies an 8-bit field for interrupt number, so the VZ328 PCI BIOS can assign a mach interrupt number that can be handled by a device driver using the register_interrupt() system call.

Potential Applications

The dimmPCI technology platform has been developed with several applications in mind, including asset management, remote-access data acquisition and instrumentation, industrial networking and control, security and power management. Using internet technologies, the dimmPCI platform enables a new class of internet appliances that brings the rich graphical environment of a web browser to small embedded devices. The dimmPCI platform differs significantly from other embedded implementations, most notably in its ability to act as a web server directly bypassing the need to have a data server.

In the future, the dimmPCI technology platform will be expanded to improve performance and expand peripherals. Architectural enhancements will include support for USB 2.0 as well.

Steven Slupsky has worked in advanced microelectronics and embedded systems design for over 15 years, first as a design engineer and architect and later in management and leadership roles for over a dozen advanced design projects.

Load Disqus comments