+Linux Graphics Drivers: an Introduction
+Stéphane Marchesin
+Accelerating graphics is a complex art which suffers a mostly unjustified
+ reputation of being voodoo magic.
+ This book is intended as an introduction to the inner workings and development
+ of graphics drivers under Linux.
+ Throughout this whole book, knowledge of C programming is expected, along
+ with some familiarity with graphics processors.
+ Although its primary audience is the graphics driver developer, this book
+ details the internals of the full Linux graphics stack and therefore can
+ also be useful to application developers seeking to enhance their vision
+ of the Linux graphics world: one can hope to improve the performance of
+ its applications through better understanding the Linux graphics stack.
+ In this day and age of pervasive 3D graphics and GPU computing, a better
+ comprehension of graphics is a must have!
+Book overview
+ there.
+ Then we paint a high-level view of the Linux graphics stack in Chapter
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "cha:The-Big-Picture"
+ and its evolution over the years.
+ Linux that, although primitive, sees wide usage in the embedded space.
+ Chapter
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "cha:The-DRM-Kernel"
+ introduces the DRM, a kernel module which is in charge of arbitrating all
+ graphics activity going on in a Linux system.
+ the developper.
+ Video decoding sees its own dedicated part in Chapter
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "cha:Video-Decoding"
+ Chapter
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "cha:Mesa"
+ and
+ acceleration under Linux used as the framework for 3D drivers.
+ Chapter
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "cha:GPU-Computing"
+ specifications in Chapter
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "cha:Technical-Specifications"
+ and what you should do aside pure development in Chapter
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "cha:Beyond-Development"
+\begin_layout Standard
+\begin_layout Section
+\begin_layout Standard
+Computer graphics move at a fast pace, and this book is not about the past.
+ Obsolete hardware (isa, vlb, ...), old standards (the vga standard and its
+ dreadful int10, vesa), outdated techniques (user space modesetting) and
+ old X11 servers (Xsun, XFree86, KDrive...) will not be detailed.
+\begin_layout Chapter
+\begin_layout Standard
+Before diving any further into the subject of graphics drivers, we need
+ to understand the hardware which is at play.
+ This chapter is by no means intended to be a complete description of all
+ inner workings of your average computer and its graphics hardware, but
+ only as an introduction thereof.
+ The goal of this section is to
+\begin_inset Quotes eld
+cover the bases
+\begin_inset Quotes erd
+ on what will be required later on.
+ Notably, most hardware concepts that will subsequently be required are
+ introduced here.
+ Although we sometimes have to go through architecture-specific hoops, we
+ try to stay as generic as possible and the concepts detailed thereafter
+ generalize well.
+\begin_layout Section
+\begin_layout Standard
+Today all computers are architectured the same way: a central processor
+ and a number of peripherals.
+ In order to exchange data, these peripherals are interconnected by a bus
+ over which all communications go.
+ Figure
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "fig:Peripheral-interconnection-in"
+ outlines the layout of peripherals in a standard computer.
+\begin_layout Standard
Peripheral interconnection in a typical computer.
+The first user of the bus is the CPU.
+ The CPU uses the bus to access system memory and other peripherals.
+ However, the CPU is not the only one able to write and read data to the
+ peripherals, the peripherals themselves also have the capability to exchange
+ information directly.
+ In particular, a peripheral which has the ability to read and write to
+ memory without the CPU intervention is said to be DMA (Direct Memory Access)
+ capable, and the memory transaction is called a DMA.
+ Today, all graphics cards feature this ability (named DMA bus mastering)
+ which consists in the card requesting and subsequently taking control of
+ the bus for a number of microseconds.
+\begin_layout Standard
+If a peripheral has the ability to achieve DMA to or from an uncontiguous
+ list of memory pages (which is very convenient when the data is not contiguous
+ in memory), it is said to have DMA scatter-gather capability (as it can
+ scatter data to different memory pages, or gather data from different pages).
+\begin_layout Standard
+Notice that the DMA capability can be a downside in some cases.
+ For example on real time systems, this means the CPU is unable to access
+ the bus while a DMA transaction is in progress, and since DMA transactions
+ happen asynchronously this can lead to missing a real time scheduling deadline.
+ Therefore, while DMA has a lot of advantages from a performance viewpoint,
+ there are situations where it should be avoided.
+Bus types
+\begin_layout Standard
+Buses connect the machine peripherals together; each and every communication
+ between different peripherals goes over (at least) one bus.
+ In particular, a bus is the way most graphics card are connected to the
+ rest of the computer (one notable exception being the case of some embedded
+ systems, where the GPU is directly connected to the CPU).
+ As shown in Table
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "fig:Common-bus-types"
+, there are many bus types suitable for graphics: PCI, AGP, PCI-X, PCI-express
+ to name a (relevant) few.
+ All the bus types we will detail are variants of the PCI bus type, however
+ some of them feature singular improvements over the original PCI design.
+\begin_layout Standard
Common bus types.
+PCI (Peripheral Component Interconnect)
+\begin_layout Standard
+PCI is the most basic bus allowing connecting graphics peripherals today.
+ One of its key feature is called bus mastering.
+ This feature allows a given peripheral to take hold of the bus for a given
+ number of cycles and do a complete transaction (called a DMA, Direct Memory
+ Access).
+ The PCI bus is coherent, which means that no explicit flushes are required
+ for the memory to be coherent across devices.
+AGP (Accelerated Graphics Port)
+\begin_layout Standard
+AGP is essentially a modified PCI bus with a number of extra features compared
+ to its ancestor.
+ Most importantly, it is faster thanks to a higher clock speed and the ability
+ to send 2, 4 or 8 bits per lane on each clock tick (for AGP 2x, 4x and
+ 8x respectively).
+ AGP also three distinctive features:
+\begin_layout Itemize
+The first feature is AGP GART (Graphics Aperture Remapping Table), a simple
+ form of IOMMU (as will be seen in section
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "sec:Virtual-and-Physical"
+ It allows taking a (non contiguous) set of physical memory pages out of
+ system memory and exposing it to the GPU for its use as a contiguous area.
+ This increases the amount of memory usable by the GPU at little cost, and
+ creates an convenient area for sharing data between the CPU and the GPU
+ (AGP graphics cards can do fast DMA to/from this area, and since the GART
+ area is a chunk of system RAM, CPU access is a lot faster than VRAM).
+ One notable drawback is that the GART area is not coherent, and therefore
+ writes to GART (be it from the GPU or CPU) need to be flushed before transactio
+ns from the other party can begin.
+ Another drawback is that only a single GART area is handled by the hardware,
+ and it has to be sub-allocated by the driver.
+\begin_layout Itemize
+The second feature is AGP side band addressing (SBA).
+ Side band addressing consists in 8 extra bus bits used as an address bus.
+ Instead of multiplexing the bus bandwidth between adresses and data, the
+ nominal AGP bandwidth can be dedicated to data only.
+ This feature is transparent to the driver developer.
+\begin_layout Itemize
+The third feature is AGP Fast Writes (FW).
+ Fast writes allow sending data to the graphics card directly, without having
+ the card initiate a DMA.
+ This feature is also transparent for the driver developer.
+\begin_layout Standard
+Keep in mind that these last two features are known to be unstable on a
+ wide range of hardware, and oftentimes require chipset-specific hacks to
+ work properly.
+ Therefore it is advisable not to enable them.
+ In fact, they are an extremely frequent cause for strange hardware errors
+ on AGP cards.
+\begin_layout Standard
+PCI-X was developed as a faster PCI for server boards, and very few graphics
+ peripherals exist in this format.
+ It is not to be confused with PCI-Express, which sees real widespread usage.
+\begin_layout Subparagraph*
+PCI-Express (PCI-E)
+\begin_layout Standard
+PCI-Express is the new generation of PCI devices.
+ It has more advantages than a simple improved PCI.
+\begin_layout Standard
+Finally, it is important to note that, depending on the architecture, the
+ CPU-GPU communication does not always relies on a bus.
+ This is especially common on embedded systems where the GPU and the CPU
+ are on a single die.
+ In that case the CPU can access the GPU registers directly.
+\begin_layout Section
+\begin_layout Standard
+The term
+\begin_inset Quotes eld
+\begin_inset Quotes erd
+ has to two main different acceptions:
+\begin_layout Itemize
+Physical memory.
+ Physical memory is real, hardware memory, as stored in the memory chips.
+\begin_layout Itemize
+Virtual memory.
+ Virtual memory is a translation of physical memory addresses allowing user
+ space applications to see their allocated chunks as if they were contiguous
+ while they are fragmented and scattered on the chips.
+\begin_layout Standard
+In order to simplify programming, it is easier to handle contiguous memory
+ areas.
+ This is easy to achieve as long as only a small area is needed.
+ But allocating a bigger memory chunk would require as much contiguous physical
+ memory which is difficult if not impossible to achieve shortly after bootup
+ because of memory fragmentation.
+ Therefore, a mechanism is required to keep the appearance of a contiguous
+ piece of memory to the application while using scattered pieces.
+\begin_layout Standard
+To achieve this, memory is split into pages.
+ For the scope of this book, it is sufficient to say that a memory page
+ is a collection contiguous bytes in physical memory
+\begin_inset Foot
+status open
+\begin_layout Plain Layout
+On x86 and x86-64, a page is usually 4096 bytes long, although different
+ sizes are possible on other architectures or with huge pages.
+In order to make a scattered list of physical pages seem contiguous in virtual
+ space, a piece of hardware called MMU (memory mapping unit) converts virtual
+ addresses (used in applications) into physical addresses (used for actually
+ accessing memory) using a page table as shown on Figure
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "fig:MMU-and-IOMMU"
+ In case a page does not exist in virtual space (and therefore not in the
+ MMU table), the MMU is able to signal it, which provides the basic mechanism
+ for reporting access to non-existent memory areas.
+ This in turn is used by the system to implement advanced memory programming
+ like swapping or on-the-fly page instantiations.
+ As the MMU is only effective for CPU access to memory, virtual addresses
+ are not relevant to the hardware since it is not able to match them to
+ physical addresses.
+\begin_layout Standard
+\begin_layout Standard
+While the MMU only works for CPU accesses, it has an equivalent for peripherals:
+ the IOMMU.
+ As shown on figure
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "fig:MMU-and-IOMMU"
+, an IOMMU is the same as an MMU except that it virtualizes the address
+ space of peripherals.
+ The IOMMU can see various incarnations, either on the motherboard chipset
+ (in which case it is shared between all peripherals) or on the graphics
+ card itself (where it will be called AGP GART, PCI GART).
+ The job of the IOMMU is to translate memory addresses from the peripherals
+ into physical addresses.
+ In particular, this allows
+\begin_inset Quotes eld
+\begin_inset Quotes erd
+ a device into restricting its DMAs to a given range of memory and it is
+ required for better security and hardware virtualization.
+\begin_layout Standard
+A special case of IOMMU is the Linux swiotlb which allocates a contiguous
+ piece of physical memory at boot (which makes it feasible to have a large
+ contiguous physical allocation since there is no fragmentation yet) and
+ uses it for DMA.
+ As the memory is physically contiguous, no page translation is required
+ and therefore a DMA can occur to and from this memory range.
+ However, this means that this memory (64MB by default) is preallocated
+ and will not be used for anything else.
+\begin_layout Standard
+AGP GART is another special case of IOMMU present with AGP graphics cards
+ which exposes a single linear area to the card.
+ In that case the IOMMU table is embedded in the AGP chipset, on the motherboard.
+\begin_layout Standard
+Yet another special case of IOMMU is the PCI GART which allows exposing
+ a chunk of system memory to the card.
+ In that case the IOMMU table is embedded in the graphics card, and often
+ the physical memory used does not need to be contiguous.
+\begin_layout Standard
+\begin_layout Standard
+Obviously, with so many different memory types, performance is not homogeneous;
+ not all combination of accesses are fast, depending on whether they involve
+ the CPU, the GPU, or bus transfers.
+ Another issue which arises is memory coherence: how can one ensure that
+ memory is coherent accross devices, in particular that data written by
+ the CPU is availble to the GPU (or the opposite).
+ These two issues are correlated, as higher performance usually means a
+ lower level of memory coherence, and vice-versa.
+\begin_layout Standard
+As far as setting the memory caching parameters goes, there are two ways
+ to set caching attributes on memory ranges:
+\begin_layout Itemize
+ An MTRR (Memory Type Range Register) is a register describing attributes
+ for a range of given physical memory.
+ The number of MTRR depends on the system, but is very limited.
+ Although this applies to a physical memory range, the effect works on the
+ corresponding virtual memory pages.
+ This for example makes it possible to map pages with a specific caching
+ type.
+\begin_layout Itemize
+PAT (Page Attribute Table) allows setting per-page memory attributes.
+ However it is an extension only available on recent x86 processors.
+\begin_layout Standard
+On top of these, one can use explicit caching instructions on some architectures
+, for example on x86
+\emph on
+\emph default
+ is an uncached mov instruction and
+\emph on
+\emph default
+ can selectively flush cache lines.
+\begin_layout Standard
+There are 3 caching modes, usable both through MTRR and PAT on system memory:
+\begin_layout Itemize
+UC (UnCached) memory is uncached.
+ No CPU read/writes to this area are cached, and each memory write instruction
+ triggers an actual immediate memory write.
+ This is helpful to ensure that information has been actually written so
+ as to avoid CPU/GPU race conditions.
+\begin_layout Itemize
+WC (Write Combine) memory is uncached, but CPU writes are combined together
+ in order to improve the performance.
+ This is useful to improve performance in situations where uncached memory
+ is required, but where combining the writes together has no adverse effects.
+\begin_layout Itemize
+WB (Write Back) memory is cached.
+ This is the default mode and leads to the best performance for CPU accesses.
+ However this does not ensure that memory writes are propagated to central
+ memory after a finite time.
+\begin_layout Standard
+Notice that these caching modes apply to the CPU only, the GPU accesses
+ are not directly affected by the current caching mode.
+ However, when the GPU has to access an area of memory which was previously
+ filled by the CPU, uncached modes ensure that the memory writes are actually
+ done, and are not pending sitting in a CPU cache.
+ Another way to achieve the same effect is the use of cache flushing instruction
+s present on some x86 processors (like cflush).
+ However this is less portable than using the caching modes.
+ Yet another (portable) way is the use of memory barriers, which ensures
+ that pending memory writes have been committed to main memory before moving
+ on.
+\begin_layout Standard
+Obviously with so many different caching modes, not all accesses have the
+ same performance:
+\begin_layout Itemize
+When it comes to CPU access to system memory, uncached mode provides the
+ worst performance, write back provides the best performance, and write
+ combine is in between.
+\begin_layout Itemize
+When the CPU accesses the video memory from a discrete card, all accesses
+ are extremely slow, be they reads or writes, as each access needs a cycle
+ on the bus.
+ Therefore it is not recommended to access large areas of VRAM with the
+ CPU.
+ Furthermore on some GPUs synchronizing is required or this could cause
+ a GPU hang.
+\begin_layout Itemize
+Obviously the GPU accessing VRAM is extremely fast.
+\begin_layout Itemize
+GPU access to system ram is unaffected by the caching mode, but still has
+ to go over the bus.
+ This is the case of DMA transactions.
+ As those happen asynchronously, they can be considered
+\begin_inset Quotes eld
+\begin_inset Quotes erd
+ from the viewpoint of the CPU, however there is a non-negligible setup
+ cost involved for each DMA transaction.
+ This is why, when transferring small amounts of memory, a DMA transaction
+ is not always better than a direct CPU access.
+\begin_layout Standard
+Finally, one last important point to make about memory is the notion of
+ memory barriers and write posting.
+ In the case of a cached (Write Combine or Write Back) memory area, a memory
+ barrier ensures that pending writes have actually been committed to memory.
+ This is used, for example, before asking the GPU to read a given memory
+ area.
+ For I/O areas, a similar technique called write posting exists: it consists
+ in doing a dummy read inside the I/O area which will, as a side effect,
+ wait until pending writes have taken effect before completing.
+The Graphics Card
+\begin_layout Standard
+Today, a graphics card is basically a computer-in-the-computer.
+ It is a complex beast with a dedicated processor on a separate card, and
+ features its own computation units, its own bus, and its own memory.
+\begin_layout Subsubsection*
+\begin_layout Standard
+The GPU's memory, which we will from now on refer to as video memory, can
+ be either real, dedicated, on-card memory (in the case of a discrete card),
+ or memory shared with the CPU (in the case of an integrated card).
+ Notice that the case of shared memory has interesting implications, as
+ it means that system to video memory copies can be virtually free if implemente
+d properly; while the case of dedicated memory means that transfers back
+ and forth will need to happen.
+\begin_layout Standard
+It is not uncommon for modern GPUs to feature a form of virtual memory as
+ well, allowing to map different resources (real video memory of system
+ memory) into the GPU address space.
+ This is very similar to the CPU's virtual memory, but uses a completely
+ separate hardware implementation.
+ For example, older Radeon cards (actually since Rage 128) feature a number
+ of surfaces which you can map into the GPU address space, each of which
+ is a contiguous memory resource (video ram, AGP, PCI).
+ Old Nvidia cards (everything up to NV40) have a similar concept based on
+ objects which describe an area of memory which can then be bound to a given
+ use.
+ Recent cards (starting with NV50 and R800) let you build the address space
+ page by page, with the ability of picking system and dedicated video memory
+ pages at will.
+ The similarity of these with a CPU virtual address space is very striking,
+ in fact you can have accesses to unmapped pages be signaled to you through
+ an interrupt and act on this in a video memory page fault handler.
+ However, be careful playing with those as the implication here is that
+ driver developers have to juggle with multiple address spaces from the
+ CPU and GPU which are going to be fundamentally different.
+\begin_layout Standard
+Surfaces are the basic sources and targets for all rendering.
+ Althought they can be called differenty (textures, render targets, buffers...)
+ the basic idea is always the same.
+ Figure
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "fig:The-layout-of"
+ depicts the layout of a graphics surface.
+ The surface width is rounded up to what we call the pitch because of hardware
+ limitations (usually to the next multiple of some power of 2) and therefore
+ there exists a dead zone of pixels which goes unused.
+ The graphics surface has a number of characteristics:
+\begin_layout Itemize
+The pixel format of the surface.
+ A pixel color is represented memory by its red, green and blue components,
+ plus an alpha component used as the opacity for blending.
+ The number of bits for a whole pixel usually matches hardware sizes (8,16
+ or 32 bits) but the repartition of the bits between the four components
+ does not have to match those.
+ The number of bits used for each pixels is referred to as bits per pixel,
+ or
+\emph on
+\emph default
+ Common pixel formats include 888 RGBX, 8888 RGBA, 565 RGB, 5551, RGBA,
+ 4444 RGBA
+ Notice that most cards today work natively in ABGR 8888.
+\begin_layout Itemize
+Width and height are the most obvious characteristics, and are given in
+ pixels.
+\begin_layout Itemize
+The pitch is the width in bytes (not in pixels!) of the surface, including
+ the dead zone pixels.
+ The pitch is convenient for computing memory usages, for example the size
+ of the surface should be computed by
+\begin_inset Formula $height\times pitch$
+ and not
+\begin_inset Formula $height\times width\times bpp$
+ in order to include the dead zone.
+\begin_layout Standard
+Notice that surfaces are not always stored linearly in video memory, in
+ fact for performance reasons it is extremely common that they are not,
+ as this improves the locality of the memory accesses when rendering.
+ Such surfaces are called
+\emph on
+\emph default
+ The exact layout of a tiled surface is highly dependent on the hardware,
+ but is usually a form of space-filling curve like the Z curve or hilbert's
+ curve.
+\begin_layout Standard
The layout of a surface.
+2D engine
+\begin_layout Standard
+The 2D engine, or blitter, is the hardware used for 2D acceleration.
+ Blitters have been one of the earliest form of graphics acceleration and
+ are still extremely widespread today.
+ Generally, a 2D engine is capable of the following operations:
+\begin_layout Itemize
+ Blits are a copy of a memory rectangle from one place to another by the
+ GPU.
+ The source and destination can be either video or system memory.
+\begin_layout Itemize
+Solid fills.
+ Solid fills consist in filling a rectangle memory area with a color.
+ Note that this can also include the alpha channel.
+\begin_layout Itemize
+Alpha blits.
+ Alpha blits use the alpha component of pixels from of a surface to achieve
+ transparency [porter & duff].
+\begin_layout Itemize
+Stretched blits.
+\begin_layout Standard
+\begin_layout Standard
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "fig:Blitting-between-two"
+ shows an example of blitting a rectangle between two different surfaces.
+ This operation is defined by the following parameters: the source and destinati
+on coordinates, the source and destination pitches, and the blit width and
+ height.
+ However, this is only 2D coordinates, no perspective is possible
+\begin_layout Standard
+\begin_layout Standard
+When a blit happens between two overlapping source and destination surfaces,
+ the semantics of the copy is not trivially defined, especially if one considers
+ that what happens for a blit is not a simple move of a rectangle, but is
+ done pixel-by-pixel at the core.
+ As seen on Figure
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "fig:Overlapping-blit-inside"
+, if one does a line-by-line copy top to bottom, some source pixels will
+ be modified as a side effect.
+ Therefore, the notion of blitting direction was introduced into the blitters.
+ In this case, for a proper copy a bottom to top copy is required.
+ Some cards will determine the blitting direction automatically according
+ to surface overlap (for example nvidia GPUs), and others will not.
+\begin_layout Standard
+Finally, keep in mind that not all current graphics accelerators feature
+ a 2D engine.
+ Since 3D acceleration is technically a super-set of 2D acceleration, it
+ is possible to implement 2D acceleration using the 3D engine (and this
+ idea is one of the core ideas behind the Gallium 3D design, which will
+ be detailed in Chapter
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "cha:Gallium-3D"
+ And indeed some drivers use the 3D engine to implement 2D which allows
+ GPU makers to completely part with the transistors otherwise dedicated
+ to it.
+ Yet some other cards do not dedicate the transistors, but microprogram
+ 2D operations on top of 3D operations inside the GPU (this is the case
+ for nVidia cards since nv10 and up to nv50, and for the Radeon R600 series
+ which have an optional firmware that implements 2D on top of 3D).
+ This sometimes has an impact on mixing 2D and 3D operations since those
+ now share hardware units.
+3D engine
+\begin_layout Standard
+A 3D engine is also called
+\begin_inset Quotes eld
+rasterization pipeline
+\begin_inset Quotes erd
+, because it contains a series of stages which exchange data in a pipeline
+ (1-directional) fashion.
+\begin_layout Standard
+vertex -> geom -> fragment
+\begin_layout Standard
+graphics fifo
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+tiled textures
+Overlays and hardware sprites
+Programming the card
+\begin_layout Standard
+Each PCI card exposes a number of PCI resources; lspci -v lists these resources.
+ These can be, but are not limited to, BIOSes, MMIO ranges, video memory
+ (or only some part of it).
+ As the total PCI resource size is limited, oftentimes a card will only
+ expose part of its video memory as a resource, and the only way to access
+ the remaining memory is through DMA from other, reachable areas (in a way
+ similar to bounce pages).
+ This is increasingly common as the video memory sizes keep growing while
+ the PCI resource space stays limited.
+\begin_layout Standard
+MMIO is the most direct access to the card.
+ A range of addresses is exposed to the CPU, where each write goes directly
+ to the GPU.
+ This allows the simplest for of communication of commands from the CPU
+ to the GPU.
+ This type of programming is synchronous, so writes are done by the CPU
+ and executed on the GPU in a lockstep fashion This results in sub-par performan
+ce as each access turns into a packet on the bus.
+\begin_layout Standard
+A direct memory access (DMA) is the use by a peripheral of the bus mastering
+ feature of the bus.
+ This allows one peripheral to talk directly to another, without intervention
+ from the CPU.
+ In the graphics card case, the two most common uses of DMAs are:
+\begin_layout Itemize
+Transfers by the GPU to and from system memory (for reading textures and
+ writing buffers).
+ This allows implementing things like texturing over AGP or PCI, and hardware-ac
+celerated texture transfers.
+\begin_layout Itemize
+The implementation of command FIFO.
+ As MMIO between the CPU and GPU is synchronous and graphics drivers inherently
+ use a lot of I/O, a faster means of communicating with the card is required.
+ The command FIFO is a piece of memory (either system memory or more rarely
+ video memory) shared between the graphics card and the CPU, where the CPU
+ places command for later execution by the GPU.
+ Then the GPU reads the FIFO asynchronously using DMA and executes the commands.
+ This model allows asynchronous execution of the CPU and GPU command flows
+ and thus leads to higher performance.
+\begin_layout Standard
+Interrupts are a way for hardware peripherals in general, and GPUs in particular
+, to signal events to the CPU.
+ Usage examples for interrupts include signaling completion of a graphics
+ command, signaling a vertical blanking event, reporting a GPU error, ...
+ When an interrupt is raised by the peripheral, the CPU executes a small
+ routine called an interrupt handler, which preempts other current executions.
+ There is a maximum execution time for an interrupt handler, so the drivers
+ have to keep it short (not more than a few microseconds).
+ In order to execute more code, the common solution is to schedule a tasklet
+ from the interrupt handler.
+\begin_layout Section
+\begin_layout Standard
+Display devices are the last ring of the graphics chain.
+ They are charged with presenting the pictures to the user.
+\begin_layout Standard
+digital vs analog signal
+\begin_layout Standard
+hsync, vsync
+\begin_layout Standard
+sync on green
+\begin_layout Standard
+Connectors and encoders: CRTC,TMDS, LVDS, DVI-I, DVI-A, DVI-D, VGA (D-SUB
+ 15 is the proper name)
+\begin_layout Section
+\begin_layout Standard
+Shader engine 4+1
+\begin_layout Standard
+NVidia hardware has multiple specificities compared to other architectures.
+ The first one is the availability of multiple contexts, which is implemented
+ using multiple command fifos (similar to what some high-end infiniband
+ networking cards do) and a context switching mechanism to commute between
+ those fifos.
+ A small firmware is used for context switches between contexts, which is
+ responsible for saving the graphics card state to a portion of memory and
+ restoring another context.
+ A scheduling system using the round robin algorithm handles the selection
+ of the contexts, and the timeslice is programmable.
+\begin_layout Standard
+The second specificity is the notion of graphics objects.
+ Nvidia hardware features two levels of GPU access: the first one is at
+ the raw level and is used for context switches, an the second one is the
+ graphics objects which microprogram the raw level to achieve high level
+ functionality (for example 2D or 3D acceleration).
+\begin_layout Standard
+Shader engine nv40/nv50
+\begin_layout Standard
+\begin_layout Standard
+Tiling architecture
+\begin_layout Standard
+\begin_layout Plain Layout
+\begin_layout Itemize
+There are multiple memory domains in a computer, and they are not coherent.
+\begin_layout Itemize
+A GPU is a completely separate computer with its own bus, address space
+ and computational units.
+\begin_layout Itemize
+Communication between the CPU and GPU is achieved over a bus, which has
+ non-trivial performance implications.
+\begin_layout Itemize
+GPUs can be programmed using two modes: MMIO and command FIFOs.
+\begin_layout Itemize
+There is no standard output method for display devices.
+\begin_layout Standard
+\begin_layout Standard
+The Linux graphics stack has seen numerous evolutions over the years.
+ The purpose of this section is to detail that history, as well as the justifica
+tion behind the changes in order to better motivate the current design.
+\begin_layout Section
+\begin_layout Standard
The X11 architecture.
+\begin_layout Standard
+DIX (Device-Independent X), DDX (Device-Dependent X),
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+X protocol
+\begin_layout Standard
+X extensions
+\begin_layout Standard
+shm -> shared memory for transport
+\begin_layout Standard
+XCB -> asynchronous
+\begin_layout Standard
+Another notable X extension is Xv, which will be discussed in further detail
+ in the video decoding chapter.
+\begin_layout Section
+\begin_layout Standard
+Initially (when Linux first supported graphics hardware acceleration), only
+ a single piece of code would access the graphics card directly: the XFree86
+ server.
+ The design was as follows: by running with super-user privileges, the XFree86
+ server could access the card from user space and did not require kernel
+ support to implement 2D acceleration.
+ The advantage of such a design was its simplicity, and the fact that the
+ XFree86 server could be easily ported from one operating system to another
+ since it required no kernel component.
+ For years this was the most widespread X server design (although there
+ were notable exceptions, like XSun which implemented modesetting in the
+ kernel for some drivers).
+\begin_layout Standard
+Later on, Utah-GLX, the first hardware-independent 3D accelerated design,
+ came to Linux.
+ Utah-GLX basically consists in an additional user space 3D driver implementing
+ GLX, and directly accesses the graphics hardware from user space, in a
+ way similar to the 2D driver.
+ In a time where the 3D hardware was clearly separated from 2D (because
+ the functionality used for 2D and 3D was completely different, or because
+ the 3D card was a completely separate card, à la 3Dfx), it made sense to
+ have a completely separate driver.
+ Furthermore, direct access to the hardware from user space was the simplest
+ approach and the shortest road to getting 3D acceleration going under Linux.
+\begin_layout Standard
+At the same time, framebuffer drivers (which will be detailed in Chapter
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "cha:Framebuffer-Drivers"
+) were getting increasingly widespread, and represented another component
+ that could simultaneously access the graphics hardware directly.
+ To avoid potential conflicts between the framebuffer and XFree86 drivers,
+ it was decided that VT switches would emit a signal to the X server telling
+ it to save the graphics hardware state.
+ Asking each driver to save its complete GPU state on VT switches made the
+ drivers more fragile, and life became more difficult for developers who
+ suddenly faced bug-prone interaction between different drivers.
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+Obviously, this model had drawbacks.
+ First, it required that unprivileged user space applications be allowed
+ access the graphics hardware for 3D.
+ Second, as can be seen on figure XXX all GL acceleration had to be indirect
+ through the X protocol, which would slow it down.
+ Because of growing concerns about the security in Linux and performance
+ shortcomings, another model was required.
+\begin_layout Standard
+To address the reliability and security concerns with the Utah-GLX model,
+ the DRI model was put together; it was used in both XFree86 and its successor,
+ X.Org.
+ This model relies on a additional kernel component whose duty is to check
+ the correctness of the 3D command stream, security-wise.
+ The main change is now that instead of accessing the card directly, the
+ unprivileged OpenGL application would submit command buffers to the kernel,
+ which would check them for security and then pass them to the hardware
+ for execution.
+ The advantage of this model is that trusting user space is no longer required.
+ Notice that although this would have been possible, the 2D command stream
+ from XFree86 still did not go through the DRM, and therefore the X server
+ still required super-user privileges.
+\begin_layout Standard
+\begin_layout Standard
+The current stack evolved from a new set of needs.
+ First, requiring the X server to have super-user has always had serious
+ security implications.
+ Second, with the previous design different drivers were touching a single
+ piece of hardware, which would often cause issues.
+ In order to resolve this the key is two-fold: first, merge the kernel framebuff
+er functionality into the DRM module and second, have X.Org access the graphics
+ card through the DRM module and run unprivileged.
+ This is called Kernel Modesetting (KMS); in this model the DRM module is
+ now responsible for providing modesetting services both as a framebuffer
+ driver and to X.Org.
+\begin_layout Standard
+\begin_inset Caption
+\begin_layout Plain Layout
+The new picture of the Linux graphics stack.
+\begin_layout Standard
+VT switches
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Plain Layout
+\begin_layout Itemize
+Applications communicate with X.Org through a specific library which encapsulates
+ drawing calls.
+\begin_layout Itemize
+The current DRI design has evolved over time in a number of significant
+ steps.
+\begin_layout Itemize
+In a modern stack, all graphics hardware activity is moderated by a kernel
+ module, the DRM.
+\begin_layout Chapter
+Framebuffer Drivers
+\begin_layout Standard
+Framebuffer drivers are the simplest form of graphics drivers under Linux.
+ Kernel modesetting DRM drivers are still a relevant option if the only
+ thing you are after is a basic two-dimensional display.
+ Furthermore, when implementing framebuffer acceleration on top of a kernel
+ modesetting DRM driver, the same callbacks need to be filled.
+ A framebuffer driver implements little functionality, and is therefore
+ extremely easy to create.
+ Such a driver is especially interesting for embedded systems, where memory
+ footprint is essential, or when the intended applications do not require
+ advanced graphics acceleration.
+\begin_layout Standard
+At the core, a framebuffer driver implements the following functionality:
+\begin_layout Itemize
+\begin_layout Itemize
+basic 2d acceleration (copy, solid)
+\begin_layout Standard
+Acceleration is sometimes made available to user space through a hook (user
+ space must then program card specific bits, must be root for that)
+\begin_layout Standard
+Framebuffer drivers do not always rely on a specific card model (like nvidiafb/a
+ Drivers on top of vesa, EFI or Openfirmware exist.
+\begin_layout Standard
+\begin_layout Section
+Framebuffer operations
+\begin_layout Standard
+The framebuffer operations structure is how non-modesetting framebuffer
+ callbacks are set.
+ Different callbacks can be set depending on what functionality you wish
+ to implement, like fills, copies, or cursor handling.
+ By filling struct fb_ops callbacks, one can implement the following functions:
+\begin_layout Standard
+/* set color register */
+\begin_layout Standard
+/* set color registers in batch */
+\begin_layout Standard
+/* blank display */
+\begin_layout Standard
+/* pan display */
+\begin_layout Standard
+/* Draws a rectangle */
+\begin_layout Standard
+/* Copy data from area to another */
+\begin_layout Standard
+/* Draws a image to the display */
+\begin_layout Standard
+/* Draws cursor */
+\begin_layout Standard
+/* Rotates the display */
+\begin_layout Standard
+/* wait for blit idle, optional */
+\begin_layout Standard
+Note that common framebuffer functions (cfb) are available if you do not
+ want to implement everything for your device specifically.
+ These functions are cfb_fillrect, cfb_copyarea and cfb_imageblit and will
+ perform the corresponding function in a generic, unoptimized fashion using
+ the CPU.
+\begin_layout Standard
+\begin_layout Plain Layout
+\begin_layout Itemize
+Framebuffer drivers are the simplest form of linux graphics driver, requiring
+ little work for implementation.
+\begin_layout Itemize
+Framebuffer drivers deliver a low memory footprint and thus are useful for
+ embedded devices.
+\begin_layout Itemize
+Implementing acceleration is optional as software fallback functions exist.
+\begin_layout Standard
+The use of a kernel module is a requirement in a complex world.
+ The kernel module, or DRM, has multiple purposes:
+\begin_layout Itemize
+Share the rendering hardware between multiple user space components, and
+ arbitrate access.
+\begin_layout Itemize
+Enforce security by preventing applications from performing DMA to arbitrary
+ memory regions, and more generally programming the card in any way that
+ could result in a security hole.
+\begin_layout Itemize
+Manage the memory of the card, by providing video memory allocation functionalit
+y to user space.
+\begin_layout Itemize
+More recently, DRM was improve to achieve modesetting.
+ This simplifies the situation where both the DRM and the framebuffer driver
+ access the card by removing the framebuffer driver and implementing in
+ the DRM.
+\begin_layout Itemize
+Put critical initialization of the card in the kernel, for example by uploading
+ firmwares or setting up DMA areas.
+\begin_layout Standard
+Kernel module (DRM)
+\begin_layout Standard
+Global DRI/DRM user space/kernel scheme (figure with libdrm - drm - entry
+ points - multiple user space apps)
+\begin_layout Standard
+\begin_layout Plain Layout
+node[mynode] (xorg) {X.Org};
+\begin_layout Plain Layout
+node[mynode, right=0.5cm of xorg] (glapplication) {OpenGL Application};
+\begin_layout Plain Layout
+node[mynode, text width = 6cm, below= of xorg, xshift = 2.2cm] (libdrm) {libdrm};
+\begin_layout Plain Layout
+draw[myarrow] (xorg.south) -> ++(0,-1) (libdrm);
+\begin_layout Plain Layout
+draw[myarrow] (glapplication.south) -> ++(0,-1) (libdrm);
+\begin_layout Plain Layout
+node[mynode, text width = 6cm, below= of libdrm] (drm) {drm};
+\begin_layout Plain Layout
+draw[myarrow] (libdrm.south) -> ++(0,-1) (drm);
+\begin_layout Plain Layout
+node[mynode, text width = 6cm, below= of drm] (hardware) {Graphics Hardware};
+\begin_layout Plain Layout
+draw[myarrow] (drm.south) -> ++(0,-1.0) (hardware);
+\begin_inset Caption
+\begin_layout Plain Layout
+Accessing the DRM through libdrm.
+\begin_layout Standard
+When designing a Linux graphics driver aiming for more than simple framebuffer
+ support, a DRM component is the first thing to do.
+ One should derive a design that is both efficient and enforces security.
+ The DRI/DRM scheme can be implemented in different ways and the interface
+ is indeed entirely card-specific.
+ Do not always follow the existing models that other drivers use, innovate!
+\begin_layout Standard
+Multiplexing of the card command fifo - For cards which only feature a single
+ hardware command submission fifo, it has to be shared between multiple
+ user space components.
+ In that case, this is achieved by the DRM module.
+\begin_layout Standard
+Prevent simultaneous access to the same hw
+Prevent arbitrary DMAs to memory.
+ IF the hardware does not feature memory protection, you have to check the
+ command stream before submitting it to the GPU.
+\begin_layout Standard
+Modesetting is the act of setting a mode on the card to display.
+ This can range from extremely simple procedures (calling a VGA interrupt
+ or VESA call is a basic form of modesetting) to directly programming the
+ card registers (which brings along the advantage of not needing to rely
+ on a VGA or VESA layer).
+ Historically, this was achieved in user space by the DDX.
+\begin_layout Standard
+However, these days it makes more sense to put it in the kernel once and
+ for all, and share it between different GPU users (framebuffer drivers,
+ DDXes, EGL stacks...).
+ This extension to modesetting is called kernel modesetting (also known
+ as KMS).
+ A number of concepts are used by the modesetting interface (those are inherited
+ from the Randr 1.2 specification).
+Crtc is in charge of reading the framebuffer memory and routes the data
+ to an encoder
+Encoder encodes the pixel data for a connector
+\begin_layout Subsubsection*
+ Notice that connectors can get their data from multiple encoders (for example
+ DVI-I which can feed both analog and digital signals)
+\begin_layout Standard
+Also, on embedded or old hardware, it is common to have encoders and connectors
+ merged for simplicity/power efficiency reasons.
+\begin_layout Standard
++++ Ajouter ici un schema crtc-encoder-connector
+libdrm is a small (but growing) component that interfaces between user space
+ and the DRM module, and allows calling into the entry points.
+\begin_layout Standard
+Obviously security should not rely on components from libdrm because it
+ is an unprivileged user space component
+\begin_layout Standard
+\begin_layout Plain Layout
+\begin_layout Itemize
+The DRM manages all graphics activity in a modern linux graphics stack.
+\begin_layout Itemize
+It is the only trusted piece of the stack and is responsible for security.
+ Therefore it shall not trust the other components.
+\begin_layout Itemize
+It provides basic graphics functionality: modesetting, framebuffer driver,
+ memory management.
+\begin_layout Standard
+This chapter covers the implementation of a 2D acceleration inside X.Org.
+\begin_layout Standard
+There are multiple ways to implement a 2D X.Org driver: ShadowFB, XAA, EXA.
+ Another simple way of implementing X.Org support is through the FBDev module.
+ This module implements X.Org on top of an existing, in-kernel framebuffer
+ driver.
+\begin_layout Standard
+\begin_layout Standard
+ShadowFB provides no acceleration proper, a copy of the framebuffer is kept
+ in system memory.
+ The driver implements a single hook that copies graphics from system to
+ video memory.
+ This can be implemented using either a DMA copy, or a CPU copy (depending
+ on the hardware and copy size, either can be better).
+\begin_layout Standard
+Despite the name, shadowFB is not to be confused with the kernel framebuffer
+ drivers.
+\begin_layout Standard
+Although ShadowFB is a very basic design, it can result in a more efficient
+ and responsive desktop than an incomplete implementation of EXA.
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Plain Layout
+\begin_layout Standard
+Scanline based acceleration
+\begin_layout Standard
+Offscreen area, same pitch as the screen
+\begin_layout Standard
+Adapted from KAA from Kdrive
+\begin_layout Standard
+Simple interface : Prepare/Act/Finish for each acceleration function
+\begin_layout Standard
+Solid - fill an area with a solid color (RGBA)
+\begin_layout Standard
+Copy - copies a rectangle area from and to video memory
+\begin_layout Standard
+Composite - optional interface used to achieve composite operations like
+ blending.
+ This allows accelerating 2D desktop effects like blending, scaling, operations
+ with masks...
+\begin_layout Standard
+UploadToScreen - copies an area from system memory to video memory
+\begin_layout Standard
+DowndloadFromScreen - copies an area from video memory to system memory
+\begin_layout Standard
+Problématique des migrations de pixmaps
+\begin_layout Standard
+\begin_layout Plain Layout
+\begin_layout Itemize
+Multiple choices exist for accelerating 2D in X.Org.
+\begin_layout Itemize
+The most efficient one is EXA, which puts all the smart optimizations in
+ a common piece of code, and leaves the driver implementation very simple.
+\begin_layout Itemize
+If your card cannot accelerate 2D operations, shadowfb is probably the path
+ to take.
+Two typical video pipelines : mpeg2 and h264
+iDCT -> MC -> CSC -> Final display
+entropy decoding -> iDCT -> MC -> CSC -> Final display
+Entropy encoding is a lossless compression phase.
+ It is the last stage of encoding and therefore also the first stage of
+ decoding.
+\begin_layout Standard
+Color spaces
+\begin_layout Standard
+Linear relation
+\begin_layout Standard
+Conversion matrices
+\begin_layout Standard
+The YUV color space: 1 component luminance (Y) + 2 components chrominance
+ (UV).
+ Chrominance information is less relevant to the eye than chrominance, so
+ usually chrominance is subsampled and luminance at the original resolution.
+ Therefore, the Y plane usually has a higher resolution than the U and V
+ planes.
+\begin_layout Standard
+Bandwidth gain (RGBA32 vs YV12)
+\begin_layout Standard
+YUV Planar and packed (interlaced) formats
+\begin_layout Standard
+Plane order (YV12 vs NV12)
+\begin_layout Standard
+Order of the planes (YV12, I420)
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "fig:YUV-to-RGB"
+ shows the conversion matrices from ITU-R BT Recommendation 601 (standard
+ content) and recommendation 709 (intended for HD content).
+ Notice that although these matrices are very similar, there are numerical
+ differences which will result in slight off-colored rendering if one is
+ used in place of the other.
+ This is indeed often the case that video decoders with YUV to RGB hardware
+ are used to playback high definition content but no attention is made to
+ the proper conversion matrix that should be used.
+ Since the colors are only slightly wrong, this problem is commonly overlooked,
+ whereas most hardware features at least a BT601/BT709 switch, or a fully
+ programmable conversion matrix.
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+Pixel scaling
+\begin_layout Standard
+Since the conversion from YUV space to RGB space is linear, filtered scaling
+ can be done either in the YUV or RGB space, which conveniently allows using
+ texture filtering which is available on 3D hardware to sample the YUV data.
+ This allows a single pass color space conversion and scaling.
+ For example, bi-linear filtering will work just fine with three textures
+ for the three Y, U and V planes.
+ Notice that higher quality can be obtained at the expense of performance
+ by using better filtering modes, such as bi-cubic [citer papier hadwiger],
+ even though this can prove to be costly.
+ A trade-off can be achieved by implementing bi-cubic filtering for the
+ (most eye-visible) Y plane, and keeping bi-linear filtering for U and V
+ planes.
+\begin_layout Standard
+If the hardware cannot achieve color space conversion and scaling at the
+ same time (for example if you have a YUV->RGB blitter and a shader less
+ 3D engine), again the linear color conversion allows you to do the scaling
+ in RGB space, and this will produce the same results (baring gamma correction).
+\begin_layout Standard
+Xv is simply about CSC ans scaling.
+ In order to implement Xv, a typical X.Org driver will have to implement
+ this space conversion.
+ Although the Xv API is a little complex for what it implements, the gits
+ of it consists in the PutImage function, which puts an YUV image on screen.
+ Multiple YUV formats can be handled, planar or interlaced mainly.
+ Note that Xv has RGB support as well.
+ Thanks to the bandwidth gains and DMA transfers, even an Xv implementation
+ already provides a relevant level of video decoding acceleration, and can
+ prove sufficient depending on the target hardware (for example, it can
+ prove to be fine when coupled with a powerful CPU to decode H264 content).
+idct + mc +csc
+VAAPI was initially created for intel's poulsbo video decoding.
+ The API is very tailored to embedded platforms and has many entry points,
+ at different pipeline stages, which makes it more complex to implement.
+The VDPAU was initiated by nvidia for H264 & VC1 decoding support
+All 3 APIs are intended for full
+\begin_layout Plain Layout
+\begin_layout Itemize
+A video decoding pipeline consists in multiple stages chained together.
+\begin_layout Itemize
+Color space conversion and scaling is the most important stage, and if your
+ driver implements only one operation for simplicity, this is it.
+\begin_layout Itemize
+Implementing a full pipeline can provide a high performance boost, and save
+ battery life on mobile systems.
+\begin_layout Standard
+OpenGL ARB, khronos, bla bla...
+\begin_layout Section
+vertex stage
+\begin_layout Standard
+vertex buffers
+\begin_layout Standard
+Render buffers
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Plain Layout
+\begin_layout Itemize
+OpenGL is a suite of stages arranged in a pipeline.
+\begin_layout Standard
+Mesa is the Common Rendering Architecture for all open source graphics drivers.
+\begin_layout Section
+\begin_layout Standard
+Mesa serves two major purposes:
+\begin_layout Itemize
+Mesa is a software implementation of OpenGL.
+ It is considered to be the reference implementation and is useful in checking
+ conformance, seeing that the official OpenGL conformance tests are not
+ publicly available.
+\begin_layout Itemize
+Mesa provides the OpenGL entry points for Open Source graphics drivers under
+ linux.
+\begin_layout Standard
+In this section, we will focus on the second point.
+\begin_layout Section
+\begin_layout Standard
+\begin_layout Plain Layout
+\begin_layout Itemize
+Mesa is the reference OpenGL implementation under Linux.
+\begin_layout Itemize
+All Open Source graphics drivers use Mesa for 3D
+\begin_layout Standard
+Gallium 3D is the Future of 3D Acceleration.
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Section
+Gallium3D: a plan for a new generation of hardware
+\begin_layout Standard
+Ten years ago, GPUs were a direct match with all the OpenGL or Direct3D
+ functionality; back then the GPUs had specific transistors dedicated to
+ each piece of functionality.
+ With the explosion in the amount of 3D functionality, this quickly made
+ it impractical both for application developers (who saw the 3D APIs growing
+ huge) and hardware designers (who faced an explosion of the number of specific
+ functionality a GPU needed), and shaders were created.
+ Instead of providing specific functionality, the 3D APIs would now let
+ the programmers create these little programs and run them on the GPU.
+ As the hardware was now programmable in a way which was a superset of fixed
+ functionality, the fixed function pipelines were not required any more
+ and were removed from the cards.
+ Gallium 3D is modeled around the simple observation that today's GPUs do
+ not have fixed pipe any more and only feature shaders, but drivers still
+ have to
+\begin_inset Quotes eld
+\begin_inset Quotes erd
+ fixed function on top of the shaders to provide API compatibility.
+ Doing so in every driver would require a lot of code duplication, and the
+ Gallium model is to put this code in a common place.
+ Therefore gallium drivers become smaller and easier to write and to maintain.
+\begin_layout Standard
+everything is a shader, including inside the driver
+\begin_layout Standard
+thin layer for fixed pipe -> programmable functionality translation
+\begin_layout Standard
+\begin_layout Standard
+A state tracker implements an API (for example OpenGL, OpenVG, Direct3D...)
+ by turning it into API-agnostic and hardware-agnostic TGSI calls.
+A pipe driver is the main part of a hardware-specific driver.
+The winsys is in charge of talking to the OS/Platform of choice.
+ The pipe driver relies on the Winsys to talk to the hardware.
+ For example, this allows having a single pipe driver with multiple winsyses
+ targetting different Operating systems.
+\begin_layout Section
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Section
+In order to operate shaders, Gallium features an internal shader description
+ language which uses 4-component vectors.
+ We will later refer to the 4 components of a vector as x,y,z,w.
+ In particular, v.x is the first component of vector v, v.xyzw are all 4 component
+s of v in that order, and swizzling is allowed, for example v.wzyx reverses
+ the component order.
+ It is also legal to replicate a component, for example v.xxxx means four
+ times the x component of v and v.yyzz means two times y and two times z.
+\begin_layout Standard
+These components usually carry no semantics, and despite their name they
+ can very well carry a color or an opacity value indifferently.
+\begin_layout Standard
+TGSI instruction set
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Plain Layout
+\begin_layout Itemize
+Gallium 3D is the new graphics API.
+\begin_layout Itemize
+Everything is converted to a shader internally, fixed functionality is gone.
+\begin_layout Itemize
+Drivers are simpler than classic Mesa drivers, as one only has to implement
+ shaders to get all fixed functionality to work.
+\begin_layout Standard
+VT switches
+\begin_layout Standard
+Card state
+\begin_layout Standard
+Suspend/resume hooks in the DRM
+\begin_layout Standard
+\begin_layout Plain Layout
+\begin_layout Itemize
+Suspend and resume has long been very clumsy, but this is solved now thanks
+ to the DRM implementing more functionality.
+\begin_layout Standard
+Technical specifications are the nuts and bolts of graphics driver work.
+ Without hardware specifications, no work can be started.
+ However, manufacturing companies are usually wary of sharing said specification
+s, as they think this will hinder their business.
+ While this claim is false (because you can't copy a GPU from just its specifica
+tions), it is still very widespread and prevents a lot of hardware from
+ being properly documented.
+ Therefore, getting hold of hardware specifications will be the first major
+ step in any graphics driver development project.
+Public specifications
+\begin_layout Standard
+Some vendors distribute the technical documentation for their hardware publicly
+ without restrictions.
+\begin_layout Standard
+Sometimes, things can be as simple as asking the vendor, who might share
+ the documentation (possibly under NDA, see below).
+Put simply, an NDA is a contract signed between the developer and the hardware
+ company, by which the developer agrees not to spread the docs he received.
+ However, there can be more restrictions in an NDA.
+\begin_layout Standard
+Terms of the NDA
+\begin_layout Standard
+Before signing an NDA, think.
+ Whatever lawyers say, there is no such thing as a
+\begin_inset Quotes eld
+\begin_inset Quotes erd
+ NDA, you can always negotiate.
+\begin_layout Standard
+Can Open Source drivers be written from that documentation under that NDA?
+\begin_layout Standard
+What happens when the NDA expires? Can code still be free, are you bound
+ by any clause?
+\begin_layout Standard
+What about yourself? Are you prevented from doing further work on this hardware?
+When specifications are not easily available or just incomplete, an alternate
+ route is reverse engineering.
+ Reverse engineering consists in figuring out the specifications for a given
+ piece of hardware by yourself, for example by looking at what a black-box
+ binary driver does to the hardware under certain circumstances.
+\begin_layout Standard
+Reverse engineering is not just a tool to obtain missing hardware specifications
+, it is also a strong means of Open Source advocacy.
+ Once a reverse engineered driver exists and ships in linux distributions,
+ pressure shifts on the hardware vendor for support.
+ This, in turn, can force the vendor to support Open Source drivers.
+\begin_layout Standard
+not as difficult as it seems, requires organization, being rigorous.
+ Write down all bits of information (even incomplete bits), share it among
+ developers, try to work out bits one by one.
+ Do not hesitate writing ad-hoc tools, as they will save precious time down
+ the road (if you hesitate, you have crossed the line already!).
+The basic idea behind mmio-trace is simple: it first hooks the ioremap call,
+ and therefore prevents mapping of a designated I/O area.
+ Subsequently, accesses to this area will generate page faults, which are
+ caught by the kernel.
+ For each page fault, the faulting instruction is decoded to figure out
+ the write or read address, along with the value written/read.
+ The page is put back, the faulting instruction is then single-stepped,
+ and the page is then removed again.
+ Execution then continues as usual.
+\begin_layout Standard
+mmio trace is now part of the official Linux kernels.
+ Therefore, any pre-existing driver can be traced.
+libsegfault is similar to mmio-trace in the way it works: after removing
+ some pages which one want to track accesses to, it will generate a segmentation
+ fault on each access and therefore be able to report each access.
+ The difference is that libsegfault is a user space tool while mmio-trace
+ is a kernel tool.
+Valgrind is a dynamic recompiling and instrumentation framework.
+ Valgrint-mmt is a plugin for valgrind which implements tracing of read
+ and writes to a certain range of memory addresses, usually an mmio range
+ accessed from user space.
+ Memory accesses are dynamically instrumented thanks to valgrind and each
+ access to the zones we want to see traced is logged.
+Finally, one last pre-existing tool to help reverse engineering is virtualizatio
+ By running a proprietary driver in a controled environment, one can figure
+ out the inner workings of a GPU.
+ The plan is then to write an emulated GPU while doing the reverse engineering
+ (which imposes the use of an open source virtualization solution like Qemu).
+In addition to these generic tools, you will often find it useful to implement
+ your own additional tools, tailored for specific needs.
+ Renouveau is an example of such a tool that integrates the reverse engineering
+ mechanisms, the command decoding and printing.
+ In order to achieve decoding of the commands, it carries a database of
+ the graphics commands of nvidia GPUs.
+ This allows quick testing of new database entries.
+ Headers generated from this database are later used in the driver development
+ process.
+\begin_layout Standard
+\begin_layout Plain Layout
+\begin_layout Itemize
+Technical specifications of course very important for authoring graphics
+ drivers.
+\begin_layout Itemize
+NDAs can have unforeseen implications on yourself and your work.
+\begin_layout Itemize
+When they are unavailable, incomplete or just plain wrong, reverse engineering
+ can help you figure out how the hardware actually works.
+\begin_layout Standard
+The official OpenGL testing suite is not publicly available, and (paying)
+ Khronos Membership is required.
+ Instead, most developers use alternate sources for test programs.
+\begin_layout Standard
+gdb needs to run on a terminal emulator while the application debug might
+ be with a lock held.
+ That might result in a deadlock between the application stuck with a lock
+ and gdb waiting to be able to output text.
+\begin_layout Standard
+printk debug
+\begin_layout Standard
+crash (surcouche gdb pour analyser les vmcore)
+\begin_layout Standard
+\begin_layout Standard
+serial console
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+Submitting your code for inclusion in the official trees is an important
+ part of the graphics driver development process under linux.
+ There are multiple motivations for doing this.
+\begin_layout Standard
+First, this allows end users to get hold of your driver more easily.
+\begin_layout Standard
+Second, this makes it easier for your driver maintenance in the future:
+ in the event of interface changes,
+\begin_layout Standard
+Why upstream?
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Standard
+\begin_layout Plain Layout
+\begin_layout Itemize
+Thoroughly testing all your changes can save you the cost of bisection later
+ on.
+\begin_layout Itemize
+Debugging is not easy for graphics drivers.
+\begin_layout Itemize
+By upstreaming your code in official repositories, you save yourself the
+ burden of adapting it to ever-moving programming interfaces in X.Org, Mesa
+ and the kernel.
+\begin_layout Standard
