Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 11 May 2014 16:51:23 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Nathan Whitehorn <nwhitehorn@freebsd.org>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r265864 - head/sys/dev/vt/hw/ofwfb
Message-ID:  <20140511133517.N1100@besplex.bde.org>
In-Reply-To: <201405110158.s4B1wvFA072381@svn.freebsd.org>
References:  <201405110158.s4B1wvFA072381@svn.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 11 May 2014, Nathan Whitehorn wrote:

> Log:
>  Make ofwfb not be painfully slow. This reduces the time for a verbose boot
>  on my G4 iBook by more than half. Still 10% slower than syscons, but that's
>  much better than a factor of 2.
>
>  The slowness had to do with pathological write performance on 8-bit
>  framebuffers, which are almost universally used on Open Firmware systems.
>  Writing 1 byte at a time, potentially nonconsecutively, resulted in many
>  extra PCI write cycles. This patch, in the common case where it's writing
>  one or several characters in an 8x8 font, gangs the writes together into
>  a set of 32-bit writes. This is a port of r143830 to vt(4).

Only 10% slower?  Bitmapped mode with 256 colors is inherently 4 times
slower for an 8x8 font (8 bytes/char instead 2) of and 8 times slower for
an 8x16 font.  That's without any I/O pathology.  Perhaps you are comparing
with a syscons that is already very slow due to the hardware not supporting
text mode.

However, syscons has buffering that should limit this problem.

>  The EFI framebuffer is also extremely slow, probably for the same reason,
>  and the same patch will likely help there.
>
> Modified:
>  head/sys/dev/vt/hw/ofwfb/ofwfb.c
>
> Modified: head/sys/dev/vt/hw/ofwfb/ofwfb.c
> ==============================================================================
> --- head/sys/dev/vt/hw/ofwfb/ofwfb.c	Sun May 11 01:44:11 2014	(r265863)
> +++ head/sys/dev/vt/hw/ofwfb/ofwfb.c	Sun May 11 01:58:56 2014	(r265864)
> @@ -136,6 +136,10 @@ ofwfb_bitbltchr(struct vt_device *vd, co
> 	uint32_t fgc, bgc;
> 	int c;
> 	uint8_t b, m;
> +	union {
> +		uint32_t l;
> +		uint8_t	 c[4];
> +	} ch1, ch2;
>
> 	fgc = sc->sc_colormap[fg];
> 	bgc = sc->sc_colormap[bg];
> @@ -147,36 +151,70 @@ ofwfb_bitbltchr(struct vt_device *vd, co
> 		return;
>
> 	line = (sc->sc_stride * top) + left * sc->sc_depth/8;
> -	for (; height > 0; height--) {
> -		for (c = 0; c < width; c++) {
> -			if (c % 8 == 0)
> +	if (mask == NULL && sc->sc_depth == 8 && (width % 8 == 0)) {
> +		for (; height > 0; height--) {
> +			for (c = 0; c < width; c += 8) {
> 				b = *src++;
> -			else
> -				b <<= 1;
> -			if (mask != NULL) {
> +

Style bug (extra newline).

> +				/*
> +				 * Assume that there is more background than
> +				 * foreground in characters and init accordingly
> +				 */
> +				ch1.l = ch2.l = (bg << 24) | (bg << 16) |
> +				    (bg << 8) | bg;
> +
> +				/*
> +				 * Calculate 2 x 4-chars at a time, and then
> +				 * write these out.
> +				 */
> +				if (b & 0x80) ch1.c[0] = fg;
> +				if (b & 0x40) ch1.c[1] = fg;
> +				if (b & 0x20) ch1.c[2] = fg;
> +				if (b & 0x10) ch1.c[3] = fg;
> +
> +				if (b & 0x08) ch2.c[0] = fg;
> +				if (b & 0x04) ch2.c[1] = fg;
> +				if (b & 0x02) ch2.c[2] = fg;
> +				if (b & 0x01) ch2.c[3] = fg;

Style bugs (missing newlines).

> +
> +				*(uint32_t *)(sc->sc_addr + line + c) = ch1.l;
> +				*(uint32_t *)(sc->sc_addr + line + c + 4) =
> +				    ch2.l;
> +			}
> +			line += sc->sc_stride;
> +		}
> +	} else {
> +		for (; height > 0; height--) {
> +			for (c = 0; c < width; c++) {
> 				if (c % 8 == 0)
> -					m = *mask++;
> +					b = *src++;
> 				else
> -					m <<= 1;
> -				/* Skip pixel write, if mask has no bit set. */
> -				if ((m & 0x80) == 0)
> -					continue;
> -			}
> -			switch(sc->sc_depth) {
> -			case 8:
> -				*(uint8_t *)(sc->sc_addr + line + c) =
> -				    b & 0x80 ? fg : bg;
> -				break;
> -			case 32:
> -				*(uint32_t *)(sc->sc_addr + line + 4*c) =
> -				    (b & 0x80) ? fgc : bgc;
> -				break;
> -			default:
> -				/* panic? */
> -				break;
> +					b <<= 1;
> +				if (mask != NULL) {
> +					if (c % 8 == 0)
> +						m = *mask++;
> +					else
> +						m <<= 1;
> +					/* Skip pixel write, if mask not set. */
> +					if ((m & 0x80) == 0)
> +						continue;
> +				}
> +				switch(sc->sc_depth) {
> +				case 8:
> +					*(uint8_t *)(sc->sc_addr + line + c) =
> +					    b & 0x80 ? fg : bg;
> +					break;
> +				case 32:
> +					*(uint32_t *)(sc->sc_addr + line + 4*c)
> +					    = (b & 0x80) ? fgc : bgc;
> +					break;
> +				default:
> +					/* panic? */
> +					break;
> +				}
> 			}
> +			line += sc->sc_stride;
> 		}
> -		line += sc->sc_stride;
> 	}
> }

A correctly-implemented console driver doesn't have itty-bitty hardware
i/o like the old version of this or itty-bitty buffering like the changed
version.

I thought that syscons always had correct buffering.  Actually, it
uses a hybrid scheme where, at least in text mode, the initial i/o is
itty-bitty 1 character+attribute at a time (16-bit i/o), but scrolling
and screen refresh is done bcopy, bcopy_io(), bcopy_fromio() and
bcopy_toio() and a couple of other functions (bzero_io(), fill*())
from/to a properly cached buffer in normal memory.  It used to use
only bcopy() and a couple of others (bzero(), fill*()), so it
automatically did 64-bit i/o's on 64-bit systems, except for fillw*()
which was intentionally 16 bits for compatibilty (but it didn't use
bcopy() which is needed for even more compatibility).  It is unclear
which old systems break with frame buffer i/o's larger (or smaller)
than 16 bits.  I never had any (x86) hardware that didn't work with any
size.  The video card might be 16-bit only, but then it should just
tell the CPU this so that the CPU reduces to 16 bits using standard
x86 mechanisms.  Video cards have been PCI or better for about 20
years.  PCI should support precisely 32-bits, but 64-bit frame buffer
accesses to PCI and AGP video cards always worked for me.

bcopy*io() is more technically correct, but is very badly implemented
and much slower than bcopy() on most systems.  Its misimplementation
includes not even using bus-space on x86.  All bcopy*io() functions
use copyw() on x86, and copyw() is just a dumb 16-bit memcpy() written
in C.  Writing it in C doesn't lose anything when it is used for a
slow i/o memory, but doing 16-bit i/o's does.  And doing 16-bit i/o's
doesn't even give compatibility, since bzero_io() is just bzero() on
x86, so it always does wider i/o's.  syscons has always used fillw*()
and never plain fill() since it doesn't the corresponding 32-bit
writes that might be given by fill().  fill() actually does 8-bit
writes.  fb also uses the badly named and implemented filll_io().
This doesn't actually support longs, but only u_int32_t.  fill_io()
is at least ifdefed on ${ARCH}, so its access size is not completely
hard-coded.  On arm and mips, all the ifdefed "io" functions except
fill_io use plain memcpy() or memset() so they get a maximum access
size and minimum hardware compatibilty.  fillw() is 16 bits on these
arches since the access size is hard-coded in the API (and conversion
to memset() is not done).

Pessimizations in syscons have made it about twice as slow as in FreeBSD-5.
This is probably mostly due to switching from bcopy() to copyw().  There
is a lot of bloat in upper layers, but with 2GHz CPUs it would take a
factor of about 10 pessimizations there to be comparable with i/o
pessimizations.

A correctly-implemented console driver assembles an image of the frame
buffer in fast memory and copies from there to the frame buffer in
large chunks.  It is tricky to keep track of changed regions so as to
not copy unchanged regions.  Copying everything at a refresh rate of
not much slower than 20 Hz works well.  200 Hz for animation, but that
is rarely needed.  The bandwidth for 80x25 text mode at 20 Hz is 80 kB/
second.  That was easy in 1982.  I aimed for 100 Hz refresh on 2 MHz
6809 systems in 1987.  PC hardware at 5 MHz was about twice as slow,
especially for frame buffers.  But it could do 80 kB/second.  The
bandwidth for 80x25 8x16 256 color bitmapped mode is 640kB/second.
This was difficult in 1982, but very easy now.  Yet the WindowsXP
safe mode with command prompt console is about as slow at scrolling
as a 1982 system in graphics mode.  It uses similar techniques to
implement the slowness:
- a large bitmapped screen.  640x200 8 colors in 1982.  Quite
   a bit larger (something like 1024x768 256 colors) in 20XX.
- write to the screen very slowly.  Use 8-bit writes with i/o artifacts
   if possible.  The 1982 system had to do 8-bit writes to 3 color planes.
   256-color mode is simpler than most.  Writes can also be done very
   slowly by using another mode and misaligning text so that every
   character written needs merging with pixels from adjacent characters.
- do scrolling in software by copying 1 pixel at a time, using read-modify-
   write
- I only tested this on 5-10 year old hardware, with a 1920x1080 screen
   but not all of it used for the console window, and with a laptop
   1024x768 screen.  A good way to be slow, one that has been portable to
   PC systems since 1982, is to use the BIOS for video.  The console was
   about twice as fast on the laptop.  This might be due to a combination
   of fewer pixels and a less well pessimized BIOS.

Some old screen benchmarks.  The benchmark is basically to write lines
of the screen width and scroll.  I stopped updating this often about 15
years ago when frame buffers and CPUs became fast enough.  But it appears
that software bloat and design errors have caught up.

% ISA ET4000: 2.4MB/sec read, 5.9MB/sec write
% VLB ET4000/W32i: 6.8MB/sec read, 25.5MB/sec write
% PCI S3/868: 3.5MB/sec read, 23.1MB/sec write
% PCI S3/Virge: 4.1MB/sec read, 40.0MB/sec write
% PCI S3/Savage: 3.3MB/sec read, 25.8MB/sec write
% PCI Xpert: 5.3MB/sec read, 21.8MB/sec write
% PCI R9200SE: 5.8MB/sec read, 60.2MB/sec write (but 120MB/sec fpu, 250/sec sfpu)
% -o means stty flag -opost
% 
% No-scroll:

Scrolling is avoided by repositioning the cursor after every screenful.

% 
% machine     video        O/S              where      real   user  sys    speed
% ---------   -------      --------------   ---------  -----  ----  -----  -----
% A/2223 PCI  R9200SE      FreeBSD-5.2m     onscreen-o  .026  0.00   .026 76.9
% A/2223 PCI  R9200SE      FreeBSD-5.2m     offscreen-o .026  0.00   .026 76.9
% A/2223 PCI  R9200SE      FreeBSD-5.2m     onscreen    .031  0.00   .031 64.5
% A/2223 PCI  R9200SE      FreeBSD-5.2m     offscreen   .031  0.00   .031 64.5

An 11 year old system.

'onscreen' means output to an active vty, 'offscreen' to an inactive vty.
The mere existence of vtys requires full buffering to fast memory for
inactive vtys, since there is no hardware frame buffer memory to write
to for the inactive vtys.  You have to buffer the writes in a form that
can be replayed when an inactive vty becomes active, and converting
immediately to the final form is a good method (it does take more memory
and limits history to a raw form).  'offscreen' is potentially much faster,
but in most cases it is only slightly faster, due to delayed refreshes
for 'onscreen' and relatively fast frame buffer memory.

-opost is tested separately because the Linux console driver was amazingly
slow without it.  This shows that it is possible for the software bloat
to be so large that it dominates hardware slowness.  FreeBSD also has
lots of bloat in the tty and syscons layers near opost, but it is in the
noise compared with the old console Linux driver.

I forget the units for these measurements, except that the speed column
gives a bandwidth in MB/sec.  I don't remember if this is for write(2)
bandwidth or is related to frame buffer bandwidth).  Interpret them as
relative.

On a system similar to the above, syscons scrolls at 50000 lines/sec.
Non-virtually, this would require a frame buffer bandwidth of 200MB/sec,
which is several times faster than possible.  Since syscons only does
a direct update for bytes written, it needs only about 1/25 of this
bandwidth or 800KB/sec.  This is not quite in the noise compared with
a frame buffer bandwidth of 60.2MB/sec.

% K6/233 PCI  S3/Virge     minix-1.6.25++   offscreen   0.2   0.00   0.12 16.0
% K6/233 PCI  S3/Virge     minix-1.6.25++   onscreen    0.2   0.00   0.12 16.0

The Minix driver from 1990 (rewritten to support virtual consoles and to
be efficient) is faster than syscons of course.  It is smarter about
buffering, so the onnscreen case goes at almost the same speed as the
offscreen case.

% K6/233 PCI  S3/Virge     FreeBSD-current  onscreen-o  0.23  0.00   0.23  8.85
% K6/233 PCI  S3/Virge     FreeBSD-current  offscreen-o 0.23  0.00   0.23  8.85

syscons is just slightly slower for the offscreen case.  -current was only
current in ~2004.

% K6/233 PCI  S3/Virge     FreeBSD-current  onscreen    0.34  0.00   0.34  5.83
% K6/233 PCI  S3/Virge     FreeBSD-current  offscreen   0.34  0.00   0.34  5.81

But in the onscreen case, syscons is more than 50% slower, due to less
virtualization.  This slowness became slower with faster frame buffers,
but is still noticeable in benchmarks with the S3/Virge's write bandwidth
of 40.0MB/sec.

% P5/133 PCI  S3/868       FreeBSD-current  onscreen-o  0.39  0.00   0.39  5.10
% P5/133 PCI  S3/868       FreeBSD-current  offscreen-o 0.40  0.00   0.40  5.00
% P5/133 PCI  S3/868       FreeBSD-current  onscreen    0.51  0.00   0.50  3.92
% P5/133 PCI  S3/868       FreeBSD-current  offscreen   0.51  0.00   0.51  3.92
% K6/233 PCI  S3/Virge     linux-2.1.63     offscreen-o 0.97  0.00   0.97  2.06
% K6/233 PCI  S3/Virge     linux-2.1.63     onscreen-o  1.03  0.00   1.03  1.93
% K6/233 PCI  S3/Virge     linux-2.1.63     offscreen   1.18  0.00   1.18  1.69
% DX2/66 VLB  ET4000/W32i  FreeBSD-current  offscreen-o 1.18  0.00   1.16  1.69
% DX2/66 VLB  ET4000/W32i  FreeBSD-current  onscreen-o  1.27  0.02   1.23  1.57
% K6/233 PCI  S3/Virge     linux-2.1.63     onscreen    1.38  0.00   1.38  1.45
% 486/33 ISA  ET4000       minix-1.6.25++   offscreen   2     0.01   1.45  1.37
% 486/33 ISA  ET4000       minix-1.6.25++   onscreen    2     0.01   1.60  1.24
% DX2/66 VLB  ET4000/W32i  FreeBSD-current  offscreen   1.60  0.00   1.59  1.25
% DX2/66 VLB  ET4000/W32i  FreeBSD-current  onscreen    1.70  0.01   1.66  1.18
% 486/33 ISA  ET4000       FreeBSD-current  offscreen-o 2.30  0.01   2.28  0.87
% 486/33 ISA  ET4000       FreeBSD-current  onscreen-o  2.39  0.02   2.32  0.84
% 486/33 ISA  ET4000       FreeBSD-current  offscreen   3.15  0.03   3.10  0.63
% 486/33 ISA  ET4000       FreeBSD-current  onscreen    3.27  0.00   3.21  0.61
% DX2/66 VLB  ET4000/W32i  linux-1.2.13     offscreen-o 3.63  0.01   3.62  0.15
% DX2/66 VLB  ET4000/W32i  linux-1.2.13     onscreen-o  3.65  0.01   3.63  0.55
% DX2/66 VLB  ET4000/W32i  linux-1.2.13     offscreen  12.48  0.01  12.47  0.16
% 486/33 ISA  ET4000       linux-1.1.36     offscreen  20.80  0.00  20.80  0.10
% DX2/66 VLB  ET4000/W32i  linux-1.2.13     onscreen   26.98  0.01  26.95  0.07
% 486/33 ISA  ET4000       linux-1.1.36     onscreen   38.34  0.02  38.38  0.05

The speedup from the worst case (old Linux on old hardware) to the best case
(old Minix on new hardware) is a factor of 38.34/0.26 = 1475.  Hardware
speeds only increased by a factor of about 223/33 = 67.  Minix was only
1.5 times faster than syscons and 10-20 times faster than Linux on old
hardware.

% 
% Scroll:
% 
% machine     video        O/S              where      real   user  sys    speed
% ---------   -------      --------------   ---------  -----  ----  -----  -----
% A/2223 PCI  R9200SE      FreeBSD-5.2m     onscreen-o  .047  0.00   .047 42.6
% A/2223 PCI  R9200SE      FreeBSD-5.2m     offscreen-o .047  0.00   .047 42.6
% A/2223 PCI  R9200SE      FreeBSD-5.2m     onscreen    .051  0.00   .051 39.2
% A/2223 PCI  R9200SE      FreeBSD-5.2m     offscreen   .051  0.00   .051 39.2
% K6/233 PCI  S3/Virge     minix-1.6.25++   offscreen   0.2   0.00   0.14 14.0
% K6/233 PCI  S3/Virge     minix-1.6.25++   onscreen    0.2   0.00   0.14 14.0
% K6/233 PCI  S3/Virge     FreeBSD-current  onscreen-o  0.36  0.00   0.36  5.54
% K6/233 PCI  S3/Virge     FreeBSD-current  offscreen-o 0.40  0.00   0.40  5.01
% K6/233 PCI  S3/Virge     FreeBSD-current  onscreen    0.47  0.00   0.47  4.22
% K6/233 PCI  S3/Virge     FreeBSD-current  offscreen   0.51  0.00   0.51  3.92

Scrolling makes no difference for Minix due to the better virtualization.
It slows down syscons by about 50%.  Strangely, the onscreen case is now
faster?!

% P5/133 PCI  S3/868       FreeBSD-current  onscreen-o  1.24  0.00   1.23  1.61
% P5/133 PCI  S3/868       FreeBSD-current  offscreen-o 1.28  0.00   1.27  1.56
% P5/133 PCI  S3/868       FreeBSD-current  onscreen    1.35  0.00   1.34  1.48
% P5/133 PCI  S3/868       FreeBSD-current  offscreen   1.39  0.00   1.38  1.44
% K6/233 PCI  S3/Virge     linux-2.1.63     onscreen-o  1.49  0.00   1.49  1.34
% 486/33 ISA  ET4000       minix-1.6.25++   offscreen   2     0.00   1.70  1.18
% 486/33 ISA  ET4000       minix-1.6.25++   onscreen    2     0.00   1.81  1.10
% K6/233 PCI  S3/Virge     linux-2.1.63     onscreen    1.85  0.00   1.85  1.08
% K6/233 PCI  S3/Virge     linux-2.1.63     offscreen-o 2.88  0.00   2.88  0.69
% K6/233 PCI  S3/Virge     linux-2.1.63     offscreen   3.10  0.00   3.10  0.65
% DX2/66 VLB  ET4000/W32i  FreeBSD-current  offscreen-o 3.39  0.02   3.36  0.59
% DX2/66 VLB  ET4000/W32i  FreeBSD-current  onscreen-o  3.67  0.02   3.63  0.54
% DX2/66 VLB  ET4000/W32i  FreeBSD-current  offscreen   3.82  0.00   3.81  0.52
% DX2/66 VLB  ET4000/W32i  FreeBSD-current  onscreen    4.14  0.03   4.06  0.48
% DX2/66 VLB  ET4000/W32i  linux-1.2.13     onscreen-o  4.34  0.01   4.32  0.46
% 486/33 ISA  ET4000       FreeBSD-current  offscreen-o 5.54  0.03   5.48  0.36
% 486/33 ISA  ET4000       FreeBSD-current  onscreen-o  5.73  0.00   5.61  0.35
% 486/33 ISA  ET4000       FreeBSD-current  offscreen   6.41  0.03   6.34  0.31
% 486/33 ISA  ET4000       FreeBSD-current  onscreen    6.62  0.01   6.45  0.30

The old systems didn't have the CPU or frame buffer bandwidth to scroll
at 50000 lines/sec.  Rescaling 50000 by this 6.62 divided by the above 0.026
gives only 196 lines/sec.  That was usable, but since you can see the
scroll move it is not very good.  Rescaling Minix's 2.0 gives 650 lines/sec,
or a full screen refresh rate of 26 Hz.  You can probably see the scroll
flicker but not move at this rate.  Of course, the implementation does
delayed refresh to reach this rate, so most of the scrolling steps are
virtual and you can only see the screen flicker for other reasons.  syscons'
scrolling is also virtual.

% DX2/66 VLB  ET4000/W32i  linux-1.2.13     offscreen-o13.48  0.01  13.47  0.15
% DX2/66 VLB  ET4000/W32i  linux-1.2.13     offscreen  22.60  0.01  22.42  0.09
% 486/33 ISA  ET4000       linux-1.1.36     offscreen  23.56  0.03  23.60  0.08
% DX2/66 VLB  ET4000/W32i  linux-1.2.13     onscreen   27.73  0.01  27.72  0.08
% 486/33 ISA  ET4000       linux-1.1.36     onscreen   40.26  0.00  40.27  0.05

Rescaling 50000 by this 40.26 divided by the above 0.026 gives 26 lines/sec.
That is only a bit better than 1982 pixel mode quality.  But this is for
text mode.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140511133517.N1100>