This bug report was migrated from our old Bugzilla tracker.

These attachments are available in the static archive:

patch (patch_bug_4361_refactoring_wrap.diff, text/plain, 2018-11-06 13:14:07 +0000, 60894 bytes)
patch (patch_bug_4361_refactoring_wrap-v2.diff, text/plain, 2018-11-06 15:37:43 +0000, 58889 bytes)
~~SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-11-26 15:01:42 +0000, 71464 bytes)~~
SDL_ttf.h (SDL_ttf.h, text/x-chdr, 2018-11-26 15:02:18 +0000, 14929 bytes)
Italic screenshot (italic_.zip, application/zip, 2018-11-28 16:34:54 +0000, 44443 bytes)
~~SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-11-28 16:40:46 +0000, 73696 bytes)~~
bench app (SDL_ttf_bench.c, text/x-csrc, 2018-11-28 16:42:39 +0000, 8206 bytes)
bench logs (bench_logs.zip, application/zip, 2018-11-28 16:47:39 +0000, 6010 bytes)
bench logs android (armeabi-v7a, arm64-v8a) (bench_logs_android.zip, application/zip, 2018-11-29 13:13:17 +0000, 15361 bytes)
~~SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-11-29 13:21:09 +0000, 73732 bytes)~~
~~SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-11-29 22:05:20 +0000, 75406 bytes)~~
bench logs for render_glyph_32 (glyph_32-64_bits.zip, application/zip, 2018-11-29 22:06:46 +0000, 7431 bytes)
~~SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-12-03 10:12:20 +0000, 94135 bytes)~~
bench logs for linux/android (logs.zip, application/zip, 2018-12-03 10:13:52 +0000, 21456 bytes)
~~SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-12-11 16:24:09 +0000, 102792 bytes)~~
bench logs (bench_sse2.txt, text/plain, 2018-12-11 16:39:28 +0000, 6515 bytes)
~~SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-12-11 21:56:02 +0000, 103540 bytes)~~
SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-12-12 17:11:24 +0000, 106994 bytes)
SDL_ttf.c (SDL_ttf.c, text/x-csrc, 2018-12-18 15:24:53 +0000, 111598 bytes)
bench SSE2 (SSE2.txt, text/plain, 2018-12-18 15:26:29 +0000, 8659 bytes)
bench NEON (NEON.txt, text/plain, 2018-12-18 15:26:59 +0000, 8766 bytes)

Reported in version: unspecified
Reported for operating system, platform: Linux, x86_64

Comments on the original bug report:

On 2018-11-06 13:10:30 +0000, Sylvain wrote:

Factorize the rendering functions by adding primitives Render_Line(), Render_Glyph(), Render_Glyph_Blended(), Create_Surface_{Solid,Blended,Shaded}().

Now TTF_Render_Wrapped() is something like:
legacy split lines code
foreach lines {
Render_Line();
}

Lots of diff, but no behaviour change and no metrics changes.

With this, it's has now been easy to add:

6 missing functions: TTF_Render{UTF8,TEXT,UNICODE}_{Shaded,Solid}_Wrapped()

Add and enable USE_DUFF_LOOP for Render_Glyph() and Render_Glyph_Blended() to expect a little faster blit on mobile platforms.

Since lots of functions require a conversion to UTF-8.( {unicode,text}{solid,blended,shaded}{normal,wrapped} == 12 fonctions),
this has been moved into TTF_Size_Internal(), TTF_Render_Internal() and TTF_Render_Wrapped_Internal(), while still been stack allocated.

Other minor changes:

Find_Glyph() gives back the glyph and either glyph->pixmap or glyph->bitmap depending on the data requested.

Rename font->outline as it somehow "shadows" "FT_Outline outline".

font->style isn't correctly initialized. it doesn't add font_face style the first time.
re-wrote this more clearly so font->style are exactly the styles handled by SDL_ttf.

set to upper case for "ft_render_mode_normal", "ft_kerning_default", "ft_render_mode_mono".
freetype.h says those lower case type are deprecated.

check that TTF_Font input pointer is not null

On 2018-11-06 13:14:07 +0000, Sylvain wrote:

Created attachment 3447
patch

patch is quite long, but the final code is 80 lines less. (with more function, more comment, and duff loops).

On 2018-11-06 14:43:26 +0000, Sam Lantinga wrote:

Can you rebase this patch on current mercurial?

Thanks!

On 2018-11-06 14:43:59 +0000, Sam Lantinga wrote:

Also, have you done performance testing to make sure this doesn't introduce regression on various platforms?

On 2018-11-06 15:37:43 +0000, Sylvain wrote:

Created attachment 3449
patch

patch updated.

I'll try some performance bench

On 2018-11-06 16:16:03 +0000, Sylvain wrote:

A quick test, on my pc, with clang and -O2
rendering 50 times the same string

SOLID/SHADED (end up using Render_Glyph)

before the current patch, takes 22-24 ms.
with the patch and DUFF_LOOPS activate it take 19-21 ms.
with the patch and without DUFF_LOOPS activate it take also 22-24 ms.

so a little fast thanks to duff_loops.

BLENDED (end up using Render_Glyph_Shaded)

59-64ms before or patch+no duff_loops
65-70ms with patch+duff_loops

here, we shouldn't active the duff_loops. But I believe on mobile, it will be faster.

(btw the times takes into account 50x SDL_FreeSurface).

On 2018-11-07 10:25:54 +0000, Sylvain wrote:

So I did tried on an android S7.

Only 1 string (multiple times the alphabet), at size 50, which ends up being a texture of 19380x59.
Not taking into account the SDL_FreeSurface.

Rendering in: (Blended, Shaded, Solid)
with code:
OLD: before this patch
DUFF_LOOPS: with this patch and duff loops
NO_DUFF: with this patch, and without the duff loops

Using freetype 2.9.1 (but shouldn't matter since glyph are rendered and cached).

Forcing -O2 in CLFAGS of SDL_ttf

Trying {arm,thumb}x{armeabi-v7a,arm64-v8a}

thumb: when settigs: LOCAL_SRC_FILES := SDL_ttf.c
arm: when settigs: LOCAL_SRC_FILES := SDL_ttf.c.arm
not sure if this has the same meaning in arm64

Haven't tried neon...
The current default SDL_ttf.c is thumb

I took always the best time of around 50 tries. because of cpu that have some variability.

most of the time the best times are (6ms, 2ms, 1ms) for (Blended, Shaded, Solid).

except:
arm64 + DUFF_LOOPS where it's (5,2,1) (in arm)
arm64 + DUFF_LOOPS where it's (5,1,1) (in thumb)
armeabi-v7a + DUFF_LOOPS where it's (5,1,1) (in thumb)
armeabi-v7a + NO_DUFF_LOOPS where it's (6,1,1) (in thumb)

Which means the the DUFF_LOOPS should be activated for this target.

On 2018-11-26 15:01:42 +0000, Sylvain wrote:

Created attachment 3503
SDL_ttf.c

Hey, a new version with more things:

The current advance isn't accurate:
First, with kerning, because it rounds and sum, instead of summing and rounding, so we lose precision.
Second, because the algorithm has improvements in FT:

FT provides more precised algorithms with left and right side bearing error correction ({rsb,lsb}_deltas).
One is named KERNING MODE SMART and the other one is about sub pixel rendering.
Some doc: http://git.savannah.gnu.org/cgit/freetype/freetype2.git/tree/include/freetype/freetype.h#n1815
Some code: http://git.savannah.gnu.org/cgit/freetype/freetype2-demos.git/tree/src/ftcommon.c#n1381

Added the kerning mode smart. This is always enable.
From FreeType.h:

If you use strong auto-hinting, you must apply these delta values!

Otherwise you will experience far too large inter-glyph spacing at

small rendering sizes in most cases. Note that it doesn't harm to use

the above code for other hinting modes also, since the delta values

are zero then.

Added the subpixel text rendering, as it's only a matter of translating glyphs with less than 1px and using hinting 'light'.
It's the modern way: letters looks smoother and more uniformly spaced.
This mode is ten times slower as there is no cache possible. Though it remains fast.
To activate it, call TTF_SetFontHinting(font, TTF_HINTING_LIGHT_SUBPIXEL);

Independently of previous changes, there is an issue with current SDL_ttf process:
Once FT has rendered the glyph, metrics changes (width/height), but also the offsets where should be copied the bitmap.
It's can happen with subpixel, but also with italic where space between letter is sometimes totally wrong.
In absolute, this can happen even in normal mode.
In fact, you just get the real position and size, once the glyph is rasterized.

So this is fixed, but we have to clip the glyph against the whole surface before trying to render.
This is some kind of bound check, at glyph level, not at rendering time, so it doesn't hurt perf.

Since it becomes more complex, I have added a buffer to store position between size() and render_line().
It's more simple, and the string is decoded only once.
And, if you add textshaping, it's also more convenient because render_line() remain the same.

One more bug fix: Wrap() behaves badly with only one line that is unbreakable.
It draws it full length.
But in fact, it should clip to wrapLenth, (like it would do if this unbreakable line was in middle of others line).

I have tested this, and also compared subpixel result with FT demos programs to make sure it was ok.
I have run some random test (and will do more) and also some pixel to pixel comparaison (with textshaping).
No patch, but the full SDL_ttf.{c,h} files !

On 2018-11-26 15:02:18 +0000, Sylvain wrote:

Created attachment 3504
SDL_ttf.h

header file

On 2018-11-28 16:34:54 +0000, Sylvain wrote:

Created attachment 3506
Italic screenshot

Here's the italic rendering issue fixed now.

On 2018-11-28 16:40:46 +0000, Sylvain wrote:

Created attachment 3507
SDL_ttf.c

Here's new version (again)

Add a cache for FT_Get_Char_Index() (char -> index conversion, for 127 first ascii values)

Current head is slower at small size, because we call twice FT_Get_Char_Index() to convert a char to its index.
One to access cache_metrics and another one to access cache_bitmap/pixmap.
It happens after using index as cache-key :(

This was indirectly first with previous patch because we only called once FT_Get_Char_Index() and store it.
This is now improved with this cache for added, which is valid even after style/hinting change.

Change the minx/yoffset naming to left/top, now that matches FreeType examples.

Don't use DUFFS_LOOP...
From 1 to 50 size it is faster (this is what I tried before!), but after it's much slower.
In fact Duffs_Loop are quite constant, but this is the non-duff-loop which became faster:
We see that in all benchs with no Duffs loop:
Rendering at size 80 is the same as rendering as size 60.
And (not dumped in next the log), but rendering at size 70 is faster that at size 60).
I believe we kind of hit some compiler optimisation at this size ..

Re-write a little bit render_glyph and un-roll render_line, after various tries, to make it faster.

Write some more precise bench mark with PerformanceCounter()

Now:

at size 8, it's 2x faster (old 25000, head 41000, new 13000)

at size 100, it's 1.5 faster (old 106000, head 98000, new 61000)

Also some subpixel benchmark (only on the new version)

On 2018-11-28 16:42:39 +0000, Sylvain wrote:

Created attachment 3508
bench app

Small bench app for various size / modes

On 2018-11-28 16:47:39 +0000, Sylvain wrote:

Created attachment 3509
bench logs

bench outputs:

old: old version (~3 month ago) (before I start adding bugs).

currentHead: current head

new: this previous SDL_ttf.c

new_With_Duff_Loops: this previous SDL_ttf.c if you add Duffs Loop. (faster up to size 50, slow after...)

On 2018-11-29 13:13:17 +0000, Sylvain wrote:

Created attachment 3510
bench logs android (armeabi-v7a, arm64-v8a)

Previous Bench was on linux i7-3610QM CPU @ 2.30GHz.

Those ones are on samsung s7.

on arm64-v8a:
new version is better than old version.
starting at size 60-70-80, same phenomenon, new-no-duffs-loop is clearly better than new-with-duffs-loop

on armeabi-v7a: at size 70-80, very little better for duffs loop.

So in the end, because of this effect, we shouldn't enable this USE DUFFS LOOP.
on android, the practical usual sizes are 30 to 90.

NB:

v7a, int64 comparison is not working, need to use "clock_gettime(CLOCK_REALTIME, &res);" instead of PerformanceCounter

all bench android: there are high variation on the same run.

On 2018-11-29 13:21:09 +0000, Sylvain wrote:

Created attachment 3511
SDL_ttf.c

New version with a little adjustment for Outline Style so that it remains centred:

@@ -1002,6 +1002,7 @@
int fo = font->outline_val;
cached->sz_width += 2 * fo;
cached->sz_rows += 2 * fo;
       cached->sz_left  -= fo;
   }

On 2018-11-29 22:05:20 +0000, Sylvain wrote:

Created attachment 3513
SDL_ttf.c

A new version again with two changes:

For Blended, pre-compute alpha_table, so that instead of doing
*dst++ |= pixel | ((Uint32)alpha_table[alpha] << 24);
It can be re-writen as:
*dst++ |= alpha_table[alpha];
A few percent of improvement.

For Solid/Shaded, we can 'ceil' glyph width to be a multiple of integer and copy faster with 32 bits or 64 bits instruction.
(64 bits fails on android arm-v7a, so only 32 bits is activated).

Gain is quite good:
on linux:
Size 61 shaded: instead of 42 us, it takes 27 us.
from old versions, it means: 71 us -> 27 us
Size 80 shaded: 40 -> 36
from old versions, it means: (also)70 us -> 36 us

on android, even better
armv7a:
Size 61 shaded: 106us, => 57 us
Size 80 shaded: 201 us => 96 us

arm64:
Size 61 shaded: 84 us => 44 us
Size 80 shaded: 106 us => 73us

It doesn't change output, since metrics during size calculation aren't change, only the rendered glyph is extra padded.
It doesn't change code complexity, since there is already a fallback to clip the glyph if it is out of the output surface.

On 2018-11-29 22:06:46 +0000, Sylvain wrote:

Created attachment 3514
bench logs for render_glyph_32

On 2018-12-03 10:12:20 +0000, Sylvain wrote:

Created attachment 3519
SDL_ttf.c

New version! I've added SSE2 and NEON Render_Glyph intrasics versions.
They work on un-aligned memory (loadu, storeu on SSE),
only the glyph width is rounded, as previous.

Not very familiar with SSE2 nor NEON, but I got them working.
NEON doesn't seem to run faster, so I still commented it out.

(there are probably room for improvement, doing prefetch or other instruction...)

Also, this is now build with macros so compiler knows how to optimize stuffs.

Same metrics, onlinux:
Size 61 shaded:
=> 25 us, with render_glyph_32
=> 22 us, with render_glyph_64
=> 23 us, with render_glyph sse2

Size 80:
=> 35 us, with render_glyph_32
=> 33 us, with render_glyph_64
=> 31 us, with render_glyph sse2

arm v7a (Render_Glyph_32):
Size 61 shaded: => 59
Size 80 shaded: => 114 ( a little slower but might be the testing as well).

arm 64 (Render_Glyph_64):
Size 61 shaded: => 40 us
Size 80 shaded: => 70 us

Fix also allocation on non scalable Fonts when converting them:
(nonscalable/pvfixed_20b.pcf.gz)
For instance:
src->pixel_mode = 1 (MONO) src->width=2 src->pitch=4
Would allocate, by doing pitch * 8 :
dst->width = 2, dst->pitch=32

src->pixel_mode = 1 (MONO) src->width=2 src->pitch=4
Now allocate:
dst->width=2 dst->pitch=8

On 2018-12-03 10:13:52 +0000, Sylvain wrote:

Created attachment 3520
bench logs for linux/android

Bench logs for previous version

On 2018-12-11 16:24:09 +0000, Sylvain wrote:

Created attachment 3541
SDL_ttf.c

Here's a new version

Spot & fix two new bugs:

Blended mode: line colour (underline/strikethrough) is always fully opaque when it should have user alpha opacity value (colour fg.a).
The issue is that it would display a very bright line, when the text is maybe 10% brightness (depending on chosen alpha).

Shaded mode: alpha palette is incorrect when for instance color alpha is 0, it is then set to OPAQUE (255) and disturbs a-diff ratio.
So the palette alpha is not regular from 0 to 255 and a text with 10% brightness is not correctly display: it looks like it has brightened edges.

A few additional optimisations:

Shaded mode: optimise surface creation:

divide by 255 (x/255) can be optimised as
(x + 1 + (x>>8))>>8, when positive
(x + 255 + (x>>8))>>8 when negative
(which a little a bit faster and non negligible on small font size, compared to the total time taken).

(NB: at small size to have a realistic bench value, kerning must be turned off because it has a too high timing cost).

Blended mode: optimise surface creation:

provide memory buffer to avoid an extra memset of the full surface from SDL_CreateRGBSurface.

All Blit Glyph routines are now available in duffs loop macro, and cleaned-up.
(still not sure if this is worth enabling on android).

Blended mode:

alpha_table doesn't need to be computed as "color | (alpha << 24)" but only "alpha << 24" since the background is already filled.

also add a disjunction case when Blended is opaque or not:
opaque doesn't need lookup alpha_table at all and can be done faster on the fly.

Optimise Blending routines SSE and NEON.
Especially in opaque blending:
NEON can process 16 pixels at once using vzipq_u8 (interleave)
that produces a two lanes value in output (uint8x16x2_t).
Render text Blended is now twice faster.

On 2018-12-11 16:39:28 +0000, Sylvain wrote:

Created attachment 3542
bench logs

Bench log with SSE2 version:

now that Kerning (call to FT_GetKerning()), for both logs is taking ~8300 us.
non-opaque means we has to use the alpha_table (as in all previous version),
whereas opaque is done on the fly (the reason is that there is nothing to compute for this value).

previous (2 dec):

INFO: Size: 40 Shaded Avg: 0.0206667 ms Avg Perf: 19344 (min= 18192)
INFO: Size: 40 Solid Avg: 0.019 ms Avg Perf: 18297 (min= 17103)
INFO: Size: 40 Blended Avg: 0.0776667 ms Avg Perf: 76130 (min= 70774)

INFO: Size: 80 Shaded Avg: 0.032 ms Avg Perf: 31004 (min= 30057)
INFO: Size: 80 Solid Avg: 0.0306667 ms Avg Perf: 30274 (min= 28786)
INFO: Size: 80 Blended Avg: 0.32 ms Avg Perf: 319046 (min= 306633)

this:

INFO: Size: 40 Shaded Avg: 0.0173333 ms Avg Perf: 16609 (min= 15533)
INFO: Size: 40 Solid Avg: 0.0166667 ms Avg Perf: 16394 (min= 15191)
INFO: Size: 40 Blended Avg: 0.043 ms Avg Perf: 41764 (min= 39904) opaque
INFO: Size: 40 Blended Avg: 0.048 ms Avg Perf: 47534 (min= 45519) [non opaque]

INFO: Size: 80 Shaded Avg: 0.0293333 ms Avg Perf: 27837 (min= 26532)
INFO: Size: 80 Solid Avg: 0.029 ms Avg Perf: 27489 (min= 26067)
INFO: Size: 80 Blended Avg: 0.131 ms Avg Perf: 130562 (min= 127157) opaque
INFO: Size: 80 Blended Avg: 0.148 ms Avg Perf: 147344 (min= 144133) [non opaque]

On 2018-12-11 21:56:02 +0000, Sylvain wrote:

Created attachment 3543
SDL_ttf.c

New version, so that the SSE Blended Opaque function also read 16 by 16 Uint8, as the NEON one.

INFO: Size: 40 Blended Avg: 0.031 ms Avg Perf: 29897 (min= 29069)

INFO: Size: 80 Blended Avg: 0.115 ms Avg Perf: 114104 (min= 110306)

On 2018-12-12 17:11:24 +0000, Sylvain wrote:

Created attachment 3544
SDL_ttf.c

New version:

Now SSE/NEON Blended Non-Opaque also compute the alpha table on the fly and performs 10-20% better.
(and I removed the alpha_table).

When blitting a glyph, 'srcskip' is not needed: we choose the width and pitch so that srcskip is 0 and can be removed. Except for the case when the glyph is clipped.
(For fixed Font, the pitch may be larger for decoding, so one post-processing step can be needed to shrink them).

Some SSE2 bench (with kerning activated):

Size: 40 Blended Avg: 0.0433333 ms Avg Perf: 41884 (min= 40173) [non opaque]
Size: 80 Blended Avg: 0.128 ms Avg Perf: 127318 (min= 125139) [non opaque]

On 2018-12-18 15:24:53 +0000, Sylvain wrote:

Created attachment 3551
SDL_ttf.c

New version with:

faster access to cache glyph

format decoding (mono/gray2/gray4) doesn't require a larger pitch.

all access to destination are now aligned:

SSE see some small benefit for larger size (> 100)
NEON is much faster (50%), and also more stable (max value vs average).

On 2018-12-18 15:26:29 +0000, Sylvain wrote:

Created attachment 3552
bench SSE2

Also, it now depends on SDL capabilities to free an aligned memory pointer.

attached: Bench SSE2

On 2018-12-18 15:26:59 +0000, Sylvain wrote:

Created attachment 3553
bench NEON

attached: bench NEON

On 2019-01-31 12:52:04 +0000, Sylvain wrote:

Pushed in https://hg.libsdl.org/SDL_ttf/rev/9f46efc0fde2

Also modify public header : https://hg.libsdl.org/SDL_ttf/rev/7714b0b23bf3
 1.7 -int TTF_GlyphIsProvided(const TTF_Font *font, Uint16 ch)
 1.8 +int TTF_GlyphIsProvided(TTF_Font *font, Uint16 ch)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring and add missing Wrapped functions. #96

Refactoring and add missing Wrapped functions. #96

SDLBugzilla commented Feb 11, 2021