We are currently migrating Bugzilla to GitHub issues.
Any changes made to the bug tracker now will be lost, so please do not post new bugs or make changes to them.
When we're done, all bug URLs will redirect to their equivalent location on the new bug tracker.

Bug 4361

Summary: Refactoring and add missing Wrapped functions.
Product: SDL_ttf Reporter: Sylvain <sylvain.becker>
Component: miscAssignee: Sam Lantinga <slouken>
Status: RESOLVED FIXED QA Contact: Sam Lantinga <slouken>
Severity: normal    
Priority: P2    
Version: unspecified   
Hardware: x86_64   
OS: Linux   
Attachments: patch
patch
SDL_ttf.c
SDL_ttf.h
Italic screenshot
SDL_ttf.c
bench app
bench logs
bench logs android (armeabi-v7a, arm64-v8a)
SDL_ttf.c
SDL_ttf.c
bench logs for render_glyph_32
SDL_ttf.c
bench logs for linux/android
SDL_ttf.c
bench logs
SDL_ttf.c
SDL_ttf.c
SDL_ttf.c
bench SSE2
bench NEON

Description Sylvain 2018-11-06 13:10:30 UTC
Factorize the rendering functions by adding primitives Render_Line(), Render_Glyph(), Render_Glyph_Blended(), Create_Surface_{Solid,Blended,Shaded}().

Now TTF_Render_Wrapped() is something like:
   legacy split lines code
   foreach lines {
     Render_Line();
   }
 
Lots of diff, but no behaviour change and no metrics changes.

With this, it's has now been easy to add:
- 6 missing functions: TTF_Render{UTF8,TEXT,UNICODE}_{Shaded,Solid}_Wrapped()
- Add and enable USE_DUFF_LOOP for Render_Glyph() and Render_Glyph_Blended() to expect a little faster blit on mobile platforms.

Since lots of functions require a conversion to UTF-8.( {unicode,text}*{solid,blended,shaded}*{normal,wrapped} == 12 fonctions),
this has been moved into TTF_Size_Internal(), TTF_Render_Internal() and TTF_Render_Wrapped_Internal(), while still been stack allocated.


Other minor changes:
- Find_Glyph() gives back the glyph and either glyph->pixmap or glyph->bitmap depending on the data requested.
- Rename font->outline as it somehow "shadows" "FT_Outline outline".
- font->style isn't correctly initialized. it doesn't add font_face style the first time.
  re-wrote this more clearly so font->style are exactly the styles handled by SDL_ttf.
- set to upper case for "ft_render_mode_normal", "ft_kerning_default", "ft_render_mode_mono". 
  freetype.h says those lower case type are deprecated.
- check that TTF_Font input pointer is not null
Comment 1 Sylvain 2018-11-06 13:14:07 UTC
Created attachment 3447 [details]
patch

patch is quite long, but the final code is 80 lines less. (with more function, more comment, and duff loops).
Comment 2 Sam Lantinga 2018-11-06 14:43:26 UTC
Can you rebase this patch on current mercurial?

Thanks!
Comment 3 Sam Lantinga 2018-11-06 14:43:59 UTC
Also, have you done performance testing to make sure this doesn't introduce regression on various platforms?
Comment 4 Sylvain 2018-11-06 15:37:43 UTC
Created attachment 3449 [details]
patch

patch updated.

I'll try some performance bench
Comment 5 Sylvain 2018-11-06 16:16:03 UTC
A quick test, on my pc, with clang and -O2
rendering 50 times the same string

SOLID/SHADED (end up using Render_Glyph)

before the current patch, takes 22-24 ms.
with the patch and DUFF_LOOPS activate it take 19-21 ms.
with the patch and without DUFF_LOOPS activate it take also 22-24 ms.

so a little fast thanks to duff_loops.

BLENDED  (end up using Render_Glyph_Shaded)

59-64ms before or patch+no duff_loops
65-70ms with patch+duff_loops

here, we shouldn't active the duff_loops. But I believe on mobile, it will be faster.

(btw the times takes into account 50x SDL_FreeSurface).
Comment 6 Sylvain 2018-11-07 10:25:54 UTC
So I did tried on an android S7.

Only 1 string (multiple times the alphabet), at size 50, which ends up being a texture of 19380x59. 
Not taking into account the SDL_FreeSurface.

Rendering in: (Blended, Shaded, Solid)
with code: 
OLD: before this patch
DUFF_LOOPS: with this patch and duff loops
NO_DUFF: with this patch, and without the duff loops

Using freetype 2.9.1 (but shouldn't matter since glyph are rendered and cached).

Forcing -O2 in CLFAGS of SDL_ttf


Trying {arm,thumb}x{armeabi-v7a,arm64-v8a}

thumb: when settigs: LOCAL_SRC_FILES := SDL_ttf.c
arm: when settigs: LOCAL_SRC_FILES := SDL_ttf.c.arm
not sure if this has the same meaning in arm64

Haven't tried neon...
The current default SDL_ttf.c is thumb

I took always the best time of around 50 tries. because of cpu that have some variability.

most of the time the best times are (6ms, 2ms, 1ms) for (Blended, Shaded, Solid).

except:
arm64 + DUFF_LOOPS where it's (5,2,1) (in arm)
arm64 + DUFF_LOOPS where it's (5,1,1) (in thumb)
armeabi-v7a + DUFF_LOOPS where it's (5,1,1)  (in thumb)
armeabi-v7a + NO_DUFF_LOOPS where it's (6,1,1)  (in thumb)

Which  means the the DUFF_LOOPS should be activated for this target.
Comment 7 Sylvain 2018-11-26 15:01:42 UTC
Created attachment 3503 [details]
SDL_ttf.c

Hey, a new version with more things:

- The current advance isn't accurate: 
  First, with kerning, because it rounds and sum, instead of summing and rounding, so we lose precision.
  Second, because the algorithm has improvements in FT:

  FT provides more precised algorithms with left and right side bearing error correction ({rsb,lsb}_deltas).
  One is named KERNING MODE SMART and the other one is about sub pixel rendering.
  Some doc:  http://git.savannah.gnu.org/cgit/freetype/freetype2.git/tree/include/freetype/freetype.h#n1815
  Some code: http://git.savannah.gnu.org/cgit/freetype/freetype2-demos.git/tree/src/ftcommon.c#n1381

- Added the kerning mode smart. This is always enable.
   From FreeType.h: 
   *   If you use strong auto-hinting, you **must** apply these delta values!
   *   Otherwise you will experience far too large inter-glyph spacing at
   *   small rendering sizes in most cases.  Note that it doesn't harm to use
   *   the above code for other hinting modes also, since the delta values
   *   are zero then.

- Added the subpixel text rendering, as it's only a matter of translating glyphs with less than 1px and using hinting 'light'. 
  It's the modern way: letters looks smoother and more uniformly spaced.
  This mode is ten times slower as there is no cache possible. Though it remains fast.
  To activate it, call TTF_SetFontHinting(font, TTF_HINTING_LIGHT_SUBPIXEL);
 
- Independently of previous changes, there is an issue with current SDL_ttf process:
  Once FT has rendered the glyph, metrics changes (width/height), but also the offsets where should be copied the bitmap.
  It's can happen with subpixel, but also with italic where space between letter is sometimes totally wrong.
  In absolute, this can happen even in normal mode.
  In fact, you just get the real position and size, once the glyph is rasterized.
  
  So this is fixed, but we have to clip the glyph against the whole surface before trying to render.
  This is some kind of bound check, at glyph level, not at rendering time, so it doesn't hurt perf.

- Since it becomes more complex, I have added a buffer to store position between size() and render_line().
  It's more simple, and the string is decoded only once.
  And, if you add textshaping, it's also more convenient because render_line() remain the same.

- One more bug fix: Wrap() behaves badly with only one line that is unbreakable.
  It draws it full length.
  But in fact, it should clip to wrapLenth, (like it would do if this unbreakable line was in middle of others line).

I have tested this, and also compared subpixel result with FT demos programs to make sure it was ok.
I have run some random test (and will do more) and also some pixel to pixel comparaison (with textshaping).
No patch, but the full SDL_ttf.{c,h} files !
Comment 8 Sylvain 2018-11-26 15:02:18 UTC
Created attachment 3504 [details]
SDL_ttf.h

header file
Comment 9 Sylvain 2018-11-28 16:34:54 UTC
Created attachment 3506 [details]
Italic screenshot

Here's the italic rendering issue fixed now.
Comment 10 Sylvain 2018-11-28 16:40:46 UTC
Created attachment 3507 [details]
SDL_ttf.c

Here's new version (again)

- Add a cache for FT_Get_Char_Index()  (char -> index conversion, for 127 first ascii values)

Current head is slower at small size, because we call twice FT_Get_Char_Index() to convert a char to its index.
One to access cache_metrics and another one to access cache_bitmap/pixmap. 
It happens after using index as cache-key :(

This was indirectly first with previous patch because we only called once FT_Get_Char_Index() and store it.
This is now improved with this cache for added, which is valid even after style/hinting change.

- Change the minx/yoffset naming to left/top, now that matches FreeType examples.

- Don't use DUFFS_LOOP... 
  From 1 to 50 size it is faster (this is what I tried before!), but after it's much slower. 
  In fact Duffs_Loop are quite constant, but this is the non-duff-loop which became faster:
  We see that in all benchs with no Duffs loop: 
  Rendering at size 80 is the same as rendering as size 60.
  And (not dumped in next the log), but rendering at size 70 is faster that at size 60).
  I believe we kind of hit some compiler optimisation at this size ..

 
- Re-write a little bit render_glyph and un-roll render_line, after various tries, to make it faster.

- Write some more precise bench mark with PerformanceCounter()

Now: 
 - at size 8, it's  2x faster   (old 25000,  head 41000, new 13000)
 - at size 100, it's 1.5 faster (old 106000, head 98000, new 61000)

Also some subpixel benchmark (only on the new version)
Comment 11 Sylvain 2018-11-28 16:42:39 UTC
Created attachment 3508 [details]
bench app

Small bench app for various size / modes
Comment 12 Sylvain 2018-11-28 16:47:39 UTC
Created attachment 3509 [details]
bench logs

bench outputs:

old: old version (~3 month ago) (before I start adding bugs). 

currentHead: current head 

new: this previous SDL_ttf.c

new_With_Duff_Loops: this previous SDL_ttf.c if you add Duffs Loop. (faster up to size 50, slow after...)
Comment 13 Sylvain 2018-11-29 13:13:17 UTC
Created attachment 3510 [details]
bench logs android (armeabi-v7a, arm64-v8a)

Previous Bench was on linux i7-3610QM CPU @ 2.30GHz.

Those ones are on samsung s7.

- on arm64-v8a: 
  new version is better than old version.
  starting at size 60-70-80, same phenomenon, new-no-duffs-loop is clearly better than new-with-duffs-loop

- on armeabi-v7a: at size 70-80, very little better for duffs loop.


So in the end, because of this effect, we shouldn't enable this USE DUFFS LOOP.
on android, the practical usual sizes are 30 to 90.

NB: 
- v7a, int64 comparison is not working, need to use "clock_gettime(CLOCK_REALTIME, &res);" instead of PerformanceCounter
- all bench android: there are high variation on the same run.
Comment 14 Sylvain 2018-11-29 13:21:09 UTC
Created attachment 3511 [details]
SDL_ttf.c

New version with a little adjustment for Outline Style so that it remains centred:


@@ -1002,6 +1002,7 @@
             int fo = font->outline_val;
             cached->sz_width += 2 * fo;
             cached->sz_rows  += 2 * fo;
+            cached->sz_left  -= fo;
         }
Comment 15 Sylvain 2018-11-29 22:05:20 UTC
Created attachment 3513 [details]
SDL_ttf.c

A new version again with two changes:

1)
For Blended, pre-compute alpha_table, so that instead of doing
  *dst++ |= pixel | ((Uint32)alpha_table[alpha] << 24);
It can be re-writen as:
  *dst++ |= alpha_table[alpha];
A few percent of improvement.


2)
For Solid/Shaded, we can 'ceil' glyph width to be a multiple of integer and copy faster with 32 bits or 64 bits instruction.
(64 bits fails on android arm-v7a, so only 32 bits is activated).


Gain is quite good:
on linux:
Size 61 shaded: instead of 42 us, it takes 27 us.
from old versions, it means: 71 us -> 27 us
Size 80 shaded: 40 -> 36
from old versions, it means: (also)70 us -> 36 us

on android, even better
armv7a:
Size 61 shaded: 106us, => 57 us
Size 80 shaded: 201 us => 96 us

arm64:
Size 61 shaded: 84 us => 44 us
Size 80 shaded: 106 us => 73us

It doesn't change output, since metrics during size calculation aren't change, only the rendered glyph is extra padded.
It doesn't change code complexity, since there is already a fallback to clip the glyph if it is out of the output surface.
Comment 16 Sylvain 2018-11-29 22:06:46 UTC
Created attachment 3514 [details]
bench logs for render_glyph_32
Comment 17 Sylvain 2018-12-03 10:12:20 UTC
Created attachment 3519 [details]
SDL_ttf.c

New version! I've added SSE2 and NEON Render_Glyph intrasics versions.
They work on un-aligned memory (loadu, storeu on SSE), 
only the glyph width is rounded, as previous.

Not very familiar with SSE2 nor NEON, but I got them working.
NEON doesn't seem to run faster, so I still commented it out.

(there are probably room for improvement, doing prefetch or other instruction...)

Also, this is now build with macros so compiler knows how to optimize stuffs.


Same metrics, onlinux:
Size 61 shaded:
=> 25 us, with render_glyph_32
=> 22 us, with render_glyph_64
=> 23 us, with render_glyph sse2

Size 80:
=> 35 us, with render_glyph_32
=> 33 us, with render_glyph_64
=> 31 us, with render_glyph sse2


arm v7a (Render_Glyph_32):
Size 61 shaded: => 59
Size 80 shaded: => 114 ( a little slower but might be the testing as well).

arm 64 (Render_Glyph_64):
Size 61 shaded: => 40 us
Size 80 shaded: => 70 us


Fix also allocation on non scalable Fonts when converting them:
(nonscalable/pvfixed_20b.pcf.gz)
For instance:
src->pixel_mode = 1 (MONO) src->width=2 src->pitch=4
Would allocate, by doing pitch * 8 : 
dst->width = 2, dst->pitch=32

src->pixel_mode = 1 (MONO) src->width=2 src->pitch=4
Now allocate:
dst->width=2 dst->pitch=8
Comment 18 Sylvain 2018-12-03 10:13:52 UTC
Created attachment 3520 [details]
bench logs for linux/android

Bench logs for previous version
Comment 19 Sylvain 2018-12-11 16:24:09 UTC
Created attachment 3541 [details]
SDL_ttf.c

Here's a new version

Spot & fix two new bugs:
- Blended mode: line colour (underline/strikethrough) is always fully opaque when it should have user alpha opacity value (colour fg.a).
  The issue is that it would display a very bright line, when the text is maybe 10% brightness (depending on chosen alpha).
- Shaded mode: alpha palette is incorrect when for instance color alpha is 0, it is then set to OPAQUE (255) and disturbs a-diff ratio.
  So the palette alpha is not regular from 0 to 255 and a text with 10% brightness is not correctly display: it looks like it has brightened edges.

A few additional optimisations: 

Shaded mode: optimise surface creation:
- divide by 255 (x/255) can be optimised as
   (x + 1 + (x>>8))>>8, when positive 
   (x + 255 + (x>>8))>>8 when negative
   (which a little a bit faster and non negligible on small font size, compared to the total time taken).

(NB: at small size to have a realistic bench value, kerning must be turned off because it has a too high timing cost).

Blended mode: optimise surface creation:
- provide memory buffer to avoid an extra memset of the full surface from SDL_CreateRGBSurface.

All Blit Glyph routines are now available in duffs loop macro, and cleaned-up.
(still not sure if this is worth enabling on android).

Blended mode: 

- alpha_table doesn't need to be computed as "color | (alpha << 24)" but only "alpha << 24" since the background is already filled.

- also add a disjunction case when Blended is opaque or not: 
  opaque doesn't need lookup alpha_table at all and can be done faster on the fly.

Optimise Blending routines SSE and NEON.
Especially in opaque blending:
NEON can process 16 pixels at once using vzipq_u8 (interleave)
that produces a two lanes value in output (uint8x16x2_t).
Render text Blended is now twice faster.
Comment 20 Sylvain 2018-12-11 16:39:28 UTC
Created attachment 3542 [details]
bench logs

Bench log with SSE2 version:

now that Kerning (call to FT_GetKerning()), for both logs is taking ~8300 us.
non-opaque means we has to use the alpha_table (as in all previous version), 
whereas opaque is done on the fly (the reason is that there is nothing to compute for this value).

previous (2 dec):

INFO: Size: 40 Shaded   Avg: 0.0206667 ms  Avg Perf:   19344 (min=  18192)
INFO: Size: 40 Solid    Avg:   0.019 ms    Avg Perf:   18297 (min=  17103)
INFO: Size: 40 Blended  Avg: 0.0776667 ms  Avg Perf:   76130 (min=  70774)

INFO: Size: 80 Shaded   Avg:   0.032 ms    Avg Perf:   31004 (min=  30057)
INFO: Size: 80 Solid    Avg: 0.0306667 ms  Avg Perf:   30274 (min=  28786)
INFO: Size: 80 Blended  Avg:    0.32 ms    Avg Perf:  319046 (min= 306633)

this:

INFO: Size: 40 Shaded   Avg: 0.0173333 ms  Avg Perf:   16609 (min=  15533)
INFO: Size: 40 Solid    Avg: 0.0166667 ms  Avg Perf:   16394 (min=  15191)
INFO: Size: 40 Blended  Avg:   0.043 ms    Avg Perf:   41764 (min=  39904) opaque
INFO: Size: 40 Blended  Avg:   0.048 ms    Avg Perf:   47534 (min=  45519) [non opaque]

INFO: Size: 80 Shaded   Avg: 0.0293333 ms  Avg Perf:   27837 (min=  26532)
INFO: Size: 80 Solid    Avg:   0.029 ms    Avg Perf:   27489 (min=  26067)
INFO: Size: 80 Blended  Avg:   0.131 ms    Avg Perf:  130562 (min= 127157) opaque
INFO: Size: 80 Blended  Avg:   0.148 ms    Avg Perf:  147344 (min= 144133) [non opaque]
Comment 21 Sylvain 2018-12-11 21:56:02 UTC
Created attachment 3543 [details]
SDL_ttf.c

New version, so that the SSE Blended Opaque function also read 16 by 16 Uint8, as the NEON one.

INFO: Size: 40 Blended  Avg:   0.031 ms  Avg Perf:   29897 (min=  29069)

INFO: Size: 80 Blended  Avg:   0.115 ms  Avg Perf:  114104 (min= 110306)
Comment 22 Sylvain 2018-12-12 17:11:24 UTC
Created attachment 3544 [details]
SDL_ttf.c

New version:

- Now SSE/NEON Blended Non-Opaque also compute the alpha table on the fly and performs 10-20% better.
(and I removed the alpha_table).


- When blitting a glyph, 'srcskip' is not needed: we choose the width and pitch so that srcskip is 0 and can be removed. Except for the case when the glyph is clipped.
(For fixed Font, the pitch may be larger for decoding, so one post-processing step can be needed to shrink them).

Some SSE2 bench (with kerning activated):

Size: 40 Blended  Avg: 0.0433333 ms  Avg Perf:   41884 (min=  40173) [non opaque]
Size: 80 Blended  Avg:     0.128 ms  Avg Perf:  127318 (min= 125139) [non opaque]
Comment 23 Sylvain 2018-12-18 15:24:53 UTC
Created attachment 3551 [details]
SDL_ttf.c

New version with:

- faster access to cache glyph
- format decoding (mono/gray2/gray4) doesn't require a larger pitch.
- all access to destination are now aligned:

SSE see some small benefit for larger size (> 100)
NEON is much faster (50%), and also more stable (max value vs average).
Comment 24 Sylvain 2018-12-18 15:26:29 UTC
Created attachment 3552 [details]
bench SSE2

Also, it now depends on SDL capabilities to free an aligned memory pointer.

attached: Bench SSE2
Comment 25 Sylvain 2018-12-18 15:26:59 UTC
Created attachment 3553 [details]
bench NEON

attached: bench NEON
Comment 26 Sylvain 2019-01-31 12:52:04 UTC
Pushed in https://hg.libsdl.org/SDL_ttf/rev/9f46efc0fde2

Also modify public header : https://hg.libsdl.org/SDL_ttf/rev/7714b0b23bf3

     1.7 -int TTF_GlyphIsProvided(const TTF_Font *font, Uint16 ch)
     1.8 +int TTF_GlyphIsProvided(TTF_Font *font, Uint16 ch)