| Summary: | Refactoring and add missing Wrapped functions. | ||
|---|---|---|---|
| Product: | SDL_ttf | Reporter: | Sylvain <sylvain.becker> |
| Component: | misc | Assignee: | Sam Lantinga <slouken> |
| Status: | RESOLVED FIXED | QA Contact: | Sam Lantinga <slouken> |
| Severity: | normal | ||
| Priority: | P2 | ||
| Version: | unspecified | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Attachments: |
patch
patch SDL_ttf.c SDL_ttf.h Italic screenshot SDL_ttf.c bench app bench logs bench logs android (armeabi-v7a, arm64-v8a) SDL_ttf.c SDL_ttf.c bench logs for render_glyph_32 SDL_ttf.c bench logs for linux/android SDL_ttf.c bench logs SDL_ttf.c SDL_ttf.c SDL_ttf.c bench SSE2 bench NEON |
||
Created attachment 3447 [details]
patch
patch is quite long, but the final code is 80 lines less. (with more function, more comment, and duff loops).
Can you rebase this patch on current mercurial? Thanks! Also, have you done performance testing to make sure this doesn't introduce regression on various platforms? Created attachment 3449 [details]
patch
patch updated.
I'll try some performance bench
A quick test, on my pc, with clang and -O2 rendering 50 times the same string SOLID/SHADED (end up using Render_Glyph) before the current patch, takes 22-24 ms. with the patch and DUFF_LOOPS activate it take 19-21 ms. with the patch and without DUFF_LOOPS activate it take also 22-24 ms. so a little fast thanks to duff_loops. BLENDED (end up using Render_Glyph_Shaded) 59-64ms before or patch+no duff_loops 65-70ms with patch+duff_loops here, we shouldn't active the duff_loops. But I believe on mobile, it will be faster. (btw the times takes into account 50x SDL_FreeSurface). So I did tried on an android S7.
Only 1 string (multiple times the alphabet), at size 50, which ends up being a texture of 19380x59.
Not taking into account the SDL_FreeSurface.
Rendering in: (Blended, Shaded, Solid)
with code:
OLD: before this patch
DUFF_LOOPS: with this patch and duff loops
NO_DUFF: with this patch, and without the duff loops
Using freetype 2.9.1 (but shouldn't matter since glyph are rendered and cached).
Forcing -O2 in CLFAGS of SDL_ttf
Trying {arm,thumb}x{armeabi-v7a,arm64-v8a}
thumb: when settigs: LOCAL_SRC_FILES := SDL_ttf.c
arm: when settigs: LOCAL_SRC_FILES := SDL_ttf.c.arm
not sure if this has the same meaning in arm64
Haven't tried neon...
The current default SDL_ttf.c is thumb
I took always the best time of around 50 tries. because of cpu that have some variability.
most of the time the best times are (6ms, 2ms, 1ms) for (Blended, Shaded, Solid).
except:
arm64 + DUFF_LOOPS where it's (5,2,1) (in arm)
arm64 + DUFF_LOOPS where it's (5,1,1) (in thumb)
armeabi-v7a + DUFF_LOOPS where it's (5,1,1) (in thumb)
armeabi-v7a + NO_DUFF_LOOPS where it's (6,1,1) (in thumb)
Which means the the DUFF_LOOPS should be activated for this target.
Created attachment 3503 [details] SDL_ttf.c Hey, a new version with more things: - The current advance isn't accurate: First, with kerning, because it rounds and sum, instead of summing and rounding, so we lose precision. Second, because the algorithm has improvements in FT: FT provides more precised algorithms with left and right side bearing error correction ({rsb,lsb}_deltas). One is named KERNING MODE SMART and the other one is about sub pixel rendering. Some doc: http://git.savannah.gnu.org/cgit/freetype/freetype2.git/tree/include/freetype/freetype.h#n1815 Some code: http://git.savannah.gnu.org/cgit/freetype/freetype2-demos.git/tree/src/ftcommon.c#n1381 - Added the kerning mode smart. This is always enable. From FreeType.h: * If you use strong auto-hinting, you **must** apply these delta values! * Otherwise you will experience far too large inter-glyph spacing at * small rendering sizes in most cases. Note that it doesn't harm to use * the above code for other hinting modes also, since the delta values * are zero then. - Added the subpixel text rendering, as it's only a matter of translating glyphs with less than 1px and using hinting 'light'. It's the modern way: letters looks smoother and more uniformly spaced. This mode is ten times slower as there is no cache possible. Though it remains fast. To activate it, call TTF_SetFontHinting(font, TTF_HINTING_LIGHT_SUBPIXEL); - Independently of previous changes, there is an issue with current SDL_ttf process: Once FT has rendered the glyph, metrics changes (width/height), but also the offsets where should be copied the bitmap. It's can happen with subpixel, but also with italic where space between letter is sometimes totally wrong. In absolute, this can happen even in normal mode. In fact, you just get the real position and size, once the glyph is rasterized. So this is fixed, but we have to clip the glyph against the whole surface before trying to render. This is some kind of bound check, at glyph level, not at rendering time, so it doesn't hurt perf. - Since it becomes more complex, I have added a buffer to store position between size() and render_line(). It's more simple, and the string is decoded only once. And, if you add textshaping, it's also more convenient because render_line() remain the same. - One more bug fix: Wrap() behaves badly with only one line that is unbreakable. It draws it full length. But in fact, it should clip to wrapLenth, (like it would do if this unbreakable line was in middle of others line). I have tested this, and also compared subpixel result with FT demos programs to make sure it was ok. I have run some random test (and will do more) and also some pixel to pixel comparaison (with textshaping). No patch, but the full SDL_ttf.{c,h} files ! Created attachment 3504 [details]
SDL_ttf.h
header file
Created attachment 3506 [details]
Italic screenshot
Here's the italic rendering issue fixed now.
Created attachment 3507 [details]
SDL_ttf.c
Here's new version (again)
- Add a cache for FT_Get_Char_Index() (char -> index conversion, for 127 first ascii values)
Current head is slower at small size, because we call twice FT_Get_Char_Index() to convert a char to its index.
One to access cache_metrics and another one to access cache_bitmap/pixmap.
It happens after using index as cache-key :(
This was indirectly first with previous patch because we only called once FT_Get_Char_Index() and store it.
This is now improved with this cache for added, which is valid even after style/hinting change.
- Change the minx/yoffset naming to left/top, now that matches FreeType examples.
- Don't use DUFFS_LOOP...
From 1 to 50 size it is faster (this is what I tried before!), but after it's much slower.
In fact Duffs_Loop are quite constant, but this is the non-duff-loop which became faster:
We see that in all benchs with no Duffs loop:
Rendering at size 80 is the same as rendering as size 60.
And (not dumped in next the log), but rendering at size 70 is faster that at size 60).
I believe we kind of hit some compiler optimisation at this size ..
- Re-write a little bit render_glyph and un-roll render_line, after various tries, to make it faster.
- Write some more precise bench mark with PerformanceCounter()
Now:
- at size 8, it's 2x faster (old 25000, head 41000, new 13000)
- at size 100, it's 1.5 faster (old 106000, head 98000, new 61000)
Also some subpixel benchmark (only on the new version)
Created attachment 3508 [details]
bench app
Small bench app for various size / modes
Created attachment 3509 [details]
bench logs
bench outputs:
old: old version (~3 month ago) (before I start adding bugs).
currentHead: current head
new: this previous SDL_ttf.c
new_With_Duff_Loops: this previous SDL_ttf.c if you add Duffs Loop. (faster up to size 50, slow after...)
Created attachment 3510 [details]
bench logs android (armeabi-v7a, arm64-v8a)
Previous Bench was on linux i7-3610QM CPU @ 2.30GHz.
Those ones are on samsung s7.
- on arm64-v8a:
new version is better than old version.
starting at size 60-70-80, same phenomenon, new-no-duffs-loop is clearly better than new-with-duffs-loop
- on armeabi-v7a: at size 70-80, very little better for duffs loop.
So in the end, because of this effect, we shouldn't enable this USE DUFFS LOOP.
on android, the practical usual sizes are 30 to 90.
NB:
- v7a, int64 comparison is not working, need to use "clock_gettime(CLOCK_REALTIME, &res);" instead of PerformanceCounter
- all bench android: there are high variation on the same run.
Created attachment 3511 [details]
SDL_ttf.c
New version with a little adjustment for Outline Style so that it remains centred:
@@ -1002,6 +1002,7 @@
int fo = font->outline_val;
cached->sz_width += 2 * fo;
cached->sz_rows += 2 * fo;
+ cached->sz_left -= fo;
}
Created attachment 3513 [details]
SDL_ttf.c
A new version again with two changes:
1)
For Blended, pre-compute alpha_table, so that instead of doing
*dst++ |= pixel | ((Uint32)alpha_table[alpha] << 24);
It can be re-writen as:
*dst++ |= alpha_table[alpha];
A few percent of improvement.
2)
For Solid/Shaded, we can 'ceil' glyph width to be a multiple of integer and copy faster with 32 bits or 64 bits instruction.
(64 bits fails on android arm-v7a, so only 32 bits is activated).
Gain is quite good:
on linux:
Size 61 shaded: instead of 42 us, it takes 27 us.
from old versions, it means: 71 us -> 27 us
Size 80 shaded: 40 -> 36
from old versions, it means: (also)70 us -> 36 us
on android, even better
armv7a:
Size 61 shaded: 106us, => 57 us
Size 80 shaded: 201 us => 96 us
arm64:
Size 61 shaded: 84 us => 44 us
Size 80 shaded: 106 us => 73us
It doesn't change output, since metrics during size calculation aren't change, only the rendered glyph is extra padded.
It doesn't change code complexity, since there is already a fallback to clip the glyph if it is out of the output surface.
Created attachment 3514 [details]
bench logs for render_glyph_32
Created attachment 3519 [details]
SDL_ttf.c
New version! I've added SSE2 and NEON Render_Glyph intrasics versions.
They work on un-aligned memory (loadu, storeu on SSE),
only the glyph width is rounded, as previous.
Not very familiar with SSE2 nor NEON, but I got them working.
NEON doesn't seem to run faster, so I still commented it out.
(there are probably room for improvement, doing prefetch or other instruction...)
Also, this is now build with macros so compiler knows how to optimize stuffs.
Same metrics, onlinux:
Size 61 shaded:
=> 25 us, with render_glyph_32
=> 22 us, with render_glyph_64
=> 23 us, with render_glyph sse2
Size 80:
=> 35 us, with render_glyph_32
=> 33 us, with render_glyph_64
=> 31 us, with render_glyph sse2
arm v7a (Render_Glyph_32):
Size 61 shaded: => 59
Size 80 shaded: => 114 ( a little slower but might be the testing as well).
arm 64 (Render_Glyph_64):
Size 61 shaded: => 40 us
Size 80 shaded: => 70 us
Fix also allocation on non scalable Fonts when converting them:
(nonscalable/pvfixed_20b.pcf.gz)
For instance:
src->pixel_mode = 1 (MONO) src->width=2 src->pitch=4
Would allocate, by doing pitch * 8 :
dst->width = 2, dst->pitch=32
src->pixel_mode = 1 (MONO) src->width=2 src->pitch=4
Now allocate:
dst->width=2 dst->pitch=8
Created attachment 3520 [details]
bench logs for linux/android
Bench logs for previous version
Created attachment 3541 [details]
SDL_ttf.c
Here's a new version
Spot & fix two new bugs:
- Blended mode: line colour (underline/strikethrough) is always fully opaque when it should have user alpha opacity value (colour fg.a).
The issue is that it would display a very bright line, when the text is maybe 10% brightness (depending on chosen alpha).
- Shaded mode: alpha palette is incorrect when for instance color alpha is 0, it is then set to OPAQUE (255) and disturbs a-diff ratio.
So the palette alpha is not regular from 0 to 255 and a text with 10% brightness is not correctly display: it looks like it has brightened edges.
A few additional optimisations:
Shaded mode: optimise surface creation:
- divide by 255 (x/255) can be optimised as
(x + 1 + (x>>8))>>8, when positive
(x + 255 + (x>>8))>>8 when negative
(which a little a bit faster and non negligible on small font size, compared to the total time taken).
(NB: at small size to have a realistic bench value, kerning must be turned off because it has a too high timing cost).
Blended mode: optimise surface creation:
- provide memory buffer to avoid an extra memset of the full surface from SDL_CreateRGBSurface.
All Blit Glyph routines are now available in duffs loop macro, and cleaned-up.
(still not sure if this is worth enabling on android).
Blended mode:
- alpha_table doesn't need to be computed as "color | (alpha << 24)" but only "alpha << 24" since the background is already filled.
- also add a disjunction case when Blended is opaque or not:
opaque doesn't need lookup alpha_table at all and can be done faster on the fly.
Optimise Blending routines SSE and NEON.
Especially in opaque blending:
NEON can process 16 pixels at once using vzipq_u8 (interleave)
that produces a two lanes value in output (uint8x16x2_t).
Render text Blended is now twice faster.
Created attachment 3542 [details]
bench logs
Bench log with SSE2 version:
now that Kerning (call to FT_GetKerning()), for both logs is taking ~8300 us.
non-opaque means we has to use the alpha_table (as in all previous version),
whereas opaque is done on the fly (the reason is that there is nothing to compute for this value).
previous (2 dec):
INFO: Size: 40 Shaded Avg: 0.0206667 ms Avg Perf: 19344 (min= 18192)
INFO: Size: 40 Solid Avg: 0.019 ms Avg Perf: 18297 (min= 17103)
INFO: Size: 40 Blended Avg: 0.0776667 ms Avg Perf: 76130 (min= 70774)
INFO: Size: 80 Shaded Avg: 0.032 ms Avg Perf: 31004 (min= 30057)
INFO: Size: 80 Solid Avg: 0.0306667 ms Avg Perf: 30274 (min= 28786)
INFO: Size: 80 Blended Avg: 0.32 ms Avg Perf: 319046 (min= 306633)
this:
INFO: Size: 40 Shaded Avg: 0.0173333 ms Avg Perf: 16609 (min= 15533)
INFO: Size: 40 Solid Avg: 0.0166667 ms Avg Perf: 16394 (min= 15191)
INFO: Size: 40 Blended Avg: 0.043 ms Avg Perf: 41764 (min= 39904) opaque
INFO: Size: 40 Blended Avg: 0.048 ms Avg Perf: 47534 (min= 45519) [non opaque]
INFO: Size: 80 Shaded Avg: 0.0293333 ms Avg Perf: 27837 (min= 26532)
INFO: Size: 80 Solid Avg: 0.029 ms Avg Perf: 27489 (min= 26067)
INFO: Size: 80 Blended Avg: 0.131 ms Avg Perf: 130562 (min= 127157) opaque
INFO: Size: 80 Blended Avg: 0.148 ms Avg Perf: 147344 (min= 144133) [non opaque]
Created attachment 3543 [details]
SDL_ttf.c
New version, so that the SSE Blended Opaque function also read 16 by 16 Uint8, as the NEON one.
INFO: Size: 40 Blended Avg: 0.031 ms Avg Perf: 29897 (min= 29069)
INFO: Size: 80 Blended Avg: 0.115 ms Avg Perf: 114104 (min= 110306)
Created attachment 3544 [details]
SDL_ttf.c
New version:
- Now SSE/NEON Blended Non-Opaque also compute the alpha table on the fly and performs 10-20% better.
(and I removed the alpha_table).
- When blitting a glyph, 'srcskip' is not needed: we choose the width and pitch so that srcskip is 0 and can be removed. Except for the case when the glyph is clipped.
(For fixed Font, the pitch may be larger for decoding, so one post-processing step can be needed to shrink them).
Some SSE2 bench (with kerning activated):
Size: 40 Blended Avg: 0.0433333 ms Avg Perf: 41884 (min= 40173) [non opaque]
Size: 80 Blended Avg: 0.128 ms Avg Perf: 127318 (min= 125139) [non opaque]
Created attachment 3551 [details]
SDL_ttf.c
New version with:
- faster access to cache glyph
- format decoding (mono/gray2/gray4) doesn't require a larger pitch.
- all access to destination are now aligned:
SSE see some small benefit for larger size (> 100)
NEON is much faster (50%), and also more stable (max value vs average).
Created attachment 3552 [details]
bench SSE2
Also, it now depends on SDL capabilities to free an aligned memory pointer.
attached: Bench SSE2
Created attachment 3553 [details]
bench NEON
attached: bench NEON
Pushed in https://hg.libsdl.org/SDL_ttf/rev/9f46efc0fde2 Also modify public header : https://hg.libsdl.org/SDL_ttf/rev/7714b0b23bf3 1.7 -int TTF_GlyphIsProvided(const TTF_Font *font, Uint16 ch) 1.8 +int TTF_GlyphIsProvided(TTF_Font *font, Uint16 ch) |
Factorize the rendering functions by adding primitives Render_Line(), Render_Glyph(), Render_Glyph_Blended(), Create_Surface_{Solid,Blended,Shaded}(). Now TTF_Render_Wrapped() is something like: legacy split lines code foreach lines { Render_Line(); } Lots of diff, but no behaviour change and no metrics changes. With this, it's has now been easy to add: - 6 missing functions: TTF_Render{UTF8,TEXT,UNICODE}_{Shaded,Solid}_Wrapped() - Add and enable USE_DUFF_LOOP for Render_Glyph() and Render_Glyph_Blended() to expect a little faster blit on mobile platforms. Since lots of functions require a conversion to UTF-8.( {unicode,text}*{solid,blended,shaded}*{normal,wrapped} == 12 fonctions), this has been moved into TTF_Size_Internal(), TTF_Render_Internal() and TTF_Render_Wrapped_Internal(), while still been stack allocated. Other minor changes: - Find_Glyph() gives back the glyph and either glyph->pixmap or glyph->bitmap depending on the data requested. - Rename font->outline as it somehow "shadows" "FT_Outline outline". - font->style isn't correctly initialized. it doesn't add font_face style the first time. re-wrote this more clearly so font->style are exactly the styles handled by SDL_ttf. - set to upper case for "ft_render_mode_normal", "ft_kerning_default", "ft_render_mode_mono". freetype.h says those lower case type are deprecated. - check that TTF_Font input pointer is not null