『漫游』酷论坛>『影音数码技术学习交流』>x264不使用sse2和sse3?
x264不使用sse2和sse3?
kousin@2008-10-17 12:17
cpuz的检测信息:
处理器信息
------------------------------------------------------------------------------------
处理器 1 (ID = 0)
核心数 2(最多2)
线程数 2 (最大 2)
名称 Intel Core Duo T2250
代号 Yonah
规格 Genuine Intel(R) CPU T2250 @ 1.73GHz
封装 Socket 479 mPGA (平台 ID = 5h)
CPUID 6.E.8
扩展 CPUID 6.E
核心步进 C0
工艺 65 nm
核心速度 1729.1 MHz (13.0 x 133.0 MHz)
额定总线速度 532.0 MHz
主频 1733 MHz
指令集 MMX, SSE, SSE2, SSE3
x264的log:
x264 [info]: using cpu capabilities: MMX2 Cache64
用了998 997 977等版本,gui用了megui和lmx264gui都是这样,只有mmx2。用的sse2版的emule正常。于是这x264怎么搞……
superkidx@2008-10-17 12:20
many changes to which asm functions are enabled on which cpus.
with Phenom, 3dnow is no longer equivalent to "sse2 is slow", so make a new flag for that.
some sse2 functions are useful only on Core2 and Phenom, so make a "sse2 is fast" flag for that.
some ssse3 instructions didn't become useful until Penryn, so yet another flag.
disable sse2 completely on Pentium M and Core1, because it's uniformly slower than mmx.
enable some sse2 functions on Athlon64 that always were faster and we just didn't notice.
remove mc_luma_sse3, because the only cpu that has lddqu (namely Pentium 4D) doesn't have "sse2 is fast".
don't print mmx1, sse1, nor 3dnow in the detected cpuflags, since we don't really have any such functions. likewise don't print sse3 unless it's used (Pentium 4D).
ZhenGod@2008-10-17 13:36
avis [info]: 1920x1080 @ 23.98 fps (104256 frames)
x264 [info]: using SAR=1/1
x264 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 PHADD SSE4 Cache64
kousin@2008-10-17 14:41
晒cpu的西奈:mad:
roozhou@2008-10-17 14:49
汗 PHADD
simonfishx@2008-10-21 16:15
弱弱地问下
PHADD是啥指令集?
superkidx@2008-10-21 20:30
PHADD 居然百度不到什么东西...
qyqgpower@2008-10-21 21:14
SSE3的指令,但只有在Penryn上速度才较快,所以只有Penryn会使用PHADD的SATD
+ if( cpu&X264_CPU_SSE4 )
+ {
+ // enabled on Penryn, but slower on Conroe
+ INIT5( satd, _ssse3_phadd );
+ INIT5( satd_x3, _ssse3_phadd );
+ INIT5( satd_x4, _ssse3_phadd );
+ }
+; phaddw is used only in 4x4 hadamard, because in 8x8 it's slower:
+; even on Penryn, phaddw has latency 3 while paddw and punpck* have 1.
+; 4x4 is special in that 4x4 transpose in xmmregs takes extra munging,
+; whereas phaddw-based transform doesn't care what order the coefs end up in.
roozhou@2008-10-21 23:25
phadd是并行水平加指令,phaddsw a, b差不多等于于C语言里的a[0]=a[0]+a[1],a[1]=a[2]+a[3],a[2]=b[0]+b[1],a[3]=b[2]+b[3]。感觉还是挺有用的,就是3周期的latency会让程序比较难写。
tendyang@2008-10-26 22:19
老实讲,sse从1-4并没有什么革命性的进步,除非是特殊的数据结构,否则平均加速不过几倍而已。如果不能迁就sse修正自己的数据结构和流程,4个乘法的时间甚至连浮点都比不过,另外一个可憎的地方就是稀少的register。。。
几十年来,连个MAC指令都没有,可见压根就没有想过对DSP的优化
难道sse就为了优化做工资报表么?或者是历史包袱,成本??
想不通,一堆intel,amd的牛人大概是觉得我们用不上吧
roozhou@2008-10-26 23:09
4个乘法比不过浮点?请给出证据。至少intel U上是绝对不可能的
srta@2008-11-17 00:14
X2 5000+
x264 [info]: using cpu capabilities: MMX2 SSE2Slow
是不是更加脑残了..........~LS那位咋用上这么多指令?
| TOP