as some of you already might know, work is going on to make GCC fully support Intel Atom architecture specifics, i.e. make -mtune=atom generate code optimized for in-order architectures like Intel Atom [1].
I therefore started to make up a small patch which adds Intel Atom as a new processor family which can be selected upon configuration. It's nothing special and also requires a patched GCC. I'd just like to get some feedback on it, i.e. is X86_L1_CACHE_SHIFT=6 ok for Atom CPUs (I was not able to find any information on Atom's cacheline size)? Any chance to include this patch once the Atom patch went into GCC mainline (probably in GCC 4.5)? Any other objections?
> as some of you already might know, work is going on to make GCC > fully support Intel Atom architecture specifics, i.e. make > -mtune=atom generate code optimized for in-order architectures > like Intel Atom [1].
> I therefore started to make up a small patch which adds Intel Atom > as a new processor family which can be selected upon > configuration. It's nothing special and also requires a patched > GCC. I'd just like to get some feedback on it, i.e. is > X86_L1_CACHE_SHIFT=6 ok for Atom CPUs (I was not able to find any > information on Atom's cacheline size)? Any chance to include this > patch once the Atom patch went into GCC mainline (probably in GCC > 4.5)? Any other objections?
> From 6aa86b4431619d38849d469c70904afe1e5a8ca0 Mon Sep 17 00:00:00 2001 > From: Tobias Doerffel <tobias.doerf...@gmail.com> > Date: Thu, 30 Apr 2009 12:36:46 +0200 > Subject: [PATCH] x86: add specific support for Intel Atom architecture
> This adds another option when selecting CPU family so the kernel can > be optimized for Intel Atom CPUs. This patch requires a GCC with a > patch applied which adds specific Intel Atom support. > --- > arch/x86/Kconfig.cpu | 19 ++++++++++++++----- > arch/x86/Makefile_32.cpu | 1 + > arch/x86/include/asm/module.h | 2 ++ > 3 files changed, 17 insertions(+), 5 deletions(-)
> diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu > index 8130334..8e565b7 100644 > --- a/arch/x86/Kconfig.cpu > +++ b/arch/x86/Kconfig.cpu > @@ -262,6 +262,15 @@ config MCORE2 > family in /proc/cpuinfo. Newer ones have 6 and older ones 15 > (not a typo)
> +config MATOM > + bool "Intel Atom" > + depends on X86_32 > + ---help--- > + > + Select this for Intel Atom platform. Intel Atom CPUs have an in-order > + pipelining architecture and thus can benefit from in-order optimized > + code (requires Intel Atom patch in GCC). > + > config GENERIC_CPU > bool "Generic-x86-64" > depends on X86_64 > @@ -310,7 +319,7 @@ config X86_L1_CACHE_SHIFT > default "7" if MPENTIUM4 || MPSC > default "4" if X86_ELAN || M486 || M386 || MGEODEGX1 > default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX > - default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MVIAC7 || X86_GENERIC || GENERIC_CPU > + default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MVIAC7 || X86_GENERIC || GENERIC_CPU
> Makes sense. One question would be X86_L1_CACHE_SHIFT - you set it > to 2^6 == 64 - that's correct i think, most Atoms come with 64 byte > L2 cache AFAIK.
> I've Cc:-ed Intel folks - is this assumption about 64 bytes correct?
Seems to be. At least that's what CPUID reports.
-hpa
-- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
> > There should be a fallback option used here rather than requiring a > > new gcc, e.g. something like:
> > $(call cc-option,-march=atom,-march=i686)
> if it's an in-order architecture, wouldn't it be better to tune for > i386 or i486 instead ?
-march isn't about tuning, it's about supported instructions. The right line is $(call cc-option,-march=atom,-march=core2)
For tuning, our experience is that currently -mtune=generic works best. Not sure about the gcc's that have complete atom tuning support yet.
Please don't do something like "oh it's in order, so was the Pentium, so lets use that"; it actually gives really really bad results.
-- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> > > There should be a fallback option used here rather than requiring a > > > new gcc, e.g. something like:
> > > $(call cc-option,-march=atom,-march=i686)
> > if it's an in-order architecture, wouldn't it be better to tune for > > i386 or i486 instead ?
> -march isn't about tuning, it's about supported instructions.
agreed, but unless specified otherwise using -mtune, -march also sets default tuning for the indicated CPU. At least in my experience.
> The right line is > $(call cc-option,-march=atom,-march=core2)
OK thanks.
> For tuning, our experience is that currently -mtune=generic works best.
OK.
> Not sure about the gcc's that have complete atom tuning support yet.
> Please don't do something like "oh it's in order, so was the Pentium, > so lets use that"; it actually gives really really bad results.
I know, I was not thinking about tuning for an "advanced" CPU such as the pentium, but rather for something generic, hence my proposal of i486 or i386. I did not know about the "generic" target. In my experience, tuning for i386/i486 often shows best overall performance on recent CPUs such as core2. I should try "generic" to compare.
>>>> $(call cc-option,-march=atom,-march=i686) >>> if it's an in-order architecture, wouldn't it be better to tune for >>> i386 or i486 instead ? >> -march isn't about tuning, it's about supported instructions.
> agreed, but unless specified otherwise using -mtune, -march also sets > default tuning for the indicated CPU. At least in my experience.
>> The right line is >> $(call cc-option,-march=atom,-march=core2)
For really old gcc's (we support all the way back to gcc 3.2 still) -march=core2 might not work either.
-hpa
-- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
>>>>> $(call cc-option,-march=atom,-march=i686) >>>> if it's an in-order architecture, wouldn't it be better to tune for >>>> i386 or i486 instead ? >>> -march isn't about tuning, it's about supported instructions.
>> agreed, but unless specified otherwise using -mtune, -march also sets >> default tuning for the indicated CPU. At least in my experience.
>>> The right line is >>> $(call cc-option,-march=atom,-march=core2)
> For really old gcc's (we support all the way back to gcc 3.2 still) > -march=core2 might not work either.
> as some of you already might know, work is going on to make GCC fully support > Intel Atom architecture specifics, i.e. make -mtune=atom generate code > optimized for in-order architectures like Intel Atom [1].
> I therefore started to make up a small patch which adds Intel Atom as a new > processor family which can be selected upon configuration. It's nothing > special and also requires a patched GCC. I'd just like to get some feedback on > it, i.e. is X86_L1_CACHE_SHIFT=6 ok for Atom CPUs (I was not able to find any > information on Atom's cacheline size)?
64bytes.
> Any chance to include this patch once > the Atom patch went into GCC mainline (probably in GCC 4.5)? Any other
atom support already went into gcc mainline.
> objections?
> Please Cc me, I'm not on the list.
FWIW I have a similar patch, but I haven't submitted it yet due to lack of benchmark numbers.
Some comments on yours.
> diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu > index 8130334..8e565b7 100644 > --- a/arch/x86/Kconfig.cpu > +++ b/arch/x86/Kconfig.cpu > @@ -262,6 +262,15 @@ config MCORE2 > family in /proc/cpuinfo. Newer ones have 6 and older ones 15 > (not a typo)
I don't think that's necessarily a good idea. You would need benchmarks showing that intel user copy performs better on Atom than the original one. Do you have some?
This should be obsolete anyways, you can just uses CORE2. They have compatible ISAs.
-Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> > Am Sonntag, 3. Mai 2009 08:48:54 schrieb H. Peter Anvin: > > > Willy Tarreau wrote: > > > >> $(call cc-option,-march=atom,-march=i686)
> > > > if it's an in-order architecture, wouldn't it be better to tune > > > > for i386 or i486 instead ?
> > > Possibly. It would be worth measuring.
> > How would one do that (never benchmarked kernel stuff before)?
> A standard method is to run lmbench and compare the results - > lmbench has a built-in 'report comparison between two runs' feature.
well... you're normally REALLY hard pressed to measure compiler differences this way.....
normally compiler options get benchmarked using speccpu and the like....
-- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> > > Am Sonntag, 3. Mai 2009 08:48:54 schrieb H. Peter Anvin: > > > > Willy Tarreau wrote: > > > > >> $(call cc-option,-march=atom,-march=i686)
> > > > > if it's an in-order architecture, wouldn't it be better to tune > > > > > for i386 or i486 instead ?
> > > > Possibly. It would be worth measuring.
> > > How would one do that (never benchmarked kernel stuff before)?
> > A standard method is to run lmbench and compare the results - > > lmbench has a built-in 'report comparison between two runs' > > feature.
> well... you're normally REALLY hard pressed to measure compiler > differences this way.....
> normally compiler options get benchmarked using speccpu and the > like....
Well, if there's no measurable difference in lmbench at all then the options probably dont matter that much. If some workload is found where compiler options show a difference then that matters. Speccpu only matters if those compiler options also help the kernel, in a measurable way.
> I don't think that's necessarily a good idea. You would need benchmarks > showing that intel user copy performs better on Atom than the original one. > Do you have some?
You're right here. I made some quick benchmarks of __copy_user[_intel[_nocache]]() and __copy_zeroing[_intel[_nocache]]() in userspace and the generic ones indeed were about 15% faster.
> Similar here. Atom is quite different from PPro/K8.
Made some benchmarks of csum_partial() and csum_partial_copy_generic() as well. Here the PPro version of csum_partial() performed 10-15% better (depending on buffer len) while both implementations of csum_partial_copy_generic() performed equal.
> > This should be obsolete anyways, you can just uses CORE2. They have > > compatible ISAs. > So you would recommend writing
> #elif defined CONFIG_MCORE2 || defined CONFIG_ATOM > #define MODULE_PROC_FAMILY "CORE2 "
> ?
Yes. Or maybe you can find a better name.
-Andi
-- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Mon, May 4, 2009 at 12:22 AM, Andi Kleen <a...@firstfloor.org> wrote: > This should be obsolete anyways, you can just uses CORE2. They have compatible ISAs.
Only correct if you don't plan to use the movbe instruction. The kernel would be the one place where I can imagine this to make sense. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, May 12, 2009 at 07:20:14AM -0700, Ulrich Drepper wrote: > On Mon, May 4, 2009 at 12:22 AM, Andi Kleen <a...@firstfloor.org> wrote: > > This should be obsolete anyways, you can just uses CORE2. They have compatible ISAs.
> Only correct if you don't plan to use the movbe instruction. The > kernel would be the one place where I can imagine this to make sense.
The problem is that you can't express the situations where movbe is better than bswap (you need both and the old and the new value) in inline assembler in a way that gcc decides automatically.
I also doubt there are many (any?) situations in the kernel where the destruction of the old register is a problem in the kernel; e.g. the network stack normally doesn't care.
My understanding is that movbe is really mainly useful for some special situations where you run a emulator/jit for a BE ISA, but that's not something the kernel does.
-Andi
-- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, May 12, 2009 at 8:04 AM, Andi Kleen <a...@firstfloor.org> wrote: > The problem is that you can't express the situations where > movbe is better than bswap (you need both and the old and the new > value) in inline assembler in a way that gcc decides automatically.
True. But I was mostly thinking about loads from memory. A quick search for ntoh*/hton* shows code like
u_int16_t queue_num = ntohs(nfmsg->res_id);
If there would be a ntohs_load() macro movbe could be used. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, May 12, 2009 at 10:45:00AM -0700, Ulrich Drepper wrote: > On Tue, May 12, 2009 at 8:04 AM, Andi Kleen <a...@firstfloor.org> wrote: > > The problem is that you can't express the situations where > > movbe is better than bswap (you need both and the old and the new > > value) in inline assembler in a way that gcc decides automatically.
> True. But I was mostly thinking about loads from memory. A quick > search for ntoh*/hton* shows code like
> u_int16_t queue_num = ntohs(nfmsg->res_id);
> If there would be a ntohs_load() macro movbe could be used.
It wouldn't surprise me if
movbe memory,%reg
generates the same uops sequence internally as
mov memory,%reg bswap %reg
I doubt there's any dedicated hardware for this in Atom (but I don't know for sure)
So unless you're really decoding constrained it would only save a few bytes of code size. Probably not worth having incompatible modules for or adding special code to the source.
-Andi
-- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, 2009-05-12 at 10:45 -0700, Ulrich Drepper wrote: > On Tue, May 12, 2009 at 8:04 AM, Andi Kleen <a...@firstfloor.org> wrote: > > The problem is that you can't express the situations where > > movbe is better than bswap (you need both and the old and the new > > value) in inline assembler in a way that gcc decides automatically.
> True. But I was mostly thinking about loads from memory. A quick > search for ntoh*/hton* shows code like
> u_int16_t queue_num = ntohs(nfmsg->res_id);
> If there would be a ntohs_load() macro movbe could be used.
<harvey.harri...@gmail.com> wrote: > It's called be16_to_cpup, or on x86, swab16p()
Indeed. If now somebody with an Atom could test whether using movbe has an advantage (my guess is that there is a slight advantage) then one could define a special version of the __beXX_to_cpup and __cpu_to_beXXp functions for Atom and start using these functions more rigorously in the tree. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, May 14, 2009 at 06:38:48AM -0700, Ulrich Drepper wrote: > On Wed, May 13, 2009 at 10:04 PM, Harvey Harrison > <harvey.harri...@gmail.com> wrote: > > It's called be16_to_cpup, or on x86, swab16p()
> Indeed. If now somebody with an Atom could test whether using movbe > has an advantage (my guess is that there is a slight advantage) then
How would you test that?
-Andi
-- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, May 14, 2009 at 7:01 AM, Andi Kleen <a...@firstfloor.org> wrote: > How would you test that?
Compare runtimes with mov+bswap for some simple code which uses the value after the conversion (e.g., just add to something).
Or in your case: get the Atom designers to comment. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/