Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
301 views22 pages

The C10K Problem

The document discusses strategies for building servers that can handle thousands of clients simultaneously (the "C10K problem"). It describes five popular strategies: 1) Using non-blocking I/O with level-triggered notifications to serve many clients per thread, 2) Using non-blocking I/O with readiness change notifications, 3) Using asynchronous I/O to serve many clients per thread, 4) Using one thread per client with blocking I/O, and 5) Building server code directly into the kernel. It then provides more details on option 1 and references additional resources on high-performance server design.

Uploaded by

Achin Sagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
301 views22 pages

The C10K Problem

The document discusses strategies for building servers that can handle thousands of clients simultaneously (the "C10K problem"). It describes five popular strategies: 1) Using non-blocking I/O with level-triggered notifications to serve many clients per thread, 2) Using non-blocking I/O with readiness change notifications, 3) Using asynchronous I/O to serve many clients per thread, 4) Using one thread per client with blocking I/O, and 5) Building server code directly into the kernel. It then provides more details on option 1 and references additional resources on high-performance server design.

Uploaded by

Achin Sagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

8/22/2015

TheC10Kproblem

TheC10Kproblem
[HelpsavethebestLinuxnewssourceonthewebsubscribetoLinuxWeeklyNews!]
It'stimeforwebserverstohandletenthousandclientssimultaneously,don'tyouthink?Afterall,theweb
isabigplacenow.
Andcomputersarebig,too.Youcanbuya1000MHzmachinewith2gigabytesofRAMandan
1000Mbit/secEthernetcardfor$1200orso.Let'sseeat20000clients,that's50KHz,100Kbytes,and
50Kbits/secperclient.Itshouldn'ttakeanymorehorsepowerthanthattotakefourkilobytesfromthe
diskandsendthemtothenetworkonceasecondforeachoftwentythousandclients.(Thatworksoutto
$0.08perclient,bytheway.Those$100/clientlicensingfeessomeoperatingsystemschargearestarting
tolookalittleheavy!)Sohardwareisnolongerthebottleneck.
In1999oneofthebusiestftpsites,cdrom.com,actuallyhandled10000clientssimultaneouslythrougha
GigabitEthernetpipe.Asof2001,thatsamespeedisnowbeingofferedbyseveralISPs,whoexpectitto
becomeincreasinglypopularwithlargebusinesscustomers.
Andthethinclientmodelofcomputingappearstobecomingbackinstylethistimewiththeserverout
ontheInternet,servingthousandsofclients.
Withthatinmind,hereareafewnotesonhowtoconfigureoperatingsystemsandwritecodetosupport
thousandsofclients.ThediscussioncentersaroundUnixlikeoperatingsystems,asthat'smypersonal
areaofinterest,butWindowsisalsocoveredabit.

Contents
TheC10Kproblem
RelatedSites
BooktoReadFirst
I/Oframeworks
I/OStrategies
1. Servemanyclientswitheachthread,andusenonblockingI/Oandleveltriggeredreadiness
notification
Thetraditionalselect()
Thetraditionalpoll()
/dev/poll(Solaris2.7+)
kqueue(FreeBSD,NetBSD)
2. Servemanyclientswitheachthread,andusenonblockingI/Oandreadinesschange
notification
epoll(Linux2.6+)
Polyakov'skevent(Linux2.6+)
Drepper'sNewNetworkInterface(proposalforLinux2.6+)
RealtimeSignals(Linux2.4+)
Signalperfd
kqueue(FreeBSD,NetBSD)
3. Servemanyclientswitheachthread,anduseasynchronousI/Oandcompletionnotification
4. Serveoneclientwitheachserverthread
LinuxThreads(Linux2.0+)
http://www.kegel.com/c10k.html#related

1/22

8/22/2015

TheC10Kproblem

NGPT(Linux2.4+)
NPTL(Linux2.6,RedHat9)
FreeBSDthreadingsupport
NetBSDthreadingsupport
Solaristhreadingsupport
JavathreadingsupportinJDK1.3.xandearlier
Note:1:1threadingvs.M:Nthreading
5. Buildtheservercodeintothekernel
6. BringtheTCPstackintouserspace
Comments
Limitsonopenfilehandles
Limitsonthreads
Javaissues[Updated27May2001]
Othertips
ZeroCopy
Thesendfile()systemcallcanimplementzerocopynetworking.
Avoidsmallframesbyusingwritev(orTCP_CORK)
SomeprogramscanbenefitfromusingnonPosixthreads.
Cachingyourowndatacansometimesbeawin.
Otherlimits
KernelIssues
MeasuringServerPerformance
Examples
Interestingselect()basedservers
Interesting/dev/pollbasedservers
Interestingepollbasedservers
Interestingkqueue()basedservers
Interestingrealtimesignalbasedservers
Interestingthreadbasedservers
Interestinginkernelservers
Otherinterestinglinks

RelatedSites
SeeNickBlack'sexecellentFastUNIXServerspageforacirca2009lookatthesituation.
InOctober2003,FelixvonLeitnerputtogetheranexcellentwebpageandpresentationaboutnetwork
scalability,completewithbenchmarkscomparingvariousnetworkingsystemcallsandoperating
systems.Oneofhisobservationsisthatthe2.6Linuxkernelreallydoesbeatthe2.4kernel,butthereare
many,manygoodgraphsthatwillgivetheOSdevelopersfoodforthoughtforsometime.(Seealsothe
Slashdotcommentsit'llbeinterestingtoseewhetheranyonedoesfollowupbenchmarksimprovingon
Felix'sresults.)

BooktoReadFirst
Ifyouhaven'treaditalready,gooutandgetacopyofUnixNetworkProgramming:NetworkingApis:
SocketsandXti(Volume1)bythelateW.RichardStevens.ItdescribesmanyoftheI/Ostrategiesand
pitfallsrelatedtowritinghighperformanceservers.Iteventalksaboutthe'thunderingherd'problem.
Andwhileyou'reatit,goreadJeffDarcy'snotesonhighperformanceserverdesign.
http://www.kegel.com/c10k.html#related

2/22

8/22/2015

TheC10Kproblem

(Anotherbookwhichmightbemorehelpfulforthosewhoare*using*ratherthan*writing*aweb
serverisBuildingScalableWebSitesbyCalHenderson.)

I/Oframeworks
Prepackagedlibrariesareavailablethatabstractsomeofthetechniquespresentedbelow,insulatingyour
codefromtheoperatingsystemandmakingitmoreportable.
ACE,aheavyweightC++I/Oframework,containsobjectorientedimplementationsofsomeof
theseI/Ostrategiesandmanyotherusefulthings.Inparticular,hisReactorisanOOwayofdoing
nonblockingI/O,andProactorisanOOwayofdoingasynchronousI/O.
ASIOisanC++I/OframeworkwhichisbecomingpartoftheBoostlibrary.It'slikeACEupdated
fortheSTLera.
libeventisalightweightCI/OframeworkbyNielsProvos.Itsupportskqueueandselect,andsoon
willsupportpollandepoll.It'sleveltriggeredonly,Ithink,whichhasbothgoodandbadsides.
Nielshasanicegraphoftimetohandleoneeventasafunctionofthenumberofconnections.It
showskqueueandsys_epollasclearwinners.
Myownattemptsatlightweightframeworks(sadly,notkeptuptodate):
PollerisalightweightC++I/OframeworkthatimplementsaleveltriggeredreadinessAPI
usingwhateverunderlyingreadinessAPIyouwant(poll,select,/dev/poll,kqueue,orsigio).
It'susefulforbenchmarksthatcomparetheperformanceofthevariousAPIs.Thisdocument
linkstoPollersubclassesbelowtoillustratehoweachofthereadinessAPIscanbeused.
rnisalightweightCI/OframeworkthatwasmysecondtryafterPoller.It'slgpl(soit's
easiertouseincommercialapps)andC(soit'seasiertouseinnonC++apps).Itwasused
insomecommercialproducts.
MattWelshwroteapaperinApril2000abouthowtobalancetheuseofworkerthreadandevent
driventechniqueswhenbuildingscalableservers.ThepaperdescribespartofhisSandstormI/O
framework.
CoryNelson'sScale!libraryanasyncsocket,file,andpipeI/OlibraryforWindows

I/OStrategies
Designersofnetworkingsoftwarehavemanyoptions.Hereareafew:
WhetherandhowtoissuemultipleI/Ocallsfromasinglethread
Don'tuseblocking/synchronouscallsthroughout,andpossiblyusemultiplethreadsor
processestoachieveconcurrency
Usenonblockingcalls(e.g.write()onasocketsettoO_NONBLOCK)tostartI/O,and
readinessnotification(e.g.poll()or/dev/poll)toknowwhenit'sOKtostartthenextI/Oon
thatchannel.GenerallyonlyusablewithnetworkI/O,notdiskI/O.
Useasynchronouscalls(e.g.aio_write())tostartI/O,andcompletionnotification(e.g.
signalsorcompletionports)toknowwhentheI/Ofinishes.Goodforbothnetworkanddisk
I/O.
Howtocontrolthecodeservicingeachclient
oneprocessforeachclient(classicUnixapproach,usedsince1980orso)
oneOSlevelthreadhandlesmanyclientseachclientiscontrolledby:
auserlevelthread(e.g.GNUstatethreads,classicJavawithgreenthreads)
astatemachine(abitesoteric,butpopularinsomecirclesmyfavorite)
acontinuation(abitesoteric,butpopularinsomecircles)
oneOSlevelthreadforeachclient(e.g.classicJavawithnativethreads)
http://www.kegel.com/c10k.html#related

3/22

8/22/2015

TheC10Kproblem

oneOSlevelthreadforeachactiveclient(e.g.TomcatwithapachefrontendNT
completionportsthreadpools)
WhethertousestandardO/Sservices,orputsomecodeintothekernel(e.g.inacustomdriver,
kernelmodule,orVxD)
Thefollowingfivecombinationsseemtobepopular:
1. Servemanyclientswitheachthread,andusenonblockingI/Oandleveltriggeredreadiness
notification
2. Servemanyclientswitheachthread,andusenonblockingI/Oandreadinesschangenotification
3. Servemanyclientswitheachserverthread,anduseasynchronousI/O
4. serveoneclientwitheachserverthread,anduseblockingI/O
5. Buildtheservercodeintothekernel

1.Servemanyclientswitheachthread,andusenonblockingI/Oandleveltriggered
readinessnotification
...setnonblockingmodeonallnetworkhandles,anduseselect()orpoll()totellwhichnetworkhandle
hasdatawaiting.Thisisthetraditionalfavorite.Withthisscheme,thekerneltellsyouwhetherafile
descriptorisready,whetherornotyou'vedoneanythingwiththatfiledescriptorsincethelasttimethe
kerneltoldyouaboutit.(Thename'leveltriggered'comesfromcomputerhardwaredesignit'sthe
oppositeof'edgetriggered'.JonathonLemonintroducedthetermsinhisBSDCON2000paperon
kqueue().)
Note:it'sparticularlyimportanttorememberthatreadinessnotificationfromthekernelisonlyahintthe
filedescriptormightnotbereadyanymorewhenyoutrytoreadfromit.That'swhyit'simportanttouse
nonblockingmodewhenusingreadinessnotification.
Animportantbottleneckinthismethodisthatread()orsendfile()fromdiskblocksifthepageisnotin
coreatthemomentsettingnonblockingmodeonadiskfilehandlehasnoeffect.Samethinggoesfor
memorymappeddiskfiles.ThefirsttimeaserverneedsdiskI/O,itsprocessblocks,allclientsmust
wait,andthatrawnonthreadedperformancegoestowaste.
ThisiswhatasynchronousI/Oisfor,butonsystemsthatlackAIO,workerthreadsorprocessesthatdo
thediskI/Ocanalsogetaroundthisbottleneck.Oneapproachistousememorymappedfiles,andif
mincore()indicatesI/Oisneeded,askaworkertodotheI/O,andcontinuehandlingnetworktraffic.Jef
PoskanzermentionsthatPai,Druschel,andZwaenepoel's1999Flashwebserverusesthistrickthey
gaveatalkatUsenix'99onit.Itlookslikemincore()isavailableinBSDderivedUnixeslikeFreeBSD
andSolaris,butisnotpartoftheSingleUnixSpecification.It'savailableaspartofLinuxasofkernel
2.3.51,thankstoChuckLever.
ButinNovember2003onthefreebsdhackerslist,VivekPeietalreportedverygoodresultsusing
systemwideprofilingoftheirFlashwebservertoattackbottlenecks.Onebottlenecktheyfoundwas
mincore(guessthatwasn'tsuchagoodideaafterall)Anotherwasthefactthatsendfileblocksondisk
accesstheyimprovedperformancebyintroducingamodifiedsendfile()thatreturnsomethinglike
EWOULDBLOCKwhenthediskpageit'sfetchingisnotyetincore.(Notsurehowyoutelltheuserthe
pageisnowresident...seemstomewhat'sreallyneededhereisaio_sendfile().)Theendresultoftheir
optimizationsisaSpecWeb99scoreofabout800ona1GHZ/1GBFreeBSDbox,whichisbetterthan
anythingonfileatspec.org.
ThereareseveralwaysforasinglethreadtotellwhichofasetofnonblockingsocketsarereadyforI/O:
http://www.kegel.com/c10k.html#related

4/22

8/22/2015

TheC10Kproblem

Thetraditionalselect()
Unfortunately,select()islimitedtoFD_SETSIZEhandles.Thislimitiscompiledintothestandard
libraryanduserprograms.(SomeversionsoftheClibraryletyouraisethislimitatuserapp
compiletime.)
SeePoller_select(cc,h)foranexampleofhowtouseselect()interchangeablywithotherreadiness
notificationschemes.
Thetraditionalpoll()
Thereisnohardcodedlimittothenumberoffiledescriptorspoll()canhandle,butitdoesgetslow
aboutafewthousand,sincemostofthefiledescriptorsareidleatanyonetime,andscanning
throughthousandsoffiledescriptorstakestime.
SomeOS's(e.g.Solaris8)speeduppoll()etalbyuseoftechniqueslikepollhinting,whichwas
implementedandbenchmarkedbyNielsProvosforLinuxin1999.
SeePoller_poll(cc,h,benchmarks)foranexampleofhowtousepoll()interchangeablywithother
readinessnotificationschemes.
/dev/poll
ThisistherecommendedpollreplacementforSolaris.
Theideabehind/dev/pollistotakeadvantageofthefactthatoftenpoll()iscalledmanytimeswith
thesamearguments.With/dev/poll,yougetanopenhandleto/dev/poll,andtelltheOSjustonce
whatfilesyou'reinterestedinbywritingtothathandlefromthenon,youjustreadthesetof
currentlyreadyfiledescriptorsfromthathandle.
ItappearedquietlyinSolaris7(seepatchid106541)butitsfirstpublicappearancewasinSolaris
8accordingtoSun,at750clients,thishas10%oftheoverheadofpoll().
Variousimplementationsof/dev/pollweretriedonLinux,butnoneofthemperformaswellas
epoll,andwereneverreallycompleted./dev/polluseonLinuxisnotrecommended.
SeePoller_devpoll(cc,hbenchmarks)foranexampleofhowtouse/dev/pollinterchangeably
withmanyotherreadinessnotificationschemes.(CautiontheexampleisforLinux/dev/poll,
mightnotworkrightonSolaris.)
kqueue()
ThisistherecommendedpollreplacementforFreeBSD(and,soon,NetBSD).
Seebelow.kqueue()canspecifyeitheredgetriggeringorleveltriggering.

2.Servemanyclientswitheachthread,andusenonblockingI/Oandreadiness
changenotification
Readinesschangenotification(oredgetriggeredreadinessnotification)meansyougivethekernelafile
descriptor,andlater,whenthatdescriptortransitionsfromnotreadytoready,thekernelnotifiesyou
somehow.Itthenassumesyouknowthefiledescriptorisready,andwillnotsendanymorereadiness
notificationsofthattypeforthatfiledescriptoruntilyoudosomethingthatcausesthefiledescriptorto
nolongerbeready(e.g.untilyoureceivetheEWOULDBLOCKerroronasend,recv,oracceptcall,ora
sendorrecvtransferslessthantherequestednumberofbytes).
http://www.kegel.com/c10k.html#related

5/22

8/22/2015

TheC10Kproblem

Whenyouusereadinesschangenotification,youmustbepreparedforspuriousevents,sinceone
commonimplementationistosignalreadinesswheneveranypacketsarereceived,regardlessofwhether
thefiledescriptorwasalreadyready.
Thisistheoppositeof"leveltriggered"readinessnotification.It'sabitlessforgivingofprogramming
mistakes,sinceifyoumissjustoneevent,theconnectionthateventwasforgetsstuckforever.
Nevertheless,Ihavefoundthatedgetriggeredreadinessnotificationmadeprogrammingnonblocking
clientswithOpenSSLeasier,soit'sworthtrying.
[Banga,Mogul,Drusha'99]describedthiskindofschemein1999.
ThereareseveralAPIswhichlettheapplicationretrieve'filedescriptorbecameready'notifications:
kqueue()ThisistherecommendededgetriggeredpollreplacementforFreeBSD(and,soon,
NetBSD).
FreeBSD4.3andlater,andNetBSDcurrentasofOct2002,supportageneralizedalternativeto
poll()calledkqueue()/kevent()itsupportsbothedgetriggeringandleveltriggering.(Seealso
JonathanLemon'spageandhisBSDCon2000paperonkqueue().)
Like/dev/poll,youallocatealisteningobject,butratherthanopeningthefile/dev/poll,youcall
kqueue()toallocateone.Tochangetheeventsyouarelisteningfor,ortogetthelistofcurrent
events,youcallkevent()onthedescriptorreturnedbykqueue().Itcanlistennotjustforsocket
readiness,butalsoforplainfilereadiness,signals,andevenforI/Ocompletion.
Note:asofOctober2000,thethreadinglibraryonFreeBSDdoesnotinteractwellwithkqueue()
evidently,whenkqueue()blocks,theentireprocessblocks,notjustthecallingthread.
SeePoller_kqueue(cc,h,benchmarks)foranexampleofhowtousekqueue()interchangeably
withmanyotherreadinessnotificationschemes.
Examplesandlibrariesusingkqueue():
PyKQueueaPythonbindingforkqueue()
RonaldF.Guilmette'sexampleechoserverseealsohis28Sept2000poston
freebsd.questions.
epoll
Thisistherecommendededgetriggeredpollreplacementforthe2.6Linuxkernel.
On11July2001,DavideLibenziproposedanalternativetorealtimesignalshispatchprovides
whathenowcalls/dev/epollwww.xmailserver.org/linuxpatches/nioimprove.html.Thisisjust
liketherealtimesignalreadinessnotification,butitcoalescesredundantevents,andhasamore
efficientschemeforbulkeventretrieval.
Epollwasmergedintothe2.5kerneltreeasof2.5.46afteritsinterfacewaschangedfromaspecial
filein/devtoasystemcall,sys_epoll.Apatchfortheolderversionofepollisavailableforthe2.4
kernel.
Therewasalengthydebateaboutunifyingepoll,aio,andothereventsourcesonthelinuxkernel
mailinglistaroundHalloween2002.Itmayyethappen,butDavideisconcentratingonfirmingup
epollingeneralfirst.
http://www.kegel.com/c10k.html#related

6/22

8/22/2015

TheC10Kproblem

Polyakov'skevent(Linux2.6+)Newsflash:On9Feb2006,andagainon9July2006,Evgeniy
PolyakovpostedpatcheswhichseemtounifyepollandaiohisgoalistosupportnetworkAIO.
See:
theLWNarticleaboutkevent
hisJulyannouncement
hiskeventpage
hisnaiopage
somerecentdiscussion
Drepper'sNewNetworkInterface(proposalforLinux2.6+)
AtOLS2006,UlrichDrepperproposedanewhighspeedasynchronousnetworkingAPI.See:
hispaper,"TheNeedforAsynchronous,ZeroCopyNetworkI/O"
hisslides
LWNarticlefromJuly22
RealtimeSignals
Thisistherecommendededgetriggeredpollreplacementforthe2.4Linuxkernel.
The2.4linuxkernelcandeliversocketreadinesseventsviaaparticularrealtimesignal.Here'show
toturnthisbehavioron:
/*MaskoffSIGIOandthesignalyouwanttouse.*/
sigemptyset(&sigset);
sigaddset(&sigset,signum);
sigaddset(&sigset,SIGIO);
sigprocmask(SIG_BLOCK,&m_sigset,NULL);
/*Foreachfiledescriptor,invokeF_SETOWN,F_SETSIG,andsetO_ASYNC.*/
fcntl(fd,F_SETOWN,(int)getpid());
fcntl(fd,F_SETSIG,signum);
flags=fcntl(fd,F_GETFL);
flags|=O_NONBLOCK|O_ASYNC;
fcntl(fd,F_SETFL,flags);

ThissendsthatsignalwhenanormalI/Ofunctionlikeread()orwrite()completes.Tousethis,
writeanormalpoll()outerloop,andinsideit,afteryou'vehandledallthefd'snoticedbypoll(),
youloopcallingsigwaitinfo().
Ifsigwaitinfoorsigtimedwaitreturnsyourrealtimesignal,siginfo.si_fdandsiginfo.si_bandgive
almostthesameinformationaspollfd.fdandpollfd.reventswouldafteracalltopoll(),soyou
handlethei/o,andcontinuecallingsigwaitinfo().
IfsigwaitinforeturnsatraditionalSIGIO,thesignalqueueoverflowed,soyouflushthesignal
queuebytemporarilychangingthesignalhandlertoSIG_DFL,andbreakbacktotheouterpoll()
loop.
SeePoller_sigio(cc,h)foranexampleofhowtousertsignalsinterchangeablywithmanyother
readinessnotificationschemes.
SeeZachBrown'sphhttpdforexamplecodethatusesthisfeaturedirectly.(Ordon'tphhttpdisa
bithardtofigureout...)
[Provos,Lever,andTweedie2000]describesarecentbenchmarkofphhttpdusingavariantof
sigtimedwait(),sigtimedwait4(),thatletsyouretrievemultiplesignalswithonecall.Interestingly,
thechiefbenefitofsigtimedwait4()forthemseemedtobeitallowedtheapptogaugesystem
overload(soitcouldbehaveappropriately).(Notethatpoll()providesthesamemeasureofsystem
http://www.kegel.com/c10k.html#related

7/22

8/22/2015

TheC10Kproblem

overload.)
Signalperfd
ChandraandMosbergerproposedamodificationtotherealtimesignalapproachcalled"signal
perfd"whichreducesoreliminatesrealtimesignalqueueoverflowbycoalescingredundant
events.Itdoesn'toutperformepoll,though.Theirpaper(www.hpl.hp.com/techreports/2000/HPL
2000174.html)comparesperformanceofthisschemewithselect()and/dev/poll.
VitalyLubanannouncedapatchimplementingthisschemeon18May2001hispatchlivesat
www.luban.org/GPL/gpl.html.(Note:asofSept2001,theremaystillbestabilityproblemswith
thispatchunderheavyload.dkftpbenchatabout4500usersmaybeabletotriggeranoops.)
SeePoller_sigfd(cc,h)foranexampleofhowtousesignalperfdinterchangeablywithmany
otherreadinessnotificationschemes.

3.Servemanyclientswitheachserverthread,anduseasynchronousI/O
ThishasnotyetbecomepopularinUnix,probablybecausefewoperatingsystemssupportasynchronous
I/O,alsopossiblybecauseit(likenonblockingI/O)requiresrethinkingyourapplication.Understandard
Unix,asynchronousI/Oisprovidedbytheaio_interface(scrolldownfromthatlinkto"Asynchronous
inputandoutput"),whichassociatesasignalandvaluewitheachI/Ooperation.Signalsandtheirvalues
arequeuedanddeliveredefficientlytotheuserprocess.ThisisfromthePOSIX1003.1brealtime
extensions,andisalsointheSingleUnixSpecification,version2.
AIOisnormallyusedwithedgetriggeredcompletionnotification,i.e.asignalisqueuedwhenthe
operationiscomplete.(Itcanalsobeusedwithleveltriggeredcompletionnotificationbycalling
aio_suspend(),thoughIsuspectfewpeopledothis.)
glibc2.1andlaterprovideagenericimplementationwrittenforstandardscomplianceratherthan
performance.
BenLaHaise'simplementationforLinuxAIOwasmergedintothemainLinuxkernelasof2.5.32.It
doesn'tusekernelthreads,andhasaveryefficientunderlyingapi,but(asof2.6.0test2)doesn'tyet
supportsockets.(ThereisalsoanAIOpatchforthe2.4kernels,butthe2.5/2.6implementationis
somewhatdifferent.)Moreinfo:
Thepage"KernelAsynchronousI/O(AIO)SupportforLinux"whichtriestotietogetherallinfo
aboutthe2.6kernel'simplementationofAIO(posted16Sept2003)
Round3:aiovs/dev/epollbyBenjaminC.R.LaHaise(presentedat2002OLS)
AsynchronousI/OSuportinLinux2.5,byBhattacharya,Pratt,Pulaverty,andMorgan,IBM
presentedatOLS'2003
DesignNotesonAsynchronousI/O(aio)forLinuxbySuparnaBhattacharyacomparesBen's
AIOwithSGI'sKAIOandafewotherAIOprojects
LinuxAIOhomepageBen'spreliminarypatches,mailinglist,etc.
linuxaiomailinglistarchives
libaiooraclelibraryimplementingstandardPosixAIOontopoflibaio.FirstmentionedbyJoel
Beckeron18Apr2003.
SuparnaalsosuggestshavingalookatthetheDAFSAPI'sapproachtoAIO.
RedHatASandSuseSLESbothprovideahighperformanceimplementationonthe2.4kernelitis
http://www.kegel.com/c10k.html#related

8/22

8/22/2015

TheC10Kproblem

relatedto,butnotcompletelyidenticalto,the2.6kernelimplementation.
InFebruary2006,anewattemptisbeingmadetoprovidenetworkAIOseethenoteaboveabout
EvgeniyPolyakov'skeventbasedAIO.
In1999,SGIimplementedhighspeedAIOforLinux.Asofversion1.1,it'ssaidtoworkwellwith
bothdiskI/Oandsockets.Itseemstousekernelthreads.Itisstillusefulforpeoplewhocan'twaitfor
Ben'sAIOtosupportsockets.
TheO'ReillybookPOSIX.4:ProgrammingfortheRealWorldissaidtoincludeagoodintroductionto
aio.
Atutorialfortheearlier,nonstandard,aioimplementationonSolarisisonlineatSunsite.It'sprobably
worthalook,butkeepinmindyou'llneedtomentallyconvert"aioread"to"aio_read",etc.
NotethatAIOdoesn'tprovideawaytoopenfileswithoutblockingfordiskI/Oifyoucareaboutthe
sleepcausedbyopeningadiskfile,Linussuggestsyoushouldsimplydotheopen()inadifferentthread
ratherthanwishingforanaio_open()systemcall.
UnderWindows,asynchronousI/Oisassociatedwiththeterms"OverlappedI/O"andIOCPor"I/O
CompletionPort".Microsoft'sIOCPcombinestechniquesfromthepriorartlikeasynchronousI/O(like
aio_write)andqueuedcompletionnotification(likewhenusingtheaio_sigeventfieldwithaio_write)
withanewideaofholdingbacksomerequeststotrytokeepthenumberofrunningthreadsassociated
withasingleIOCPconstant.Formoreinformation,seeInsideI/OCompletionPortsbyMark
Russinovichatsysinternals.com,JeffreyRichter'sbook"ProgrammingServerSideApplicationsfor
MicrosoftWindows2000"(Amazon,MSPress),U.S.patent#06223207,orMSDN.

4.Serveoneclientwitheachserverthread
...andletread()andwrite()block.Hasthedisadvantageofusingawholestackframeforeachclient,
whichcostsmemory.ManyOS'salsohavetroublehandlingmorethanafewhundredthreads.Ifeach
threadgetsa2MBstack(notanuncommondefaultvalue),yourunoutof*virtualmemory*at(2^30/
2^21)=512threadsona32bitmachinewith1GBuseraccessibleVM(like,say,Linuxasnormally
shippedonx86).Youcanworkaroundthisbygivingeachthreadasmallerstack,butsincemostthread
librariesdon'tallowgrowingthreadstacksoncecreated,doingthismeansdesigningyourprogramto
minimizestackuse.Youcanalsoworkaroundthisbymovingtoa64bitprocessor.
ThethreadsupportinLinux,FreeBSD,andSolarisisimproving,and64bitprocessorsarejustaround
thecornerevenformainstreamusers.Perhapsinthenottoodistantfuture,thosewhopreferusingone
threadperclientwillbeabletousethatparadigmevenfor10000clients.Nevertheless,atthecurrent
time,ifyouactuallywanttosupportthatmanyclients,you'reprobablybetteroffusingsomeother
paradigm.
Foranunabashedlyprothreadviewpoint,seeWhyEventsAreABadIdea(forHighconcurrency
Servers)byvonBehren,Condit,andBrewer,UCB,presentedatHotOSIX.Anyonefromtheantithread
campcaretopointoutapaperthatrebutsthisone?:)
LinuxThreads
LinuxTheadsisthenameforthestandardLinuxthreadlibrary.Itisintegratedintoglibcsinceglibc2.0,
andismostlyPosixcompliant,butwithlessthanstellarperformanceandsignalsupport.
http://www.kegel.com/c10k.html#related

9/22

8/22/2015

TheC10Kproblem

NGPT:NextGenerationPosixThreadsforLinux
NGPTisaprojectstartedbyIBMtobringgoodPosixcompliantthreadsupporttoLinux.It'satstable
version2.2now,andworkswell...buttheNGPTteamhasannouncedthattheyareputtingtheNGPT
codebaseintosupportonlymodebecausetheyfeelit's"thebestwaytosupportthecommunityforthe
longterm".TheNGPTteamwillcontinueworkingtoimproveLinuxthreadsupport,butnowfocusedon
improvingNPTL.(KudostotheNGPTteamfortheirgoodworkandthegracefulwaytheyconcededto
NPTL.)
NPTL:NativePosixThreadLibraryforLinux
NPTLisaprojectbyUlrichDrepper(thebenevolentdict^H^H^H^Hmaintainerofglibc)andIngo
MolnartobringworldclassPosixthreadingsupporttoLinux.
Asof5October2003,NPTLisnowmergedintotheglibccvstreeasanaddondirectory(justlike
linuxthreads),soitwillalmostcertainlybereleasedalongwiththenextreleaseofglibc.
ThefirstmajordistributiontoincludeanearlysnapshotofNPTLwasRedHat9.(Thiswasabit
inconvenientforsomeusers,butsomebodyhadtobreaktheice...)
NPTLlinks:
MailinglistforNPTLdiscussion
NPTLsourcecode
InitialannouncementforNPTL
OriginalwhitepaperdescribingthegoalsforNPTL
RevisedwhitepaperdescribingthefinaldesignofNPTL
IngoMolnar'sfirstbenchmarkshowingitcouldhandle10^6threads
Ulrich'sbenchmarkcomparingperformanceofLinuxThreads,NPTL,andIBM'sNGPT.Itseems
toshowNPTLismuchfasterthanNGPT.
Here'smytryatdescribingthehistoryofNPTL(seealsoJerryCooperstein'sarticle):
InMarch2002,BillAbtoftheNGPTteam,theglibcmaintainerUlrichDrepper,andothersmettofigure
outwhattodoaboutLinuxThreads.Oneideathatcameoutofthemeetingwastoimprovemutex
performanceRustyRusselletalsubsequentlyimplementedfastuserspacemutexes(futexes)),whichare
nowusedbybothNGPTandNPTL.MostoftheattendeesfiguredNGPTshouldbemergedintoglibc.
UlrichDrepper,though,didn'tlikeNGPT,andfiguredhecoulddobetter.(Forthosewhohaveevertried
tocontributeapatchtoglibc,thismaynotcomeasabigsurprise:)Overthenextfewmonths,Ulrich
Drepper,IngoMolnar,andotherscontributedglibcandkernelchangesthatmakeupsomethingcalled
theNativePosixThreadsLibrary(NPTL).NPTLusesallthekernelenhancementsdesignedforNGPT,
andtakesadvantageofafewnewones.IngoMolnardescribedthekernelenhancementsasfollows:
WhileNPTLusesthethreekernelfeaturesintroducedbyNGPT:getpid()returnsPID,
CLONE_THREADandfutexesNPTLalsouses(andrelieson)amuchwidersetofnew
kernelfeatures,developedaspartofthisproject.
SomeoftheitemsNGPTintroducedintothekernelaround2.5.8gotmodified,cleanedup
andextended,suchasthreadgrouphandling(CLONE_THREAD).[theCLONE_THREAD
changeswhichimpactedNGPT'scompatibilitygotsyncedwiththeNGPTfolks,tomakesure
http://www.kegel.com/c10k.html#related

10/22

8/22/2015

TheC10Kproblem

NGPTdoesnotbreakinanyunacceptableway.]
ThekernelfeaturesdevelopedforandusedbyNPTLaredescribedinthedesignwhitepaper,
http://people.redhat.com/drepper/nptldesign.pdf...
Ashortlist:TLSsupport,variouscloneextensions(CLONE_SETTLS,CLONE_SETTID,
CLONE_CLEARTID),POSIXthreadsignalhandling,sys_exit()extension(releaseTID
futexuponVMrelease),thesys_exit_group()systemcall,sys_execve()enhancementsand
supportfordetachedthreads.
TherewasalsoworkputintoextendingthePIDspaceeg.procfscrasheddueto64KPID
assumptions,max_pid,andpidallocationscalabilitywork.Plusanumberofperformance
onlyimprovementsweredoneaswell.
Inessencethenewfeaturesareanocompromisesapproachto1:1threadingthekernel
nowhelpsineverythingwhereitcanimprovethreading,andwepreciselydotheminimally
necessarysetofcontextswitchesandkernelcallsforeverybasicthreadingprimitive.
OnebigdifferencebetweenthetwoisthatNPTLisa1:1threadingmodel,whereasNGPTisanM:N
threadingmodel(seebelow).Inspiteofthis,Ulrich'sinitialbenchmarksseemtoshowthatNPTLis
indeedmuchfasterthanNGPT.(TheNGPTteamislookingforwardtoseeingUlrich'sbenchmarkcode
toverifytheresult.)
FreeBSDthreadingsupport
FreeBSDsupportsbothLinuxThreadsandauserspacethreadinglibrary.Also,aM:Nimplementation
calledKSEwasintroducedinFreeBSD5.0.Foroneoverview,seewww.unobvious.com/bsd/freebsd
threads.html.
On25Mar2003,JeffRobersonpostedonfreebsdarch:
...ThankstothefoundationprovidedbyJulian,DavidXu,Mini,DanEischen,andeveryone
elsewhohasparticipatedwithKSEandlibpthreaddevelopmentMiniandIhavedeveloped
a1:1threadingimplementation.ThiscodeworksinparallelwithKSEanddoesnotbreakit
inanyway.ItactuallyhelpsbringM:Nthreadingcloserbytestingoutsharedbits....
AndinJuly2006,RobertWatsonproposedthatthe1:1threadingimplementationbecomethedefaultin
FreeBsd7.x:
Iknowthishasbeendiscussedinthepast,butIfiguredwith7.xtrundlingforward,itwas
timetothinkaboutitagain.Inbenchmarksformanycommonapplicationsandscenarios,
libthrdemonstratessignificantlybetterperformanceoverlibpthread...libthrisalso
implementedacrossalargernumberofourplatforms,andisalreadylibpthreadonseveral.
ThefirstrecommendationwemaketoMySQLandotherheavythreadusersis"Switchto
libthr",whichissuggestive,also!...Sothestrawmanproposalis:makelibthrthedefault
threadinglibraryon7.x.
NetBSDthreadingsupport
AccordingtoanotefromNoriyukiSoda:
http://www.kegel.com/c10k.html#related

11/22

8/22/2015

TheC10Kproblem

KernelsupportedM:NthreadlibrarybasedontheSchedulerActivationsmodelismerged
intoNetBSDcurrentonJan182003.
Fordetails,seeAnImplementationofSchedulerActivationsontheNetBSDOperatingSystemby
NathanJ.Williams,WasabiSystems,Inc.,presentedatFREENIX'02.
Solaristhreadingsupport
ThethreadsupportinSolarisisevolving...fromSolaris2toSolaris8,thedefaultthreadinglibraryused
anM:Nmodel,butSolaris9defaultsto1:1modelthreadsupport.SeeSun'smultithreadedprogramming
guideandSun'snoteaboutJavaandSolaristhreading.
JavathreadingsupportinJDK1.3.xandearlier
Asiswellknown,JavauptoJDK1.3.xdidnotsupportanymethodofhandlingnetworkconnections
otherthanonethreadperclient.Volanomarkisagoodmicrobenchmarkwhichmeasuresthroughputin
messsagespersecondatvariousnumbersofsimultaneousconnections.AsofMay2003,JDK1.3
implementationsfromvariousvendorsareinfactabletohandletenthousandsimultaneousconnections
albeitwithsignificantperformancedegradation.SeeTable4foranideaofwhichJVMscanhandle
10000connections,andhowperformancesuffersasthenumberofconnectionsincreases.
Note:1:1threadingvs.M:Nthreading
Thereisachoicewhenimplementingathreadinglibrary:youcaneitherputallthethreadingsupportin
thekernel(thisiscalledthe1:1threadingmodel),oryoucanmoveafairbitofitintouserspace(thisis
calledtheM:Nthreadingmodel).Atonepoint,M:Nwasthoughttobehigherperformance,butit'sso
complexthatit'shardtogetright,andmostpeoplearemovingawayfromit.
WhyIngoMolnarprefers1:1overM:N
Sunismovingto1:1threads
NGPTisanM:NthreadinglibraryforLinux.
AlthoughUlrichDrepperplannedtouseM:Nthreadsinthenewglibcthreadinglibrary,hehas
sinceswitchedtothe1:1threadingmodel.
MacOSXappearstouse1:1threading.
FreeBSDandNetBSDappeartostillbelieveinM:Nthreading...Theloneholdouts?Lookslike
freebsd7.0mightswitchto1:1threading(seeabove),soperhapsM:Nthreading'sbelievershave
finallybeenprovenwrongeverywhere.

5.Buildtheservercodeintothekernel
NovellandMicrosoftarebothsaidtohavedonethisatvarioustimes,atleastoneNFSimplementation
doesthis,khttpddoesthisforLinuxandstaticwebpages,and"TUX"(ThreadedlinUXwebserver)isa
blindinglyfastandflexiblekernelspaceHTTPserverbyIngoMolnarforLinux.Ingo'sSeptember1,
2000announcementsaysanalphaversionofTUXcanbedownloadedfrom
ftp://ftp.redhat.com/pub/redhat/tux,andexplainshowtojoinamailinglistformoreinfo.
Thelinuxkernellisthasbeendiscussingtheprosandconsofthisapproach,andtheconsensusseemsto
beinsteadofmovingwebserversintothekernel,thekernelshouldhavethesmallestpossiblehooks
addedtoimprovewebserverperformance.Thatway,otherkindsofserverscanbenefit.Seee.g.Zach
Brown'sremarksaboutuserlandvs.kernelhttpservers.Itappearsthatthe2.4linuxkernelprovides
http://www.kegel.com/c10k.html#related

12/22

8/22/2015

TheC10Kproblem

sufficientpowertouserprograms,astheX15serverrunsaboutasfastasTux,butdoesn'tuseanykernel
modifications.

BringtheTCPstackintouserspace
SeeforinstancethenetmappacketI/Oframework,andtheSandstormproofofconceptwebserverbased
onit.

Comments
RichardGoochhaswrittenapaperdiscussingI/Ooptions.
In2001,TimBrechtandMMichalOstrowskimeasuredvariousstrategiesforsimpleselectbasedservers.
Theirdataisworthalook.
In2003,TimBrechtpostedsourcecodeforuserver,asmallwebserverputtogetherfromseveralservers
writtenbyAbhishekChandra,DavidMosberger,DavidPariag,andMichalOstrowski.Itcanuseselect(),
poll(),epoll(),orsigio.
BackinMarch1999,DeanGaudetposted:
Ikeepgettingasked"whydon'tyouguysuseaselect/eventbasedmodellikeZeus?It's
clearlythefastest."...
Hisreasonsboileddownto"it'sreallyhard,andthepayoffisn'tclear".Withinafewmonths,though,it
becameclearthatpeoplewerewillingtoworkonit.
MarkRussinovichwroteaneditorialandanarticlediscussingI/Ostrategyissuesinthe2.2Linuxkernel.
Worthreading,evenheseemsmisinformedonsomepoints.Inparticular,heseemstothinkthatLinux
2.2'sasynchronousI/O(seeF_SETSIGabove)doesn'tnotifytheuserprocesswhendataisready,only
whennewconnectionsarrive.Thisseemslikeabizarremisunderstanding.Seealsocommentsonan
earlierdraft,IngoMolnar'srebuttalof30April1999,Russinovich'scommentsof2May1999,arebuttal
fromAlanCox,andvariouspoststolinuxkernel.IsuspecthewastryingtosaythatLinuxdoesn't
supportasynchronousdiskI/O,whichusedtobetrue,butnowthatSGIhasimplementedKAIO,it'snot
sotrueanymore.
Seethesepagesatsysinternals.comandMSDNforinformationon"completionports",whichhesaid
wereuniquetoNTinanutshell,win32's"overlappedI/O"turnedouttobetoolowleveltobe
convenient,anda"completionport"isawrapperthatprovidesaqueueofcompletionevents,plus
schedulingmagicthattriestokeepthenumberofrunningthreadsconstantbyallowingmorethreadsto
pickupcompletioneventsifotherthreadsthathadpickedupcompletioneventsfromthisportare
sleeping(perhapsdoingblockingI/O).
SeealsoOS/400'ssupportforI/Ocompletionports.
TherewasaninterestingdiscussiononlinuxkernelinSeptember1999titled">15,000Simultaneous
Connections"(andthesecondweekofthethread).Highlights:
EdHallpostedafewnotesonhisexperienceshe'sachieved>1000connects/secondonaUP
P2/333runningSolaris.Hiscodeusedasmallpoolofthreads(1or2perCPU)eachmanaginga
http://www.kegel.com/c10k.html#related

13/22

8/22/2015

TheC10Kproblem

largenumberofclientsusing"aneventbasedmodel".
MikeJagdispostedananalysisofpoll/selectoverhead,andsaid"Thecurrentselect/poll
implementationcanbeimprovedsignificantly,especiallyintheblockingcase,buttheoverhead
willstillincreasewiththenumberofdescriptorsbecauseselect/polldoesnot,andcannot,
rememberwhatdescriptorsareinteresting.ThiswouldbeeasytofixwithanewAPI.Suggestions
arewelcome..."
Mikepostedabouthisworkonimprovingselect()andpoll().
MikepostedabitaboutapossibleAPItoreplacepoll()/select():"Howabouta'devicelike'API
whereyouwrite'pollfdlike'structs,the'device'listensforeventsanddelivers'pollfdlike'structs
representingthemwhenyoureadit?..."
RogierWolffsuggestedusing"theAPIthatthedigitalguyssuggested",
http://www.cs.rice.edu/~gaurav/papers/usenix99.ps
JoergPommnitzpointedoutthatanynewAPIalongtheselinesshouldbeabletowaitfornotjust
filedescriptorevents,butalsosignalsandmaybeSYSVIPC.Oursynchronizationprimitives
shouldcertainlybeabletodowhatWin32'sWaitForMultipleObjectscan,atleast.
StephenTweedieassertedthatthecombinationofF_SETSIG,queuedrealtimesignals,and
sigwaitinfo()wasasupersetoftheAPIproposedin
http://www.cs.rice.edu/~gaurav/papers/usenix99.ps.Healsomentionsthatyoukeepthesignal
blockedatalltimesifyou'reinterestedinperformanceinsteadofthesignalbeingdelivered
asynchronously,theprocessgrabsthenextonefromthequeuewithsigwaitinfo().
JaysonNordwickcomparedcompletionportswiththeF_SETSIGsynchronouseventmodel,and
concludedthey'reprettysimilar.
AlanCoxnotedthatanolderrevofSCT'sSIGIOpatchisincludedin2.3.18ac.
JordanMendelsonpostedsomeexamplecodeshowinghowtouseF_SETSIG.
StephenC.TweediecontinuedthecomparisonofcompletionportsandF_SETSIG,andnoted:
"Withasignaldequeuingmechanism,yourapplicationisgoingtogetsignalsdestinedforvarious
librarycomponentsiflibrariesareusingthesamemechanism,"butthelibrarycansetupitsown
signalhandler,sothisshouldn'taffecttheprogram(much).
DougRoyernotedthathe'dgotten100,000connectionsonSolaris2.6whilehewasworkingon
theSuncalendarserver.OtherschimedinwithestimatesofhowmuchRAMthatwouldrequireon
Linux,andwhatbottleneckswouldbehit.
Interestingreading!

Limitsonopenfilehandles
AnyUnix:thelimitssetbyulimitorsetrlimit.
Solaris:seetheSolarisFAQ,question3.46(orthereaboutstheyrenumberthequestions
periodically).
FreeBSD:
Edit/boot/loader.conf,addtheline
setkern.maxfiles=XXXX

whereXXXXisthedesiredsystemlimitonfiledescriptors,andreboot.Thankstoananonymous
reader,whowroteintosayhe'dachievedfarmorethan10000connectionsonFreeBSD4.3,and
says
"FWIW:Youcan'tactuallytunethemaximumnumberofconnectionsinFreeBSD
trivially,viasysctl....Youhavetodoitinthe/boot/loader.conffile.
http://www.kegel.com/c10k.html#related

14/22

8/22/2015

TheC10Kproblem

Thereasonforthisisthatthezalloci()callsforinitializingthesocketsandtcpcb
structureszonesoccursveryearlyinsystemstartup,inorderthatthezonebebothtype
stableandthatitbeswappable.
Youwillalsoneedtosetthenumberofmbufsmuchhigher,sinceyouwill(onan
unmodifiedkernel)chewuponembufperconnectionfortcptemplstructures,which
areusedtoimplementkeepalive."
Anotherreadersays
"AsofFreeBSD4.4,thetcptemplstructureisnolongerallocatedyounolongerhave
toworryaboutonembufbeingchewedupperconnection."
Seealso:
theFreeBSDhandbook
SYSCTLTUNING,LOADERTUNABLES,andKERNELCONFIGTUNINGin'man
tuning'
TheEffectsofTuningaFreeBSD4.3BoxforHighPerformance,DaemonNews,Aug2001
postfix.orgtuningnotes,coveringFreeBSD4.2and4.4
theMeasurementFactory'snotes,circaFreeBSD4.3
OpenBSD:Areadersays
"InOpenBSD,anadditionaltweakisrequiredtoincreasethenumberofopen
filehandlesavailableperprocess:theopenfilescurparameterin/etc/login.confneeds
tobeincreased.Youcanchangekern.maxfileseitherwithsysctlworinsysctl.conf
butithasnoeffect.Thismattersbecauseasshipped,thelogin.conflimitsareaquite
low64fornonprivilegedprocesses,128forprivileged."
Linux:SeeBodoBauer's/procdocumentation.On2.4kernels:
echo32768>/proc/sys/fs/filemax

increasesthesystemlimitonopenfiles,and
ulimitn32768

increasesthecurrentprocess'limit.
On2.2.xkernels,
echo32768>/proc/sys/fs/filemax
echo65536>/proc/sys/fs/inodemax

increasesthesystemlimitonopenfiles,and
ulimitn32768

increasesthecurrentprocess'limit.
IverifiedthataprocessonRedHat6.0(2.2.5orsopluspatches)canopenatleast31000file
descriptorsthisway.Anotherfellowhasverifiedthataprocesson2.2.12canopenatleast90000
filedescriptorsthisway(withappropriatelimits).Theupperboundseemstobeavailablememory.
StephenC.Tweediepostedabouthowtosetulimitlimitsgloballyorperuseratboottimeusing
initscriptandpam_limit.
Inolder2.2kernels,though,thenumberofopenfilesperprocessisstilllimitedto1024,evenwith
http://www.kegel.com/c10k.html#related

15/22

8/22/2015

TheC10Kproblem

theabovechanges.
SeealsoOskar's1998post,whichtalksabouttheperprocessandsystemwidelimitsonfile
descriptorsinthe2.0.36kernel.

Limitsonthreads
Onanyarchitecture,youmayneedtoreducetheamountofstackspaceallocatedforeachthreadtoavoid
runningoutofvirtualmemory.Youcansetthisatruntimewithpthread_attr_init()ifyou'reusing
pthreads.
Solaris:itsupportsasmanythreadsaswillfitinmemory,Ihear.
Linux2.6kernelswithNPTL:/proc/sys/vm/max_map_countmayneedtobeincreasedtogoabove
32000orsothreads.(You'llneedtouseverysmallstackthreadstogetanywherenearthatnumber
ofthreads,though,unlessyou'reona64bitprocessor.)SeetheNPTLmailinglist,e.g.thethread
withsubject"Cannotcreatemorethan32Kthreads?",formoreinfo.
Linux2.4:/proc/sys/kernel/threadsmaxisthemaxnumberofthreadsitdefaultsto2047onmy
RedHat8system.Youcansetincreasethisasusualbyechoingnewvaluesintothatfile,e.g.
"echo4000>/proc/sys/kernel/threadsmax"
Linux2.2:Eventhe2.2.13kernellimitsthenumberofthreads,atleastonIntel.Idon'tknowwhat
thelimitsareonotherarchitectures.Mingopostedapatchfor2.1.131onIntelthatremovedthis
limit.Itappearstobeintegratedinto2.3.20.
SeealsoVolano'sdetailedinstructionsforraisingfile,thread,andFD_SETlimitsinthe2.2kernel.
Wow.Thisdocumentstepsyouthroughalotofstuffthatwouldbehardtofigureoutyourself,but
issomewhatdated.
Java:SeeVolano'sdetailedbenchmarkinfo,plustheirinfoonhowtotunevarioussystemsto
handlelotsofthreads.

Javaissues
UpthroughJDK1.3,Java'sstandardnetworkinglibrariesmostlyofferedtheonethreadperclientmodel.
Therewasawaytodononblockingreads,butnowaytodononblockingwrites.
InMay2001,JDK1.4introducedthepackagejava.niotoprovidefullsupportfornonblockingI/O(and
someothergoodies).Seethereleasenotesforsomecaveats.TryitoutandgiveSunfeedback!
HP'sjavaalsoincludesaThreadPollingAPI.
In2000,MattWelshimplementednonblockingsocketsforJavahisperformancebenchmarksshowthat
theyhaveadvantagesoverblockingsocketsinservershandlingmany(upto10000)connections.His
classlibraryiscalledjavanbioit'spartoftheSandstormproject.Benchmarksshowingperformance
with10000connectionsareavailable.
SeealsoDeanGaudet'sessayonthesubjectofJava,networkI/O,andthreads,andthepaperbyMatt
Welshoneventsvs.workerthreads.
BeforeNIO,therewereseveralproposalsforimprovingJava'snetworkingAPIs:
MattWelsh'sJaguarsystemproposespreserializedobjects,newJavabytecodes,andmemory
http://www.kegel.com/c10k.html#related

16/22

8/22/2015

TheC10Kproblem

managementchangestoallowtheuseofasynchronousI/OwithJava.
InterfacingJavatotheVirtualInterfaceArchitecture,byCC.ChangandT.vonEicken,proposes
memorymanagementchangestoallowtheuseofasynchronousI/OwithJava.
JSR51wastheSunprojectthatcameupwiththejava.niopackage.MattWelshparticipated(who
saysSundoesn'tlisten?).

Othertips
ZeroCopy
Normally,datagetscopiedmanytimesonitswayfromheretothere.Anyschemethateliminates
thesecopiestothebarephysicalminimumiscalled"zerocopy".
ThomasOgrisegg'szerocopysendpatchformmapedfilesunderLinux2.4.172.4.20.
Claimsit'sfasterthansendfile().
IOLiteisaproposalforasetofI/Oprimitivesthatgetsridoftheneedformanycopies.
AlanCoxnotedthatzerocopyissometimesnotworththetroublebackin1999.(Hedidlike
sendfile(),though.)
IngoimplementedaformofzerocopyTCPinthe2.4kernelforTUX1.0inJuly2000,and
sayshe'llmakeitavailabletouserspacesoon.
DrewGallatinandRobertPiccohaveaddedsomezerocopyfeaturestoFreeBSDtheidea
seemstobethatifyoucallwrite()orread()onasocket,thepointerispagealigned,andthe
amountofdatatransferredisatleastapage,*and*youdon'timmediatelyreusethebuffer,
memorymanagementtrickswillbeusedtoavoidcopies.Butseefollowupstothismessage
onlinuxkernelforpeople'smisgivingsaboutthespeedofthosememorymanagementtricks.
AccordingtoanotefromNoriyukiSoda:
SendingsidezerocopyissupportedsinceNetBSD1.6releasebyspecifying
"SOSEND_LOAN"kerneloption.ThisoptionisnowdefaultonNetBSDcurrent
(youcandisablethisfeaturebyspecifying"SOSEND_NO_LOAN"inthekernel
optiononNetBSD_current).Withthisfeature,zerocopyisautomatically
enabled,ifdatamorethan4096bytesarespecifiedasdatatobesent.
Thesendfile()systemcallcanimplementzerocopynetworking.
Thesendfile()functioninLinuxandFreeBSDletsyoutellthekerneltosendpartorallofa
file.ThisletstheOSdoitasefficientlyaspossible.Itcanbeusedequallywellinservers
usingthreadsorserversusingnonblockingI/O.(InLinux,it'spoorlydocumentedatthe
momentuse_syscall4tocallit.AndiKleeniswritingnewmanpagesthatcoverthis.See
alsoExploringThesendfileSystemCallbyJeffTranterinLinuxGazetteissue91.)Rumor
hasit,ftp.cdrom.combenefittednoticeablyfromsendfile().
Azerocopyimplementationofsendfile()isonitswayforthe2.4kernel.SeeLWNJan25
2001.
Onedeveloperusingsendfile()withFreebsdreportsthatusingPOLLWRBANDinsteadof
POLLOUTmakesabigdifference.
Solaris8(asoftheJuly2001update)hasanewsystemcall'sendfilev'.Acopyoftheman
pageishere..TheSolaris87/01releasenotesalsomentionit.Isuspectthatthiswillbemost
usefulwhensendingtoasocketinblockingmodeit'dbeabitofapaintousewitha
nonblockingsocket.
http://www.kegel.com/c10k.html#related

17/22

8/22/2015

TheC10Kproblem

Avoidsmallframesbyusingwritev(orTCP_CORK)
AnewsocketoptionunderLinux,TCP_CORK,tellsthekerneltoavoidsendingpartialframes,
whichhelpsabite.g.whentherearelotsoflittlewrite()callsyoucan'tbundletogetherforsome
reason.Unsettingtheoptionflushesthebuffer.Bettertousewritev(),though...
SeeLWNJan252001forasummaryofsomeveryinterestingdiscussionsonlinuxkernelabout
TCP_CORKandapossiblealternativeMSG_MORE.
Behavesensiblyonoverload.
[Provos,Lever,andTweedie2000]notesthatdroppingincomingconnectionswhentheserveris
overloadedimprovedtheshapeoftheperformancecurve,andreducedtheoverallerrorrate.They
usedasmoothedversionof"numberofclientswithI/Oready"asameasureofoverload.This
techniqueshouldbeeasilyapplicabletoserverswrittenwithselect,poll,oranysystemcallthat
returnsacountofreadinesseventspercall(e.g./dev/pollorsigtimedwait4()).
SomeprogramscanbenefitfromusingnonPosixthreads.
Notallthreadsarecreatedequal.Theclone()functioninLinux(anditsfriendsinotheroperating
systems)letsyoucreateathreadthathasitsowncurrentworkingdirectory,forinstance,whichcan
beveryhelpfulwhenimplementinganftpserver.SeeHoserFTPdforanexampleoftheuseof
nativethreadsratherthanpthreads.
Cachingyourowndatacansometimesbeawin.
"Re:fixforhybridserverproblems"byVivekSadanandaPai([email protected])onnewhttpd,
May9th,states:
"I'vecomparedtherawperformanceofaselectbasedserverwithamultipleprocess
serveronbothFreeBSDandSolaris/x86.Onmicrobenchmarks,there'sonlya
marginaldifferenceinperformancestemmingfromthesoftwarearchitecture.Thebig
performancewinforselectbasedserversstemsfromdoingapplicationlevelcaching.
Whilemultipleprocessserverscandoitatahighercost,it'shardertogetthesame
benefitsonrealworkloads(vsmicrobenchmarks).I'llbepresentingthose
measurementsaspartofapaperthat'llappearatthenextUsenixconference.Ifyou've
gotpostscript,thepaperisavailableathttp://www.cs.rice.edu/~vivek/flash99/"

Otherlimits
Oldsystemlibrariesmightuse16bitvariablestoholdfilehandles,whichcausestroubleabove
32767handles.glibc2.1shouldbeok.
Manysystemsuse16bitvariablestoholdprocessorthreadid's.Itwouldbeinterestingtoportthe
VolanoscalabilitybenchmarktoC,andseewhattheupperlimitonnumberofthreadsisforthe
variousoperatingsystems.
Toomuchthreadlocalmemoryispreallocatedbysomeoperatingsystemsifeachthreadgets
1MB,andtotalVMspaceis2GB,thatcreatesanupperlimitof2000threads.
Lookattheperformancecomparisongraphatthebottomof
http://www.acme.com/software/thttpd/benchmarks.html.Noticehowvariousservershavetrouble
above128connections,evenonSolaris2.6?Anyonewhofiguresoutwhy,letmeknow.
Note:iftheTCPstackhasabugthatcausesashort(200ms)delayatSYNorFINtime,asLinux
2.2.02.2.6had,andtheOSorhttpdaemonhasahardlimitonthenumberofconnectionsopen,
youwouldexpectexactlythisbehavior.Theremaybeothercauses.

KernelIssues
http://www.kegel.com/c10k.html#related

18/22

8/22/2015

TheC10Kproblem

ForLinux,itlookslikekernelbottlenecksarebeingfixedconstantly.SeeLinuxWeeklyNews,Kernel
Traffic,theLinuxKernelmailinglist,andmyMindcraftReduxpage.
InMarch1999,MicrosoftsponsoredabenchmarkcomparingNTtoLinuxatservinglargenumbersof
httpandsmbclients,inwhichtheyfailedtoseegoodresultsfromLinux.Seealsomyarticleon
Mindcraft'sApril1999Benchmarksformoreinfo.
SeealsoTheLinuxScalabilityProject.They'redoinginterestingwork,includingNielsProvos'hinting
pollpatch,andsomeworkonthethunderingherdproblem.
SeealsoMikeJagdis'workonimprovingselect()andpoll()here'sMike'spostaboutit.
MohitAron([email protected])writesthatratebasedclockinginTCPcanimproveHTTPresponsetime
over'slow'connectionsby80%.

MeasuringServerPerformance
Twotestsinparticulararesimple,interesting,andhard:
1. rawconnectionspersecond(howmany512bytefilespersecondcanyouserve?)
2. totaltransferrateonlargefileswithmanyslowclients(howmany28.8kmodemclientscan
simultaneouslydownloadfromyourserverbeforeperformancegoestopot?)
JefPoskanzerhaspublishedbenchmarkscomparingmanywebservers.See
http://www.acme.com/software/thttpd/benchmarks.htmlforhisresults.
IalsohaveafewoldnotesaboutcomparingthttpdtoApachethatmaybeofinteresttobeginners.
ChuckLeverkeepsremindingusaboutBangaandDruschel'spaperonwebserverbenchmarking.It's
wortharead.
IBMhasanexcellentpapertitledJavaserverbenchmarks[Bayloretal,2000].It'swortharead.

Examples
Nginxisawebserverthatuseswhateverhighefficiencynetworkeventmechanismisavailableonthe
targetOS.It'sgettingpopularthereareeventwobooksaboutit.

Interestingselect()basedservers
thttpdVerysimple.Usesasingleprocess.Ithasgoodperformance,butdoesn'tscalewiththe
numberofCPU's.Canalsousekqueue.
mathopd.Similartothttpd.
fhttpd
boa
Roxen
Zeus,acommercialserverthattriestobetheabsolutefastest.Seetheirtuningguide.
TheothernonJavaserverslistedathttp://www.acme.com/software/thttpd/benchmarks.html
BetaFTPd
http://www.kegel.com/c10k.html#related

19/22

8/22/2015

TheC10Kproblem

FlashLitewebserverusingIOLite.
Flash:AnefficientandportableWebserverusesselect(),mmap(),mincore()
TheFlashwebserverasof2003usesselect(),modifiedsendfile(),asyncopen()
xitamiusesselect()toimplementitsownthreadabstractionforportabilitytosystemswithout
threads.
MedusaaserverwritingtoolkitinPythonthattriestodeliververyhighperformance.
userverasmallhttpserverthatcanuseselect,poll,epoll,orsigio

Interesting/dev/pollbasedservers
N.Provos,C.Lever,"ScalableNetworkI/OinLinux,"May,2000.[FREENIXtrack,Proc.
USENIX2000,SanDiego,California(June,2000).]Describesaversionofthttpdmodifiedto
support/dev/poll.Performanceiscomparedwithphhttpd.

Interestingepollbasedservers
ribs2
cmogstoredusesepoll/kqueueformostnetworking,threadsfordiskandaccept4

Interestingkqueue()basedservers
thttpd(asofversion2.21?)
AdrianChaddsays"I'mdoingalotofworktomakesquidactuallyLIKEakqueueIOsystem"it's
anofficialSquidsubprojectseehttp://squid.sourceforge.net/projects.html#commloops.(Thisis
apparentlynewerthanBenno'spatch.)

Interestingrealtimesignalbasedservers
Chromium'sX15.Thisusesthe2.4kernel'sSIGIOfeaturetogetherwithsendfile()and
TCP_CORK,andreportedlyachieveshigherspeedthanevenTUX.Thesourceisavailableundera
communitysource(notopensource)license.SeetheoriginalannouncementbyFabioRiccardi.
ZachBrown'sphhttpd"aquickwebserverthatwaswrittentoshowcasethesigio/siginfoevent
model.considerthiscodehighlyexperimentalandyourselfhighlymentalifyoutryanduseitina
productionenvironment."Usesthesiginfofeaturesof2.3.21orlater,andincludestheneeded
patchesforearlierkernels.Rumoredtobeevenfasterthankhttpd.Seehispostof31May1999for
somenotes.

Interestingthreadbasedservers
HoserFTPD.Seetheirbenchmarkpage.
PeterEriksson'sphttpdand
pftpd
TheJavabasedserverslistedathttp://www.acme.com/software/thttpd/benchmarks.html
Sun'sJavaWebServer(whichhasbeenreportedtohandle500simultaneousclients)

Interestinginkernelservers
http://www.kegel.com/c10k.html#related

20/22

8/22/2015

TheC10Kproblem

khttpd
"TUX"(ThreadedlinUXwebserver)byIngoMolnaretal.For2.4kernel.

Otherinterestinglinks
JeffDarcy'snotesonhighperformanceserverdesign
Ericsson'sARIESprojectbenchmarkresultsforApache1vs.Apache2vs.Tomcaton1to12
processors
Prof.PeterLadkin'sWebServerPerformancepage.
Novell'sFastCacheclaims10000hitspersecond.Quitetheprettyperformancegraph.
RikvanRiel'sLinuxPerformanceTuningsite

Translations
BelorussiantranslationprovidedbyPatricConradatUcallweconn

Changelog
2011/07/21
Addednginx.org
$Log:c10k.html,v$
Revision1.2122006/09/0214:52:13dank
addedasio
Revision1.2112006/07/2710:28:58dank
LinktoCalHenderson'sbook.
Revision1.2102006/07/2710:18:58dank
Listifypolyakovlinks,addDrepper'snewproposal,notethatFreeBSD7mightmoveto1:1
Revision1.2092006/07/1315:07:03dank
linktoScale!library,updatedPolyakovlinks
Revision1.2082006/07/1314:50:29dank
LinktoPolyakov'spatches
Revision1.2072003/11/0308:09:39dank
LinktoLinus'smessagedeprecatingtheideaofaio_open
Revision1.2062003/11/0307:44:34dank
linktouserver
Revision1.2052003/11/0306:55:26dank
LinktoVivekPei'snewFlashpaper,mentiongreatspecweb99score

Copyright19992014DanKegel
[email protected]
Lastupdated:5February2014
[Returntowww.kegel.com]
http://www.kegel.com/c10k.html#related

21/22

8/22/2015

http://www.kegel.com/c10k.html#related

TheC10Kproblem

22/22

You might also like