inDatabases,FileSystems,andStreams
BedaChristophHammerschmidt1,ChristianWerner3,YlvaBrandt3,
VolkerLinnemann2,SvenGroppe2,andStefanFischer3
1
OracleCorporation,400OracleParkway,4OP408,RedwoodShores,CA94065,USA
beda.hammerschmidt@oracle.co
2
InstituteofInformationSystems,UniversityofLuebeck,Germany
{groppe,linnemann}@ifis.uni-luebeck.de3
InstituteofTelematics,UniversityofLuebeck,Germany{werner,brandt,fischer}@itm.uni-luebeck.de
Abstract.Althoughthenative(tree-like)storageofXMLdatabecomesmoreandmoreimportanttherewillbeanenduringdemandtomanageXMLdatainitstextualrepresentation,forinstanceinrelationalstructuresorfilesystems.XMLdatahastobewellformedbydefinitionandadditionally,inmanycases,ithastobevalidaccordingtoagivenXMLschema.BecausetheXMLcolumntypesareoftenderivedfromtexttypes(e.g.CLOBs)guaranteeingwell-formednessaswellasvalidityisnottrivial.Andevenworse,forfrequentlymodifieddataitisusuallytooexpensivetore-validatethewholeXMLdataaftereachupdate–butwaivingre-validationmayleadtoinconsistenciesandmalfunctionsofapplications.Inthispaperwepresentaschema-awarepushdownautomaton(i.e.astackmachine)thatvalidatesanXMLstring/stream.Usinganelement/state-index,thepushdownau-tomatonisabletore-validatelocalmodificationsofthedatawhileguaranteeingoverallvalidity.Updateoperations(e.g.SQLXML,XQueryupdates)arevali-datedbeforeexecutingthem.
1IntroductionandMotivation
ThestorageofXMLdocumentsanddataindatabasemanagementsystemsisadailytask,althoughnosinglestoragetechnologyhasprevailed.Dependingontheapplicationconstraintsonemaychooseanativestoragesystem(e.g.[10,27])or(object-)relationalsystemsthatwereenabledforXML.AcommonapproachistoshredtheXMLdataandmapittocorrespondingtables.AnotlesspopularapproachistostoretheXMLdatainatextualorbinaryrepresentationinaspecialcolumntypeorinthefilesystem.
Here,theoriginalstructureoftheXMLdataispreserved.Therefore,XMLquerylan-guageslikeXPathandXQueryandXMLprocessorslikeXSLTenginescanbeappliedinamorenaturalwaycomparedtotheshreddingapproachthatneedstoreconstructtheXMLdataortorewritetheinstructionstoSQL.Inaddition,thisrepresentationasase-quenceoftagsisthenaturalchoicewhentransmittingXMLovernetworks.CommercialdatabasemanagementsystemsliketheOracleDatabase10g,theMicrosoftSQLServer,andIBMDB2offerthesetwostoringapproachesalthoughnativestorageofXMLdatainrelationaldatabaseshasbeenpropagatedrecentlyin[4]andin[22].Thedifferences
Y.Ioannidis,B.Novikov,andB.Rachev(Eds.):ADBIS2007,LNCS4690,pp.314–329,2007.cSpringer-VerlagBerlinHeidelberg2007
IncrementalValidationofString-BasedXMLData315
andadvantagesofdifferentstoragetechniqueshavebeendiscussedextensively.Theyarenotinthefocusofthispaper.WeconcentrateontheproblemofefficientandpartialvalidationofXMLstringsstoredinaspecializedstringcolumntype,inthefilesystems,orasastreaminnetworkcommunication.
InmanyapplicationstheXMLdatareferstoanXMLSchema[34]thatdefinesthestructureandtheuseddatatypes.WheneveranewrowwithXMLcontentisaddedintoatableitisvalidatedbyaparserthatprocessesthewholeXMLdata.ModificationsofthedatamayviolatetheXMLSchema;thishastobeavoidedbyre-validatingthedata.Inspiteofthefactthatthedataismostoftenmodifiedlocally(e.g.addinganauthorchildtoabookelement)thevalidatingparserprocessesthewholedata.Thisraiseshighcomputationcosts,especiallywhenstoringlargepiecesofXMLdata.Inconsequence,thevalidationisnotperformedaftereachmodificationleadingtopossiblyinconsistentdatathatcannotbecorrected.
InthispaperwepresentanovelapproachforvalidatingXMLdatainitstextualrepresentation.Ourapproach
–istimeefficient,sincethevalidatingautomatonisschema-awareandcreatedonlyonce.ThevalidationtimeislinearinthesizeoftheXMLdata.Schemaviolationsaredetectedimmediatelybecausenotransitionisfoundinourparserautomatonforthenext(invalid)tag.
–ismemoryefficient:unlikeDOM-basedapproacheswedonotloadthefullXMLdataintomainmemory.TheonlymemoryisastackthatmaygrowlinearlyinthedepthoftheXMLdata.
–supportsrecursiveschemes(seeSection2).
–allowsthelocalandincrementalvalidationoftheXMLstring.Usingacombinedelement/state-indexweenabledirectaccesstothestatesinaPush-DownAutoma-ton(PDA)andtoelementsintheXMLdata.ThisisparticularlyadvantageousforXMLdatathatisfrequentlyupdated,sinceonlythisparthastoberevalidatedwhileguaranteeingglobalvalidity.Updateoperationsarevalidatedbeforeexecutingthem.–canbeusedforvariouspurposeswhereXMLdataisrepresentedasastringorastreamandneedstobewellformedandvalid.Examplesincludetext-baseddatabasecolumns,filesystems,andnetworkcommunications(e.g.SOAPmessages).Theremainderofthispaperisorganizedasfollows:westartwithanoverviewonau-tomatausedforvalidationofXMLdatainSection2.Section3introducesandfor-malizesthevalidationproblemforstreamingXMLdata.Ourpushdownautomatonisextendedwithanelement/state-indexinSection4.WepresentexperimentalresultsinSection5andfinishthispaperwithaconclusioninSection6.
2RelatedWork
RelatedworkinthecontextofXMLvalidationandautomatatheoryismanifold.[28]dealswithdifferentclassesofcomplexityforrecursiveandnon-recursiveDTDs.Asrealworld’sXMLSchemesmayberecursive(e.g.tomodeldynamichierarchies),weignorenon-recursive(lessexpressive)approachesinthispaper.Thework[29]introducesthedifferenceofvalidationandstrongvalidationforstreamingXMLdata.Strongvalida-tionincludesthetestofwell-formednesswhereasnon-strongvalidationassumesthat
316B.C.Hammerschmidtetal.
thedataiswell-formedandfailswhenthisisnotthecase.BecauseXMLupdatesareusuallygiveninatextualrepresentation(e.g.SQLXML,XQueryupdates)checkingwell-formednessisamustandimpliesthattheautomatonusesamemory(e.g.astack)tomatchopeningtagsandclosingtags.Therefore,automatawithoutanymemorycan-notcheckwell-formednessofXMLdatarepresentedasastring.Additionally,eveniftheXMLdataiswell-formedastackmayberequiredtocheckthevalidityofsomeschemes(seeExample1inSection3).
Theideaofapushdownautomatonforstrongvalidationispresentedin[29],butincontrasttoourapproach,thewholeXMLdatahastobevalidatedleadingtosignifi-cantperformancedegradation.Additionally,[29]doesnotpresentanimplementationoralgorithm.
[8]presentsaone-counterautomatonforstrongvalidationofXMLdocuments.Asasinglecountervariableisusedinsteadofastack,onlyasubsetofDTDscanbevali-dated,namelythesocalled(restricted)one-counterlanguages.Anexampleforalan-guage(schema)thatcannotbevalidatedisanbmbmanwherea,bareopeningtagsanda,barethecorrespondingclosingtags.Usingastackitispossibletocountandmatchthetwodifferenttags,withasinglevariableitisimpossible.Additionally,[8]supportsonlyglobalvalidation.
AmultitudeofapproachesforpartialvalidationofXMLdataafterupdateshavebeenpresented(e.g.[2,3,5,6,20,21,25]).AlltheseapproachesassumethattheXMLdataaccordstotheDOMmodel,i.e.itisrepresentedasatreeofnodes.Thisrepresentationisinherentlywell-formedandenablesthedirectandefficientnavigationwithinthenodes.Forinstance,itispossibletoaccessallchildrenofagivennode.TheDOMmodelworkswellfornativeXMLdatabasemanagementsystemswherethetree-likerepresentationispreserved.Incontrast,wefocusonthestringrepresentationofXMLdataasitisusedinXMLcolumntypes,messagesystems(e.g.SOAP)orSQLXMLupdatecommands.InthosesystemstheXMLdataisrepresentedasasequenceoftagsandvalues.ThesequencecanbeseenastheresultofapreordertraversalofthecorrespondingXMLtree.
[12]presentsaschemavalidationtechniquewithoutanyautomata.InthisapproachtheXMLdataisrepresentedasasetofrowsinonededicatedXPath-awaretable.There-fore,itisclosertotheXMLshreddingapproachandtheresultsarehardlytransferabletotheXMLcolumntype.
Ourpreviousworks[13,14,15,16,17]dealwith(autonomous)indexingandupdateissuesinnativeXMLstoragesystemsrepresentingXMLdatausingtheDOM-model.ThepushdownautomatonpresentedinthispaperisalsousedtoconvertanXMLstreamintoabinaryrepresentationinordertocompressit.Thisisachievedbyaddingbinarysymbolstoalltransitionsintheautomaton.DetailsonhowwecompressXMLdata,inparticularSOAPmessages,byusingapushdownautomatoncanbefoundin[32].
3TheXMLValidationProblem
Weassumethatthereaderisfamiliarwithbasictermsofformallanguagesandautomatatheorylikeregulargrammarsandfinitestatemachinesaswellascontext-freegrammars
IncrementalValidationofString-BasedXMLData
317 Fig.1.RecursiveXMLSchemadescription andpushdownautomata(finitestatemachineswithastack).Wereferthereaderto[18]forfundamentalsinautomatatheoryandformallanguages.WemotivatethecomplexityoftheXMLSchemavalidationproblembyasimpleinformalexample: TheXMLSchemadocumentpresentedinFigure1consistsofthreedifferentele-mentsa,bandc.Asamapstobandbmapstoa,thisisarecursiveschemaallow-inganarbitrarydepthoftheXMLdata.WepresentamorecompactrepresentationofthisgrammarinFigure2.Asonecanseetheaelementunderbappearsalwaystwice(minoccurs=maxoccurs=2).Therefore,wehavealefta-elementandarighta-elementthatbelongtogether.WhenreadingtheXMLstringanautomatonhastore-memberwhetherana-elementistheleftone(rightaismandatory)ortherightone(nomorea-elementsallowed).AstheXMLdatahasanarbitrarydepth,theautoma-tonhastorememberarbitrarymanyofthesepairs—thisisonlypossibleifweuseanadditionalmemorylikeastack.Evenifweassumewell-formednessoftheXMLdatatobevalidated,wecannotsupportthisXMLSchema(orthecorrespondingDTD)withoutastack.TheapproachesforXMLstream/stringvalidationthatrelyonlyonafiniteautomatonwithoutfurthermemorycallthiskindofschemaanotrecognizableschema[8,29].DOM-treebasedvalidatorsrelyingontree-automatadonotneedastackbecausetheycancheckdirectlywhetherab-elementhastwoachildrenornot.3.1Formalization WerepresentanXMLdatainstance(alsoknownasXMLdocument)asasequenceofopeningandclosingtagswiththeircontent.WeformalizeXMLandXMLSchemaasfollows: Definition1(XMLdatainstance).AnXMLdatainstancex∈Σ∗isasequenceofarbitrarymanysymbolsσ∈ΣwithΣ=Σt∪Σc.Σtisthetag-alphabetwhereas ∗ Σc={a..z,A..Z,1..9}representsthealphabetforthetextualcontent1.ΣcdenotesanarbitrarysequenceofsymbolsofΣcincludingtheemptysequence.Weidentifyclosing + tagswithaline:aistheclosingtagofa.OpeningtagsaresummarizedinΣtwhereas 1 OnecouldusetheUnicodesymbolsforΣc,aswell. 318B.C.Hammerschmidtetal. − closingtagsareinΣt.Atagcorrespondstoanelementname;givenatagtonemaygetitsnamewiththefunctionname(t). +−+− InourexampleitholdsthatΣt={a,b,c},Σt={a,b,c},andΣt=Σt∪Σt.Pleasebeawarethatthisgenericdefinitionallowstheformulationofnon-well-formedandinvalidXMLdata. WeformalizeXMLSchemaasaRegularTreeGrammar(RTG)G.[24]showsthatandhowanXMLSchemacanbemappedtoaRTG.AregulartreegrammarG=(N,T,P,S)isa4-tupleconsistingofasetofnon-terminalsymbolsN,asetofterminalsymbolsT,asetofproductionrulesPandasetofstartsymbolsS⊆N.ThelanguagedefinedbyGisdenotedwithL(G). AllXMLdatatypesresultinnon-terminalsymbols(writtenincapitallettersorpre-fixedwithxsd:).Allpossibleelementnamesresultinterminalsymbols(smallletters).ThesetofstartsymbolsS⊆Ncontainsallnon-terminalsbelongingtoelementsthataredeclaredonthetop-leveloftheXMLSchemadocument. TheproductionrulesPreflectthestructureofvaliddocuments.Thelefthandsideofarulep∈Pisanon-terminalsymboldefininganelementtype.Itcanbedeterminedbythefunctiontype(p):P→N. Therighthandsideconsistsofanelementname(aterminalsymbol∈T)thatcorre-spondstothetype.Itcanbeaccessedusingthefunctionname(p):P→T Theelementnameisfollowedbyaregularexpressionrwhichdescribesthecontentmodelofthiselement,i.e.typesofchildelements.EachregularexpressionelementmapstotypesinN,simpleXMLSchematypes(e.g.xsd:int),ortonothing(). Forourexample,wedepicttheconversionresultinFigure2.See[24]fordetailsabouttheconversionalgorithm. G=({A,B,C,xsd:int},{a,b,c},P,{A})withP={A→a(B|ε), B→b(AA|C),C→c(xsd:int)} Fig.2.RegulartreegrammaroftheschemainFigure1 Therulesinourexamplehavethefollowingmeaning:TheAelementtypeisrepre-sentedbytheelementnameaandcontainsoneB-typeor(|)ornothing(),whereasBcontainseithertwoAtypesoroneCtypecontaininganintegervalue.Hence,GisequivalenttotheSchemainFigure1. ThecontentmodelsofGcanbetransferredtoautomatatheory:Lemma1.Foreachregularexpressioninaproductionrulep∈Pthereexistsafinitestatemachineacceptingthecontentmodeloftype(p).ThesetofallfinitestatemachinesisdenotedbyFSM.Givenap∈PonegetsitscorrespondingFSMwiththefunctionfsm(p):P→FSM. IncrementalValidationofString-BasedXMLData A:q0B:q1r0Ar1CAC:r2s0319 Bxsd:ints1Fig.3.SetoffinitestatemachinesgeneratedfromtheregulartreegrammarofFigure2 Wegivenoproofforthislemmaasitisstandardautomatatheory.Pleasesee[18]describinghowtotransformregularexpressionsintoFSMs.ThecorrespondingstatemachinesforourexamplearepresentedinFigure3. 3.2ConstructionoftheXMLSchema-AwarePushdownAutomaton XMLisaderivativeoftheDyck-languages(parenthesislanguages)thatcanbede-scribedbycontext-freegrammars[7].Therefore,thepushdownautomatonistheap-propriatemodelforstringprocessingofXMLdata.Wepresenttheconstructionofthepushdownautomaton(PDA)intheAppendix(SectionA).Thepushdownautomatonaspresentedhereisaslightmodificationoftheautomatonaspresentedin[32]inthecontextofcompressiontechniquesforSOAPmessages. OurPDAstartswiththespecialsymbolZastheonlystackcontent.ThePDAacceptsadocumentiftheremaininginputsequenceandthestackareempty. ThemainideaofthePDAistocheckthecontentmodelsoftheelementtypesoftheregulartreegrammarbysimulatingthenestedexecutionoftheFSMs.Forthisreason,eachopeningandclosingtagisrepresentedbyonededicatedstate. Thestack’scontentisusedtomemorizethepathtothecurrenttag:Statesrepre-sentingthecurrentpositioninthecontentmodelarepushedonthestackwhilereadingtheXMLdata.Thetransitionscapturevalidtagsequences,i.e.therestrictionsofthecontentmodels.ThePDAconstructedforoursamplegrammarispresentedinFigure4.Thefirstsymbolofatransition’slabelrepresentsatagt∈ΣttobereadfromtheXMLinputx,thesecondsymbolrepresentsthestateofthecontentmodeltobepoppedfromthestack(pleaseseeSectionAfordetails),andthelastsymbol(s)arethestatestobepushedonthestackwhenusingthistransition.Forinstance,ifwereadanopeningtagathesymbolqoispushedonthestack.Becauseamayhaveanemptycontentmodelweareabletoreadtheclosingtagaafterwards.Thetransitionchecksiftheprevioustagwasabypoppingq0fromthestack. Pleasenotethatnospecialtreatmentforterminatingtheexecutionoftheautoma-tonisneededbecausetheautomatonacceptsadocumentifboththeremaininginputsequenceandthestackcontentareempty.Fora-documentsthisisthecasewhenthecorrespondingCloseAstateisreached. 4EfficientlyValidatingUpdates WhenprocessinganXMLinputstringthecomputationalcomplexityislinear(O(n))withrespecttothestring’slengthn.UsuallyO(n)complexityisnotappropriateifthesizeoftheXMLdataexceedsacertaindegree.InthissectionwedescribehowweadaptourapproachtoincrementallyvalidateXMLstringsbeforeperforminganupdate 320B.C.Hammerschmidtetal. Fig.4.PDAconstructedfromtheRTGofFigure2 operation.Incrementalvalidationmeansthatonlythemodifiedpartisvalidatedinitslocalcontext.Inthiswaythecomplexitybecomeslinearinthesizeofthemodification.Weassumethatanupdateoperationiseitheraninsertionofanewelement(withitschildren)orthedeletionofanelement(includingallitschildren).Formally,anupdateoperationoisrepresentedasatripleo=(path,type,content),wherepathisanXPathexpressionselectingonetomanycontextelementsintheXMLdata,typeindicateswhatkindofupdateitis,andcontentisthetextualrepresentationofanewelementtobeinserted.Anexampleforsuchanupdateoperationmightbethefollowinginsertstatement–formulatedinXQueryUpdate[33,35]: doinsert Here,itholdsthatpath=/site/categories,type=appendandcontent= InordertoincrementallyvalidatetheXMLdatawedonotreaditfromthebeginningbutneeddirectaccesstoanoffsetposition.Forthisreason,weuseanindexwhoseen-triesreferenceopen-tagsintheXMLstring.Thisistechnicallydonebytheuseofchar-acteroffsetvalues.Thekeysoftheindex’sentriesaresimplepathexpressions.Simplepathexpressionscontainthechild-axisandelementnames.TheStrongDataGuide[11]isanexampleforsuchanindexstructurewiththecharacteristicthatallpossiblesim-plepathexpressionsarecovered.Inourworkwerelaxthischaracteristicandassumethatatleastsomepathexpressionsareindexedwithoutregardinghowtoselectthem. IncrementalValidationofString-BasedXMLData321 (WehavediscussedtheproblemoffindingsuitableXMLindexesforagivenapplicationin[14,15]). ThemainideaofourapproachistoextendtheindexentriesinsuchawaythattheyreferencenotonlyintotheXMLstringbutalsotothecorrespondingstatesinthePDAwhileprovidingthetopelementofthestackatthesametime.AccordingtotheconstructionofthePDA,thetopelementdenotestheinitialstateofthecorrespondingfinitestatemachine.Theideaistokeep’snapshots’ofthePDAexecutionintheindex.Forthisreasonwedefinetheindexasfollows: Definition2(Element/State-Index).Theelement/state-indexconsistsofasetofen-triesE.Eachentrye∈Eisa2-tuplee=(key,ref)wherekeyisasimplepathexpressionandref=(offsetlist,state,stackelement)areference:offsetlistisalistofcharacterdistanceseachofwhichisanumberindicatingthecharacterdistancefromtheroot;state∈QisthecorrespondingstateinthePDA,andstackelement∈Γisthetopelementofthecorrespondingstackcontent.QdenotesthesetofallstatesandΓisthesetofallstacksymbols.WedescribedetailsintheAppendix. Weneedonlythetopstackelementdenotingtheinitialstateofthecorrespondingfinitestatemachinebecauseregulartreegrammarsarecontextfree,i.e.wecananalyzethecontentofanelementindependentlyofthepositionoftheelementwithinthedocument.Forexample,thebehaviourofthePDAbasicallyisthesameifweanalyzethefirsta-elementortheseconda-elementwithinab-element.Thestackcontentdiffersonlyinpartsthatareirrelevantforanalyzingthea-element. Anoverviewofanelement/state-indexisgiveninFigure5.Forexample,ifab-elementwiththepath/a/bisupdated,onlythenewb-elementhastobecheckedasfollows:ThePDAskipstherootb-elementandstartsinstateOpenBwithr0astheonlystackelement.Aftersuccessfullycheckingthenewb-element,thePDAstopsinstateCloseBwithanemptystack. TheindexispopulatedwheninsertingdataintotheXMLcolumn:Thestringrep-resentationofthedataisprocessedoncebythePDAinordertocheckvalidity.Whileprocessingthetagsequencewecheckifthecurrenttagbelongstoasimplepathex-pressiontobeindexedornot.Ifthisisthecaseweinsertthecurrentoffset,stateandtopmoststacksymbolasanewentryintotheelement/state-index.AnefficientretrievalofentriesisenabledwhenusingasearchstructurelikeaB-treeorhashtable. AstheXMLdataisahierarchyofelements,weillustrateitasatree(triangle).ThetreesinsidetheXMLdatarepresentsubtrees(elementsinelements).PleasenotethattheXMLdataisstoredasasequenceoftagsinthecolumnandnotasatreeofelementnodeslikeinnativeXMLDBMS. Whenperforminganupdateoperationo=(path,type,content)itislikelythatitspathexpressionpathisnotcoveredbyanentryintheindex.However,ifallelementsselectedbypatharedescendantsofanindexelementwecanstillusethatindextoacceleratethevalidationprocess.WhetheranindexedpathexpressioncanbeusedornotisdeterminedusingtheXPathcontainmentalgorithmofMiklauandSuciu[23].Itdetectsifelementsselectedbyapathexpressionpareasubsetofthoseselectedbyp.Inourcasep=pathandpdenotesapathexpressionoftheindexentries 322B.C.Hammerschmidtetal. Root/a/b/a/b/aXMLDataIndexkey PDAref offsetliststateRoot1Start/a/b4OpenB/a/b/a7,14OpenA stackelement Zr0q0 ThetableshowstheindexfortheautomatongiveninFigure4andforthedocument Fig.5.Theelement/state-indexreferencingXMLdataandPDAstates IndexedelementUpdateIndexedelementUpdateelemento Dpt1,t2,.........,ti,......,to,.......,to,....ti,........,t1 Dp Fig.6.FindingtheupdatepositionintheXMLdatausingtheindex (thekeys).Ifp⊆pwemayusethatentry.Ofcourse,incaseofp⊂p,i.e.p=pwehavetocheckforeachelementoftheoffsetlistwhetherthecorrespondingdatauptothispositioncorrespondstop.Ifthereisnoindexentrywithp⊆p,wemustprocesstheXMLdatafromtherootresultinginaconventionalvalidation. Inthefollowingweassumethatasuitableindexentryisfound.Figure6illustratesthesituation:Theindexreferstoanopen-tagelementtiintheXMLdatax.Thecorre-spondingclosingtagelementticanbefoundeasilybecausexiswellformed. pistherelativesimplepathexpressionstartingfromtheindexedelementtiandleadstotheelementtothatisselectedbytheupdateoperation.pcanbecalculatedbyapplyingXPathdifferencefunctions(fordetailssee[13])andconsistsofasequenceofnavigationalstepstochildelements. TheideaofourapproachistousethePDAforvalidationandtopretendthatthemodifyingoperationohasalreadybeenexecuted.Thevalidationtakesplacefromtitoti.Theoffsetinxtotiisfetchedfromtheindex;thePDAisinitializedwiththesettingsbelongingtothatentry.WhenprocessingxusingthePDAwefollowthereferenceandinitializethePDAanditsstack.Wenowstarttheprocessingofxuntilwereachto.Wehavereachedtheelementselectedbytheupdateoperationoifthenexttaginxcorrespondstothelastlocationstepinp.Beforereadingthattagtowememorizethe IncrementalValidationofString-BasedXMLData323 currentstateofthePDAwithitsstackcontent.Thenextstepsofouralgorithmdependonthetypeofthemodificationoperationo: –Delete:Ifoisadeleteoperationalltagsinthesequenceto,....toaretoberemoved.Again,theclosingtagtocanbeidentifiedbecausexiswell-formed.WeskipalltagsinthatsequenceandcontinuetoreadxusingthePDAuntilwereachti.IfthePDAcouldprocessthisshortenedinput,thenxwouldbestillvalidafterperformingthedeletion.Theoffsetsintheindex’entrieshavetobeupdatedbecausetheupdatemayhavechangedthepositionofsometags.Thisisdonebysubtractingthelengthofthedeletedsequencefromalloffsetsthataregreaterthanthepositionofto.–Append:ThePDAreadsthefullsequenceto,....toandstops.Thenexttagsarethecontentoftheupdateoperationo.AfterreadingthemthePDAresumestoreadfromxuntilreachingti.IfthePDAacceptsthatextendedinput,thenxisstillvalidafterperformingtheappendoperation.Again,wehavetoupdatetheoffsetsbyaddingthelengthofthecontenttothoseoffsetsthatarelargerthanthepositionofto.–Insert:Thiscaseisverysimilartothepreviousonewiththedifferencethatthenewelementisnotasiblingbutachild.Therefore,theinsertiontakesplaceafterreadingto,ifthenewelementisthefirstchild,orjustbeforereadingto,ifitisthelastchild.Analogously,theindexoffsetshavetobeadapted. Asthisalgorithmonlyreadsthetagsfromtitoti,weavoidacomplexityofO(lengthofthewholedocument).Instead,weachieveacomplexityofO(lengthofthemodifiedpartofthedocument).Oursolutionisalsoabletohandlethe QueryingXMLDataUsingtheIndex.Theelement/state-indexcanalsobeusedtoacceleratethepathexpressionswhenexecutingqueryingdatabaseoperations.Althoughitisnotinthefocusofthispaper,wepresentthebasicideathatissimilartotheincre-mentalvalidation:thebestindexentryisdeterminedreferencingtopositionsintheXMLdata;Δpiscalculatedandexecutedonthesepositionsleadingtothetagsthatareselectedbythequery.Wereferto[14,13]forthedetails. 5Experiments Inordertodemonstratethebenefitsofourschema-awarepushdownautomatonforvali-dationwecompareitwithotherapproaches.WeusetheXMark[26]benchmarktooltogenerateXMLdataofdifferentsizes.TheXMarkdatamodelsanauctionscenario.Itsschemaconsistsofroughly80elementtypesandallowstherecursivenestingofsomeelementswithoutrestrictingthedocumentsdepth.BecausetheXMarkschemaispro-videdbyaDTD,weconvertittoanequivalentXMLSchemausingSun’sTrang[30].TheexperimentswereperformedonanIntelP4,2.67GHzwith1GBRAM.TheconstructionofthePDAtook270ms.Itconsistsof237statesand627transitions. 324B.C.Hammerschmidtetal. Table1.TimetovalidatetheXMarksampledata PDAgeneration XMLSpyn/aJDOMParsern/aSaxParsern/aXeniaglobal270msXenialocal270ms 100KB 1000ms1010ms622ms87ms22ms 1MB4200ms1322ms1035ms370ms25ms 10MB28000ms3890ms2122ms1061ms25ms 100MBn/a24690ms14210ms12648ms26ms WehavecreatedXMLdocumentsofdifferentsizesandhavecomparedthetimesforvalidationwiththetwowell-knownparsersJDOM[19]andSAX[9],theXMLEditorXMLSpy[1]andourapproachcalledXenia.WesummarizetheresultsofourexperimentsinTable1. Themeasuredvaluesareaveragedovertenruns.The100MBdocumentcouldnotbeprocessedbytheXMLSpyduetomemoryconstraints.Asonecanseeinthecolumnsfor100KBand10MB,Xeniaissignificantlyfasterbecausethevalidatingautomatonisalreadybuiltwhereastheotherparsershavetoreadandprocesstheschemainformation.ForlargerXMLdatathemeasuredtimesarecomparabletotheSAXparser.BothXeniaandSAXdonotneedtokeepthewholedocumentinthememorybecausetheyoperatestream-oriented.TheJDOMparserandXMLSpykeepthewholeXMLdatainmemory.Theexperimentsforincrementalvalidationsafterlocalupdatesarepresentedinthelastrowofthetable.Theupdatedelementhasasizeof20kB.Asonecansee,thetimetovalidatetheupdateisalmostconstantandnotdependentonthetotalsizeoftheXMLdata.Thisisbecauseonlythemodifiedpart(whichisthesameinallcases)hastobechecked.Itdoesnotmakeanydifferencewhetherwedealwithaninsertoradeleteoperation.ThetimeforconstructingthePDAisgivenseparatelyinthefirstcolumnofTable1becauseinourapproachthePDAhastobegeneratedonlyonceforeachschema.ThesamePDAisusedforallupdatesofdocumentshavingthesameschema. 6Conclusion WehaveseenthatconventionalparsersconsumelineartimewhenvalidatinganXMLstring/stream.Especiallyforlargerdatasetsthatareperiodicallyupdatedthecostsforre-validationbecomeprohibitivelyhigh.Waivingthevalidationmaybeinevitablebutisunwantedinmostcases.Inthispaper,weintroduceaschema-awarepushdownautoma-tonusedforvalidation.UsinganindexstructurecapturessnapshotsofthisautomatonwhileprovidingdirectaccessintotheXMLdata.Usingthisindex,thecostsforre-validationareinthesizeoftheupdateoperationbecauseonlyasmallpartoftheXMLdataisprocessedbytheautomaton.Additionally,thevalidationisperformedbeforeupdatingthedata,sothatinvaliddatacannotbecreated.Experimentalresultsshowtheefficiencyofourapproach. TothebestofourknowledgeourapproachisthefirstoneforpartiallyvalidatingXMLstringsafterlocalupdates.Thisisahighlyrelevanttopicforimplementingeffi-cientnon-nativedatabasemanagementsystemswhichusespecialXMLcolumntypesorforstoringXMLdatainthefilesystem. IncrementalValidationofString-BasedXMLData325 References 1.Altova.XMLSpy.URL:http://www.altova.com 2.Balmin,A.,Papakonstantinou,Y.,Vianu,V.:IncrementalvalidationofXMLdocuments.ACMTrans.DatabaseSyst.29(4),710–751(2004) 3.Barbosa,D.,Mendelzon,A.O.,Libkin,L.,Mignet,L.,Arenas,M.:EfficientIncrementalVal-idationofXMLDocuments.In:ICDE’04:Proceedingsofthe20thInternationalConferenceonDataEngineering,Washington,DC,USA,pp.671–682.IEEEComputerSocietyPress,LosAlamitos(2004) ¨4.Beyer,K.,Cochrane,R.,Josifovski,V.,Kleewein,J.,Lapis,G.,Lohman,G.,Lyle,B.,Ozcan, F.,Pirahesh,H.,Seemann,N.,Truong,T.:SystemRX:OnePartRelational,OnePartXML.In:Proceedingsofthe2005ACMSIGMODInternationalConferenceonManagementofData,Baltimore,Maryland,USA,June14-162005,pp.347–358.ACMPress,NewYork(2005) 5.Bouchou,B.,Alves,M.H.F.:UpdatesandIncrementalValidationofXMLDocuments.In:DBPL,pp.216–232(2003) 6.Bouchou,B.,Alves,M.H.F.,Laurent,D.,Duarte,D.:ExtendingTreeAutomatatoModelXMLValidationUnderElementandAttributeConstraints.In:ICEIS(1),pp.184–190(2003)7.Br¨uggemann-Klein,A.,Wood,D.:Balancedcontext-freegrammars,hedgegrammarsandpushdowncaterpillarautomata.In:ExtremeMarkupLanguages(2004) 8.Chitic,C.,Rosu,D.:OnvalidationofXMLstreamsusingfinitestatemachines.In:WebDB’04:Proceedingsofthe7thInternationalWorkshopontheWebandDatabases,pp.85–90.ACMPress,NewYork,NY,USA(2004) 9.Megginson,D.:SimpleAPIforXML.URL:http://www.saxproject.org/ 10.Fiebig,T.,Helmer,S.,Kanne,C.-C.,Moerkotte,G.,Neumann,J.,Schiele,R.,Westmann,T.: AnatomyofanativeXMLbasemanagementsystem.VLDBJournal11(4),292–314(2002)11.Goldman,R.,Widom,J.:DataGuides:EnablingQueryFormulationandOptimizationin SemistructuredDatabases.In:VLDB’97,Proceedingsof23rdInternationalConferenceonVeryLargeDataBases,pp.436–445(1997) 12.Grust,T.,Klinger,S.:Schemavalidationandtypeannotationforencodedtrees.In:Pro-ceedingsoftheFirstInternationalWorkshoponXQueryImplementation(XIME-P),Paris,France,June2004,pp.55–60(2004) 13.Hammerschmidt,B.C.:KeyX:SelectiveKey-OrientedIndexinginNativeXML-Databases.DissertationzumDr.-Ing.,Institutf¨urInformationssysteme,Technisch-NaturwissenschaftlicheFakult¨at,Universit¨atzuL¨ubeck,October,DISDBIS93,Akademis-cheVerlagsgesellschaftAkaGmbH,Berlin2006,ISBN3-89838-493-4(2005) 14.Hammerschmidt,B.C.,Kempa,M.,Linnemann,V.:Aselectivekey-orientedXMLIndex fortheIndexSelectionProbleminXDBMS.In:Galindo,F.,Takizawa,M.,Traunm¨uller,R.(eds.)DEXA2004.LNCS,vol.3180,Springer,Heidelberg(2004) 15.Hammerschmidt,B.C.,Kempa,M.,Linnemann,V.:AutonomousIndexOptimizationin XMLDatabases.In:ProceedingsoftheInternationalWorkshoponSelf-ManagingDatabaseSystems(SMDB2005),Tokyo,Japan,April8-92005,pp.56–65(2005) 16.Hammerschmidt,B.C.,Kempa,M.,Linnemann,V.:OntheIntersectionofXPathExpres-sions.In:Proceedingsofthe9thInternationalDatabaseEngineering&ApplicationSympo-sium(IDEAS2005),Montreal,Canada,July25-27,2005(2005) 17.Hammerschmidt,B.C.,Linnemann,V.:TheIndexUpdateProblemforXMLDatain XDBMS.In:Proceedingsofthe7thInternationalConferenceonEnterpriseInformationSys-tems(ICEIS2005),Miami,USA,pp.27–34(2005) 18.Hopcroft,J.E.,Motwani,R.,Ullman,J.D.:IntroductiontoAutomataTheory,Languages,and Computation.AddisonWesleyPublishingCompany,Reading(2001) 326B.C.Hammerschmidtetal. 19.Hunter,J.,McLaughlin.:JDOM1.0.URL:http://www.jdom.org/ 20.Sang-Kyun,K.,Myungcheol,L.,Kyu-Chul,L.:ImmediateandPartialValidationMechanism fortheConflictResolutionofUpdateOperationsinXMLDatabases.In:Meng,X.,Su,J.,Wang,Y.(eds.)WAIM2002.LNCS,vol.2419,pp.387–396.Springer,Heidelberg(2002)21.Sang-Kyun,K.,Myungcheol,L.,Kyu-Chul,L.:ValidationofXMLDocumentUpdates ˇep´BasedonXMLSchemainXMLDatabases.In:Maˇr´ık,V.,Stˇankov´a,O.,Retschitzegger, W.(eds.)DEXA2003.LNCS,vol.2736,pp.98–108.Springer,Heidelberg(2003) 22.Liu,Z.H.,Krishnaprasad,M.,Arora,V.:NativeXqueryprocessinginOracleXMLDB.In: Proceedingsofthe2005ACMSIGMODInternationalConferenceonManagementofData,Baltimore,Maryland,USA,June14-162005,pp.828–833.ACMPress,NewYork(2005)23.Miklau,G.,Suciu,D.:ContainmentandequivalenceforafragmentofXPath.Journalofthe ACM51(1),2–45(2004) 24.Murata,M.,Lee,D.,Mani,M.,Kawaguchi,K.:TaxonomyofXMLschemalanguagesusing formallanguagetheory.ACMTrans.Inter.Tech.5(4)(2005) 25.Papakonstantinou,Y.,Vianu,V.:IncrementalValidationofXMLDocuments.In:Calvanese, D.,Lenzerini,M.,Motwani,R.(eds.)ICDT2003.LNCS,vol.2572,pp.47–63.Springer,Heidelberg(2002) 26.Schmidt,A.,Waas,F.,Kersten,M.L.,Carey,M.J.,Manolescu,I.,Busse,R.:XMark:A BenchmarkforXMLDataManagement.In:ProceedingsoftheInternationalConferenceonVeryLargeDataBases(VLDB),HongKong,China,pp.974–985(2002)27.Sch¨oning,H.:Tamino-ADBMSdesignedforXML.In:Proceedingsofthe17thInterna-tionalConferenceonDataEngineering,Heidelberg,Germany,April2-6,2001,pp.149–154. IEEEComputerSociety,LosAlamitos(2001) 28.Segoufin,L.:TypingandqueryingXMLdocuments:somecomplexitybounds.In:PODS ’03:Proceedingsofthetwenty-secondACMSIGMOD-SIGACT-SIGARTsymposiumonPrinciplesofdatabasesystems,pp.167–178.ACMPress,NewYork(2003) 29.Segoufin,L.,Vianu,V.:ValidatingstreamingXMLdocuments.In:PODS’02:Proceedings ofthetwenty-firstACMSIGMOD-SIGACT-SIGARTsymposiumonPrinciplesofdatabasesystems,pp.53–64.ACMPress,NewYork(2002) 30.SunMicrosystems,Inc.Trang:Multi-formatschemaconverterbasedonRELAXNG(May 2006),URL:http://www.thaiopensource.com/relaxng/trang.html 31.Thompson,H.S.,Beech,D.,Maloney,M.,Mendelsohn,N.:XMLSchema part1:Structures2edn.W3CRecommendation(October2004),URL:http://www.w3.org/TR/xmlschema-1 32.Werner,C.,Buschmann,C.,Brandt,Y.,Fischer,S.:CompressingSOAPMessagesbyusing PushdownAutomata.In:ProceedingsoftheIEEEInternationalConferenceonWebServices,Chicago,USA,September2006,IEEEComputerSocietyPress,LosAlamitos(2006) 33.WorldWideWebConsortium(W3C).XQueryUpdateFacilityRequirements(2005),URL: http://www.w3.org/TR/xquery-update-requirements/34.WorldWideWebConsortium(W3C).XMLSchema(2006)URL: http://www.w3.org/XML/Schema 35.WorldWideWebConsortium(W3C).XQueryUpdateFacility(2006)URL: http://www.w3.org/TR/2006/WD-xqupdate-20060711/ AAppendix Inthissection,wedescribehowthepushdownautomatonisconstructedbyusingaregulartreegrammarG=(N,T,P,S).Asaninitialstep,weconstructthefinitestatemachinesrepresentingthecontentmodelsofG.First,wedefinesomerequiredfunctions: IncrementalValidationofString-BasedXMLData327 Definition3(Stackmachinefunctions).ThestatesofanFSMf∈FSMcanbedeterminedbythefunctionstates(f).Acceptingstatescanberetrievedwiththefunc-tionstatesaccept(f),theinitialstatewiththefunctionstateinit(f).Givenastateq∈states(f)thefunctionout(q)returnsalloutgoingtransitionsofqandthefunctionin(q)returnsallincomingtransitions.Eachtransitionthasalabelcorrespondingtoanelementtype∈Nthatcanbedeterminedbythefunctiontype(t).Thestatetowhichtmapscanberetrievedbydest(t). ThecrucialpointnowisthattheFSMsfocusonelement-nodes(tree-model)whereasourinputisanXMLstring(sequenceofsymbols).Weuseapushdownautomaton(PDA)thatsimulatestheexecutionoftheFSMswhileprocessingXMLdataasastring.ThisisdonebypushingthereachedopeningstatesoftheFSMontothestack:Thestackrepresentsthepathtothecurrenttag.TheproofthatforeachDTDdthereexistsaPDAacceptingL(d)hasbeenprovidedbySegoufinandVianu([29]).WenowgiveadetailedconstructionalgorithmforXMLSchema.However,wefirsttakeacloserlookatthetagsequence: Lemma2.AnyXMLstringxconsistsofasequenceoftag-symbols∈Σtandtheir ∗ valuesinΣc.Foranyarbitrarytwoconsecutivesymbolst1,t2∈Σttherearefivedifferentcasesforawell-formedsequencex: + 1.t1∈Σtisthefirstsymbolinx +− 2.t1∈Σt∧t2∈Σt∧name(t1)=name(t2). ++∧t2∈Σt3.t1∈Σt−− 4.t1∈Σt∧t2∈Σt −+ 5.t1∈Σt∧t2∈Σt (e.g. (e.g.)[childcase](e.g.)[parentcase](e.g.)[siblingcase] Forthesakeofsimplicity,weignoreallsymbolsofthetextualcontentalphabetΣc.AsXMLSchematypescanbedefinedbyaregularexpression2itisobviousthatany ∗ sequenceσ1,...σn∈ΣccanbecheckedbyanautomatonwhetherornotitcorrespondstoagivenXMLSchematype. Wegeneratethepushdownautomaton(PDA)acceptingL(G)asfollows: Formally,apushdownautomatonKisa6-tuple;K=(Q,Σ,Γ,Ψ,qstart,Z)whereQisasetofstates,Σistheinputalphabet,Γisthestackalphabet,Ψisasetoftransitionsoftheform(Q×Σ×Γ)→(Q×Γ∗),qstart∈Qistheinitialstate,andZ∈Γistheinitialsymbolofthestack. ThetransitionrulesΨdefinewhatsymbolsinΣandΓhavetobereadinastateinQinordertomovetothenextstatewhilewritingontothestack.ThePDAacceptswhentheinputsequencehasbeencompletelyreadandthestackisempty. Definition4(PDAacceptingL(G)).ThePDAhasonededicatedinitialstatenamedsstart.Foreachtagt∈Σtthereisonecorrespondingstateqt∈Q.Thesetofalltag-statesisdenotedbyQΣt.ForeachXMLSchematype(e.g.xsd:int)thereisonededicatedstate3.ThesetypestatesaredenotedbyQtype.NowwedefineQ={qstart}∪ 23 E.g.theregularexpression(−|ε)(0|1|...|9)∗definesthedecimaltype.Seealso[31].Thelatterstatesrepresentsub-automataforcheckingthetextualcontent. 328B.C.Hammerschmidtetal. Algorithm1.PDAConstructionAlgorithm 1:forallp∈Pdo 2:tag←element(p);3:fsm←fsm(p);4:si←stateinit(fsm)5:Sa←statesaccept(fsm)6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:37:38: iftype(p)∈Sthen AddTransition:(sstart,tag+,z)→(state(tag+),[si,z]);endif forallq∈states(fsm)do ifq=si∧q∈Sathen AddTransition:(state(tag+),tag−,q)→(state(tag−),[]);endif ifq=sithen forallt∈out(q)dotypet←type(t)tagt←tag(typet)qdest←dest(t) + ,q)→AddTransition:(state(tag+),tagt + (state(tagt),[stateinit(fsm(typet)),qdest]);endforendif ifq∈Sathen forallt∈in(q)dotypet←type(t)tagt←tag(typet) − ),tag−,q)→(state(tag−),[]);AddTransition:(state(tagt endforendif foralltin∈in(q)doforalltout∈out(q)dotypein←type(tin)typeout←type(tout)tagout←tag(typeout)tagin←tag(typein) −+),tagout,q)→AddTransition:(state(tagin + (state(tagout),[stateinit(fsm(typeout)),dest(tout)]);endforendforendforendfor Leaf-caseRoot-case Child-case Parent-case Sibling-case IncrementalValidationofString-BasedXMLData329 QΣt∪Qtype.TheinputalphabetΣ=Σt∪Σccontainstagsandcontent.Thefunctionstate(t),wheret∈Σt,returnsthecorrespondingstate∈Q.WedonotimposeanyrestrictionsonthestackalphabetΓ.ThetransitionsΨaregeneratedbyAlgorithm1.Algorithm1issplitintofiveparts–correspondingtothefivetag-sequencecases.WhileiteratingovertheproductionrulesoftheregulartreegrammarG(line1)itischeckedwhichoneofthefivecasesholds.ThePDAsimulatestherunofthesetofFSMsbypushingandpoppingcorrespondingFSM-statesonto/fromthestack. Ifanelementisdefinedastop-levelelementintheXMLSchemaitappearsasastart-symbolinG.Transitionsfromtheinitialstateqstarttothecorrespondingopen-tagstatesarecreated(line7). WeiterateoverallstatesintheFSMwhicharegeneratedinline3.Iftheinitialstateisalsoafinalstate,thenithasanemptycontentmodel,i.e.itisaleaf–aclosingtagisreaddirectlyaftertheopeningtag(line11).Nothinghastobepushedontothestackbecausethesubsequenceforthetaghasbeenprocessedcompletely. OutgoingtransitionsoftheinitialstatesoftheFSMsmaptothefirstelementoftheircontentmodel(child).Therefore,thereisatransitioninthePDAtotheopen-tagstatesofeachofsuchchildren(line18).Thetransitionsoftheparent-casearegeneratedinversely(line25)basedonthefinalstatesoftheFSMs.Here,nothinghastobepushedontothestackbecausethesub-sequencebelongingtotheclosingtaghasbeenprocessedcompletely. ForallstatesintheFSMshavinganincomingandoutgoingtransitionwecreateatransitioninthePDAthatcorrespondstothesibling-case.Here,thePDAreadstheclosing-tagofanelementaandafterwardstheopening-tagofanelementb(line34).WhensomeXMLinputisgiven,theautomatonreadsittagbytagandswitchesintocorrespondingstates.Whentheinputisinvalidtherewillbenotransitionorthetop-moststateonthestackdoesnotmatchtherequirements.Inthiscase,theinputcannotbefullyreadandtheinputisrejected. ThecomplexityofthisconstructionalgorithmisO(n3)withnbeingthesizeofstatesintheFSMs.However,theconstructionofthePDAisperformedonlyoncewhendefiningthecolumntype. 因篇幅问题不能全部显示,请点此查看更多更全内容