热门搜索 :
考研考公
您的当前位置:首页正文

Incremental Validation of String-Based XML Data in Databases, File Systems, and Streams

来源:东饰资讯网
IncrementalValidationofString-BasedXMLData

inDatabases,FileSystems,andStreams

BedaChristophHammerschmidt1,ChristianWerner3,YlvaBrandt3,

VolkerLinnemann2,SvenGroppe2,andStefanFischer3

1

OracleCorporation,400OracleParkway,4OP408,RedwoodShores,CA94065,USA

beda.hammerschmidt@oracle.co

2

InstituteofInformationSystems,UniversityofLuebeck,Germany

{groppe,linnemann}@ifis.uni-luebeck.de3

InstituteofTelematics,UniversityofLuebeck,Germany{werner,brandt,fischer}@itm.uni-luebeck.de

Abstract.Althoughthenative(tree-like)storageofXMLdatabecomesmoreandmoreimportanttherewillbeanenduringdemandtomanageXMLdatainitstextualrepresentation,forinstanceinrelationalstructuresorfilesystems.XMLdatahastobewellformedbydefinitionandadditionally,inmanycases,ithastobevalidaccordingtoagivenXMLschema.BecausetheXMLcolumntypesareoftenderivedfromtexttypes(e.g.CLOBs)guaranteeingwell-formednessaswellasvalidityisnottrivial.Andevenworse,forfrequentlymodifieddataitisusuallytooexpensivetore-validatethewholeXMLdataaftereachupdate–butwaivingre-validationmayleadtoinconsistenciesandmalfunctionsofapplications.Inthispaperwepresentaschema-awarepushdownautomaton(i.e.astackmachine)thatvalidatesanXMLstring/stream.Usinganelement/state-index,thepushdownau-tomatonisabletore-validatelocalmodificationsofthedatawhileguaranteeingoverallvalidity.Updateoperations(e.g.SQLXML,XQueryupdates)arevali-datedbeforeexecutingthem.

1IntroductionandMotivation

ThestorageofXMLdocumentsanddataindatabasemanagementsystemsisadailytask,althoughnosinglestoragetechnologyhasprevailed.Dependingontheapplicationconstraintsonemaychooseanativestoragesystem(e.g.[10,27])or(object-)relationalsystemsthatwereenabledforXML.AcommonapproachistoshredtheXMLdataandmapittocorrespondingtables.AnotlesspopularapproachistostoretheXMLdatainatextualorbinaryrepresentationinaspecialcolumntypeorinthefilesystem.

Here,theoriginalstructureoftheXMLdataispreserved.Therefore,XMLquerylan-guageslikeXPathandXQueryandXMLprocessorslikeXSLTenginescanbeappliedinamorenaturalwaycomparedtotheshreddingapproachthatneedstoreconstructtheXMLdataortorewritetheinstructionstoSQL.Inaddition,thisrepresentationasase-quenceoftagsisthenaturalchoicewhentransmittingXMLovernetworks.CommercialdatabasemanagementsystemsliketheOracleDatabase10g,theMicrosoftSQLServer,andIBMDB2offerthesetwostoringapproachesalthoughnativestorageofXMLdatainrelationaldatabaseshasbeenpropagatedrecentlyin[4]andin[22].Thedifferences

Y.Ioannidis,B.Novikov,andB.Rachev(Eds.):ADBIS2007,LNCS4690,pp.314–329,2007.cSpringer-VerlagBerlinHeidelberg2007󰁭

IncrementalValidationofString-BasedXMLData315

andadvantagesofdifferentstoragetechniqueshavebeendiscussedextensively.Theyarenotinthefocusofthispaper.WeconcentrateontheproblemofefficientandpartialvalidationofXMLstringsstoredinaspecializedstringcolumntype,inthefilesystems,orasastreaminnetworkcommunication.

InmanyapplicationstheXMLdatareferstoanXMLSchema[34]thatdefinesthestructureandtheuseddatatypes.WheneveranewrowwithXMLcontentisaddedintoatableitisvalidatedbyaparserthatprocessesthewholeXMLdata.ModificationsofthedatamayviolatetheXMLSchema;thishastobeavoidedbyre-validatingthedata.Inspiteofthefactthatthedataismostoftenmodifiedlocally(e.g.addinganauthorchildtoabookelement)thevalidatingparserprocessesthewholedata.Thisraiseshighcomputationcosts,especiallywhenstoringlargepiecesofXMLdata.Inconsequence,thevalidationisnotperformedaftereachmodificationleadingtopossiblyinconsistentdatathatcannotbecorrected.

InthispaperwepresentanovelapproachforvalidatingXMLdatainitstextualrepresentation.Ourapproach

–istimeefficient,sincethevalidatingautomatonisschema-awareandcreatedonlyonce.ThevalidationtimeislinearinthesizeoftheXMLdata.Schemaviolationsaredetectedimmediatelybecausenotransitionisfoundinourparserautomatonforthenext(invalid)tag.

–ismemoryefficient:unlikeDOM-basedapproacheswedonotloadthefullXMLdataintomainmemory.TheonlymemoryisastackthatmaygrowlinearlyinthedepthoftheXMLdata.

–supportsrecursiveschemes(seeSection2).

–allowsthelocalandincrementalvalidationoftheXMLstring.Usingacombinedelement/state-indexweenabledirectaccesstothestatesinaPush-DownAutoma-ton(PDA)andtoelementsintheXMLdata.ThisisparticularlyadvantageousforXMLdatathatisfrequentlyupdated,sinceonlythisparthastoberevalidatedwhileguaranteeingglobalvalidity.Updateoperationsarevalidatedbeforeexecutingthem.–canbeusedforvariouspurposeswhereXMLdataisrepresentedasastringorastreamandneedstobewellformedandvalid.Examplesincludetext-baseddatabasecolumns,filesystems,andnetworkcommunications(e.g.SOAPmessages).Theremainderofthispaperisorganizedasfollows:westartwithanoverviewonau-tomatausedforvalidationofXMLdatainSection2.Section3introducesandfor-malizesthevalidationproblemforstreamingXMLdata.Ourpushdownautomatonisextendedwithanelement/state-indexinSection4.WepresentexperimentalresultsinSection5andfinishthispaperwithaconclusioninSection6.

2RelatedWork

RelatedworkinthecontextofXMLvalidationandautomatatheoryismanifold.[28]dealswithdifferentclassesofcomplexityforrecursiveandnon-recursiveDTDs.Asrealworld’sXMLSchemesmayberecursive(e.g.tomodeldynamichierarchies),weignorenon-recursive(lessexpressive)approachesinthispaper.Thework[29]introducesthedifferenceofvalidationandstrongvalidationforstreamingXMLdata.Strongvalida-tionincludesthetestofwell-formednesswhereasnon-strongvalidationassumesthat

316B.C.Hammerschmidtetal.

thedataiswell-formedandfailswhenthisisnotthecase.BecauseXMLupdatesareusuallygiveninatextualrepresentation(e.g.SQLXML,XQueryupdates)checkingwell-formednessisamustandimpliesthattheautomatonusesamemory(e.g.astack)tomatchopeningtagsandclosingtags.Therefore,automatawithoutanymemorycan-notcheckwell-formednessofXMLdatarepresentedasastring.Additionally,eveniftheXMLdataiswell-formedastackmayberequiredtocheckthevalidityofsomeschemes(seeExample1inSection3).

Theideaofapushdownautomatonforstrongvalidationispresentedin[29],butincontrasttoourapproach,thewholeXMLdatahastobevalidatedleadingtosignifi-cantperformancedegradation.Additionally,[29]doesnotpresentanimplementationoralgorithm.

[8]presentsaone-counterautomatonforstrongvalidationofXMLdocuments.Asasinglecountervariableisusedinsteadofastack,onlyasubsetofDTDscanbevali-dated,namelythesocalled(restricted)one-counterlanguages.Anexampleforalan-guage(schema)thatcannotbevalidatedisanbmbmanwherea,bareopeningtagsanda,barethecorrespondingclosingtags.Usingastackitispossibletocountandmatchthetwodifferenttags,withasinglevariableitisimpossible.Additionally,[8]supportsonlyglobalvalidation.

AmultitudeofapproachesforpartialvalidationofXMLdataafterupdateshavebeenpresented(e.g.[2,3,5,6,20,21,25]).AlltheseapproachesassumethattheXMLdataaccordstotheDOMmodel,i.e.itisrepresentedasatreeofnodes.Thisrepresentationisinherentlywell-formedandenablesthedirectandefficientnavigationwithinthenodes.Forinstance,itispossibletoaccessallchildrenofagivennode.TheDOMmodelworkswellfornativeXMLdatabasemanagementsystemswherethetree-likerepresentationispreserved.Incontrast,wefocusonthestringrepresentationofXMLdataasitisusedinXMLcolumntypes,messagesystems(e.g.SOAP)orSQLXMLupdatecommands.InthosesystemstheXMLdataisrepresentedasasequenceoftagsandvalues.ThesequencecanbeseenastheresultofapreordertraversalofthecorrespondingXMLtree.

[12]presentsaschemavalidationtechniquewithoutanyautomata.InthisapproachtheXMLdataisrepresentedasasetofrowsinonededicatedXPath-awaretable.There-fore,itisclosertotheXMLshreddingapproachandtheresultsarehardlytransferabletotheXMLcolumntype.

Ourpreviousworks[13,14,15,16,17]dealwith(autonomous)indexingandupdateissuesinnativeXMLstoragesystemsrepresentingXMLdatausingtheDOM-model.ThepushdownautomatonpresentedinthispaperisalsousedtoconvertanXMLstreamintoabinaryrepresentationinordertocompressit.Thisisachievedbyaddingbinarysymbolstoalltransitionsintheautomaton.DetailsonhowwecompressXMLdata,inparticularSOAPmessages,byusingapushdownautomatoncanbefoundin[32].

3TheXMLValidationProblem

Weassumethatthereaderisfamiliarwithbasictermsofformallanguagesandautomatatheorylikeregulargrammarsandfinitestatemachinesaswellascontext-freegrammars

IncrementalValidationofString-BasedXMLData

317

Fig.1.RecursiveXMLSchemadescription

andpushdownautomata(finitestatemachineswithastack).Wereferthereaderto[18]forfundamentalsinautomatatheoryandformallanguages.WemotivatethecomplexityoftheXMLSchemavalidationproblembyasimpleinformalexample:

TheXMLSchemadocumentpresentedinFigure1consistsofthreedifferentele-mentsa,bandc.Asamapstobandbmapstoa,thisisarecursiveschemaallow-inganarbitrarydepthoftheXMLdata.WepresentamorecompactrepresentationofthisgrammarinFigure2.Asonecanseetheaelementunderbappearsalwaystwice(minoccurs=maxoccurs=2).Therefore,wehavealefta-elementandarighta-elementthatbelongtogether.WhenreadingtheXMLstringanautomatonhastore-memberwhetherana-elementistheleftone(rightaismandatory)ortherightone(nomorea-elementsallowed).AstheXMLdatahasanarbitrarydepth,theautoma-tonhastorememberarbitrarymanyofthesepairs—thisisonlypossibleifweuseanadditionalmemorylikeastack.Evenifweassumewell-formednessoftheXMLdatatobevalidated,wecannotsupportthisXMLSchema(orthecorrespondingDTD)withoutastack.TheapproachesforXMLstream/stringvalidationthatrelyonlyonafiniteautomatonwithoutfurthermemorycallthiskindofschemaanotrecognizableschema[8,29].DOM-treebasedvalidatorsrelyingontree-automatadonotneedastackbecausetheycancheckdirectlywhetherab-elementhastwoachildrenornot.3.1Formalization

WerepresentanXMLdatainstance(alsoknownasXMLdocument)asasequenceofopeningandclosingtagswiththeircontent.WeformalizeXMLandXMLSchemaasfollows:

Definition1(XMLdatainstance).AnXMLdatainstancex∈Σ∗isasequenceofarbitrarymanysymbolsσ∈ΣwithΣ=Σt∪Σc.Σtisthetag-alphabetwhereas

Σc={a..z,A..Z,1..9}representsthealphabetforthetextualcontent1.ΣcdenotesanarbitrarysequenceofsymbolsofΣcincludingtheemptysequence.Weidentifyclosing

+

tagswithaline:aistheclosingtagofa.OpeningtagsaresummarizedinΣtwhereas

1

OnecouldusetheUnicodesymbolsforΣc,aswell.

318B.C.Hammerschmidtetal.

closingtagsareinΣt.Atagcorrespondstoanelementname;givenatagtonemaygetitsnamewiththefunctionname(t).

+−+−

InourexampleitholdsthatΣt={a,b,c},Σt={a,b,c},andΣt=Σt∪Σt.Pleasebeawarethatthisgenericdefinitionallowstheformulationofnon-well-formedandinvalidXMLdata.

WeformalizeXMLSchemaasaRegularTreeGrammar(RTG)G.[24]showsthatandhowanXMLSchemacanbemappedtoaRTG.AregulartreegrammarG=(N,T,P,S)isa4-tupleconsistingofasetofnon-terminalsymbolsN,asetofterminalsymbolsT,asetofproductionrulesPandasetofstartsymbolsS⊆N.ThelanguagedefinedbyGisdenotedwithL(G).

AllXMLdatatypesresultinnon-terminalsymbols(writtenincapitallettersorpre-fixedwithxsd:).Allpossibleelementnamesresultinterminalsymbols(smallletters).ThesetofstartsymbolsS⊆Ncontainsallnon-terminalsbelongingtoelementsthataredeclaredonthetop-leveloftheXMLSchemadocument.

TheproductionrulesPreflectthestructureofvaliddocuments.Thelefthandsideofarulep∈Pisanon-terminalsymboldefininganelementtype.Itcanbedeterminedbythefunctiontype(p):P→N.

Therighthandsideconsistsofanelementname(aterminalsymbol∈T)thatcorre-spondstothetype.Itcanbeaccessedusingthefunctionname(p):P→T

Theelementnameisfollowedbyaregularexpressionrwhichdescribesthecontentmodelofthiselement,i.e.typesofchildelements.EachregularexpressionelementmapstotypesinN,simpleXMLSchematypes(e.g.xsd:int),ortonothing(󰁯).

Forourexample,wedepicttheconversionresultinFigure2.See[24]fordetailsabouttheconversionalgorithm.

G=({A,B,C,xsd:int},{a,b,c},P,{A})withP={A→a(B|ε),

B→b(AA|C),C→c(xsd:int)}

Fig.2.RegulartreegrammaroftheschemainFigure1

Therulesinourexamplehavethefollowingmeaning:TheAelementtypeisrepre-sentedbytheelementnameaandcontainsoneB-typeor(|)ornothing(󰁯),whereasBcontainseithertwoAtypesoroneCtypecontaininganintegervalue.Hence,GisequivalenttotheSchemainFigure1.

ThecontentmodelsofGcanbetransferredtoautomatatheory:Lemma1.Foreachregularexpressioninaproductionrulep∈Pthereexistsafinitestatemachineacceptingthecontentmodeloftype(p).ThesetofallfinitestatemachinesisdenotedbyFSM.Givenap∈PonegetsitscorrespondingFSMwiththefunctionfsm(p):P→FSM.

IncrementalValidationofString-BasedXMLData

A:q0B:q1r0Ar1CAC:r2s0319

Bxsd:ints1Fig.3.SetoffinitestatemachinesgeneratedfromtheregulartreegrammarofFigure2

Wegivenoproofforthislemmaasitisstandardautomatatheory.Pleasesee[18]describinghowtotransformregularexpressionsintoFSMs.ThecorrespondingstatemachinesforourexamplearepresentedinFigure3.

3.2ConstructionoftheXMLSchema-AwarePushdownAutomaton

XMLisaderivativeoftheDyck-languages(parenthesislanguages)thatcanbede-scribedbycontext-freegrammars[7].Therefore,thepushdownautomatonistheap-propriatemodelforstringprocessingofXMLdata.Wepresenttheconstructionofthepushdownautomaton(PDA)intheAppendix(SectionA).Thepushdownautomatonaspresentedhereisaslightmodificationoftheautomatonaspresentedin[32]inthecontextofcompressiontechniquesforSOAPmessages.

OurPDAstartswiththespecialsymbolZastheonlystackcontent.ThePDAacceptsadocumentiftheremaininginputsequenceandthestackareempty.

ThemainideaofthePDAistocheckthecontentmodelsoftheelementtypesoftheregulartreegrammarbysimulatingthenestedexecutionoftheFSMs.Forthisreason,eachopeningandclosingtagisrepresentedbyonededicatedstate.

Thestack’scontentisusedtomemorizethepathtothecurrenttag:Statesrepre-sentingthecurrentpositioninthecontentmodelarepushedonthestackwhilereadingtheXMLdata.Thetransitionscapturevalidtagsequences,i.e.therestrictionsofthecontentmodels.ThePDAconstructedforoursamplegrammarispresentedinFigure4.Thefirstsymbolofatransition’slabelrepresentsatagt∈ΣttobereadfromtheXMLinputx,thesecondsymbolrepresentsthestateofthecontentmodeltobepoppedfromthestack(pleaseseeSectionAfordetails),andthelastsymbol(s)arethestatestobepushedonthestackwhenusingthistransition.Forinstance,ifwereadanopeningtagathesymbolqoispushedonthestack.Becauseamayhaveanemptycontentmodelweareabletoreadtheclosingtagaafterwards.Thetransitionchecksiftheprevioustagwasabypoppingq0fromthestack.

Pleasenotethatnospecialtreatmentforterminatingtheexecutionoftheautoma-tonisneededbecausetheautomatonacceptsadocumentifboththeremaininginputsequenceandthestackcontentareempty.Fora-documentsthisisthecasewhenthecorrespondingCloseAstateisreached.

4EfficientlyValidatingUpdates

WhenprocessinganXMLinputstringthecomputationalcomplexityislinear(O(n))withrespecttothestring’slengthn.UsuallyO(n)complexityisnotappropriateifthesizeoftheXMLdataexceedsacertaindegree.InthissectionwedescribehowweadaptourapproachtoincrementallyvalidateXMLstringsbeforeperforminganupdate

320B.C.Hammerschmidtetal.

Fig.4.PDAconstructedfromtheRTGofFigure2

operation.Incrementalvalidationmeansthatonlythemodifiedpartisvalidatedinitslocalcontext.Inthiswaythecomplexitybecomeslinearinthesizeofthemodification.Weassumethatanupdateoperationiseitheraninsertionofanewelement(withitschildren)orthedeletionofanelement(includingallitschildren).Formally,anupdateoperationoisrepresentedasatripleo=(path,type,content),wherepathisanXPathexpressionselectingonetomanycontextelementsintheXMLdata,typeindicateswhatkindofupdateitis,andcontentisthetextualrepresentationofanewelementtobeinserted.Anexampleforsuchanupdateoperationmightbethefollowinginsertstatement–formulatedinXQueryUpdate[33,35]:

doinsert

Harddisksinto/site/categories

Here,itholdsthatpath=/site/categories,type=appendandcontent=Harddisks.

InordertoincrementallyvalidatetheXMLdatawedonotreaditfromthebeginningbutneeddirectaccesstoanoffsetposition.Forthisreason,weuseanindexwhoseen-triesreferenceopen-tagsintheXMLstring.Thisistechnicallydonebytheuseofchar-acteroffsetvalues.Thekeysoftheindex’sentriesaresimplepathexpressions.Simplepathexpressionscontainthechild-axisandelementnames.TheStrongDataGuide[11]isanexampleforsuchanindexstructurewiththecharacteristicthatallpossiblesim-plepathexpressionsarecovered.Inourworkwerelaxthischaracteristicandassumethatatleastsomepathexpressionsareindexedwithoutregardinghowtoselectthem.

IncrementalValidationofString-BasedXMLData321

(WehavediscussedtheproblemoffindingsuitableXMLindexesforagivenapplicationin[14,15]).

ThemainideaofourapproachistoextendtheindexentriesinsuchawaythattheyreferencenotonlyintotheXMLstringbutalsotothecorrespondingstatesinthePDAwhileprovidingthetopelementofthestackatthesametime.AccordingtotheconstructionofthePDA,thetopelementdenotestheinitialstateofthecorrespondingfinitestatemachine.Theideaistokeep’snapshots’ofthePDAexecutionintheindex.Forthisreasonwedefinetheindexasfollows:

Definition2(Element/State-Index).Theelement/state-indexconsistsofasetofen-triesE.Eachentrye∈Eisa2-tuplee=(key,ref)wherekeyisasimplepathexpressionandref=(offsetlist,state,stackelement)areference:offsetlistisalistofcharacterdistanceseachofwhichisanumberindicatingthecharacterdistancefromtheroot;state∈QisthecorrespondingstateinthePDA,andstackelement∈Γisthetopelementofthecorrespondingstackcontent.QdenotesthesetofallstatesandΓisthesetofallstacksymbols.WedescribedetailsintheAppendix.

Weneedonlythetopstackelementdenotingtheinitialstateofthecorrespondingfinitestatemachinebecauseregulartreegrammarsarecontextfree,i.e.wecananalyzethecontentofanelementindependentlyofthepositionoftheelementwithinthedocument.Forexample,thebehaviourofthePDAbasicallyisthesameifweanalyzethefirsta-elementortheseconda-elementwithinab-element.Thestackcontentdiffersonlyinpartsthatareirrelevantforanalyzingthea-element.

Anoverviewofanelement/state-indexisgiveninFigure5.Forexample,ifab-elementwiththepath/a/bisupdated,onlythenewb-elementhastobecheckedasfollows:ThePDAskipstherootb-elementandstartsinstateOpenBwithr0astheonlystackelement.Aftersuccessfullycheckingthenewb-element,thePDAstopsinstateCloseBwithanemptystack.

TheindexispopulatedwheninsertingdataintotheXMLcolumn:Thestringrep-resentationofthedataisprocessedoncebythePDAinordertocheckvalidity.Whileprocessingthetagsequencewecheckifthecurrenttagbelongstoasimplepathex-pressiontobeindexedornot.Ifthisisthecaseweinsertthecurrentoffset,stateandtopmoststacksymbolasanewentryintotheelement/state-index.AnefficientretrievalofentriesisenabledwhenusingasearchstructurelikeaB-treeorhashtable.

AstheXMLdataisahierarchyofelements,weillustrateitasatree(triangle).ThetreesinsidetheXMLdatarepresentsubtrees(elementsinelements).PleasenotethattheXMLdataisstoredasasequenceoftagsinthecolumnandnotasatreeofelementnodeslikeinnativeXMLDBMS.

Whenperforminganupdateoperationo=(path,type,content)itislikelythatitspathexpressionpathisnotcoveredbyanentryintheindex.However,ifallelementsselectedbypatharedescendantsofanindexelementwecanstillusethatindextoacceleratethevalidationprocess.WhetheranindexedpathexpressioncanbeusedornotisdeterminedusingtheXPathcontainmentalgorithmofMiklauandSuciu[23].Itdetectsifelementsselectedbyapathexpressionpareasubsetofthoseselectedbyp󰀂.Inourcasep=pathandp󰀂denotesapathexpressionoftheindexentries

322B.C.Hammerschmidtetal.

Root/a/b/a/b/aXMLDataIndexkey

PDAref

offsetliststateRoot1Start/a/b4OpenB/a/b/a7,14OpenA

stackelement

Zr0q0

ThetableshowstheindexfortheautomatongiveninFigure4andforthedocument

:

Fig.5.Theelement/state-indexreferencingXMLdataandPDAstates

IndexedelementUpdateIndexedelementUpdateelemento

Dpt1,t2,.........,ti,......,to,.......,to,....ti,........,t1

Dp

Fig.6.FindingtheupdatepositionintheXMLdatausingtheindex

(thekeys).Ifp⊆p󰀂wemayusethatentry.Ofcourse,incaseofp⊂p󰀂,i.e.p=p󰀂wehavetocheckforeachelementoftheoffsetlistwhetherthecorrespondingdatauptothispositioncorrespondstop.Ifthereisnoindexentrywithp⊆p󰀂,wemustprocesstheXMLdatafromtherootresultinginaconventionalvalidation.

Inthefollowingweassumethatasuitableindexentryisfound.Figure6illustratesthesituation:Theindexreferstoanopen-tagelementtiintheXMLdatax.Thecorre-spondingclosingtagelementticanbefoundeasilybecausexiswellformed.

󰀔pistherelativesimplepathexpressionstartingfromtheindexedelementtiandleadstotheelementtothatisselectedbytheupdateoperation.󰀔pcanbecalculatedbyapplyingXPathdifferencefunctions(fordetailssee[13])andconsistsofasequenceofnavigationalstepstochildelements.

TheideaofourapproachistousethePDAforvalidationandtopretendthatthemodifyingoperationohasalreadybeenexecuted.Thevalidationtakesplacefromtitoti.Theoffsetinxtotiisfetchedfromtheindex;thePDAisinitializedwiththesettingsbelongingtothatentry.WhenprocessingxusingthePDAwefollowthereferenceandinitializethePDAanditsstack.Wenowstarttheprocessingofxuntilwereachto.Wehavereachedtheelementselectedbytheupdateoperationoifthenexttaginxcorrespondstothelastlocationstepin󰀔p.Beforereadingthattagtowememorizethe

IncrementalValidationofString-BasedXMLData323

currentstateofthePDAwithitsstackcontent.Thenextstepsofouralgorithmdependonthetypeofthemodificationoperationo:

–Delete:Ifoisadeleteoperationalltagsinthesequenceto,....toaretoberemoved.Again,theclosingtagtocanbeidentifiedbecausexiswell-formed.WeskipalltagsinthatsequenceandcontinuetoreadxusingthePDAuntilwereachti.IfthePDAcouldprocessthisshortenedinput,thenxwouldbestillvalidafterperformingthedeletion.Theoffsetsintheindex’entrieshavetobeupdatedbecausetheupdatemayhavechangedthepositionofsometags.Thisisdonebysubtractingthelengthofthedeletedsequencefromalloffsetsthataregreaterthanthepositionofto.–Append:ThePDAreadsthefullsequenceto,....toandstops.Thenexttagsarethecontentoftheupdateoperationo.AfterreadingthemthePDAresumestoreadfromxuntilreachingti.IfthePDAacceptsthatextendedinput,thenxisstillvalidafterperformingtheappendoperation.Again,wehavetoupdatetheoffsetsbyaddingthelengthofthecontenttothoseoffsetsthatarelargerthanthepositionofto.–Insert:Thiscaseisverysimilartothepreviousonewiththedifferencethatthenewelementisnotasiblingbutachild.Therefore,theinsertiontakesplaceafterreadingto,ifthenewelementisthefirstchild,orjustbeforereadingto,ifitisthelastchild.Analogously,theindexoffsetshavetobeadapted.

Asthisalgorithmonlyreadsthetagsfromtitoti,weavoidacomplexityofO(lengthofthewholedocument).Instead,weachieveacomplexityofO(lengthofthemodifiedpartofthedocument).Oursolutionisalsoabletohandletheconstructef-ficiently:ThecontentmodelA&B&Cissplitupintoallpossiblesequences:ABC|ACB|BAC|BCA|CAB|CBA.Ontheonehandthissolutionobviouslyleadstolargerautomata,ontheotherhanditalwayspreservesaruntimecomplexityofO(lengthofthechangedpartofthedocument).See[32]forthedetails.

QueryingXMLDataUsingtheIndex.Theelement/state-indexcanalsobeusedtoacceleratethepathexpressionswhenexecutingqueryingdatabaseoperations.Althoughitisnotinthefocusofthispaper,wepresentthebasicideathatissimilartotheincre-mentalvalidation:thebestindexentryisdeterminedreferencingtopositionsintheXMLdata;Δpiscalculatedandexecutedonthesepositionsleadingtothetagsthatareselectedbythequery.Wereferto[14,13]forthedetails.

5Experiments

Inordertodemonstratethebenefitsofourschema-awarepushdownautomatonforvali-dationwecompareitwithotherapproaches.WeusetheXMark[26]benchmarktooltogenerateXMLdataofdifferentsizes.TheXMarkdatamodelsanauctionscenario.Itsschemaconsistsofroughly80elementtypesandallowstherecursivenestingofsomeelementswithoutrestrictingthedocumentsdepth.BecausetheXMarkschemaispro-videdbyaDTD,weconvertittoanequivalentXMLSchemausingSun’sTrang[30].TheexperimentswereperformedonanIntelP4,2.67GHzwith1GBRAM.TheconstructionofthePDAtook270ms.Itconsistsof237statesand627transitions.

324B.C.Hammerschmidtetal.

Table1.TimetovalidatetheXMarksampledata

PDAgeneration

XMLSpyn/aJDOMParsern/aSaxParsern/aXeniaglobal270msXenialocal270ms

100KB

1000ms1010ms622ms87ms22ms

1MB4200ms1322ms1035ms370ms25ms

10MB28000ms3890ms2122ms1061ms25ms

100MBn/a24690ms14210ms12648ms26ms

WehavecreatedXMLdocumentsofdifferentsizesandhavecomparedthetimesforvalidationwiththetwowell-knownparsersJDOM[19]andSAX[9],theXMLEditorXMLSpy[1]andourapproachcalledXenia.WesummarizetheresultsofourexperimentsinTable1.

Themeasuredvaluesareaveragedovertenruns.The100MBdocumentcouldnotbeprocessedbytheXMLSpyduetomemoryconstraints.Asonecanseeinthecolumnsfor100KBand10MB,Xeniaissignificantlyfasterbecausethevalidatingautomatonisalreadybuiltwhereastheotherparsershavetoreadandprocesstheschemainformation.ForlargerXMLdatathemeasuredtimesarecomparabletotheSAXparser.BothXeniaandSAXdonotneedtokeepthewholedocumentinthememorybecausetheyoperatestream-oriented.TheJDOMparserandXMLSpykeepthewholeXMLdatainmemory.Theexperimentsforincrementalvalidationsafterlocalupdatesarepresentedinthelastrowofthetable.Theupdatedelementhasasizeof20kB.Asonecansee,thetimetovalidatetheupdateisalmostconstantandnotdependentonthetotalsizeoftheXMLdata.Thisisbecauseonlythemodifiedpart(whichisthesameinallcases)hastobechecked.Itdoesnotmakeanydifferencewhetherwedealwithaninsertoradeleteoperation.ThetimeforconstructingthePDAisgivenseparatelyinthefirstcolumnofTable1becauseinourapproachthePDAhastobegeneratedonlyonceforeachschema.ThesamePDAisusedforallupdatesofdocumentshavingthesameschema.

6Conclusion

WehaveseenthatconventionalparsersconsumelineartimewhenvalidatinganXMLstring/stream.Especiallyforlargerdatasetsthatareperiodicallyupdatedthecostsforre-validationbecomeprohibitivelyhigh.Waivingthevalidationmaybeinevitablebutisunwantedinmostcases.Inthispaper,weintroduceaschema-awarepushdownautoma-tonusedforvalidation.UsinganindexstructurecapturessnapshotsofthisautomatonwhileprovidingdirectaccessintotheXMLdata.Usingthisindex,thecostsforre-validationareinthesizeoftheupdateoperationbecauseonlyasmallpartoftheXMLdataisprocessedbytheautomaton.Additionally,thevalidationisperformedbeforeupdatingthedata,sothatinvaliddatacannotbecreated.Experimentalresultsshowtheefficiencyofourapproach.

TothebestofourknowledgeourapproachisthefirstoneforpartiallyvalidatingXMLstringsafterlocalupdates.Thisisahighlyrelevanttopicforimplementingeffi-cientnon-nativedatabasemanagementsystemswhichusespecialXMLcolumntypesorforstoringXMLdatainthefilesystem.

IncrementalValidationofString-BasedXMLData325

References

1.Altova.XMLSpy.URL:http://www.altova.com

2.Balmin,A.,Papakonstantinou,Y.,Vianu,V.:IncrementalvalidationofXMLdocuments.ACMTrans.DatabaseSyst.29(4),710–751(2004)

3.Barbosa,D.,Mendelzon,A.O.,Libkin,L.,Mignet,L.,Arenas,M.:EfficientIncrementalVal-idationofXMLDocuments.In:ICDE’04:Proceedingsofthe20thInternationalConferenceonDataEngineering,Washington,DC,USA,pp.671–682.IEEEComputerSocietyPress,LosAlamitos(2004)

¨4.Beyer,K.,Cochrane,R.,Josifovski,V.,Kleewein,J.,Lapis,G.,Lohman,G.,Lyle,B.,Ozcan,

F.,Pirahesh,H.,Seemann,N.,Truong,T.:SystemRX:OnePartRelational,OnePartXML.In:Proceedingsofthe2005ACMSIGMODInternationalConferenceonManagementofData,Baltimore,Maryland,USA,June14-162005,pp.347–358.ACMPress,NewYork(2005)

5.Bouchou,B.,Alves,M.H.F.:UpdatesandIncrementalValidationofXMLDocuments.In:DBPL,pp.216–232(2003)

6.Bouchou,B.,Alves,M.H.F.,Laurent,D.,Duarte,D.:ExtendingTreeAutomatatoModelXMLValidationUnderElementandAttributeConstraints.In:ICEIS(1),pp.184–190(2003)7.Br¨uggemann-Klein,A.,Wood,D.:Balancedcontext-freegrammars,hedgegrammarsandpushdowncaterpillarautomata.In:ExtremeMarkupLanguages(2004)

8.Chitic,C.,Rosu,D.:OnvalidationofXMLstreamsusingfinitestatemachines.In:WebDB’04:Proceedingsofthe7thInternationalWorkshopontheWebandDatabases,pp.85–90.ACMPress,NewYork,NY,USA(2004)

9.Megginson,D.:SimpleAPIforXML.URL:http://www.saxproject.org/

10.Fiebig,T.,Helmer,S.,Kanne,C.-C.,Moerkotte,G.,Neumann,J.,Schiele,R.,Westmann,T.:

AnatomyofanativeXMLbasemanagementsystem.VLDBJournal11(4),292–314(2002)11.Goldman,R.,Widom,J.:DataGuides:EnablingQueryFormulationandOptimizationin

SemistructuredDatabases.In:VLDB’97,Proceedingsof23rdInternationalConferenceonVeryLargeDataBases,pp.436–445(1997)

12.Grust,T.,Klinger,S.:Schemavalidationandtypeannotationforencodedtrees.In:Pro-ceedingsoftheFirstInternationalWorkshoponXQueryImplementation(XIME-P),Paris,France,June2004,pp.55–60(2004)

13.Hammerschmidt,B.C.:KeyX:SelectiveKey-OrientedIndexinginNativeXML-Databases.DissertationzumDr.-Ing.,Institutf¨urInformationssysteme,Technisch-NaturwissenschaftlicheFakult¨at,Universit¨atzuL¨ubeck,October,DISDBIS93,Akademis-cheVerlagsgesellschaftAkaGmbH,Berlin2006,ISBN3-89838-493-4(2005)

14.Hammerschmidt,B.C.,Kempa,M.,Linnemann,V.:Aselectivekey-orientedXMLIndex

fortheIndexSelectionProbleminXDBMS.In:Galindo,F.,Takizawa,M.,Traunm¨uller,R.(eds.)DEXA2004.LNCS,vol.3180,Springer,Heidelberg(2004)

15.Hammerschmidt,B.C.,Kempa,M.,Linnemann,V.:AutonomousIndexOptimizationin

XMLDatabases.In:ProceedingsoftheInternationalWorkshoponSelf-ManagingDatabaseSystems(SMDB2005),Tokyo,Japan,April8-92005,pp.56–65(2005)

16.Hammerschmidt,B.C.,Kempa,M.,Linnemann,V.:OntheIntersectionofXPathExpres-sions.In:Proceedingsofthe9thInternationalDatabaseEngineering&ApplicationSympo-sium(IDEAS2005),Montreal,Canada,July25-27,2005(2005)

17.Hammerschmidt,B.C.,Linnemann,V.:TheIndexUpdateProblemforXMLDatain

XDBMS.In:Proceedingsofthe7thInternationalConferenceonEnterpriseInformationSys-tems(ICEIS2005),Miami,USA,pp.27–34(2005)

18.Hopcroft,J.E.,Motwani,R.,Ullman,J.D.:IntroductiontoAutomataTheory,Languages,and

Computation.AddisonWesleyPublishingCompany,Reading(2001)

326B.C.Hammerschmidtetal.

19.Hunter,J.,McLaughlin.:JDOM1.0.URL:http://www.jdom.org/

20.Sang-Kyun,K.,Myungcheol,L.,Kyu-Chul,L.:ImmediateandPartialValidationMechanism

fortheConflictResolutionofUpdateOperationsinXMLDatabases.In:Meng,X.,Su,J.,Wang,Y.(eds.)WAIM2002.LNCS,vol.2419,pp.387–396.Springer,Heidelberg(2002)21.Sang-Kyun,K.,Myungcheol,L.,Kyu-Chul,L.:ValidationofXMLDocumentUpdates

ˇep´BasedonXMLSchemainXMLDatabases.In:Maˇr´ık,V.,Stˇankov´a,O.,Retschitzegger,

W.(eds.)DEXA2003.LNCS,vol.2736,pp.98–108.Springer,Heidelberg(2003)

22.Liu,Z.H.,Krishnaprasad,M.,Arora,V.:NativeXqueryprocessinginOracleXMLDB.In:

Proceedingsofthe2005ACMSIGMODInternationalConferenceonManagementofData,Baltimore,Maryland,USA,June14-162005,pp.828–833.ACMPress,NewYork(2005)23.Miklau,G.,Suciu,D.:ContainmentandequivalenceforafragmentofXPath.Journalofthe

ACM51(1),2–45(2004)

24.Murata,M.,Lee,D.,Mani,M.,Kawaguchi,K.:TaxonomyofXMLschemalanguagesusing

formallanguagetheory.ACMTrans.Inter.Tech.5(4)(2005)

25.Papakonstantinou,Y.,Vianu,V.:IncrementalValidationofXMLDocuments.In:Calvanese,

D.,Lenzerini,M.,Motwani,R.(eds.)ICDT2003.LNCS,vol.2572,pp.47–63.Springer,Heidelberg(2002)

26.Schmidt,A.,Waas,F.,Kersten,M.L.,Carey,M.J.,Manolescu,I.,Busse,R.:XMark:A

BenchmarkforXMLDataManagement.In:ProceedingsoftheInternationalConferenceonVeryLargeDataBases(VLDB),HongKong,China,pp.974–985(2002)27.Sch¨oning,H.:Tamino-ADBMSdesignedforXML.In:Proceedingsofthe17thInterna-tionalConferenceonDataEngineering,Heidelberg,Germany,April2-6,2001,pp.149–154.

IEEEComputerSociety,LosAlamitos(2001)

28.Segoufin,L.:TypingandqueryingXMLdocuments:somecomplexitybounds.In:PODS

’03:Proceedingsofthetwenty-secondACMSIGMOD-SIGACT-SIGARTsymposiumonPrinciplesofdatabasesystems,pp.167–178.ACMPress,NewYork(2003)

29.Segoufin,L.,Vianu,V.:ValidatingstreamingXMLdocuments.In:PODS’02:Proceedings

ofthetwenty-firstACMSIGMOD-SIGACT-SIGARTsymposiumonPrinciplesofdatabasesystems,pp.53–64.ACMPress,NewYork(2002)

30.SunMicrosystems,Inc.Trang:Multi-formatschemaconverterbasedonRELAXNG(May

2006),URL:http://www.thaiopensource.com/relaxng/trang.html

31.Thompson,H.S.,Beech,D.,Maloney,M.,Mendelsohn,N.:XMLSchema

part1:Structures2edn.W3CRecommendation(October2004),URL:http://www.w3.org/TR/xmlschema-1

32.Werner,C.,Buschmann,C.,Brandt,Y.,Fischer,S.:CompressingSOAPMessagesbyusing

PushdownAutomata.In:ProceedingsoftheIEEEInternationalConferenceonWebServices,Chicago,USA,September2006,IEEEComputerSocietyPress,LosAlamitos(2006)

33.WorldWideWebConsortium(W3C).XQueryUpdateFacilityRequirements(2005),URL:

http://www.w3.org/TR/xquery-update-requirements/34.WorldWideWebConsortium(W3C).XMLSchema(2006)URL:

http://www.w3.org/XML/Schema

35.WorldWideWebConsortium(W3C).XQueryUpdateFacility(2006)URL:

http://www.w3.org/TR/2006/WD-xqupdate-20060711/

AAppendix

Inthissection,wedescribehowthepushdownautomatonisconstructedbyusingaregulartreegrammarG=(N,T,P,S).Asaninitialstep,weconstructthefinitestatemachinesrepresentingthecontentmodelsofG.First,wedefinesomerequiredfunctions:

IncrementalValidationofString-BasedXMLData327

Definition3(Stackmachinefunctions).ThestatesofanFSMf∈FSMcanbedeterminedbythefunctionstates(f).Acceptingstatescanberetrievedwiththefunc-tionstatesaccept(f),theinitialstatewiththefunctionstateinit(f).Givenastateq∈states(f)thefunctionout(q)returnsalloutgoingtransitionsofqandthefunctionin(q)returnsallincomingtransitions.Eachtransitionthasalabelcorrespondingtoanelementtype∈Nthatcanbedeterminedbythefunctiontype(t).Thestatetowhichtmapscanberetrievedbydest(t).

ThecrucialpointnowisthattheFSMsfocusonelement-nodes(tree-model)whereasourinputisanXMLstring(sequenceofsymbols).Weuseapushdownautomaton(PDA)thatsimulatestheexecutionoftheFSMswhileprocessingXMLdataasastring.ThisisdonebypushingthereachedopeningstatesoftheFSMontothestack:Thestackrepresentsthepathtothecurrenttag.TheproofthatforeachDTDdthereexistsaPDAacceptingL(d)hasbeenprovidedbySegoufinandVianu([29]).WenowgiveadetailedconstructionalgorithmforXMLSchema.However,wefirsttakeacloserlookatthetagsequence:

Lemma2.AnyXMLstringxconsistsofasequenceoftag-symbols∈Σtandtheir

valuesinΣc.Foranyarbitrarytwoconsecutivesymbolst1,t2∈Σttherearefivedifferentcasesforawell-formedsequencex:

+

1.t1∈Σtisthefirstsymbolinx

+−

2.t1∈Σt∧t2∈Σt∧name(t1)=name(t2).

++∧t2∈Σt3.t1∈Σt−−

4.t1∈Σt∧t2∈Σt

−+

5.t1∈Σt∧t2∈Σt

(e.g.)[rootcase]

(e.g.)[leafcase]

(e.g.)[childcase](e.g.)[parentcase](e.g.)[siblingcase]

Forthesakeofsimplicity,weignoreallsymbolsofthetextualcontentalphabetΣc.AsXMLSchematypescanbedefinedbyaregularexpression2itisobviousthatany

sequenceσ1,...σn∈ΣccanbecheckedbyanautomatonwhetherornotitcorrespondstoagivenXMLSchematype.

Wegeneratethepushdownautomaton(PDA)acceptingL(G)asfollows:

Formally,apushdownautomatonKisa6-tuple;K=(Q,Σ,Γ,Ψ,qstart,Z)whereQisasetofstates,Σistheinputalphabet,Γisthestackalphabet,Ψisasetoftransitionsoftheform(Q×Σ×Γ)→(Q×Γ∗),qstart∈Qistheinitialstate,andZ∈Γistheinitialsymbolofthestack.

ThetransitionrulesΨdefinewhatsymbolsinΣandΓhavetobereadinastateinQinordertomovetothenextstatewhilewritingontothestack.ThePDAacceptswhentheinputsequencehasbeencompletelyreadandthestackisempty.

Definition4(PDAacceptingL(G)).ThePDAhasonededicatedinitialstatenamedsstart.Foreachtagt∈Σtthereisonecorrespondingstateqt∈Q.Thesetofalltag-statesisdenotedbyQΣt.ForeachXMLSchematype(e.g.xsd:int)thereisonededicatedstate3.ThesetypestatesaredenotedbyQtype.NowwedefineQ={qstart}∪

23

E.g.theregularexpression(−|ε)(0|1|...|9)∗definesthedecimaltype.Seealso[31].Thelatterstatesrepresentsub-automataforcheckingthetextualcontent.

328B.C.Hammerschmidtetal.

Algorithm1.PDAConstructionAlgorithm

1:forallp∈Pdo

2:tag←element(p);3:fsm←fsm(p);4:si←stateinit(fsm)5:Sa←statesaccept(fsm)6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:37:38:

iftype(p)∈Sthen

AddTransition:(sstart,tag+,z)→(state(tag+),[si,z]);endif

forallq∈states(fsm)do

ifq=si∧q∈Sathen

AddTransition:(state(tag+),tag−,q)→(state(tag−),[]);endif

ifq=sithen

forallt∈out(q)dotypet←type(t)tagt←tag(typet)qdest←dest(t)

+

,q)→AddTransition:(state(tag+),tagt

+

(state(tagt),[stateinit(fsm(typet)),qdest]);endforendif

ifq∈Sathen

forallt∈in(q)dotypet←type(t)tagt←tag(typet)

),tag−,q)→(state(tag−),[]);AddTransition:(state(tagt

endforendif

foralltin∈in(q)doforalltout∈out(q)dotypein←type(tin)typeout←type(tout)tagout←tag(typeout)tagin←tag(typein)

−+),tagout,q)→AddTransition:(state(tagin

+

(state(tagout),[stateinit(fsm(typeout)),dest(tout)]);endforendforendforendfor

󰀎Leaf-case󰀎Root-case

󰀎Child-case

󰀎Parent-case

󰀎Sibling-case

IncrementalValidationofString-BasedXMLData329

QΣt∪Qtype.TheinputalphabetΣ=Σt∪Σccontainstagsandcontent.Thefunctionstate(t),wheret∈Σt,returnsthecorrespondingstate∈Q.WedonotimposeanyrestrictionsonthestackalphabetΓ.ThetransitionsΨaregeneratedbyAlgorithm1.Algorithm1issplitintofiveparts–correspondingtothefivetag-sequencecases.WhileiteratingovertheproductionrulesoftheregulartreegrammarG(line1)itischeckedwhichoneofthefivecasesholds.ThePDAsimulatestherunofthesetofFSMsbypushingandpoppingcorrespondingFSM-statesonto/fromthestack.

Ifanelementisdefinedastop-levelelementintheXMLSchemaitappearsasastart-symbolinG.Transitionsfromtheinitialstateqstarttothecorrespondingopen-tagstatesarecreated(line7).

WeiterateoverallstatesintheFSMwhicharegeneratedinline3.Iftheinitialstateisalsoafinalstate,thenithasanemptycontentmodel,i.e.itisaleaf–aclosingtagisreaddirectlyaftertheopeningtag(line11).Nothinghastobepushedontothestackbecausethesubsequenceforthetaghasbeenprocessedcompletely.

OutgoingtransitionsoftheinitialstatesoftheFSMsmaptothefirstelementoftheircontentmodel(child).Therefore,thereisatransitioninthePDAtotheopen-tagstatesofeachofsuchchildren(line18).Thetransitionsoftheparent-casearegeneratedinversely(line25)basedonthefinalstatesoftheFSMs.Here,nothinghastobepushedontothestackbecausethesub-sequencebelongingtotheclosingtaghasbeenprocessedcompletely.

ForallstatesintheFSMshavinganincomingandoutgoingtransitionwecreateatransitioninthePDAthatcorrespondstothesibling-case.Here,thePDAreadstheclosing-tagofanelementaandafterwardstheopening-tagofanelementb(line34).WhensomeXMLinputisgiven,theautomatonreadsittagbytagandswitchesintocorrespondingstates.Whentheinputisinvalidtherewillbenotransitionorthetop-moststateonthestackdoesnotmatchtherequirements.Inthiscase,theinputcannotbefullyreadandtheinputisrejected.

ThecomplexityofthisconstructionalgorithmisO(n3)withnbeingthesizeofstatesintheFSMs.However,theconstructionofthePDAisperformedonlyoncewhendefiningthecolumntype.

Top