Wortfaktorisierung in


12

Wenn zwei Zeichenfolgen S1,S2 , schreiben wir S1S2 für ihre Verkettung. Bei einer Zeichenkette S und Integer k1 , wir schreiben (S)k=SSS für die Verkettung von k Kopien von S . Wenn wir nun einen String haben, können wir ihn mit dieser Notation 'komprimieren', dh AABAAB kann geschrieben werden als ((A)2B)2 . Nennen wir das Gewicht einer Komprimierung die Anzahl der Zeichen, die darin vorkommen, sodass das Gewicht von ((A)2B2) zwei und das Gewicht von (AB)2A (eine Komprimierung von ABABA ) ist. ist drei (separate A werden separat gezählt).

Betrachten Sie nun das Problem, die 'leichteste' Komprimierung eines gegebenen Strings S mit |S|=n berechnen S | = n . Nach einigem Nachdenken gibt es einen offensichtlichen dynamischen Programmieransatz, der in O(n3logn) oder O ( n 3 ) abläuft.O(n3) je nach genauem Ansatz .

Mir wurde jedoch mitgeteilt, dass dieses Problem in O(n2logn) gelöst werden kann , obwohl ich keine Quellen dazu finde. Insbesondere wurde dieses Problem in einem kürzlich durchgeführten Programmierwettbewerb (Problem K hier , letzte zwei Seiten) angegeben. Während der Analyse wurde ein O(n3logn) -Algorithmus vorgestellt, und am Ende wurde die pseudoquadratische Grenze erwähnt ( hier bei der Vier-Minuten-Marke). Leider bezog sich der Moderator nur auf 'ein kompliziertes Wortkombinatorisches Lemma', und jetzt bin ich hierher gekommen, um nach der Lösung zu fragen :-)


Nur eine zufällige Eigenschaft: Wenn für einen String haben wir S = X a = Y b , dann muss es auch sein , dass S = Z | S | / gcd ( | X | , | Y | ) [Ich habe hier einen Fehler behoben], wobei Z die Länge gcd ( | X | , | Y | ) hat (die nicht länger als X oder Y sein kannSS=Xa=YbS=Z|S|/gcd(|X|,|Y|)Zgcd(|X|,|Y|)XY). Not sure how useful this is though. If you have already found that S=Xa and know that S contains at least 2 distinct characters, and are now looking for a shorter Y such that S=Yb, then you only need to try prefixes Y of X having length that divides |X|.
j_random_hacker

The problem is that even after reducing all possible Xa, you still need to aggregate the answer by a cubic DP over subsegments (i.e. DP[l,r]=minkDP[l,k]+DP[k+1,r]), so there still is some extra work to be done after that ...
Timon Knigge

I see what you mean. I think you need some kind of dominance relation that eliminates some k values from needing to be tested -- but I haven't been able to think of one. In particular, I considered the following: Suppose S[1..i] has an optimal factorisation S[1..i]=XYk with k>1; is it possible that there is an optimal solution in which S is factorised as XYjZ with j<k? Unfortunately the answer is yes: for , S [ 1..4 ] hat eine optimale Faktorisierung ( A B ) 2 , aber die eindeutige optimale Faktorisierung für S ist A B ( A B C ) 2 . S=ABABCABCS[1..4](AB)2SAB(ABC)2
j_random_hacker

Antworten:


1

If I'm not misunderstanding you, I think the minimum cost factorization can be calculated in O(n2) time as follows.

For each index i, we will calculate a bunch of values (pi,ri) for =1,2, as follows. Let pi11 be the smallest integer such that there is an integer r2 satisfying

S[irpi1+1,ipi1]=S[i(r1)pi1+1,i].
For this particular pi1, let ri1 be the largest r with this property. If no such pi exists, set Li=0 so we know there are zero (pi,ri) values for this index.

Let pi2 be the smallest integer strictly bigger than (ri11)pi1 satisfying, likewise,

S[iri2pi2+1,ipi2]=S[i(ri21)pi2+1,i]
ri22ri2pi2pi(ri11)pi1pi exists, then Li=1.

Note that for each index i, we have Li=O(log(i+1)) due to pi values increasing geometrically with . (if pi+1 exists, it's not just strictly bigger than (ri1)pi but bigger than that by at least pi/2. This establishes the geometric increase.)

Suppose now all (pi,ri) values are given to us. The minimum cost is given by the recurrence

dp(i,j)=min{dp(i,j1)+1,min(dp(i,jrjpj)+dp(jrjpj+1,jpj))}
with the understanding that for i>j we set dp(i,j)=+. The table can be filled in O(n2+njLj) time.

We already observed above that jLj=O(jlog(j+1))=Θ(nlogn) by bounding the sum term by term. But actually if we look at the whole sum, we can prove something sharper.

Consider the suffix tree T(S) of the reverse of S (i.e., the prefix tree of S). We will charge each contribution to the sum iLi to an edge of T(S) so that each edge will be charged at most once. Charge each pij to the edge emanating from nca(v(i),v(ipij)) and going towards v(ipij). Here v(i) is the leaf of the prefix tree corresponding to S[1..i] and nca denotes the nearest common ancestor.

This shows that O(iLi)=O(n). The values (pij,rij) can be calculated in time O(n+iLi) by a traversal of the suffix tree but I will leave the details to a later edit if anyone is interested.

Let me know if this makes sense.


-1

There is your initial string S of length n. Here is the pseudo-code of the method.

next_end_bracket = n
for i in [0:n]: # main loop

    break if i >= length(S) # due to compression
    w = (next_end_bracket - i)# width to analyse

    for j in [w/2:0:-1]: # period loop, look for largest period first
        for r in [1:n]: # number of repetition loop
            if i+j*(r+1) > w:
                break r loop

            for k in [0:j-i]:
                # compare term to term and break at first difference
                if S[i+k] != S[i+r*j+k]:
                    break r loop

        if r > 1:
            # compress
            replace S[i:i+j*(r+1)] with ( S[i:i+j] )^r
            # don't forget to record end bracket...
            # and reduce w for the i-run, carrying on the j-loop for eventual smaller periods. 
            w = j-i

I intentionally gave little details on "end brackets" as it needs lot of steps to stack and unstack which would let the core method unclear. The idea is to test an eventual further contraction inside the first one. for exemple ABCBCABCBC => (ABCBC)² => (A(BC)²)².

So the main point is to look for large periods first. Note that S[i] is the ith term of S skipping any "(", ")" or power.

  • i-loop is O(n)
  • j-loop is O(n)
  • r+k-loops is O(log(n)) as it stops at first difference

This is globally O(n²log(n)).


It's not clear to me that the r and k loops are O(log n) -- even separately. What ensures that a difference is found after at most O(log n) iterations?
j_random_hacker

Do I understand correctly that you are compressing greedily? Because that is incorrect, consider e.g. ABABCCCABCCC which you should factorize as AB(ABC^3)^2.
Timon Knigge

Yeah you are totally right about that, I've to think about this.
Optidad
Durch die Nutzung unserer Website bestätigen Sie, dass Sie unsere Cookie-Richtlinie und Datenschutzrichtlinie gelesen und verstanden haben.
Licensed under cc by-sa 3.0 with attribution required.