#P8131. [ICPC2020 WF] Gene Folding

[ICPC2020 WF] Gene Folding

题目背景

ICPC2020 WF D

题目描述

International Cell Processing Company (ICPC) is a world leader in the analysis of genetic sequences. A genetic sequence is a sequence of nucleotides, which in this problem is represented by a string containing only the letters A\texttt{A}, C\texttt{C}, G\texttt{G}, and T\texttt{T} in some combination, each letter representing a single nucleotide (A\textbf{A}denine, C\textbf{C}ytosine, G\textbf{G}uanine, and T\textbf{T}hymine, respectively).

One of the key discoveries made by ICPC is that through a process called Genetically Optimized Organic Folding (GOOF), they can take a genetic sequence and transform it into a simpler one, while preserving many of the properties of the sequence that ICPC wants to analyze.

A single application of GOOF works as follows. Find a point between two adjacent nucleotides in the nucleotide sequence, such that the sequence reads the same from that point in both directions, up until the nearer end of the sequence. For instance, in the sequence ATTACC\texttt{ATTACC}, there are two such points: AT-TACC\texttt{AT-TACC} and ATTAC-C\texttt{ATTAC-C}. Then pick one of these points (say, the first one), and fold the genetic sequence at that point, merging the identical nucleotides (so, in this case the AT\texttt{AT} and TA\texttt{TA} would become merged, and the resulting sequence would be CCAT\texttt{CCAT} or TACC\texttt{TACC}).

Through repeated application of GOOF, a nucleotide can potentially be made much shorter. However, manually searching for the appropriate folding points is very time-consuming. ICPC reached out to you to write a program that would automate the process of finding the folding points and choosing them so as to obtain the shortest possible genetic sequence from a given input sequence.

输入格式

The input contains a single string ss representing the nucleotide sequence to be analyzed. The string consists of characters A\texttt{A}, C\texttt{C}, G\texttt{G}, and T\texttt{T} only. The length of ss is between 11 and 41064 \cdot 10^6, inclusive.

输出格式

Output the smallest possible length of a sequence obtained from the input by applying GOOF zero or more times.

题目大意

题目描述

国际细胞加工公司(ICPC)在基因序列分析方面处于世界领先地位。遗传序列是核苷酸的序列,在这个问题中,它由一个只包含字母A\texttt{A}C\texttt{C}G\texttt{G}T\texttt{T}的某种组合的字符串来表示,每个字母分别代表一个核苷酸腺嘌呤(A\textbf{A}denine)、胞嘧啶(C\textbf{C}ytosine)、鸟嘌呤(G\textbf{G}uanine)和胸腺嘧啶(T\textbf{T}hymine)。

ICPC的一个重要发现是,通过一种被称为基因优化有机折叠(GOOF)的过程,他们可以将一个基因序列转化为一个更简单的序列,同时保留ICPC想要分析的序列的许多属性。

以下是GOOF的一个应用。在核苷酸序列中两个相邻核苷酸之间找到一个点,使从该点开始的序列在两个方向上都是相同的,直到序列的最后一个点。例如,在序列ATTACC\texttt{ATTACC}中,有两个这样的点:AT-TACC\texttt{AT-TACC}ATTAC-C\texttt{ATTAC-C}。然后选择其中一个点(比如第一个点),在那个点折叠基因序列,合并相同的核苷酸(因此,在这种情况下,AT\texttt{AT}TA\texttt{TA}会合并,得到的序列将是CCAT\texttt{CCAT}TACC\texttt{TACC})。

通过重复使用GOOF,可以使核苷酸变得更短。但是,手工寻找合适的折叠点非常耗时。ICPC找到你,让你写一个程序来自动找到折叠点并选择它们,从而从给定的输入序列中获得尽可能短的基因序列。

输入格式

输入包含一个表示要分析的核苷酸序列的字符串。只包含A\texttt{A}C\texttt{C}G\texttt{G}T\texttt{T}s 长度在11 ~ 41064\cdot10^6之间。

输出格式

输出从输入中应用零次或多次GOOF得到的序列的最小可能长度。

ATTACC
3
AAAAGAATTAA
5