热门搜索 :
考研考公
您的当前位置:首页正文

文字处理

来源:东饰资讯网

(四)Unicode格式转换
UTF(Unicode Transformation Formats)
(五)NSString
关于 NSString,最需要记住的是:NSString 代表的是用 UTF-16 编码的文本,长度、索引和范围都基于 UTF-16 的码元。对于这一点要是不注意会有以下一些陷阱:
长度

我们经常用 NSString 的 length 方法来获取一个字符串的长度,在大多数情况下这个方法都没有问题,但是当一个字符串中包含 emoji 时,这个返回的长度值并不准确,以下是具体例子:

NSString *s = @"\U0001F30D"; // earth globe emoji 🌍
NSLog(@"The length of %@ is %lu", s, [s length]);
// => The length of 🌍 is 2

以下代码可以获取实际的长度

NSUInteger realLength =
 [s lengthOfBytesUsingEncoding:NSUTF32StringEncoding] / 4;
 NSLog(@"The real length of %@ is %lu", s, realLength);
 
 // => The real length of 🌍 is 1

随机访问

用 characterAtIndex: 方法以索引方式直接访问 unichar 会有同样的问题。可以用 rangeOfComposedCharacterSequenceAtIndex: 来确定特定位置的 unichar 是不是代表单个字符(可能由多个码点组成)的码元序列的一部分。每当给另一个方法传入一个内容未知的字符串的范围作参数时都应该这样做,确保 Unicode 字符不会被从中间分开。
遍历

使用 rangeOfComposedCharacterSequenceAtIndex: 的时候,可以写一个代码套路来正确地循环字符串里所有的字符,但每次要遍历一个字符串时都得这样做太不方便了。幸运的是,NSString 有更好地方式:enumerateSubstringsInRange:options:usingBlock: 方法。这个方法把 Unicode 抽象的地方隐藏了,能让你轻松地循环字符串里的组合字符串、单词、行、句子或段落。你甚至可以加上 NSStringEnumerationLocalized 这个选项,这样可以在确定词语间和句子间的边界时把用户所在的区域考虑进去。要遍历单个字符,把参数指定为 NSStringEnumerationByComposedCharacterSequences:

NSString *s = @"The weather on \U0001F30D is \U0001F31E today.";
// The weather on 🌍 is 🌞 today.
NSRange fullRange = NSMakeRange(0, [s length]);
[s enumerateSubstringsInRange:fullRange
                      options:NSStringEnumerationByComposedCharacterSequences
                   usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop)
{
    NSLog(@"%@ %@", substring, NSStringFromRange(substringRange));
}];

比较

有些字符可以由单一码点或由多个码点组成,虽然外观和意义相同,在 Unicode 语境下并不相等。isEqual: 和 isEqualToString: 这两个方法都是一个字节一个字节地比较的。如果希望字符串的合成和分解的形式相吻合,得先自己正规化:

NSString *s = @"\u00E9"; // é
NSString *t = @"e\u0301"; // e + ´
BOOL isEqual = [s isEqualToString:t];
NSLog(@"%@ is %@ to %@", s, isEqual ? @"equal" : @"not equal", t);
// => é is not equal to é
 
// Normalizing to form C
NSString *sNorm = [s precomposedStringWithCanonicalMapping];
NSString *tNorm = [t precomposedStringWithCanonicalMapping];
BOOL isEqualNorm = [sNorm isEqualToString:tNorm];
NSLog(@"%@ is %@ to %@", sNorm, isEqualNorm ? @"equal" : @"not equal", tNorm);
 
// => é is equal to é
 
NSString *s = @"ff"; // ff
NSString *t = @"\uFB00"; // ff ligature
NSComparisonResult result = [s localizedCompare:t];
NSLog(@"%@ is %@ to %@", s, result == NSOrderedSame ? @"equal" : @"not equal", t);
// => ff is equal to ff

Text Kit is a set of classes and protocols that provide high-quality typographical services which enable apps to store, lay out, and display text with all the characteristics of fine typesetting, such as kerning, ligatures, line breaking, and justification.

layoutManager 将 textStorage 存储的内容根据 textContainers 定义的区域布局到 textViews(UITextView)里。
在 MVC 中,textStorage 和 textContainers 相当于 M,textViews 相当于 V,leyoutManager 相当于 C。
An NSLayoutManager object orchestrates the operation of the other text handling objects. It intercedes in operations that convert the data in an NSTextStorage object to rendered text in a view’s display area. It maps Unicode character codes to glyphs and oversees the layout of the glyphs within the areas defined by NSTextContainer objects.

NSLayoutManager 将 Unicode 字符转换成 glyphs(字形),并在 NSTextContainer 定义的范围内布局这些字形。
The layout manager performs the following actions:
Controls text storage and text container objects
Generates glyphs from characters
Computes glyph locations and stores the information
Manages ranges of glyphs and characters
Draws glyphs in text views when requested by the view
Computes bounding box rectangles for lines of text
Controls hyphenation
Manipulates character attributes and glyph properties

layout manager 会做如下一系列操作:
控制 text storage 和 text container
Unicode 字符转换成 glyphs(字形)
计算字形的位置信息并保存起来
管理字符的范围信息
将字形绘制到视图上
计算每一行的矩形包裹信息
处理断字
处理文字的属性,例如字体、颜色、下标

Text Kit handles three kinds of text attributes:
character attributes, paragraph attributes, and document attributes.
Character attributes include traits such as font, color, and subscript, which can be associated with an individual character or a range of characters.
Paragraph attributes are traits such as indentation, tabs, and line spacing. Document attributes include documentwide traits such as paper size, margins, and view zoom percentage.

Character attributes:字体、颜色、下标
Paragraph attributes:缩进、制表符、行距
Document attributes:页数、页间距、页缩放比例


UIFont 的 metrics 有一个具体应用,比如我们想让一个区域最多显示6行的文本,如果使用 UILabel,我们可以指定 numberOfLines 属性,在不使用 UILabel 的情况下,我们就可以用到 UIFont 的 lineHeight 属性了,用法如下:

+ (float)calculateContentHeight:(NSString *)content{
 
    UIFont *font = [UIFont systemFontOfSize:13];
    CGFloat lineHeight = font.lineHeight;
 
    int height = 0;
    float max_width = SCREEN_WIDTH-30;
    float max_height = ceil(lineHeight)*6;
 
    CGSize content_size = [content sizeWithFont:font constrainedToSize:CGSizeMake(max_width, MAXFLOAT) lineBreakMode:NSLineBreakByWordWrapping];
    height = ceil(content_size.height);
    if (content.length == 0) {
        return 0;
    }
 
    height = MIN(max_height, height);
    return height;
}

这些被加入文字中的 Unicode 控制字符在显示界面上是不可见的,也不占用任何显示空间。它们只是在默默地影响着双向文字的显示。
Unicode 控制字符又可以分为两类,
第一类为隐性双向控制字符:
U+200E: LEFT-TO-RIGHT MARK (LRM)
U+200F: RIGHT-TO-LEFT MARK (RLM

简单来说,您可以将这类的控制字符看成是不会显示出来的强字符,LRM 为从左到右的强字符,而 RLM 为从右到左的强字符。
而第二类当然就是显性双向控制字符:
U+202A: LEFT-TO-RIGHT EMBEDDING (LRE)
U+202B: RIGHT-TO-LEFT EMBEDDING (RLE)
U+202D: LEFT-TO-RIGHT OVERRIDE (LRO)
U+202E: RIGHT-TO-LEFT OVERRIDE (RLO)
U+202C: POP DIRECTIONAL FORMATTING (PDF)

Top