From owner-freebsd-questions@FreeBSD.ORG Mon Jan 5 18:00:53 2004 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B189F16A4CE for ; Mon, 5 Jan 2004 18:00:53 -0800 (PST) Received: from smtp1.adl2.internode.on.net (smtp1.adl2.internode.on.net [203.16.214.181]) by mx1.FreeBSD.org (Postfix) with ESMTP id E841843D3F for ; Mon, 5 Jan 2004 18:00:51 -0800 (PST) (envelope-from malcolm.kay@internode.on.net) Received: from beta.home (ppp129-234.lns1.adl2.internode.on.net [150.101.129.234])i0620gRp097912; Tue, 6 Jan 2004 12:30:50 +1030 (CST) Content-Type: text/plain; charset="gb2312" From: Malcolm Kay Organization: At home To: zhangweiwu@realss.com, "Zhang Weiwu" , questions@freebsd.org Date: Tue, 6 Jan 2004 12:30:42 +1030 User-Agent: KMail/1.4.3 References: In-Reply-To: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Message-Id: <200401061230.42038.malcolm.kay@internode.on.net> Subject: Re: help me with this sed expression X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Jan 2004 02:00:53 -0000 On Mon, 5 Jan 2004 22:19, Zhang Weiwu wrote: > Hello. I've worked an hour to figure out a serial of sed command to pro= cess > some text (without any luck, you kown I'm kinda newbie). I really > appreciate your help. > > The original text file is in this form -- for each line: > one Chinese word then one or two English word seperated by space. > > I wish to change to: > 1) target file: one English word, then a space, then a Chinese word > coorisponding to that English word. > 2) if in the original file one Chinese word has more than one English w= ord > following in the same line, repeat the Chinese word to satisfy 1). > > Define: Chinese word =3D one or more continous bytes of data where each= byte > is greater then 128 in value. (it is true in GB2312 Chinese charset whi= ch > this email is written in.) > Define: English word =3D one or more continous bytes of [a-z]. > > Say, for the original file: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > =D2=BBa av > =BF=C9=B8=E8=BF=C9=C6=FCaaav > =CE=DE=BF=C9=B7=EE=B8=E6aacm > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > The target file should be: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > a =D2=BB > av =D2=BB > aaav =BF=C9=B8=E8=BF=C9=C6=FC > aacm =CE=DE=BF=C9=B7=EE=B8=E6 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > I tried to do things like s/\(.*\)\([a-z]*\)/\2 \1/ but the first \(.*\= ) is > too greedy and included the rest [a-z]. Well the greedy part is easily fixed with: s/\([^a-z]*\)\([a-z]*\)/\2 \1/ But this will not work for those lines with 2 english words. The followin= g should: % sed -n -e 's/\([^a-z]*\)\([a-z]*\) .*/\2 \1/p' -e 's/\([^a-z]*\)[a-z]* = \([a-z]*\)/\2 \1/p' original > target Malcolm Kay