From owner-freebsd-hackers@freebsd.org Tue Jan 23 00:53:33 2018 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E90BEEBB5BB for ; Tue, 23 Jan 2018 00:53:33 +0000 (UTC) (envelope-from yuripv@icloud.com) Received: from pv33p00im-asmtp001.me.com (pv33p00im-asmtp001.me.com [17.142.194.250]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id C26A477671; Tue, 23 Jan 2018 00:53:33 +0000 (UTC) (envelope-from yuripv@icloud.com) Received: from process-dkim-sign-daemon.pv33p00im-asmtp001.me.com by pv33p00im-asmtp001.me.com (Oracle Communications Messaging Server 8.0.1.2.20170607 64bit (built Jun 7 2017)) id <0P2Z00L00G8JON00@pv33p00im-asmtp001.me.com>; Tue, 23 Jan 2018 00:53:27 +0000 (GMT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=04042017; t=1516668807; bh=syXfrKCTfWHU5vOkSThlQ38TRx8kp4tCOAHxQVjj08o=; h=To:From:Subject:Message-id:Date:MIME-version:Content-type; b=L7TxuNvwXb7OyPZ7G80jbnm2a3fZudFvhoJaHcqVNK1K19lkAhtWCK9O/dmhF3cgO SOtSr+p8peuwJXxm0a7pGTWZnQmcc7ggvsL1ilpeYI2Dx6+vvtSzMWbv4PHuPZtgxM CO9Wa2ZtGgayk9168S9dw8tiGkTmGhO6O6FBqEzdUn9xmfRF7E3kQWKT9mDvAyUEoq D+/pFSrlQh4PFncazH20XuoWkFk8wcDSzFcNh8HdUbnBNnE8iVPsxSX6oWPHB9XMu4 23Q6V6MYWGJnFb2y3ipULRCe4dq8E3M3c5+d735suiaP1u/5tUSpOWIu/wZC+ymw/n LzR5CnxUWhBkg== Received: from icloud.com ([127.0.0.1]) by pv33p00im-asmtp001.me.com (Oracle Communications Messaging Server 8.0.1.2.20170607 64bit (built Jun 7 2017)) with ESMTPSA id <0P2Z002HMH4WYM00@pv33p00im-asmtp001.me.com>; Tue, 23 Jan 2018 00:53:25 +0000 (GMT) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-01-22_10:,, signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 clxscore=1011 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1801230008 To: freebsd-hackers , Kyle Evans From: Yuri Pankov Subject: libc/regex: r302824 added invalid check breaking collating ranges Message-id: Date: Tue, 23 Jan 2018 03:53:19 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 MIME-version: 1.0 Content-type: text/plain; charset=utf-8; format=flowed Content-language: en-US Content-transfer-encoding: 8bit X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Jan 2018 00:53:34 -0000 (CCing Kyle as he's working on regex at the moment and not because he broke something) Hi, r302284 added an invalid check which breaks collating ranges: -if (table->__collate_load_error) { - (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE); +if (table->__collate_load_error || MB_CUR_MAX > 1) { + (void)REQUIRE(start <= finish, REG_ERANGE); The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison according to current locale's collation and not simply comparing the wchar_t values. Example -- see Table 1 in http://www.unicode.org/reports/tr10/: Let's try Swedish collation: $ echo 'test' | LC_COLLATE=se_SE.UTF-8 grep '[ö-z]' grep: invalid character range $ echo 'test' | LC_COLLATE=se_SE.UTF-8 grep '[z-ö]' OK, the above seems to be correct, 'ö' > 'z' in Swedish collation, but we just got lucky here, as wchar_t comparison gives us the same result. Now German one: $ echo 'test' | LC_COLLATE=de_DE.UTF-8 grep '[ö-z]' grep: invalid character range $ echo 'test' | LC_COLLATE=de_DE.UTF-8 grep '[z-ö]' Same, but according to the table, 'ö' < 'z' in German collation! I think the fix here would be to drop the "if (table->__collate_load_error || MB_CUR_MAX > 1)" block entirely as we no longer use the "table" so there's no point in getting it and checking error, wcscoll() which would be called eventually in p_range_cmp() does the table handling itself, and we can't use the direct comparison for anything other than 'C' locale (not sure if it's applicable even there).