沉沙
2018-09-25
来源 :
阅读 4515
评论 0
摘要:本篇教程探讨了大数据技术 gprecoverseg导致的元数据库问题致使数据库无法启动以及修复,希望阅读本篇文章以后大家有所收获,帮助大家对大数据技术的理解更加深入。
本篇教程探讨了大数据技术 gprecoverseg导致的元数据库问题致使数据库无法启动以及修复,希望阅读本篇文章以后大家有所收获,帮助大家对大数据技术的理解更加深入。
<
一、现象描述
在一次执行gprecoverseg后发现数据库无法正常连接,现象如下:
执行gprecoverseg日志如下:
gprecoverseg:mdw-1:gpadmin-[INFO]:-Starting gprecoverseg with args: -r
gprecoverseg:mdw-1:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.99.00 build dev'
gprecoverseg:mdw-1:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 4.3.99.00 build dev) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4) compiled on Jan 18 2018 15:34:00 (with assert checking)'
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 1
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 2
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 3
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 4
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 5
gprecoverseg:mdw-1:gpadmin-[CRITICAL]:-gprecoverseg failed. (Reason='Unable to connect to database and start transaction') exiting...
这里看出gprecoverseg命令是执行失败了的。
之后尝试启动数据库,执行gpstart:
gpstart:mdw-1:gpadmin-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait...
..
gpstart:mdw-1:gpadmin-[INFO]:-Process results...
gpstart:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstart:mdw-1:gpadmin-[INFO]:- Successful segment starts = 8
gpstart:mdw-1:gpadmin-[INFO]:- Failed segment starts = 0
gpstart:mdw-1:gpadmin-[INFO]:- Skipped segment starts (segments are marked down in configuration) = 0
gpstart:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstart:mdw-1:gpadmin-[INFO]:-
gpstart:mdw-1:gpadmin-[INFO]:-Successfully started 8 of 8 segment instances
gpstart:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstart:mdw-1:gpadmin-[INFO]:-Starting Master instance mdw-1 directory /home/data/master/gpseg-1
gpstart:mdw-1:gpadmin-[INFO]:-Command pg_ctl reports Master mdw-1 instance active
gpstart:mdw-1:gpadmin-[WARNING]:-server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
gpstart:mdw-1:gpadmin-[INFO]:-Starting standby master
gpstart:mdw-1:gpadmin-[INFO]:-Checking if standby master is running on host: mdw-2 in directory: /home/data/master/gpseg-1
gpstart:mdw-1:gpadmin-[INFO]:-Check status of database with gpstate utility
这时候显示所有segment都是启动的,但是库没有正常启动起来
然后执行gpstate,确认状态:
gpstate:mdw-1:gpadmin-[INFO]:-Starting gpstate with args:
gpstate:mdw-1:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.99.00 build dev'
gpstate:mdw-1:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 4.3.99.00 build dev) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4) compiled on Jan 18 2018 15:34:00 (with assert checking)'
gpstate:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gpstate:mdw-1:gpadmin-[INFO]:-Gathering data from segments...
.
gpstate:mdw-1:gpadmin-[INFO]:-Greenplum instance status summary
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:- Master instance = Active
gpstate:mdw-1:gpadmin-[INFO]:- Master standby = mdw-2
gpstate:mdw-1:gpadmin-[INFO]:- Standby master state = Standby host passive
gpstate:mdw-1:gpadmin-[INFO]:- Total segment instance count from metadata = 8
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:- Primary Segment Status
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:- Total primary segments = 4
gpstate:mdw-1:gpadmin-[INFO]:- Total primary segment valid (at master) = 4
gpstate:mdw-1:gpadmin-[INFO]:- Total primary segment failures (at master) = 0
gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid files missing = 0
gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid files found = 4
gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid PIDs missing = 0
gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid PIDs found = 4
gpstate:mdw-1:gpadmin-[INFO]:- Total number of /tmp lock files missing = 0
gpstate:mdw-1:gpadmin-[INFO]:- Total number of /tmp lock files found = 4
gpstate:mdw-1:gpadmin-[INFO]:- Total number postmaster processes missing = 0
gpstate:mdw-1:gpadmin-[INFO]:- Total number postmaster processes found = 4
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:- Mirror Segment Status
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:- Total mirror segments = 4
gpstate:mdw-1:gpadmin-[INFO]:- Total mirror segment valid (at master) = 4
gpstate:mdw-1:gpadmin-[INFO]:- Total mirror segment failures (at master) = 0
gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid files missing = 0
gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid files found = 4
gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid PIDs missing = 0
gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid PIDs found = 4
gpstate:mdw-1:gpadmin-[INFO]:- Total number of /tmp lock files missing = 0
gpstate:mdw-1:gpadmin-[INFO]:- Total number of /tmp lock files found = 4
gpstate:mdw-1:gpadmin-[INFO]:- Total number postmaster processes missing = 0
gpstate:mdw-1:gpadmin-[INFO]:- Total number postmaster processes found = 4
gpstate:mdw-1:gpadmin-[WARNING]:-Total number mirror segments acting as primary segments = 1 <<<<<<<<
gpstate:mdw-1:gpadmin-[INFO]:- Total number mirror segments acting as mirror segments = 3
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
这里说明是发生过primary/mirror的角色切换的
尝试连接数据库:
psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
这时候库已经无法连接上。
二、定位过程
1、查看日志
首先查看master的日志:
CST,"gpadmin","postgres",p30743,th1692887104,"[local]",,CST,0,con38,,seg-1,,,,sx1,"LOG","00000","The previous session was reset because its gang was disconnected (session id = 37). The new session id = 38",,,,,,,0,,"cdbgang.c",1638,
CST,"gpadmin","postgres",p30743,th1692887104,"[local]",,CST,0,con38,,seg-1,,,,sx1,"LOG","00000","Failed to connect to seg0 xx.xx.xx.xx:50000(xx.xx.xx.xx是segment的ip)",,,,,,,0,,"cdbgang_async.c",174,
CST,"gpadmin","postgres",p30743,th1692887104,"[local]",,CST,0,con39,,seg-1,,,,sx1,"LOG","00000","The previous session was reset because its gang was disconnected (session id = 38). The new session id = 39",,,,,,,0,,"cdbgang.c",1638,
CST,,,p30743,th0,,,CST,0,con39,,seg-1,,,,,"PANIC","XX000","Unexpected internal error: Master process received signal SIGSEGV",,,,,,,0,,,,"1 0xa0ce4b postgres StandardHandlerForSigillSigsegvSigbus_OnMainThread + 0x1b2
2 0x8d376e postgres CdbProgramErrorHandler + 0xf1
3 0x7fee63a14100 libpthread.so.0
4 0xaaa718 postgres
5 0xaaa3e4 postgres CdbDispatchCommand + 0x35
6 0xb75719 postgres
7 0xb755d4 postgres
8 0xb7524b postgres
9 0xb71c51 postgres
10 0xb720ae postgres initTM + 0xe2
11 0xb78864 postgres cdb_setup + 0x4b
12 0xa1b6e8 postgres InitPostgres + 0x9fa
"
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","server process (PID 30743) was terminated by signal 11: Segmentation fault",,,,,,,0,,"postmaster.c",5604,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","terminating any other active server processes",,,,,,,0,,"postmaster.c",5284,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","seqserver process (PID 28900) exited with exit code 2",,,,,,,0,,"postmaster.c",5584,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","ftsprobe process (PID 28901) exited with exit code 2",,,,,,,0,,"postmaster.c",5584,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","sweeper process (PID 28902) exited with exit code 2",,,,,,,0,,"postmaster.c",5584,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","BeginResetOfPostmasterAfterChildrenAreShutDown: counter 0",,,,,,,0,,"postmaster.c",2171,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","gp_session_id high-water mark is 39",,,,,,,0,,"postmaster.c",2197,
这一句说明有个segment是连不上的:
?1"Failed to connect to seg0 xx.xx.xx.xx:50000(xx.xx.xx.xx是segment的ip)",,,,,,,0,,"cdbgang_async.c",174,
连接到这个segment所在的物理机,查看segment日志,发现这个时间点segment并没有相关的日志信息。
2、查看其它状态信息
使用gpstate -s 查看各节点的状态,截取其中有问题的一个segment:
gpstate:mdw-1:gpadmin-[INFO]:- Segment Info
gpstate:mdw-1:gpadmin-[INFO]:- Hostname = sdw-2
gpstate:mdw-1:gpadmin-[INFO]:- Address = sdw-2
gpstate:mdw-1:gpadmin-[INFO]:- Datadir = /gpadmin/data/mirror/gpseg0
gpstate:mdw-1:gpadmin-[INFO]:- Port = 50000
gpstate:mdw-1:gpadmin-[INFO]:- Mirroring Info
gpstate:mdw-1:gpadmin-[INFO]:- Current role = Primary
gpstate:mdw-1:gpadmin-[INFO]:- Preferred role = Mirror
gpstate:mdw-1:gpadmin-[INFO]:- Mirror status = Resynchronizing
gpstate:mdw-1:gpadmin-[INFO]:- Change Tracking Info
gpstate:mdw-1:gpadmin-[INFO]:- Change tracking data size = 32.1 kB
gpstate:mdw-1:gpadmin-[INFO]:- Resynchronization Info
gpstate:mdw-1:gpadmin-[INFO]:- Resynchronization mode = Incremental
gpstate:mdw-1:gpadmin-[INFO]:- Data synchronized = 0 bytes
gpstate:mdw-1:gpadmin-[INFO]:- Estimated total data to synchronize = Not Available
gpstate:mdw-1:gpadmin-[INFO]:- Estimated resync progress with mirror = Not Available
gpstate:mdw-1:gpadmin-[INFO]:- Total resync objects = 0
gpstate:mdw-1:gpadmin-[INFO]:- Objects to resync = 0
gpstate:mdw-1:gpadmin-[INFO]:- Estimated resync end time = Not Available
gpstate:mdw-1:gpadmin-[INFO]:- Status
gpstate:mdw-1:gpadmin-[INFO]:- PID = 26649
gpstate:mdw-1:gpadmin-[INFO]:- Configuration reports status as = Up
gpstate:mdw-1:gpadmin-[INFO]:- Database status = Stopping
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:- Segment Info
gpstate:mdw-1:gpadmin-[INFO]:- Hostname = sdw-1
gpstate:mdw-1:gpadmin-[INFO]:- Address = sdw-1
gpstate:mdw-1:gpadmin-[INFO]:- Datadir = /gpadmin/data/primary/gpseg0
gpstate:mdw-1:gpadmin-[INFO]:- Port = 40000
gpstate:mdw-1:gpadmin-[INFO]:- Mirroring Info
gpstate:mdw-1:gpadmin-[INFO]:- Current role = Mirror
gpstate:mdw-1:gpadmin-[INFO]:- Preferred role = Primary
gpstate:mdw-1:gpadmin-[INFO]:- Mirror status = Resynchronizing
gpstate:mdw-1:gpadmin-[INFO]:- Status
gpstate:mdw-1:gpadmin-[INFO]:- PID = 30458
gpstate:mdw-1:gpadmin-[INFO]:- Configuration reports status as = Up
gpstate:mdw-1:gpadmin-[WARNING]:- Segment status = Down <<<<<<<<
这里看到segment是处于Change Tracking状态的,其中数据库是Stopping状态,sdw-1机器上,端口为40000的Mirror处于down的状态,和配置中的up状态是冲突的。
由于这个时候库是不可连接的,所以通过master only模式直接连接到master,查看源数据库,信息如下:
?123456postgres=# select * from gp_segment_configuration ; dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port | san_mounts------+---------+------+----------------+------+--------+-------+---------------+---------------+------------------+------------(省略正常节点) 2 | 0 | m | p | r | u | 40000 | sdw-1 | sdw-1 | 41000 | 6 | 0 | p | m | r | u | 50000 | sdw-2 | sdw-2 | 51000 |
这里可以看到源数据库标记的这个segment处于同步状态,然而status的状态都为‘u’,可见,源数据库标记出现了错误!
三、解决方案
通过以上的所有信息,可以分析出如下的一些问题:
库是无法正常启动的,也无法正常连接
无法连接库则使用gprecoverseg -F 这种全量修复是不可行的
元数据库信息是有问题的,数据库启动的时候通过元数据库认为segment正常,会进行同步,然而有个segment实际是down的,无法反正同步,导致库无法启动
gpcheckcat该命令是用于修复元数据库错误的,会给出报告和修复建议,因为库无法正常启动,所以该命令无法正常执行
那么需要通过修复元数据的方式,让greenplum数据库正常启动,首先我们了解下greenplum的primary/mirror的状态:
greenplum检测segment状态的机制称为FTS,所以在主备切换时,Primary活着,Mirror挂了。GP Master探测Primary成功之后,通过Primary返回的状态得知Mirror挂掉了(Mirror挂掉之后,Primary将会探测到,将自己变成ChangeTracking模式),这时候更新Master元信息,进行下一个Segment检测;在这种状态下,会进行自动的primary/mirror的同步,知道了这点,我们就可以修改元数据库,让greenplum顺利的完成同步。
另外,我们看下greenplum的primary/mirror的同步机制:
这里我们可以看到数据是先进primary,然后由primary发送数据到mirror上,所以如果发生了primary/mirror的切换,肯定是要将原来的primary(当前的mirror)数据同步到原来的mirror(当前的primary),知道这一点后,修改源数据表
将dbid为2的mode修改为‘c’,将dbid为6的status修改为'd',重启数据库即可正常启动,数据进入同步状态
本文由职坐标整理发布,学习更多的大数据技术相关知识,请关注职坐标大技术云计算大技术技术频道!
喜欢 | 1
不喜欢 | 0
您输入的评论内容中包含违禁敏感词
我知道了

请输入正确的手机号码
请输入正确的验证码
您今天的短信下发次数太多了,明天再试试吧!
我们会在第一时间安排职业规划师联系您!
您也可以联系我们的职业规划师咨询:
版权所有 职坐标-一站式AI+学习就业服务平台 沪ICP备13042190号-4
上海海同信息科技有限公司 Copyright ©2015 www.zhizuobiao.com,All Rights Reserved.
沪公网安备 31011502005948号