用jquery加载google plusone按钮

托伟大的GFW的褔,若按照官方添加script后将导致页面延迟时间5秒以上。所以还是用jquery的异步加载script标签吧。

$(function(){
var plus = document.createElement('script');
plus.src = 'http://apis.google.com/js/plusone.js';
plus.text = "{lang: 'zh-CN'";
document.documentElement.firstChild.appendChild(plus);
});

QCon2011杂记 & 视觉中国的MongoDB应用实践完整版

三天QCon结束了,我只坚持了2天,其实非常想听今天的Twitter调优和54chen的keynote。但由于昨天热伤风,今天没有能到现场,遗憾。
对我而言,参加QCon就是抱着学习的心态去的,和众多业界的牛人在一起,自己的知识结构和研究深度,都显得非常的单薄。
最大的收获依然是,又认识了一些朋友,其中有些是T上关注很久的。在线下交流中,获得了不少启发,受益匪浅。

我参加的是NoSQL track,支持人是豆瓣的洪强宁。非常棒的主持人,前期做ppt主题的时候就给了我很多的方向性指导。和其他2位偏重实现细节,技术含金量很高的主题比起来,我做的角度较为浅显,更多是实际使用的一本流水账。在前期内部试讲时候,同事们就评价不高,估计现场开发者应该听着没什么感觉吧? 讲的时候自己做了一个计时,但是工作人员举牌时间的时候,还是有点慌,因为和我的计时似乎有了点偏差。最后提问的时候也有稍显有点冷场。但下来了,反而有一些同学过来问了一些问题,让我稍许有点意外。晚上回去又蹭了强宁的车,呵呵,这主持人不好当啊。

收获最大的2场第一个是淘宝褚霸关于MySQL商品库调优。有些点自己以前用过,有些则是知其然不知所以。第二个是robinLu的高质量iOS开发。我对这部分是门外汉,所以感觉收获更多,为此还放弃了面向开发的MySQL优化主题。淘宝的前台优化实践和我预想的稍有差别,估计是前后划分的概念拧吧了。虽然不使用Java做前台,当年也用过velocity有段时间,听着也挺有意思的,
而且为此对我们的一些架构调整方面也有了新的思路。比较杯具的是,贪图享乐,没有站着,坐在一个投影仪的后面,加上之前听ios就热的够呛,于是立马热伤风了,只好打道回府。

接下来一段时间需要好好消化qcon上的一些思路和想法,继续宅家里吧。

-------

小小的更正说明:在9号听Netflix的时候,正好有位奇艺的同学问我,需要在mongodb中存入一个验证码,希望保存后slave也能立即读取,问是否可以。我当时回答,需要在primary读,不要从slave读,无法保证从slave读到。由于当时netflix的演讲马上开始,没有太认证思考。回来后一回想发现其实是不对的,使用w参数可以做到(w=P+S)。但是这样相当于同步复制了,需要设置好wtimeout避免阻塞太久。虽然我还是不推荐使用,但仍希望那位同学能看到这个更正,不要给你造成误导。

-------

由于时间限制,在qcon上的keynote是一个删节版,去除了一些最后我认为比较水,大家可能都会知道的东西。为了备忘,传到slideshare上了。

mongodb 北京技术交流会

开完公司的会议匆匆转场到mongodb的会场,发现相当的火爆。
在会场间隙,见到了范凯,程显峰,新浪的timyang, 淘宝和盛大的几位。

自己的keynote还是比较水,对于诸位dba,架构师来说还是难以有眼前发亮的东西,
根源在于不同的业务特点和以及中小网站的数据量规模有限。很高兴看到,像淘宝,盛大,豆瓣等能够开始
测试和实施MongoDB的应用场景,我倒是蛮期待未来他们的一些实践分享。

会后和超群聊了聊,除去性能,伸缩等这些吸引眼球的卖点,MongoDB其实对于普通开发者最大的好处就是,通过在KV和RDBMS之间获得适度折中平衡,提高开发的性价比。

今天最大的收获依然是,见到老朋友,认识一些新朋友,这也应该是参加诸如此类的技术交流会的最大收获吧。

Review MongoDB 1.8 release

完整的Release 在这里: http://www.mongodb.org/display/DOCS/1.8+Release+Notes

几个对我们的使用影响较大的改进:

1. Journaling

算是提高mongodb单机可靠性的一个改进。单机可靠性当前是mongodb的软肋,由于使用mmap file,60s才能刷新到磁盘,
如果中间出现致命宕机(可试试kill -9 或拔掉电源),则重启时候将出现数据的不一致,必须repair,Repair的时间极其漫长,我们机房的断电已经让我尝尽了苦头。。 虽然目前也有一些补救措施,但不如这个更为实际。

当然,任何改进都是有代价的,日志文件会增加写操作的成本,好在目前使用了group commit,可以提高部分性能。group commit的间隔目前是100ms,未来可调。

我认为group commit带来的另一个小好处是能够略微减少碎片的产生。 我最期待的mongodb的另一个功能就是在线compact,也许要排到2.0了。

2. Sparse 和 Covered Indexes

Sparse index解决索引文件过大的问题,有时候我们要索引的某个属性并非是所有记录都有,普通的索引是将所有的记录都
包含进来,而sparse索引则紧包含含有这个属性的记录,这样可以减少索引的大小。但,目前的限制是,sparse index只能包含一个属性,即便如此,实际中还是值得使用的。
在我实际的开发中,很多场合,某个字段的值往往是后加的,并用$exists来获取。 使用Sparse index就比较适合这类场合。
此外,可以减少使用字段的默认值,减少垃圾数据的产生。

Covered index是另一个场合,如果你查询的字段就在索引中包含了,那么就直接从索引中返回结果,无需再次检索实际的记录。

这个对我们的orm实现相当有利。在我们的orm中使用lazy load,多数查询只是返回_id,然后使用data mapper去检索完整的记录。

兔年展望

兔年来了,2012前的最后1年,所以要做的事情很多。

3月初是mongoDB在北京的会议,从去年11月底就帮助10gen的Adam在协调这个事情,后来有robbin(@robbin)和CSDN的加入,年前终于落听了。4月初是infoQ北京大会,谢谢 @hongqn 的推荐。由于个人比较宅,也深知自己能力有限,一直比较回避这些活动,不知道怎么都碰到一起了,深怕搞错了误人歧途,感觉压力很大。

今年工作上负责社区/产品部的整体工作,类事业部的划分,职责更加清晰,希望切实走出这些年的困境。除去做技术这块,能够做出有用的产品也是我的一个目标。做产品容易,实际能够创造并运维出一个有用实用的产品就太难了,每个人都可以跟你头头是道的说出一堆理论,举出一堆例子(可惜就是没有自己的成功例子),当然,这里定义的“每个人”肯定也包括我自己。

我虽然是做技术出身,但我并不认同技术是最重要的环节。我也反感运维至上的做法,如何做到平衡,那就得看团队的功力了。

总之,今年要和团队伙伴一起冲锋陷阵,完成目标,年底让大家的红包更大,更鼓一些,虽然买不上大船票,也能开心开心。

如果还能挤出一些业余时间,我希望能够把我Things的todo列表上那几个开源项目落实完成。

最后,也是最重要的,今年完成人生中的两个大事,进入新的阶段。

----
立此存照。2011.2.7

c30k和nginx的AIO

最近我们的下载服务遭遇了c30k,导致nginx的下载服务近乎停滞。原因嘛,很简单,服务器部署在国外,众所周知的原因,SL机房的线路不稳,加上不同地区出口速率抖动很厉害,为了加速下载,我们放开了限制,允许用户使用多线程的下载工具。这样一来,自然产生了c10k问题。下载文件都不小,每个用户至少使用4线程,同时下载若干个素材。。。很自然并发链接数30k以上。
更受限于手头money,无法扩容(实际上要有钱也不会跑国外)。因此,必须提高单机并发能力和吞吐量。

我们的下载服务是使用Perl写的一个Plack应用,典型的PSGI,实现下载验证,实时防火墙,用户下载跟踪等等,无法直接使用静态文件分发(实际上Perl的性能还是很高效的,部署于Starman,对比PHP的实现,是后者(PHP-FPM)的10倍左右)。

Starman是一个很不错的PSGI Server,它使用传统的Prefork模式。即便高效,但Prefork确实无法有效应对c10k,我无法把Starman的worker增大到几百上千个。在以前的文章曾经提到Evented IO是能够应付c10k的一个方案。因此,我使用Twiggy换下了Starman。Twiggy是基于AE(AnyEvent)的一个PSGI Server,单进程。在低并发下,单进程的Twiggy的qps是弱于Starman,不过到了高并发,Twiggy的优势就显现出来了。在实际部署中,我启动了多个Twiggy进程,分别监听独立的端口,nginx则使用upstream进行负载均衡。 10个Twiggy的吞吐量已经远远超过了50个Starman worker。 Twiggy的开销也不大,因此可以很放心的增加Twiggy的进程。
感谢PSGI的接口规范,从Starman切换到Twiggy,应用程序无需做任何改动。(前提是程序内不能有阻塞io的操作)。

另一个问题是服务器的IO-Wait比较高,毕竟下载这个是IO-Bound的任务。
Nginx支持Linux Native AIO,因此我考虑是否使用AIO能够大大降低IO-Wait? 性能应该有比较明显的提升?
网上有一些资料,吹嘘的Nginx AIO性能提升,神奇云云。我有点将信将疑,因为都没有任何的测试数据比较,
均是人云亦云。另外,多数配置都是或多或少有问题的。

我使用的CentOS, Nginx AIO要使用,必须是CentOS 5.5以上。因为只有5.5的kernel才有AIO的backport,nginx并没有使用libaio。
此外,Nginx的AIO本来是为FreeBSD开发,Linux固然可以使用,不过受到了Linux AIO的很多限制。
1. 必须使用Direct IO. 这样一来,导致无法使用vm的disk cache.
2. 文件只有大小和directio_alignment定义block size整数倍的数据才可以使用AIO,当文件整数据块之前和之后,那些不能取整的部分则是blocking方式读取的,这也是为什么需要output-buffer。directio_alignment大小取决于你使用的文件系统,默认是512,而对于XFS,注意,如果你没有修改XFS bsize, 需要调整为XFS默认的4k.

我使用的配置如下:
location /archive {
internal;
aio on;
directio 4k;
directio_alignment 4k;
output_buffers 1 128k;
}

当启用AIO后,可以看到vmstat中,cache的内存消耗迅速降低,这是因为使用AIO必须使用directio,这就绕过了vm的diskcache。

实际性能如何,AIO一定很快么? 这点即便是Igor也不确定。

从我们自己的实际效果看,AIO并没有明显的性能提升,相反,偶尔会轻微增加了IO-Wait,这是因为无法利用diskcache,
而如果文件多数又和directio_alignment有偏差(尤其是断点续传的时候,多数文件读取位置在directio_alignment数据边界外),这部分的数据必须使用blocking io读取,又没有disk cache,增加IO-Wait也可以理解。

最终,结论是,与其使用不那么靠谱的Nginx AIO, 不如多开一些Nginx的worker,重复利用vm disk cache, 当内存100%利用率的时候,nginx的静态文件分发效率是高于AIO模式的。

BTW,这个实际用例也重新印证了我的一个观点,不要轻信网上那些毫无测试数据的忽悠,多数都是copy & paste的传说, 各个说好,其实多数都没实际印证过。

MongoDB ReplicaSet problem。

由于目前版本(v1.6.3)的一个bug, replicaset 初始化时候,host不能是一个内网地址,例如:
假设你在192.168.8.3 上进行rs的初始化:

rs.initiate({_id: ‘cv_rs1′, members: [
{_id: 0, host: '192.168.8.3:27017'},
{_id: 1, host: '192.168.8.8:27017'},
{_id: 2, host: '192.168.8.9:27017'}]
});

会出现一个错误:
“all members and seeds must be reachable to initiate set”

权宜之计是用hostname 代替private ip address:

rs.initiate({_id: ‘cv_rs1′, members: [
{_id: 0, host: 's3:27017'},
{_id: 1, host: 's8:27017'},
{_id: 2, host: 's9:27017'}]
});

这个bug已经在1.7.1修复。

Foursquare outage post mortem(from MongoDb user mailinglist)

(Note: this is being posted with Foursquare’s permission.)
As many of you are aware, Foursquare had a significant outage this
week. The outage was caused by capacity problems on one of the
machines hosting the MongoDB database used for check-ins. This is an
account of what happened, why it happened, how it can be prevented,
and how 10gen is working to improve MongoDB in light of this outage.
It’s important to note that throughout this week, 10gen and Foursquare
engineers have been working together very closely to resolve the
issue.
* Some history
Foursquare has been hosting check-ins on a MongoDB database for some
time now. The database was originally running on a single EC2
instance with 66GB of RAM. About 2 months ago, in response to
increased capacity requirements, Foursquare migrated that single
instance to a two-shard cluster. Now, each shard was running on its
own 66GB instance, and both shards were also replicating to a slave
for redundancy. This was an important migration because it allowed
Foursquare to keep all of their check-in data in RAM, which is
essential for maintaining acceptable performance.
The data had been split into 200 evenly distributed chunks based on
user id. The first half went to one server, and the other half to the
other. Each shard had about 33GB of data in RAM at this point, and
the whole system ran smoothly for two months.
* What we missed in the interim
Over these two months, check-ins were being written continually to
each shard. Unfortunately, these check-ins did not grow evenly across
chunks. It’s easy to imagine how this might happen: assuming certain
subsets of users are more active than others, it’s conceivable that
their updates might all go to the same shard. That’s what occurred in
this case, resulting in one shard growing to 66GB and the other only
to 50GB. [1]
* What went wrong
On Monday morning, the data on one shard (we’ll call it shard0)
finally grew to about 67GB, surpassing the 66GB of RAM on the hosting
machine. Whenever data size grows beyond physical RAM, it becomes
necessary to read and write to disk, which is orders of magnitude
slower than reading and writing RAM. Thus, certain queries started to
become very slow, and this caused a backlog that brought the site
down.
We first attempted to fix the problem by adding a third shard. We
brought the third shard up and started migrating chunks. Queries were
now being distributed to all three shards, but shard0 continued to hit
disk very heavily. When this failed to correct itself, we ultimately
discovered that the problem was due to data fragmentation on shard0.
In essence, although we had moved 5% of the data from shard0 to the
new third shard, the data files, in their fragmented state, still
needed the same amount of RAM. This can be explained by the fact that
Foursquare check-in documents are small (around 300 bytes each), so
many of them can fit on a 4KB page. Removing 5% of these just made
each page a little more sparse, rather than removing pages
altogether.[2]
After the first day’s outage it had become clear that chunk migration,
sans compaction, was not going to solve the immediate problem. By the
time the second day’s outage occurred, we had already move 5% of the
data off of shard0, so we decided to start an offline process to
compact the database using MongoDB’s repairDatabase() feature. This
process took about 4 hours (partly due to the data size, and partly
because of the slowness of EBS volumes at the time). At the end of
the 4 hours, the RAM requirements for shard0 had in fact been reduced
by 5%, allowing us to bring the system back online.
* Afterwards
Since repairing shard0 and adding a third shard, we’ve set up even
more shards, and now the check-in data is evenly distributed and there
is a good deal of extra capacity. Still, we had to address the
fragmentation problem. We ran a repairDatabase() on the slaves, and
promoted the slaves to masters, further reducing the RAM needed on
each shard to about 20GB.
* How is this issue triggered?
Several conditions need to be met to trigger the issue that brought
down Foursquare:
1. Systems are at or over capacity. How capacity is defined varies; in
the case of Foursquare, all data needed to fit into RAM for acceptable
performance. Other deployments may not have such strict RAM
requirements.
2. Document size is less than 4k. Such documents, when moved, may be
too small to free up pages and, thus, memory.
3. Shard key order and insertion order are different. This prevents
data from being moved in contiguous chunks.
Most sharded deployments will not meet these criteria. Anyone whose
documents are larger than 4KB will not suffer significant
fragmentation because the pages that aren’t being used won’t be
cached.
* Prevention
The main thing to remember here is that once you’re at max capacity,
it’s difficult to add more capacity without some downtime when objects
are small. However, if caught in advance, adding more shards on a
live system can be done with no downtime.
For example, if we had notifications in place to alert us 12 hours
earlier that we needed more capacity, we could have added a third
shard, migrated data, and then compacted the slaves.
Another salient point: when you’re operating at or near capacity,
realize that if things get slow at your hosting provider, you may find
yourself all of a sudden effectively over capacity.
* Final Thoughts
The 10gen tech team is working hard to correct the issues exposed by
this outage. We will continue to work as hard as possible to ensure
that everyone using MongoDB has the best possible experience. We are
thankful for the support that we have received from Foursquare and our
community during this unfortunate episode. As always, please let us
know if you have any questions or concerns.
[1] Chunks get split when they are 200MB into 2 100MB halves. This
means that even if the number of chunks on each shard was the same,
data size is not always so. This is something we are going to be
addressing in MongoDB. We’ll be making splitting balancing look for
this imbalance so it can act upon it.
[2] The 10gen team is working on doing online incremental compaction
of both data files and indexes. We know this has been a concern in
non-sharded systems as well. More details about this will be coming
in the next few weeks.
=====
来自mailinglist http://groups.google.com/group/mongodb-user/browse_thread/thread/528a94f287e9d77e#, 省去翻墙,存档备忘。

AnyMongo – MongoDB driver for AnyEvent application

最近做了一个mongoDB在AnyEvent上驱动原型,大部分代码来自官方的perl driver,部分则来自ruby driver。我觉得ruby driver
比较有意思。 替换了普通版本的sock相关的操作,使用AE的handler替代,增加了BSON的decode和encode,重写了write_***函数。
同时,提供了一个AnyMongo::Compat兼容包, 原先使用MongoDB驱动的代码应该可以直接运行,因为这个包可以通过官方0.36的
大部分测试 t/perl-driver-api/*t . 对于使用了一些底层函数比如write/recv的代码则无法兼容。

由于是prototype,所以目前版本谨供学习和参考,不可能在正式使用,因为还缺少很多功能,包括authentication, paris/replica-set
支持,reconnect, timeout 等等。

性能上也不好不坏, 对于crud操作,要比官方的快20-30%, 但对于cursor操作,则相反要慢30%1%左右。官方版本的cursor的性能也
还是有点问题,因为当cursor遍历的记录数在10万以上时,双方的差距缩小到5%1%左右。

anymongo性能提升的空间还是很大,目前还是先把缺少的功能补上,同时也要测试Coro的兼容性,
最后再考虑解决cursor的性能问题。

计划在0.10左右可以投入生产使用,正好可以用在我们新项目上。

AnyMongo使用Perl license,源码:

http://github.com/nightsailer/any-mongo

Updated:
简单做了profiling,发现瓶颈在于一些调试代码和moose。去除调试代码,启用moose inlined destruction,
现在anymongo的cursor和官方版本几乎一样了,只有不到1%的差距。