[Designs, Lessons and Advice from Building Large Distributed Systems](https://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf),Google 杰夫·迪恩(Jeff Dean)2009年一次演讲的PPT。2010年,斯坦福大学请杰夫·迪恩到大学里给他们讲了一节课,你可以在YouTube上看一下,[Building Software Systems At Google and Lessons Learned](https://www.youtube.com/watch?v=modXC5IWTJI) ,其回顾了Google发展的历史。
[Notes on Distributed Systems for Young Bloods](http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/) ,给准备进入分布式系统领域的人的一些忠告。
[On Designing and Deploying Internet-Scale Services](https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/index.html)([中译版](http://darktea.github.io/notes/2014/07/23/On-Designing-and-Deploying-Internet-Scale-Services.html)),微软Windows Live服务平台的一些经验性的总结文章,很值得一读。
[4 Things to Keep in Mind When Building a Platform for the Enterprise](https://blog.box.com/blog/4-things-to-keep-in-mind-when-building-a-platform-for-the-enterprise/) ,Box平台VP海蒂·威廉姆斯(Heidi Williams)撰写的一篇文章,阐述了为企业构建平台时需要牢记的四件关于软件设计方面的事:1. Design Broadly, Build Narrowly; 2. Platforms Are Powerful and Flexible. Choose wisely what to expose when!;3. Build Incrementally, Get Feedback, and Iterate;4. Create a Platform-first Mentality。文章中有详细的解读,推荐看看。
[Principles of Chaos Engineering](https://www.usenix.org/conference/srecon17americas/program/presentation/rosenthal) ,我们知道,Netflix公司有一个叫Chaos Monkey的东西,这个东西会到分布式系统里“瞎搞”,以此来测试系统的健壮和稳定性。这个视频中,Netflix分享了一些软件架构的经验和原则,值得一看。
[Building Fast & Resilient Web Applications](https://www.igvita.com/2016/05/20/building-fast-and-resilient-web-applications/) ,伊利亚·格里高利克(Ilya Grigorik)在Google I/O 2016上的一次关于如何通过弹力设计来实现快速和可容错的网站架构的演讲,其中有好些经验分享。
[Design for Resiliency](http://highscalability.com/blog/2012/12/31/designing-for-resiliency-will-be-so-2013.html) ,这篇文章带我们全面认识“弹力(Resiliency)”,以及弹力对于系统的重要性,并详细阐述了如何设计和实现系统的弹力。
微软的Azure网站上有一系列的 [Design Principle](https://docs.microsoft.com/en-us/azure/architecture/guide/design-principles/) 的文章,你可以看看这几篇: [Design for Self-healing](https://docs.microsoft.com/en-us/azure/architecture/guide/design-principles/self-healing) 、[Design for Scaling Out](https://docs.microsoft.com/en-us/azure/architecture/guide/design-principles/scale-out) 和 [Design for Evolution](https://docs.microsoft.com/en-us/azure/architecture/guide/design-principles/design-for-evolution) 。
[Writing Code that Scales](https://www.codeproject.com/Articles/151520/Write-Scalable-Code) ,Rackspace的一篇很不错的博文,告诉我们一些很不错的写出高扩展和高性能代码的工程原则。
[Automate and Abstract: Lessons from Facebook on Engineering for Scale](https://architecht.io/lessons-from-facebook-on-engineering-for-scale-f5716f0afc7a) ,软件自动化和软件抽象,这是软件工程中最重要的两件事了。通过这篇文章,我们可以看到Facebook的关于这方面的一些经验教训。
[Automating Failure Testing Research at Internet Scale](https://people.ucsc.edu/~palvaro/socc16.pdf) ,Netflix公司伙同圣克鲁斯加利福尼亚大学和Gremlin游戏公司一同撰写的一篇论文。
## 弹性伸缩
[4 Architecture Issues When Scaling Web Applications: Bottlenecks, Database, CPU, IO](http://highscalability.com/blog/2014/5/12/4-architecture-issues-when-scaling-web-applications-bottlene.html) ,本文讲解了后端程序的主要性能指标,即响应时间和可伸缩性这两者如何能提高的解决方案,讨论了包括纵向和横向扩展,可伸缩架构、负载均衡、数据库的伸缩、CPU密集型和I/O密集型程序的考量等。
[Scaling Stateful Objects](http://ithare.com/scaling-stateful-objects/) ,这是一本叫《Development&Deployment of Multiplayer Online Games》书中一章内容的节选,讨论了有状态和无状态的节点如何伸缩的问题。虽然还没有写完,但是可以给你一些很不错的基本概念和想法。
[Scale Up vs Scale Out: Hidden Costs](https://blog.codinghorror.com/scaling-up-vs-scaling-out-hidden-costs/) ,Coding Horror上的一篇有趣的文章,详细分析了可伸缩性架构的不同扩展方案(横向扩展或纵向扩展)所带来的成本差异,帮助你更好地选择合理的扩展方案,可以看看。
[Best Practices for Scaling Out](https://www.infoq.cn/article/hiXg6WRDjvNE0VNuwzPg) ,OpenShift的一篇讨论Scale out最佳实践的文章。
[Reddit: Lessons Learned From Mistakes Made Scaling To 1 Billion Pageviews A Month](http://highscalability.com/blog/2013/8/26/reddit-lessons-learned-from-mistakes-made-scaling-to-1-billi.html) ,Reddit分享的一些关于系统扩展的经验教训。
下面是几篇关于自动化弹性伸缩的文章。
- [Autoscaling Pinterest](https://medium.com/@Pinterest_Engineering/auto-scaling-pinterest-df1d2beb4d64);
- [Square: Autoscaling Based on Request Queuing](https://medium.com/square-corner-blog/autoscaling-based-on-request-queuing-c4c0f57f860f);
- [PayPal: Autoscaling Applications](https://medium.com/paypal-engineering/autoscaling-applications-paypal-fb5bb9fdb821);
- [Trivago: Your Definite Guide For Autoscaling Jenkins](http://tech.trivago.com/2017/02/17/your-definite-guide-for-autoscaling-jenkins/);
- [Scryer: Netflix’s Predictive Auto Scaling Engine](https://medium.com/netflix-techblog/scryer-netflixs-predictive-auto-scaling-engine-a3f8fc922270)。
## 一致性哈希
[Distributing Content to Open Connect](https://medium.com/netflix-techblog/distributing-content-to-open-connect-3e3e391d4dc9) ,Netflix的一个对一致性哈希的实践,提出了Uniform Consistent Hashing,是挺有意思的一篇文章。
[Consistent Hashing in Cassandra](https://blog.imaginea.com/consistent-hashing-in-cassandra/) ,这是Cassandra中使用到的一致性哈希的相关设计。
[Why you don’t want to shard](https://www.percona.com/blog/2009/08/06/why-you-dont-want-to-shard/) ,这是Percona的一篇文章,其中表达了,不到万不得已不要做数据库分片。是的,最好还是先按业务来拆分,先把做成微服务的架构,然后把数据集变简单,然后再做Sharding会更好。
[How to Scale Big Data Applications](https://www.percona.com/sites/default/files/presentations/How%20to%20Scale%20Big%20Data%20Applications.pdf) ,这也是Percona给出的一篇关于怎样给大数据应用做架构扩展的文章。值得一读。
[MySQL Sharding with ProxySQL](https://www.percona.com/blog/2016/08/30/mysql-sharding-with-proxysql/) ,用ProxySQL来支撑MySQL数据分片的一篇实践文章。
[Design Of A Modern Cache](http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html) ,设计一个现代化的缓存系统需要注意到的东西。
[Netflix: Caching for a Global Netflix](https://medium.com/netflix-techblog/caching-for-a-global-netflix-7bcc457012f1) ,Netflix公司的全局缓存架构实践。
[Facebook: An analysis of Facebook photo caching](https://code.facebook.com/posts/220956754772273/an-analysis-of-facebook-photo-caching/) ,Facebook公司的图片缓存使用分析,这篇文章挺有意思的,用数据来调优不同的缓存大小和算法。
[How trivago Reduced Memcached Memory Usage by 50%](https://tech.trivago.com/2017/12/19/how-trivago-reduced-memcached-memory-usage-by-50/) ,Trivago公司一篇分享自己是如何把Memcached的内存使用率降了一半的实践性文章。很有意思,可以让你学到很多东西。
[Caching Internal Service Calls at Yelp](https://engineeringblog.yelp.com/2018/03/caching-internal-service-calls-at-yelp.html) ,Yelp公司的缓存系统架构。
## 消息队列
[Understanding When to use RabbitMQ or Apache Kafka](https://content.pivotal.io/blog/understanding-when-to-use-rabbitmq-or-apache-kafka) ,什么时候使用RabbitMQ,什么时候使用Kafka,通过这篇文章可以让你明白如何做技术决策。
[Trello: Why We Chose Kafka For The Trello Socket Architecture](https://tech.trello.com/why-we-chose-kafka/) ,Trello的Kafka架构分享。
[LinkedIn: Running Kafka At Scale](https://engineering.linkedin.com/kafka/running-kafka-scale) ,LinkedIn公司的Kafka架构扩展实践。
[Should You Put Several Event Types in the Same Kafka Topic?](https://www.confluent.io/blog/put-several-event-types-kafka-topic/) ,这个问题可能经常困扰你,这篇文章可以为你找到答案。
[Billions of Messages a Day - Yelp’s Real-time Data Pipeline](https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html) ,Yelp公司每天十亿级实时消息的架构。
[Uber: Building Reliable Reprocessing and Dead Letter Queues with Kafka](https://eng.uber.com/reliable-reprocessing/) ,Uber公司的Kafka应用。
[Uber: Introducing Chaperone: How Uber Engineering Audits Kafka End-to-End](https://eng.uber.com/chaperone/) ,Uber公司对Kafka消息的端到端审计。
[Publishing with Apache Kafka at The New York Times](https://open.nytimes.com/publishing-with-apache-kafka-at-the-new-york-times-7f0e3b7d2077) ,纽约时报的Kafka工程实践。
[Kafka Streams on Heroku](https://blog.heroku.com/kafka-streams-on-heroku) ,Heroku公司的Kafka Streams实践。
[Salesforce: How Apache Kafka Inspired Our Platform Events Architecture](https://engineering.salesforce.com/how-apache-kafka-inspired-our-platform-events-architecture-2f351fe4cf63) ,Salesforce的Kafka工程实践。
[Exactly-once Semantics are Possible: Here’s How Kafka Does it](https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/) ,怎样用Kafka让只发送一次的语义变为可能。这是业界中一个很难的工程问题。
[Delivering billions of messages exactly once](https://segment.com/blog/exactly-once-delivery/) 同上,这也是一篇挑战消息只发送一次这个技术难题的文章。
[Benchmarking Streaming Computation Engines at Yahoo!](https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at)。Yahoo!的Storm团队在为他们的流式计算做技术选型时,发现市面上缺乏针对不同计算平台的性能基准测试。于是,他们研究并设计了一种方案来做基准测试,测试了Apache Flink、Apache Storm和Apache Spark这三种平台。文中给出了结论和具体的测试方案。(如果原文链接不可用,请尝试搜索引擎对该网页的快照。)
## 关于日志方面
[Using Logs to Build a Solid Data Infrastructure - Martin Kleppmann](https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/) ,设计基于log结构应用架构的一篇不错的文章。
[LogDevice: a distributed data store for logs](https://code.facebook.com/posts/357056558062811/logdevice-a-distributed-data-store-for-logs/) ,Facebook分布式日志系统方面的一些工程分享。
[Performance is a Feature](https://blog.codinghorror.com/performance-is-a-feature/) ,Coding Horror上的一篇让你关注性能的文章。
[Make Performance Part of Your Workflow](https://codeascraft.com/2014/12/11/make-performance-part-of-your-workflow/) ,这篇文章是图书《[Designing for Performance](http://shop.oreilly.com/product/0636920033578.do)》中的节选(国内没有卖的),其中给出来了一些和性能有关的设计上的平衡和美学。
[CloudFlare: How we built rate limiting capable of scaling to millions of domains](https://blog.cloudflare.com/counting-things-a-lot-of-different-things/),讲述了CloudFlare公司是怎样实现他们的限流功能的。从最简单的每客户IP限流开始分析,进一步讲到anycast,在这种情况下PoP的分布式限流是怎样实现的,并详细解释了具体的算法。
## 关于搜索方面
- [Instagram: Search Architecture](https://instagram-engineering.com/search-architecture-eeb34a936d3a)
- [eBay: The Architecture of eBay Search](http://www.cs.otago.ac.nz/homepages/andrew/papers/2017-8.pdf)
- [eBay: Improving Search Engine Efficiency by over 25%](https://www.ebayinc.com/stories/blogs/tech/making-e-commerce-search-faster/)
- [LinkedIn: Introducing LinkedIn’s new search architecture](https://engineering.linkedin.com/search/did-you-mean-galene)
- [LinkedIn: Search Federation Architecture at LinkedIn](https://engineering.linkedin.com/blog/2018/03/search-federation-architecture-at-linkedin)
- [Slack: Search at Slack](https://slack.engineering/search-at-slack-431f8c80619e)
- [DoorDash: Search and Recommendations at DoorDash](https://blog.doordash.com/powering-search-recommendations-at-doordash-8310c5cfd88c)
- [Twitter: Search Service at Twitter (2014)](https://blog.twitter.com/engineering/en_us/a/2014/building-a-complete-tweet-index.html)
- [Pinterest: Manas: High Performing Customized Search System](https://medium.com/@Pinterest_Engineering/manas-a-high-performing-customized-search-system-cf189f6ca40f)
- [Sherlock: Near Real Time Search Indexing at Flipkart](https://tech.flipkart.com/sherlock-near-real-time-search-indexing-95519783859d)
- [Airbnb: Nebula: Storage Platform to Build Search Backends](https://medium.com/airbnb-engineering/nebula-as-a-storage-platform-to-build-airbnbs-search-backends-ecc577b05f06)
## 各公司的架构实践
[High Scalability](http://highscalability.com/) ,这个网站会定期分享一些大规模系统架构是怎样构建的,下面是迄今为止各个公司的架构说明。
- [YouTube Architecture](http://highscalability.com/youtube-architecture)
- [Scaling Pinterest](http://highscalability.com/blog/2013/4/15/scaling-pinterest-from-0-to-10s-of-billions-of-page-views-a.html)
- [Google Architecture](http://highscalability.com/google-architecture)
- [Scaling Twitter](http://highscalability.com/scaling-twitter-making-twitter-10000-percent-faster)
- [The WhatsApp Architecture](http://highscalability.com/blog/2014/2/26/the-whatsapp-architecture-facebook-bought-for-19-billion.html)
- [Flickr Architecture](http://highscalability.com/flickr-architecture)
- [Amazon Architecture](http://highscalability.com/amazon-architecture)
- [Stack Overflow Architecture](http://highscalability.com/blog/2009/8/5/stack-overflow-architecture.html)
- [Pinterest Architecture](http://highscalability.com/blog/2012/5/21/pinterest-architecture-update-18-million-visitors-10x-growth.html)
- [Tumblr Architecture](http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html)
- [Instagram Architecture](http://highscalability.com/blog/2011/12/6/instagram-architecture-14-million-users-terabytes-of-photos.html)
- [TripAdvisor Architecture](http://highscalability.com/blog/2011/6/27/tripadvisor-architecture-40m-visitors-200m-dynamic-page-view.html)
- [Scaling Mailbox](http://highscalability.com/blog/2013/6/18/scaling-mailbox-from-0-to-one-million-users-in-6-weeks-and-1.html)
- [Salesforce Architecture ](http://highscalability.com/blog/2013/9/23/salesforce-architecture-how-they-handle-13-billion-transacti.html)
- [ESPN Architecture](http://highscalability.com/blog/2013/11/4/espns-architecture-at-scale-operating-at-100000-duh-nuh-nuhs.html)
- [Uber Architecture](http://highscalability.com/blog/2015/9/14/how-uber-scales-their-real-time-market-platform.html)
- [Dropbox Design](https://www.youtube.com/watch?v=PE4gwstWhmc)
- [Splunk Architecture](http://www.splunk.com/view/SP-CAAABF9)
# 小结
今天我们分享的内容是高手成长篇分布式架构部分的最后一篇——分布式架构工程设计,讲述了设计原则、设计模式等方面的内容,尤其整理和推荐了国内外知名企业的设计思路和工程实践,十分具有借鉴意义。
下篇文章中,我们将分享微服务架构方面的内容。敬请期待。
下面是《程序员练级攻略》系列文章的目录。
- [开篇词](https://time.geekbang.org/column/article/8136)