TikTok is a video-sharing app that lets users create and share short videos. It impresses users with its personalized recommendations just “for you” precisely. It is highly addictive and very popular among young people. Behind it, it is powered by artificial intelligence technologies. This post is an overview of TikTok system design.
Table of Content
- TikTok Architecture
- Big data frameworks
- Machine learning
- Microservices architecture
- Back-of-the-envelope calculation
1. TikTok Architecture
The architecture of the TikTok recommendation system includes three components: big data frameworks, machine learning, and microservices architecture.
1. Big data frameworks are the starting point of the system. It provides real-time data streaming processing, data computing, and data storage.
2. Machine learning is the brain of the recommendation system. A range of machine learning and deep learning algorithms and techniques are applied to build models and generate recommendations to suit individual preferences.
3. Microservices architecture is the infrastructure underneath to make the whole system serve fast and efficiently.
2. Big data frameworks
No data, no intelligence.
Most data are coming from the users’ smartphones. That includes the operating system and installed app etc. More importantly, TikTok pays special attention to the users’ activity logs, such as watch time, swipe, likes, shares, and comments.
The log data are collected and aggregated through Flume and Scribe. They are piped into the Kafka queue. Then Apache Storm processes data streams in real-time with other components in the Apache Hadoop ecosystem.
The Apache Hadoop ecosystem is a distributed system for data processing and storage. This includes MapReduce, the first generation of distributed data processing system. It processes data in parallel with batch processing. YARN is a framework for job scheduling and cluster resource management. HDFS is a distributed file system. HBase is a scalable, distributed database that supports structured data storage for large tables. Hive is a data warehouse infrastructure that provides data summarization and querying. Zookeeper is a high-performance coordination service.
As data volumes grow fast, real-time data processing frameworks come into the picture. Apache Spark is the 3rd generation framework that helps with near real-time distributed processing for big data workloads. Spark enhances the performance of MapReduce by doing the processing in memory. In the last couple of years, TikTok applied the 4th generation framework Flink. It is designed to do real-time streaming processing natively.
The database systems include MySQL, MongoDB, and many others.
3. Machine learning
This is the center of how TikTok earned the household name of “hyper-personalized, addictive algorithm”.
After vast datasets pour in, next is content analysis, user profiling, and context analysis. Neural-network deep learning frameworks such as TensorFlow are used to perform computer vision and native language processing (NLP). Computer vision will decipher images with photos and videos. NLP includes classification, labeling, and evaluations.
Some classic machine learning algorithms are used, including logistic regression(LR), convolutional neural network (CNN), recurrent neural network (RNN), and gradient-boosting decision trees(GBDT). The common recommendation approaches are applied, such as content-based filtering(CBF), collaborative filtering(CF), and more advanced matrix factorization(MF).
The secret weapons that TikTok uses to read your mind are:
1. Algorithm experimental platform: the engineers experiment with the mixing of multiple machine learning algorithms such as LR and DNN. Then run the testing (A/B test) and make the adjustment.
2. Extensive classification and labeling: The models are based on the users’ engagement such as watch time, and swipe in addition to the commonly used likes or shares (not what you say in public eyes, but what you do as a reflection of your subconscious says more about you). The number of user features, vectors, and categories is more than most of the recommendation systems in the world. And they keep adding more.
3. User feedback engine: It updates the models after retrieving feedback from the users in multiple iterations. The experience management platform is built on this engine and ultimately improves the predictions and recommendations.
To solve the cold-start problem in recommendation, the recall strategy is used. It is to select thousands of candidates from tens of millions of videos that have been proven to be popular and have high quality.
Meanwhile, some of the AI work has been moved to the client side for super-fast response. That includes real-time training, modeling, and reasoning in smaller sizes done on the devices. The machine learning frameworks such as TensorFlow Lite or ByteNN are used on the client side.
4. Microservices architecture
TikTok has embraced cloud-native infrastructure. The recommendation components such as user profiling, predictions, cold-start, recall, and user feedback engine are serving as APIs. The services are hosted in the cloud such as Amazon AWS and Microsoft Azure. As the outcome of the system, the video curation will be pushed to the users through the cloud.
TikTok employs Kubernetes-based containerization technology. Kubernetes is known as a container orchestrator. It is the toolset to automate the application’s life cycle. Kubeflow is dedicated to making deployments of machine learning workflows on Kubernetes.
As part of the cloud-native stack, the service mesh is another tool to handle service-to-service communication. It controls how different parts of an application share data with one another. It inserts features or services at platform layers, rather application layer.
Due to the requirement of high concurrency, the services are built with Go language and gRPC. In TikTok, Go has become the dominant language in service development because of its good built-in network and concurrency support. gRPC is a Remote Procedure Control framework to build and connect services efficiently.
The success of TikTok is that they would go the extra mile to provide the best user experience. They build in-house tools to maximize performance at a low level (system level). For example, ByteMesh is an improved version of Service Mesh, KiteX is a high-performance Golang gRPC framework, and Sonic is an enhanced Golang JSON library. Other in-house tools or systems include parameter servers, ByteNN, and abuse to name a few.
As a TikTok machine learning principal, Xiang Liang put it, sometimes the infrastructure beneath is more important than the (machine learning) algorithms above.
5. Back-of-the-envelope calculation (2022)
# of daily active users (US) | 50 millions |
# of monthly active users (worldwide/US) | 1.2 billions/138 millions |
# of video watched per minute | 167 millions |
time spent daily(average) | 52 minutes |
time feedback update | Within 10 minutes |
# of daily training data | 300TB ~ 1PB |
*The above is not official statistics. The numbers may go up and down as we speak.