This is a guest post co-written with Tim Krause, Lead MLOps Architect at CONXAI.
CONXAI Technology GmbH is pioneering the development of an advanced AI platform for the Architecture, Engineering, and Construction (AEC) industry. Our platform uses advanced AI to empower construction domain experts to create complex use cases efficiently.
Construction sites typically employ multiple CCTV cameras, generating vast amounts of visual data. These camera feeds can be analyzed using AI to extract valuable insights. However, to comply with GDPR regulations, all individuals captured in the footage must be anonymized by masking or blurring their identities.
In this post, we dive deep into how CONXAI hosts the state-of-the-art OneFormer segmentation model on AWS using Amazon Simple Storage Service (Amazon S3), Amazon Elastic Kubernetes Service (Amazon EKS), KServe, and NVIDIA Triton.
Our AI solution is offered in two forms:
- Model as a service (MaaS) – Our AI model is accessible through an API, enabling seamless integration. Pricing is based on processing batches of 1,000 images, offering flexibility and scalability for users.
- Software as a service (SaaS) – This option provides a user-friendly dashboard, acting as a central control panel. Users can add and manage new cameras, view footage, perform analytical searches, and enforce GDPR compliance with automatic person anonymization.
Our AI model, fine-tuned with a proprietary dataset of over 50,000 self-labeled images from construction sites, achieves significantly greater accuracy compared to other MaaS solutions. With the ability to recognize more than 40 specialized object classes—such as cranes, excavators, and portable toilets—our AI solution is uniquely designed and optimized for the construction industry.
Our journey to AWS
Initially, CONXAI started with a small cloud provider specializing in offering affordable GPUs. However, it lacked essential services required for machine learning (ML) applications, such as frontend and backend infrastructure, DNS, load balancers, scaling, blob storage, and managed databases. At that time, the application was deployed as a single monolithic container, which included Kafka and a database. This setup was neither scalable nor maintainable.
After migrating to AWS, we gained access to a robust ecosystem of services. Initially, we deployed the all-in-one AI container on a single Amazon Elastic Compute Cloud (Amazon EC2) instance. Although this provided a basic solution, it wasn’t scalable, necessitating the development of a new architecture.
Our top reasons for choosing AWS were primarily driven by the team’s extensive experience with AWS. Additionally, the initial cloud credits provided by AWS were invaluable for us as a startup. We now use AWS managed services wherever possible, particularly for data-related tasks, to minimize maintenance overhead and pay only for the resources we actually use.
At the same time, we aimed to remain cloud-agnostic. To achieve this, we chose Kubernetes, enabling us to deploy our stack directly on a customer’s edge—such as on construction sites—when needed. Some customers are potentially very compliance-restrictive, not allowing data to leave the construction site. Another opportunity is federated learning, training on the customer’s edge and only transferring model weights, without sensitive data, into the cloud. In the future, this approach might lead to having one model fine-tuned for each camera to achieve the best accuracy, which requires hardware resources on-site. For the time being, we use Amazon EKS to offload the management overhead to AWS, but we could easily deploy on a standard Kubernetes cluster if needed.
Our previous model was running on TorchServe. With our new model, we first tried performing inference in Python with Flask and PyTorch, as well as with BentoML. Achieving high inference throughput with high GPU utilization for cost-efficiency was very challenging. Exporting the model to ONNX format was particularly difficult because the OneFormer model lacks strong community support. It took us some time to identify why the OneFormer model was so slow in the ONNX Runtime with NVIDIA Triton. We ultimately resolved the issue by converting ONNX to TensorRT.
Defining the final architecture, training the model, and optimizing costs took approximately 2–3 months. Currently, we improve our model by incorporating increasingly accurate labeled data, a process that takes around 3–4 weeks of training on a single GPU. Deployment is fully automated with GitLab CI/CD pipelines, Terraform, and Helm, requiring less than an hour to complete without any downtime. New model versions are typically rolled out in shadow mode for 1–2 weeks to provide stability and accuracy before full deployment.
Solution overview
The following diagram illustrates the solution architecture.
The architecture consists of the following key components:
- The S3 bucket (1) is the most important data source. It is cost-effective, scalable, and provides almost unlimited blob storage. We encrypt the S3 bucket, and we delete all data with privacy concerns after processing took place. Almost all microservices read and write files from and to Amazon S3, which ultimately triggers (2) Amazon EventBridge (3). The process begins when a customer uploads an image on Amazon S3 using a presigned URL provided by our API handling user authentication and authorization through Amazon Cognito.
- The S3 bucket is configured in such a way that it forwards (2) all events into EventBridge.
- TriggerMesh is a Kubernetes controller where we use
AWSEventBridgeSource
(6). It abstracts the infrastructure automation and automatically creates an Amazon Simple Queue Service (Amazon SQS) (5) processing queue, which acts as a processing buffer. Additionally, it creates an EventBridge rule (4) to forward the S3 event from the event bus into the SQS processing queue. Finally, TriggerMesh creates a Kubernetes Pod to poll events from the processing queue to feed it into the Knative broker (7). The resources in the Kubernetes cluster are deployed in a private subnet. - The central place for Knative Eventing is the Knative broker (7). It is backed by Amazon Managed Streaming for Apache Kafka (Amazon MSK) (8).
- The Knative trigger (9) polls the Knative broker based on a specific
CloudEventType
and forwards it accordingly to the KServeInferenceService
(10). - KServe is a standard model inference platform on Kubernetes that uses Knative Serving as its foundation and is fully compatible with Knative Eventing. It also pulls models from a model repository into the container before the model server starts, eliminating the need to build a new container image for each model version.
- We use KServe’s “Collocate transformer and predictor in same pod” feature to maximize inference speed and throughput because containers within the same pod can communicate over localhost and the network traffic never leaves the CPU.
- After many performance tests, we achieved best performance with the NVIDIA Triton Inference Server (11) after converting our model first into ONNX and then into TensorRT.
- Our transformer (12) uses Flask with Gunicorn and is optimized for the number of workers and CPU cores to maintain GPU utilization over 90%. The transformer gets a
CloudEvent
with the reference of the image Amazon S3 path, downloads it, and performs model inference over HTTP. After getting back the model results, it performs preprocessing and finally uploads the processed model results back to Amazon S3. - We use Karpenter as the cluster auto scaler. Karpenter is responsible for scaling the inference component to handle high user request loads. Karpenter launches new EC2 instances when the system experiences increased demand. This allows the system to automatically scale up computing resources to meet the increased workload.
All this divides our architecture mainly in AWS managed data service and the Kubernetes cluster:
- The S3 bucket, EventBridge, and SQS queue as well as Amazon MSK are all fully managed services on AWS. This keeps our data management effort low.
- We use Amazon EKS for everything else. TriggerMesh,
AWSEventBridgeSource
, Knative Broker, Knative Trigger, KServe with our Python transformer, and the Triton Inference Server are also within the same EKS cluster on a dedicated EC2 instance with a GPU. Because our EKS cluster is just used for processing, it is fully stateless.
Summary
From initially having our own highly customized model, transitioning to AWS, improving our architecture, and introducing our new Oneformer model, CONXAI is now proud to provide scalable, reliable, and secure ML inference to customers, enabling construction site improvements and accelerations. We achieved a GPU utilization of over 90%, and the number of processing errors has dropped almost to zero in recent months. One of the major design choices was the separation of the model from the preprocessing and postprocessing code in the transformer. With this technology stack, we gained the ability to scale down to zero on Kubernetes using the Knative serverless feature, while our scale-up time from a cold state is just 5–10 minutes, which can save significant infrastructure costs for potential batch inference use cases.
The next important step is to use these model results with proper analytics and data science. These results can also serve as a data source for generative AI features such as automated report generation. Furthermore, we want to label more diverse images and train the model on additional construction domain classes as part of a continuous improvement process. We also work closely with AWS specialists to bring our model in AWS Inferentia chipsets for better cost-efficiency.
To learn more about the services in this solution, refer to the following resources:
About the Authors
Tim Krause is Lead MLOps Architect at CONXAI. He takes care of all activities when AI meets infrastructure. He joined the company with previous Platform, Kubernetes, DevOps, and Big Data knowledge and was training LLMs from scratch.
Mehdi Yosofie is a Solutions Architect at AWS, working with startup customers, and leveraging his expertise to help startup customers design their workloads on AWS.