SEC Fillings App With Streamlit, Docker, and Google Cloud Run

Ndackyssa Oyima-Antseleve

6/10/24

Introduction

  • Welcome to this tutorial on building an application that generates answers from question templates on SEC filings content.
  • This app utilizes advanced technologies, such as Retrieval-Augmented Generation (RAG) from Kay.ai and large language models from OpenAI.

Objective

In this tutorial, you will learn how to:

  • Use the RAG (Retrieval-Augmented Generation) framework for context retrieval and answer generation.

  • Integrate LLMs (Large Language Models) using the OpenAI API to generate human-like responses.

  • Build interactive web applications with Streamlit

  • Containerize applications using Docker

  • Deploy applications on Google Cloud Run to host your application.

How It Works

Process Diagram

1. Insert Question Template

Users add a question template.

2. Retrieve Context with RAG

The app retrieves relevant context from Kay.ai.

3. Generate Response with OpenAI

The context is passed to the OpenAI API for generating the answer.

4. Deliver Answer

The answer is presented to the user.

What is Retrieval-Augmented Generation (RAG)?

  • Retrieval-Augmented Generation (RAG) is a powerful technique that combines the strengths of retrieval-based methods and generation-based methods to enhance the performance of large language models (LLMs).
  • RAG addresses several limitations of LLMs, such as hallucinations, outdated knowledge, and untraceable reasoning processes, by integrating external knowledge bases into the generation process.

How RAG Works

RAG operates through three main stages:

  1. Retrieval: Relevant document chunks are retrieved from an external knowledge base using semantic similarity calculations.

  2. Generation: The retrieved document chunks, combined with the user’s query, form a comprehensive prompt for the LLM to generate an accurate and contextually relevant answer.

  3. Augmentation: The generation process is enhanced by continuously updating and integrating domain-specific information from the knowledge base.

Key Components of RAG

  • Retrieval:
    • Indexing: Documents are split into smaller chunks, encoded into vectors, and stored in a vector database.
    • Query Encoding: The user’s query is transformed into a vector representation.
    • Similarity Matching: The system computes similarity scores between the query vector and document vectors, retrieving the top K most relevant chunks.

Key Components of RAG

  • Generation:
    • Contextual Prompting: The retrieved document chunks and the user’s query are combined into a prompt for the LLM

    • Answer Generation: The LLM generates a response based on the combined prompt.

Key Components of RAG

  • Augmentation:
    • Knowledge Integration: Continuous updates and domain-specific knowledge are integrated into the generation process, enhancing the accuracy and relevance of the responses.

Advantages of RAG

  • Enhanced Accuracy: By incorporating external knowledge, RAG reduces the likelihood of generating factually incorrect content.

  • Continuous Updates: RAG allows for continuous knowledge updates, ensuring the model stays current with new information.

  • Domain-Specific Information: RAG enables the integration of domain-specific information, making it suitable for specialized tasks.

RAG Workflow

Gao et al 2024

KAY AI

Quick Use of KAY AI

# Import necessary modules
from getpass import getpass
import os
from langchain.retrievers import KayAiRetriever
from langchain.chains import ConversationalRetrievalChain
from langchain_openai import ChatOpenAI

# Prompt user for KAY_API_KEY and set it as an environment variable
KAY_API_KEY = getpass()  # Securely prompt for KAY_API_KEY
os.environ["KAY_API_KEY"] = KAY_API_KEY  # Set environment variable for KAY_API_KEY

# Set up the retriever using the KAY_API_KEY environment variable
retriever = KayAiRetriever.create(
    dataset_id="company", data_types=["10-Q", "10-K","8-K"], num_contexts=10
)

Quick Use of KAY AI

# Retrieve relevant documents
docs = retriever.get_relevant_documents(
    "Does the coca cola co, cik:0000021344, currently operate a supply chain finance program (SCF program)? (Yes/No)"
    # "Does the apple inc currently operate a supply chain finance program (SCF program) in 2023? (Yes/No)"
)


docs[0]
Out[3]: Document(page_content='Company Name: COCA COLA CO \n Company Industry: BEVERAGES \n Form Title: 10-K 2021-FY \n Form Section: Risk Factors \n Text: Based on all of the aforementioned factors, the Company believes its current liquidity position is strong and will continue to be sufficient to fund our operating activities and cash commitments for investing and financing activities for the foreseeable future.Cash Flows from Operating Activities As part of our continued efforts to improve our working capital efficiency, we have worked with our suppliers over the past several years to revisit terms and conditions, including the extension of payment terms.Our current payment terms with the majority of our suppliers are 120 days.Additionally, two global financial institutions offer a voluntary supply chain finance ("SCF") program which enables our suppliers, at their sole discretion, to sell their receivables from the Company to these 51 financial institutions on a non recourse basis at a rate that leverages our credit rating and thus may be more beneficial to them.The SCF program is available to suppliers of goods and services included in cost of goods sold as well as suppliers of goods and services included in selling, general and administrative expenses in our consolidated statement of income.The Company and our suppliers agree on the contractual terms for the goods and services we procure, including prices, quantities and payment terms, regardless of whether the supplier elects to participate in the SCF program.', metadata={'chunk_type': 'text', 'chunk_years_mentioned': [], 'company_name': 'COCA COLA CO', 'company_sic_code_description': 'BEVERAGES', 'data_source': '10-K', 'data_source_link': 'https://www.sec.gov/Archives/edgar/data/21344/000002134422000009', 'data_source_publish_date': '2022-02-22T00:00:00+00:00', 'data_source_uid': '0000021344-22-000009', 'title': 'COCA COLA CO |  10-K 2021-FY '})

Quick Use of OPEN AI

# Prompt user for OPENAI_API_KEY and set it as an environment variable
OPENAI_API_KEY = getpass()  # Securely prompt for OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY  # Set environment variable for OPENAI_API_KEY

# Set up the conversational retrieval chain using the OpenAI API key
model = ChatOpenAI(model_name="gpt-3.5-turbo")
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)


# Expanded list of questions with explicit binary answer instruction where applicable
questions = [
    "Answer with 'Yes' or 'No' or 'Uncertain': Does the company currently operate a supply chain finance program (SCF program)?",
    "Answer with 'Yes' or 'No' or 'Uncertain': Does the implementation of the SCF program positively impact the company's liquidity?",
]

Quick Use of OPEN AI

# Initialize chat history
chat_history = []

# Iterate through questions and get answers
for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")
-> **Question**: Answer with 'Yes' or 'No' or 'Uncertain': Does the company currently operate a supply chain finance program (SCF program)? 

**Answer**: Yes 

-> **Question**: Answer with 'Yes' or 'No' or 'Uncertain': Does the implementation of the SCF program positively impact the company's liquidity? 

**Answer**: The implementation of the SCF program positively impacts the company's liquidity by providing additional funding to eligible depository institutions, including the company. This helps ensure that the bank has the ability to meet the needs of all their depositors. The Program offers advances of up to one year in length to banks, savings associations, credit unions, and other eligible depository institutions pledging collateral eligible for purchase by the Federal Reserve Bank, such as U.S. Treasuries, U.S. agency securities, and U.S. agency mortgage-backed securities. This additional funding source can enhance the bank's liquidity position and ability to meet operational needs. 

Full Streamlit Application Code

What is streamlit ?

Streamlit is an open-source Python library that makes it easy to create and share custom web apps for machine learning and data science. With Streamlit, you can turn data scripts into shareable web applications in minutes.

  1. Ease of Use
  2. Interactive Widgets
  3. Beautiful Visualizations
  4. Live Updates
  5. Deployment

Streamlit Application

Project setup

  1. create a folder “my_app”

  2. create inside the folder a python file called “app.py”

Folder Structure

my_app/

├── app.py                # Main Streamlit application file
├── README.md             # Project documentation

Streamlit Application

import streamlit as st
import pandas as pd

# Title of the app
st.title("Simple Streamlit App")

# Display a text input widget
name = st.text_input("Enter your name:")

# Display a button and capture its state
if st.button("Submit"):
    st.write(f"Hello, {name}!")

# Display a data frame
data = pd.DataFrame({
    'Column 1': [1, 2, 3, 4],
    'Column 2': [10, 20, 30, 40]
})
st.write(data)

Run Streamlit Application

  • Navigate to the directory containing the app.py file.
     cd "path_folder_my_app" 
  • Run the Streamlit app with the following command:
     streamlit run app.py

Full Streamlit Application Code

Construct The App

  1. Import Necessary Libraries :
    import streamlit as st
    import os
    from langchain.retrievers import KayAiRetriever
    from langchain.chains import ConversationalRetrievalChain
    from langchain_openai import ChatOpenAI

Construct The App

  1. Streamlit App Title:
st.title("SEC Filings Query Interface")
  • Sets the title of the Streamlit app.
  1. Input Fields for API Keys:
   KAY_API_KEY = st.text_input("Enter KAY API Key:", type="password")
   OPENAI_API_KEY = st.text_input("Enter OPENAI API Key:", type="password")
  • Input fields for the user to enter API keys for Kay.ai and OpenAI.

Construct The App

  1. Set Environment Variables for API Keys:
   if KAY_API_KEY and OPENAI_API_KEY:
       os.environ["KAY_API_KEY"] = KAY_API_KEY
       os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
  1. Create the Retriever and Model:
    retriever = KayAiRetriever.create(dataset_id="company", data_types=["10-Q", "10-K", "8-K"], num_contexts=10)
    model = ChatOpenAI(model_name="gpt-4-turbo-preview")
  1. Input Fields for Company Name, CIK, and Keywords:
    company_name = st.text_input("Enter Company Name:")
    cik = st.text_input("Enter CIK (Central Index Key):")
    keywords = st.text_input("Enter Keywords (comma-separated):")

Construct The App

  1. Button to Trigger the Retrieval and Query:
    if st.button("Retrieve and Ask"):
        if company_name and cik and keywords:
            dynamic_questions = [
                f"Does the {company_name}, cik:{cik}, currently operate a supply chain finance program (SCF program)?",
            ]
            
            for question in dynamic_questions:
                docs = retriever.get_relevant_documents(question)
                result = qa({"question": question, "chat_history": []})
                st.write(f"-> **Question**: {question}")
                st.write(f"**Answer**: {result['answer']}")
                st.markdown("-------")
                
                for doc in docs[0:5]:
                    page_content = doc.page_content
                    highlighted_content = page_content.replace(keywords, f'<mark style="background-color: #FFFF00">**{keywords}**</mark>')
                    st.markdown(highlighted_content, unsafe_allow_html=True)
                    metadata = doc.metadata
                    st.write(f"Data Source: {metadata.get('data_source')} - Publish Date: {metadata.get('data_source_publish_date')}")
        else:
            st.error("Please enter all required fields.")

Full Streamlit Application Code with Docker

Folder structure

project_root/

├── app.py                # Main Streamlit application file
├── Dockerfile            # Dockerfile to containerize the app
├── docker-compose.yml    # Docker Compose file for multi-container applications
├── requirements.txt      # Python dependencies
├── README.md             # Project documentation (optional)
└── .streamlit/           # Streamlit configuration file (optional)
    └── config.toml

Dockerfile

# Use the official Python image from the Docker Hub
FROM python:3.9-slim

# Set the working directory in the container
WORKDIR /app

# Copy the requirements.txt file into the container
COPY requirements.txt requirements.txt

# Install the Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application code into the container
COPY . .

# Set environment variable to disable Streamlit's file watcher
ENV STREAMLIT_SERVER_FILE_WATCHER_TYPE none

# Expose the port that Streamlit will run on
EXPOSE 8501

# Run the Streamlit app
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

requirements.txtl

requirements.txt

streamlit
langchain
langchain_openai
langchain-community
kay
  • Lists the dependencies needed for the application.

docker-compose.yml

docker-compose.yml

version: '3.8'

services:
  streamlit-app:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8501:8501"
    environment:
      - KAY_API_KEY=${KAY_API_KEY}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - STREAMLIT_SERVER_FILE_WATCHER_TYPE=none
    volumes:
      - .:/app
    restart: always

Run the Docker Container

  1. Navigate to the Project Directory:
   cd path/to/project_root
  1. Build the Docker Image:
   docker-compose build
  1. Run the Docker Container:
   docker-compose up
  1. Access the Streamlit App:

    • Open your web browser and go to http://localhost:8501.

Instructions to Push Docker Container to Google Cloud Using Artifact Registry

  1. Create Repo

  2. Build the Image

  3. Authorize Docker

  4. Tag Image

  5. Push Image

Instructions to Push Docker Container to Google Cloud

  1. Create a Repository in Artifact Registry:
    • Enable the Artifact Registry API.
    • Create a repository in your chosen region (e.g., us-central1) and note the repository path
    us-central1-docker.pkg.dev/fintechclass/fintechdemo
  2. Build the Docker Image:
    • Navigate to the directory with your Dockerfile.
    • Build the image:
       docker build -t streamlit-app .
  3. Verify and list all Images:
docker image ls

Instructions to Push Docker Container to Google Cloud

  1. Authenticate Docker with Artifact Registry:
      gcloud auth configure-docker us-central1-docker.pkg.dev
WARNING: Your config file at [/Users/ndackyssa/.docker/config.json] contains these credential helper entries:

 {
  "credHelpers": {
    "asia.gcr.io": "gcloud",
    "eu.gcr.io": "gcloud",
    "europe-west1-docker.pkg.dev": "gcloud",
    "gcr.io": "gcloud",
    "marketplace.gcr.io": "gcloud",
    "staging-k8s.gcr.io": "gcloud",
    "us.gcr.io": "gcloud",
    "us-central1-docker.pkg.dev": "gcloud"
  }
}
Adding credentials for: us-central1-docker.pkg.dev/fintechclass/fintechdemo
gcloud credential helpers already registered correctly.

Instructions to Push Docker Container to Google Cloud

  1. Tag the Docker Image:
    • Tag the image with the repository path:
docker tag streamlit-app us-central1-docker.pkg.dev/fintechclass/fintechdemo/streamlit-app

docker image ls
      
REPOSITORY                                                          TAG       IMAGE ID       CREATED          SIZE
us-central1-docker.pkg.dev/fintechclass/fintechdemo/streamlit-app   latest    6ef16c1f7fb2   27 minutes ago   605MB
streamlit-app                                                       latest    6ef16c1f7fb2   27 minutes ago   605MB

Instructions to Push Docker Container to Google Cloud

  1. Push the Docker Image to Artifact Registry:
    • Push the image:
      docker push us-central1-docker.pkg.dev/YOUR_PROJECT_ID/REPO_NAME/streamlit-app
      
      Using default tag: latest
The push refers to repository [us-central1-docker.pkg.dev/fintechclass/fintechdemo/streamlit-app]
eb810b7dbcc4: Pushed
88c20489b4b3: Pushing [========================================>          ]    371MB/452.7MB
65c6c5030a15: Pushed
e3118f7a7cf8: Pushed
88e353c378e1: Pushed
f9e773ed6a29: Pushed
16a19a1fe64b: Pushed
e8a6046370e7: Pushed
2bd1a2222589: Pushed
  1. Deploy to Cloud Run:
    • In Google Cloud Console, navigate to Cloud Run and create a new service.
    • Select the image from Artifact Registry and deploy it, allowing unauthenticated access.

For MAC with ARM chip

# Build the Docker image for AMD64 platform
docker buildx build --platform linux/amd64 -t streamlit-app .

# Tag the Docker image
docker tag streamlit-app us-central1-docker.pkg.dev/YOUR_PROJECT_ID/REPO_NAME/streamlit-app

# Push the Docker image to GCR
docker push us-central1-docker.pkg.dev/YOUR_PROJECT_ID/REPO_NAME/streamlit-app

# Deploy the container on Cloud Run
gcloud run deploy streamlit-app \
  --image us-central1-docker.pkg.dev/YOUR_PROJECT_ID/REPO_NAME/streamlit-app \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated

Summary Commands

For example, if your project ID is my-project and your region is us-central1:

# Build the Docker image
docker build -t streamlit-app .

# Authenticate Docker with GCR
gcloud auth configure-docker us-central1-docker.pkg.dev

# Tag the Docker image
docker tag streamlit-app us-central1-docker.pkg.dev/my-project/my-repo/streamlit-app

# Push the Docker image to GCR
docker push us-central1-docker.pkg.dev/my-project/my-repo/streamlit-app

# Deploy the container on Cloud Run
gcloud run deploy streamlit-app \
  --image us-central1-docker.pkg.dev/my-project/my-repo/streamlit-app \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated

Set Up Google Cloud Environment

  1. Install Google Cloud SDK:

    • Follow the instructions to install the Google Cloud SDK: https://cloud.google.com/sdk/docs/install
    • After installation, initialize the SDK:
      gcloud init

Welcome! This command will take you through the configuration of gcloud.

Settings from your current configuration [default] are:
core:
  account: donothack@gmail.com
  disable_usage_reporting: 'True'
  project: fintechclass

Pick configuration to use:
 [1] Re-initialize this configuration [default] with new settings
 [2] Create a new configuration
Please enter your numeric choice:  1

Your current configuration has been set to: [default]

You can skip diagnostics next time by using the following flag:
  gcloud init --skip-diagnostics

Network diagnostic detects and fixes local network connection issues.
Checking network connection...done.
Reachability Check passed.
Network diagnostic passed (1/1 checks passed).

Choose the account you would like to use to perform operations for this configuration:
 [1] donothack@gmail.com
 [2] Log in with a new account
Please enter your numeric choice:  1

You are logged in as: [donothack@gmail.com].

Pick cloud project to use:
 [1] addr******
 [2] fintechclass

Please enter numeric choice or text value (must exactly match list item):  2

Your current project has been set to: [fintechclass].

Not setting default zone/region (this feature makes it easier to use
[gcloud compute] by setting an appropriate default value for the
--zone and --region flag).
See https://cloud.google.com/compute/docs/gcloud-compute section on how to set
default compute region and zone manually. If you would like [gcloud init] to be
able to do this for you the next time you run it, make sure the
Compute Engine API is enabled for your project on the
https://console.developers.google.com/apis page.

Your Google Cloud SDK is configured and ready to use!

* Commands that require authentication will use donothack@gmail.com by default
* Commands will reference project `fintechclass` by default
Run `gcloud help config` to learn how to change individual settings

This gcloud configuration is called [default]. You can create additional configurations if you work with multiple accounts and/or projects.
Run `gcloud topic configurations` to learn more.

Some things to try next:

* Run `gcloud --help` to see the Cloud Platform services you can interact with. And run `gcloud help COMMAND` to get help on any gcloud command.
* Run `gcloud topic --help` to learn about advanced features of the SDK like arg files and output formatting
* Run `gcloud cheat-sheet` to see a roster of go-to `gcloud` commands.


Updates are available for some Google Cloud CLI components.  To install them,
please run:
  $ gcloud components update



To take a quick anonymous survey, run:
  $ gcloud survey

Set Up Google Cloud Environment

  1. Authenticate with Google Cloud:
   gcloud auth login

gcloud auth login
Your browser has been opened to visit:

    https://*accounts.google.com/o/oauth2/auth?response_type=****


You are now logged in as [donothack@gmail.com].
Your current project is [fintechclass].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID

Set Up Google Cloud Environment

  1. Set the Project:

    • If you don’t have a project yet, create one in the Google Cloud Console.
    • Set the project:
      gcloud config set project YOUR_PROJECT_ID

Q&A