Cyberax's Blog

Personal Blog of https://github.com/Cyberax

Some years ago, AWS introduced the SSM (Simple Systems Manager) Agent. It's an agent that can be started on EC2 instances and perform multiple utility functions. Over the years, the SSM agent was added to all the major cloud-enabled Linux distribution AMIs, including Ubuntu, Amazon Linux, and RHEL.

SSM Agent supports a wide range of functionality. It can inventory running processes, apply patches, run shell commands, establish terminal sessions to EC2 instances, and even set up port forwarding.

The most significant advantage of the SSM Agent is its complete independence from the VPC settings. It uses an EC2 service, ssmmessages, and as a result, it can work just fine even in a VPC that doesn't have connectivity with the public Internet.

I was primarily interested in using this to set up port forwarding and, avoid using bastion hosts for SSH or PostgreSQL access.

Unfortunately, AWS's existing tooling around the SSM protocol is very clunky and can't be easily used in a composable standalone library. So I spent some time doing just that. The result of this work is Gimlet.

Looking at how SSM is supposed to be used normally

The normal way to use the SSM port forwarding is by using the AWS CLI with an optional Session Manager Plugin.

For example, to set up port forwarding to the instance i-0e3a964d49f28a5b8 and port 22 we need to run:

aws ssm start-session --target i-0e3a964d49f28a5b8 --document-name AWS-StartPortForwardingSession --parameters '{"portNumber":["22"], "localPortNumber":["56789"]}'

Starting session with SessionId: admin-0b6458385d2cffd35
Port 56789 opened for sessionId admin-0b6458385d2cffd35.
Waiting for connections...

The AWS CLI itself doesn't actually do anything but initiate the session, and the work of port forwarding is handled by spawning a background daemon:

cyberax@CybArm:~$ ps aux | grep session-manager-plugin
cyberax          43380   0,0  0,0 408636096   1488 s003  S+   10:01     0:00.00 grep session-manager-plugin
cyberax          43163   0,0  0,0 35315540  14816 s001  S+   10:00     0:00.27 session-manager-plugin {"SessionId": "admin-0b6458385d2cffd35", "TokenValue": "AAEAA......417bvh4OL", "StreamUrl": "wss://ssmmessages.us-east-1.amazonaws.com/v1/data-channel/admin-0b6458385d2cffd35?role=publish_subscribe&cell-number=AAEAAf2PldzWvh3EDdw8q7A0JB+nBIqkCvU+htAEPX0+D2QYAAAAAGPPc/9A2Ugc4EBbN5Qt9rCZMB9iBuYX6zUdShZndrZ5tvWh3g==", "ResponseMetadata": {"RequestId": "208ec786-7805-4965-91d7-8e6f7e95a603", "HTTPStatusCode": 200, "HTTPHeaders": {"server": "Server", "date": "Tue, 24 Jan 2023 06:00:31 GMT", "content-type": "application/x-amz-json-1.1", "content-length": "947", "connection": "keep-alive", "x-amzn-requestid": "208ec786-7805-4965-91d7-8e6f7e95a603"}, "RetryAttempts": 0}} us-east-1 StartSession pers {"Target": "i-0e3a964d49f28a5b8", "DocumentName": "AWS-StartPortForwardingSession", "Parameters": {"portNumber": ["22"], "localPortNumber": ["56789"]}} https://ssm.us-east-1.amazonaws.com

Yup. The parameters, including connection tokens and request metadata, are passed through the command line. Sigh.

“Reverse engineering” the SSM port forwarding protocol

It's clear that the default implementation of session-manager-plugin leaves a lot to be desired. So we should just re-implement it! AWS is known for its pretty good documentation, so it should be simple, right?

The SSM port forwarding API calls are very eloquently documented in AWS as special operations used by AWS Systems Manager. Which is about the total extent of the available documentation.

Fortunately, we do have the source code for both the server side and the client side. So we just need to read it and untangle its twisted web.

I documented the results of my investigation in Gimlet's README file.

In the next post, I'm going to demonstrate how Gimlet can be used to build a simple SSH proxy to allow passwordless access to EC2 instances.

Discuss...

What is Multitenancy

Even in the microservice world there's a common requirement to host data for many tenants. But first let's discuss what exactly is a tenant. When discussing multitenancy, various websites and some books give examples like this: a company that resells database access and needs to store data from multiple clients in one database. This is an extremely naïve example and it doesn't really reflect the actual reality.

For the purpose of this document, a tenant is a group of users that work within the same organization (or are somehow associated). For example, if we're making an enterprise chat application (a Slack clone, why not?) then a tenant would be an organization that subscribes for it. And the main goal would be to make sure that one tenant can't access the data from other tenants.

We likely still need to add some kind of access control within the tenant, but we absolutely need to make sure that data doesn't leak across the tenant boundary. Moreover, we should make it a goal to ensure that any bug in our code does not result in leaking data outside of the tenant. Basically, imagine how you can deal with the worst possible scenarios: arbitrary unlimited SQL injection, or a fully exploitable buffer overflow in server code.

PostgreSQL Row-Level Security

So with this in mind let's start designing the data access code. We're using PostgreSQL as our main data storage, so we'll be looking at securing it. We can't do database-per-tenant or schema-per-tenant partitioning as this will blow up the complexity of all the routine operations like database upgrade. Instead we'll be looking at row-level security.

Let's start with the schema and some sample data. I purposefully use human-readable names for tenants, in actual production code it's better to use something like uuid default uuid_generate_v4() instead (and probably for the orders table as well).

-- The application role
create role slack_app nosuperuser nocreatedb nocreaterole login password '123';

-- The tenants table
create table tenant (tenant_id varchar primary key, name varchar);
grant select on tenant to slack_app;

-- And a simple data table
create table orders (order_id int8 primary key, order_text varchar, 
    tenant_id varchar not null references tenant(tenant_id) on delete restrict);
grant select, insert, update, delete on orders to slack_app;

insert into tenant(tenant_id, name) values('HHL','Horns&Hooves Ltd.');
insert into tenant(tenant_id, name) values('CIA', 'Scary Government Agency');
insert into orders(order_id, order_text, tenant_id) values (1, 'Chairs', 'HHL');
insert into orders(order_id, order_text, tenant_id) values (2, 'Diamonds', 'HHL');
insert into orders(order_id, order_text, tenant_id) values (3, 'Killer Drones', 'CIA');

alter table tenant enable row level security;
alter table orders enable row level security;

So far so good. We can log into the database as slack_app and do the CRUD operations on tenants and orders.

Now we need to add some kind of security. This how-to guide has a nice tutorial, so we'll follow it.

create policy tenant_isolation_policy on tenant using (tenant_id = current_setting('app.current_tenant'));
create policy tenant_isolation_policy on orders using (tenant_id = current_setting('app.current_tenant'));

Now let's test it by logging in as slack_app and trying to do something:

cyberax@CybMac:/tmp$ psql --user slack_app slackapp
slackapp=> select * from orders;
ERROR:  unrecognized configuration parameter "app.current_tenant"

Good, we can't see all the data. Now let's try to change the tenant:

slackapp=> set app.current_tenant = 'HHL';
SET
slackapp=> select * from orders;
 order_id | order_text | tenant_id
----------+------------+-----------
        1 | Chairs     | HHL
        2 | Diamonds   | HHL
(2 rows)

Great! We can only see our own data. But there's one small problem, as nothing whatsoever stops an attacker that gained ability to do arbitrary SQL injection from doing this:

slackapp=> set app.current_tenant = 'CIA';
SET
slackapp=> select * from orders;
 order_id |  order_text   | tenant_id
----------+---------------+-----------
        3 | Killer Drones | CIA
(1 row)

There is no way to limit the SET operations in PostgreSQL to be one-time only. Also there are no ways to prohibit running SET commands altogether, or at least limit them to a set of whitelisted options.

Tokenizing Everything

One possible solution is to use unguessable tenant names (e.g. UUIDs), so this way the attacker likely won't know the other tenants' IDs right off the bat. But this is not a good solution, any tenant ID leak would give an attacker access to all the tenant's documents.

But this approach seems to be on the right track. What if instead of tenants we use one-time tokens? These tokens can be populated by a small highly-secure service and passed to the main application.

Let's try it! We need to create a table for the tokens and a stored procedure to check them:

create table token(token varchar primary key, tenant_id varchar not null references tenant(tenant_id) on delete restrict, valid_until timestamp);

create or replace function get_tenant()
returns varchar language plpgsql security definer
as $$ declare 
  tenant_res varchar;
begin
select tenant_id into tenant_res from token where token = current_setting('app.token', true) and now() < valid_until;
return tenant_res;
end $$;

Some explanations: security definer modifier means that the function is always invoked with the permissions of the user that defined it (the superuser in this case). This is necessary because we absolutely DO NOT want to give the slack_app user permissions to do select on our tokens table.

The rest is straightforward, we need to modify the row-level security policy on the tables and insert some test tokens.

drop policy tenant_isolation_policy on tenant;
drop policy tenant_isolation_policy on orders;

create policy tenant_isolation_policy on tenant using (tenant_id = get_tenant());
create policy tenant_isolation_policy on orders using (tenant_id = get_tenant());

-- Insert some test tokens (they MUST be unguessable cryptographically random strings in a real application)
insert into token (tenant_id, token, valid_until) values ('CIA', 'token-high', now() + interval '2 hours');
insert into token (tenant_id, token, valid_until) values ('HHL', 'token-low', now() + interval '2 hours');

And now let's test it!

cyberax@CybMac:/tmp$ psql --user slack_app slackapp
psql (13.3)
Type "help" for help.

slackapp=> select * from orders ;
 order_id | order_text | tenant_id
----------+------------+-----------
(0 rows)

slackapp=> set app.token = 'token-high';
slackapp=> select * from orders;
 order_id |  order_text   | tenant_id
----------+---------------+-----------
        3 | Killer Drones | CIA
(1 row)

slackapp=> set app.token = 'token-low';
slackapp=> select * from orders;
 order_id | order_text | tenant_id
----------+------------+-----------
        1 | Chairs     | HHL
        2 | Diamonds   | HHL
(2 rows)

And this is exactly what we want! The limited time tokens provide authorization to access the data for individual tenants. The tokens can be generated by a small service and communicated to the app over a secure channel. And there's nothing an attacker can do without knowing a token.

Performance

Row-level security is implemented as a hidden where clause, that is visible in the explain statement:

slackapp=> explain select * from orders;
                       QUERY PLAN
---------------------------------------------------------
 Seq Scan on orders  (cost=0.00..222.62 rows=4 width=72)
   Filter: ((tenant_id)::text = (get_tenant())::text)
(2 rows)

This hidden clause needs to be taken into account when designing queries and indexes.

Discuss...